Re: Some performance measurements on the FreeBSD network stack

From: Andre Oppermann <andre_at_freebsd.org> Date: Thu, 19 Apr 2012 23:20:00 +0200 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:26 UTC

On 19.04.2012 22:46, Luigi Rizzo wrote:
> On Thu, Apr 19, 2012 at 10:05:37PM +0200, Andre Oppermann wrote:
>> On 19.04.2012 15:30, Luigi Rizzo wrote:
>>> I have been running some performance tests on UDP sockets,
>>> using the netsend program in tools/tools/netrate/netsend
>>> and instrumenting the source code and the kernel do return in
>>> various points of the path. Here are some results which
>>> I hope you find interesting.
>>
>> Jumping over very interesting analysis...
>>
>>> - the next expensive operation, consuming another 100ns,
>>>    is the mbuf allocation in m_uiotombuf(). Nevertheless, the allocator
>>>    seems to scale decently at least with 4 cores.  The copyin() is
>>>    relatively inexpensive (not reported in the data below, but
>>>    disabling it saves only 15-20ns for a short packet).
>>>
>>>    I have not followed the details, but the allocator calls the zone
>>>    allocator and there is at least one critical_enter()/critical_exit()
>>>    pair, and the highly modular architecture invokes long chains of
>>>    indirect function calls both on allocation and release.
>>>
>>>    It might make sense to keep a small pool of mbufs attached to the
>>>    socket buffer instead of going to the zone allocator.
>>>    Or defer the actual encapsulation to the
>>>    (*so->so_proto->pr_usrreqs->pru_send)() which is called inline, anyways.
>>
>> The UMA mbuf allocator is certainly not perfect but rather good.
>> It has a per-CPU cache of mbuf's that are very fast to allocate
>> from.  Once it has used them it needs to refill from the global
>> pool which may happen from time to time and show up in the averages.
>
> indeed i was pleased to see no difference between 1 and 4 threads.
> This also suggests that the global pool is accessed very seldom,
> and for short times, otherwise you'd see the effect with 4 threads.

Robert did the per-CPU mbuf allocator pools a few years ago.
Excellent engineering.

> What might be moderately expensive are the critical_enter()/critical_exit()
> calls around individual allocations.

Can't get away from those as a thread must not migrate away
when manipulating the per-CPU mbuf pool.

> The allocation happens while the code has already an exclusive
> lock on so->snd_buf so a pool of fresh buffers could be attached
> there.

Ah, there it is not necessary to hold the snd_buf lock while
doing the allocate+copyin.  With soreceive_stream() (which is
experimental not enabled by default) I did just that for the
receive path.  It's quite a significant gain there.

IMHO better resolve the locking order than to juggle yet another
mbuf sink.

> But the other consideration is that one could defer the mbuf allocation
> to a later time when the packet is actually built (or anyways
> right before the thread returns).
> What i envision (and this would fit nicely with netmap) is the following:
> - have a (possibly readonly) template for the headers (MAC+IP+UDP)
>    attached to the socket, built on demand, and cached and managed
>    with similar invalidation rules as used by fastforward;

That would require to cross-pointer the rtentry and whatnot again.
We want to get away from that to untangle the (locking) mess that
eventually results from it.

> - possibly extend the pru_send interface so one can pass down the uio
>    instead of the mbuf;
> - make an opportunistic buffer allocation in some place downstream,
>    where the code already has an x-lock on some resource (could be
>    the snd_buf, the interface, ...) so the allocation comes for free.

ETOOCOMPLEXOVERTIME.

>>> - another big bottleneck is the route lookup in ip_output()
>>>    (between entries 51 and 56). Not only it eats another
>>>    100ns+ on an empty routing table, but it also
>>>    causes huge contentions when multiple cores
>>>    are involved.
>>
>> This is indeed a big problem.  I'm working (rough edges remain) on
>> changing the routing table locking to an rmlock (read-mostly) which
>
> i was wondering, is there a way (and/or any advantage) to use the
> fastforward code to look up the route for locally sourced packets ?

No.  The main advantage/difference of fastforward is the short code
path and processing to completion.

-- 
Andre