Re: Some performance measurements on the FreeBSD network stack

From: Andre Oppermann <andre_at_freebsd.org>
Date: Fri, 20 Apr 2012 11:26:01 +0200
On 20.04.2012 08:35, Luigi Rizzo wrote:
> On Fri, Apr 20, 2012 at 12:37:21AM +0200, Andre Oppermann wrote:
>> On 20.04.2012 00:03, Luigi Rizzo wrote:
>>> On Thu, Apr 19, 2012 at 11:20:00PM +0200, Andre Oppermann wrote:
>>>> On 19.04.2012 22:46, Luigi Rizzo wrote:
>>>>> The allocation happens while the code has already an exclusive
>>>>> lock on so->snd_buf so a pool of fresh buffers could be attached
>>>>> there.
>>>>
>>>> Ah, there it is not necessary to hold the snd_buf lock while
>>>> doing the allocate+copyin.  With soreceive_stream() (which is
>>>
>>> it is not held in the tx path either -- but there is a short section
>>> before m_uiotombuf() which does
>>>
>>> 	...
>>> 	SOCKBUF_LOCK(&so->so_snd);
>>> 	// check for pending errors, sbspace, so_state
>>> 	SOCKBUF_UNLOCK(&so->so_snd);
>>> 	...
>>>
>>> (some of this is slightly dubious, but that's another story)
>>
>> Indeed the lock isn't held across the m_uiotombuf().  You're talking
>> about filling an sockbuf mbuf cache while holding the lock?
>
> all i am thinking is that when we have a serialization point we
> could use it for multiple related purposes. In this case yes we
> could keep a small mbuf cache attached to so_snd. When the cache
> is empty either get a new batch (say 10-20 bufs) from the zone
> allocator, possibly dropping and regaining the lock if the so_snd
> must be a leaf.  Besides for protocols like TCP (does it use the
> same path ?) the mbufs are already there (released by incoming acks)
> in the steady state, so it is not even necessary to to refill the
> cache.

I'm sure things can be tuned towards particular cases but almost
always that some at the expense of versatility.  I was looking
at netmap for a project.  It's great when there is one thing being
done by one process at great speed.  However as soon as I have to
dispatch certain packets somewhere else for further processing,
in another process, things quickly become complicated and fall
apart.  It would have meant to replicate what the kernel does
with protosw & friends in userspace coated with IPC.  No to mention
re-inventing the socket layer abstraction again.

So netmap is fantastic for simple, bulk and repetitive tasks with
little variance.  Things like packet routing, bridging, encapsulation,
perhaps inspection and acting as a traffic sink/source.  There are
plenty of use cases for that.

Coming back to your UDP test case, while the 'hacks' you propose
may benefit the bulk sending of a bound socket it may not help or
pessimize the DNS server case where a large number of packets is
send to a large number of destinations.

The layering abstractions we have in BSD are excellent and have
served us quite well so far.  Adding new protocols is a simple
task and so on.  Of course it has some trade-offs by having some
indirections and not being bare-metal fast.  Yes, there is a lot
of potential in optimizing the locking strategies we currently
have within the BSD network stack layering.  Your profiling work
is immensely helpful in identifying where to aim at.  Once that
is fixed we should stop there.  Anyone who needs a particular as
close as possible to the bare metal UDP packet blaster should
fork the tree and do their own short-cuts and whatnot.  But FreeBSD
should stay a reasonable general purpose.  It won't be a Ferrari,
but an Audi S6 is a damn nice car as well and it can carry your
whole family. :)

> This said, i am not 100% sure that the 100ns I am seeing are all
> spent in the zone allocator.  As i said the chain of indirect calls
> and other ops is rather long on both acquire and release.
>
>>>>> But the other consideration is that one could defer the mbuf allocation
>>>>> to a later time when the packet is actually built (or anyways
>>>>> right before the thread returns).
>>>>> What i envision (and this would fit nicely with netmap) is the following:
>>>>> - have a (possibly readonly) template for the headers (MAC+IP+UDP)
>>>>>    attached to the socket, built on demand, and cached and managed
>>>>>    with similar invalidation rules as used by fastforward;
>>>>
>>>> That would require to cross-pointer the rtentry and whatnot again.
>>>
>>> i was planning to keep a copy, not a reference. If the copy becomes
>>> temporarily stale, no big deal, as long as you can detect it reasonably
>>> quiclky -- routes are not guaranteed to be correct, anyways.
>>
>> Be wary of disappearing interface pointers...
>
> (this reminds me, what prevents a route grabbed from the flowtable
> from disappearing and releasing the ifp reference ?)

It has to keep a refcounted reference to the rtentry.

> In any case, it seems better to keep a more persistent ifp reference
> in the socket rather than grab and release one on every single
> packet transmission.

The socket doesn't and shouldn't know anything about ifp's.

>>>>> - possibly extend the pru_send interface so one can pass down the uio
>>>>>    instead of the mbuf;
>>>>> - make an opportunistic buffer allocation in some place downstream,
>>>>>    where the code already has an x-lock on some resource (could be
>>>>>    the snd_buf, the interface, ...) so the allocation comes for free.
>>>>
>>>> ETOOCOMPLEXOVERTIME.
>>>
>>> maybe. But i want to investigate this.
>>
>> I fail see what passing down the uio would gain you.  The snd_buf lock
>> isn't obtained again after the copyin.  Not that I want to prevent you
>> from investigating other ways. ;)
>
> maybe it can open the way to other optimizations, such as reducing
> the number of places where you need to lock, or save some data
> copies, or reduce fragmentation, etc.

I appreciate your profiling work very much and try my best to help
you to minimize the contention points.  I hope the rtable locking
changes will solve one of the biggest choke points.

-- 
Andre
Received on Fri Apr 20 2012 - 07:25:13 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:26 UTC