Re: Some performance measurements on the FreeBSD network stack

From: Andre Oppermann <andre_at_freebsd.org> Date: Thu, 19 Apr 2012 22:05:37 +0200 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:26 UTC

On 19.04.2012 15:30, Luigi Rizzo wrote:
> I have been running some performance tests on UDP sockets,
> using the netsend program in tools/tools/netrate/netsend
> and instrumenting the source code and the kernel do return in
> various points of the path. Here are some results which
> I hope you find interesting.

Jumping over very interesting analysis...

> - the next expensive operation, consuming another 100ns,
>    is the mbuf allocation in m_uiotombuf(). Nevertheless, the allocator
>    seems to scale decently at least with 4 cores.  The copyin() is
>    relatively inexpensive (not reported in the data below, but
>    disabling it saves only 15-20ns for a short packet).
>
>    I have not followed the details, but the allocator calls the zone
>    allocator and there is at least one critical_enter()/critical_exit()
>    pair, and the highly modular architecture invokes long chains of
>    indirect function calls both on allocation and release.
>
>    It might make sense to keep a small pool of mbufs attached to the
>    socket buffer instead of going to the zone allocator.
>    Or defer the actual encapsulation to the
>    (*so->so_proto->pr_usrreqs->pru_send)() which is called inline, anyways.

The UMA mbuf allocator is certainly not perfect but rather good.
It has a per-CPU cache of mbuf's that are very fast to allocate
from.  Once it has used them it needs to refill from the global
pool which may happen from time to time and show up in the averages.

> - another big bottleneck is the route lookup in ip_output()
>    (between entries 51 and 56). Not only it eats another
>    100ns+ on an empty routing table, but it also
>    causes huge contentions when multiple cores
>    are involved.

This is indeed a big problem.  I'm working (rough edges remain) on
changing the routing table locking to an rmlock (read-mostly) which
doesn't produce any lock contention or cache pollution.  Also skipping
the per-route lock while the table read-lock is held should help some
more.  All in all this should give a massive gain in high pps situations
at the expense of costlier routing table changes.  However changes
are seldom to essentially never with a single default route.

After that the ARP table will gets same treatment and the low stack
lock contention points should be gone for good.

-- 
Andre