RE: 4.7 vs 5.2.1 SMP/UP bridging performance

From: Gerrit Nagelhout <gnagelhout_at_sandvine.com> Date: Tue, 4 May 2004 18:17:39 -0400 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:53 UTC

>>>>I would like to move to CURRENT for new hardware support, and the 
>>>>ability to properly use multi-threading in user-space, but can't do
>>>>this until the performance bottlenecks are solved.  I realize that 
>>>>5.x is still a work in progress and hasn't been tuned as well as 4.7 
>>>>yet, but are there any plans for optimizations in this area?  Does 
>>>>anyone have any suggestions on what else I can try?
>>>
>>>
>>>Try rwatson's netperf patches:
>>>
>>>  http://www.watson.org/~robert/freebsd/netperf/
>>>
>>>There is at least one outstanding panic condition known, but more
>>>testing will be a great help.
>>>
>>>Kris
>>>
>>>P.S. You didn't mention the status of WITNESS, but I'm assuming you
>>>read the docs and disabled it since it's a huge performance killer.
> 
> 
>>WITNESS and INVARIANTS are turned off for the 5.2.1 release bits.
>>However, the debug.mpsafenet sysctl is also turned off.  Turning this
>>on might give a significant performance boost for bridging.
> 
> 
>>Scott
> 
> 
> Thanks for all the responses so far.  WITNESS is definitely disabled, 
> as are the other INVARIANTS.  I had a look through the netperf patches, 
> but I don't think they will affect bridging very much.  They seem be 
> directed more towards the socket layer and above.
> 
> I still think that one of the bigger bottlenecks is the cost of all
> the mutexes in SMP mode, and some of the new bus_dma and mbuf code that
> was introduced.  
> 
> With previous platforms I have worked on (vxWorks), we had similar 
> issues, and ended up pushing buckets of packets through the data path, 
> so each mutex was only taken once for every 10-100 packets.
> 
> Also, polling is currently done by only one CPU at a time.  If this 
> were changed to have multiple threads poll multiple devices at the
> same time, the performance should become much better.
> 
> Thanks,
> 
> Gerrit

>You are correct about the netperf patches being directed towards the 
>socket layer.  The IP stack and below was locked for 5.2, but the
>benefits won't be seen unless you turn on debug.mpsafenet.  During
>the 5.2 development cycle I believe that benchmarking was done that
>showed that mpsafenet bridging was significantly faster than non-
>mpsafenet, and nearly as fast as 4.x if not a little faster.

>I'd be interest to know more about your comments about polling from
>multiple CPUs.  Did you have a thread bound to each CPU, and did
>each thread poll every interface, or only an exclusive subset of the
>interfaces?

>Scott

>I tried enabling debug.mpsafenet, but it didn't make any difference.
>Which parts of the bridging path do you think should be faster with
>that enabled?

>I haven't actually tried implementing polling from multiple CPUs, but
>suggested it because I think it would help performance for certain
>applications (such as bridging).  What I would probably do 
>(without having given this a great deal of thought) is to:
>1) Have a variable controlling how many threads to use for polling
>2) Either lock an interface to a thread, or have interfaces switch
>   between threads depending on their load dynamically.
>One obvious problem with this approach will be mutex contention
>between threads.  Even though the source interface would be owned
>by a thread, the destination would likely be owned by a different
>thread.  I'm assuming that with the current mutex setup, only one
>thread can receive from or transmit to an interface at a time.

>Before this becomes feasible though, the cost of the mutexes should
>be addressed first (assuming that is the current bottleneck for SMP)

>Gerrit

I ran the following fragment of code to determine the cost of a LOCK & 
UNLOCK on both UP and SMP:

#define	EM_LOCK(_sc)		mtx_lock(&(_sc)->mtx)
#define	EM_UNLOCK(_sc)		mtx_unlock(&(_sc)->mtx)

    unsigned int startTime, endTime, delta;
    startTime = rdtsc();
    for (i = 0; i < 100; i++)
    {
        EM_LOCK(adapter);
        EM_UNLOCK(adapter);
    }
    endTime = rdtsc();
    delta = endTime - startTime;
    printf("delta %u start %u end %u \n", (unsigned int)delta, startTime,
endTime);

On a single hyperthreaded xeon 2.8Ghz, it took ~30 cycles (per LOCK&UNLOCK, 
and dividing by 100) under UP, and ~300 cycles for SMP.  Assuming 10 
locks for every packet(which is conservative), at 500Kpps, this accounts
for:
300 * 10 * 500000 = 1.5 billion cycles (out of 2.8 billion cycles)
Any comments?

Thanks,

Gerrit