Re: FreeBSD 5.3 Bridge performance take II

From: Scott Long <scottl_at_samsco.org> Date: Wed, 08 Sep 2004 12:17:07 -0600 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:11 UTC

Gerrit Nagelhout wrote:
> Hi,
> 
> I have just finished some profiling and analysis of the FREEBSD_5_BP code 
> running a standard 4-port ethernet bridge (not netgraph).  On the upside, 
> some of the features such as the netperf stuff, MUTEX_PROFILING and 
> UMA are very cool, and (I think) give the potential for a really fast bridge 
> (or similar application).  However, the current performance is still rather 
> poor compared to 4.x, but I think that with the groundwork now in place, and
> some minor changes and a couple of new features, it can be made much much faster.
> I would like to discuss some possible optimizations (will suggest some below), and
> then we are willing to take on some of them, and give the code back to FreeBSD.
> Hopefully these changes can be made on RELENG_5 to be used by by 5.4.
> The tests that I have run so far have focussed on the different between 
> running in polling mode (dual 2.8Ghz Xeon, 2 2-port em NICs) versus interrupt 
> mode (with debug.mpsafenet=1, and no INVARIANTS/WITNESS or anything 
> like that).  In both setups I actually get similar throughput (300kpps total in 
> and out divided evenly over the 4 ports).  I think it should be possible to
> get >> 1Mpps bridging on this platform.
> 
> In the polling case, there is still only one active thread, and the limiting
> factor seems to be simply the number of mutexes (11 per packet
> according to MUTEX_PROFILING), and overhead from UMA, bus_dma, etc.  
> With polling disabled, I think the fact that PREEMPTION was disabled (I can't even
> boot with it on), and some sub-optimal mutex usage resulting in a lot
> of collisions caused some problems, even though in theory all 4 cores should
> be able to run simultaneously.

Neither the ULE nor 4BSD scheduler seem to perform very well with 
HyperThreading turned on.  At best I've measured a small dip in 
performance, at worst I've measured a 50% dip.  There are a lot of
possible explainations for this, but no serious investigation has been
done yet that I know of.  It could be that we are ping-ponging threads
between processors too much, or that each processor is contending too
much on locks like the sched_lock, or that we are doing too many 
unneeded IPIs, etc.  It would be interesting to compare scalability
on a true 4 way system (with HTT off) with that on a 2x2 HTT system,
also.

[...]

> 
> The latest generation Xeons (Nocona) have a couple of new features that are
> very useful for optimizing code.  One of them is the ability to prefetch a cache line
> for which a page is not yet in the tlb.  It should be possible to strategically sprinkle
> a few prefetches in the code, and get a big performance boost.  This is probably
> pretty platform specific though, so I don't know how to do this in general because
> it will only benefit some platforms (don't know about AMD/alpha), and may slightly
> hurt some others.

I recently compared a 2x 2.8GHz Nacona running BETA3/amd64 with a 2x
3.0GHz P4 running BETA3/i386 on a socket-intensive benchmark.  The
Nacona scored about 10% lower than the P4 which somewhat surprised me.
I should probably switch the Nacona to 32bit mode and benchmark that
also to see if long mode is helping or hurting it.

> 
> In terms of cache efficiency, I am not sure that using the UMA mbuf packet zone
> is the best way to go.  To be able to put a cluster on a DMA descriptor, you 
> currently need to read the mbuf header to get its pointer.  It may be more efficient
> to have the local cache of just clusters and mbufs.  To allocate a cluster you 
> just need to read the bucket array, and can add the cluster to the descriptor without
> having anything but the array itself in cache.  Once the packet is filled up, it can
> be coupled to an mbuf header.  The other advantage of this is that pointers for
> both are always easily available in an array, they lend themselves well to s/w 
> prefetching.
> 

Robert Watson has some experimental patches for creating per-thread UMA
mbuf caches.  It's not exactly what you are talking about here, but it
might be interesting to look at.

> 
> The choice of schedulers, and use of PREEMPTION will probably make a bit of a 
> difference for these tests too, but I did not do much experimentation because I 
> couldn't even boot with the ULE scheduler & PREEMPTION enabled.  I suspect
> that preemption will help quite a bit when there are mutex collisions.

The default for the immediate future is likely to be the 4BSD scheduler.
PREEMPTION seems to work much better on it also.  The problem that you 
had with ULE+PREEMPTION on boot was a known problem at the RELENG_5_BP 
time so your report doesn't surprise me.  5.3-BETA4 is going to have the
latest scheduler fixes and the switch to 4BSD+PREEMPTION.

> 
> This is all I have for now.  As I mentioned previously, I'd like to generate some 
> discussion on some of these points, as well as hear ideas for additional optimizations.
> We will definitely implement some of these features ourselves, but would much
> rather give back the code and make this a "cooperative effort".
> Also, I haven't done any testing on the netgraph side of things yet, but that will
> probably be next on the list.
> Comments?
> Thanks,
> 
> Gerrit Nagelhout
> 
> 
> 
> 
> _______________________________________________
> freebsd-current_at_freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to "freebsd-current-unsubscribe_at_freebsd.org"