RE: 4.7 vs 5.2.1 SMP/UP bridging performance

From: Gerrit Nagelhout <gnagelhout_at_sandvine.com>
Date: Tue, 11 May 2004 11:30:21 -0400
Robert Watson wrote:

> On Fri, 7 May 2004, Gerrit Nagelhout wrote:
> 
>> The biggest problem I still see with all of this, is that even if I
>> could compile the kernel for the P4, under SMP there is 
>> still no fast
>> locking mechanism in place (that I am aware of, although I am
>> researching that).  
> 
> Are you doing a uni-directional packet stream of a 
> bi-directional packet
> stream?
>

I am doing a bi-directional packet stream.  I am still using device
polling though, so it doesn't really matter because only one cpu
will be used at a time.  I will try out your patches in a couple
of days without device polling.  The polling vs non-polling 
performance on 5.2.1 is similar, except that without polling, the
system will livelock if I send too much data through.

 
> Speaking of unsafe but fun to try: at one point I did experiment with
> ifdef'ing out all of the locking macros and just using Giant, 
> and I found
> that the lack of ability to properly preempt resulted in substantial
> latency problems that in turn substantially slowed down performance
> measurements for stream protocols like TCP.

I can believe this.  The fine grained locking can provide a lot
of benefits, as long as the cost of the locks themselves can
be mitigated.

> Actually, I think the memory allocation code is one of the 
> areas we need
> to think very seriously about the current level of locking -- 
> perhaps more
> so than the other locking that's around.  I've noticed a pretty high
> number of locking operations to perform memory allocations, and we're
> doing a lot of them.  With mbuf allocation moving to using 
> UMA, we'll be
> able to run statistics and optimize just one allocator, not two.

I just had a quick look through some UMA documentation, and it sounds
like it could work quite well for mbufs.  Is any of that work 
scheduled for 5.3?  Having per CPU memory pools should work well if
the interfaces are bound to CPUs also.  
One approach that I found works well with bridging-type applications
is to allocate the clusters separately from the mbufs.  Ideally you
would allocate a cluster in such a way that it did not have to be
pulled into the cache, and then add it to the dma receive ring.  Once
the dma engine has filled it with data, an mbuf would be allocated, and
the cluster attached to it then.  The benefit of this is that the cluster
only needs to get pulled into the cache once instead of twice, and
the number of active mbufs is reduced.  With this setup, the number
of mbufs is small enough that they will likely stay in the cache all
the time, so that the only cache misses will be reading the cluster
data.
If the cost of allocating one cluster is significant, it may also be
worth while to be able to allocate a "slab" of these at once for
each interface.  If these were returned in some kind of array-type
structure, it would be very efficient to add them to the dma ring.

 
> BTW, on a related note, I've also been exploring passing packet chains
> into a number of routines that currently accept only single 
> packets.  This
> raises more of the "retain ordering" spectre, but might also have some
> benefits.  If we do explore this model seriously, I think we need to
> introduce an 'mbuf queue' structure/object that can be passed 
> around in
> order to improve our ability to use type safety to avoid 
> nasties.  I've
> seen too many bugs where code assumes m_nextpkt is NULL when 
> it's not, or
> doesn't carefully maintain synchronization and coherrency of 
> fields across
> sleeping in 4.x, etc.
> 
> Before committing to an approach based on short work queues 
> over single
> packets, I'd definitely want to see observable performance 
> improvements in
> some example cases, though.  I think it's also worth 
> continuing to follow
> down our current locking path to make sure we have a decent first cut
> MPSAFE system where we have well documented synchronization semantics.

I like the idea of using short work queues, and it should work very
well for bridge/netgraph/etc.  I ran a test where I changed the em
driver to generate a chain of mbufs, and pass this to ether_input, thus
avoiding one lock/unlock (much like you described).  This changed
the throughput from ~500kpps to ~540kpps, which is slightly better
than just removing the semaphore.  I think there are several advantages
to this approach:
1) Fewer locks
2) Fewer function calls
3) Better instruction cache usage.  Especially on a platform like the
Xeon where the decoded instructions are stored in a cache, this is
likely to have large benefits.  In the case of bridging, I am pretty
sure that the entire code path is too big to fit in this, but the
individual loops that would be created by this would likely fit.

Using work queues may also allow for using software prefetches
to ensure the next packet is in cache while working on the current
one.

After (unsafely :) ) removing the obvious semaphores in the bridge
path, the throughput went to around 700kpps, and after profiling this,
most of the cycles seemed to be going to mbuf handling (allocating,
freeing) and busdma.  This is unfortunately still nowhere close
to where 4.7 was, but with the UMA allocater, it should be become
a lot better.


> Something I'd like to explore in detail is a careful comparison of
> performing delivery to completion in the interrupt thread vs 
> using a set
> of netisr threads pinned to CPUs.  Avoiding lots of context 
> switches is a
> good thing, but I don't have any measurements that suggest at 
> what layer
> in the inbound packet path we get the most coallescing: we 
> know interrupt
> mitigation and coallescing works well at a high level, but are we
> currently seeing substantial coallescing due to long-running interrupt
> threads, handoffs to netisrs, etc?  Some decent measurements 
> with counters
> at each of the hand-off points might go a long way.

If there are any other tests you'd like me to try out, let me know.
I have a pretty good setup for stress & throughput testing bridging.

> I would think that multi-directional bridging would be a win 
> due to real
> latency improvements from parallelism in the kernel.  
> Likewise, I would
> expect SMP application/kernel scenarios such as web servers 
> on SMP boxes
> to benefit through substantially reduced contention.  Giant and a
> non-parallel kernel hurts us in a lot of these scenarios. 
> 

As long as we can reduce the locking costs, I definitely agree
with this.  
Thanks for all the excellent feedback so far,

Gerrit
Received on Tue May 11 2004 - 06:30:39 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:53 UTC