RE: 4.7 vs 5.2.1 SMP/UP bridging performance

From: Robert Watson <rwatson_at_freebsd.org> Date: Fri, 7 May 2004 19:17:01 -0400 (EDT) · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:53 UTC

On Fri, 7 May 2004, Gerrit Nagelhout wrote:

> The biggest problem I still see with all of this, is that even if I
> could compile the kernel for the P4, under SMP there is still no fast
> locking mechanism in place (that I am aware of, although I am
> researching that).  I ran a few more tests and did some more
> calculations to determine the impact of (removing) mutexes, and here is
> what I found:  for UP, I was able to get 850kpps, which is 3294
> cycles/packet (at 2.8Ghz)  for SMP, it was 500kpps, which is 5600
> cycles/packet, or an additional 2306 cycles/packet, which presumably
> goes mostly towards the atomic locked operations.  At ~120 cycles/lock
> extra for SMP, this means that there should be around 19 atomic
> operations per packet.  After getting rid of one mutex (from IF_DEQUEUE,
> this is not safe, but fun to try), the performance went to 530kpps, or
> 5283 cycles/packet.  This is a savings of ~317 cycles per packet. 

Are you doing a uni-directional packet stream of a bi-directional packet
stream?

Speaking of unsafe but fun to try: at one point I did experiment with
ifdef'ing out all of the locking macros and just using Giant, and I found
that the lack of ability to properly preempt resulted in substantial
latency problems that in turn substantially slowed down performance
measurements for stream protocols like TCP.

> After a quick look through the bridge code path, I found the following
> atomic operations (I probably missed some, and might have some that don't
> always lock, but the total seems about right)
> 
> em_process_receive_interrupts (EM_LOCK)
> bus_dma ?
> mb_alloc ? (MBP_PERSISTENT flag is set, where is this first locked?)

Actually, I think the memory allocation code is one of the areas we need
to think very seriously about the current level of locking -- perhaps more
so than the other locking that's around.  I've noticed a pretty high
number of locking operations to perform memory allocations, and we're
doing a lot of them.  With mbuf allocation moving to using UMA, we'll be
able to run statistics and optimize just one allocator, not two.

> bridge_in (BDG_LOCK)
> if_handoff (IF_LOCK)
> em_start (EM_LOCK)
> IF_DEQUEUE (IF_LOCK)
> m_free (atomic_cmpset_int)
> m_free (atomic_subtract_int)

Same goes here: I suspect we're doing more locking than we should for
memory allocation.  Could you turn on mutex profiling and take a look at
where the locking operations are taking place?

I'm not familiar with our interrupt routine and interrupt thread
scheduling behavior, but if we're not yet pinning interrupt threads to
CPUs while they run, we should be.  I seem to recall SCHED_ULE avoids
migrating interrupt threads, but I haven't checked lately.

> At 2 locks/mutex this adds up to about 16 atomic operations per packet.
> 
> I think that some of the changes that Robert mentioned before about
> putting mbufs in a list before releasing the lock should help a lot for
> the Xeons.  I am willing to try out some of these changes (both testing
> for performance, and making the actual code changes) because we can't
> switch over to 5.x until the performance is back up to where 4.7 was. 

BTW, on a related note, I've also been exploring passing packet chains
into a number of routines that currently accept only single packets.  This
raises more of the "retain ordering" spectre, but might also have some
benefits.  If we do explore this model seriously, I think we need to
introduce an 'mbuf queue' structure/object that can be passed around in
order to improve our ability to use type safety to avoid nasties.  I've
seen too many bugs where code assumes m_nextpkt is NULL when it's not, or
doesn't carefully maintain synchronization and coherrency of fields across
sleeping in 4.x, etc.

Before committing to an approach based on short work queues over single
packets, I'd definitely want to see observable performance improvements in
some example cases, though.  I think it's also worth continuing to follow
down our current locking path to make sure we have a decent first cut
MPSAFE system where we have well documented synchronization semantics.

Something I'd like to explore in detail is a careful comparison of
performing delivery to completion in the interrupt thread vs using a set
of netisr threads pinned to CPUs.  Avoiding lots of context switches is a
good thing, but I don't have any measurements that suggest at what layer
in the inbound packet path we get the most coallescing: we know interrupt
mitigation and coallescing works well at a high level, but are we
currently seeing substantial coallescing due to long-running interrupt
threads, handoffs to netisrs, etc?  Some decent measurements with counters
at each of the hand-off points might go a long way.

> Most of my experience with FreeBSD (from the last 1/2 year or so of
> looking at (and changing a few things) the code) is in the area of the
> low level network drivers (em) and some of the lower stack layers.  This
> is why I have focused on the bridging data path to compare the
> performance.  I must admit that I don't know exactly what code changes
> are going on in the stack, but if fine-grained locking means a (large) 
> increase in the number of mutexes throughout the stack, I am quite
> concerned about the performance of the whole system on P4/Xeons.  With
> fine-grained locking I think that the cost of individual functions will
> go up (a lot in the Xeons :( ), but the overal performance may still be
> better because multiple threads can do work simultaneously if there is
> nothing else for the other processors to do.  What I am concerned about
> is that if you have a dual-xeon system with enough kernel (stack) work
> to keep one processor busy, and enough user-space work to keep the other
> 3 processors busy on 4.7, what will happen on 5.x? 

I would think that multi-directional bridging would be a win due to real
latency improvements from parallelism in the kernel.  Likewise, I would
expect SMP application/kernel scenarios such as web servers on SMP boxes
to benefit through substantially reduced contention.  Giant and a
non-parallel kernel hurts us in a lot of these scenarios. 

Robert N M Watson             FreeBSD Core Team, TrustedBSD Projects
robert_at_fledge.watson.org      Senior Research Scientist, McAfee Research