RE: 4.7 vs 5.2.1 SMP/UP bridging performance

From: Gerrit Nagelhout <gnagelhout_at_sandvine.com> Date: Fri, 7 May 2004 14:32:59 -0400 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:53 UTC

Scott Long wrote:
> Robert Watson wrote:
> > On Fri, 7 May 2004, Brad Knowles wrote:
> > 
> > 
> >>At 10:55 PM -0400 2004/05/06, Robert Watson wrote:
> >>
> >>
> >>> On occasion, I've had conversations with Peter Wemm about 
> providing HAL
> >>> modules with optimized versions of various common 
> routines for specific
> >>> hardware platforms.  However, that would require us to 
> make a trade-off
> >>> between the performance benefits of inlining and the 
> performance benefits
> >>> of a HAL module...
> >>
> >>	I'm confused.  Couldn't you just do this sort of stuff as
> >>conditional macros, which would have both benefits? 
> > 
> > 
> > Well, the goal of introducing HAL modules would be that you 
> don't have to
> > recompile the kernel in order to perform local hardware-specific
> > optimization of low level routines.  I.e., you could 
> substitute faster
> > implementations of zeroing, synchronization, certain math 
> routines, etc
> > based on the CPU discovered at run-time.  While you can have switch
> > statements, etc, it's faster just to relink the kernel to 
> use the better
> > implementation for the available CPU.  However, if you do 
> that, you still
> > end up with the function call cost, which might well out-weight the
> > benefits of specialization.
> > 
> > Robert N M Watson             FreeBSD Core Team, TrustedBSD Projects
> > robert_at_fledge.watson.org      Senior Research Scientist, 
> McAfee Research
> > 
> 
> It really depends on how you link the HAL module in.  Calling 
> indirectly
> through function pointers is pretty darn slow, and I suspect that the
> long pipeline of a P4 makes this even worse.  Switching to a better
> instruction might save you 20 cycles, but the indirect call to do it
> might cost you 30 and that assumes that the branched 
> instruction stream
> is still in the L1 cache and that twiddling %esp and %ebp gives no 
> pipeline stalls themselves.  Even without the indirect jump, 
> all of the
> housekeeping that goes into making a function call might 
> drown out most
> benefits.  The only way that this might make sense is if you move the
> abstraction upwards and have it encompass more common code, or do some
> sort of self-modifying code scheme early in the boot.  The alternative
> might be to have the HAL be a compile-time option like Brad hinted at.
> 
> Scott

The biggest problem I still see with all of this, is that even if I could
compile the kernel for the P4, under SMP there is still no fast locking
mechanism in place (that I am aware of, although I am researching that).
I ran a few more tests and did some more calculations to determine the 
impact of (removing) mutexes, and here is what I found:
for UP, I was able to get 850kpps, which is 3294 cycles/packet (at 2.8Ghz)
for SMP, it was 500kpps, which is 5600 cycles/packet, or an additional
2306 cycles/packet, which presumably goes mostly towards the atomic
locked operations.  At ~120 cycles/lock extra for SMP, this means that
there should be around 19 atomic operations per packet.  After getting
rid of one mutex (from IF_DEQUEUE, this is not safe, but fun to try), the
performance went to 530kpps, or 5283 cycles/packet.  This is a savings
of ~317 cycles per packet.

After a quick look through the bridge code path, I found the following
atomic operations (I probably missed some, and might have some that don't
always lock, but the total seems about right)

em_process_receive_interrupts (EM_LOCK)
bus_dma ?
mb_alloc ? (MBP_PERSISTENT flag is set, where is this first locked?)
bridge_in (BDG_LOCK)
if_handoff (IF_LOCK)
em_start (EM_LOCK)
IF_DEQUEUE (IF_LOCK)
m_free (atomic_cmpset_int)
m_free (atomic_subtract_int)

At 2 locks/mutex this adds up to about 16 atomic operations per packet.

I think that some of the changes that Robert mentioned before about
putting mbufs in a list before releasing the lock should help a lot
for the Xeons.  I am willing to try out some of these changes (both
testing for performance, and making the actual code changes) because
we can't switch over to 5.x until the performance is back up to where
4.7 was.

Most of my experience with FreeBSD (from the last 1/2 year or so of
looking at (and changing a few things) the code) is in the area of 
the low level network drivers (em) and some of the lower stack layers.
This is why I have focused on the bridging data path to compare the 
performance.  I must admit that I don't know exactly what code changes
are going on in the stack, but if fine-grained locking means a (large)
increase in the number of mutexes throughout the stack, I am quite
concerned about the performance of the whole system on P4/Xeons.
With fine-grained locking I think that the cost of individual functions
will go up (a lot in the Xeons :(  ), but the overal performance may still
be better because multiple threads can do work simultaneously if there
is nothing else for the other processors to do.  What I am concerned 
about is that if you have a dual-xeon system with enough
kernel (stack) work to keep one processor busy, and enough user-space
work to keep the other 3 processors busy on 4.7, what will happen on
5.x?

Thanks,

Gerrit