On Fri, 7 May 2004, Gerrit Nagelhout wrote: > The biggest problem I still see with all of this, is that even if I > could compile the kernel for the P4, under SMP there is still no fast > locking mechanism in place (that I am aware of, although I am > researching that). I ran a few more tests and did some more > calculations to determine the impact of (removing) mutexes, and here is > what I found: for UP, I was able to get 850kpps, which is 3294 > cycles/packet (at 2.8Ghz) for SMP, it was 500kpps, which is 5600 > cycles/packet, or an additional 2306 cycles/packet, which presumably > goes mostly towards the atomic locked operations. At ~120 cycles/lock > extra for SMP, this means that there should be around 19 atomic > operations per packet. After getting rid of one mutex (from IF_DEQUEUE, > this is not safe, but fun to try), the performance went to 530kpps, or > 5283 cycles/packet. This is a savings of ~317 cycles per packet. Are you doing a uni-directional packet stream of a bi-directional packet stream? Speaking of unsafe but fun to try: at one point I did experiment with ifdef'ing out all of the locking macros and just using Giant, and I found that the lack of ability to properly preempt resulted in substantial latency problems that in turn substantially slowed down performance measurements for stream protocols like TCP. > After a quick look through the bridge code path, I found the following > atomic operations (I probably missed some, and might have some that don't > always lock, but the total seems about right) > > em_process_receive_interrupts (EM_LOCK) > bus_dma ? > mb_alloc ? (MBP_PERSISTENT flag is set, where is this first locked?) Actually, I think the memory allocation code is one of the areas we need to think very seriously about the current level of locking -- perhaps more so than the other locking that's around. I've noticed a pretty high number of locking operations to perform memory allocations, and we're doing a lot of them. With mbuf allocation moving to using UMA, we'll be able to run statistics and optimize just one allocator, not two. > bridge_in (BDG_LOCK) > if_handoff (IF_LOCK) > em_start (EM_LOCK) > IF_DEQUEUE (IF_LOCK) > m_free (atomic_cmpset_int) > m_free (atomic_subtract_int) Same goes here: I suspect we're doing more locking than we should for memory allocation. Could you turn on mutex profiling and take a look at where the locking operations are taking place? I'm not familiar with our interrupt routine and interrupt thread scheduling behavior, but if we're not yet pinning interrupt threads to CPUs while they run, we should be. I seem to recall SCHED_ULE avoids migrating interrupt threads, but I haven't checked lately. > At 2 locks/mutex this adds up to about 16 atomic operations per packet. > > I think that some of the changes that Robert mentioned before about > putting mbufs in a list before releasing the lock should help a lot for > the Xeons. I am willing to try out some of these changes (both testing > for performance, and making the actual code changes) because we can't > switch over to 5.x until the performance is back up to where 4.7 was. BTW, on a related note, I've also been exploring passing packet chains into a number of routines that currently accept only single packets. This raises more of the "retain ordering" spectre, but might also have some benefits. If we do explore this model seriously, I think we need to introduce an 'mbuf queue' structure/object that can be passed around in order to improve our ability to use type safety to avoid nasties. I've seen too many bugs where code assumes m_nextpkt is NULL when it's not, or doesn't carefully maintain synchronization and coherrency of fields across sleeping in 4.x, etc. Before committing to an approach based on short work queues over single packets, I'd definitely want to see observable performance improvements in some example cases, though. I think it's also worth continuing to follow down our current locking path to make sure we have a decent first cut MPSAFE system where we have well documented synchronization semantics. Something I'd like to explore in detail is a careful comparison of performing delivery to completion in the interrupt thread vs using a set of netisr threads pinned to CPUs. Avoiding lots of context switches is a good thing, but I don't have any measurements that suggest at what layer in the inbound packet path we get the most coallescing: we know interrupt mitigation and coallescing works well at a high level, but are we currently seeing substantial coallescing due to long-running interrupt threads, handoffs to netisrs, etc? Some decent measurements with counters at each of the hand-off points might go a long way. > Most of my experience with FreeBSD (from the last 1/2 year or so of > looking at (and changing a few things) the code) is in the area of the > low level network drivers (em) and some of the lower stack layers. This > is why I have focused on the bridging data path to compare the > performance. I must admit that I don't know exactly what code changes > are going on in the stack, but if fine-grained locking means a (large) > increase in the number of mutexes throughout the stack, I am quite > concerned about the performance of the whole system on P4/Xeons. With > fine-grained locking I think that the cost of individual functions will > go up (a lot in the Xeons :( ), but the overal performance may still be > better because multiple threads can do work simultaneously if there is > nothing else for the other processors to do. What I am concerned about > is that if you have a dual-xeon system with enough kernel (stack) work > to keep one processor busy, and enough user-space work to keep the other > 3 processors busy on 4.7, what will happen on 5.x? I would think that multi-directional bridging would be a win due to real latency improvements from parallelism in the kernel. Likewise, I would expect SMP application/kernel scenarios such as web servers on SMP boxes to benefit through substantially reduced contention. Giant and a non-parallel kernel hurts us in a lot of these scenarios. Robert N M Watson FreeBSD Core Team, TrustedBSD Projects robert_at_fledge.watson.org Senior Research Scientist, McAfee ResearchReceived on Fri May 07 2004 - 14:17:23 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:53 UTC