Luigi Rizzo wrote: > On Wed, May 05, 2004 at 07:38:38PM -0400, Gerrit Nagelhout wrote: > > Robert Watson wrote: > ... > > > Getting polling and SMP to play nicely would be a very > good thing, but > > > isn't something I currently have the bandwidth to work on. > > > Don't suppose we could interest you in that? :-) > ... > > I won't be able to work on that feature anytime soon, but if some > > prototyping turns out to have good results, and the mutex > cost issues > > are worked out, it's quite likely that we'll try to > implement it. The > > original author of the polling code (Luigi?) may have some input on > > this as well. > > ENOTIME at the moment, but surely i would like to figure out > first some locking issues e.g. related to the clustering of > packets to reduce the number of locking ops. > The other issue is the partitioning of work -- no point > in having multiple polling loops work on the same interface. > Possibly we might try to restructure the processing in the network > stack by using one processor/polling loop that quickly determines > the tasks that need work and then posts the results so that > other processors can grab one task each. Kind of a dataflow > architecture, in a sense. > > In any case, I am really impressed by the numbers Gerrit achieved > in the UP/4.7 case -- i never went above 800kpps though in > my case i think i was limited by the PCI bus (64 bit, 100MHz) > or possibly even the chipset > The numbers I achieved on 4.7 were on a 2.8Ghz xeon, 64 bit 100Mhz PCI-X. The setup was a 2 port bridge, with each port receiving & transmitting up to 600kpps (1.2Mpps aggregate). On this particular system, I've seen it as high at 700kpps when using 133Mhz and enabling hardware prefetch in the em driver (this is not a supported feature though and might hang the chip). Using a 4 port bridge, the aggregate can go a little higher (~1.6Mpps). In order to get this performance, I had to make quite a few tweaks to both the em driver and the way mbufs/clusters are used. The first bottleneck (the one 4.7 currently has) is due to the pci accesses in the em driver. The following changes will make this better: 1) Don't update the tail pointer for every packet added to the receive ring. Only doing this once every 64 packets or so reduces a lot of pci accesses. (I later noticed that the linux driver supplied by Intel already does this) 2) Same as 1), but for the transmit ring. This one is a little trickier because it may add some latency. I only updated it after "n" packets, and at the end of em_poll. 3) Have the transmitter only write back every nth descriptor instead of every one. This makes the tranmit ring cleanup code a bit more expensive, but it's well worth it. After making these changes, the bottleneck will typically become non-cached memory accesses. Changing em_get_buf to use the mcl_pool cache in uipc_mbuf.c makes the receive path a little faster but it's still not optimal. Ideally when a new packet is added to the receive ring it (mbuf & cluster) doesn't have to be pulled into the cache until the packet is filled in and is ready to be processed. Because mbufs are on a linked list, it gets pulled into cache to read the next pointer. To avoid this, I created a cached array (stack) of cluster pointers (not attached to mbuf). To add clusters to the receive ring, the cluster doesn't need to be accessed at all, saving a memory read. Once the packet is ready to be processed an mbuf is allocated to attach this cluster to. In the bridging code, the number of mbufs was small enough to always be in cache and therefore there was only one random memory lookup for every packet by the cpu. This also made it possible to create much larger receive rings (2048) in order to avoid dropping packets under high loads when the processor got "distracted" for a bit. Also, this kind of performance was only possible using polling. One advantage that UP has (at least on 4.7) is the idle_poll feature. One more optimization that I hacked together once, but have not been able to properly implement is to use 4M pages for the mbufs and clusters. I found that using software prefetches can help performance significantly, but on the i386 you can only prefetch something that is in the TLB. Since there are only 64 entries, the odds of cluster being in the TLB upon receiving it is very small. With 4M pages, it only takes a few pages to map all the mbufs & clusters in the system. The advantages are that you can prefetch, and avoid tlb page misses & thrashing under high loads. After all these changes, the bottleneck becomes raw cpu cycles. That's why I'd like to get multiple processors doing polling simultaneously. I've always meant to submit some of these changes back to freeBSD again, but wasn't sure how much interest there would be. If anyone one is interested in helping me out with this, let me know and I will try to get that process started. GerritReceived on Thu May 06 2004 - 09:05:23 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:53 UTC