On Tue, 8 Apr 2008, Darren Reed wrote: > Is there a performance analysis of the copy vs zerocopy available? (I don't > see one in the paper, just a "to do" item.) > > The numbers I'm interested in seeing are how many Mb/s you can capture > before you start suffering packet loss. This needs to be done with > sequenced packets so that you can observe gaps in the sequence captured. We've done some analysis, and a couple of companies have the zero-copy BPF code deployed. I hope to generate a more detailed analysis before the developer summit so we can review it at BSDCan. The basic observation is that for quite a few types of network links, the win isn't in packet loss per se, but in reduced CPU use, freeing up CPU for other activities. There are a number of sources of win: - Reduced system call overhead -- as load increases, # system calls goes down, especially if you get a two-CPU pipeline going. - Reduced memory access, especially for larger buffer sizes, avoids filling the cache twice (first in copyout, then again in using the buffer in userspace). - Reduced lock contention, as only a single thread, the device driver ithread, is acquiring the bpf descriptor's lock, and it's no longer contending with the user thread. One interesting, and in retrospect reasonable, side effect is that user CPU time goes up in the SMP scenario, as cache misses on the BPF buffer move from the read() system call to userspace. And, as you observe, you have to use somewhat larger buffer sizes, as in the previous scenario there were three buffers: two kernel buffers and a user buffer, and now there are simply two kernel buffers shared directly with user space. The original committed version has a problem in that it allows only one kernel buffer to be "owned" by userspace at a time, which can lead to excess calls to select(); this has now been corrected, so if people have run performance benchmarks, they should update to the new code and re-run them. I don't have numbers off-hand, but 5%-25% were numbers that appeared in some of the measurements, and I'd like to think that the recent fix will further improve that. For 10gbps, something we need to think about is how to modify the structure of BPF to allow different BPF devices for different input queues... Robert N M Watson Computer Laboratory University of CambridgeReceived on Tue Apr 08 2008 - 10:28:19 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:29 UTC