Re: HEADS UP: zerocopy bpf commits impending

From: Robert Watson <rwatson_at_FreeBSD.org> Date: Tue, 8 Apr 2008 13:28:18 +0100 (BST) · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:29 UTC

On Tue, 8 Apr 2008, Darren Reed wrote:

> Is there a performance analysis of the copy vs zerocopy available? (I don't 
> see one in the paper, just a "to do" item.)
>
> The numbers I'm interested in seeing are how many Mb/s you can capture 
> before you start suffering packet loss.  This needs to be done with 
> sequenced packets so that you can observe gaps in the sequence captured.

We've done some analysis, and a couple of companies have the zero-copy BPF 
code deployed.  I hope to generate a more detailed analysis before the 
developer summit so we can review it at BSDCan.  The basic observation is that 
for quite a few types of network links, the win isn't in packet loss per se, 
but in reduced CPU use, freeing up CPU for other activities.  There are a 
number of sources of win:

- Reduced system call overhead -- as load increases, # system calls goes down,
   especially if you get a two-CPU pipeline going.

- Reduced memory access, especially for larger buffer sizes, avoids filling
   the cache twice (first in copyout, then again in using the buffer in
   userspace).

- Reduced lock contention, as only a single thread, the device driver ithread,
   is acquiring the bpf descriptor's lock, and it's no longer contending with
   the user thread.

One interesting, and in retrospect reasonable, side effect is that user CPU 
time goes up in the SMP scenario, as cache misses on the BPF buffer move from 
the read() system call to userspace.  And, as you observe, you have to use 
somewhat larger buffer sizes, as in the previous scenario there were three 
buffers: two kernel buffers and a user buffer, and now there are simply two 
kernel buffers shared directly with user space.

The original committed version has a problem in that it allows only one kernel 
buffer to be "owned" by userspace at a time, which can lead to excess calls to 
select(); this has now been corrected, so if people have run performance 
benchmarks, they should update to the new code and re-run them.

I don't have numbers off-hand, but 5%-25% were numbers that appeared in some 
of the measurements, and I'd like to think that the recent fix will further 
improve that.

For 10gbps, something we need to think about is how to modify the structure of 
BPF to allow different BPF devices for different input queues...

Robert N M Watson
Computer Laboratory
University of Cambridge