Re: HEADS UP: zerocopy bpf commits impending

From: Darren Reed <darrenr_at_freebsd.org> Date: Tue, 08 Apr 2008 04:35:11 -0700 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:29 UTC

Robert Watson wrote:
> On Mon, 17 Mar 2008, Christian S.J. Peron wrote:
>
>> Just wanted to give a heads up that I plan to start merging the work 
>> located in the zerocopy bpf perforce branch.  We have been working on 
>> this project for about a year now and feel that it is ready to come 
>> into the tree.
>>
>> I will begin to merge hopefully today [assuming nobody has any 
>> concerns] or tommorow.  Zerocopy bpf will be disabled by default, and 
>> can be enabled globally though the use of a sysctl variable. Once the 
>> kernel bits are in and we sort out a couple minor nits in 
>> libpcap+tcpdump, we will be be looking at getting our libpcap patches 
>> committed upstream.  I will post a patch for people to experiment 
>> with in the meantime after the kernel commits are complete.
>>
>> We do not anticipate this will have any effect on existing bpf 
>> consumers like libpcap, tcpdump etc... so if something breaks, it 
>> shouldn't have and we need to know about :) We were pretty careful 
>> about preserving the ABI. The only exception to this is, netstat will 
>> need a recompile because the size of it's bpf stats structure changed.
>>
>> So if there are any objections or concerns, now is the time to raise 
>> them.
>
> Per previous posts, interested parties can find the slides on the 
> design from the BSDCan 2008 developer summit here:
>
>   
> http://www.watson.org/~robert/freebsd/2007bsdcan/20070517-devsummit-zerocopybpf.pdf

Is there a performance analysis of the copy vs zerocopy available?
(I don't see one in the paper, just a "to do" item.)

The numbers I'm interested in seeing are how many Mb/s you can capture
before you start suffering packet loss.  This needs to be done with
sequenced packets so that you can observe gaps in the sequence captured.

I kind of experimented with this back in 2004:

http://mail-index.netbsd.org/tech-net/2004/05/02/0001.html
http://mail-index.netbsd.org/tech-net/2004/05/21/0001.html

Rather than map the user space memory into the kernel, I used mmap(2)
to access the kernel's buffer from user space and then did the ioctl
thing to move pointers. I also played with changing the size of the
primary buffer to be smaller but to have more alternate buffers.  So
while one buffer was mapped out to the user space, one (or more)
buffer(s) were available in the kernel.

Speed improvement?  Slight (less than 2%) in the testing I did.

Why only slight?

Because there's another factor here, and that's how long it takes to
process the data that is in the buffer and free it up for the kernel.
But then the time you gain from having more buffer space available in
the kernel you lose (in part) to the management overhead.

In the end I decided that change, while interesting, didn't really
solve the problem which was that the speed at which capturing could
be effectively done was bounded by the time spent analysing the data
captured.  If there are packets that you want to analyse arriving
faster than you can do the analysis, then you will drop packets -
end of story.

So why isn't there a huge performance increase?  My $0.2c...
When using read(2) to get bpf data, you straight away transfer the
data from the kernel to the user space buffer and that immediately
free's up that buffer in the kernel for more capture.  When you share
the buffer between the kernel and user space, you either (1) delay
kernel access to that buffer while you process all the contents,
and if there any bits that you want to keep, then you need to copy
them out or (2) do another copy from the shared buffer to a private
buffer, releasing contention for the shared buffer but again doing
a copy, so the end result is not much different.  The problem with
(1) is that you always have less buffer space available at the kernel
level for storing packet data than you do without that segment "held"
for user space activity.  So even if you do a write(2) of the buffer
used in (1) straight away, there is a delay in the turnaround time
for the buffer of how even long your disk I/O takes to complete.

And someone asked about packet capture direct to disk - too slow if
you do it through a vnode with an eye on 10G.  Heck, at 10G speeds,
you need to be handling 2GB/sec - can any affordable disk write that
fast?

Why 2GB/sec?
To successfully sniff a 10G stream, you need two 10G NICs, for a
combined total of 20G incoming (remember, full duplex, 10G going in
both direction... and you thought plugging your single NIC into any
full-duplex monitor port on a switch was always enough....ha!)

State of the art packet capture has moved to hardware assisted cards,
such as those from Endace:
http://www.endace.com/our-products/dag-network-monitoring-cards/ethernet

If you want to get 10G capture on FreeBSD, get drivers for those cards
made for FreeBSD.  Those cards are absolutely necessary on Linux to get
performance anywhere near FreeBSD"s ;)

Darren