Re: New optimized soreceive_stream() for TCP sockets, proof of concept

From: Robert Watson <rwatson_at_FreeBSD.org>
Date: Sat, 3 Mar 2007 22:30:52 +0000 (GMT)
On Fri, 2 Mar 2007, Andre Oppermann wrote:

> Instead of the unlock-lock dance soreceive_stream() pulls a properly sized 
> (relative to the receive system call buffer space) from the socket buffer 
> drops the lock and gives copyout as much time as it needs.  In the mean time 
> the lower half can happily add as many new packets as it wants without 
> having to wait for a lock.  It also allows the upper and lower halfs to run 
> on different CPUs without much interference.  There is a unsolved nasty race 
> condition in the patch though. When the socket closes and we still have data 
> around or the copyout failed it tries to put the data back into the socket 
> buffer which is gone already by then leading to a panic.  Work is underway 
> to find a realiable fix for this.  I wanted to get this out to the community 
> nonetheless to give it some more exposure.

I'll try to take a look at this in the next few days.

However, I find the description above of soreceive() a bit odd -- I'm pretty 
sure it doesn't do some of the things you're describing.  For example, 
soreceive() does release the locks acquired by the network input processing 
path while copying to user space: there should be no contention during the 
copyout(), only while processing the socket buffer between copyout() calls. 
This is possible because the socket receive sleep lock (not the mutex) holds 
sb_mb constant if it is non-NULL, making copyout() of sb_mb->m_data safe while 
not holding the socket buffer mutex in the current implementation.

In my experience, soreceive() is an incredibly complicated function, and could 
stand significant simplification.  However, it has to be done very carefully 
for exactly this reason :-).  There are some existing bugs in soreceive(), one 
involving incorrect handling of interlaced I/O due to a label being in the 
wrong place, that we should resolve.

BTW, the point of not pulling the data out of the socket buffer until 
copyout() is complete is not error handling reversion so much as not changing 
the advertised window size until the copy is done, since the data isn't 
delivered to user space.  Copyout() can take a very long time to run, due to 
page faults, for example, and the socket buffer represents a maximum bound on 
in-flight traffic as specified by the application.  Whether this is a property 
we want to keep is another question, but I believe that's the rationale.

Robert N M Watson
Computer Laboratory
University of Cambridge

>
> The patch is here:
>
> http://people.freebsd.org/~andre/soreceive_stream-20070302.diff
>
> Any testing, especially on 10Gig cards, and feedback appreciated.
>
> -- 
> Andre
>
> _______________________________________________
> freebsd-net_at_freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscribe_at_freebsd.org"
>
Received on Sat Mar 03 2007 - 21:30:53 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:06 UTC