Re: Much improved sendfile(2) kernel implementation

From: Robert Watson <rwatson_at_FreeBSD.org>
Date: Thu, 21 Sep 2006 14:59:07 +0100 (BST)
On Thu, 21 Sep 2006, Andre Oppermann wrote:

>> There should be unconditional M_NOWAIT. Oops, the M_DONTWAIT in the current 
>> code is incorrect. It is present since rev. 1.171. If the m_uiotombuf() 
>> fails the current code returns from syscall without error! Before rev. 
>> 1.171, there wasn't m_uiotombuf(), the mbuf header was allocated below, 
>> with correct wait argument.
>> 
>> The wait argument for m_uiotombuf() should be changed to M_WAITOK, but in a 
>> separate commit.
<snip>
>> This one should be M_WAITOK always. It is M_TRYWAIT (equal to M_WAITOK) in 
>> the current code.
>
> The reason why I changed the mbuf allocations with SS_NBIO is the rationale 
> of sendfile() and the performance evaluation that was done by alc_at_ students. 
> sendfile() has two flags which control its blocking behavior.  Non blocking 
> socket (SS_NBIO) and SF_NODISKIO.  The latter is necessary because file 
> reads or writes are normally not considered to be blocking.  The most 
> optimal sendfile() is usage is with a single process doing accept(), parsing 
> and then sendfile that should never ever block on anything.  This way the 
> main process then can use kqueue for all the socket stuff and it can 
> transfer all sends that require disk I/O to a child process or thread to 
> provide a context for the read.  Meanwhile the main process is free to 
> accept further connections and to continue serving existing connections. 
> Having sendfile() block in mbuf allocation for the header, on sfbufs or 
> anything else is not desirable and must be avoided.  I know I'm extending 
> the traditional definition of SS_NBIO a bit but it's fully in line with the 
> semantics and desired operational behavior of sendfile().  The paper by 
> alc_at_'s students clearly identifies this as the main property of a sendfile 
> implementation besides its zero copy nature.

The semantics with regard to waiting are a bit confusing, but the existing 
model has a fairly specific meaning that has some benefits.  Normally we have 
three dispositions for a network I/O operation:

(1) Fully blocking -- the default disposition.  The operation may block for
     several reasons, but most usually due to either insufficient buffer
     space/data in the socket buffer, insufficient memory for the kernel to
     perform the operation (usually mbufs), or due to a user space page fault
     in reading or writing the data.

(2) Non-blocking -- SS_NBIO, MSG_NBIO, etc.  The operation will not block if
     there is insufficient data/buffer space.  Typically, this is aligned with
     select()/poll()/kqueue()'s notion of data or space.

(3) Non-waiting -- MSG_DONTWAIT.  The operation will not sleep in kernel for
     any reason, either as part of I/O blocking, or for memory allocation.  It
     may still sleep if a page fault occurs, but as kernel senders send using
     pinned kernel memory, this isn't an issue.

There are a few known bugs -- for example, in zero-copy mode, we may block 
waiting for an sf_buf with MSG_DONTWAIT set (this used to be the case, haven't 
checked lately).  However, for applications, you typically run in (1) or (2) 
of the above, where the notion of blocking is aligned with a notion of buffer 
space or data, not with a notion of kernel sleeping.  In particular, it has to 
do with the definition used by select()/kqueue()/poll().  If you make SS_NBIO 
sockets return immediately if there is no memory free for sendfile(), this 
will be inconsistent with the normal behavior in which select() returning 
writable means that you will be able to write -- so an application that shows 
the socket as writable via select() might sit there spinning performing the 
I/O operation, with it repeatedly returning an error saying it wasn't ready.

My feeling is that we should constrain absolutely non-sleeping to the 
MSG_DONTWAIT case -- if desired, we could add SF_DONTWAIT to determine if 
sleeping ever at all happens.  SS_NBIO should not return an error in a limited 
memory case, it should sleep waiting on memory, as sleeping (mutexes, memory 
allocation, ...) is not considered blocking.  Blocking should continue to 
refer to the socket buffer-related behavior, and specifically sbwait().

However, we should fix any bugs in MSG_DONTWAIT for sosend/soreceive (and 
hence sendmsg, recvmsg) that cause it to sleep improperly -- I'm not sure if 
the zero-copy case still does it wrong, but that's potentially a problem if we 
ever support zero-copy send from in kernel space, as sosend/soreceive can be 
called while a mutex is held or in network interrupt context, hence needing 
the flag.

Robert N M Watson
Computer Laboratory
University of Cambridge
Received on Thu Sep 21 2006 - 11:59:10 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:00 UTC