Re: Socket related code duplication in NFS

From: Rick Macklem <rmacklem_at_uoguelph.ca> Date: Thu, 21 May 2009 16:32:03 -0400 (EDT) · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:48 UTC

On Wed, 20 May 2009, Andre Oppermann wrote:

>
> e) The socket buffer is most efficient when it can aggregate a number of
>    packets together before they are processed.  Can the NFS code set a low
>    water mark on the socket to get called only after a few packets have
>    arrived instead of each one? (In the select and taskqueue model.)
>
I think the answer to this one is "no". NFS traffic is RPC requests and
replies, which are mostly rather small messages (the write request,
read reply and readdir reply are the exceptions). NFS performance is very
sensition to RPC RTT, which means anything that introduces delay in
getting an RPC message through (such as waiting a little while for more
data/messages) is normally a detrement from what I've seen. It might be
possible to handle the exceptions as a special case, but it isn't going
to be easy, since TCP doesn't handle record marks, so knowing when a
large message is coming would require something like "peeking" in the
data for the RPC record marks. (Sun RPC puts a 32bit number in network
byte order in front of each RPC message, which is it's length in bytes.
A quirk on top of this is the definition of the high order bit of this
mark indicating whether or not it is the last segment of a message.
ie. An RPC message can be several record marked segments.)

> f) I've been thinking of an modular socket filter approach (much like the
>    accept filter) scanning for upper layer specific markers or boundaries
>    and then signalling data availability.
>
If by this you mean scanning for the RPC message boundaries in the TCP
stream (similar to what I said above), this could be very useful. So
long as a message gets passed along as soon as you have a complete one,
this sounds like a good idea to me.

Btw, although FreeBSD currently uses 32Kbyte reads/writes, Solaris10 is
using up to 1Mbyte and I'd like to see that happenning in FreeBSD too.
(When you have 1Mbyte write request and read reply messages, delaying
  an upcall until you have an entire message, might work well.)

Good luck with it, it sounds like an interesting project, rick