Re: NFS write() calls lead to read() calls?

From: Yar Tikhiy <yar_at_comp.chem.msu.su> Date: Thu, 29 Mar 2007 04:11:48 +0400 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:07 UTC

Greetings,

On Wed, Mar 28, 2007 at 11:38:44AM +0200, Ulrich Spoerlein wrote:
> 
> I observe a strange effect, when using the following setup: Three
> FreeBSD 6.2[1] machines on Gigabit Ethernet using em(4) interfaces.
> 
> HostC is the NFS server, HostB has /net/share mounted from HostC. I
> will use HostA and HostB to demonstrate the issue. Picture this:
> 
> hostA # scp 500MB hostB:/net/share/
> 
> Iff the file "500MB" does not yet exist on the NFS share, I can see X
> MB/s going out of HostA, X MB/s coming in on HostB, X MB/s going out
> on hostB again and finally X MB/s coming in on HostC.
> 
> If I run the scp again, I can see X MB/s going out from HostA, 2*X
> MB/s coming in on HostB and X MB/s out plus X MB/s in on HostC. What's
> happening is, that HostB issues one NFS READ call for every WRITE
> call. The traffic flows like this:
> 
>  ----->   ----->
> A        B        C
>           <-----
> 
> If I rm(1) the file on the NFS share, then the first scp(1) will not
> show this behaviour. It is only when overwritting files, that this
> happens.
> 
> The real weirdness comes into play, when I simply cp(1) from HostB
> itself like this:
> 
> hostB # cp 500MB /net/share/
> 
> I can do this over and over again, and _never_ get any noteworthy
> amount of NFS READ calls, only WRITE. The network traffic is also, as
> you would expect.
> 
> Then I tested using ssh(1) instead of scp(1), like this:
> 
> hostA # cat 500MB | ssh hostB "cat >/net/share/500MB"
> 
> This works, too. Probably, because sh(1) is truncating the file?
> 
> So, can someone please explain to me, what is happening and if/how it
> can be avoided?

My first guess is that scp and Samba use too small an I/O block
size.  Forget NFS and simply imagine that an application issues
writes in 128-byte blocks while the disc block size is 512 bytes.
If the OS is simple, like MS-DOS :-), then it will read the whole
disc block each time and replace just 128 bytes in it on every
application's write.  If the OS is a bit more sophisticated, say
FreeBSD ;-), it will use a buffer cache to alleviate the disc churn.
However, it still will have to read the disc block once on the first
small write to it because it has no way to know that the application
is going to overwrite the whole of the disc block in a moment.  So
each disc block is read once and written once; but the OS still has
to read it due to the poor choice of the write block size.

Of course, my scenario implies that the file already contains data
and the writes go over them, not beyond the end of file.

Something similar (but maybe a bit more complex) should be going
on in your case.

-- 
Yar