Re: [RFC] vfs.read_min proposal

From: Maksim Yevmenkin <maksim.yevmenkin_at_gmail.com> Date: Thu, 28 Mar 2013 10:11:25 -0700 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:36 UTC

On Thu, Mar 28, 2013 at 12:52 AM, Konstantin Belousov
<kostikbel_at_gmail.com> wrote:

>> i would like to get some reviews, opinions and/or comments on the patch below.
>>
>> a little bit background, as far as i understand, cluster_read() can
>> initiate two disk i/o's: one for exact amount of data being requested
>> (rounded up to a filesystem block size) and another for a configurable
>> read ahead. read ahead data are always extra and do not super set data
>> being requested. also, read ahead can be controlled via f_seqcount (on
>> per descriptor basis) and/or vfs.read_max (global knob).
>>
>> in some cases and/or on some work loads it can be beneficial to bundle
>> original data and read ahead data in one i/o request. in other words,
>> read more than caller has requested, but only perform one larger i/o,
>> i.e. super set data being requested and read ahead.
>
> The totread argument to the cluster_read() is supplied by the filesystem
> to indicate how many data in the current request is specified. Always
> overriding this information means two things:
> - you fill the buffer and page cache with potentially unused data.

it very well could be

>   For some situations, like partial reads, it would be really bad.

certainly possible

> - you increase the latency by forcing the reader to wait for the whole
>   cluster which was not asked for.

perhaps, however, modern drives are fast, and, in fact, are a lot
better at reading continuous chunks without introducing significant
delays.

> So it looks as very single- and special-purpose hack. Besides, the
> global knob is obscure and probably would not have any use except your
> special situation. Would a file flag be acceptable for you ?

flag would probably work just as well, but having global knob allows
for runtime tuning without software re-configuration and re-start.

> What is the difference in the numbers you see, and what numbers ?
> Is it targeted for read(2) optimizations, or are you also concerned
> with the read-ahead done at the fault time ?

ok. please consider the following. modern high performance web server
- nginx - with aio + sendfile(SF_NODISKIO) method of delivering data.

for those who are not familiar how it works, here is a very quick
overview of nginx's aio + sendfile

1) nginx always uses sendfile() with SF_NODISKIO to deliver content.
it means that if requested pages are not in the buffer cache, send as
much as available and return EBUSY

2) when nginx sees EBUSY return code from sendfile() it would issue
aio_read() for 1 byte. the idea here being that OS will issue a read
ahead and fill up buffer cache with some pages

3) when aio_read() completes, nginx calls sendfile() with SF_NODISKIO
again, making an assumption that at least some of the pages will be in
the buffer cache.

this model allow for completely non-blocking asynchronous data send
without copying anything back to user space. sounds awesome, right?
well, it almost is. here is the problem: aio_read() for 1 byte will
issue read for 1 filesystem block size, and, *another* read for read
ahead data. say, we configure, read ahead to be 1mb (via
ioctl(F_READAHEAD and/or vfs.read_max), and, say, our filesystem block
size is 64k. we end up with 2 disk reads: one for 64k and another for
1mb, two trips to VM, and our average size per disk transfer is (64k +
1mb) / 2 = 544kb.

now, if we use vfs.read_min and set it to 16, i.e. 1mb reads with 64k
filesystem block size, turn off read ahead completely, i.e. set
vfs.read_max to zero, then *all* cluster_reads() become nice 1mb
chunks, and, we only do one disk i/o and one trip to VM to get the
same data. so we effectively doubled (or halfed) our iops. also
average size per disk transfer is around 900kb (there are still some
filesystem block sized i/o that are not going via cluser_read()) which
is a lot nicer for modern disks.

finally, vfs.read_min allows us to control size of orignal disk reads,
and, vfs.read_max allows us to control of additional read ahead. so,
ww control both sides here. in fact, we can have 1mb reads and 1mb
read aheads together. granted, its not going to be optimal for all
loads. that is why vfs.read_min default is 1. however, i strongly
suspect that there are quite a few workloads where this could really
help with disk i/o.

thanks,
max