Re: [RFC] vfs.read_min proposal

From: Maksim Yevmenkin <maksim.yevmenkin_at_gmail.com> Date: Fri, 29 Mar 2013 16:57:48 -0700 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:36 UTC

On Fri, Mar 29, 2013 at 1:58 PM, Konstantin Belousov
<kostikbel_at_gmail.com> wrote:
> On Thu, Mar 28, 2013 at 10:11:25AM -0700, Maksim Yevmenkin wrote:
>> On Thu, Mar 28, 2013 at 12:52 AM, Konstantin Belousov
>> <kostikbel_at_gmail.com> wrote:
>>
>> >> i would like to get some reviews, opinions and/or comments on the patch below.
>> >>
>> >> a little bit background, as far as i understand, cluster_read() can
>> >> initiate two disk i/o's: one for exact amount of data being requested
>> >> (rounded up to a filesystem block size) and another for a configurable
>> >> read ahead. read ahead data are always extra and do not super set data
>> >> being requested. also, read ahead can be controlled via f_seqcount (on
>> >> per descriptor basis) and/or vfs.read_max (global knob).
>> >>
>> >> in some cases and/or on some work loads it can be beneficial to bundle
>> >> original data and read ahead data in one i/o request. in other words,
>> >> read more than caller has requested, but only perform one larger i/o,
>> >> i.e. super set data being requested and read ahead.
>> >
>> > The totread argument to the cluster_read() is supplied by the filesystem
>> > to indicate how many data in the current request is specified. Always
>> > overriding this information means two things:
>> > - you fill the buffer and page cache with potentially unused data.
>>
>> it very well could be
>>
>> >   For some situations, like partial reads, it would be really bad.
>>
>> certainly possible
>>
>> > - you increase the latency by forcing the reader to wait for the whole
>> >   cluster which was not asked for.
>>
>> perhaps, however, modern drives are fast, and, in fact, are a lot
>> better at reading continuous chunks without introducing significant
>> delays.
>>
>> > So it looks as very single- and special-purpose hack. Besides, the
>> > global knob is obscure and probably would not have any use except your
>> > special situation. Would a file flag be acceptable for you ?
>>
>> flag would probably work just as well, but having global knob allows
>> for runtime tuning without software re-configuration and re-start.
> My point is that the 'tuning' there has too wide scope to be acceptable
> as the feature for the general-puspose OS.

i understand. i was just making parallels with vfs.read_max which has
just as wider scope. granted, it can be set on per-descriptor basis,
but only lowered, imo.

>> > What is the difference in the numbers you see, and what numbers ?
>> > Is it targeted for read(2) optimizations, or are you also concerned
>> > with the read-ahead done at the fault time ?
>>
>> ok. please consider the following. modern high performance web server
>> - nginx - with aio + sendfile(SF_NODISKIO) method of delivering data.
>>
>> for those who are not familiar how it works, here is a very quick
>> overview of nginx's aio + sendfile
>>
>> 1) nginx always uses sendfile() with SF_NODISKIO to deliver content.
>> it means that if requested pages are not in the buffer cache, send as
>> much as available and return EBUSY
>>
>> 2) when nginx sees EBUSY return code from sendfile() it would issue
>> aio_read() for 1 byte. the idea here being that OS will issue a read
>> ahead and fill up buffer cache with some pages
>>
>> 3) when aio_read() completes, nginx calls sendfile() with SF_NODISKIO
>> again, making an assumption that at least some of the pages will be in
>> the buffer cache.
>>
>> this model allow for completely non-blocking asynchronous data send
>> without copying anything back to user space. sounds awesome, right?
>> well, it almost is. here is the problem: aio_read() for 1 byte will
>> issue read for 1 filesystem block size, and, *another* read for read
>> ahead data. say, we configure, read ahead to be 1mb (via
>> ioctl(F_READAHEAD and/or vfs.read_max), and, say, our filesystem block
>> size is 64k. we end up with 2 disk reads: one for 64k and another for
>> 1mb, two trips to VM, and our average size per disk transfer is (64k +
>> 1mb) / 2 = 544kb.
>>
>> now, if we use vfs.read_min and set it to 16, i.e. 1mb reads with 64k
>> filesystem block size, turn off read ahead completely, i.e. set
>> vfs.read_max to zero, then *all* cluster_reads() become nice 1mb
>> chunks, and, we only do one disk i/o and one trip to VM to get the
>> same data. so we effectively doubled (or halfed) our iops. also
>> average size per disk transfer is around 900kb (there are still some
>> filesystem block sized i/o that are not going via cluser_read()) which
>> is a lot nicer for modern disks.
> I think this is definitely a feature that should be set by a flag to
> either file descriptor used for aio_read, or aio_read call itself.
> Adding a flag to aio_read() might be cumbersome from the ABI perspective.

something along the lines of ioctl(F_READAHEAD)/f_seqcount might be
acceptable, dont you think?

>> finally, vfs.read_min allows us to control size of orignal disk reads,
>> and, vfs.read_max allows us to control of additional read ahead. so,
>> ww control both sides here. in fact, we can have 1mb reads and 1mb
>> read aheads together. granted, its not going to be optimal for all
>> loads. that is why vfs.read_min default is 1. however, i strongly
>> suspect that there are quite a few workloads where this could really
>> help with disk i/o.
>
> In fact, the existing OS behaviour is reasonable for the arguments
> which are passed to the syscall. The specified read size is 1, and the
> current read-ahead algorithm tries to satisfy the request with minimal
> latency and without creating additional load under memory pressure,
> while starting the useful optimization of the background read.

like i said, i really don't buy latency argument here. i personally
did not observe any noticeable latency increase. if anything, two
side-by-side disk i/o's will like to have more latency than one i/o
spanning across both ranges. there is also a potentially bad
interaction with cache inside drive itself. but that is really hard to
see as we dont get to see inside the drive.

> Not lying to the OS could be achieved by somehow specifying to
> aio_read() that you do not need copyout, and issuing the request for
> read of the full range. This is definitely more work than read_min,
> but I think that the result could be useful for the wide audience.

the whole idea is to make sure we dont copy anything back user space.
anyway... we will probably think of something similar. i just wanted
to throw idea out there :)

thanks
max