Re: [RFC] vfs.read_min proposal

From: Scott Long <scottl_at_samsco.org> Date: Sat, 30 Mar 2013 01:51:15 -0600 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:36 UTC

On Mar 29, 2013, at 2:58 PM, Konstantin Belousov <kostikbel_at_gmail.com> wrote:

>> 
> I think this is definitely a feature that should be set by a flag to
> either file descriptor used for aio_read, or aio_read call itself.
> Adding a flag to aio_read() might be cumbersome from the ABI perspective.
> 

Fine if you think that there should be a corresponding fcntl() operation, but
I see good reason to also have a vfs.read_min that compliments vfs_read_max.
It's no less obscure.

>> 
>> finally, vfs.read_min allows us to control size of orignal disk reads,
>> and, vfs.read_max allows us to control of additional read ahead. so,
>> ww control both sides here. in fact, we can have 1mb reads and 1mb
>> read aheads together. granted, its not going to be optimal for all
>> loads. that is why vfs.read_min default is 1. however, i strongly
>> suspect that there are quite a few workloads where this could really
>> help with disk i/o.
> 
> In fact, the existing OS behaviour is reasonable for the arguments
> which are passed to the syscall. The specified read size is 1, and the
> current read-ahead algorithm tries to satisfy the request with minimal
> latency and without creating additional load under memory pressure,
> while starting the useful optimization of the background read.
> 

The doubled transaction made a lot of sense back when disks were very
slow.  Now, let's use a modern example:

Default UFS block size = 16k
Default vfs.read_max = 8 (128k)
Time spent transferring a 16k block over 3Gbps SATA: 54ns
Time spent transferring a 128k block over 3Gbps SATA: 436ns
Time spent seeking to the 16k/128k block: Average 8ms on modern disks.
% time spent on data vs seek, 16k: 0.68%
% time spent on data vs seek, 128k: 5.4%

It'll take you 5% longer to get a completion back.  Not nothing, but it's
also not something that would be  turned on by default, at least not
right now.  For 6Gbps SATA, it'll be half of that.  However, this is a
very idealized example.  When you start getting a busy disk and the
seek times reach the hundreds of milliseconds, this overhead goes
well into the noise.  At the same time, reducing the number of
concurrent, unbalanced transactions to the disk makes them perform
much better when they are at their performance saturation point, and
we have very solid numbers to prove it.

I think that there's still a place for doubled transactions for read ahead,
and that place would likely be with low-latency flash, but there's a lot
of other factors that get in the way of that right now in FreeBSD, like the
overhead of the threaded handoffs in GEOM.  As this area is developed
over the next 6 months, and as we have more time to build and test
more models, I'm sure well get some interesting data.  But for now, I'll
argue that Max's proposal is sound and is low maintenance.

> Not lying to the OS could be achieved by somehow specifying to
> aio_read() that you do not need copyout, and issuing the request for
> read of the full range. This is definitely more work than read_min,
> but I think that the result could be useful for the wide audience.

A side-effect of the aio_mlock() work that's also going on right now is
that we won't need to lie to the OS anymore.  We still may not want to
do a doubled transaction for read-ahead though because we're
constrained on disk transactional bandwidth and we don't know that
we'll always actually use the data that gets read-ahead.  In any case,
it's hard for me to resolve the arguments for giving freebsd the tools to
let people make it faster in demonstrable ways, and then arguing that
the tools offered won't be used and are too obscure.  Let's move forward
with this.

Scott