Re: [RFC] vfs.read_min proposal

From: Konstantin Belousov <kostikbel_at_gmail.com> Date: Sun, 31 Mar 2013 15:37:30 +0300 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:36 UTC

On Sat, Mar 30, 2013 at 01:51:15AM -0600, Scott Long wrote:
> 
> On Mar 29, 2013, at 2:58 PM, Konstantin Belousov <kostikbel_at_gmail.com> wrote:
> 
> >> 
> > I think this is definitely a feature that should be set by a flag to
> > either file descriptor used for aio_read, or aio_read call itself.
> > Adding a flag to aio_read() might be cumbersome from the ABI perspective.
> > 
> 
> Fine if you think that there should be a corresponding fcntl() operation, but
> I see good reason to also have a vfs.read_min that compliments vfs_read_max.
> It's no less obscure.
> 
> >> 
> >> finally, vfs.read_min allows us to control size of orignal disk reads,
> >> and, vfs.read_max allows us to control of additional read ahead. so,
> >> ww control both sides here. in fact, we can have 1mb reads and 1mb
> >> read aheads together. granted, its not going to be optimal for all
> >> loads. that is why vfs.read_min default is 1. however, i strongly
> >> suspect that there are quite a few workloads where this could really
> >> help with disk i/o.
> > 
> > In fact, the existing OS behaviour is reasonable for the arguments
> > which are passed to the syscall. The specified read size is 1, and the
> > current read-ahead algorithm tries to satisfy the request with minimal
> > latency and without creating additional load under memory pressure,
> > while starting the useful optimization of the background read.
> > 
> 
> The doubled transaction made a lot of sense back when disks were very
> slow.  Now, let's use a modern example:
> 
> Default UFS block size = 16k
> Default vfs.read_max = 8 (128k)
> Time spent transferring a 16k block over 3Gbps SATA: 54ns
> Time spent transferring a 128k block over 3Gbps SATA: 436ns
> Time spent seeking to the 16k/128k block: Average 8ms on modern disks.
> % time spent on data vs seek, 16k: 0.68%
> % time spent on data vs seek, 128k: 5.4%
> 
> It'll take you 5% longer to get a completion back.  Not nothing, but it's
> also not something that would be  turned on by default, at least not
> right now.  For 6Gbps SATA, it'll be half of that.  However, this is a
> very idealized example.  When you start getting a busy disk and the
> seek times reach the hundreds of milliseconds, this overhead goes
> well into the noise.  At the same time, reducing the number of
> concurrent, unbalanced transactions to the disk makes them perform
> much better when they are at their performance saturation point, and
> we have very solid numbers to prove it.
Hm, what you are talking about is not what the patch does. Patch fakes
the read size from the user.

Let me reformulate this in the following way:
assume that the file extent at the current point is contiguous, and
the read-ahead was specified by the file system. Then, is makes absolutely
no sense to not read the maximal available contigous extent up to maxra.

Current code limits the amount of first clustered read to the amount
specified in the usermode read request. I think this is bug, and we do
not need a tunable to disable the bug.

> 
> I think that there's still a place for doubled transactions for read ahead,
> and that place would likely be with low-latency flash, but there's a lot
> of other factors that get in the way of that right now in FreeBSD, like the
> overhead of the threaded handoffs in GEOM.  As this area is developed
> over the next 6 months, and as we have more time to build and test
> more models, I'm sure well get some interesting data.  But for now, I'll
> argue that Max's proposal is sound and is low maintenance.
Even with such low-latency disk, I think that not reading contiguous
chunk when there is read-ahead, is wrong.

Could you try this, please, instead of read_min ? The patch can be buggy,
but it should demonstrate what I mean.

diff --git a/sys/kern/vfs_cluster.c b/sys/kern/vfs_cluster.c
index 91a0443..c926dda 100644
--- a/sys/kern/vfs_cluster.c
+++ b/sys/kern/vfs_cluster.c
_at__at_ -201,7 +201,7 _at__at_ cluster_read(struct vnode *vp, u_quad_t filesize, daddr_t lblkno, long size,
 		 */
 		if (ncontig) {
 			/* Account for our first block. */
-			ncontig = min(ncontig + 1, nblks);
+			ncontig = min(ncontig + 1, maxra);
 			if (ncontig < nblks)
 				nblks = ncontig;
 			bp = cluster_rbuild(vp, filesize, lblkno,