Re: Time to increase MAXPHYS?

From: Tomoaki AOKI <junchoon_at_dec.sakura.ne.jp>
Date: Sat, 10 Jun 2017 16:14:10 +0900
It's what I proposed first. ;-)

But looking through this thread, I now like Konstantin's idea, in
conjunction with quirks mechanism.

With single MAXPHYS all over the OS, many non-latest hardwares could be
fallen out if it's set larger, but the larger the MAXPHYS is, the
better the virtual instances (AWS, Azure, ...) runs.

It seems that there should be flexible way to make MAXPHYS "per
consumer" (devices, drivers, virtual instances, ...), and Konstantin's
idea looks good to me. (Although there would be some risk of memory
leak problem.)

One more possibility.
Abusing quirks to allow larger MAXPHYS only if possible, and keep
current default. This way, only something requires larger value would
be affected, IMHO.

 *I guess quirks should only be uesd for problematic things, though.


On Sun, 4 Jun 2017 12:40:55 +0000
Rick Macklem <rmacklem_at_uoguelph.ca> wrote:

> There is an array in aio.h sized on MAXPHYS as well.
> 
> A simpler possibility might be to leave MAXPHYS as a compile
> time setting, but allow it to be set "per arch" and make it bigger
> for amd64.
> 
> Good luck with it, rick
> ________________________________________
> From: owner-freebsd-current_at_freebsd.org <owner-freebsd-current_at_freebsd.org> on behalf of Konstantin Belousov <kostikbel_at_gmail.com>
> Sent: Sunday, June 4, 2017 4:10:32 AM
> To: Warner Losh
> Cc: Allan Jude; FreeBSD Current
> Subject: Re: Time to increase MAXPHYS?
> 
> On Sat, Jun 03, 2017 at 11:28:23PM -0600, Warner Losh wrote:
> > On Sat, Jun 3, 2017 at 9:55 PM, Allan Jude <allanjude_at_freebsd.org> wrote:
> >
> > > On 2017-06-03 22:35, Julian Elischer wrote:
> > > > On 4/6/17 4:59 am, Colin Percival wrote:
> > > >> On January 24, 1998, in what was later renumbered to SVN r32724, dyson_at_
> > > >> wrote:
> > > >>> Add better support for larger I/O clusters, including larger physical
> > > >>> I/O.  The support is not mature yet, and some of the underlying
> > > >>> implementation
> > > >>> needs help.  However, support does exist for IDE devices now.
> > > >> and increased MAXPHYS from 64 kB to 128 kB.  Is it time to increase it
> > > >> again,
> > > >> or do we need to wait at least two decades between changes?
> > > >>
> > > >> This is hurting performance on some systems; in particular, EC2 "io1"
> > > >> disks
> > > >> are optimized for 256 kB I/Os, EC2 "st1" (throughput optimized
> > > >> spinning rust)
> > > >> disks are optimized for 1 MB I/Os, and Amazon's NFS service (EFS)
> > > >> recommends
> > > >> using a maximum I/O size of 1 MB (and despite NFS not being *physical*
> > > >> I/O it
> > > >> seems to still be limited by MAXPHYS).
> > > >>
> > > > We increase it in freebsd 8 and 10.3 on our systems,  Only good results.
> > > >
> > > > sys/sys/param.h:#define MAXPHYS         (1024 * 1024)   /* max raw I/O
> > > > transfer size */
> > > >
> > > > _______________________________________________
> > > > freebsd-current_at_freebsd.org mailing list
> > > > https://lists.freebsd.org/mailman/listinfo/freebsd-current
> > > > To unsubscribe, send any mail to "freebsd-current-unsubscribe_at_
> > > freebsd.org"
> > >
> > > At some point Warner and I discussed how hard it might be to make this a
> > > boot time tunable, so that big amd64 machines can have a larger value
> > > without causing problems for smaller machines.
> > >
> > > ZFS supports a block size of 1mb, and doing I/Os in 128kb negates some
> > > of the benefit.
> > >
> > > I am preparing some benchmarks and other data along with a patch to
> > > increase the maximum size of pipe I/O's as well, because using 1MB
> > > offers a relatively large performance gain there as well.
> > >
> >
> > It doesn't look to be hard to change this, though struct buf depends on
> > MAXPHYS:
> >         struct  vm_page *b_pages[btoc(MAXPHYS)];
> > and b_pages isn't the last item in the list, so changing MAXPHYS at boot
> > time would cause an ABI change. IMHO, we should move it to the last element
> > so that wouldn't happen. IIRC all buf allocations are from a fixed pool.
> > We'd have to audit anybody that creates one on the stack knowing it will be
> > persisted. Given how things work, I don't think this is possible, so we may
> > be safe. Thankfully, struct bio doesn't seem to be affected.
> >
> > As for making it boot-time configurable, it shouldn't be too horrible with
> > the above change. We should have enough of the tunables mechanism up early
> > enough to pull this in before we create the buf pool.
> >
> > Netflix runs MAXPHYS of 8MB. There's issues with something this big, to be
> > sure, especially on memory limited systems. Lots of hardware can't do this
> > big an I/O, and some drivers can't cope, even if the underlying hardware
> > can. Since we don't use such drivers at work, I don't have a list handy
> > (though I think the SG list for NVMe limits it to 1MB). 128k is totally
> > reasonable bump by default, but I think going larger by default should be
> > approached with some caution given the overhead that adds to struct buf.
> > Having it be a run-time tunable would be great.
> The most important side-effect of bumping MAXPHYS as high as you did,
> which is somewhat counter-intuitive and also probably does not matter
> for typical Netflix cache box load (as I understand it) is increase of
> fragmentation for UFS volumes.
> 
> MAXPHYS limits the max cluster size, and larger the cluster we trying to
> build, larger is the probability of failure.  We might end with single-block
> writes more often, defeating reallocblk defragmenter.  This might be
> somewhat theoretical, and probably can be mitigated in the clustering code
> if real, but it is a thing to look at.
> 
> WRT making the MAXPHYS tunable, I do not like the proposal of converting
> b_pages[] into the flexible array.  I think that making b_pages a pointer
> to off-structure page run is better.  One of the reason is that buf cache
> buffers are not only buffers in the system.  There are several cases where
> the buffers are malloced, like markers for iterating queues.  In this case,
> b_pages[] can be eliminated at all.  (I believe I changed all local
> struct bufs to be allocated with malloc).
> 
> Another non-struct buf supply of buffers are phys buffers pool, see
> vm/vm_pager.c.
> 
> >
> > There's a number of places in userland that depend on MAXPHYS, which is
> > unfortunate since they assume a fixed value and don't pick it up from the
> > kernel or kernel config. Thankfully, there are only a limited number of
> > these.
> >
> > Of course, there's times when I/Os can return much more than this. Reading
> > drive log pages, for example, can generate tens or hundreds of MB of data,
> > and there's no way to do that with one transaction today. If drive makers
> > were perfect, we could use the generally defined offset and length fields
> > to read them out piecemeal. If the log is table, a big if for some of the
> > snapshots of internal state logs that are sometimes necessary to
> > investigate problems... It sure would be nice if there were a way to have
> > super-huge I/O on an exception basis for these situations.
> _______________________________________________
> freebsd-current_at_freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to "freebsd-current-unsubscribe_at_freebsd.org"
> _______________________________________________
> freebsd-current_at_freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to "freebsd-current-unsubscribe_at_freebsd.org"
> 


-- 
Tomoaki AOKI    <junchoon_at_dec.sakura.ne.jp>
Received on Sat Jun 10 2017 - 05:14:14 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:12 UTC