Re: Time to increase MAXPHYS?

From: Tomoaki AOKI <junchoon_at_dec.sakura.ne.jp> Date: Sun, 4 Jun 2017 16:39:48 +0900 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:11 UTC

Hi

One possibility would be to make it MD build-time OTIONS,
defaulting 1M on regular systems and 128k on smaller systems.

Of course I guess making it a tunable (or sysctl) would be best,
though.

On Sat, 3 Jun 2017 23:49:01 -0600
Warner Losh <imp_at_bsdimp.com> wrote:

> On Sat, Jun 3, 2017 at 11:28 PM, Warner Losh <imp_at_bsdimp.com> wrote:
> 
> >
> >
> > On Sat, Jun 3, 2017 at 9:55 PM, Allan Jude <allanjude_at_freebsd.org> wrote:
> >
> >> On 2017-06-03 22:35, Julian Elischer wrote:
> >> > On 4/6/17 4:59 am, Colin Percival wrote:
> >> >> On January 24, 1998, in what was later renumbered to SVN r32724, dyson_at_
> >> >> wrote:
> >> >>> Add better support for larger I/O clusters, including larger physical
> >> >>> I/O.  The support is not mature yet, and some of the underlying
> >> >>> implementation
> >> >>> needs help.  However, support does exist for IDE devices now.
> >> >> and increased MAXPHYS from 64 kB to 128 kB.  Is it time to increase it
> >> >> again,
> >> >> or do we need to wait at least two decades between changes?
> >> >>
> >> >> This is hurting performance on some systems; in particular, EC2 "io1"
> >> >> disks
> >> >> are optimized for 256 kB I/Os, EC2 "st1" (throughput optimized
> >> >> spinning rust)
> >> >> disks are optimized for 1 MB I/Os, and Amazon's NFS service (EFS)
> >> >> recommends
> >> >> using a maximum I/O size of 1 MB (and despite NFS not being *physical*
> >> >> I/O it
> >> >> seems to still be limited by MAXPHYS).
> >> >>
> >> > We increase it in freebsd 8 and 10.3 on our systems,  Only good results.
> >> >
> >> > sys/sys/param.h:#define MAXPHYS         (1024 * 1024)   /* max raw I/O
> >> > transfer size */
> >> >
> >> > _______________________________________________
> >> > freebsd-current_at_freebsd.org mailing list
> >> > https://lists.freebsd.org/mailman/listinfo/freebsd-current
> >> > To unsubscribe, send any mail to "freebsd-current-unsubscribe_at_f
> >> reebsd.org"
> >>
> >> At some point Warner and I discussed how hard it might be to make this a
> >> boot time tunable, so that big amd64 machines can have a larger value
> >> without causing problems for smaller machines.
> >>
> >> ZFS supports a block size of 1mb, and doing I/Os in 128kb negates some
> >> of the benefit.
> >>
> >> I am preparing some benchmarks and other data along with a patch to
> >> increase the maximum size of pipe I/O's as well, because using 1MB
> >> offers a relatively large performance gain there as well.
> >>
> >
> > It doesn't look to be hard to change this, though struct buf depends on
> > MAXPHYS:
> >         struct  vm_page *b_pages[btoc(MAXPHYS)];
> > and b_pages isn't the last item in the list, so changing MAXPHYS at boot
> > time would cause an ABI change. IMHO, we should move it to the last element
> > so that wouldn't happen. IIRC all buf allocations are from a fixed pool.
> > We'd have to audit anybody that creates one on the stack knowing it will be
> > persisted. Given how things work, I don't think this is possible, so we may
> > be safe. Thankfully, struct bio doesn't seem to be affected.
> >
> > As for making it boot-time configurable, it shouldn't be too horrible with
> > the above change. We should have enough of the tunables mechanism up early
> > enough to pull this in before we create the buf pool.
> >
> > Netflix runs MAXPHYS of 8MB. There's issues with something this big, to be
> > sure, especially on memory limited systems. Lots of hardware can't do this
> > big an I/O, and some drivers can't cope, even if the underlying hardware
> > can. Since we don't use such drivers at work, I don't have a list handy
> > (though I think the SG list for NVMe limits it to 1MB). 128k is totally
> > reasonable bump by default, but I think going larger by default should be
> > approached with some caution given the overhead that adds to struct buf.
> > Having it be a run-time tunable would be great.
> >
> 
> Of course 128k is reasonable, it's the current default :). I'd mean to say
> that doubling would have a limited impact. 1MB might be a good default, but
> it might be too big for smaller systems (nothing says it has to be a MI
> constant, though). It would be a perfectly fine default if it were a
> tunable.
> 
> 
> > There's a number of places in userland that depend on MAXPHYS, which is
> > unfortunate since they assume a fixed value and don't pick it up from the
> > kernel or kernel config. Thankfully, there are only a limited number of
> > these.
> >
> 
> There's a number of other places that assume MAXPHYS is constant. The ahci
> driver uses it to define the max number of SG operations you can have, for
> example. aio has an array sized based off of it. There are some places that
> use this when they should use 128k instead. There's several places that use
> it to define other constants, and it would take a while to run them all to
> ground to make sure they are all good. We might need to bump DFLTPHYS as
> well, so it might also make a good tunable. There's a few places that check
> things in terms of a fixed multiple of MAXPHYS that are rules of thumb that
> kinda work today maybe by accident or maybe the 100 * MAXPHYS is highly
> scientific. It's hard to say without careful study.
> 
> For example, until recently, nvmecontrol would use MAXPHYS. But it's the
> system default MAXPHYS. And even if it isn't, there's currently a hard
> limit of 1MB for an I/O imposed by how the driver uses nvme's SG lists. But
> it doesn't show up as MAXPHYS, but rather as NVME_MAX_XFER_SIZE in places.
> It totally surprised me when I hit this problem at runtime and tracked it
> to ground.
> 
> 
> > Of course, there's times when I/Os can return much more than this. Reading
> > drive log pages, for example, can generate tens or hundreds of MB of data,
> > and there's no way to do that with one transaction today. If drive makers
> > were perfect, we could use the generally defined offset and length fields
> > to read them out piecemeal. If the log is table, a big if for some of the
> > snapshots of internal state logs that are sometimes necessary to
> > investigate problems... It sure would be nice if there were a way to have
> > super-huge I/O on an exception basis for these situations.
> >
> 
> The hardest part about doing this is chasing down all the references since
> it winds up in the craziest of places.
> 
> Warner
> _______________________________________________
> freebsd-current_at_freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to "freebsd-current-unsubscribe_at_freebsd.org"
> 

-- 
Tomoaki AOKI    <junchoon_at_dec.sakura.ne.jp>