Re: Increasing MAXPHYS

From: Scott Long <scottl_at_samsco.org> Date: Sat, 20 Mar 2010 12:17:33 -0600 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:01 UTC

On Mar 20, 2010, at 11:53 AM, Matthew Dillon wrote:
> 
> :All above I have successfully tested last months with MAXPHYS of 1MB on
> :i386 and amd64 platforms.
> :
> :So my questions are:
> :- does somebody know any issues denying increasing MAXPHYS in HEAD?
> :- are there any specific opinions about value? 512K, 1MB, MD?
> :
> :-- 
> :Alexander Motin
> 
>    (nswbuf * MAXPHYS) of KVM is reserved for pbufs, so on i386 you
>    might hit up against KVM exhaustion issues in unrelated subsystems.
>    nswbuf typically maxes out at around 256.  For i386 1MB is probably
>    too large (256M of reserved KVM is a lot for i386).  On amd64 there
>    shouldn't be a problem.
> 

Yes, this needs to be addressed.  I've never gotten a clear answer from
VM people like Peter Wemm and Alan Cox on what should be done.

>    Diminishing returns get hit pretty quickly with larger MAXPHYS values.
>    As long as the I/O can be pipelined the reduced transaction rate
>    becomes less interesting when the transaction rate is less than a
>    certain level.  Off the cuff I'd say 2000 tps is a good basis for
>    considering whether it is an issue or not.  256K is actually quite
>    a reasonable value.  Even 128K is reasonable.
> 

I agree completely.  I did quite a bit of testing on this in 2008 and 2009.
I even added some hooks into CAM to support this, and I thought that I had
discussed this extensively with Alexander at the time.  Guess it was yet another
wasted conversation with him =-(  I'll repeat it here for the record.

What I call the silly-i/o-test, filling a disk up with the dd command, yields
performance improvements up to a MAXPHYS of 512K.  Beyond that and
it's negligible, and actually starts running into contention on the VM page
queues lock.  There is some work to break down this lock, so it's worth
revisiting in the future.

For the non-silly-i/o-test, where I do real file i/o using various sequential and
random patterns, there was a modest improvement up to 256K, and a slight
improvement up to 512K.  This surprised me as I figured that most filesystem
i/o would be in UFS block sized chunks.  Then I realized that the UFS clustering
code was actually taking advantage of the larger I/O's.  The improvement really
depends on the workload, of course, and I wouldn't expect it to be noticeable
for most people unless they're running something like a media server.

Besides the nswbuf sizing problem, there is a real problem that a lot of drivers
have incorrectly assumed over the years that MAXPHYS and DFLTPHYS are
particular values, and they've sized their data structures accordingly.  Before
these values are changed, an audit needs to be done OF EVERY SINGLE
STORAGE DRIVER.  No exceptions.  This isn't a case of changing MAXHYS
in the ata driver, testing that your machine boots, and then committing the change
to source control.  Some drivers will have non-obvious restrictions based on
the number of SG elements allowed in a particular command format.  MPT
comes to mind (its multi message SG code seems to be broken when I tried
testing large MAXPHYS on it), but I bet that there are others.

Windows has a MAXPHYS equivalent of 1M.  Linux has an equivalent of an
odd number less than 512k.  For the purpose of benchmarking against these
OS's, having comparable capabilities is essential; Linux easily beats FreeBSD
in the silly-i/o-test because of the MAXPHYS difference (though FreeBSD typically
stomps linux in real I/O because of vastly better latency and caching algorithms).
I'm fine with raising MAXPHYS in production once the problems are addressed.

>    Nearly all the issues I've come up against in the last few years have
>    been related more to pipeline algorithms breaking down and less with
>    I/O size.  The cluster_read() code is especially vulnerable to
>    algorithmic breakdowns when fast media (such as a SSD) is involved.
>    e.g.  I/Os queued from the previous cluster op can create stall
>    conditions in subsequent cluster ops before they can issue new I/Os
>    to keep the pipeline hot.
> 

Yes, this is another very good point.  It's time to start really figuring out what SSD
means for FreeBSD I/O.

Scott