On Mar 20, 2010, at 11:53 AM, Matthew Dillon wrote: > > :All above I have successfully tested last months with MAXPHYS of 1MB on > :i386 and amd64 platforms. > : > :So my questions are: > :- does somebody know any issues denying increasing MAXPHYS in HEAD? > :- are there any specific opinions about value? 512K, 1MB, MD? > : > :-- > :Alexander Motin > > (nswbuf * MAXPHYS) of KVM is reserved for pbufs, so on i386 you > might hit up against KVM exhaustion issues in unrelated subsystems. > nswbuf typically maxes out at around 256. For i386 1MB is probably > too large (256M of reserved KVM is a lot for i386). On amd64 there > shouldn't be a problem. > Yes, this needs to be addressed. I've never gotten a clear answer from VM people like Peter Wemm and Alan Cox on what should be done. > Diminishing returns get hit pretty quickly with larger MAXPHYS values. > As long as the I/O can be pipelined the reduced transaction rate > becomes less interesting when the transaction rate is less than a > certain level. Off the cuff I'd say 2000 tps is a good basis for > considering whether it is an issue or not. 256K is actually quite > a reasonable value. Even 128K is reasonable. > I agree completely. I did quite a bit of testing on this in 2008 and 2009. I even added some hooks into CAM to support this, and I thought that I had discussed this extensively with Alexander at the time. Guess it was yet another wasted conversation with him =-( I'll repeat it here for the record. What I call the silly-i/o-test, filling a disk up with the dd command, yields performance improvements up to a MAXPHYS of 512K. Beyond that and it's negligible, and actually starts running into contention on the VM page queues lock. There is some work to break down this lock, so it's worth revisiting in the future. For the non-silly-i/o-test, where I do real file i/o using various sequential and random patterns, there was a modest improvement up to 256K, and a slight improvement up to 512K. This surprised me as I figured that most filesystem i/o would be in UFS block sized chunks. Then I realized that the UFS clustering code was actually taking advantage of the larger I/O's. The improvement really depends on the workload, of course, and I wouldn't expect it to be noticeable for most people unless they're running something like a media server. Besides the nswbuf sizing problem, there is a real problem that a lot of drivers have incorrectly assumed over the years that MAXPHYS and DFLTPHYS are particular values, and they've sized their data structures accordingly. Before these values are changed, an audit needs to be done OF EVERY SINGLE STORAGE DRIVER. No exceptions. This isn't a case of changing MAXHYS in the ata driver, testing that your machine boots, and then committing the change to source control. Some drivers will have non-obvious restrictions based on the number of SG elements allowed in a particular command format. MPT comes to mind (its multi message SG code seems to be broken when I tried testing large MAXPHYS on it), but I bet that there are others. Windows has a MAXPHYS equivalent of 1M. Linux has an equivalent of an odd number less than 512k. For the purpose of benchmarking against these OS's, having comparable capabilities is essential; Linux easily beats FreeBSD in the silly-i/o-test because of the MAXPHYS difference (though FreeBSD typically stomps linux in real I/O because of vastly better latency and caching algorithms). I'm fine with raising MAXPHYS in production once the problems are addressed. > Nearly all the issues I've come up against in the last few years have > been related more to pipeline algorithms breaking down and less with > I/O size. The cluster_read() code is especially vulnerable to > algorithmic breakdowns when fast media (such as a SSD) is involved. > e.g. I/Os queued from the previous cluster op can create stall > conditions in subsequent cluster ops before they can issue new I/Os > to keep the pipeline hot. > Yes, this is another very good point. It's time to start really figuring out what SSD means for FreeBSD I/O. ScottReceived on Sat Mar 20 2010 - 17:17:40 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:01 UTC