Re: Increasing MAXPHYS

From: Scott Long <scottl_at_samsco.org> Date: Mon, 22 Mar 2010 10:27:15 -0600 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:02 UTC

On Mar 22, 2010, at 9:52 AM, Alexander Sack wrote:
> On Mon, Mar 22, 2010 at 8:39 AM, John Baldwin <jhb_at_freebsd.org> wrote:
>> On Monday 22 March 2010 7:40:18 am Gary Jennejohn wrote:
>>> On Sun, 21 Mar 2010 19:03:56 +0200
>>> Alexander Motin <mav_at_FreeBSD.org> wrote:
>>> 
>>>> Scott Long wrote:
>>>>> Are there non-CAM drivers that look at MAXPHYS, or that silently assume
>> that
>>>>> MAXPHYS will never be more than 128k?
>>>> 
>>>> That is a question.
>>>> 
>>> 
>>> I only did a quick&dirty grep looking for MAXPHYS in /sys.
>>> 
>>> Some drivers redefine MAXPHYS to be 512KiB.  Some use their own local
>>> MAXPHYS which is usually 128KiB.
>>> 
>>> Some look at MAXPHYS to figure out other things; the details escape me.
>>> 
>>> There's one driver which actually uses 100*MAXPHYS for something, but I
>>> didn't check the details.
>>> 
>>> Lots of them were non-CAM drivers AFAICT.
>> 
>> The problem is the drivers that _don't_ reference MAXPHYS.  The driver author
>> at the time "knew" that MAXPHYS was 128k, so he did the MAXPHYS-dependent
>> calculation and just put the result in the driver (e.g. only supporting up to
>> 32 segments (32 4k pages == 128k) in a bus dma tag as a magic number to
>> bus_dma_tag_create() w/o documenting that the '32' was derived from 128k or
>> what the actual hardware limit on nsegments is).  These cannot be found by a
>> simple grep, they require manually inspecting each driver.
> 
> 100% awesome comment.  On another kernel, I myself was guilty of this
> crime (I did have a nice comment though above the def).
> 
> This has been a great thread since our application really needs some
> of the optimizations that are being thrown around here.  We have found
> in real live performance testing that we are almost always either
> controller bound (i.e. adding more disks to spread IOPs has little to
> no effect in large array configurations on throughput, we suspect that
> is hitting the RAID controller's firmware limitations) or tps bound,
> i.e. I never thought going from 128k -> 256k per transaction would
> have a dramatic effect on throughput (but I never verified).
> 
> Back to HBAs,  AFAIK, every modern iteration of the most popular HBAs
> can easily do way more than a 128k scatter/gather I/O.  Do you guys
> know of any *modern* (circa within the last 3-4 years) that can not do
> more than 128k at a shot?

>64K broken in MPT at the moment.  The hardware can do it, the driver thinks it can do it, but it fails.  AAC hardware traditionally cannot, but maybe the firmware has been improved in the past few years.  I know that there are other low-performance devices that can't do more than 64 or 128K, but none are coming to mind at the moment.  Still, it shouldn't be a universal assumption that all hardware can do big I/O's.

Another consideration is that some hardware can do big I/O's, but not very efficiently.  Not all DMA engines are created equal, and moving to compound commands and excessively long S/G lists can be a pessimization.  For example, MFI hardware does a hinted prefetch on the segment list, but once you exceed a certain limit, that prefetch doesn't work anymore and the firmware has to take the slow path to execute the i/o.  I haven't quantified this penalty yet, but it's something that should be thought about.

> 
> In other words, I've always thought the limit was kernel imposed and
> not what the memory controller on the card can do (I certainly never
> got the impression talking with some of the IHVs over the years that
> they were designing their hardware for a 128k limit - I sure hope
> not!).

You'd be surprised at the engineering compromises and handicaps that are committed at IHVs because of misguided marketters.

Scott