Re: Getting/Forcing Greater than 4KB Buffer Allocations

From: Scott Long <scottl_at_samsco.org> Date: Thu, 19 Jul 2007 00:56:34 -0400 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:14 UTC

David Christensen wrote:
>>  > Thanks Pyun but I'm really just looking for a way to test 
>> that I can
>>  > handle the number of segments I've advertised that I can 
>> support.  I 
>>  > believe my code is correct but when all I see are allocations of 3 
>>  > segments I just can't prove it.  I was hoping that running 
>> a utility
>>  > such as "stress" would help fragment memory and force more variable
>>  > responses but that hasn't happened yet.
>>  > 
>>
>> It seems you've used the following code to create jumbo dma tag.
>>         /*
>>          * Create a DMA tag for RX mbufs.
>>          */
>>         if (bus_dma_tag_create(sc->parent_tag,
>>                         1,
>>                         BCE_DMA_BOUNDARY,
>>                         sc->max_bus_addr,
>>                         BUS_SPACE_MAXADDR,
>>                         NULL, NULL,
>>                         MJUM9BYTES,
>>                         BCE_MAX_SEGMENTS,
>>                         MJUM9BYTES,
>>                         ^^^^^^^^^^
>>                         0,
>>                         NULL, NULL,
>>                 &sc->rx_mbuf_tag)) {
>>                 BCE_PRINTF("%s(%d): Could not allocate RX 
>> mbuf DMA tag!\n",
>>                         __FILE__, __LINE__);
>>                 rc = ENOMEM;
>>                 goto bce_dma_alloc_exit;
>>         }
>> If you want to have > 9 dma segements change maxsegsz(MJUM9BYTES) to
>> 1024. bus_dma honors maxsegsz argument so you wouldn't get a dma
>> segments larger than maxsegsz. With MJUM9BYTES maxsegsz you would get
>> up to 4 dma segments on systems with 4K PAGE_SIZE.(You would have
>> got up to 3 dma segements if you used PAGE_SIZE alignment argument.)
> 
> I don't want more segments, I just want to get a distribution of
> segments
> up to the max size I specified.  For example, since my BCE_MAX_SEGMENTS 
> size is 8, I want to make sure I get mbufs that are spread over 1, 2, 3,
> 4, 5, 6, 7, and 8 segments.  
> 
> It turns out if I reduce the amount of memory in the system (from 8GB to
> 2GB) I will get more mbufs coalesced into 2 segments, rather than the
> more typical 3 segments, but that's good enough for my testing now.
> 

Dave,

I'm trying to catch up on this thread, but I'm utterly confused as to
what you're looking for.  Let's try talking through a few scenarios
here:

1. Your hardware has slots for 3 SG elements, and all three MUST be
filled without exception.  Therefore, you want segments that are 4k, 4k,
and 1k (or some slight variation of that if the buffer is misaligned).
To do this, set the maxsegs to 3 and the maxsegsize to 4k.  This will
ensure that busdma does no coalescing (more on this topic later) and
will always give you 3 segments for 9k of contiguous buffers.  If the
actual buffer winds up being <= 8k, busdma won't guarantee that you'll
get 3 segments, and you'll have to fake something up in your driver.  If
the buffer winds up being an fragmented mbuf chain, it also won't
guarantee that you'll get 3 segments either, but that's already handled
now via m_defrag().

2. Your hardware can only handle 4k segments, but is less restrictive on
the min/max number of segements.  The solution is the same as above.

3. Your hardware has slots for 8 SG elements, and all 8 MUST be filled
without exception.  There's no easy solution for this, as it's a fairly
bizarre situation.  I'll only discuss it further if you confirm that
it's actually the case here.

As for coalescing segments, I'm considering a new busdma back-end that
greatly streamlines loads by eliminating cycle-consuming tasks like
segment coalescing.  The original justification for coalescing was that
DMA engines operated faster with fewer segments.  That might still be
true, but the extra host CPU cycles and cache-line misses probably
result in a net loss.  I'm also going to axe bounce-buffer support since
it bloats the I cache.  The target for this new back-end is drivers that
support hardware that don't need these services and that are also
sensitive to the amount of host CPU cycles being consumed, i.e. modern
1Gb and 10Gb adapters.  The question I have is whether this new back-end
should be accessible directly through yet another bus_dmamap_load_foo
variant that the drivers need to know specifically about, or indirectly
and automatically via the existing bus_dmamap_load_foo variants.  The
tradeoff is further API pollution vs the opportunity for even more
efficiency through no indirect function calls and no cache misses from
accessing the busdma tag.  I don't like API pollution since it makes it
harder to maintain code, but the opportunity for the best performance
possible is also appealing.

Scott