Re: Getting/Forcing Greater than 4KB Buffer Allocations

From: Scott Long <scottl_at_samsco.org> Date: Mon, 23 Jul 2007 19:40:15 -0600 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:15 UTC

David Christensen wrote:
>> I'm trying to catch up on this thread, but I'm utterly confused as to
>> what you're looking for.  Let's try talking through a few scenarios
>> here:
> 
> My goal is simple.  I've modified my driver to support up to 8 segments
> in an mbuf and I want to verify that it works correctly.  It's simple to
> test when every mbuf has the same number of segments, but I want to make
> sure my code is robust enough to support cases where one mbuf is made of
> 3 segments while the next is made of 5 segments.  The best case would be
> 
> to get a distribution of sizes from the min to the max (i.e. 1 to 8).
> I'm not trying to test for performance, only for proper operation under
> a worst case load.
> 
>> 1. Your hardware has slots for 3 SG elements, and all three MUST be
>> filled without exception.  Therefore, you want segments that 
>> are 4k, 4k,
>> and 1k (or some slight variation of that if the buffer is misaligned).
>> To do this, set the maxsegs to 3 and the maxsegsize to 4k.  This will
>> ensure that busdma does no coalescing (more on this topic later) and
>> will always give you 3 segments for 9k of contiguous buffers.  If the
>> actual buffer winds up being <= 8k, busdma won't guarantee that you'll
>> get 3 segments, and you'll have to fake something up in your 
>> driver.  If
>> the buffer winds up being an fragmented mbuf chain, it also won't
>> guarantee that you'll get 3 segments either, but that's 
>> already handled
>> now via m_defrag().
> 
> My hardware supports multiples of 255 buffer descriptors (255, 510,
> 765, etc.).  If all mbufs have 1 segment (common for 1500 MTU) then
> I can handle multiples of 255 mbufs.  If all mbufs have 3 segments,
> (common for 9000 MTU) then I can handle multiples of 85 mbufs.  If
> the mbufs have varying number of segments (anywhere from 1 to 8)
> then a varying number of mbufs can be buffered.  This last case is
> the most complicated to handle and I want to make sure my code is
> robust enough to handle it.  I've found that reducing the system 
> memory from 8GB to 2GB has allowed me to see both 2 segment and
> 3 segment mbufs (the former I assume occurs because of coalescing)
> but I haven't been able to load the system in such a way to cause
> any other number of segments to occur.

There is no way to tell busdma "give me no less than N number of
segments" because that is generally not a useful task for bio and mbuf
operations (though it is useful for static allocations, and is a feature
that is on the TODO list).  For the purposes of validation like what you
want, you're going to have write either a custom mbuf injector that
creates an mbuf chain of a header and N number of clusters that are each
allocated separately and only filled partially, or you're going to have
to do manual splitting of the segments that busdma gives you.

> 
>> 2. Your hardware can only handle 4k segments, but is less 
>> restrictive on
>> the min/max number of segements.  The solution is the same as above.
> 
> No practical limit on the segment size.  Anything between 1 byte and 
> 9KB is fine.
> 
>> 3. Your hardware has slots for 8 SG elements, and all 8 MUST be filled
>> without exception.  There's no easy solution for this, as 
>> it's a fairly
>> bizarre situation.  I'll only discuss it further if you confirm that
>> it's actually the case here.
> 
> The number of SG elements can vary anywhere from 1 to 8.  If the first
> SG element has 2 slots then there's no problem with the second SG
> element having 8 slots, and then the third having 4 slots.  The only 
> difficulty comes in keeping the ring full since the number of slots
> used won't always match the number of slots available.  I think I can
> handle this correctly but it's difficult to test since all of the 
> SG entries have the same number of slots (which also happens to be 
> evenly divisible by the total number of slots available in the ring).
> 
>> As for coalescing segments, I'm considering a new busdma back-end that
>> greatly streamlines loads by eliminating cycle-consuming tasks like
>> segment coalescing.  The original justification for 
>> coalescing was that
>> DMA engines operated faster with fewer segments.  That might still be
>> true, but the extra host CPU cycles and cache-line misses probably
>> result in a net loss.  I'm also going to axe bounce-buffer 
>> support since
>> it bloats the I cache.  The target for this new back-end is 
>> drivers that
>> support hardware that don't need these services and that are also
>> sensitive to the amount of host CPU cycles being consumed, i.e. modern
>> 1Gb and 10Gb adapters.  The question I have is whether this 
>> new back-end
>> should be accessible directly through yet another bus_dmamap_load_foo
>> variant that the drivers need to know specifically about, or 
>> indirectly
>> and automatically via the existing bus_dmamap_load_foo variants.  The
>> tradeoff is further API pollution vs the opportunity for even more
>> efficiency through no indirect function calls and no cache misses from
>> accessing the busdma tag.  I don't like API pollution since 
>> it makes it
>> harder to maintain code, but the opportunity for the best performance
>> possible is also appealing.
> 
> Others have reported that single, larger segments provide better 
> performance than multiple, smaller segments.  (Kip Macy recently
> forwarded me a patch to test which shows a performance improvement
> on the cxgb adatper when this is used.)  I haven't done enough 
> performance testing on bce to know if this helps overall, hurts,
> or has no overall difference.  One thing I am interested in is
> finding a way to allocate receive mbufs such that I can split the
> header into a single buffer and then place the data into one or
> more page aligned buffers, similar to what a transmit mbuf looks
> like.  Anyway to support that in the current architecture?
> 

Useful data points.  For the case where the operation is limited by host
CPU cycles (as is often the case with 10Gb right now), offloading more
of the DMA segmentation work to the chip is desirable, even if the
overall outcome isn't completely ideal.

For your question about RX mbufs, look at the if_ti driver.  It was
specifically designed with RX header splitting in mind and specifically
handles what you are talking about.  If header splitting becomes more
interesting again, it might be useful to move some of this out of the
drivers and into the mbuf allocator.

Scott