Re: quick summary results with ixgbe (was Re: datapoints on 10G throughput with TCP ?

From: Andre Oppermann <andre_at_freebsd.org> Date: Fri, 09 Dec 2011 01:33:04 +0100 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:21 UTC

On 08.12.2011 16:34, Luigi Rizzo wrote:
> On Fri, Dec 09, 2011 at 12:11:50AM +1100, Lawrence Stewart wrote:
>> On 12/08/11 05:08, Luigi Rizzo wrote:
> ...
>>> I ran a bunch of tests on the ixgbe (82599) using RELENG_8 (which
>>> seems slightly faster than HEAD) using MTU=1500 and various
>>> combinations of card capabilities (hwcsum,tso,lro), different window
>>> sizes and interrupt mitigation configurations.
>>>
>>> default latency is 16us, l=0 means no interrupt mitigation.
>>> lro is the software implementation of lro (tcp_lro.c)
>>> hwlro is the hardware one (on 82599). Using a window of 100 Kbytes
>>> seems to give the best results.
>>>
>>> Summary:
>>
>> [snip]
>>
>>> - enabling software lro on the transmit side actually slows
>>>    down the throughput (4-5Gbit/s instead of 8.0).
>>>    I am not sure why (perhaps acks are delayed too much) ?
>>>    Adding a couple of lines in tcp_lro to reject
>>>    pure acks seems to have much better effect.
>>>
>>> The tcp_lro patch below might actually be useful also for
>>> other cards.
>>>
>>> --- tcp_lro.c   (revision 228284)
>>> +++ tcp_lro.c   (working copy)
>>> _at__at_ -245,6 +250,8 _at__at_
>>>
>>>          ip_len = ntohs(ip->ip_len);
>>>          tcp_data_len = ip_len - (tcp->th_off<<   2) - sizeof (*ip);
>>> +       if (tcp_data_len == 0)
>>> +               return -1;      /* not on ack */
>>>
>>>
>>>          /*
>>
>> There is a bug with our LRO implementation (first noticed by Jeff
>> Roberson) that I started fixing some time back but dropped the ball on.
>> The crux of the problem is that we currently only send an ACK for the
>> entire LRO chunk instead of all the segments contained therein. Given
>> that most stacks rely on the ACK clock to keep things ticking over, the
>> current behaviour kills performance. It may well be the cause of the
>> performance loss you have observed.
>
> I should clarify better.
> First of all, i tested two different LRO implementations: our
> "Software LRO" (tcp_lro.c), and the "Hardware LRO" which is implemented
> by the 82599 (called RSC or receive-side-coalescing in the 82599
> data sheets). Jack Vogel and Navdeep Parhar (both in Cc) can
> probably comment on the logic of both.
>
> In my tests, either SW or HW LRO on the receive side HELPED A LOT,
> not just in terms of raw throughput but also in terms of system
> load on the receiver. On the receive side, LRO packs multiple data
> segments into one that is passed up the stack.
>
> As you mentioned this also reduces the number of acks generated,
> but not dramatically (consider, the LRO is bounded by the number
> of segments received in the mitigation interval).
> In my tests the number of reads() on the receiver was reduced by
> approx a factor of 3 compared to the !LRO case, meaning 4-5 segment
> merged by LRO. Navdeep reported some numbers for cxgbe with similar
> numbers.
>
> Using Hardware LRO on the transmit side had no ill effect.
> Being done in hardware i have no idea how it is implemented.
>
> Using Software LRO on the transmit side did give a significant
> throughput reduction. I can't explain the exact cause, though it
> is possible that between reducing the number of segments to the
> receiver and collapsing ACKs that it generates, the sender starves.
> But it could well be that it is the extra delay on passing up the ACKs
> that limits performance.
> Either way, since the HW LRO did a fine job, i was trying to figure
> out whether avoiding LRO on pure acks could help, and the two-line
> patch above did help.
>
> Note, my patch was just a proof-of-concept, and may cause
> reordering if a data segment is followed by a pure ack.
> But this can be fixed easily, handling a pure ack as
> an out-of-sequence packet in tcp_lro_rx().
>
>>                                      WIP patch is at:
>> http://people.freebsd.org/~lstewart/patches/misctcp/tcplro_multiack_9.x.r219723.patch
>>
>> Jeff tested the WIP patch and it *doesn't* fix the issue. I don't have
>> LRO capable hardware setup locally to figure out what I've missed. Most
>> of the machines in my lab are running em(4) NICs which don't support
>> LRO, but I'll see if I can find something which does and perhaps
>> resurrect this patch.

LRO can always be done in software.  You can do it at driver, ether_input
or ip_input level.

> a few comments:
> 1. i don't think it makes sense to send multiple acks on
>     coalesced segments (and the 82599 does not seem to do that).
>     First of all, the acks would get out with minimal spacing (ideally
>     less than 100ns) so chances are that the remote end will see
>     them in a single burst anyways. Secondly, the remote end can
>     easily tell that a single ACK is reporting multiple MSS and
>     behave as if an equivalent number of acks had arrived.

ABC (appropriate byte counting) gets in the way though.

> 2. i am a big fan of LRO (and similar solutions), because it can save
>     a lot of repeated work when passing packets up the stack, and the
>     mechanism becomes more and more effective as the system load increases,
>     which is a wonderful property in terms of system stability.
>
>     For this reason, i think it would be useful to add support for software
>     LRO in the generic code (sys/net/if.c) so that drivers can directly use
>     the software implementation even without hardware support.

It hurts on higher RTT links in the general case.  For LAN RTT's
it's good.

> 3. similar to LRO, it would make sense to implement a "software TSO"
>     mechanism where the TCP sender pushes a large segment down to
>     ether_output, and having code in if_ethersubr.c do the segmentation
>     and checksum computation. This would save multiple traversals of
>     the various layers on the stack recomputing essentially the same
>     information on all segments.

All modern NIC's support hardware TSO.  There's little benefit in
having a parallel software implementation.  And then you run into
the mbuf chain copying issue further down the layer.  The win won't
be much.

-- 
Andre