On 08.12.2011 16:34, Luigi Rizzo wrote: > On Fri, Dec 09, 2011 at 12:11:50AM +1100, Lawrence Stewart wrote: >> On 12/08/11 05:08, Luigi Rizzo wrote: > ... >>> I ran a bunch of tests on the ixgbe (82599) using RELENG_8 (which >>> seems slightly faster than HEAD) using MTU=1500 and various >>> combinations of card capabilities (hwcsum,tso,lro), different window >>> sizes and interrupt mitigation configurations. >>> >>> default latency is 16us, l=0 means no interrupt mitigation. >>> lro is the software implementation of lro (tcp_lro.c) >>> hwlro is the hardware one (on 82599). Using a window of 100 Kbytes >>> seems to give the best results. >>> >>> Summary: >> >> [snip] >> >>> - enabling software lro on the transmit side actually slows >>> down the throughput (4-5Gbit/s instead of 8.0). >>> I am not sure why (perhaps acks are delayed too much) ? >>> Adding a couple of lines in tcp_lro to reject >>> pure acks seems to have much better effect. >>> >>> The tcp_lro patch below might actually be useful also for >>> other cards. >>> >>> --- tcp_lro.c (revision 228284) >>> +++ tcp_lro.c (working copy) >>> _at__at_ -245,6 +250,8 _at__at_ >>> >>> ip_len = ntohs(ip->ip_len); >>> tcp_data_len = ip_len - (tcp->th_off<< 2) - sizeof (*ip); >>> + if (tcp_data_len == 0) >>> + return -1; /* not on ack */ >>> >>> >>> /* >> >> There is a bug with our LRO implementation (first noticed by Jeff >> Roberson) that I started fixing some time back but dropped the ball on. >> The crux of the problem is that we currently only send an ACK for the >> entire LRO chunk instead of all the segments contained therein. Given >> that most stacks rely on the ACK clock to keep things ticking over, the >> current behaviour kills performance. It may well be the cause of the >> performance loss you have observed. > > I should clarify better. > First of all, i tested two different LRO implementations: our > "Software LRO" (tcp_lro.c), and the "Hardware LRO" which is implemented > by the 82599 (called RSC or receive-side-coalescing in the 82599 > data sheets). Jack Vogel and Navdeep Parhar (both in Cc) can > probably comment on the logic of both. > > In my tests, either SW or HW LRO on the receive side HELPED A LOT, > not just in terms of raw throughput but also in terms of system > load on the receiver. On the receive side, LRO packs multiple data > segments into one that is passed up the stack. > > As you mentioned this also reduces the number of acks generated, > but not dramatically (consider, the LRO is bounded by the number > of segments received in the mitigation interval). > In my tests the number of reads() on the receiver was reduced by > approx a factor of 3 compared to the !LRO case, meaning 4-5 segment > merged by LRO. Navdeep reported some numbers for cxgbe with similar > numbers. > > Using Hardware LRO on the transmit side had no ill effect. > Being done in hardware i have no idea how it is implemented. > > Using Software LRO on the transmit side did give a significant > throughput reduction. I can't explain the exact cause, though it > is possible that between reducing the number of segments to the > receiver and collapsing ACKs that it generates, the sender starves. > But it could well be that it is the extra delay on passing up the ACKs > that limits performance. > Either way, since the HW LRO did a fine job, i was trying to figure > out whether avoiding LRO on pure acks could help, and the two-line > patch above did help. > > Note, my patch was just a proof-of-concept, and may cause > reordering if a data segment is followed by a pure ack. > But this can be fixed easily, handling a pure ack as > an out-of-sequence packet in tcp_lro_rx(). > >> WIP patch is at: >> http://people.freebsd.org/~lstewart/patches/misctcp/tcplro_multiack_9.x.r219723.patch >> >> Jeff tested the WIP patch and it *doesn't* fix the issue. I don't have >> LRO capable hardware setup locally to figure out what I've missed. Most >> of the machines in my lab are running em(4) NICs which don't support >> LRO, but I'll see if I can find something which does and perhaps >> resurrect this patch. LRO can always be done in software. You can do it at driver, ether_input or ip_input level. > a few comments: > 1. i don't think it makes sense to send multiple acks on > coalesced segments (and the 82599 does not seem to do that). > First of all, the acks would get out with minimal spacing (ideally > less than 100ns) so chances are that the remote end will see > them in a single burst anyways. Secondly, the remote end can > easily tell that a single ACK is reporting multiple MSS and > behave as if an equivalent number of acks had arrived. ABC (appropriate byte counting) gets in the way though. > 2. i am a big fan of LRO (and similar solutions), because it can save > a lot of repeated work when passing packets up the stack, and the > mechanism becomes more and more effective as the system load increases, > which is a wonderful property in terms of system stability. > > For this reason, i think it would be useful to add support for software > LRO in the generic code (sys/net/if.c) so that drivers can directly use > the software implementation even without hardware support. It hurts on higher RTT links in the general case. For LAN RTT's it's good. > 3. similar to LRO, it would make sense to implement a "software TSO" > mechanism where the TCP sender pushes a large segment down to > ether_output, and having code in if_ethersubr.c do the segmentation > and checksum computation. This would save multiple traversals of > the various layers on the stack recomputing essentially the same > information on all segments. All modern NIC's support hardware TSO. There's little benefit in having a parallel software implementation. And then you run into the mbuf chain copying issue further down the layer. The win won't be much. -- AndreReceived on Thu Dec 08 2011 - 23:33:11 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:21 UTC