CFT: re(4)

From: Pyun YongHyeon <pyunyh_at_gmail.com>
Date: Tue, 29 May 2007 21:18:37 +0900
Dear all,

I've committed a fix for bus_dma(9) bug which resulted in poor Tx
performance on TSO enabled re(4) driver. With the fix and revised
re(4) I got more sane performance on re(4). Because there are too many
hardwares that rely on re(4) I'd like to hear any success or failure
reports before revised re(4) hits the tree.
For PCIe hardware users it would be great if you can submit
performance numbers for stock re(4) and revised one. The revised
re(4) can be found at the following URL.
http://people.freebsd.org/~yongari/re/re.HEAD.patch

Note, you need latest kernel to get correct performance numbers.

Changes:
o For 8169 GigEs increased Rx/Tx descriptors to 256 because it's hard
  to push the hardware to the limit with default 64 descriptors.
  TSO requires large number of Tx descriptors to pass a full sized TCP
  segment(65535 bytes IP packet) to hardware. Previously it consumed
  32 Tx descriptors, assuming MCLBYTES DMA segment size, to send the
  TCP segment which means re(4) couldn't queue more than two full
  sized IP packets. 
  For 8139C+ it still uses 64 Rx/Tx descriptors due to its hardware
  limitations. With this changes there are (very) small waste of
  memory for 8139C+ users but I don't think it would affect 8139C+
  users for most cases.
o Various bus_dma(9) fix.
   - The hardware supports DAC so allow 64bit DMA operations.
   - Removed BUS_DMA_ALLOC_NOW flag. The use of the flag is almost
     always bug.
   - Increased DMA segment size to 4096 from MCLBYTES as TSO consumes
     too many descriptors with MCLBYTES DMA segment size.
   - Tx/Rx side bus_dmamap_load_mbuf_sg(9) support. With these changes
     the code is more readable than previous one and got a (slightly)
     better performance as it doesn't need to pass/decode arguments
     to/from callback function.
   - Removed unnecessary callback function re_dmamap_desc() and
     nuked rl_dmaload_arg structure which was used in the callback.
   - Additional protection for DMA map load failure. In case of
     failure reuse current map instead of returning a bogus DMA map.
   - Deferred DMA map unloading/sync operation for maximum performance
     until we really need to load new DMA map. If we happen to reuse
     current map(e.g. input error) there is no need to sync/unload/
     load again.
   - The number of allowable Tx DMA segments for a mbuf chains are
     now 32 instead of magic nseg value. If the number of available
     Tx descriptors are short enough to send highly fragmented mbuf
     chains an optimized re_defrag() is called to collapse mbuf chains
     which is supposed to be much faster than m_defrag(9).
     re_defrag() was borrowed from ath(4).
   - Separated Rx/Tx DMA tag from a common DMA tag such that Rx DMA
     tag correctly uses DMA maps that were created with DMA alignment
     limitations(64bit alignments). Tx DMA tag does not have such
     a alignment limitation.
   - Added additional sanity checks for DMA ring map load failure.
   - Added an additional spare Rx DMA map for graceful handling of Rx
     DMA map load failure.
   - Fixed misused bus_dmamap_sync(9) and added missing
     bus_dmamap_sync(9) in re_encap()/re_txeof()/re_rxeof().
o Don't touch DMA address of a Tx descriptor in re_txeof(). It's not
  needed.
o Fix incorrect update of if_ierrors counter. For Rx buffer shortage
  it should update if_qdrops as the buffer is reused.
o Added checks for unsupported H/W revisions and return ENXIO for
  these hardwares. This is required to make re_probe() resource
  allocation free as other drivers do in device probe routine.
o Modified descriptor index manipulation macros as it's now possible
  to have different number of descriptors for Rx/Tx.
o In re_start, to save a lock operation, use IFQ_DRV_IS_EMPTY before
  trying to invoke IFQ_DRV_DEQUEUE. Also don't blindly call re_encap
  since we already know the number of available Tx descriptors in
  advance.
o Removed RL_TX_DESC_THLD which was used to reserve RL_TX_DESC_THLD
  descriptors in Tx path. There is no such a limitation mentioned in
  8139C+/8169/8110/8168/8101/8111 datasheet and it seems to work ok
  without reserving RL_TX_DESC_THLD descriptors.
o Fix a comment for RL_GTXSTART. The register is 8bits register.
o Added comments for 8169/8139C+ hardware restrictions on descriptors.
o Removed forward declaration for "struct rl_softc", it's not needed.
o Added a new structure rl_txdesc for Tx descriptor managements and
  a structure rl_rxdesc for Rx descriptor managements.
o Removed unused member variable rl_intlock in driver softc. There are
  still several unused member variables which are supposed to be used
  to access hardware statistics counters. But it seems that accessing
  hardware counters were not implemented yet.

Thanks.
-- 
Regards,
Pyun YongHyeon
Received on Tue May 29 2007 - 10:18:45 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:11 UTC