Re: 'nfe' stalls (analysis and partial solution)

From: Pyun YongHyeon <pyunyh_at_gmail.com> Date: Sat, 26 Apr 2008 14:33:37 +0900 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:30 UTC

On Fri, Apr 25, 2008 at 06:00:39PM +0200, Luigi Rizzo wrote:
 > just for the record and the mail archives - i have been experiencing
 > a lot of unrecovered stalls of the network card with the 'nfe'
 > driver under heavy load (this was on 7.0-i386 and 7.0-amd64, but
 > it is hardware related so it cross-platform).
 > 
 > After 2-3 days of investigation, and with the help of
 > Pyun YongHyeon (yongari) i finally managed to pin down the
 > problem and start working on a solution.
 > 
 > I would be grateful if others can report of similar problems
 > with the 'nfe' driver so we can see if the patch we can come
 > up with also fix their problem.
 > 
 > THE PROBLEM:
 > under heavy load (e.g. full speed ssh transfers, disk activity,
 > Xwindows...) causing the receive ring to fill up, it seems that
 > some nfe-supported cards (at least the MCP67) enter a state where
 > they stop looking at the ring buffers and drop incoming packets.
 > 
 > The driver does not recover from the error so you manually have
 > to 'ifconfig down; ifconfig up' the interface to restart
 > receiving.
 > 

I tried to reprocude this on CK804 MCP9 hardware but nfe(4)
recovered successfully from this Rx ring full condition.
Of course, I still don't know how to reliably reproduce Rx stalls
but just Rx ring full condition doesn't seem to trigger Rx stalls
on CK804 MCP9.
As Luigi said, it's also possible only some NVIDIA chips can have
this issue. If you happen to see this issue please let us know what
chip/model you have.

The Rx ring full condition could be easily triggered by sending
lots of UDP packets with network benchmark programs. In order to
increase the possibility of the Rx ring full condition, running
buildworld while benchmark test is in progress would certainly
trigger the condition.

 > SOLUTION:
 > I have not yet determined the exact conditions causing the error,
 > so as a temporary workaround i am calling nfe_init_locked() every
 > from the watchdog routine every time a receive error of some kind
 > is experienced.
 > 
 > I definitely need to apply stricter checks on the error condition,
 > but some more extra card reset is certainly better than losing contact
 > with the machine. Unfortunately there is no documentation on this
 > behaviour of the card, and the linux driver (forcedeth) has no
 > error checking/recovery at all so it is of no help.
 > 
 > 	cheers
 > 	luigi
-- 
Regards,
Pyun YongHyeon