'nfe' stalls (analysis and partial solution)

From: Luigi Rizzo <rizzo_at_iet.unipi.it> Date: Fri, 25 Apr 2008 18:00:39 +0200 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:30 UTC

just for the record and the mail archives - i have been experiencing
a lot of unrecovered stalls of the network card with the 'nfe'
driver under heavy load (this was on 7.0-i386 and 7.0-amd64, but
it is hardware related so it cross-platform).

After 2-3 days of investigation, and with the help of
Pyun YongHyeon (yongari) i finally managed to pin down the
problem and start working on a solution.

I would be grateful if others can report of similar problems
with the 'nfe' driver so we can see if the patch we can come
up with also fix their problem.

THE PROBLEM:
under heavy load (e.g. full speed ssh transfers, disk activity,
Xwindows...) causing the receive ring to fill up, it seems that
some nfe-supported cards (at least the MCP67) enter a state where
they stop looking at the ring buffers and drop incoming packets.

The driver does not recover from the error so you manually have
to 'ifconfig down; ifconfig up' the interface to restart
receiving.

SOLUTION:
I have not yet determined the exact conditions causing the error,
so as a temporary workaround i am calling nfe_init_locked() every
from the watchdog routine every time a receive error of some kind
is experienced.

I definitely need to apply stricter checks on the error condition,
but some more extra card reset is certainly better than losing contact
with the machine. Unfortunately there is no documentation on this
behaviour of the card, and the linux driver (forcedeth) has no
error checking/recovery at all so it is of no help.

	cheers
	luigi