Re: amd64/115126: [nfe] nfe0: watchdog timeout (missed Tx interrupts) -- recovering (UP with SCHED_ULE)

From: Luigi Rizzo <rizzo_at_iet.unipi.it> Date: Tue, 22 Apr 2008 09:28:39 +0200 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:30 UTC

related to this bug, i am seeing similar problems with RELENG_7 and amd64,
with an ASUS M2N-VM DVI motherboard
http://www.asus.com/products.aspx?modelmenu=1&model=1841&l1=3&l2=101&l3=567&l4=0
and an Athlon64-BE2400 dual core CPU .

Under heavy load, e.g. scp-ing a large file over the local network,
and at the same time doing a buildkernel or building a port,
and with X11 active (using the 'vesa' xorg driver)
the network card stalls and doesn't recover - i waited over 10 minutes
hoping for the watchdog or some timeout to kick in, the only way
to bring the link back up was

	ifconfig nfe0 down ; ifconfig nfe0 up
	dhclient nfe0

doing only ifconfig down/up or only dhclient did not help, i needed both.

vmstat -i says the network card has irq256 (???) and it is not shared with
other devices. Ehci, sound, ohci, ata, and others have low irq numbers
(6, 14, 20, 21, 22), some shared, some not.

Changing the bios setting for PnP OS from 'yes' to 'no' or viceversa
does not change the situation.

The stall seems related to the presence of other activity - if i
let the bulk scp transfer alone, i get an happy 10-10.5Mbytes/s
(over a 100meg link).

When the stall occurs, i see no interrupts (vmstat -i counts
for irq256 says the same),
Packets are still transmitted and received on the other side, it's
the rx side of the card that becomes deaf. I don't see any
watchdog timeout or other error messages in /var/log/messages.

Also, enabling polling does not help getting traffic in
(with a kernel built with DEVICE_POLLING,
doing sysctl kern.polling.enable=1 and "ifconfig nfe0 polling").

So i suspect that for some reason the rx ring becomes confused
and does not recover.

Hope this helps...

cheers
luigi