Re: Frequent network access freeze (in 7.0)

From: Robert Watson <rwatson_at_FreeBSD.org>
Date: Wed, 20 Feb 2008 11:35:34 +0000 (GMT)
On Wed, 20 Feb 2008, Unga wrote:

> I'm running 7.0-PRERELEASE (RC2, dated 15/02/2008), compiled from sources on 
> i386 machine (512MB RAM, 3.0GHz, tx0: <SMC EtherPower II 10/100>).
>
> Network access freezes very frequently. Cannot ping to any ip address. The 
> only way to get networking working again is reboot.
>
> I'm having this problem on 7.0 ever since I tried it from BETA4. I have 
> reported also to this list before but sadly nobody was interested on it.
>
> If somebody is interested to look into this problem, I could furnish with 
> more detail and participate in testing.

This sort of problem frequently turns out to be a bug in a device driver or a 
problem with interrupt probing/configuration, so my first guess would be a 
problem with the if_tx driver.  The usual starting diagnostics when ping fails 
are to try to use tcpdump to determine whether it's receive or transmit 
failing (or both).  Quiet the network between two endpoints as much as you can 
so you can avoid noise from making the dumps more complex, and dump arp and 
icmp at both endpoints.  Now try to ping from each end point to the other. 
One potential source of confusion is that ping requires ARP to work, and ARP 
can be a slightly confusing protocol as it usually resolves actively (query, 
response) but sometimes it receives passive updates or extends existing 
entries.

What you want to look for is a packet sent by one side that isn't received by 
the other.  You might find, for example, that your host receives packets fine, 
but the packets it transmits are never received. This would be indicative of a 
driver bug in which it fails to properly handle (for example) transmit queues 
filling, and might only trigger under very high load.  Or, you might find that 
your host never receives anything the other side transmits, but can send fine. 
This might be indicative of a driver bug involving the receive code, or a 
problem with how interrupts are being handled more generally.

It looks like the last non-routine maintenance to the driver was done by 
Maxime in about 2003; the more recent changes have all been updates to 
newbus/busdma infrastructure, ifnet changes, locking changes, etc.  I've CC'd 
him as it sounds like he may have hardware...  My advice would be to do the 
above tests and see if you can narrow down whether it's transmit, receive, or 
both failing.

Robert N M Watson
Computer Laboratory
University of Cambridge
Received on Wed Feb 20 2008 - 10:35:35 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:27 UTC