> On Mon, 17 Oct 2005 11:46:55 +0200 > Dan Bilik <dan_at_mail.neosystem.cz> wrote: > > > Situation: > > Single-purpose machines only serving http requests for static content, > > ... > > Problem: > > After some time of serving requests the ethernet interface in the > > machine stops communicating on the wire. It does not respond to any > > packets (ping, http, nfs, ssh) and vmstat(8) shows stopped interrupt > > ... > > Some fresh additional info: > > Today one of the problem machines got stuck again. I was able to log on > through second functional interface and watch it more closely. Sending > packets from the box worked (its arp requests were appearing on other > boxes in the subnet) but it could not receive any packet. And another > thing... It seems that running tcpdump (ie. entering and leaving > promiscuous mode) on the interface resolved the problem > and made the machine to appear back on the network. It's running > with no problem from that moment. > > Any ideas what's going on here? Does it make sense to anyone? > > Dan With the fxp driver, running entering promiscuous mode implies reinitializing the NIC from the ground up. The fxp_ioctl() handler will call fxp_init() when any of the flags change, like IFF_PROMISC, so running tcpdump is equivalent to doing "ifconfig fxp0 up" in this case. It should come as no surprise that reinitializing the card brings it back to life. There's a couple reasons why you could be getting into this state: - The chip has experienced an RX overrun, where all of the descriptors in its RX DMA ring have been filled by the chip before the driver has had a chance to drain them. When this happens, the chip may require the RX unit to be resumed. - For some reason, the RX handler code in the driver has fallen out of sync with the chip, i.e. the current descriptor index has gotten clobbered, or maybe the chip was restarted and the index wasn't properly reset. In both of these cases, calling fxp_init() will reinitialize the RX unit and get the chip working again. RX overruns are obviously the result of a very busy network (or a very busy host processor that can't service the NIC frequently enough to drain the RX ring). If the network is busy, it would be with a lot of small packets. With heavy streaming traffic, you'd get a lot of large frames, and with 1500 byte frames you max out a 100Mbps ethernet at only about 8100 frames/second. By contrast, it takes 148000 frames/sec to max out a 100Mbps pipe with 64 byte frames. Ideally, the driver should recover gracefully from an RX overrun, though I'll bet a quarter nobody's really tested it very thoroughly. ("I can browse the intarweb so it must be ok!") Also, I discovered recently that the chip can do really strange things if you make the mistake of issuing an RX unit start command twice, instead of just once. (If you do this, it's apparently possible for the chip to DMA into the same packet buffer twice, which has the effect of clobbering a packet after it's already been passed to the stack for processing.) You should run vmstat -i or something to monitor the interrupt rate on the failing interface and see if it peaks right before it goes deaf. -Bill -- ============================================================================= -Bill Paul (510) 749-2329 | Senior Engineer, Master of Unix-Fu wpaul_at_windriver.com | Wind River Systems ============================================================================= <adamw> you're just BEGGING to face the moose =============================================================================Received on Tue Oct 18 2005 - 18:30:17 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:45 UTC