It seems Robert Watson wrote: > How fast are your systems, speaking of which? I live in the world of > 300-500 mhz machines at work, and 300-800 mhz boxes at home. If you're > using multi-ghz boxes, that could well be the distinguishing factor > between our configurations... Server is 533MhzVIA C3, clients everything from 300Mhz PII to 2.6G P4. > Ok, here's the strategy I was planning to take once I could reproduce it: > > (1) Attempt to further narrow down responsibility to client/server. In > particular, see if an apparent hang on one client affects the other > clients. For me its just the server end that fails, I've not seen the client hang. > (2) Investigate Soren's report that killing and restarting nfsd on the > server would clear the hang. Yups, that works, in fact I have that in my crontab now every minute to keep NFS from hosing my setup here. NOTE: I also still need to ifconfig done/up my interfaces on some boxes or the netstack will freeze (again done every minute in crontab). However when NFS locks up it seems totatlly unrelated, ie all other network traffic works... > (3) Look at stack traces of involved processes on both the client and > server: in particular, look at traces for any client blocked in NFS, > any nfsiod processes on the client, and the nfsd processes on the > server. Also look at the wait channels on clients and servers for > these processes. Particularly interested in whether nfsd processes > are blocked trying to grab locks. Ok, will do.. > (4) Look at netstat information for NFS sockets, in particular, if the > buffers are full, or not being drained. In particular, on the server, > is the input queue not being drained by nfsd worker threads? Netstat doesn't seem to give any hints or even usefull info here, any special cmdøs you want the output from ? > (5) Try backing out src/sys/nfsserver/nfs_serv.c:1.137, which removed > another deadlock problem, but did change locking behavior in the NFS > server. No change already tried. > (6) Look at packet traces between the client and server with ethereal, > which has pretty good NFS decoding. Is the client retransmitting an > RPC to the server and the server just isn't responding, or is the > client failing to transmit? At the point of the hang, what sorts of > RPCs are outstanding to the server? In the past, we've seen "apparent > hangs" when some or another more obscure unusual error case on the NFS > server fails to respond to an RPC, which causes the client to "wait > forever". I can try that easily, I'll get a trace to you later tonight... > Things to look for: normally, idle nfsd and nfsiod processes have a WCHAN > of "-" (ps -lax), which indicates they're blocked waiting for some event > to kick them off. If you see nfsd processes "hung" in another state, it's > a good sign we've identified a server problem. In the nfs client > processes, "nfsrcvlk" typically indicates a process has sent out an RPC > and is now waiting on a response. I see the idle '-' wchan here when things go bad IIRC... -SørenReceived on Mon Nov 10 2003 - 08:44:01 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:28 UTC