Re: hard deadlock(?) on -current; some debugging info, need help

From: Peter Jeremy <PeterJeremy_at_optushome.com.au> Date: Sat, 28 May 2005 06:43:02 +1000 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:35 UTC

On Fri, 2005-May-27 08:27:52 -0700, Ted Faber wrote:
>work something out, bit I do have a laptop running in the same
>environment (and with a kernel from the same source) that does not
>exhibit this problem.

That's a useful snippet.  I missed the bit about same source before.
What are the differences between the systems (including kernel
compilation options)?  That might provide a clue as to the underlying
problem.  Have you tried running the same sort of workload on your
laptop?  Is is feasible to run one of the kernels on both systems?

>> It might be useful to know some more details about that NFS mount
>> (fsid 0x0600ff07).  Can you tell us the mount parameters and what the
>> server is (OS type).
>
>Most o fthe nfs filesystems are automounted.  I'm on the machine now, so
>I can't look at debugger output, but I can tell you that most of the NFS
>mounts that I can imagine either psi or bash looking at are automounted.
>The mount parameters are: timeo=8,retrans=9,intr

I didn't notice amd before.  If you can't avoid NFS, any chance of (at
least temporarily) hard-mounting all the relevant filesystems and
disabling amd?  amd acts as an NFS server to detect activity on the
automount filesystems.  Both the backtraces you posted show that one
process is blocked on an NFS request and amd is blocked on ufs.  The
locks on the second backtrace show that the bash waiting on an NFS
request is a root of the deadlock tree.  If that NFS request is
supposed to be handled by amd, you close the deadlock cycle.

Also, if your mounts are interruptable, that nfsreq sleep is
interruptable - you could try dropping into DDB, finding the process
sleeping on nfsreq and killing it ("kill signal_number pid" in ddb,
no '-' on the signal number), then using "cont" to recover.  That
might break the deadlock.

>For completeness, the server is a Solaris box.  Don't laugh:
>boreas:~$ uname -a
>SunOS boreas.isi.edu 5.9 Generic_117171-12 sun4u sparc

Sun's NFS implementations should be trustable :-).

> If moving the config does not solve it, is there some output from
>teh debugger I should get about the file system?

I can't see any DDB command to dump the mount table and doing it
manually would be painful.  Have you managed to get a crash dump?  (If
not, what does "call doadump" do?)  Alternatively, have you ever tried
running remote GDB?

>  It really helps to talk these things
>out with someone knowledgable.

Unfortunately, no-one knowledgable has showed up :-).

-- 
Peter Jeremy