Re: hard deadlock(?) on -current; some debugging info, need help

From: Ted Faber <faber_at_isi.edu> Date: Fri, 27 May 2005 08:27:52 -0700 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:35 UTC

On Fri, May 27, 2005 at 06:37:34PM +1000, Peter Jeremy wrote:
> On Thu, 2005-May-26 13:32:43 -0700, Ted Faber wrote:
> >On Thu, May 26, 2005 at 09:08:46AM -0700, Ted Faber wrote:
> >Next lock up is now.  Same kernel, pics are at
> >
> >http://www.isi.edu/~faber/tmp/deadlock/DSCN048{83,84,85,86,87,88,89,90,91}.JPG
> 
> After comparing it with the last URL, I worked out it was actually
> http://www.isi.edu/~faber/tmp/deadlock/DSCN04{83,84,85,86,87,88,89,90,91}.JPG

Sorry.  Typo.

> 
> >My inexpert reading is that one of the threads of the psi jabber client
> >is locked on something.   "Something" why I need help. :-)
> 
> There are two filesystem locks:
> - The psi process (pid 6936) is holding a lock on ad0s1a (probably /)
>   The thread in question is waiting on a nfs lock.
> - A bash process (pid 6598) is holding an NFS lock and waiting on nfsreq
> 
> According to the vnode locks, there's one process waiting on the NFS
> lock held by bash and 7 processes waiting on the ufs lock held by psi.
> Without access to the actual process and lock structures, I can't be
> certain but it looks very much like psi is waiting on the NFS lock held
> by bash (there are no other processes waiting on nfs).
> 
> It's looking more like an NFS problem.  I'm not sure where to go next
> but I'd more strongly suggest that you try to get the system running
> without NFS.

For debugging or for my own sanity? :-)

It's going to be fairly problematic to move from NFS and keep things
going reasonably here.  If it's a step we need to take to debug I can
work something out, bit I do have a laptop running in the same
environment (and with a kernel from the same source) that does not
exhibit this problem.

> 
> It might be useful to know some more details about that NFS mount
> (fsid 0x0600ff07).  Can you tell us the mount parameters and what the
> server is (OS type).

Most o fthe nfs filesystems are automounted.  I'm on the machine now, so
I can't look at debugger output, but I can tell you that most of the NFS
mounts that I can imagine either psi or bash looking at are automounted.
The mount parameters are: timeo=8,retrans=9,intr

Ummmmm.

As I look this up, I realize that the amd config file in which this
stuff resides is itself on an NFS file system.  Not an automounted one,
but an NFS filesystem nonetheless.  I've got a very bad feeling about
that all of a sudden.  Visions of the automounter being asked to mount a
filesystem that it has to look up in this config file that is
temporarily unavailable due to network glitch (or some NFS race, or
someone locking the file to edit it) seem bad.

I'm going to move that configuration file.

How does that possibility sound to you?

For completeness, the server is a Solaris box.  Don't laugh:
boreas:~$ uname -a
SunOS boreas.isi.edu 5.9 Generic_117171-12 sun4u sparc

I'll move that configuration.  With any luck this will solve my problem,
though if you see somthing else more promising, don't hesitate to speak
up.  If moving the config does not solve it, is there some output from
teh debugger I should get about the file system?

Thanks again for all your help.  It really helps to talk these things
out with someone knowledgable.  I hope it does turn out to be this NFS
double jeopardy/pilot error, but if not I'll speak up again.  

-- 
Ted Faber
http://www.isi.edu/~faber           PGP: http://www.isi.edu/~faber/pubkeys.asc
Unexpected attachment on this mail? See http://www.isi.edu/~faber/FAQ.html#SIG