Re: 5.1-CURRENT hangs on disk i/o? sysctl_old_user() non-sleepable locks

From: Don Lewis <truckman_at_FreeBSD.org> Date: Tue, 17 Jun 2003 19:32:59 -0700 (PDT) · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:12 UTC

On 17 Jun, Chris Shenton wrote:
> Don Lewis <truckman_at_FreeBSD.org> writes:
> 
>> I doubt it.  I checked in a fix for this problem today so you should get
>> the fix when you next cvsup.
> 
> Yup, many thanks.
> 
>> Can you break into ddb and do a ps to find out what state all the
>> processes are in?
> 
> I'm a newbie to ddb.  Was able to get a ps from a hung system but
> didn't know how to capture it to send to you.  Any hints?

If you have another machine and a null modem cable you can redirect the
system console of the machine to be debugged to a serial port and run
some comm software on the other machine so that you can capture all the
output from ddb.

Lacking that, there's the pencil and paper method that I used for far
too long.

> 
>> You might want to try adding the DEBUG_VFS_LOCKS options to your
>> kernel config to see if that turns up anything.
> 
> Oh, man, I'm getting killed here now. Rebuilt the kernel with that
> option (not found in GENERIC or other examples in /usr/src/sys/i386/conf/).
> 
> Now the system is dropping into ddb ever minute or so with complaints
> like the following on the screen, and in /var/log/messages:
> 
> Jun 17 21:06:08 PECTOPAH kernel: VOP_GETVOBJECT: 0xc584eb68 is not locked but should be
> Jun 17 21:08:04 PECTOPAH last message repeated 3 times
> ...
> Jun 17 21:18:55 PECTOPAH kernel: VOP_GETVOBJECT: 0xc59346d8 is not locked but should be
> Jun 17 21:18:59 PECTOPAH last message repeated 5 times
> 
> Lots 'n' lots of 'em, with a few of the same hex value then another
> set for a different hex value.

Been there, but that was quite a while ago.  I run this way all the time
and hardly ever see problems these days.  You must be exercising some
file system code that I don't.  At the ddb prompt, you can do a "tr"
command to get a stack trace, which is likely to be very helpful in
pointing out the offending code.

If you're getting a lot of VFS lock violation reports, the underlying
locking violations could be the reason that your machine deadlocks.

Post some representative stack traces.  These problems are generally
easy to fix.

>> There is also ddb command to list the locked vnodes "show
>> lockedvnods".
> 
> After I type "cont" at ddb a few times the system runs for a while
> again, only to repeat.  When it drops to ddb again that show command
> doesn't list anything. 
> 
> I may have to remove that option from my kernel just to get to run a
> bit, even tho eventually the system will hang.  It's (of course) my
> main box which the other systems NFS off, mail server, etc. :-(

At the ddb prompt you should be able to use the write command tweak a
couple of variables to modify this behavior.  If you set the
vfs_badlock_panic variable to zero, the kernel will no longer drop into
DDB when one of these lock violations occurs.  If you set the
vfs_badlock_print variable to zero, the kernel will stop printing the
warnings.

If you are running the NFS *client* code on this machine, there is one
lock assertion that is easy to trigger.  The stack trace will show the
nfsiod process calling nfssvc_iod(), which calls nfs_doio(), which
complains about a lock not being held.  If you run into that problem,
just comment out the line:
	 ASSERT_VOP_LOCKED(vp, "nfs_doio");
in nfs_doio(), in the file sys/nfsclient/nfs_bio.c.  I haven't been able
to figure out the correct fix for this problem, and so far I haven't
encountered any problems with the problem being unfixed.

> 
>> Are you using nullfs or unionfs which are a bit fragile?
> 
> Nope.  I'd be happy to mail you my kernel config if you want. I've
> posted it to http://chris.shenton.org/PECTOPAH but if the system's
> hung again, naturally it won't be available :-(
> 
> 
> Thanks for your help.  Any other things I might try?
> 
> Dunno if this matters, but I'm using an DELL CERC ATA RAID card with
> disks showing up as amrd* if that matters.  Was flawless at
> 5.0-{CURRENT,RELEASE}.