On 17 Jun, Chris Shenton wrote: > Don Lewis <truckman_at_FreeBSD.org> writes: > >> I doubt it. I checked in a fix for this problem today so you should get >> the fix when you next cvsup. > > Yup, many thanks. > >> Can you break into ddb and do a ps to find out what state all the >> processes are in? > > I'm a newbie to ddb. Was able to get a ps from a hung system but > didn't know how to capture it to send to you. Any hints? If you have another machine and a null modem cable you can redirect the system console of the machine to be debugged to a serial port and run some comm software on the other machine so that you can capture all the output from ddb. Lacking that, there's the pencil and paper method that I used for far too long. > >> You might want to try adding the DEBUG_VFS_LOCKS options to your >> kernel config to see if that turns up anything. > > Oh, man, I'm getting killed here now. Rebuilt the kernel with that > option (not found in GENERIC or other examples in /usr/src/sys/i386/conf/). > > Now the system is dropping into ddb ever minute or so with complaints > like the following on the screen, and in /var/log/messages: > > Jun 17 21:06:08 PECTOPAH kernel: VOP_GETVOBJECT: 0xc584eb68 is not locked but should be > Jun 17 21:08:04 PECTOPAH last message repeated 3 times > ... > Jun 17 21:18:55 PECTOPAH kernel: VOP_GETVOBJECT: 0xc59346d8 is not locked but should be > Jun 17 21:18:59 PECTOPAH last message repeated 5 times > > Lots 'n' lots of 'em, with a few of the same hex value then another > set for a different hex value. Been there, but that was quite a while ago. I run this way all the time and hardly ever see problems these days. You must be exercising some file system code that I don't. At the ddb prompt, you can do a "tr" command to get a stack trace, which is likely to be very helpful in pointing out the offending code. If you're getting a lot of VFS lock violation reports, the underlying locking violations could be the reason that your machine deadlocks. Post some representative stack traces. These problems are generally easy to fix. >> There is also ddb command to list the locked vnodes "show >> lockedvnods". > > After I type "cont" at ddb a few times the system runs for a while > again, only to repeat. When it drops to ddb again that show command > doesn't list anything. > > I may have to remove that option from my kernel just to get to run a > bit, even tho eventually the system will hang. It's (of course) my > main box which the other systems NFS off, mail server, etc. :-( At the ddb prompt you should be able to use the write command tweak a couple of variables to modify this behavior. If you set the vfs_badlock_panic variable to zero, the kernel will no longer drop into DDB when one of these lock violations occurs. If you set the vfs_badlock_print variable to zero, the kernel will stop printing the warnings. If you are running the NFS *client* code on this machine, there is one lock assertion that is easy to trigger. The stack trace will show the nfsiod process calling nfssvc_iod(), which calls nfs_doio(), which complains about a lock not being held. If you run into that problem, just comment out the line: ASSERT_VOP_LOCKED(vp, "nfs_doio"); in nfs_doio(), in the file sys/nfsclient/nfs_bio.c. I haven't been able to figure out the correct fix for this problem, and so far I haven't encountered any problems with the problem being unfixed. > >> Are you using nullfs or unionfs which are a bit fragile? > > Nope. I'd be happy to mail you my kernel config if you want. I've > posted it to http://chris.shenton.org/PECTOPAH but if the system's > hung again, naturally it won't be available :-( > > > Thanks for your help. Any other things I might try? > > Dunno if this matters, but I'm using an DELL CERC ATA RAID card with > disks showing up as amrd* if that matters. Was flawless at > 5.0-{CURRENT,RELEASE}.Received on Tue Jun 17 2003 - 17:33:12 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:12 UTC