Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)

From: Kenneth D. Merry <ken_at_freebsd.org> Date: Wed, 22 Jun 2011 14:09:19 -0600 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:15 UTC

On Wed, Jun 22, 2011 at 08:13:25 +0400, Andrey Chernov wrote:
> On Tue, Jun 21, 2011 at 09:54:04PM -0600, Kenneth D. Merry wrote:
> > These two are interesting:
> > 
> > > http://img825.imageshack.us/img825/1249/21062011014m.jpg
> > > http://img839.imageshack.us/img839/3791/21062011015.jpg
> > 
> > It looks like the GEOM event thread is stuck inside the cd(4) driver.  The
> > cd(4) driver is trying to acquire the peripheral lock, and is sleeping
> > until it gets it.
> > 
> > What isn't clear is who is holding it.  The ps output shows an idle thread
> > running on CPU 1, and thread 100014 (taskq) running on CPU 0.
> > Unfortunately I don't see a stack trace for that.  (I might have missed
> > it.)
> > 
> > Do you happen to have the image with the stack trace for that thread?
> 
> I don't have the image because no disks are mounted at that stage and the 
> swap slice is not attached. But I can issue more specific DDB commands to 
> narrow it down, just say what you need in detail.
> 
> BTW, the machine have 2 DVD both are attached to Marvell IDE plain ATA 
> interface, they always works before.
> 
> Are you sure that something holding the lock? 'show lock' shows absolutely 
> nothing, it is empty.

Well, after looking at the code a little more, it looks like the "lock"
that is being held is the periph lock, which is really just a flag.
So 'show lock' wouldn't show anything relevant.  Here's cam_periph_hold():

int
cam_periph_hold(struct cam_periph *periph, int priority)
{
	int error;

	/*
	 * Increment the reference count on the peripheral
	 * while we wait for our lock attempt to succeed
	 * to ensure the peripheral doesn't disappear out
	 * from user us while we sleep.
	 */

	if (cam_periph_acquire(periph) != CAM_REQ_CMP)
		return (ENXIO);

	mtx_assert(periph->sim->mtx, MA_OWNED);
	while ((periph->flags & CAM_PERIPH_LOCKED) != 0) {
		periph->flags |= CAM_PERIPH_LOCK_WANTED;
		if ((error = mtx_sleep(periph, periph->sim->mtx, priority,
		     "caplck", 0)) != 0) {
			cam_periph_release_locked(periph);
			return (error);
		}
	}

	periph->flags |= CAM_PERIPH_LOCKED;
	return (0);
}

The GEOM event thread is stuck sleeping in the mtx_sleep() call above.  So
that tells me that one of several things is going on:

 - There is a path in the cd(4) driver where it can call cam_periph_hold()
   but not cam_periph_unhold().

 - There is another thread in the system that has called cam_periph_hold(),
   and has gotten stuck before it can call cam_periph_unhold().

 - The hold/unhold logic is broken, and there is a case where a thread
   waiting for the lock can miss the wakeup.  After looking at the code, I
   don't think this is the case, but I may have missed something.

So it is probably one of the first two cases.  From the dmesg, I only see
cd1 listed, not cd0.  So it is possible that cd0 is stuck in the probe code
somewhere, and the geom code just gets stuck trying to open it when the
probe hasn't completed.

Seeing the stack trace for the taskq thread that is running on CPU 0
(process 100014) might be enlightening, it's hard to say.  That may or may
not show the issue.

It's possible that this issue is directly related to the commit in
question; perhaps there is an error being returned that wasn't returned
before and it isn't being handled right in the cd(4) driver.  (The cd(4)
driver wasn't touched in the commit.)

It's also possible that the commit in question just changed the timing and
your system is hitting a race that was there previously.

Ken
-- 
Kenneth Merry
ken_at_FreeBSD.ORG