Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)

From: Justin T. Gibbs <gibbs_at_FreeBSD.org>
Date: Thu, 23 Jun 2011 23:01:17 -0400
On 6/22/11 4:09 PM, Kenneth D. Merry wrote:
>  On Wed, Jun 22, 2011 at 08:13:25 +0400, Andrey Chernov wrote:
> > On Tue, Jun 21, 2011 at 09:54:04PM -0600, Kenneth D. Merry wrote:
> >> These two are interesting:
> >>
> >>> http://img825.imageshack.us/img825/1249/21062011014m.jpg
> >>> http://img839.imageshack.us/img839/3791/21062011015.jpg
> >>
> >> It looks like the GEOM event thread is stuck inside the cd(4) 
driver. The
> >> cd(4) driver is trying to acquire the peripheral lock, and is sleeping
> >> until it gets it.
> >>
> >> What isn't clear is who is holding it.

...

>  The GEOM event thread is stuck sleeping in the mtx_sleep() call above. So
>  that tells me that one of several things is going on:
>
>  - There is a path in the cd(4) driver where it can call cam_periph_hold()
>  but not cam_periph_unhold().
>
>  - There is another thread in the system that has called cam_periph_hold(),
>  and has gotten stuck before it can call cam_periph_unhold().
>
>  - The hold/unhold logic is broken, and there is a case where a thread
>  waiting for the lock can miss the wakeup. After looking at the code, I
>  don't think this is the case, but I may have missed something.
>
>  So it is probably one of the first two cases.

...

I have a theory for the cause of this hang.

The commit that triggers this problem added calls to g_access() during the
geom_dev probe.  I believe this hit a race in cdregister() where
the periph hold lock is dropped around the changer probe code.  Why the
periph hold lock is dropped there, I do not know as I haven't fully
reviewed the changer probe code.

The drop of the lock in cdregister() can allow geom classes to probe and
thus call g_access()->g_disk_access()->cdopen() before a probe is initiated
in the "normal way" by cdregister().  cdopen() checks for media presence
by issuing immediate ccds.  When the race is exploited, the peripheral will
be in the "probe state" when the immediate ccbs are requested.  This will
cause the device probe to be performed before the immediate ccd is returned.
When the cdopen() activity finally unwinds, cdregister() will again
take the periph hold lock and schedule the peripheral, expecting probe
processing to complete and release the hold lock.  However, since the
periph is already in the normal state (due to the successful probe performed
indirectly by the cdopen() call), that unlock never happens, thus wedging
the device.

To test this theory, apply the following patch.  I do not know if this
is safe for changer devices, but I will review the changer code if this
patch fixes ache's problem.

--
Justin


Received on Fri Jun 24 2011 - 01:01:25 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:15 UTC