Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)

From: Scott Long <scottl_at_samsco.org>
Date: Thu, 23 Jun 2011 23:18:40 -0600
On Jun 23, 2011, at 9:01 PM, Justin T. Gibbs wrote:

> On 6/22/11 4:09 PM, Kenneth D. Merry wrote:
>> On Wed, Jun 22, 2011 at 08:13:25 +0400, Andrey Chernov wrote:
>> > On Tue, Jun 21, 2011 at 09:54:04PM -0600, Kenneth D. Merry wrote:
>> >> These two are interesting:
>> >>
>> >>> http://img825.imageshack.us/img825/1249/21062011014m.jpg
>> >>> http://img839.imageshack.us/img839/3791/21062011015.jpg
>> >>
>> >> It looks like the GEOM event thread is stuck inside the cd(4) 
> driver. The
>> >> cd(4) driver is trying to acquire the peripheral lock, and is sleeping
>> >> until it gets it.
>> >>
>> >> What isn't clear is who is holding it.
> 
> ...
> 
>> The GEOM event thread is stuck sleeping in the mtx_sleep() call above. So
>> that tells me that one of several things is going on:
>> 
>> - There is a path in the cd(4) driver where it can call cam_periph_hold()
>> but not cam_periph_unhold().
>> 
>> - There is another thread in the system that has called cam_periph_hold(),
>> and has gotten stuck before it can call cam_periph_unhold().
>> 
>> - The hold/unhold logic is broken, and there is a case where a thread
>> waiting for the lock can miss the wakeup. After looking at the code, I
>> don't think this is the case, but I may have missed something.
>> 
>> So it is probably one of the first two cases.
> 
> ...
> 
> I have a theory for the cause of this hang.
> 
> The commit that triggers this problem added calls to g_access() during the
> geom_dev probe.  I believe this hit a race in cdregister() where
> the periph hold lock is dropped around the changer probe code.  Why the
> periph hold lock is dropped there, I do not know as I haven't fully
> reviewed the changer probe code.
> 

Are you talking about this?

        disk_create(softc->disk, DISK_VERSION);
        cam_periph_lock(periph);
        cam_periph_unhold(periph);
[...]
        if (((cgd->ccb_h.target_lun > 0)
          && ((softc->quirks & CD_Q_NO_CHANGER) == 0))
         || ((softc->quirks & CD_Q_CHANGER) != 0)) {

The unhold there compliments the hold that was done prior to disk_create().  The hold/unhold is done as a hack around the need to drop the periph/sim mutex while calling disk_create(), due to the later's insistence on using blocking calls.  I've wanted to re-think how that pattern is done (it's the same gross hack in nearly all of the periph drivers), but haven't gotten around to it.  If the 'hold' semaphore needs to be held longer to prevent the race that you're theorizing, then it should be possible to simply extend its coverage in the code block, but I'm not sure if it'll result in an unintended deadlock with the changer enumeration/matching code.  I _think_ that it'll be ok, but the density of magic in the code is a bit overwhelming at this time of night =-)

Scott
Received on Fri Jun 24 2011 - 03:18:45 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:15 UTC