Re: CAM breaks USB [was Re: USB causing boot to hang]

From: Warner Losh <imp_at_bsdimp.com>
Date: Fri, 6 Dec 2019 17:33:33 -0700
On Fri, Dec 6, 2019 at 4:41 PM Steve Kargl <sgk_at_troutmask.apl.washington.edu>
wrote:

> On Fri, Dec 06, 2019 at 06:15:32PM -0500, Alexander Motin wrote:
> > On 06.12.2019 17:52, Steve Kargl wrote:
> > > On Fri, Dec 06, 2019 at 03:33:09PM -0700, Warner Losh wrote:
> > >> On Fri, Dec 6, 2019 at 3:31 PM Steve Kargl <
> sgk_at_troutmask.apl.washington.edu>
> > >> wrote:
> > >>
> > >>> On Fri, Dec 06, 2019 at 12:23:16PM -0800, Steve Kargl wrote:
> > >>>> I updates /usr/src to r355452, and updated by kernel and
> > >>>> world.  Upon rebooting, verbose boot messages susgests
> > >>>> the system is hanging when USB starts to attach.  With
> > >>>> the 3-week kernel verbose boot shows:
> > >>>>
> > >>>> ...
> > >>>> pcm4: Playback channel matrix is: 2.0 (unknown)
> > >>>> usbus0: 5.0Gbps Super Speed USB v3.0
> > >>>> ...
> > >>>>
> > >>>> end with a prompt on the console.  With today's kernel,
> > >>>> boot is hung after the last pcm4: message and no usbus0
> > >>>> is displayed.
> > >>>>
> > >>>> The booting kernel/system is a
> > >>>>
> > >>>> % uname -a
> > >>>> FreeBSD 13.0-CURRENT #1 r354658: Wed Nov 13 11:23:32 PST 2019,
> amd64
> > >>>>
> > >>>> Again, the failing kernel is r 355452
> > >>>>
> > >>>
> > >>> The problem seems to be caused 355010.  This is a commit to
> > >>> fix CAM, which seems to break USB.
> > >>>
> > >>
> > >> Yes. mav_at_ made this change...
> > >>
> > >
> > > src/UPDATING seems to be missing an entry about CAM breaking USB.
> >
> > And also that moon is made of cheese. :-\
> >
>
> Not sure what you mean.  You made a change, and the commit log
> even notes that there could be an issue.  Yet, you want a user
> to waste half a day finding the root cause of the problem.
>
> > > The commit message for 355010 states:
> > >
> > >    Devices appearing on USB bus later may still require setting
> > >    kern.cam.boot_delay, but hopefully those are minority.
> > >
> > > There is no statement about "where" kern.cam.boot_delay should be set.
> > > There is no statement about "what"  value(s) kern.cam.boot_delay
> should be.
> >
> > If you never needed it before, you still don't need it.
>
> Prior to 355010 the system just boots up.  After 355010
> the system hangs.  Will  kern.cam.boot_delay paper over
> whatever (latent?) bug you've exposed?
>
> > > For the record add kern.cam.boot_delay to /boot/loader.conf with the
> > > values 0, 1, and "1" did not allow the system to boot.
> >
> > boot_delay value is measured in milliseconds, so values of 0 and 1 mean
> > close to nothing.  You may try to set it to some 10000, if you really
> > want to try to delay CAM devices attach, but I doubt.
>
> 0 and 1 were my guesses that boot_delay was an integer representation
> of a boolean value; 0 being disable the new code; 1 being enable new
> code.  Looks like I guessed wrong given the documentation.
>
>
> > > The system
> > > will not boot with or without
> > >
> > > umass0 on uhub1
> > > umass0: <Seagate BUP SL, class 0/0, rev 3.00/1.00, addr 2> on usbus0
> > > umass0:  SCSI over Bulk-Only; quirks = 0x0100
> > > umass0:9:0: Attached to scbus9
> > > da0 at umass-sim0 bus 0 scbus9 target 0 lun 0
> > > da0: <Seagate BUP SL 0304> Fixed Direct Access SPC-4 SCSI device
> > > da0: Serial Number NA7PEG27
> > > da0: 400.000MB/s transfers
> > > da0: 3815447MB (7814037167 512 byte sectors)
> > > da0: quirks=0x2<NO_6_BYTE>
> > >
> > > plugged into the port.
> >
> > If system hangs even without any USB disk attached, then I don't see a
> > relation between CAM and USB here.  My change could affect some timings
> > of the boot process, but without closer debugging it is hard to guess
> > something.  To be sure whether USB is related I would try to disable all
> > USB controllers either in BIOS or with set of loader tunables like
> > hint.ehci.0.disabled=1 , hint.ohci.0.disabled=1 ,
> > hint.xhci.0.disabled=1, ...
>
> Yep.  Completely disabling USB allows the system to boot.  I don't
> see how this would be unexpected as umass using cam.
>

There is a long, tangled history of multiple mechanisms being used to
control releasing mountroot() to do its thing. CAM  historically used one
method, while USB used another. I've not closely reviewed this change to
see what the issue might be, but if the system booted before, but doesn't
now, then there's been a de-facto bug introduce or exposed by this change.
Maybe it would be better to back out 355010 and have it reviewed and tested
more carefully. It's been tricky in the past to get right and since there's
issues that have come up, it might be best to take a more conservative
approach. If we can't get a quick resolution, I'd recommend that we go this
route...

Looking at the change, I see that it is a bit weird...  It ditches the 'do
all the config intr hook stuff' which completes before we look at the root
holds for using the root holds. In theory, this should be fine... however,
USB does root holds for its uhub exploration which then finds umass, which
needs its own enumeration  before it's usable as root. I need to see what
interlocks are there, but it does look a little like there might be a
chance for USB config and CAM config to race more now than before. I say
'might' because I've not looked all places where things were held,
released, etc.

Disabling USB is a big clue, but I'm not entirely what it's a clue of.  I
think it means it's disable thing other half of the race, but it could also
be disabling a deadlock between threads that before could never deadlock.

Warner
Received on Fri Dec 06 2019 - 23:33:48 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:22 UTC