Re: GPF on boot with devmatch

From: Xin Li <delphij_at_delphij.net> Date: Mon, 12 Oct 2020 15:42:52 -0700 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:25 UTC

On 10/12/20 11:13, Warner Losh wrote:
> 
> 
> On Mon, Oct 5, 2020 at 3:39 PM Alexander Motin <mav_at_freebsd.org
> <mailto:mav_at_freebsd.org>> wrote:
> 
>     On 05.10.2020 17:20, Warner Losh wrote:
>     > On Mon, Oct 5, 2020 at 12:36 PM Alexander Motin <mav_at_freebsd.org
>     <mailto:mav_at_freebsd.org>
>     > <mailto:mav_at_freebsd.org <mailto:mav_at_freebsd.org>>> wrote:
>     >
>     >     I can add that we've received report about identical panic on
>     FreeBSD
>     >     releng/12.2 of r365436, AKA TrueNAS 12.0-RC1:
>     >     https://jira.ixsystems.com/browse/NAS-107578
>     <https://jira.ixsystems.com/browse/NAS-107578> .  So it looks a) pretty
>     >     rate (one report from thousands of early adopters and none in
>     our lab),
>     >     and b) it is in stable/12 too, not only head.
>     >
>     > Thanks! I'll see if I can recreate here....  But we're accessing the
>     > sysctl tree from devmatch to get some information, which should always
>     > be OK (the fact that it isn't suggests either a bug in some driver
>     > leaving bad pointers, or some race or both)...  It would be nice
>     to know
>     > which nodes they were, or to have a kernel panic I can look at...
> 
>     All we have now in this case is a screenshot you may see in the ticket.
>      Also previously the same user on some earlier version of stable/12
>     reported other very weird panics on process lock being dropped where it
>     can't be in some other sysctls inside kern.proc, so if we guess those
>     are related, I suspect there may be some kind of memory corruption
>     happening, but have no clue where.  Unfortunately we have only textdumps
>     for those.  So if Xin is able to reproduce it locally, it may be our
>     best chance to debug it, at least this specific issue.
> 
> 
> That's totally weird. 
> 
> Xin Li's traceback lead to code I just rewrote in current, while this
> code leads to code that's been there for a long time and hasn't been
> MFC'd. This suggests that Xin Li's backtrace isn't to be trusted, or
> there's two issues at play. Both are plausible. I've fixed a minor
> signedness bug and a possible one byte overflow that might have happened
> in the code I just rewrote. But I suspect this is due to something else
> related to how children are handled after we've raced. Maybe there's
> something special about how USB does things, because other buses will
> create the child early and the child list is stable. If USB's discovery
> code is adding something and is racing with devd's walking of the tree,
> that might explain it...  It would be nice if there were some way to
> provoke the race on a system I could get a core from for deeper analysis....

There might be some other players; I just don't have a lot of time
recently to shoot it down; the system is somewhat critical for my
internal network so I can't afford to have long downtimes (it's fine if
I have controllable downtimes, e.g. if you want me to deliberately panic
the system and get some debugging data, please feel free to ask as long
as I can continue to boot at the end of experiment :))

From what I was observing, it seems to be some kind of race condition
between the USB stack and sysctl tree; however, the race might be
delicate as I never successfully provoked the panic on my laptop, which
also runs -CURRENT).  If you want to add some instruments to the code,
please let me know and I'll get the tree patched to try to catch it.

Cheers,

> Warner
>  
> 
>     -- 
>     Alexander Motin
>