On 10/12/20 11:13, Warner Losh wrote: > > > On Mon, Oct 5, 2020 at 3:39 PM Alexander Motin <mav_at_freebsd.org > <mailto:mav_at_freebsd.org>> wrote: > > On 05.10.2020 17:20, Warner Losh wrote: > > On Mon, Oct 5, 2020 at 12:36 PM Alexander Motin <mav_at_freebsd.org > <mailto:mav_at_freebsd.org> > > <mailto:mav_at_freebsd.org <mailto:mav_at_freebsd.org>>> wrote: > > > > I can add that we've received report about identical panic on > FreeBSD > > releng/12.2 of r365436, AKA TrueNAS 12.0-RC1: > > https://jira.ixsystems.com/browse/NAS-107578 > <https://jira.ixsystems.com/browse/NAS-107578> . So it looks a) pretty > > rate (one report from thousands of early adopters and none in > our lab), > > and b) it is in stable/12 too, not only head. > > > > Thanks! I'll see if I can recreate here.... But we're accessing the > > sysctl tree from devmatch to get some information, which should always > > be OK (the fact that it isn't suggests either a bug in some driver > > leaving bad pointers, or some race or both)... It would be nice > to know > > which nodes they were, or to have a kernel panic I can look at... > > All we have now in this case is a screenshot you may see in the ticket. > Also previously the same user on some earlier version of stable/12 > reported other very weird panics on process lock being dropped where it > can't be in some other sysctls inside kern.proc, so if we guess those > are related, I suspect there may be some kind of memory corruption > happening, but have no clue where. Unfortunately we have only textdumps > for those. So if Xin is able to reproduce it locally, it may be our > best chance to debug it, at least this specific issue. > > > That's totally weird. > > Xin Li's traceback lead to code I just rewrote in current, while this > code leads to code that's been there for a long time and hasn't been > MFC'd. This suggests that Xin Li's backtrace isn't to be trusted, or > there's two issues at play. Both are plausible. I've fixed a minor > signedness bug and a possible one byte overflow that might have happened > in the code I just rewrote. But I suspect this is due to something else > related to how children are handled after we've raced. Maybe there's > something special about how USB does things, because other buses will > create the child early and the child list is stable. If USB's discovery > code is adding something and is racing with devd's walking of the tree, > that might explain it... It would be nice if there were some way to > provoke the race on a system I could get a core from for deeper analysis.... There might be some other players; I just don't have a lot of time recently to shoot it down; the system is somewhat critical for my internal network so I can't afford to have long downtimes (it's fine if I have controllable downtimes, e.g. if you want me to deliberately panic the system and get some debugging data, please feel free to ask as long as I can continue to boot at the end of experiment :)) From what I was observing, it seems to be some kind of race condition between the USB stack and sysctl tree; however, the race might be delicate as I never successfully provoked the panic on my laptop, which also runs -CURRENT). If you want to add some instruments to the code, please let me know and I'll get the tree patched to try to catch it. Cheers, > Warner > > > -- > Alexander Motin >
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:25 UTC