Re: sysctl -a causes kernel trap 12

From: Brandon Gooch <jamesbrandongooch_at_gmail.com> Date: Fri, 18 Jan 2013 23:58:14 -0600 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:34 UTC

On Fri, Jan 18, 2013 at 2:56 PM, Xin Li <delphij_at_delphij.net> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA512
>
> On 01/18/13 12:50, Brandon Gooch wrote:
> > On Thu, Jan 10, 2013 at 4:25 PM, Xin Li <delphij_at_delphij.net
> > <mailto:delphij_at_delphij.net>> wrote:
> >
> > -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256
> >
> > To all: this became more and more hard to replicate lately.  I've
> > tried these options and the most important progress is that it's
> > possible to get a crashdump when debug.debugger_on_panic=0 and I
> > managed to get a backtrace which indicates the panic occur when
> > trying to do mtx_lock(&Giant) -> __mtx_lock_sleep -> turnstile_wait
> > -> propagate_priority, but after I've added some instruments to
> > the surrounding code and enabled INVARIANT and/or WITNESS, it
> > mysteriously went away.
> >
> > Reverting my instruments code and update to latest svn makes the
> > issue disappear for one day.  I've hit it again today but
> > unfortunately didn't get a successful dump and after reboot I can't
> > reproduce it again :(
> >
> > Still trying...
> >
> >
> > Any updates Xin?
>
> No, it mysteriously disappeared for now.  According to my
> understanding to recent svn commits, I didn't see anybody committing
> something that fixes it but I can no longer panic my system, with or
> without debugging code :(
>
> > I was actually hitting what I believe to be exactly the same issue
> > as you on one of my systems, and, as you've seen, adding any extra
> > debugging or diagnostics seemed to eliminate the issue.
> >
> > I was able to generate quite a few vmcores and still have these
> > sitting around in my filesystem (along with the kernels that helped
> > produce them).
> >
> > I can recreate this crash on my system by compiling the NVIDIA
> > driver with clang at -01 and above. Although it's been noted that
> > this issue has been seen in scenarios without an NIVIDIA driver in
> > the mix, whatever is happening in the kernel to cause the panic is
> > somehow triggered by this, at least on my system.
>
> I'm not sure if this is the same problem.  Could you please try using
> gcc to compile the nVIdia driver and see if that "fixes" the problem?
>
> Cheers,
> - --
> Xin LI <delphij_at_delphij.net>    https://www.delphij.net/
> FreeBSD - The Power to Serve!           Live free or die
>

Indeed, a gcc compiled NVIDIA module eliminates the issue, sorry if I
hadn't mentioned this earlier.

What was happening to me at first was that my system would just hang while
booting. I was able to figure out that it was during /etc/rc.d/initrandom.
I actually got to a point where I removed the call to sysctl -a from
'better_than_nothing()' in /etc/rc.d/initrandom to have a booting system. I
finally had a situation where I could get a panic by adding SW_WATCHDOG to
my kernel and running watchdogd(8).

For me, this panic would come and go seemingly at random as well, and I
couldn't fumble my way around in the debugger to learn much of anything
when I first started seeing it. I just started a process of modularizing
everything I could in my kernel config, then loading modules 1-by-1 and
booting over-and-over until I finally found what appeared to be the
problem, which was the NVIDIA module compiled with clang.

Oh, another thing: at times it seemed as though it was the number of
modules loaded, as I could get the hang with 41 modules loaded, but not 40
or 42?! I admit, when I was seeing that behavior, I hadn't eliminated the
NVIDIA driver from my loaded modules. I need to revisit the panic situation
to confirm this particular strangeness.

Here's the last panic I had:

Unread portion of the kernel message buffer:
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 1175 (sysctl)

(kgdb) bt
#0  doadump (textdump=1694704112) at pcpu.h:229
#1  0xffffffff802fab82 in db_fncall (dummy1=<value optimized out>,
dummy2=<value optimized out>, dummy3=<value optimized out>, dummy4=<value
optimized out>) at /usr/src/sys/ddb/db_command.c:578
#2  0xffffffff802fa85a in db_command (last_cmdp=<value optimized out>,
cmd_table=<value optimized out>, dopager=1) at
/usr/src/sys/ddb/db_command.c:449
#3  0xffffffff802fa612 in db_command_loop () at
/usr/src/sys/ddb/db_command.c:502
#4  0xffffffff802fcf60 in db_trap (type=<value optimized out>, code=0) at
/usr/src/sys/ddb/db_main.c:231
#5  0xffffffff804a7b93 in kdb_trap (type=12, code=0, tf=<value optimized
out>) at /usr/src/sys/kern/subr_kdb.c:654
#6  0xffffffff807157c5 in trap_fatal (frame=0xffffff8865032670, eva=<value
optimized out>) at /usr/src/sys/amd64/amd64/trap.c:867
#7  0xffffffff80715adb in trap_pfault (frame=0x0, usermode=0) at
/usr/src/sys/amd64/amd64/trap.c:698
#8  0xffffffff8071529b in trap (frame=0xffffff8865032670) at
/usr/src/sys/amd64/amd64/trap.c:463
#9  0xffffffff806ff382 in calltrap () at exception.S:228
#10 0xffffffff8047bd50 in sysctl_sysctl_next_ls (lsp=<value optimized out>,
name=0xffffff8865032a80, namelen=<value optimized out>,
next=0xffffff8865032898, len=0xffffff8865032904, level=3) at
/usr/src/sys/kern/kern_sysctl.c:759
#11 0xffffffff8047be5e in sysctl_sysctl_next_ls (lsp=0xfffffe000d3f0080,
name=0xffffff8865032a7c, namelen=<value optimized out>,
next=0xffffff8865032894, len=0xffffff8865032904, level=2) at
/usr/src/sys/kern/kern_sysctl.c:786
#12 0xffffffff8047be5e in sysctl_sysctl_next_ls (lsp=0xfffffe000d3f0080,
name=0xffffff8865032a78, namelen=<value optimized out>,
next=0xffffff8865032890, len=0xffffff8865032904, level=1) at
/usr/src/sys/kern/kern_sysctl.c:786
#13 0xffffffff8047bca3 in sysctl_sysctl_next (oidp=<value optimized out>,
arg1=0xffffff8865032a78, arg2=4, req=0xffffff88650329a8) at
/usr/src/sys/kern/kern_sysctl.c:808
#14 0xffffffff8047b03f in sysctl_root (arg1=<value optimized out>,
arg2=<value optimized out>) at /usr/src/sys/kern/kern_sysctl.c:1513
#15 0xffffffff8047b5d8 in userland_sysctl (td=<value optimized out>,
name=0xffffff8865032a70, namelen=<value optimized out>, old=<value
optimized out>, oldlenp=<value optimized out>, inkernel=<value optimized
out>, new=<value optimized out>, newlen=<value optimized out>,
    retval=<value optimized out>, flags=1694706064) at
/usr/src/sys/kern/kern_sysctl.c:1623
#16 0xffffffff8047b3c4 in sys___sysctl (td=0xfffffe001e2d4900,
uap=0xffffff8865032b80) at /usr/src/sys/kern/kern_sysctl.c:1549
#17 0xffffffff807160f7 in amd64_syscall (td=0xfffffe001e2d4900, traced=0)
at subr_syscall.c:135
#18 0xffffffff806ff66b in Xfast_syscall () at exception.S:387
#19 0x000000080093697a in ?? ()
Previous frame inner to this frame (corrupt stack?)
Current language:  auto; currently minimal

Any ideas on where to look through this vmcore?

-Brandon