Re: amd64 head -r329465 (non-debug build, but with symbols): "panic: spin lock held too long" during make check-old, reported during a sys_vfork

From: Trond Endrestøl <Trond.Endrestol_at_fagskolen.gjovik.no>
Date: Sun, 18 Feb 2018 13:25:17 +0100 (CET)
On Sat, 17 Feb 2018 17:39-0800, Mark Millard wrote:

> This is for FreeBSD running under Hyper-V on a Windows 10 Pro machine.
> The FreeBSD "disk" bindings are to SSDs, not the insides of NTFS files.
> 29 logical processors assigned to FreeBSD (on a 32-thread Ryzen
> Threadripper 1950X). No other Hyper-V use.
> 
> This happened during:
> 
> # ~/sys_build_scripts.amd64-host/make_powerpc64vtsc_nodebug_clang_altbinutils-amd64-host.sh check-old DESTDIR=/usr/obj/DESTDIRs/clang-powerpc64-installworld_altbinutils
> Script started, output file is /root/sys_typescripts/typescript_make_powerpc64vtsc_nodebug_clang_altbinutils-amd64-host-2018-02-17:15:56:20
> >>> Checking for old files
> 
> 
> (Hand typed from a picture of a window's content
> at slighly different times, expect typos:)
> 
> KDB: enter: panic
> [thread pid 42170 tid 100752 ]
> Stopped at kdb_enter+0x3b: movq $0,kdb_why
> db> call doadump
> Dumping 4825 out of 110559 MB: ... (omitted) ...
> Dump complete
> = 0
> 
> 
> (The "pid 42170" identification as the process reporting the
> issue does not seem to appear in the core.txt.0 file.)
> 
> 
> # ls -lTdt /var/crash/*
> -rw-r--r--  1 root  wheel      100792 Feb 17 16:09:18 2018 /var/crash/core.txt.0
> lrwxr-xr-x  1 root  wheel           8 Feb 17 16:09:08 2018 /var/crash/vmcore.last -> vmcore.0
> lrwxr-xr-x  1 root  wheel           6 Feb 17 16:09:08 2018 /var/crash/info.last -> info.0
> -rw-------  1 root  wheel  5060296704 Feb 17 16:09:08 2018 /var/crash/vmcore.0
> -rw-------  1 root  wheel         392 Feb 17 16:08:59 2018 /var/crash/info.0
> -rw-r--r--  1 root  wheel           2 Feb 17 16:08:59 2018 /var/crash/bounds
> -rw-r--r--  1 root  wheel           5 Nov 22 04:34:36 2017 /var/crash/minfree
> 
> >From /var/crash/core.txt.0 :
> 
> Unread portion of the kernel message buffer:
> spin lock 0xffffffff81b2dcc0 (sched lock 5) held by 0xfffff8011d936560 (tid 100691) too long
> panic: spin lock held too long
> cpuid = 5
> time = 1518911834
> KDB: stack backtrace:
> db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00f10094d0
> vpanic() at vpanic+0x18d/frame 0xfffffe00f1009530
> panic() at panic+0x43/frame 0xfffffe00f1009590
> _mtx_lock_indefinite_check() at _mtx_lock_indefinite_check+0x71/frame 0xfffffe00f10095a0
> thread_lock_flags_() at thread_lock_flags_+0xdb/frame 0xfffffe00f1009610
> statclock_cnt() at statclock_cnt+0xdc/frame 0xfffffe00f1009650
> handleevents() at handleevents+0x113/frame 0xfffffe00f10096a0
> timercb() at timercb+0xa9/frame 0xfffffe00f10096f0
> lapic_handle_timer() at lapic_handle_timer+0xa7/frame 0xfffffe00f1009730
> timerint_u() at timerint_u+0x96/frame 0xfffffe00f1009810
> thread_lock_flags_() at thread_lock_flags_+0xc1/frame 0xfffffe00f1009880
> fork1() at fork1+0x1b9f/frame 0xfffffe00f1009930
> sys_vfork() at sys_vfork+0x4c/frame 0xfffffe00f1009980
> amd64_syscall() at amd64_syscall+0xa48/frame 0xfffffe00f1009ab0
> fast_syscall_common() at fast_syscall_common+0x101/frame 0x7fffffffcfc0
> KDB: enter: panic
> 
> __curthread () at ./machine/pcpu.h:230
> 230             __asm("movq %%gs:%1,%0" : "=r" (td)
> (kgdb) #0  __curthread () at ./machine/pcpu.h:230
> #1  doadump (textdump=-2122191464) at /usr/src/sys/kern/kern_shutdown.c:347
> #2  0xffffffff8040b42c in db_fncall_generic (addr=<optimized out>, 
>    rv=<optimized out>, nargs=<optimized out>, args=<optimized out>)
>    at /usr/src/sys/ddb/db_command.c:609
> #3  db_fncall (dummy1=<optimized out>, dummy2=<optimized out>, 
>    dummy3=<optimized out>, dummy4=<optimized out>)
>    at /usr/src/sys/ddb/db_command.c:657
> #4  0xffffffff8040af79 in db_command (last_cmdp=<optimized out>, 
>    cmd_table=<optimized out>, dopager=<optimized out>)
>    at /usr/src/sys/ddb/db_command.c:481
> #5  0xffffffff8040acf4 in db_command_loop ()
>    at /usr/src/sys/ddb/db_command.c:534
> #6  0xffffffff8040df9f in db_trap (type=<optimized out>, code=<optimized out>)
>    at /usr/src/sys/ddb/db_main.c:250
> #7  0xffffffff80b370e3 in kdb_trap (type=3, code=-61456, tf=<optimized out>)
>    at /usr/src/sys/kern/subr_kdb.c:697
> #8  0xffffffff80fa2c5c in trap (frame=0xfffffe00f1009400)
>    at /usr/src/sys/amd64/amd64/trap.c:547
> #9  <signal handler called>
> #10 kdb_enter (why=0xffffffff811f280b "panic", msg=<optimized out>)
>    at /usr/src/sys/kern/subr_kdb.c:479
> #11 0xffffffff80aef17a in vpanic (fmt=<optimized out>, ap=0xfffffe00f1009570)
>    at /usr/src/sys/kern/kern_shutdown.c:801
> #12 0xffffffff80aeefc3 in panic (fmt=0x0)
>    at /usr/src/sys/kern/kern_shutdown.c:739
> #13 0xffffffff80acfa31 in _mtx_lock_indefinite_check (m=<optimized out>, 
>    ldap=<optimized out>) at /usr/src/sys/kern/kern_mutex.c:1224
> #14 0xffffffff80acfb9b in thread_lock_flags_ (td=0xfffff818719d1000, 
>    opts=<optimized out>, file=<optimized out>, line=<optimized out>)
>    at /usr/src/sys/kern/kern_mutex.c:913
> #15 0xffffffff80a89d6c in statclock_cnt (cnt=1, usermode=<optimized out>)
>    at /usr/src/sys/kern/kern_clock.c:768
> #16 0xffffffff810d0003 in handleevents (now=43230207690178, fake=0)
>    at /usr/src/sys/kern/kern_clocksource.c:196
> #17 0xffffffff810d0709 in timercb (et=0xffffffff81c528e8 <lapic_et>, 
>    arg=<optimized out>) at /usr/src/sys/kern/kern_clocksource.c:353
> #18 0xffffffff8110dad7 in lapic_handle_timer (frame=0xfffffe00f1009740)
>    at /usr/src/sys/x86/x86/local_apic.c:1305
> #19 0xffffffff80f849d0 in timerint_u ()
>    at /usr/src/sys/amd64/amd64/apic_vector.S:132
> #20 0xfffffe00f1009828 in ?? ()
> #21 0x000000000000b6b1 in ?? ()
> #22 0x0000000000002000 in ?? ()
> #23 0x00000000ffffdfff in ?? ()
> #24 0x00c11c08e43e7fd5 in ?? ()
> #25 0x00000000000003e8 in ?? ()
> #26 0x00000000fffff1eb in ?? ()
> #27 0xfffffe00f1009828 in ?? ()
> #28 0xfffffe00f1009810 in ?? ()
> #29 0x00000000800e6d01 in ?? ()
> #30 0x0000000000000064 in ?? ()
> #31 0xfffff8011d936560 in ?? ()
> #32 0xfffffe00f1009828 in ?? ()
> #33 0xffffffff81771014 in mtx_delay ()
> #34 0x0000000000000000 in ?? ()
> (kgdb) 
> 
> . . .
> UID   PID  PPID CPU PRI NI   VSZ   RSS MWCHAN   STAT TT     TIME COMMAND
> . . .
>   0  1110  1102   0  33  0 12024  3076 ttyin    D+    -  0:00.00 [sh]
>   0  1120  1044   0  20  0 18572  7936 select   Ds    -  0:00.00 [sshd]
> 1001  1123  1120   0  20  0 18936  8044 select   D     -  0:00.00 [sshd]
> 1001  1124  1123   0  34  0 12120  3196 wait     Ds    -  0:00.00 [sh]
>   0  1134  1124   0  22  0 12060  3148 wait     D     -  0:00.00 [su]
>   0  1135  1134   0  20  0 12312  3244 wait     D     -  0:00.00 [sh]
>   0 42072  1135   0  25  0 11464  3060 wait     D+    -  0:00.00 [sh]
>   0 42072  1135   0  25  0 11464  3060 wait     D+    -  0:00.00 [sh]
>   0 42075 42072   0  20  0 10928  2480 select   D+    -  0:00.00 [script]
>   0 42076 42075   0  52  0 10160  1396 wait     Ds+   -  0:00.00 [make]
>   0 42108 42076   0  52  0 12236  3224 wait     D+    -  0:00.00 [make]
>   0 42168 42108   0  52  0 11496  3068 wait     D+    -  0:00.00 [sh]
>   0 42169 42168   0  52  0 12608  3568 pipewr   D+    -  0:00.00 [make]
>   0 42170 42168   0  72  0 10956  2284 -        R+    -  0:00.00 [xargs]
>   0 42171 42168   0  35  0 11500  3064 piperd   D+    -  0:00.00 [sh]
>   0 46769 42170   0  72  0 10956  2284 -        ?VL+  -  0:00.00 [xargs]
> . . .
> 
> 
> 
> 
> Context details:
> 
> # uname -apKU
> FreeBSD FBSDFSSD 12.0-CURRENT FreeBSD 12.0-CURRENT  r329465M  amd64 amd64 1200058 1200058
> 
> # svnlite status /usr/src | sort
> ?       /usr/src/sys/amd64/conf/GENERIC-DBG
> ?       /usr/src/sys/amd64/conf/GENERIC-NODBG
> ?       /usr/src/sys/arm/conf/GENERIC-DBG
> ?       /usr/src/sys/arm/conf/GENERIC-NODBG
> ?       /usr/src/sys/arm64/conf/GENERIC-DBG
> ?       /usr/src/sys/arm64/conf/GENERIC-NODBG
> ?       /usr/src/sys/powerpc/conf/GENERIC64vtsc-DBG
> ?       /usr/src/sys/powerpc/conf/GENERIC64vtsc-NODBG
> ?       /usr/src/sys/powerpc/conf/GENERICvtsc-DBG
> ?       /usr/src/sys/powerpc/conf/GENERICvtsc-NODBG
> M       /usr/src/contrib/llvm/lib/Target/PowerPC/PPCFrameLowering.cpp
> M       /usr/src/contrib/llvm/tools/lld/ELF/Arch/PPC64.cpp
> M       /usr/src/crypto/openssl/crypto/armcap.c
> M       /usr/src/lib/libkvm/kvm_powerpc.c
> M       /usr/src/lib/libkvm/kvm_private.c
> M       /usr/src/stand/defs.mk
> M       /usr/src/stand/powerpc/boot1.chrp/Makefile
> M       /usr/src/stand/powerpc/kboot/Makefile
> M       /usr/src/sys/arm64/arm64/identcpu.c
> M       /usr/src/sys/conf/kmod.mk
> M       /usr/src/sys/conf/ldscript.powerpc
> M       /usr/src/sys/kern/subr_pcpu.c
> M       /usr/src/sys/powerpc/aim/mmu_oea64.c
> M       /usr/src/sys/powerpc/ofw/ofw_machdep.c
> M       /usr/src/sys/powerpc/powerpc/interrupt.c
> M       /usr/src/sys/powerpc/powerpc/mp_machdep.c
> M       /usr/src/sys/powerpc/powerpc/trap.c
> M       /usr/src/sys/vm/uma_core.c
> M       /usr/src/usr.bin/top/machine.c
> 
> The GENERIC* files include GENERIC and then set explicit
> debug status choices of mine. Most of the rest is tied to
> historical powerpc and powerpc64 investigations. I also
> have top report the maximum swap-in-use figure that it
> sees during its run.
> 
> # svnlite diff /usr/src/sys/vm/uma_core.c
> Index: /usr/src/sys/vm/uma_core.c
> ===================================================================
> --- /usr/src/sys/vm/uma_core.c	(revision 329465)
> +++ /usr/src/sys/vm/uma_core.c	(working copy)
> _at__at_ -3428,7 +3428,7 _at__at_
> void
> uma_reclaim_wakeup(void)
> {
> -
> +printf("limit %lX, total %lX, needed %d\n", uma_kmem_limit, uma_kmem_total, uma_reclaim_needed);
> 	if (atomic_fetchadd_int(&uma_reclaim_needed, 1) == 0)
> 		wakeup(uma_reclaim);
> }
> 
> 
> 
> Side note: It took the automatic fsck and 2 manual fsck
> runs to get back to a clean status (before I could get to
> multi-user).

The same thing happened to me on a VirtualBox VM running base/head 
r329515. This VM previously ran r329101 and I can easily revert to 
this version where the problem doesn't exist.

See https://ximalas.info/~trond/base-head-r329515/ for core.txt.0 and 
info.0. Crash dump of roughly 1 GiB is available on request.

I achieved an uptime of roughly 26 minutes before the panic set in. I 
was in the process of upgrading print/texinfo using portmaster when 
the crash occurred. xargs was the running process.

Unread portion of the kernel message buffer:
spin lock 0xffffffff80e76e80 (sched lock 0) held by 0xfffff800c58d4000 (tid 100746) too long
panic: spin lock held too long
cpuid = 0
time = 1518955069
KDB: stack backtrace:
db_trace_self_wrapper() at 0xffffffff804d281b = db_trace_self_wrapper+0x2b/frame 0xfffffe004b204610
vpanic() at 0xffffffff807106cd = vpanic+0x18d/frame 0xfffffe004b204670
panic() at 0xffffffff80710533 = panic+0x43/frame 0xfffffe004b2046d0
_mtx_lock_indefinite_check() at 0xffffffff806f0fa1 = _mtx_lock_indefinite_check+0x71/frame 0xfffffe004b2046e0
thread_lock_flags_() at 0xffffffff806f110b = thread_lock_flags_+0xdb/frame 0xfffffe004b204750
statclock_cnt() at 0xffffffff806ab0dc = statclock_cnt+0xdc/frame 0xfffffe004b204790
handleevents() at 0xffffffff80a72d73 = handleevents+0x113/frame 0xfffffe004b2047e0
timercb() at 0xffffffff80a73479 = timercb+0xa9/frame 0xfffffe004b204830
lapic_handle_timer() at 0xffffffff80ad6237 = lapic_handle_timer+0xa7/frame 0xfffffe004b204870
timerint_u() at 0xffffffff80a1f080 = timerint_u+0x96/frame 0xfffffe004b204950
thread_lock_flags_() at 0xffffffff806f10f1 = thread_lock_flags_+0xc1/frame 0xfffffe004b2049c0
fork1() at 0xffffffff806d124f = fork1+0x1b9f/frame 0xfffffe004b204a70
sys_vfork() at 0xffffffff806d17ec = sys_vfork+0x4c/frame 0xfffffe004b204ac0
amd64_syscall() at 0xffffffff80a3d6c8 = amd64_syscall+0xa48/frame 0xfffffe004b204bf0
fast_syscall_common() at 0xffffffff80a1e03d = fast_syscall_common+0x101/frame 0x7fffffffe000
Uptime: 26m5s
Dumping 1003 out of 6123 MB:..2%..12%..21%..31%..42%..52%..61%..71%..82%..91%

__curthread () at ./machine/pcpu.h:230
230	./machine/pcpu.h: No such file or directory.
(kgdb) #0  __curthread () at ./machine/pcpu.h:230
#1  doadump (textdump=1) at /usr/src/sys/kern/kern_shutdown.c:347
#2  0xffffffff80710200 in kern_reboot (howto=260)
    at /usr/src/sys/kern/kern_shutdown.c:416
#3  0xffffffff8071072d in vpanic (fmt=<optimized out>, ap=0xfffffe004b2046b0)
    at /usr/src/sys/kern/kern_shutdown.c:812
#4  0xffffffff80710533 in panic (fmt=<unavailable>)
    at /usr/src/sys/kern/kern_shutdown.c:739
#5  0xffffffff806f0fa1 in _mtx_lock_indefinite_check (m=<optimized out>, 
    ldap=<optimized out>) at /usr/src/sys/kern/kern_mutex.c:1224
#6  0xffffffff806f110b in thread_lock_flags_ (td=0xfffff8006205d000, 
    opts=<optimized out>, file=<optimized out>, line=<optimized out>)
    at /usr/src/sys/kern/kern_mutex.c:913
#7  0xffffffff806ab0dc in statclock_cnt (cnt=1, usermode=<optimized out>)
    at /usr/src/sys/kern/kern_clock.c:768
#8  0xffffffff80a72d73 in handleevents (now=6721246917305, fake=0)
    at /usr/src/sys/kern/kern_clocksource.c:196
#9  0xffffffff80a73479 in timercb (et=0xffffffff80f80ad8 <lapic_et>, 
    arg=<optimized out>) at /usr/src/sys/kern/kern_clocksource.c:353
#10 0xffffffff80ad6237 in lapic_handle_timer (frame=0xfffffe004b204880)
    at /usr/src/sys/x86/x86/local_apic.c:1305
#11 0xffffffff80a1f080 in timerint_u ()
    at /usr/src/sys/amd64/amd64/apic_vector.S:132
#12 0xfffffe004b204968 in ?? ()
#13 0x0000000000001b8d in ?? ()
#14 0x0000000000000800 in ?? ()
#15 0x00000000fffff7ff in ?? ()
#16 0x0000000080000034 in ?? ()
#17 0x00000000000001f4 in ?? ()
#18 0x00000000fffff8eb in ?? ()
#19 0xfffffe004b204968 in ?? ()
#20 0xfffffe004b204950 in ?? ()
#21 0x000000008000ba2f in ?? ()
#22 0x0000000000000032 in ?? ()
#23 0xfffff800c58d4000 in ?? ()
#24 0xfffffe004b204968 in ?? ()
#25 0xffffffff80c41014 in mtx_delay ()
#26 0x0000000000000000 in ?? ()

Looking at the commit logs, I suspect r329419, r329420, r329422, 
and/or r320449 being possible candidates.

I'll try revert my source tree to r329418, and take it from there.

-- 
Trond.
Received on Sun Feb 18 2018 - 11:25:40 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:15 UTC