Kqueue races causing crashes

From: Eric Badger <eric_at_badgerio.us> Date: Tue, 14 Jun 2016 22:26:14 -0500 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:05 UTC

Hi there,

There seems to be some racy code in kern_event.c which is causing me to 
run into some crashes. I’ve attached the test program used to generate 
these crashes (build it and run the “go” script). They were produced in 
a VM with 4 cores on 11 Alpha 3 (and originally 10.3). The crashes I’ve 
seen come in a few varieties:

1. “userret: returning with the following locks held”. This one is the 
easiest to hit (assuming witness is enabled).

userret: returning with the following locks held:
exclusive sleep mutex process lock (process lock) r = 0 
(0xfffff80006956120) locked _at_ /usr/src/sys/kern/kern_event.c:2125
panic: witness_warn
cpuid = 2
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 
0xfffffe000039d8e0
vpanic() at vpanic+0x182/frame 0xfffffe000039d960
kassert_panic() at kassert_panic+0x126/frame 0xfffffe000039d9d0
witness_warn() at witness_warn+0x3c6/frame 0xfffffe000039daa0
userret() at userret+0x9d/frame 0xfffffe000039dae0
amd64_syscall() at amd64_syscall+0x406/frame 0xfffffe000039dbf0
Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe000039dbf0
--- syscall (1, FreeBSD ELF64, sys_sys_exit), rip = 0x800b8a0ba, rsp = 
0x7fffffffea98, rbp = 0x7fffffffeae0 ---
KDB: enter: panic
[ thread pid 64855 tid 100106 ]
Stopped at      kdb_enter+0x3b: movq    $0,kdb_why
db> show all locks
Process 64855 (watch) thread 0xfffff800066c3000 (100106)
exclusive sleep mutex process lock (process lock) r = 0 
(0xfffff80006956120) locked _at_ /usr/src/sys/kern/kern_event.c:2125
Process 64855 (watch) thread 0xfffff8000696a500 (100244)
exclusive sleep mutex pmap (pmap) r = 0 (0xfffff800068c3138) locked _at_ 
/usr/src/sys/amd64/amd64/pmap.c:4067
exclusive sx vm map (user) (vm map (user)) r = 0 (0xfffff800068f6080) 
locked _at_ /usr/src/sys/vm/vm_map.c:3315
exclusive sx vm map (user) (vm map (user)) r = 0 (0xfffff800068c3080) 
locked _at_ /usr/src/sys/vm/vm_map.c:3311
db> ps
   pid  ppid  pgrp   uid   state   wmesg         wchan        cmd
64855   690   690     0  R+      (threaded)                  watch
100106                   Run     CPU 2                       main
100244                   Run     CPU 1 procmaker
100245                   Run     CPU 3 reaper

2.  “Sleeping thread owns a non-sleepable lock”. This one first drew my 
attention by showing up in a real world application at work.

Sleeping thread (tid 100101, pid 76857) owns a non-sleepable lock
KDB: stack backtrace of thread 100101:
sched_switch() at sched_switch+0x2a5/frame 0xfffffe0000257690
mi_switch() at mi_switch+0xe1/frame 0xfffffe00002576d0
sleepq_catch_signals() at sleepq_catch_signals+0x16c/frame 
0xfffffe0000257730
sleepq_timedwait_sig() at sleepq_timedwait_sig+0xf/frame 0xfffffe0000257760
_sleep() at _sleep+0x234/frame 0xfffffe00002577e0
kern_kevent_fp() at kern_kevent_fp+0x38a/frame 0xfffffe00002579d0
kern_kevent() at kern_kevent+0x9f/frame 0xfffffe0000257a30
sys_kevent() at sys_kevent+0x12a/frame 0xfffffe0000257ae0
amd64_syscall() at amd64_syscall+0x2d4/frame 0xfffffe0000257bf0
Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe0000257bf0
--- syscall (363, FreeBSD ELF64, sys_kevent), rip = 0x800b6afea, rsp = 
0x7fffffffea88, rbp = 0x7fffffffead0 ---
panic: sleeping thread
cpuid = 3
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 
0xfffffe0000225590
kdb_backtrace() at kdb_backtrace+0x39/frame 0xfffffe0000225640
vpanic() at vpanic+0x126/frame 0xfffffe0000225680
panic() at panic+0x43/frame 0xfffffe00002256e0
propagate_priority() at propagate_priority+0x166/frame 0xfffffe0000225710
turnstile_wait() at turnstile_wait+0x282/frame 0xfffffe0000225750
__mtx_lock_sleep() at __mtx_lock_sleep+0x26b/frame 0xfffffe00002257d0
__mtx_lock_flags() at __mtx_lock_flags+0x5e/frame 0xfffffe00002257f0
proc_to_reap() at proc_to_reap+0x46/frame 0xfffffe0000225840
kern_wait6() at kern_wait6+0x202/frame 0xfffffe00002258f0
sys_wait4() at sys_wait4+0x72/frame 0xfffffe0000225ae0
amd64_syscall() at amd64_syscall+0x2d4/frame 0xfffffe0000225bf0
Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe0000225bf0
--- syscall (7, FreeBSD ELF64, sys_wait4), rip = 0x800b209ba, rsp = 
0x7fffdfdfcf48, rbp = 0x7fffdfdfcf80 ---
KDB: enter: panic
[ thread pid 76857 tid 100225 ]
Stopped at      kdb_enter+0x3e: movq    $0,kdb_why
db> show allchains
chain 1:
  thread 100225 (pid 76857, reaper) blocked on lock 0xfffff800413105f0 
(sleep mutex) "process lock"
  thread 100101 (pid 76857, main) inhibited

(3./4.) There are a few others that I hit less frequently (“page fault 
while in kernel mode”,  "Kernel page fault with the following 
non-sleepable locks held”. I don’t have a backtrace handy for these.

I believe they all have more or less the same cause. The crashes occur 
because we acquire a knlist lock via the KN_LIST_LOCK macro, but when we 
call KN_LIST_UNLOCK, the knote’s knlist reference (kn->kn_knlist) has 
been cleared by another thread. Thus we are unable to unlock the 
previously acquired lock and hold it until something causes us to crash 
(such as the witness code noticing that we’re returning to userland with 
the lock still held).

A walkthrough of what happens in the test program:

There are 3 threads: 1 forks off short-lived child processes, 2 reaps 
the child processes, and 3 tracks the child processes via a kqueue 
(NOTE_EXIT | NOTE_EXEC | NOTE_FORK | NOTE_TRACK). I believe a crash 
generally looks like this:

1. Forker thread creates a short lived child. That child dies and 
triggers a NOTE_EXIT event.
2. Kqueue thread is somewhere in kqueue_scan(), probably blocked at a 
KN_LIST_LOCK call.
3. The dying process calls into filt_proc() and notices that the 
KN_DETACHED flag is not set. It therefore decides to call 
knlist_remove_inevent() to take the knote out of the knlist. 
Importantly, this sets kn->kn_knlist to NULL, meaning we can no longer 
access the knlist lock from the knote.
4. Kqueue thread, still in kqueue_scan(), is able to acquire the lock 
via KN_LIST_LOCK. It does some work and then calls the KN_LIST_UNLOCK 
macro. This macro checks and finds that the knote does not have a 
reference to a knlist, and thus takes no action, leaving the lock in the 
locked state.

I believe there’s also a small window where the KN_LIST_LOCK macro 
checks kn->kn_knlist and finds it to be non-NULL, but by the time it 
actually dereferences it, it has become NULL. This would produce the 
“page fault while in kernel mode” crash.

If someone familiar with this code sees an obvious fix, I’ll be happy to 
test it. Otherwise, I’d appreciate any advice on fixing this. My first 
thought is that a ‘struct knote’ ought to have its own mutex for 
controlling access to the flag fields and ideally the “kn_knlist” field. 
I.e., you would first acquire a knote’s lock and then the knlist lock, 
thus ensuring that no one could clear the kn_knlist variable while you 
hold the knlist lock. The knlist lock, however, usually comes from 
whichever event producing entity the knote tracks, so getting lock 
ordering right between the per-knote mutex and this other lock seems 
potentially hard. (Sometimes we call into functions in kern_event.c with 
the knlist lock already held, having been acquired in code outside of 
kern_event.c. Consider, for example, calling KNOTE_LOCKED from 
kern_exit.c; the PROC_LOCK macro has already been used to acquire the 
process lock, also serving as the knlist lock).

Apropos of the knlist lock and its provenance: why is a lock from the 
event producing entity used to control access to the knlist and knote? 
Is it generally desirable to, for example, hold the process lock while 
operating on a knlist attached to that process? It’s not obvious to me 
that this is required or even desirable. This might suggest that a 
knlist should have its own lock rather than using a lock from the event 
producing entity, which might make addressing this problem more 
straightforward.

Many thanks for any help, and please let me know if I’ve failed to make 
anything clear.

Cheers,
Eric