process stuck in stat/../cache_lookup: ktorrent, zfs

From: Andriy Gapon <avg_at_icyb.net.ua>
Date: Sun, 06 Dec 2009 13:31:12 +0200
System is recent 9-current, amd64.
I see that sometimes ktorrent gets stuck during heavy download (multiple files
in parallel, high speed).  It is completely unresponsive and not killable even
with SIGKILL.

Here's procstat output for it (many thanks to Robert again):
$ procstat -k 1882
  PID    TID COMM             TDNAME           KSTACK
 1882 100270 ktorrent         -                mi_switch sleepq_switch
sleepq_catch_signals sleepq_wait_sig _cv_wait_sig seltdwait poll syscall
Xfast_syscall
 1882 100277 ktorrent         -                mi_switch sleepq_switch
sleepq_wait __lockmgr_args vop_stdlock VOP_LOCK1_APV _vn_lock vget cache_lookup
vfs_cache_lookup VOP_LOOKUP_APV lookup namei kern_statat_vnhook kern_statat
kern_stat stat syscall
 1882 100279 ktorrent         -                mi_switch sleepq_switch
sleepq_catch_signals sleepq_timedwait_sig _cv_timedwait_sig seltdwait
kern_select select syscall Xfast_syscall
 1882 100283 ktorrent         -                mi_switch sleepq_switch
sleepq_catch_signals sleepq_wait_sig _sleep do_cv_wait __umtx_op_cv_wait
_umtx_op syscall Xfast_syscall
 1882 100284 ktorrent         -                mi_switch sleepq_switch
sleepq_catch_signals sleepq_wait_sig _sleep do_cv_wait __umtx_op_cv_wait
_umtx_op syscall Xfast_syscall

ktrace shows that threads 100270 and 100279 are still live.
I think that it is a thread doing stat that is stuck and holding other threads.

Here's info for that thread with -kk:
 1882 100277 ktorrent         -                mi_switch+0x1cf
sleepq_switch+0xdb sleepq_wait+0x45 __lockmgr_args+0x3d4 vop_stdlock+0x51
VOP_LOCK1_APV+0x82 _vn_lock+0x48 vget+0x88 cache_lookup+0x8e3
vfs_cache_lookup+0xad VOP_LOOKUP_APV+0x7c lookup+0x64a namei+0x68f
kern_statat_vnhook+0x97 kern_statat+0x15 kern_stat+0x1c stat+0x2d syscall+0x3f6

Even better, kgdb on /dev/mem:
(kgdb) thr 255
[Switching to thread 255 (Thread 100277)]#0  sched_switch
(td=0xffffff012a6c5700, newtd=0xffffff0001533380, flags=Variable "flags" is not
available.
) at /usr/src/sys/kern/sched_ule.c:1865
1865                    cpuid = PCPU_GET(cpuid);
(kgdb) bt
#0  sched_switch (td=0xffffff012a6c5700, newtd=0xffffff0001533380,
flags=Variable "flags" is not available.
) at /usr/src/sys/kern/sched_ule.c:1865
#1  0xffffffff80374baf in mi_switch (flags=260, newtd=0x0) at
/usr/src/sys/kern/kern_synch.c:449
#2  0xffffffff803a795b in sleepq_switch (wchan=Variable "wchan" is not available.
) at /usr/src/sys/kern/subr_sleepqueue.c:509
#3  0xffffffff803a8645 in sleepq_wait (wchan=0xffffff0105b457f8, pri=80) at
/usr/src/sys/kern/subr_sleepqueue.c:588
#4  0xffffffff80351184 in __lockmgr_args (lk=0xffffff0105b457f8, flags=2097408,
ilk=0xffffff0105b45820, wmesg=Variable "wmesg" is not available.
) at /usr/src/sys/kern/kern_lock.c:216
#5  0xffffffff803ea171 in vop_stdfsync (ap=0x0) at
/usr/src/sys/kern/vfs_default.c:587
#6  0xffffffff8058b582 in VOP_LOCK1_APV (vop=Variable "vop" is not available.
) at vnode_if.c:1994
#7  0xffffffff80408318 in _vn_lock (vp=0xffffff0105b45760, flags=2097408,
file=0xffffff0000000823 "", line=-2141062687) at vnode_if.h:880
#8  0xffffffff803fb798 in vget (vp=0xffffff0105b45760, flags=2097408,
td=0xffffff012a6c5700) at /usr/src/sys/kern/vfs_subr.c:2089
#9  0xffffffff803e7953 in cache_lookup (dvp=0xffffff0055877b10,
vpp=0xffffff8085155960, cnp=0xffffff8085155988) at /usr/src/sys/kern/vfs_cache.c:568
#10 0xffffffff803e7b3d in vfs_cache_lookup (ap=Variable "ap" is not available.
) at vnode_if.h:80
#11 0xffffffff8058c08c in VOP_LOOKUP_APV (vop=Variable "vop" is not available.
) at vnode_if.c:129
#12 0xffffffff803eeeaa in lookup (ndp=0xffffff8085155930) at
/usr/src/sys/kern/vfs_lookup.c:697
#13 0xffffffff803efeaf in namei (ndp=0xffffff8085155930) at
/usr/src/sys/kern/vfs_lookup.c:276
#14 0xffffffff803ffa47 in kern_statat_vnhook (td=0xffffff012a6c5700,
flag=Variable "flag" is not available.
) at /usr/src/sys/kern/vfs_syscalls.c:2348
#15 0xffffffff803ffc05 in kern_lstat (td=0x0, path=0x0, pathseg=UIO_USERSPACE,
sbp=0x0) at /usr/src/sys/kern/vfs_syscalls.c:2399
#16 0xffffffff803ffd0c in nstat (td=0x0, uap=0x0) at
/usr/src/sys/kern/vfs_syscalls.c:2448
#17 0xffffffff803ffdad in stat () at /usr/src/sys/kern/vfs_syscalls.c:2291
#18 0xffffff8085155ad0 in ?? ()
#19 0xffffffff803731de in runq_remove (rq=Variable "rq" is not available.
) at /usr/src/sys/kern/kern_switch.c:486
#20 0xffffffff805363a6 in syscall (frame=0xffffff8085155c80) at
/usr/src/sys/amd64/amd64/trap.c:994
#21 0xffffffff8051c121 in Xfast_syscall32 () at
/usr/src/sys/amd64/amd64/exception.S:385

How can I inspect what is causing this situation, presumably dead-lock?

-- 
Andriy Gapon
Received on Sun Dec 06 2009 - 10:32:02 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:58 UTC