On Fri, Jul 12, 2013 at 08:11:36PM +0200, Ian FREISLICH wrote: > John Baldwin wrote: > > On Thursday, July 11, 2013 6:54:35 am Ian FREISLICH wrote: > > > John Baldwin wrote: > > > > On Thursday, July 04, 2013 5:03:29 am Ian FREISLICH wrote: > > > > > Konstantin Belousov wrote: > > > > > > > > > > > > Care to provide any useful information ? > > > > > > > > > > > > http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers- > > > > handbook/kerneldebug-deadlocks.html > > > > > > > > > > Well, the system doesn't deadlock it's perfectly useable so long > > > > > as you don't touch the file that's wedged. A lot of the time the > > > > > userland process is unkillable, but often it is killable. How do > > > > > I get from from the PID to where the FS is stuck in the kernel? > > > > > > > > Use kgdb. 'proc <pid>', then 'bt'. > > > > > > So, I setup a remote kbgd session, but I still can't figure out how > > > to get at the information we need. > > > > > > (kgdb) proc 5176 > > > only supported for core file target > > > > > > In the mean time, I'll just force it to make a core dump from ddb. > > > However, I can't reacreate the issue while the mirror (gmirror) is > > > rebuilding, so we'll have to wait for that to finish. > > > > Sorrry, just run 'sudo kgdb' on the box itself. You can inspect the running > > kernel without having to stop it. > > So, this machine's installworld *always* stalls installing clang. > The install can be stopped (ctrl-c) leaving behind this process: > > root 23147 0.0 0.0 9268 1512 1 D 7:51PM 0:00.01 install -s -o root -g wheel -m 555 clang /usr/bin/clang > > This is the backtrace from gdb. I suspect frame 4. > > (kgdb) proc 23147 > [Switching to thread 117 (Thread 100059)]#0 sched_switch ( > td=0xfffffe000c012920, newtd=0x0, flags=<value optimized out>) > at /usr/src/sys/kern/sched_ule.c:1954 > 1954 cpuid = PCPU_GET(cpuid); > Current language: auto; currently minimal > (kgdb) bt > #0 sched_switch (td=0xfffffe000c012920, newtd=0x0, > flags=<value optimized out>) at /usr/src/sys/kern/sched_ule.c:1954 > #1 0xffffffff8047539e in mi_switch (flags=260, newtd=0x0) > at /usr/src/sys/kern/kern_synch.c:487 > #2 0xffffffff804acbea in sleepq_wait (wchan=0x0, pri=0) > at /usr/src/sys/kern/subr_sleepqueue.c:620 > #3 0xffffffff80474ee9 in _sleep (ident=<value optimized out>, > lock=0xffffffff80a20300, priority=84, wmesg=0xffffffff8071129a "wdrain", > sbt=<value optimized out>, pr=0, flags=<value optimized out>) > at /usr/src/sys/kern/kern_synch.c:249 > #4 0xffffffff804e6523 in waitrunningbufspace () > at /usr/src/sys/kern/vfs_bio.c:564 > #5 0xffffffff804e6073 in bufwrite (bp=<value optimized out>) > at /usr/src/sys/kern/vfs_bio.c:1226 > #6 0xffffffff804f05ed in cluster_wbuild (vp=0xfffffe008fec4000, size=32768, > start_lbn=136, len=<value optimized out>, gbflags=<value optimized out>) > at /usr/src/sys/kern/vfs_cluster.c:1002 > #7 0xffffffff804efbc3 in cluster_write (vp=0xfffffe008fec4000, > bp=0xffffff80f83da6f0, filesize=4456448, seqcount=127, > gbflags=<value optimized out>) at /usr/src/sys/kern/vfs_cluster.c:592 > #8 0xffffffff805c1032 in ffs_write (ap=0xffffff8121c81850) > at /usr/src/sys/ufs/ffs/ffs_vnops.c:801 > #9 0xffffffff8067fe21 in VOP_WRITE_APV (vop=<value optimized out>, > ---Type <return> to continue, or q <return> to quit--- > a=<value optimized out>) at vnode_if.c:999 > #10 0xffffffff80511eca in vn_write (fp=0xfffffe006a5f7410, > uio=0xffffff8121c81a90, active_cred=0x0, flags=<value optimized out>, > td=0x0) at vnode_if.h:413 > #11 0xffffffff8050eb3a in vn_io_fault (fp=0xfffffe006a5f7410, > uio=0xffffff8121c81a90, active_cred=0xfffffe006a6ca000, flags=0, > td=0xfffffe000c012920) at /usr/src/sys/kern/vfs_vnops.c:983 > #12 0xffffffff804b506a in dofilewrite (td=0xfffffe000c012920, fd=5, > fp=0xfffffe006a5f7410, auio=0xffffff8121c81a90, > offset=<value optimized out>, flags=0) at file.h:290 > #13 0xffffffff804b4cde in sys_write (td=0xfffffe000c012920, > uap=<value optimized out>) at /usr/src/sys/kern/sys_generic.c:460 > #14 0xffffffff8061807a in amd64_syscall (td=0xfffffe000c012920, traced=0) > at subr_syscall.c:134 > #15 0xffffffff806017ab in Xfast_syscall () > at /usr/src/sys/amd64/amd64/exception.S:387 > #16 0x000000000044e75a in ?? () > Previous frame inner to this frame (corrupt stack?) > (kgdb) Please apply (mostly debugging) patch below, then reproduce the issue. I need the backtrace of the 'main' hung process, assuming it is stuck in the waitrunningbufspace(). Also, from the same kgdb session, print runningbufreq, runningbufspace and lorunningspace. diff --git a/sys/kern/vfs_bio.c b/sys/kern/vfs_bio.c index 68021e0..205e9b3 100644 --- a/sys/kern/vfs_bio.c +++ b/sys/kern/vfs_bio.c _at__at_ -474,10 +474,12 _at__at_ runningbufwakeup(struct buf *bp) { long space, bspace; - if (bp->b_runningbufspace == 0) - return; - space = atomic_fetchadd_long(&runningbufspace, -bp->b_runningbufspace); bspace = bp->b_runningbufspace; + if (bspace == 0) + return; + space = atomic_fetchadd_long(&runningbufspace, -bspace); + KASSERT(space >= bspace, ("runningbufspace underflow %ld %ld", + space, bspace)); bp->b_runningbufspace = 0; /* * Only acquire the lock and wakeup on the transition from exceeding _at__at_ -561,7 +563,7 _at__at_ waitrunningbufspace(void) mtx_lock(&rbreqlock); while (runningbufspace > hirunningspace) { - ++runningbufreq; + runningbufreq = 1; msleep(&runningbufreq, &rbreqlock, PVM, "wdrain", 0); } mtx_unlock(&rbreqlock);
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:39 UTC