Re: Filesystem wedges caused by r251446

From: Ian FREISLICH <ianf_at_clue.co.za> Date: Fri, 12 Jul 2013 23:34:18 +0200 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:39 UTC

Konstantin Belousov wrote:
> 
> --rMuTkhzRlt+HYtLC
> Content-Type: text/plain; charset=us-ascii
> Content-Disposition: inline
> Content-Transfer-Encoding: quoted-printable
> 
> On Fri, Jul 12, 2013 at 08:11:36PM +0200, Ian FREISLICH wrote:
> > John Baldwin wrote:
> > > On Thursday, July 11, 2013 6:54:35 am Ian FREISLICH wrote:
> > > > John Baldwin wrote:
> > > > > On Thursday, July 04, 2013 5:03:29 am Ian FREISLICH wrote:
> > > > > > Konstantin Belousov wrote:
> > > > > > >=20
> > > > > > > Care to provide any useful information ?
> > > > > > >=20
> > > > > > > http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-
> > > > > handbook/kerneldebug-deadlocks.html
> > > > > >=20
> > > > > > Well, the system doesn't deadlock it's perfectly useable so long
> > > > > > as you don't touch the file that's wedged.  A lot of the time the
> > > > > > userland process is unkillable, but often it is killable.  How do
> > > > > > I get from from the PID to where the FS is stuck in the kernel?
> > > > >=20
> > > > > Use kgdb.  'proc <pid>', then 'bt'.
> > > >=20
> > > > So, I setup a remote kbgd session, but I still can't figure out how
> > > > to get at the information we need.
> > > >=20
> > > > (kgdb) proc 5176
> > > > only supported for core file target
> > > >=20
> > > > In the mean time, I'll just force it to make a core dump from ddb.
> > > > However, I can't reacreate the issue while the mirror (gmirror) is
> > > > rebuilding, so we'll have to wait for that to finish.
> > >=20
> > > Sorrry, just run 'sudo kgdb' on the box itself.  You can inspect the ru=
> nning
> > > kernel without having to stop it.
> >=20
> > So, this machine's installworld *always* stalls installing clang.
> > The install can be stopped (ctrl-c) leaving behind this process:
> >=20
> > root    23147   0.0  0.0   9268  1512  1  D     7:51PM  0:00.01 install -=
> s -o root -g wheel -m 555 clang /usr/bin/clang
> >=20
> > This is the backtrace from gdb.  I suspect frame 4.
> >=20
> > (kgdb) proc 23147
> > [Switching to thread 117 (Thread 100059)]#0  sched_switch (
> >     td=3D0xfffffe000c012920, newtd=3D0x0, flags=3D<value optimized out>)
> >     at /usr/src/sys/kern/sched_ule.c:1954
> > 1954                    cpuid =3D PCPU_GET(cpuid);
> > Current language:  auto; currently minimal
> > (kgdb) bt
> > #0  sched_switch (td=3D0xfffffe000c012920, newtd=3D0x0,=20
> >     flags=3D<value optimized out>) at /usr/src/sys/kern/sched_ule.c:1954
> > #1  0xffffffff8047539e in mi_switch (flags=3D260, newtd=3D0x0)
> >     at /usr/src/sys/kern/kern_synch.c:487
> > #2  0xffffffff804acbea in sleepq_wait (wchan=3D0x0, pri=3D0)
> >     at /usr/src/sys/kern/subr_sleepqueue.c:620
> > #3  0xffffffff80474ee9 in _sleep (ident=3D<value optimized out>,=20
> >     lock=3D0xffffffff80a20300, priority=3D84, wmesg=3D0xffffffff8071129a =
> "wdrain",=20
> >     sbt=3D<value optimized out>, pr=3D0, flags=3D<value optimized out>)
> >     at /usr/src/sys/kern/kern_synch.c:249
> > #4  0xffffffff804e6523 in waitrunningbufspace ()
> >     at /usr/src/sys/kern/vfs_bio.c:564
> > #5  0xffffffff804e6073 in bufwrite (bp=3D<value optimized out>)
> >     at /usr/src/sys/kern/vfs_bio.c:1226
> > #6  0xffffffff804f05ed in cluster_wbuild (vp=3D0xfffffe008fec4000, size=
> =3D32768,=20
> >     start_lbn=3D136, len=3D<value optimized out>, gbflags=3D<value optimi=
> zed out>)
> >     at /usr/src/sys/kern/vfs_cluster.c:1002
> > #7  0xffffffff804efbc3 in cluster_write (vp=3D0xfffffe008fec4000,=20
> >     bp=3D0xffffff80f83da6f0, filesize=3D4456448, seqcount=3D127,=20
> >     gbflags=3D<value optimized out>) at /usr/src/sys/kern/vfs_cluster.c:5=
> 92
> > #8  0xffffffff805c1032 in ffs_write (ap=3D0xffffff8121c81850)
> >     at /usr/src/sys/ufs/ffs/ffs_vnops.c:801
> > #9  0xffffffff8067fe21 in VOP_WRITE_APV (vop=3D<value optimized out>,=20
> > ---Type <return> to continue, or q <return> to quit---=20
> >     a=3D<value optimized out>) at vnode_if.c:999
> > #10 0xffffffff80511eca in vn_write (fp=3D0xfffffe006a5f7410,=20
> >     uio=3D0xffffff8121c81a90, active_cred=3D0x0, flags=3D<value optimized=
>  out>,=20
> >     td=3D0x0) at vnode_if.h:413
> > #11 0xffffffff8050eb3a in vn_io_fault (fp=3D0xfffffe006a5f7410,=20
> >     uio=3D0xffffff8121c81a90, active_cred=3D0xfffffe006a6ca000, flags=3D0=
> ,=20
> >     td=3D0xfffffe000c012920) at /usr/src/sys/kern/vfs_vnops.c:983
> > #12 0xffffffff804b506a in dofilewrite (td=3D0xfffffe000c012920, fd=3D5,=
> =20
> >     fp=3D0xfffffe006a5f7410, auio=3D0xffffff8121c81a90,=20
> >     offset=3D<value optimized out>, flags=3D0) at file.h:290
> > #13 0xffffffff804b4cde in sys_write (td=3D0xfffffe000c012920,=20
> >     uap=3D<value optimized out>) at /usr/src/sys/kern/sys_generic.c:460
> > #14 0xffffffff8061807a in amd64_syscall (td=3D0xfffffe000c012920, traced=
> =3D0)
> >     at subr_syscall.c:134
> > #15 0xffffffff806017ab in Xfast_syscall ()
> >     at /usr/src/sys/amd64/amd64/exception.S:387
> > #16 0x000000000044e75a in ?? ()
> > Previous frame inner to this frame (corrupt stack?)
> > (kgdb)=20
> 
> Please apply (mostly debugging) patch below, then reproduce the issue.
> I need the backtrace of the 'main' hung process, assuming it is stuck
> in the waitrunningbufspace().  Also, from the same kgdb session, print
> runningbufreq, runningbufspace and lorunningspace.

This is the hung process after applying the patch.  I see no console
messages on recreating the issue.

(kgdb) proc 23113
[Switching to thread 102 (Thread 100146)]#0  sched_switch (
    td=0xfffffe006971e000, newtd=0x0, flags=<value optimized out>)
    at /usr/src/sys/kern/sched_ule.c:1954
1954                    cpuid = PCPU_GET(cpuid);
Current language:  auto; currently minimal
(kgdb) bt
#0  sched_switch (td=0xfffffe006971e000, newtd=0x0, 
    flags=<value optimized out>) at /usr/src/sys/kern/sched_ule.c:1954
#1  0xffffffff8047539e in mi_switch (flags=260, newtd=0x0)
    at /usr/src/sys/kern/kern_synch.c:487
#2  0xffffffff804acbea in sleepq_wait (wchan=0x0, pri=0)
    at /usr/src/sys/kern/subr_sleepqueue.c:620
#3  0xffffffff80474ee9 in _sleep (ident=<value optimized out>, 
    lock=0xffffffff80a20300, priority=84, wmesg=0xffffffff8071129a "wdrain", 
    sbt=<value optimized out>, pr=0, flags=<value optimized out>)
    at /usr/src/sys/kern/kern_synch.c:249
#4  0xffffffff804e6527 in waitrunningbufspace ()
    at /usr/src/sys/kern/vfs_bio.c:566
#5  0xffffffff804e6073 in bufwrite (bp=<value optimized out>)
    at /usr/src/sys/kern/vfs_bio.c:1228
#6  0xffffffff804f05ed in cluster_wbuild (vp=0xfffffe007f90d750, size=32768, 
    start_lbn=825, len=<value optimized out>, gbflags=<value optimized out>)
    at /usr/src/sys/kern/vfs_cluster.c:1002
#7  0xffffffff804efbc3 in cluster_write (vp=0xfffffe007f90d750, 
    bp=0xffffff80f86bcd90, filesize=27033600, seqcount=127, 
    gbflags=<value optimized out>) at /usr/src/sys/kern/vfs_cluster.c:592
#8  0xffffffff805c1032 in ffs_write (ap=0xffffff8121e34850)
    at /usr/src/sys/ufs/ffs/ffs_vnops.c:801
#9  0xffffffff8067fe21 in VOP_WRITE_APV (vop=<value optimized out>, 
---Type <return> to continue, or q <return> to quit--- 
    a=<value optimized out>) at vnode_if.c:999
#10 0xffffffff80511eca in vn_write (fp=0xfffffe000ccb5870, 
    uio=0xffffff8121e34a90, active_cred=0x0, flags=<value optimized out>, 
    td=0x0) at vnode_if.h:413
#11 0xffffffff8050eb3a in vn_io_fault (fp=0xfffffe000ccb5870, 
    uio=0xffffff8121e34a90, active_cred=0xfffffe000cf28800, flags=0, 
    td=0xfffffe006971e000) at /usr/src/sys/kern/vfs_vnops.c:983
#12 0xffffffff804b506a in dofilewrite (td=0xfffffe006971e000, fd=5, 
    fp=0xfffffe000ccb5870, auio=0xffffff8121e34a90, 
    offset=<value optimized out>, flags=0) at file.h:290
#13 0xffffffff804b4cde in sys_write (td=0xfffffe006971e000, 
    uap=<value optimized out>) at /usr/src/sys/kern/sys_generic.c:460
#14 0xffffffff8061807a in amd64_syscall (td=0xfffffe006971e000, traced=0)
    at subr_syscall.c:134
#15 0xffffffff806017ab in Xfast_syscall ()
    at /usr/src/sys/amd64/amd64/exception.S:387
#16 0x000000000044e75a in ?? ()
Previous frame inner to this frame (corrupt stack?)
(kgdb) frame 4
#4  0xffffffff804e6527 in waitrunningbufspace ()
    at /usr/src/sys/kern/vfs_bio.c:566
566                     msleep(&runningbufreq, &rbreqlock, PVM, "wdrain", 0);
(kgdb) print runningbufreq
$1 = 1
(kgdb) print runningbufspace
$2 = 0
(kgdb) print lorunningspace
$3 = 4587520
(kgdb) print hirunningspace
$4 = 4194304

-- 
Ian Freislich