Re: r244036 kernel hangs under load.

From: Rick Macklem <rmacklem_at_uoguelph.ca> Date: Thu, 13 Dec 2012 22:14:29 -0500 (EST) · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:33 UTC

Konstantin Belousov wrote:
> On Wed, Dec 12, 2012 at 10:01:39PM -0500, Rick Macklem wrote:
> > Konstantin Belousov wrote:
> > > On Tue, Dec 11, 2012 at 08:58:47PM -0500, Rick Macklem wrote:
> > > > Ok, I'll test r243598 and then r243599 and r243835, just to
> > > > see if it really is this.
> > > >
> > > > I'll email when I have done this.
> > > If you test only r243598, I am sure that you would experience
> > > corruption.
> > > The r243599 should cause the deadlocks.
> > >
> > I think you meant r243599 will result in corruptions and
> > r243835 deadlocks.
> >
> > I have run r243598 for a while without a hang. (r243599 doesn't
> > build) I haven't tried r243835 yet.
> >
> > > >
> > > > > >
> > > > > >  Also, do you use the post-r244095 kernel ?
> > > > > >
> > > > > > Before and after. The most recent tests were post-r244095.
> > > > > > (If anything the more recent kernels hang more easily.)
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > Is your machine SMP ?
> > > > > >
> > > > > > Old, slow single core i386.
> > > > >
> > > > > Try this. Please note that this is mostly a debugging
> > > > > facility.
> > > > >
> > > > It seemed to help, but didn't stop the hangs completely.
> > > > r244125 without the patch would hang somewhere in a kernel
> > > > build. r244125 plus this patch ran almost 2 kernel builds
> > > > before it got hung.
> > >
> > > Can you try to look into the system state on the hang (on the
> > > kernel
> > > with the if (1 || patch applied) ? Using the ddb and recipe from
> > > the
> > > web page. Basically watch for a thread looping in the mnt_active
> > > iterator and threads owning vnode interlocks.
> >
> > Ok, there is only one process in the mnt_active iterator and its
> > trace is as follows (syncer):
> > dle+0x12d/frame 0xdfe33adc (I suspect the screen lost an 'I')
> > intr_execute_handlers(c5e3d064,dfe33b20,0,dfe33b64,c0ec2115,...) at
> > intr_execute_handlers+0x49/frame 0xdfe33afc
> > lapic_handle_intr(3c,dfe33b20) at lapic_handle_intr+0x36/frame
> > 0xdfe33b10
> > Xapic_isr1() at Xapic_isr1+0x35/frame 0xdfe33b10
> > --- interrupt, eip = 0xc0eca8db, esp = 0xdfe33b60, ebp = 0xdfe33b64
> > ---
> > spinlock_exit(c128be90,4,c10b5017,130,1710,...) at
> > spinlock_exit+0x2b/frame 0xdfe33b64
> > __mtx_unlock_spin_flags(c128be90,0,c10b80be,25d,0,...) at
> > __mtx_unlock_spin_flags+0x112/frame 0xdfe33b90
> > kern_yield(ffffffff,0,c10c75c9,127b,c8b05238,...) at
> > kern_yield+0x125/frame 0xdfe33bbc
> > __mnt_vnode_next_active(dfe33c08,c857ba80,c10c75c9,dac,5d7,...) at
> > __mnt_vnode_next_active+0xda/frame 0xdfe33be0
> > vfs_msync(c857ba80,2,2,e6b,c857ba80,...) at vfs_msync+0x175/frame
> > 0xdfe33c18
> > sync_fsync(dfe33ca8,c85cf470,80400,c10c75c9,6f4,...) at
> > sync_fsync+0x141/frame 0xdfe33c64
> > VOP_FSYNC_APV(c12008a0,dfe33ca8,c10c75c9,6f4,4e20,...) at
> > VOP_FSYNC_APV+0xb4/frame 0xdfe33c64
> > sched_sync(0,dfe33d08,c10b0e10,3db,c85395a0,...) at
> > sched_sync+0x399/frame 0xdfe33cc8
> > fork_exit(c0b79420,0,dfe33d08) at fork_exit+0xc0/frame 0xdfe33cf4
> > fork_trampoline() at fork_trampoline+0x8/frame 0xdfe33cf4
> > --- trap 0, eip = 0, esp = 0xdfe33d40, ebp = 0 ---
> >
> > This process holds:
> > exclusive lockmgr syncer (syncer) r = 0 (0xc85cf4c8) locked _at_
> > kern/vfs_subr.c:1780
> >
> > The only other process that is doing anything in the VFS subsystem
> > holds the vnode interlock. It's trace is:
> > dle+0x12d/frame 0xdfe6a850
> > intr_execute_handlers(c5f721c0,dfe6a894,0,dfe6a908,c0ec2115,...) at
> > intr_execute_handlers+0x49/frame 0xdfe6a870
> > lapic_handle_intr(31,dfe6a894) at lapic_handle_intr+0x36/frame
> > 0xdfe6a884
> > Xapic_isr1() at Xapic_isr1+0x35/frame 0xdfe6a884
> > --- interrupt, eip = 0xc0b2206a, esp = 0xdfe6a8d4, ebp = 0xdfe6a908
> > ---
> > witness_unlock(c8972a74,8,c10c75c9,965,0,...) at
> > witness_unlock+0x3a/frame 0xdfe6a908
> > __mtx_unlock_flags(c8972a84,0,c10c75c9,965,c89729fc,...) at
> > __mtx_unlock_flags+0x9f/frame 0xdfe6a938
> > vdropl(c89729fc,dfe6a974,c10c75c9,8e7,c1238020,...) at
> > vdropl+0x63/frame 0xdfe6a95c
> > vputx(dfe6aa04,c0b67acc,c89729fc,dfe6a9e4,dfe6abc4,...) at
> > vputx+0x300/frame 0xdfe6a994
> > vput(c89729fc,dfe6a9e4,dfe6abc4,31d,dfe6a9e4,...) at vput+0x10/frame
> > 0xdfe6a99c
> > lookup(dfe6ab84,c857e000,0,ce,c13c83c8,...) at lookup+0x9bc/frame
> > 0xdfe6aa04
> > namei(dfe6ab84,0,0,246,0,...) at namei+0x4fe/frame 0xdfe6aa80
> > vn_open_cred(dfe6ab84,dfe6ac24,1a4,0,c5dd4580,...) at
> > vn_open_cred+0x2c0/frame 0xdfe6ab40
> > vn_open(dfe6ab84,dfe6ac24,1a4,c85922a0,c853a2d0,...) at
> > vn_open+0x3b/frame 0xdfe6ab60
> > kern_openat(c85c55e0,ffffff9c,2882dcc0,0,8001,...) at
> > kern_openat+0x1e2/frame 0xdfe6ac0c
> > kern_open(c85c55e0,2882dcc0,0,8000,1b6,...) at kern_open+0x35/frame
> > 0xdfe6ac2c
> > sys_open(c85c55e0,dfe6accc,c02acde7,7307f55d,5e5b00,...) at
> > sys_open+0x30/frame 0xdfe6ac48
> > syscall(dfe6ad08) at syscall+0x2e5/frame 0xdfe6acfc
> > Xint0x80_syscall() at Xint0x80_syscall+0x21/frame 0xdfe6acfc
> > --- syscall (5, FreeBSD ELF32, sys_open), eip = 0x84a1667, esp =
> > 0xbfbfcffc, ebp = 0xbfbfd018 ---
> >
> > The locks this process holds are:
> > exclusive sleep mutex vnode interlock (vnode interlock) r = 0
> > (0x8972a74) locked _at_ kern/vfs_subr.c:2513
> > shared lockmgr ufs (ufs) r = 0 (0xc8bd181c) locked _at_
> > kern/vfs_subr.c:2161
> >
> > The only other lock held by any thread/process is:
> > Process 12 (intr) thread 0xc5dfc5e0 (100012)
> > exclusive sleep mutex Giant (Giant) r = 1 (0xc127b690) locked _at_
> > dev/syscons/syscons.c:724
> >
> > The only 2 locked vnodes are the ufs one and the syncer one, as
> > shown above.
> >
> > The rest of the processes/threads don't hold any locks and don't
> > seem
> > to be in vfs code.
> >
> > db> show pcpu (this machine is single core)
> > cpuid = 0
> > dynamic pcpu = 0x5e5b00
> > curthread = 0xc5dfc5e0: pid 12 "swi6: Giant taskq"
> > curpcb = 0xc592fd60
> > fpcurthread = none
> > idlethread = 0xc5dfb8d0: tid 100003 "idle: cpu0"
> > APIC ID = 0
> > currentldt = 0x50
> > spin locks held:
> > db>
> >
> > When I do a "db> cont", wait a while and then go
> > back into the debugger with <ctl><alt><esc>, everything
> > looks the same.
> >
> > I'll leave this machine like this, in case you want me
> > to do anything else in the debugger. The kernel is
> > r244125 + your little patch and it took about 5 kernel
> > build cycles to get it hung. (Without your little
> > patch, it usually hangs in the first kernel build cycle.)
> >
> > Do you happen to know how I can tell where it was executing
> > when I <ctl><alt><esc>? (show registers just has the eip in
> > kdb_enter)
> > Can I work up a the stack referenced by esp and find a saved
> > eip?
> The thread that was executing is shown as curthread from 'show pcpu'.
> Was it taskq always, or was taskq only sporadic, and syncer process
> rest of the time ?
> 
> FWIW, you could try this patch too, at least I have a theory now why
> the if (1) patch did not helped. My guess is that kern_yield()
> effectively
> returned the same high-priority syncer process back to the CPU.
> 
> diff --git a/sys/kern/vfs_subr.c b/sys/kern/vfs_subr.c
> index 67e078d..2325201 100644
> --- a/sys/kern/vfs_subr.c
> +++ b/sys/kern/vfs_subr.c
> _at__at_ -4710,32 +4710,54 _at__at_ __mnt_vnode_markerfree_all(struct vnode **mvp,
> struct mount *mp)
> * These are helper functions for filesystems to traverse their
> * active vnodes. See MNT_VNODE_FOREACH_ACTIVE() in sys/mount.h
> */
> -struct vnode *
> -__mnt_vnode_next_active(struct vnode **mvp, struct mount *mp)
> +static void
> +mnt_vnode_markerfree_active(struct vnode **mvp, struct mount *mp)
> +{
> +
> + KASSERT((*mvp)->v_mount == mp, ("marker vnode mount list
> mismatch"));
> +
> + MNT_ILOCK(mp);
> + MNT_REL(mp);
> + MNT_IUNLOCK(mp);
> + free(*mvp, M_VNODE_MARKER);
> + *mvp = NULL;
> +}
> +
> +#ifdef SMP
> +#define ALWAYS_YIELD (mp_ncpus == 1)
> +#else
> +#define ALWAYS_YIELD 1
> +#endif
> +
> +static struct vnode *
> +mnt_vnode_next_active(struct vnode **mvp, struct mount *mp)
> {
> struct vnode *vp, *nvp;
> 
> - if (should_yield())
> - kern_yield(PRI_UNCHANGED);
> - mtx_lock(&vnode_free_list_mtx);
> -restart:
> + mtx_assert(&vnode_free_list_mtx, MA_OWNED);
> KASSERT((*mvp)->v_mount == mp, ("marker vnode mount list mismatch"));
> +restart:
> vp = TAILQ_NEXT(*mvp, v_actfreelist);
> + TAILQ_REMOVE(&mp->mnt_activevnodelist, *mvp, v_actfreelist);
> while (vp != NULL) {
> if (vp->v_type == VMARKER) {
> vp = TAILQ_NEXT(vp, v_actfreelist);
> continue;
> }
> if (!VI_TRYLOCK(vp)) {
> - if (should_yield()) {
> + if (ALWAYS_YIELD || should_yield()) {
> + TAILQ_INSERT_BEFORE(vp, *mvp, v_actfreelist);
> mtx_unlock(&vnode_free_list_mtx);
> - kern_yield(PRI_UNCHANGED);
> + kern_yield(PRI_USER);
> mtx_lock(&vnode_free_list_mtx);
> + goto restart;
> }
> - goto restart;
> + continue;
> }
> - if (vp->v_mount == mp && vp->v_type != VMARKER &&
> - (vp->v_iflag & VI_DOOMED) == 0)
> + KASSERT(vp->v_type != VMARKER, ("locked marker %p", vp));
> + KASSERT(vp->v_mount == mp || vp->v_mount == NULL,
> + ("alien vnode on the active list %p %p", vp, mp));
> + if (vp->v_mount == mp && (vp->v_iflag & VI_DOOMED) == 0)
> break;
> nvp = TAILQ_NEXT(vp, v_actfreelist);
> VI_UNLOCK(vp);
> _at__at_ -4745,22 +4767,31 _at__at_ restart:
> /* Check if we are done */
> if (vp == NULL) {
> mtx_unlock(&vnode_free_list_mtx);
> - __mnt_vnode_markerfree_active(mvp, mp);
> - mtx_assert(MNT_MTX(mp), MA_NOTOWNED);
> + mnt_vnode_markerfree_active(mvp, mp);
> return (NULL);
> }
> - TAILQ_REMOVE(&mp->mnt_activevnodelist, *mvp, v_actfreelist);
> TAILQ_INSERT_AFTER(&mp->mnt_activevnodelist, vp, *mvp, v_actfreelist);
> mtx_unlock(&vnode_free_list_mtx);
> ASSERT_VI_LOCKED(vp, "active iter");
> KASSERT((vp->v_iflag & VI_ACTIVE) != 0, ("Non-active vp %p", vp));
> return (vp);
> }
> +#undef ALWAYS_YIELD
> +
> +struct vnode *
> +__mnt_vnode_next_active(struct vnode **mvp, struct mount *mp)
> +{
> +
> + if (should_yield())
> + kern_yield(PRI_UNCHANGED);
> + mtx_lock(&vnode_free_list_mtx);
> + return (mnt_vnode_next_active(mvp, mp));
> +}
> 
> struct vnode *
> __mnt_vnode_first_active(struct vnode **mvp, struct mount *mp)
> {
> - struct vnode *vp, *nvp;
> + struct vnode *vp;
> 
> *mvp = malloc(sizeof(struct vnode), M_VNODE_MARKER, M_WAITOK |
> M_ZERO);
> MNT_ILOCK(mp);
> _at__at_ -4770,44 +4801,14 _at__at_ __mnt_vnode_first_active(struct vnode **mvp,
> struct mount *mp)
> (*mvp)->v_mount = mp;
> 
> mtx_lock(&vnode_free_list_mtx);
> -restart:
> vp = TAILQ_FIRST(&mp->mnt_activevnodelist);
> - while (vp != NULL) {
> - if (vp->v_type == VMARKER) {
> - vp = TAILQ_NEXT(vp, v_actfreelist);
> - continue;
> - }
> - if (!VI_TRYLOCK(vp)) {
> - if (should_yield()) {
> - mtx_unlock(&vnode_free_list_mtx);
> - kern_yield(PRI_UNCHANGED);
> - mtx_lock(&vnode_free_list_mtx);
> - }
> - goto restart;
> - }
> - if (vp->v_mount == mp && vp->v_type != VMARKER &&
> - (vp->v_iflag & VI_DOOMED) == 0)
> - break;
> - nvp = TAILQ_NEXT(vp, v_actfreelist);
> - VI_UNLOCK(vp);
> - vp = nvp;
> - }
> -
> - /* Check if we are done */
> if (vp == NULL) {
> mtx_unlock(&vnode_free_list_mtx);
> - MNT_ILOCK(mp);
> - MNT_REL(mp);
> - MNT_IUNLOCK(mp);
> - free(*mvp, M_VNODE_MARKER);
> - *mvp = NULL;
> + mnt_vnode_markerfree_active(mvp, mp);
> return (NULL);
> }
> - TAILQ_INSERT_AFTER(&mp->mnt_activevnodelist, vp, *mvp,
> v_actfreelist);
> - mtx_unlock(&vnode_free_list_mtx);
> - ASSERT_VI_LOCKED(vp, "active iter first");
> - KASSERT((vp->v_iflag & VI_ACTIVE) != 0, ("Non-active vp %p", vp));
> - return (vp);
> + TAILQ_INSERT_BEFORE(vp, *mvp, v_actfreelist);
> + return (mnt_vnode_next_active(mvp, mp));
> }
> 
> void
> _at__at_ -4817,14 +4818,8 _at__at_ __mnt_vnode_markerfree_active(struct vnode
> **mvp, struct mount *mp)
> if (*mvp == NULL)
> return;
> 
> - KASSERT((*mvp)->v_mount == mp, ("marker vnode mount list
> mismatch"));
> -
> mtx_lock(&vnode_free_list_mtx);
> TAILQ_REMOVE(&mp->mnt_activevnodelist, *mvp, v_actfreelist);
> mtx_unlock(&vnode_free_list_mtx);
> - MNT_ILOCK(mp);
> - MNT_REL(mp);
> - MNT_IUNLOCK(mp);
> - free(*mvp, M_VNODE_MARKER);
> - *mvp = NULL;
> + mnt_vnode_markerfree_active(mvp, mp);
> }

Good work. This patch seems to have done the trick. I've run
quite a few kernel build cycles without a hang. I'll keep running
them, but I would have expected to see a hang by now.

Maybe Tim can test the patch as well?
(I needed to add #include <sys/smp.h> btw.)

rick