Re: analysis of snapshot-related system deadlock

From: Don Lewis <truckman_at_FreeBSD.org> Date: Wed, 28 Sep 2005 13:06:20 -0700 (PDT) · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:44 UTC

On 28 Sep, Eric Anderson wrote:
> Don Lewis wrote:
>> I've been experimenting with Peter Holm's kernel stress test suite and
>> file system snapshots.  I've been frequently seeing system deadlocks, so
>> I went looking for the cause.
>> 
>> In the latest instance, there were 12 threads waiting on "snaplk", and
>> the thread holding "snaplk" was sleeping on "wdrain".  Two of the
>> threads waiting on "snaplk" were syncer and bufdaemon, which is not a
>> good sign.
>> 
>> Ordinarily, I/O activity should eventually reduce runningbufspace below
>> lorunningspace and wake up the thread sleeping on "wdrain", but this is
>> where the problem gets interesting.  The stack trace of the thread
>> sleeping on "wdrain" is:
>> 
>> #0  0xc0653913 in sched_switch (td=0xc23fe300, newtd=0xc2275480, flags=1)
>>     at /usr/src/sys/kern/sched_4bsd.c:973
>> #1  0xc0649158 in mi_switch (flags=1, newtd=0x0)
>>     at /usr/src/sys/kern/kern_synch.c:356
>> #2  0xc066073c in sleepq_switch (wchan=0x0)
>>     at /usr/src/sys/kern/subr_sleepqueue.c:427
>> #3  0xc0660920 in sleepq_wait (wchan=0xc0984404)
>>     at /usr/src/sys/kern/subr_sleepqueue.c:539
>> #4  0xc0648dc9 in msleep (ident=0xc0984404, mtx=0xc0984420, priority=68, 
>>     wmesg=0xc0876f1c "wdrain", timo=0) at /usr/src/sys/kern/kern_synch.c:227
>> #5  0xc0687592 in bufwrite (bp=0xd648f558) at /usr/src/sys/kern/vfs_bio.c:383
>> #6  0xc0687bbd in bawrite (bp=0x0) at buf.h:401
>> #7  0xc077ca98 in ffs_copyonwrite (devvp=0xc2933770, bp=0xd6543e90)
>>     at /usr/src/sys/ufs/ffs/ffs_snapshot.c:2119
>> #8  0xc0788ec5 in ffs_geom_strategy (bo=0xc2933830, bp=0xd6543e90)
>>     at /usr/src/sys/ufs/ffs/ffs_vfsops.c:1686
>> #9  0xc068750e in bufwrite (bp=0xd6543e90) at buf.h:415
>> #10 0xc0788e32 in ffs_bufwrite (bp=0xd6543e90)
>>     at /usr/src/sys/ufs/ffs/ffs_vfsops.c:1663
>> #11 0xc0775a09 in ffs_update (vp=0xc5095cc0, waitfor=0) at buf.h:401
>> #12 0xc0793670 in ufs_mkdir (ap=0xeb785bb8)
>>     at /usr/src/sys/ufs/ufs/ufs_vnops.c:1556
>> #13 0xc08149e7 in VOP_MKDIR_APV (vop=0xc0910b60, a=0xeb785bb8)
>> 
>> 
>> The problem is that bufs passed through ffs_copyonwrite() get double
>> counted in runningbufspace, once for each pass through bufwrite().  This
>> includes the bufs being processed by all the threads that are waiting on
>> "snaplk".  If enough threads get backed up waiting for "snaplk", the
>> total size bufs they are processing will exceed lorunningspace and any
>> threads sleeping on wdrain will sleep forever.
>> 
>> Probably the easiest fix would be to call runningbufwakeup() from
>> ffs_copyonwrite() before grabbing "snaplk", and increase runningbufspace
>> again before returning from ffs_copyonwrite().  The bufs waiting for
>> "snaplk" aren't yet async writes currently running, to borrow from the
>> comment on waitrunningbufspace().
> 
> This sounds like a problem I've had with snapshot creation taking hours 
> on a large filesystem.  I've seen the same thing, but simply waited and 
> eventually it would complete, however during the process, many processes 
> were held in the SNAPLK state, while others were idle in wdrain.  I'm 
> currently running 5-STABLE on this box.

I don't know if this has the same cause or if there is another problem.
It would be useful to know what the thread that is creating the snapshot
is doing during this time.  What things is it waiting on?

Here's a patch vs. HEAD.  It has survived an hour running the stress
test suite while periodically creating and deleting snapshots.  I
haven't tried applying it to RELENG_5, but I plan to MFC whatever fix
gets committed to HEAD back to RELENG_6 and RELENG_5.

The bufdaemon and syncer threads could be converted to use the
TDP_NORUNNINGBUF flag to avoid special casing them in bufwrite().

Index: sys/kern/vfs_bio.c
===================================================================
RCS file: /home/ncvs/src/sys/kern/vfs_bio.c,v
retrieving revision 1.493
diff -u -r1.493 vfs_bio.c
--- sys/kern/vfs_bio.c	3 Aug 2005 05:02:08 -0000	1.493
+++ sys/kern/vfs_bio.c	28 Sep 2005 18:01:11 -0000
_at__at_ -318,7 +318,7 _at__at_
  * runningbufwakeup() - in-progress I/O accounting.
  *
  */
-static __inline void
+void
 runningbufwakeup(struct buf *bp)
 {

_at__at_ -848,7 +848,8 _at__at_
 		 * to deadlock.
 		 */
 		if (curthread->td_proc != bufdaemonproc &&
-		    curthread->td_proc != updateproc)
+		    curthread->td_proc != updateproc &&
+		    (curthread->td_pflags & TDP_NORUNNINGBUF) == 0)
 			waitrunningbufspace();
 	}

Index: sys/sys/buf.h
===================================================================
RCS file: /home/ncvs/src/sys/sys/buf.h,v
retrieving revision 1.189
diff -u -r1.189 buf.h
--- sys/sys/buf.h	8 Sep 2005 06:30:05 -0000	1.189
+++ sys/sys/buf.h	28 Sep 2005 17:15:40 -0000
_at__at_ -477,6 +477,7 _at__at_
 extern int	cluster_pbuf_freecnt;	/* Number of pbufs for clusters */
 extern int	vnode_pbuf_freecnt;	/* Number of pbufs for vnode pager */

+void	runningbufwakeup(struct buf *);
 caddr_t	kern_vfs_bio_buffer_alloc(caddr_t v, long physmem_est);
 void	bufinit(void);
 void	bwillwrite(void);
Index: sys/sys/proc.h
===================================================================
RCS file: /home/ncvs/src/sys/sys/proc.h,v
retrieving revision 1.434
diff -u -r1.434 proc.h
--- sys/sys/proc.h	15 Sep 2005 19:05:37 -0000	1.434
+++ sys/sys/proc.h	28 Sep 2005 17:58:09 -0000
_at__at_ -378,6 +378,7 _at__at_
 #define	TDP_SCHED4	0x00008000 /* Reserved for scheduler private use */
 #define	TDP_GEOM	0x00010000 /* Settle GEOM before finishing syscall */
 #define	TDP_SOFTDEP	0x00020000 /* Stuck processing softdep worklist */
+#define	TDP_NORUNNINGBUF 0x00040000 /* Ignore runningbufspace check */

 /*
  * Reasons that the current thread can not be run yet.
Index: sys/ufs/ffs/ffs_snapshot.c
===================================================================
RCS file: /home/ncvs/src/sys/ufs/ffs/ffs_snapshot.c,v
retrieving revision 1.103
diff -u -r1.103 ffs_snapshot.c
--- sys/ufs/ffs/ffs_snapshot.c	3 Apr 2005 12:03:44 -0000	1.103
+++ sys/ufs/ffs/ffs_snapshot.c	28 Sep 2005 18:07:00 -0000
_at__at_ -1998,6 +1998,12 _at__at_
 		return (0);
 	}
 	/*
+	 * Since I/O on bp isn't yet in progress and it may be blocked
+	 * for a long time waiting on snaplk, back it out of
+	 * runningbufspace, possibly waking other threads waiting for space.
+	 */
+	runningbufwakeup(bp);
+	/*
 	 * Not in the precomputed list, so check the snapshots.
 	 */
 retry:
_at__at_ -2028,7 +2034,7 _at__at_
 				goto retry;
 			}
 			snapshot_locked = 1;
-			td->td_pflags |= TDP_COWINPROGRESS;
+			td->td_pflags |= TDP_COWINPROGRESS | TDP_NORUNNINGBUF;
 			error = UFS_BALLOC(vp, lblktosize(fs, (off_t)lbn),
 			   fs->fs_bsize, KERNCRED, BA_METAONLY, &ibp);
 			td->td_pflags &= ~TDP_COWINPROGRESS;
_at__at_ -2065,7 +2071,7 _at__at_
 			goto retry;
 		}
 		snapshot_locked = 1;
-		td->td_pflags |= TDP_COWINPROGRESS;
+		td->td_pflags |= TDP_COWINPROGRESS | TDP_NORUNNINGBUF;
 		error = UFS_BALLOC(vp, lblktosize(fs, (off_t)lbn),
 		    fs->fs_bsize, KERNCRED, 0, &cbp);
 		td->td_pflags &= ~TDP_COWINPROGRESS;
_at__at_ -2120,10 +2126,16 _at__at_
 		if (dopersistence && VTOI(vp)->i_effnlink > 0)
 			(void) ffs_syncvnode(vp, MNT_WAIT);
 	}
-	if (snapshot_locked)
+	if (snapshot_locked) {
 		lockmgr(vp->v_vnlock, LK_RELEASE, NULL, td);
-	else
+		td->td_pflags &= ~TDP_NORUNNINGBUF;
+	} else
 		VI_UNLOCK(devvp);
+	/*
+	 * I/O on bp will now be started, so count it in runningbufspace.
+	 */
+	if (bp->b_runningbufspace)
+		atomic_add_int(&runningbufspace, bp->b_runningbufspace);
 	return (error);
 }