Re: analysis of snapshot-related system deadlock

From: Eric Anderson <anderson_at_centtech.com>
Date: Sat, 01 Oct 2005 19:21:51 -0500
Don Lewis wrote:
> On 28 Sep, Eric Anderson wrote:
> 
>>Don Lewis wrote:
>>
>>>I've been experimenting with Peter Holm's kernel stress test suite and
>>>file system snapshots.  I've been frequently seeing system deadlocks, so
>>>I went looking for the cause.
>>>
>>>In the latest instance, there were 12 threads waiting on "snaplk", and
>>>the thread holding "snaplk" was sleeping on "wdrain".  Two of the
>>>threads waiting on "snaplk" were syncer and bufdaemon, which is not a
>>>good sign.
>>>
>>>Ordinarily, I/O activity should eventually reduce runningbufspace below
>>>lorunningspace and wake up the thread sleeping on "wdrain", but this is
>>>where the problem gets interesting.  The stack trace of the thread
>>>sleeping on "wdrain" is:
>>>
>>>#0  0xc0653913 in sched_switch (td=0xc23fe300, newtd=0xc2275480, flags=1)
>>>    at /usr/src/sys/kern/sched_4bsd.c:973
>>>#1  0xc0649158 in mi_switch (flags=1, newtd=0x0)
>>>    at /usr/src/sys/kern/kern_synch.c:356
>>>#2  0xc066073c in sleepq_switch (wchan=0x0)
>>>    at /usr/src/sys/kern/subr_sleepqueue.c:427
>>>#3  0xc0660920 in sleepq_wait (wchan=0xc0984404)
>>>    at /usr/src/sys/kern/subr_sleepqueue.c:539
>>>#4  0xc0648dc9 in msleep (ident=0xc0984404, mtx=0xc0984420, priority=68, 
>>>    wmesg=0xc0876f1c "wdrain", timo=0) at /usr/src/sys/kern/kern_synch.c:227
>>>#5  0xc0687592 in bufwrite (bp=0xd648f558) at /usr/src/sys/kern/vfs_bio.c:383
>>>#6  0xc0687bbd in bawrite (bp=0x0) at buf.h:401
>>>#7  0xc077ca98 in ffs_copyonwrite (devvp=0xc2933770, bp=0xd6543e90)
>>>    at /usr/src/sys/ufs/ffs/ffs_snapshot.c:2119
>>>#8  0xc0788ec5 in ffs_geom_strategy (bo=0xc2933830, bp=0xd6543e90)
>>>    at /usr/src/sys/ufs/ffs/ffs_vfsops.c:1686
>>>#9  0xc068750e in bufwrite (bp=0xd6543e90) at buf.h:415
>>>#10 0xc0788e32 in ffs_bufwrite (bp=0xd6543e90)
>>>    at /usr/src/sys/ufs/ffs/ffs_vfsops.c:1663
>>>#11 0xc0775a09 in ffs_update (vp=0xc5095cc0, waitfor=0) at buf.h:401
>>>#12 0xc0793670 in ufs_mkdir (ap=0xeb785bb8)
>>>    at /usr/src/sys/ufs/ufs/ufs_vnops.c:1556
>>>#13 0xc08149e7 in VOP_MKDIR_APV (vop=0xc0910b60, a=0xeb785bb8)
>>>
>>>
>>>The problem is that bufs passed through ffs_copyonwrite() get double
>>>counted in runningbufspace, once for each pass through bufwrite().  This
>>>includes the bufs being processed by all the threads that are waiting on
>>>"snaplk".  If enough threads get backed up waiting for "snaplk", the
>>>total size bufs they are processing will exceed lorunningspace and any
>>>threads sleeping on wdrain will sleep forever.
>>>
>>>Probably the easiest fix would be to call runningbufwakeup() from
>>>ffs_copyonwrite() before grabbing "snaplk", and increase runningbufspace
>>>again before returning from ffs_copyonwrite().  The bufs waiting for
>>>"snaplk" aren't yet async writes currently running, to borrow from the
>>>comment on waitrunningbufspace().
>>
>>This sounds like a problem I've had with snapshot creation taking hours 
>>on a large filesystem.  I've seen the same thing, but simply waited and 
>>eventually it would complete, however during the process, many processes 
>>were held in the SNAPLK state, while others were idle in wdrain.  I'm 
>>currently running 5-STABLE on this box.
> 
> 
> I don't know if this has the same cause or if there is another problem.
> It would be useful to know what the thread that is creating the snapshot
> is doing during this time.  What things is it waiting on?

I've seen various states, but here's what I currently see when doing the 
mksnap_ffs command, and while that is running, doing something in 
another window that accesses the partition in some way:

mksnap_ffs toggles between biord and biowr, mostly in biord state:
10410 root        1  -8    0  1308K   736K biord    0:00  0.00% mksnap_ffs

Here's what an 'ls -al /mnt/' does while mksnap is running on /mnt 
mounted partition.  It will block until the snapshot finishes (or nearly 
finishes):
10425 anderson    1  -4    0  1632K   964K ufs      0:00  0.00% ls

Same with doing a 'cd /mnt', not surprisingly:
10431 anderson    1  -4    0  2340K  1808K ufs      0:00  0.00% bash

If you need a ktrace or something else more detailed, let me know what 
you want.

Here's how I reproduce this:

dd if=/dev/zero bs=512 count=1 oseek=4000m of=/tmp/test-snapshot
mdconfig -a -t vnode -f /tmp/test-snapshot
newfs /dev/md0
mount /dev/md0 /mnt
time mksnap_ffs /mnt /mnt/SNAP000

Now, while the mksnap_ffs is running, do anything to the /mnt area, and 
it will hang until the snapshot is done, which is a long time.  From 
what I've read, McKusick describes how it should only hold processes 
from writing during a few stages, which should be very fast anyhow, yet 
it blocked for about 2 hours while the mksnap was running:

Window A:
[ 00:33:45 root_at_neutrino ~ ]# time mksnap_ffs /mnt /mnt/SNAP

real    113m48.075s
user    0m0.000s
sys     0m14.256s
[ 02:27:50 root_at_neutrino ~ ]#

Window B:
[ 00:35:15 44 ~/ ]$ ls -al /mnt/
total 1203062
drwxr-xr-x   3 root  wheel               512 Oct  1 00:34 .
drwxr-xr-x  38 root  wheel              1024 Sep 28 19:13 ..
drwxrwxr-x   2 root  operator            512 Oct  1 00:26 .snap
-r--------   1 root  wheel     2147483648000 Oct  1 01:59 SNAP
[ 01:59:20 45 ~/ ]$

I'm currently running -CURRENT from right before you committed your 
patches, so I'm going to update and try after your patches and see how 
things act, and report back.

Thanks for looking at this.

Eric



-- 
------------------------------------------------------------------------
Eric Anderson        Sr. Systems Administrator        Centaur Technology
Anything that works is better than anything that doesn't.
------------------------------------------------------------------------
Received on Sat Oct 01 2005 - 22:22:00 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:44 UTC