useful workaround and analysis of vnode-backed md deadlock

From: Peter Edwards <pmedwards_at_eircom.net> Date: Wed, 10 Sep 2003 19:20:02 +0100 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:22 UTC

There's been few reports of deadlocks in md on the lists recently,
and I walked into it trying to generate flash images for my shiny
new Soekris box. In particular, A previous mail mentioned something
getting stuck in "wdrain": (Message-ID <20030806104332.GA42110_at_sunbay.com>
from ru_at_freebsd.org)

For the impatient, a way I found around the problem was to mount
the md-backed filesystems with the "sync" option.

I analysed the deadlock a little, and here's a synopsis, in case
they're of use to anyone.

This down as well as I could, and it appears to be an interaction
between three processes. This may (and most likely isn't) the only
md deadlock, but once I otherwise leave the backing file alone, I
don't experience any problems once I mount the filesystem sync,
And, because the underlying filesystem is async, access to the md
filesystem isn't painfully slower than normal.

1: One thread is operating on the filesystem.  In general, this
	thread is creating dirty buffers for later processing by
	the bufdaemon, and also making direct write requests.  This
	doesn't actually participate in the deadlock, but does set
	the stage for it.

2: The "md" thread, processing requests from (1), attempts to lock
	the vnode for the underlying md device, in order to fulfill
	a queued write request on the md device.

3: Meanwhile.... the bufdaemon has kicked in, and is flushing dirty
	buffers. Some of these are for the files on the md filesystem,
	some are for the vnode backing the md device itself (actually,
	I assume that the flushing of the former causes a sudden
	surge in the latter, as the writes to the md filesystem are
	converted to writes to the backing vnode)

The bufdaemon has locked the md vnode in order to write bufs to it.
However, it needs to wait for "runningbufspace", which is designed
to limit the number of in-flight async buffer writes.

Once the running buffer space exceeds a high threshold, the scheduler
is blocked, to be awakened when completed async writes bring it
under the low threshold. However, a large chunk of the running buf
space is sitting queued for the md thread to process. The md thread
can't continue without the vnode lock, so the running buffer space
will not fall, and the bufdaemon cannot continue without running
buffer space, so will never release the vnode lock.
-- 
Peter Edwards.