Re: Deadlock between GEOM and devfs device destroy and process exit.

From: Alexander Motin <mav_at_FreeBSD.org>
Date: Sat, 30 Jan 2010 20:51:27 +0200
Pawel Jakub Dawidek wrote:
> On Sat, Jan 30, 2010 at 12:27:49PM +0100, Pawel Jakub Dawidek wrote:
>> On Sat, Jan 30, 2010 at 12:58:26AM +0200, Alexander Motin wrote:
>>> Experimenting with SATA hot-plug I've found quite repeatable deadlock
>>> case. Problem observed when several SATA devices, opened via devfs,
>>> disappear at exactly same time. In my case, at time of unplugging SATA
>>> Port Multiplier with several disks beyond it. All I have to do is to run
>>> several `dd if=/dev/adaX of=/dev/null bs=1m &` commands and unplug
>>> multiplier. That causes predictable I/O errors and devices destruction.
>>> But with high probability several dd processes getting stuck in kernel.
>> [...]
>>
>> I observed the same thing yesterday while stress-testing HAST:
>>
>>  3659  2504  3659     0  DE+     GEOM top 0x8079a348 dd
>>  3658  2102  2102     0  DE+     GEOM top 0x8079a348 hastd
>>     2     0     0     0  DL      devdrn   0x85b1bc68 [g_event]
>>
>> Both dd(1) and hastd(8) wait for the GEOM topology lock in the exit path,
>> which is already held by the g_event thread.
> 
> Maybe I'll add how I understand what's going on:
> 
> GEOM calls destroy_dev() while holding the topology lock.
> 
> Destroy_dev() wants to destroy device, but can't because there are
> threads that still have it open.
> 
> The threads can't close it, because to close it they need the topology
> lock.
> 
> The deadlock is quite obvious, IMHO.

You are right, but as it happens not every time I was interested why.
After closer look I found two different scenarios.

In first case application receives I/O error and closes device. On
device close CAM calls disk_destroy(), which schedules device
destruction. When destroy_dev() called, device already free and there is
no problem, as these events are always asynchronous.

In second case, application also receives I/O error, but before it is
able to react, GEOM starts handling of disk_gone(), called by CAM. As
result, destroy_dev() called with device still opened, and it can't ever
be closed due to topology lock held.

I've played a bit with destroy_dev_sched(), but locking indeed looks not
to be easy. Is there some known good practice? destroy_dev_sched_cb()
looks a bit more promising.

-- 
Alexander Motin
Received on Sat Jan 30 2010 - 17:51:33 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:00 UTC