On Sat, Jan 30, 2010 at 08:51:27PM +0200, Alexander Motin wrote: > Pawel Jakub Dawidek wrote: > > On Sat, Jan 30, 2010 at 12:27:49PM +0100, Pawel Jakub Dawidek wrote: > >> On Sat, Jan 30, 2010 at 12:58:26AM +0200, Alexander Motin wrote: > >>> Experimenting with SATA hot-plug I've found quite repeatable deadlock > >>> case. Problem observed when several SATA devices, opened via devfs, > >>> disappear at exactly same time. In my case, at time of unplugging SATA > >>> Port Multiplier with several disks beyond it. All I have to do is to run > >>> several `dd if=/dev/adaX of=/dev/null bs=1m &` commands and unplug > >>> multiplier. That causes predictable I/O errors and devices destruction. > >>> But with high probability several dd processes getting stuck in kernel. > >> [...] > >> > >> I observed the same thing yesterday while stress-testing HAST: > >> > >> 3659 2504 3659 0 DE+ GEOM top 0x8079a348 dd > >> 3658 2102 2102 0 DE+ GEOM top 0x8079a348 hastd > >> 2 0 0 0 DL devdrn 0x85b1bc68 [g_event] > >> > >> Both dd(1) and hastd(8) wait for the GEOM topology lock in the exit path, > >> which is already held by the g_event thread. > > > > Maybe I'll add how I understand what's going on: > > > > GEOM calls destroy_dev() while holding the topology lock. > > > > Destroy_dev() wants to destroy device, but can't because there are > > threads that still have it open. > > > > The threads can't close it, because to close it they need the topology > > lock. > > > > The deadlock is quite obvious, IMHO. > > You are right, but as it happens not every time I was interested why. > After closer look I found two different scenarios. > > In first case application receives I/O error and closes device. On > device close CAM calls disk_destroy(), which schedules device > destruction. When destroy_dev() called, device already free and there is > no problem, as these events are always asynchronous. > > In second case, application also receives I/O error, but before it is > able to react, GEOM starts handling of disk_gone(), called by CAM. As > result, destroy_dev() called with device still opened, and it can't ever > be closed due to topology lock held. > > I've played a bit with destroy_dev_sched(), but locking indeed looks not > to be easy. Is there some known good practice? destroy_dev_sched_cb() > looks a bit more promising. What do you mean by not easy locking ? destroy_dev_sched(dev) == destroy_dev_sched_cb(dev, NULL, NULL). There is even a man page describing the interface. Main issue with destroy_dev_sched is the window between a moment when device is scheduled for destruction and thus kept in half-demolished state, and actual removal of devfs node. My exemplary case has been snp(4) before tty got rewritten, see r. 1.107 of sys/dev/snp/snp.c. No calls to destroy_dev_sched() that I placed in the src/ a kept around, that is good because corresponding subsystems got serious rewrite.
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:00 UTC