deadlock in ata_queue_request()

From: Bruce Evans <bde_at_zeta.org.au>
Date: Thu, 25 Dec 2003 15:38:16 +1100 (EST)
ata_queue_request() sleeps in an interrupt handler here:

% void
% ata_queue_request(struct ata_request *request)
% {
%     /* mark request as virgin (it might be a reused one) */
%     request->result = request->status = request->error = 0;
%     request->flags &= ~ATA_R_DONE;
%
%     /* put request on the locked queue at the specified location */
%     mtx_lock(&request->device->channel->queue_mtx);
%     if (request->flags & ATA_R_AT_HEAD)
% 	TAILQ_INSERT_HEAD(&request->device->channel->ata_queue, request, chain);
%     else
% 	TAILQ_INSERT_TAIL(&request->device->channel->ata_queue, request, chain);
%     mtx_unlock(&request->device->channel->queue_mtx);
%
%     /* should we skip start ? */
%     if (!(request->flags & ATA_R_SKIPSTART))
% 	ata_start(request->device->channel);
%
%     /* if this was a requeue op callback/sleep already setup */
%     if (request->flags & ATA_R_REQUEUE)
% 	return;
%
%     /* if this is not a callback and we havn't seen DONE yet -> sleep */
%     if (!request->callback) {
% 	while (!(request->flags & ATA_R_DONE))
% 	    tsleep(request, PRIBIO, "atareq", hz/10);
  	    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
%     }
% }

when it is called from an interrupt handler.  It is called from an interrupt
handler as part of timeout processing:

...
msleep(...)
ata_queue_request(...)
ata_via_family_setmode(...)
ata_identify_devices(...)
ata_reinit(...)
ata_timeout(...)
softclock(...)
ithread_loop(...)
...

The timeout was called here shortly after ad2 hung:

Dec 25 12:28:27 besplex kernel: ad2: TIMEOUT - READ_DMA retrying (2 retries left)
Dec 25 12:28:27 besplex kernel: ata1: resetting devices ..
Dec 25 12:28:27 besplex kernel: ad2: FAILURE - already active DMA on this device
Dec 25 12:28:27 besplex kernel: ad2: setting up DMA failed

ATA_R_DONE was never set and wakeup_request() was never called either, so
softclock() was deadlocked and tsleep() never returned.

The system ran surprisingly well with softclock() deadlocked.  ad0 worked
and everything that didn't use timeouts worked.  Examples of things that
didn't work because they use timeouts:
- syscons screen updates.
- statistics in top and systat.
- sleep 1 in shells.
- mbmon (shows the status, then never repeats).

I tried the following to recover:
- call wakeup(request) using ddb.  This worked, but ATA_R_DONE was never
  set so ata_queue_request() just looped.
- also ignore the ATA_R_DONE check using ddb.  This un-deadlocked
  softclock(), but ata1 remained wedged.
- then call "atacontrol reinit 1".  This partly worked:

Dec 25 14:31:12 besplex kernel: ata1: resetting devices ..
Dec 25 14:31:44 besplex kernel: ad2: WARNING - removed from configuration
Dec 25 14:31:44 besplex kernel: done

but the ata driver caused a null pointer panic an instant later.

- ad2 didn't come back after a hard reset.
- ad2 came back after a power cycle.

This was on an undermydesktop.  Problems resuming on laptops may be similar.
The hardware may really be wedged.  Then the software shouldn't make things
worse by sleeping or spinning in the timeout handler.

Bruce
Received on Wed Dec 24 2003 - 19:38:24 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:35 UTC