SATA disk error and hang until "atacontrol reinit" ?

From: Nathaniel W Filardo <nwf_at_cs.jhu.edu>
Date: Thu, 29 Oct 2009 17:39:29 -0400
I have a FreeBSD/SPARC
  FreeBSD hydra.priv.oc.ietfng.org 9.0-CURRENT FreeBSD 9.0-CURRENT
  #11: Mon Oct 19 22:08:50 EDT 2009
  root_at_hydra.priv.oc.ietfng.org:/systank/obj/systank/src/sys/NWFKERN
  sparc64
with a
  atapci1: <Marvell 88SX6081 SATA300 controller> port 0x300-0x3ff
    mem 0x600000-0x6fffff,0x800000-0xbfffff at device 1.0 on pci3
and eight SATA2 disks:
  ad0: 305245MB <Seagate ST3320620AS 3.AAJ> at ata4-master SATA300
  ad1: 305245MB <Seagate ST3320620AS 3.AAE> at ata5-master SATA300
  ad2: 305245MB <Seagate ST3320620AS 3.AAE> at ata6-master SATA300
  ad3: 305245MB <Seagate ST3320620AS 3.AAJ> at ata7-master SATA300
  ad4: 715404MB <WDC WD7500AADS-00L5B1 01.01A01> at ata8-master SATA300
  ad5: 715404MB <WDC WD7500AADS-00L5B1 01.01A01> at ata9-master SATA300
  ad6: 715404MB <WDC WD7500AADS-00L5B1 01.01A01> at ata10-master SATA300
  ad7: 715404MB <WDC WD7500AADS-00L5B1 01.01A01> at ata11-master SATA300

The two sets of four disks are each RAIDZ'd together, and the two RAIDZs are
in one storage pool.

I've been stress-testing the disks by scrubbing and find that after a few
days of uptime, I will get
  ad0: FAILURE - READ_DMA status=51<READY,DSC,ERROR> error=0 LBA=103200892
(It's always ad0 that fails) and all I/O directed at this storage pool through
ZFS hangs. (I have not yet tested with dd from the raw disks; didn't think
to do it, sorry.)  During this period, zpool status reports 1 checksum error
from ad0, though I don't know if this is occurs before, after, or in
synchrony with the ad0 READ_DMA FAILURE.

Previously, I just rebooted, but this time I thought to run "atacontrol
reinit ata4" (which is the channel holding ad0).  That caused the kernel to
say
  ad0: WARNING - WRITE_DMA48 requeued due to channel reset LBA=625104384
  ad0: FAILURE - already active DMA on this device
  ad0: setting up DMA failed
zpool status now indicates that the scrub is proceeding again, and that ad0
has suffered 3 read, 1 write, and 1 checksum error.  I/O directed at the
storage tank works again.

Is my disk going bad or is there something more funny here?  Even if the
disk is going bad, shouldn't the controller time out the request eventually?

Thanks much in advance.
--nwf;

Received on Thu Oct 29 2009 - 20:51:00 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:57 UTC