I have a FreeBSD/SPARC FreeBSD hydra.priv.oc.ietfng.org 9.0-CURRENT FreeBSD 9.0-CURRENT #11: Mon Oct 19 22:08:50 EDT 2009 root_at_hydra.priv.oc.ietfng.org:/systank/obj/systank/src/sys/NWFKERN sparc64 with a atapci1: <Marvell 88SX6081 SATA300 controller> port 0x300-0x3ff mem 0x600000-0x6fffff,0x800000-0xbfffff at device 1.0 on pci3 and eight SATA2 disks: ad0: 305245MB <Seagate ST3320620AS 3.AAJ> at ata4-master SATA300 ad1: 305245MB <Seagate ST3320620AS 3.AAE> at ata5-master SATA300 ad2: 305245MB <Seagate ST3320620AS 3.AAE> at ata6-master SATA300 ad3: 305245MB <Seagate ST3320620AS 3.AAJ> at ata7-master SATA300 ad4: 715404MB <WDC WD7500AADS-00L5B1 01.01A01> at ata8-master SATA300 ad5: 715404MB <WDC WD7500AADS-00L5B1 01.01A01> at ata9-master SATA300 ad6: 715404MB <WDC WD7500AADS-00L5B1 01.01A01> at ata10-master SATA300 ad7: 715404MB <WDC WD7500AADS-00L5B1 01.01A01> at ata11-master SATA300 The two sets of four disks are each RAIDZ'd together, and the two RAIDZs are in one storage pool. I've been stress-testing the disks by scrubbing and find that after a few days of uptime, I will get ad0: FAILURE - READ_DMA status=51<READY,DSC,ERROR> error=0 LBA=103200892 (It's always ad0 that fails) and all I/O directed at this storage pool through ZFS hangs. (I have not yet tested with dd from the raw disks; didn't think to do it, sorry.) During this period, zpool status reports 1 checksum error from ad0, though I don't know if this is occurs before, after, or in synchrony with the ad0 READ_DMA FAILURE. Previously, I just rebooted, but this time I thought to run "atacontrol reinit ata4" (which is the channel holding ad0). That caused the kernel to say ad0: WARNING - WRITE_DMA48 requeued due to channel reset LBA=625104384 ad0: FAILURE - already active DMA on this device ad0: setting up DMA failed zpool status now indicates that the scrub is proceeding again, and that ad0 has suffered 3 read, 1 write, and 1 checksum error. I/O directed at the storage tank works again. Is my disk going bad or is there something more funny here? Even if the disk is going bad, shouldn't the controller time out the request eventually? Thanks much in advance. --nwf;
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:57 UTC