Re: ata timeouts under load

From: Wes Morgan <morganw_at_chemikals.org>
Date: Tue, 15 Sep 2009 23:31:28 -0500 (CDT)
On Sun, 13 Sep 2009, Kris Kennaway wrote:

> Alexander Motin wrote:
>> Kris Kennaway wrote:
>>> I am getting timeouts on 8.0b4/HEAD when I do a lot of ZFS I/O to a pool
>>> on ad4:
>>> 
>>> atapci0: <VIA 6420 SATA150 controller> port
>>> 0xc800-0xc807,0xc400-0xc403,0xc000-0xc007,0xb800-0xb803,0xb400-0xb40f,0xb000-0xb0ff
>>> irq 20 at device 15.0 on pci0
>>> ata2: <ATA channel 0> on atapci0
>>> ata3: <ATA channel 1> on atapci0
>>> ata0: <ATA channel 0> on atapci1
>>> ata1: <ATA channel 1> on atapci1
>>> 
>>> ad4: 476940MB <WDC WD5000AAKS-00TMA0 12.01C01> at ata2-master SATA150
>>> ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout -
>>> completing request directly
>>> ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout -
>>> completing request directly
>>> ad4: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing
>>> request directly
>>> ad4: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing
>>> request directly
>>> ad4: WARNING - SET_MULTI taskqueue timeout - completing request directly
>>> ad4: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=344052040
>>> ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout -
>>> completing request directly
>>> ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout -
>>> completing request directly
>>> 
>>> It becomes stuck in a loop displaying the above and is unable to
>>> complete further I/O operations.  I wonder if it is just batching up a
>>> lot of I/O and then timing out because it is busy, and then not
>>> recovering from this state?
>>> 
>>> Any ideas what could be wrong?
>> 
>> There are two different kinds of timeouts we can see:
>>  - first one, "ad4: WARNING - ..." is just a queue waiting timeout. It
>> is not the reason, but consequence of the problem. And I have doubts
>> that it is reasonable to do it.
>>  - second one, "TIMEOUT - WRITE_DMA48 ..." is a real command execution
>> timeout. I don't know whether this is result of some improper error
>> recovery, or you drive indeed lost required servo information near
>> LBA=344052040 and tries to find it too long. You can try to read that
>> sector and nearby ones with dd.
>> 
>
> It's always that sequence (with setfeatures timing out first, then the dma 
> later)...and the block number varies widely, also whether it's read/write. 
> The disk itself & the data it contains appears to be OK as far as I have been 
> able to determine so far.

This may not be meaningful, but I used to have a lot of very similar (the 
messages, loop, etc is exactly the same) problems with VIA chipsets and an 
AMD cpu. Seemed to be triggered by a certain drive, but I never could 
figure it out totally. Moved to an Intel board/cpu and I've never seen it 
since. Looks like an older SATA1 chipset, so perhaps it could be the same 
problem. Problem was not related to zfs.
Received on Wed Sep 16 2009 - 02:31:34 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:55 UTC