Re: ata timeouts under load

From: Kris Kennaway <kris_at_FreeBSD.org>
Date: Sun, 13 Sep 2009 22:02:10 +0100
Alexander Motin wrote:
> Kris Kennaway wrote:
>> I am getting timeouts on 8.0b4/HEAD when I do a lot of ZFS I/O to a pool
>> on ad4:
>>
>> atapci0: <VIA 6420 SATA150 controller> port
>> 0xc800-0xc807,0xc400-0xc403,0xc000-0xc007,0xb800-0xb803,0xb400-0xb40f,0xb000-0xb0ff
>> irq 20 at device 15.0 on pci0
>> ata2: <ATA channel 0> on atapci0
>> ata3: <ATA channel 1> on atapci0
>> ata0: <ATA channel 0> on atapci1
>> ata1: <ATA channel 1> on atapci1
>>
>> ad4: 476940MB <WDC WD5000AAKS-00TMA0 12.01C01> at ata2-master SATA150
>> ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout -
>> completing request directly
>> ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout -
>> completing request directly
>> ad4: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing
>> request directly
>> ad4: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing
>> request directly
>> ad4: WARNING - SET_MULTI taskqueue timeout - completing request directly
>> ad4: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=344052040
>> ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout -
>> completing request directly
>> ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout -
>> completing request directly
>>
>> It becomes stuck in a loop displaying the above and is unable to
>> complete further I/O operations.  I wonder if it is just batching up a
>> lot of I/O and then timing out because it is busy, and then not
>> recovering from this state?
>>
>> Any ideas what could be wrong?
> 
> There are two different kinds of timeouts we can see:
>  - first one, "ad4: WARNING - ..." is just a queue waiting timeout. It
> is not the reason, but consequence of the problem. And I have doubts
> that it is reasonable to do it.
>  - second one, "TIMEOUT - WRITE_DMA48 ..." is a real command execution
> timeout. I don't know whether this is result of some improper error
> recovery, or you drive indeed lost required servo information near
> LBA=344052040 and tries to find it too long. You can try to read that
> sector and nearby ones with dd.
> 

It's always that sequence (with setfeatures timing out first, then the 
dma later)...and the block number varies widely, also whether it's 
read/write.  The disk itself & the data it contains appears to be OK as 
far as I have been able to determine so far.

Kris
Received on Sun Sep 13 2009 - 19:02:09 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:55 UTC