Re: ata timeouts under load

From: Alexandre Sunny <gaijin.k_at_gmail.com> Date: Mon, 14 Sep 2009 10:09:41 -0400 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:55 UTC

On Sun, 13 Sep 2009 22:02:10 +0100
Kris Kennaway <kris_at_FreeBSD.org> wrote:

> Alexander Motin wrote:
> > Kris Kennaway wrote:
> >> I am getting timeouts on 8.0b4/HEAD when I do a lot of ZFS I/O to
> >> a pool on ad4:
> >>
> >> atapci0: <VIA 6420 SATA150 controller> port
> >> 0xc800-0xc807,0xc400-0xc403,0xc000-0xc007,0xb800-0xb803,0xb400-0xb40f,0xb000-0xb0ff
> >> irq 20 at device 15.0 on pci0
> >> ata2: <ATA channel 0> on atapci0
> >> ata3: <ATA channel 1> on atapci0
> >> ata0: <ATA channel 0> on atapci1
> >> ata1: <ATA channel 1> on atapci1
> >>
> >> ad4: 476940MB <WDC WD5000AAKS-00TMA0 12.01C01> at ata2-master
> >> SATA150 ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue
> >> timeout - completing request directly
> >> ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout -
> >> completing request directly
> >> ad4: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout -
> >> completing request directly
> >> ad4: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout -
> >> completing request directly
> >> ad4: WARNING - SET_MULTI taskqueue timeout - completing request
> >> directly ad4: TIMEOUT - WRITE_DMA48 retrying (1 retry left)
> >> LBA=344052040 ad4: WARNING - SETFEATURES SET TRANSFER MODE
> >> taskqueue timeout - completing request directly
> >> ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout -
> >> completing request directly
> >>
> >> It becomes stuck in a loop displaying the above and is unable to
> >> complete further I/O operations.  I wonder if it is just batching
> >> up a lot of I/O and then timing out because it is busy, and then
> >> not recovering from this state?
> >>
> >> Any ideas what could be wrong?
> > 
> > There are two different kinds of timeouts we can see:
> >  - first one, "ad4: WARNING - ..." is just a queue waiting timeout.
> > It is not the reason, but consequence of the problem. And I have
> > doubts that it is reasonable to do it.
> >  - second one, "TIMEOUT - WRITE_DMA48 ..." is a real command
> > execution timeout. I don't know whether this is result of some
> > improper error recovery, or you drive indeed lost required servo
> > information near LBA=344052040 and tries to find it too long. You
> > can try to read that sector and nearby ones with dd.
> > 
> 
> It's always that sequence (with setfeatures timing out first, then
> the dma later)...and the block number varies widely, also whether
> it's read/write.  The disk itself & the data it contains appears to
> be OK as far as I have been able to determine so far.

Does smartctl -A /dev/ad4 report "Seek Error Rate" and/or "ECC Error
Rate", and, if so, do those values change while errors are being
reported?

"Replaced Sector Count" or something similar might give some insight
too.

--
Alexandre Kovalenko.