Re: ata timeouts under load

From: Miroslav Lachman <000.fbsd_at_quip.cz> Date: Mon, 14 Sep 2009 17:21:34 +0200 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:55 UTC

Alexandre Sunny wrote:
> On Sun, 13 Sep 2009 22:02:10 +0100
> Kris Kennaway <kris_at_FreeBSD.org> wrote:
> 
> 
>>Alexander Motin wrote:
>>
>>>Kris Kennaway wrote:

[...]

>>>There are two different kinds of timeouts we can see:
>>> - first one, "ad4: WARNING - ..." is just a queue waiting timeout.
>>>It is not the reason, but consequence of the problem. And I have
>>>doubts that it is reasonable to do it.
>>> - second one, "TIMEOUT - WRITE_DMA48 ..." is a real command
>>>execution timeout. I don't know whether this is result of some
>>>improper error recovery, or you drive indeed lost required servo
>>>information near LBA=344052040 and tries to find it too long. You
>>>can try to read that sector and nearby ones with dd.
>>>
>>
>>It's always that sequence (with setfeatures timing out first, then
>>the dma later)...and the block number varies widely, also whether
>>it's read/write.  The disk itself & the data it contains appears to
>>be OK as far as I have been able to determine so far.
> 
> 
> Does smartctl -A /dev/ad4 report "Seek Error Rate" and/or "ECC Error
> Rate", and, if so, do those values change while errors are being
> reported?
> 
> "Replaced Sector Count" or something similar might give some insight
> too.

I have very similar problem with one disk in gmirror, but it is on 7.2 
not current.

Sep 14 04:48:29 jimi kernel: ad6: timeout waiting to issue command
Sep 14 04:48:29 jimi kernel: ad6: error issuing FLUSHCACHE command
Sep 14 04:48:29 jimi kernel: ad6: TIMEOUT - READ_DMA48 retrying (1 retry 
left) LBA=447001516
Sep 14 04:48:29 jimi kernel: ad6: FAILURE - READ_DMA48 
status=51<READY,DSC,ERROR> error=4<ABORTED> LBA=447001516
Sep 14 04:48:29 jimi kernel: GEOM_MIRROR: Request failed (error=5). 
ad6[READ(offset=228864776192, length=2048)]
Sep 14 04:48:29 jimi kernel: GEOM_MIRROR: Device gm0: provider ad6 
disconnected.

But no errors in SMART log:

Device Model:     Hitachi HDP725050GLA360
Firmware Version: GM4OA52A
User Capacity:    500,107,862,016 bytes

SMART overall-health self-assessment test result: PASSED

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE 
UPDATED  WHEN_FAILED RAW_VALUE
   1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail 
Always       -       1
   2 Throughput_Performance  0x0005   130   130   054    Pre-fail 
Offline      -       151
   3 Spin_Up_Time            0x0007   116   116   024    Pre-fail 
Always       -       312 (Average 350)
   4 Start_Stop_Count        0x0012   100   100   000    Old_age 
Always       -       23
   5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail 
Always       -       0
   7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail 
Always       -       0
   8 Seek_Time_Performance   0x0005   129   129   020    Pre-fail 
Offline      -       30
   9 Power_On_Hours          0x0012   099   099   000    Old_age 
Always       -       13911
  10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail 
Always       -       0
  12 Power_Cycle_Count       0x0032   100   100   000    Old_age 
Always       -       23
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always 
       -       545
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always 
       -       545
194 Temperature_Celsius     0x0002   240   240   000    Old_age   Always 
       -       25 (Lifetime Min/Max 20/34)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always 
       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always 
       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age 
Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always 
       -       0

As it was discussed many times - it should be fixed by increasing the 
hardcoded timouts.
Is it time to make the ATA timeout sysctl tunables?
There were patches from FreeNAS and some PRs about longer timeouts.

kern/136182: [ata] Heavy disk writes (e.g. ZFS resilver to a drive) can 
cause "adX: TIMEOUT - FLUSHCACHE retrying (1 retry left)" on console.
http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/136182

kern/111023: [ata] [request] [patch] please expand ata timeouts
http://www.freebsd.org/cgi/query-pr.cgi?pr=111023

ATA/SATA DMA timeout issues
http://wiki.freebsd.org/JeremyChadwick/ATA_issues_and_troubleshooting#line-53

HowTo: Fix SATA DMA timeout issues on FreeBSD
http://linux-bsd-sharing.blogspot.com/2009/03/howto-fix-sata-dma-timeout-issues-on.html

Western Digital hard disks and ATA timeouts
http://www.mail-archive.com/freebsd-hardware_at_freebsd.org/msg03135.html

ata FLUSHCACHE timeout errors? [patch]
http://lists.freebsd.org/pipermail/freebsd-current/2009-April/005939.html

And I am sure, you can find many more reports floating around.

Miroslav Lachman