TIMEOUT - WRITE_DMA and smart questions

From: Ion-Mihai Tetcu <itetcu_at_apropo.ro>
Date: Mon, 11 Oct 2004 14:09:31 +0300
[ please reply only on questions_at_ if this is not appropriate for current_at_ ]


While doing nothing special the system start printing TIMEOUT -
WRITE_DMA erros and eventually after an atacontrol mode 0 PIO4 PIO4
hanged completely at 04:20.

After restart I've got a few TIMEOUT .. but no hung, however the machine
is idle.

SMART was enabled as seen bellow, but smartd wasn't running (stupid, huh
:-/ ).

Obvious question: is the hdd dying ?

Second question, as I'm not familiar with SMART: how much can one trust
SMART reports ?

Third question: could you suggest some settings for smartd ? I'm, asking
this because I don't fully understand the man pages for smartctl and
smartd; a link explaining more about smart would also be appreciated.

System details:

Local system status (last daily mail):
 3:01AM  up 2 days, 11:56, 2 users, load averages: 1.04, 1.07, 0.95

 % uname -a
FreeBSD it.buh.cameradicommercio.ro 5.3-BETA7 FreeBSD 5.3-BETA7 #3: Mon Oct  4 21:57:25 EEST 2004     root_at_it.buh.tecnik93.com:/usr/obj/usr/src/sys/IT53_d  i386

Oct 11 04:06:51 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=186210020
Oct 11 04:07:02 it kernel: ata0: reiniting channel ..
Oct 11 04:07:02 it kernel: ata0: reset tp1 mask=03 ostat0=d0 ostat1=d0
Oct 11 04:07:02 it kernel: ad0: stat=0xd0 err=0xd0 lsb=0xd0 msb=0xd0
Oct 11 04:07:02 it last message repeated 95 times
Oct 11 04:07:02 it kernel: ad0: stat=0x50 err=0x01 lsb=0x00 msb=0x00
Oct 11 04:07:02 it kernel: ata0-slave:  stat=0x00 err=0x01 lsb=0x00 msb=0x00
Oct 11 04:07:02 it kernel: ata0: reset tp2 stat0=50 stat1=00 devices=0x1<ATA_MASTER>
Oct 11 04:07:02 it kernel: ata0: resetting done ..
Oct 11 04:07:02 it kernel: ad0: pio=0x0c wdma=0x22 udma=0x45 cable=80pin
Oct 11 04:07:02 it kernel: ad0: setting PIO4 on VIA 8235 chip
Oct 11 04:07:02 it kernel: ad0: setting UDMA100 on VIA 8235 chip
Oct 11 04:07:02 it kernel: ata0: device config done ..
Oct 11 04:07:16 it kernel: (probe0:ata0:0:0:0): error 22
Oct 11 04:07:16 it kernel: (probe0:ata0:0:0:0): Unretryable Error
Oct 11 04:07:16 it kernel: (probe1:ata0:0:1:0): error 22
Oct 11 04:07:16 it kernel: (probe1:ata0:0:1:0): Unretryable Error

 # grep LBA /var/log/messages
Oct 11 04:06:51 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=186210020
Oct 11 04:07:52 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=165839908
Oct 11 04:08:48 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=165849220
Oct 11 04:09:12 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=165851556
Oct 11 04:09:32 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=165859748
Oct 11 04:10:44 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=6343103
Oct 11 04:11:23 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=186210916
Oct 11 04:11:36 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=186211044
Oct 11 04:11:58 it kernel: acd0: FAILURE - ATA_IDENTIFY status=51<READY,DSC,ERROR> error=4<ABORTED> LBA=0
Oct 11 04:13:21 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=309294340
Oct 11 04:14:00 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=175421156
Oct 11 04:14:24 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=175421156
Oct 11 04:15:04 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=175421796
Oct 11 04:15:48 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=130261540
Oct 11 04:16:10 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=175421892
Oct 11 04:16:53 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=173918724
Oct 11 04:18:50 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=309924420
Oct 11 04:19:14 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=4920283
Oct 11 04:40:00 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=4918975
Oct 11 04:40:56 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=6067199
Oct 11 10:46:52 it kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=6343103

 # grep sw /var/log/messages
Oct 11 04:14:24 it kernel: swap_pager: indefinite wait buffer: device: ad0s1e, blkno: 14841, size: 4096
Oct 11 04:14:24 it kernel: swap_pager: indefinite wait buffer: device: ad0s3d, blkno: 14381, size: 4096
Oct 11 04:16:53 it kernel: swap_pager: indefinite wait buffer: device: ad0s3d, blkno: 60732, size: 4096
Oct 11 04:16:53 it kernel: swap_pager: indefinite wait buffer: device: ad0s3d, blkno: 33481, size: 4096
Oct 11 04:16:53 it kernel: swap_pager: indefinite wait buffer: device: ad0s3d, blkno: 33488, size: 4096

The disk is:
 # atacontrol cap 0 0
ATA channel 0, Master, device ad0:

Protocol              ATA/ATAPI revision 6
device model          WDC WD1600JB-00EVA0
serial number         WD-WCAEK1298992
firmware revision     15.05R15
cylinders             16383
heads                 16
sectors/track         63
lba supported         268435455 sectors
lba48 supported       312579695 sectors
dma supported
overlap not supported

Feature                      Support  Enable    Value   Vendor
write cache                    yes      no
read ahead                     yes      yes
dma queued                     no       no      0/0x00
SMART                          yes      yes
microcode download             yes      yes
security                       yes      no
power management               yes      yes
advanced power management      no       no      0/0x00
automatic acoustic management  yes      yes     254/0xFE        128/0x80

 # smartctl -a /dev/ad0
smartctl version 5.32 Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Device Model:     WDC WD1600JB-00EVA0
Serial Number:    WD-WCAEK1298992
Firmware Version: 15.05R15
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   6
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Mon Oct 11 12:37:32 2004 EEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

The SMART RETURN STATUS return value (smartmontools -H option/Directive)
 can not be retrieved with this version of ATAng, please do not rely on this value
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x05) Offline data collection activity
                                        was aborted by an interrupting command from host.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (  40) The self-test routine was interrupted
                                        by the host with a hard or soft reset.
Total time to complete Offline
data collection:                 (5061) seconds.
Offline data collection
capabilities:                    (0x79) SMART execute Offline immediate.
                                        No Auto Offline data collection support.
                                        Suspend Offline collection upon new
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        No General Purpose Logging support.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  67) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
  1 Raw_Read_Error_Rate     0x000b   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0007   155   147   021    Pre-fail  Always       -       2775
  4 Start_Stop_Count        0x0032   100   100   040    Old_age   Always       -       464
  5 Reallocated_Sector_Ct   0x0033   199   199   140    Pre-fail  Always       -       8
  7 Seek_Error_Rate         0x000b   200   199   051    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   096   096   000    Old_age   Always       -       3360
 10 Spin_Retry_Count        0x0013   100   100   051    Pre-fail  Always       -       0
 11 Calibration_Retry_Count 0x0013   100   100   051    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       462
194 Temperature_Celsius     0x0022   124   253   000    Old_age   Always       -       26
196 Reallocated_Event_Count 0x0032   194   194   000    Old_age   Always       -       6
197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0012   200   200   000    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x000a   200   253   000    Old_age   Always       -       2
200 Multi_Zone_Error_Rate   0x0009   200   155   051    Pre-fail  Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended captive    Interrupted (host reset)      80%        77         -
# 2  Extended offline    Aborted by host               90%        77         -
# 3  Conveyance offline  Completed without error       00%        76         -
# 4  Short offline       Completed without error       00%        76         -
# 5  Conveyance offline  Completed without error       00%       233         -
# 6  Short captive       Interrupted (host reset)      90%       233         -

SMART Selective self-test log data structure revision number 1
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing

Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


Received on Mon Oct 11 2004 - 09:26:26 UTC

