Re: ad0 errors on 6.0-RC1

From: Dan Langille <dan_at_langille.org>
Date: Wed, 12 Oct 2005 22:55:22 -0400
On 12 Oct 2005 at 19:58, Mike Tancsa wrote:

> At 05:48 PM 12/10/2005, Dan Langille wrote:
> >I'm seeing these errors but I do not know if it's an HDD problem
> >or an OS problem.  Clues please?
> 
> They look like hard errors, but I have seen similar problems with bad 
> drive trays.  smartmontools out of the ports will help you narrow it 
> down. (eg check the output of smartctl -a /dev/ad0).

We did that yesterday.  I don't know enough about the output to 
judge, but it seems ok.  Also posted to http://pastebin.com/391872

[root_at_mtwenty:/usr/ports/sysutils/smartmontools] # smartctl -a 
/dev/ad0
smartctl version 5.33 [i386-portbld-freebsd6.0] Copyright (C) 2002-4 
Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     Maxtor 6Y080L0
Serial Number:    Y3KLWA7E
Firmware Version: YAR41BW0
User Capacity:    81,964,302,336 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  ATA/ATAPI-7 T13 1532D revision 0
Local Time is:    Tue Oct 11 08:45:22 2005 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x80) Offline data collection 
activity
                                        was never started.
                                        Auto Offline Data Collection: 
Enabled.
Self-test execution status:      (   0) The previous self-test 
routine completed
                                        without error or no self-test 
has ever
                                        been run.
Total time to complete Offline
data collection:                 ( 182) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline 
immediate.
                                        Auto Offline data collection 
on/off support.
                                        Suspend Offline collection 
upon new
                                        command.
                                        Offline surface scan 
supported.
                                        Self-test supported.
                                        No Conveyance Self-test 
supported.
                                        Selective Self-test 
supported.
SMART capabilities:            (0x0003) Saves SMART data before 
entering
                                        power-saving mode.
                                        Supports SMART auto save 
timer.
Error logging capability:        (0x01) Error logging supported.
                                        No General Purpose Logging 
support.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  40) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      
UPDATED  WHEN_FAILED RAW_VALUE
  3 Spin_Up_Time            0x0027   200   200   063    Pre-fail  
Always       -       16714
  4 Start_Stop_Count        0x0032   253   253   000    Old_age   
Always       -       77
  5 Reallocated_Sector_Ct   0x0033   253   253   063    Pre-fail  
Always       -       0
  6 Read_Channel_Margin     0x0001   253   253   100    Pre-fail  
Offline      -       0
  7 Seek_Error_Rate         0x000a   253   252   000    Old_age   
Always       -       0
  8 Seek_Time_Performance   0x0027   251   247   187    Pre-fail  
Always       -       36405
  9 Power_On_Minutes        0x0032   243   243   000    Old_age   
Always       -       317h+56m
 10 Spin_Retry_Count        0x002b   253   252   157    Pre-fail  
Always       -       0
 11 Calibration_Retry_Count 0x002b   253   252   223    Pre-fail  
Always       -       0
 12 Power_Cycle_Count       0x0032   253   253   000    Old_age   
Always       -       84
192 Power-Off_Retract_Count 0x0032   253   253   000    Old_age   
Always       -       0
193 Load_Cycle_Count        0x0032   253   253   000    Old_age   
Always       -       0
194 Temperature_Celsius     0x0032   253   253   000    Old_age   
Always       -       36
195 Hardware_ECC_Recovered  0x000a   253   252   000    Old_age   
Always       -       3036
196 Reallocated_Event_Count 0x0008   253   253   000    Old_age   
Offline      -       0
197 Current_Pending_Sector  0x0008   253   253   000    Old_age   
Offline      -       0
198 Offline_Uncorrectable   0x0008   253   253   000    Old_age   
Offline      -       0
199 UDMA_CRC_Error_Count    0x0008   198   196   000    Old_age   
Offline      -       4
200 Multi_Zone_Error_Rate   0x000a   253   252   000    Old_age   
Always       -       0
201 Soft_Read_Error_Rate    0x000a   253   252   000    Old_age   
Always       -       4
202 TA_Increase_Count       0x000a   253   252   000    Old_age   
Always       -       0
203 Run_Out_Cancel          0x000b   253   252   180    Pre-fail  
Always       -       0
204 Shock_Count_Write_Opern 0x000a   253   252   000    Old_age   
Always       -       0
205 Shock_Rate_Write_Opern  0x000a   253   252   000    Old_age   
Always       -       0
207 Spin_High_Current       0x002a   253   252   000    Old_age   
Always       -       0
208 Spin_Buzz               0x002a   253   252   000    Old_age   
Always       -       0
209 Offline_Seek_Performnce 0x0024   198   198   000    Old_age   
Offline      -       0
 99 Unknown_Attribute       0x0004   253   253   000    Old_age   
Offline      -       0
100 Unknown_Attribute       0x0004   253   253   000    Old_age   
Offline      -       0
101 Unknown_Attribute       0x0004   253   253   000    Old_age   
Offline      -       0

SMART Error Log Version: 1
ATA Error Count: 4
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 4 occurred at disk power-on lifetime: 3332 hours (138 days + 20 
hours)
  When the command that caused the error occurred, the device was in 
an unknown state.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 38 1f a2 e0  Error: ICRC, ABRT at LBA = 0x00a21f38 = 
10624824

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 37 38 1f a2 e0 08      00:06:15.120  READ DMA
  c8 00 09 2f 1f a2 e0 08      00:06:15.120  READ DMA
  c8 00 36 f9 1e a2 e0 08      00:06:15.120  READ DMA
  c8 00 0a ef 1e a2 e0 08      00:06:15.120  READ DMA
  c8 00 35 ba 1e a2 e0 08      00:06:15.120  READ DMA

Error 3 occurred at disk power-on lifetime: 3332 hours (138 days + 20 
hours)
  When the command that caused the error occurred, the device was in 
an unknown state.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 ba 1e a2 e0  Error: ICRC, ABRT at LBA = 0x00a21eba = 
10624698

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 35 ba 1e a2 e0 08      00:06:15.056  READ DMA
  c8 00 0b af 1e a2 e0 08      00:06:15.056  READ DMA
  c8 00 34 7b 1e a2 e0 08      00:06:15.056  READ DMA
  c8 00 0c 6f 1e a2 e0 08      00:06:15.056  READ DMA
  c8 00 02 6f 1e a2 e0 08      00:06:15.056  READ DMA

Error 2 occurred at disk power-on lifetime: 3332 hours (138 days + 20 
hours)
  When the command that caused the error occurred, the device was in 
an unknown state.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 c1 aa a2 e0  Error: ICRC, ABRT at LBA = 0x00a2aac1 = 
10660545

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 3e c1 aa a2 e0 08      00:06:14.880  READ DMA
  c8 00 02 bf aa a2 e0 08      00:06:14.880  READ DMA
  c8 00 34 0b 3b 53 e0 08      00:06:14.880  READ DMA
  c8 00 0c ff 3a 53 e0 08      00:06:14.880  READ DMA
  c8 00 01 7e 00 00 e0 08      00:06:14.880  READ DMA

Error 1 occurred at disk power-on lifetime: 3332 hours (138 days + 20 
hours)
  When the command that caused the error occurred, the device was in 
an unknown state.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 79 96 0e e0  Error: ICRC, ABRT at LBA = 0x000e9679 = 
956025

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 16 79 96 0e e0 08      00:06:14.736  READ DMA
  c8 00 2a 4f 96 0e e0 08      00:06:14.736  READ DMA
  c8 00 02 33 54 53 e0 08      00:06:14.736  READ DMA
  c8 00 08 f7 aa a2 e0 08      00:06:14.736  READ DMA
  c8 00 08 f7 aa a2 e0 08      00:06:14.736  READ DMA

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  
LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      3370  
       -
# 2  Short offline       Completed without error       00%         7  
       -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute 
delay.

[root_at_mtwenty:/usr/ports/sysutils/smartmontools] #

> 
>          ---Mike
> 
> 
> >The following was also posted at http://pastebin.com/391670
> >
> >Oct 11 03:40:00 mtwenty kernel: ad0: FAILURE - READ_DMA 
> >status=7f<READY,DMA_READY,DSC,DRQ,CORRECTABLE,INDEX,ERROR> 
> >error=7f<UNCORRECTABLE,MEDIA_CHANGED,NID_NOT_FOUND,MEDIA_CHANGE_REQEST,ABORTED,NO_MEDIA
> >,ILLEGAL_LENGTH> LBA=802719
> >Oct 11 03:40:00 mtwenty kernel: 
> >g_vfs_done():ad0s1a[READ(offset=410959872, length=16384)]error = 5
> >Oct 11 03:40:06 mtwenty kernel: ad0: FAILURE - READ_DMA 
> >status=7f<READY,DMA_READY,DSC,DRQ,CORRECTABLE,INDEX,ERROR> 
> >error=7f<UNCORRECTABLE,MEDIA_CHANGED,NID_NOT_FOUND,MEDIA_CHANGE_REQEST,ABORTED,NO_MEDIA
> >,ILLEGAL_LENGTH> LBA=802175
> >Oct 11 03:40:06 mtwenty kernel: 
> >g_vfs_done():ad0s1a[READ(offset=410681344, length=8192)]error = 5
> >Oct 11 03:40:06 mtwenty kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 
> >retry left) LBA=4857391
> >Oct 11 03:40:01 mtwenty cron[82160]: login_getclass: retrieving 
> >class information: Input/output error
> >Oct 11 03:44:49 mtwenty kernel: ad0: FAILURE - READ_DMA 
> >status=7f<READY,DMA_READY,DSC,DRQ,CORRECTABLE,INDEX,ERROR> 
> >error=7f<UNCORRECTABLE,MEDIA_CHANGED,NID_NOT_FOUND,MEDIA_CHANGE_REQEST,ABORTED,NO_MEDIA
> >,ILLEGAL_LENGTH> LBA=151787983
> >Oct 11 03:44:49 mtwenty kernel: 
> >g_vfs_done():ad0s1f[READ(offset=74097885184, length=14336)]error = 5
> >Oct 11 03:44:56 mtwenty kernel: ad0: FAILURE - WRITE_DMA 
> >status=7f<READY,DMA_READY,DSC,DRQ,CORRECTABLE,INDEX,ERROR> 
> >error=7f<UNCORRECTABLE,MEDIA_CHANGED,NID_NOT_FOUND,MEDIA_CHANGE_REQEST,ABORTED,NO_MEDI
> >A,ILLEGAL_LENGTH> LBA=4857391
> >Oct 11 03:44:56 mtwenty kernel: 
> >g_vfs_done():ad0s1d[WRITE(offset=969719808, length=10240)]error = 5
> >Oct 11 03:44:56 mtwenty kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 
> >retry left) LBA=92997387
> >Oct 11 03:55:07 mtwenty kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 
> >retry left) LBA=4092687
> >Oct 11 13:04:08 mtwenty kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 
> >retry left) LBA=4092687
> >Oct 11 13:52:08 mtwenty kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 
> >retry left) LBA=4092687
> >Oct 11 13:55:07 mtwenty kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 
> >retry left) LBA=4092687
> >Oct 11 13:55:33 mtwenty kernel: ad0: timeout waiting to issue command
> >Oct 11 13:55:33 mtwenty kernel: ad0: error issueing WRITE_DMA command
> >Oct 11 13:55:33 mtwenty kernel: ad0: timeout waiting to issue command
> >Oct 11 13:55:33 mtwenty kernel: ad0: error issueing WRITE_DMA command
> >Oct 11 13:55:33 mtwenty kernel: ad0: timeout waiting to issue command
> >Oct 11 13:55:33 mtwenty kernel: ad0: error issueing WRITE_DMA command
> >Oct 11 13:55:33 mtwenty kernel: ad0: timeout waiting to issue command
> >Oct 11 13:55:33 mtwenty kernel: ad0: error issueing WRITE_DMA command
> >Oct 11 13:55:33 mtwenty kernel: 
> >g_vfs_done():ad0s1f[WRITE(offset=42777804800, length=16384)]error = 5
> >Oct 11 13:55:33 mtwenty kernel: 
> >g_vfs_done():ad0s1f[WRITE(offset=43163189248, length=16384)]error = 5
> >Oct 11 13:55:33 mtwenty kernel: 
> >g_vfs_done():ad0s1a[WRITE(offset=131072, length=16384)]error = 5
> >Oct 11 13:55:33 mtwenty kernel: 
> >g_vfs_done():ad0s1a[WRITE(offset=147456, length=16384)]error = 5
> >Oct 11 13:55:38 mtwenty kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 
> >retry left) LBA=786815
> >Oct 11 15:44:31 mtwenty shutdown: reboot by dan:
> >
> >Oct 11 16:13:03 mtwenty su: dan to root on /dev/ttyp1
> >Oct 11 19:51:04 mtwenty kernel: ad0: timeout waiting to issue command
> >Oct 11 19:51:09 mtwenty kernel: ad0: error issueing WRITE_DMA command
> >Oct 11 19:51:09 mtwenty kernel: ad0: timeout waiting to issue command
> >Oct 11 19:51:09 mtwenty kernel: ad0: error issueing WRITE_DMA command
> >Oct 11 19:51:09 mtwenty kernel: 
> >g_vfs_done():ad0s1f[WRITE(offset=49576368128, length=2048)]error = 5
> >Oct 11 19:51:09 mtwenty kernel: 
> >g_vfs_done():ad0s1f[WRITE(offset=49767104512, length=16384)]error = 5
> >Oct 11 19:51:09 mtwenty kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 
> >retry left) LBA=104266895
> >Oct 11 20:17:45 mtwenty kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 
> >retry left) LBA=319
> >Oct 12 17:23:37 mtwenty syslogd: kernel boot file is /boot/kernel/kernel
> >Oct 12 17:23:37 mtwenty kernel: 
> >g_vfs_done():ad0s1d[WRITE(offset=969867264, length=8192)]error = 6
> >Oct 12 17:23:37 mtwenty kernel: 
> >g_vfs_done():ad0s1d[WRITE(offset=963559424, length=16384)]error = 6
> >Oct 12 17:23:37 mtwenty kernel: unknown: TIMEOUT - READ_DMA retrying 
> >(0 retries left) LBA=153118463
> >Oct 12 17:23:37 mtwenty kernel: unknown: FAILURE - READ_DMA timed 
> >out LBA=153118463
> >Oct 12 17:23:37 mtwenty kernel: 
> >g_vfs_done():ad0s1f[READ(offset=74779090944, length=2048)]error = 5
> >Oct 12 17:23:37 mtwenty kernel: 
> >g_vfs_done():ad0s1f[READ(offset=74779097088, length=2048)]error = 6
> >Oct 12 17:23:37 mtwenty kernel: 
> >g_vfs_done():ad0s1f[READ(offset=74202345472, length=2048)]error = 6
> >Oct 12 17:23:37 mtwenty kernel: 
> >g_vfs_done():ad0s1f[READ(offset=75589498880, length=2048)]error = 6
> >
> >Thanks
> >--
> >Dan Langille : http://www.langille.org/
> >BSDCan - The Technical BSD Conference - http://www.bsdcan.org/
> >
> >
> >_______________________________________________
> >freebsd-current_at_freebsd.org mailing list
> >http://lists.freebsd.org/mailman/listinfo/freebsd-current
> >To unsubscribe, send any mail to "freebsd-current-unsubscribe_at_freebsd.org"
> 
> 


-- 
Dan Langille : http://www.langille.org/
BSDCan - The Technical BSD Conference - http://www.bsdcan.org/
Received on Thu Oct 13 2005 - 00:55:26 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:45 UTC