Re: SMART: disk problems on RAIDZ1 pool: (ada6:ahcich6:0:0:0): CAM status: ATA Status Error

From: O. Hartmann <o.hartmann_at_walstatt.org> Date: Wed, 13 Dec 2017 09:54:43 +0100 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:14 UTC

Am Tue, 12 Dec 2017 14:55:49 -0800 (PST)
"Rodney W. Grimes" <freebsd-rwg_at_pdx.rh.CN85.dnsmgr.net> schrieb:

> > Am Tue, 12 Dec 2017 10:52:27 -0800 (PST)
> > "Rodney W. Grimes" <freebsd-rwg_at_pdx.rh.CN85.dnsmgr.net> schrieb:
> > 
> > 
> > Thank you for answering that fast!
> >   
> > > > Hello,
> > > > 
> > > > running CURRENT (recent r326769), I realised that smartmond sends out some console
> > > > messages when booting the box:
> > > > 
> > > > [...]
> > > > Dec 12 14:14:33 <3.2> box1 smartd[68426]: Device: /dev/ada6, 1 Currently
> > > > unreadable (pending) sectors Dec 12 14:14:33 <3.2> box1 smartd[68426]:
> > > > Device: /dev/ada6, 1 Offline uncorrectable sectors
> > > > [...]
> > > > 
> > > > Checking the drive's SMART log with smartctl (it is one of four 3TB disk drives),
> > > > I gather these informations:
> > > > 
> > > > [... smartctl -x /dev/ada6 ...]
> > > > Error 42 [17] occurred at disk power-on lifetime: 25335 hours (1055 days + 15
> > > > hours) When the command that caused the error occurred, the device was active or
> > > > idle.
> > > > 
> > > >   After command completion occurred, registers were:
> > > >   ER -- ST COUNT  LBA_48  LH LM LL DV DC
> > > >   -- -- -- == -- == == == -- -- -- -- --
> > > >   40 -- 51 00 00 00 00 c2 7a 72 98 40 00  Error: UNC at LBA = 0xc27a7298 =
> > > > 3262804632
> > > > 
> > > >   Commands leading to the command that caused the error were:
> > > >   CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
> > > >   -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
> > > >   60 00 b0 00 88 00 00 c2 7a 73 20 40 08     23:38:12.195  READ FPDMA QUEUED
> > > >   60 00 b0 00 80 00 00 c2 7a 72 70 40 08     23:38:12.195  READ FPDMA QUEUED
> > > >   2f 00 00 00 01 00 00 00 00 00 10 40 08     23:38:12.195  READ LOG EXT
> > > >   60 00 b0 00 70 00 00 c2 7a 73 20 40 08     23:38:09.343  READ FPDMA QUEUED
> > > >   60 00 b0 00 68 00 00 c2 7a 72 70 40 08     23:38:09.343  READ FPDMA QUEUED
> > > > [...]
> > > > 
> > > > and
> > > > 
> > > > [...]
> > > > SMART Attributes Data Structure revision number: 16
> > > > Vendor Specific SMART Attributes with Thresholds:
> > > > ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
> > > >   1 Raw_Read_Error_Rate     POSR-K   200   200   051    -    64
> > > >   3 Spin_Up_Time            POS--K   178   170   021    -    6075
> > > >   4 Start_Stop_Count        -O--CK   098   098   000    -    2406
> > > >   5 Reallocated_Sector_Ct   PO--CK   200   200   140    -    0
> > > >   7 Seek_Error_Rate         -OSR-K   200   200   000    -    0
> > > >   9 Power_On_Hours          -O--CK   066   066   000    -    25339
> > > >  10 Spin_Retry_Count        -O--CK   100   100   000    -    0
> > > >  11 Calibration_Retry_Count -O--CK   100   100   000    -    0
> > > >  12 Power_Cycle_Count       -O--CK   098   098   000    -    2404
> > > > 192 Power-Off_Retract_Count -O--CK   200   200   000    -    154
> > > > 193 Load_Cycle_Count        -O--CK   001   001   000    -    2055746
> > > > 194 Temperature_Celsius     -O---K   122   109   000    -    28
> > > > 196 Reallocated_Event_Count -O--CK   200   200   000    -    0
> > > > 197 Current_Pending_Sector  -O--CK   200   200   000    -    1
> > > > 198 Offline_Uncorrectable   ----CK   200   200   000    -    1
> > > > 199 UDMA_CRC_Error_Count    -O--CK   200   200   000    -    0
> > > > 200 Multi_Zone_Error_Rate   ---R--   200   200   000    -    5
> > > >                             ||||||_ K auto-keep
> > > >                             |||||__ C event count
> > > >                             ||||___ R error rate
> > > >                             |||____ S speed/performance
> > > >                             ||_____ O updated online
> > > >                             |______ P prefailure warning
> > > > 
> > > > [...]    
> > > 
> > > The data up to this point informs us that you have 1 bad sector
> > > on a 3TB drive, that is actually an expected event given the data
> > > error rate on this stuff is such that your gona have these now
> > > and again.
> > > 
> > > Given you have 1 single event I would not suspect that this drive
> > > is dying, but it would be prudent to prepare for that possibility.  
> > 
> > Hello.
> > 
> > Well, I copied simply "one single event" that has been logged so far.
> > 
> > As you (and I) can see, it is error #42. After I posted here, a reboot has taken place
> > because the "repair" process on the Pool suddenly increased time and now I'm with
> > error #47, but interestingly, it is a new block that is damaged, but the SMART
> > attribute fields show this for now:  
> 
> Can you send the complete output of smartctl -a /dev/foo, I somehow missed
> that 40+ other errors had occured.

Yes, here it is, but please do not beat me due to its size ;-). It is "smartctl -x", that
shows me the errors. See file attached named "smart_ada.txt". It is everything of
interest about the drive, I guess.

> 
> > [...]
> > SMART Attributes Data Structure revision number: 16
> > Vendor Specific SMART Attributes with Thresholds:
> > ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
> >   1 Raw_Read_Error_Rate     POSR-K   200   200   051    -    69
> >   3 Spin_Up_Time            POS--K   178   170   021    -    6075
> >   4 Start_Stop_Count        -O--CK   098   098   000    -    2406
> >   5 Reallocated_Sector_Ct   PO--CK   200   200   140    -    0  
> 
> Interesting, no reallocation has occured....
> 
> >   7 Seek_Error_Rate         -OSR-K   200   200   000    -    0
> >   9 Power_On_Hours          -O--CK   066   066   000    -    25343
> >  10 Spin_Retry_Count        -O--CK   100   100   000    -    0
> >  11 Calibration_Retry_Count -O--CK   100   100   000    -    0
> >  12 Power_Cycle_Count       -O--CK   098   098   000    -    2404
> > 192 Power-Off_Retract_Count -O--CK   200   200   000    -    154
> > 193 Load_Cycle_Count        -O--CK   001   001   000    -    2055746  
> 
> Hum, just noticed this.  25k hours power on, 2M load cycles, this is
> very hard on a hard drive.  Your drive is going into power save mode
> and unloading the heads.  Infact at a rate of 81 times per hour?
> Oh, I can not believe that.  Either way we need to get this stopped,
> it shall wear your drives out.
> 
> > 194 Temperature_Celsius     -O---K   122   109   000    -    28
> > 196 Reallocated_Event_Count -O--CK   200   200   000    -    0
> > 197 Current_Pending_Sector  -O--CK   200   200   000    -    0
> > 198 Offline_Uncorrectable   ----CK   200   200   000    -    1
> > 199 UDMA_CRC_Error_Count    -O--CK   200   200   000    -    0
> > 200 Multi_Zone_Error_Rate   ---R--   200   200   000    -    5
> >                             ||||||_ K auto-keep
> >                             |||||__ C event count
> >                             ||||___ R error rate
> >                             |||____ S speed/performance
> >                             ||_____ O updated online
> >                             |______ P prefailure warning
> > [...]
> > 
> > 
> > 197 Current_Pending_Sector decreased to zero so far, but with every reboot, the error
> > count seems to increase:  
> 
> Ok, some drive firmware well at the power on even try to test the
> pending sector list and clear it if it can actually read the sector.
> 
> > 
> > [...]
> > Error 47 [22] occurred at disk power-on lifetime: 25343 hours (1055 days + 23 hours)
> >   When the command that caused the error occurred, the device was active or idle.
> > 
> >   After command completion occurred, registers were:
> >   ER -- ST COUNT  LBA_48  LH LM LL DV DC
> >   -- -- -- == -- == == == -- -- -- -- --
> >   40 -- 51 00 00 00 00 c2 19 d9 88 40 00  Error: UNC at LBA = 0xc219d988 = 3256473992
> > 
> >   Commands leading to the command that caused the error were:
> >   CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
> >   -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
> >   60 00 b0 00 d0 00 00 c2 19 da 28 40 08  1d+07:12:34.336  READ FPDMA QUEUED
> >   60 00 b0 00 c8 00 00 c2 19 d9 78 40 08  1d+07:12:34.336  READ FPDMA QUEUED
> >   2f 00 00 00 01 00 00 00 00 00 10 40 08  1d+07:12:34.336  READ LOG EXT
> >   60 00 b0 00 b8 00 00 c2 19 da 28 40 08  1d+07:12:31.484  READ FPDMA QUEUED
> >   60 00 b0 00 b0 00 00 c2 19 d9 78 40 08  1d+07:12:31.483  READ FPDMA QUEUED
> > 
> > 
> > I think this is watching a HDD dying, isn't it?  
> 
> It could be, need to see as many of the other 46 errors as we can to make
> a decision on that.   Probably only 5 in the log though.

As said above, see attached file ;-)

> 
> > I'd say, a broken cabling would produce different errors, wouldn't it?  
> Yes, there is a CRC error that would occur on cabling error.
> 
> > The Western Digital Green series HDD is a useful fellow when the HDD is used as a
> > single drive. I think there might be an issue with paring 4 HDDs, 3 of them "GREEN",
> > in a RAIDZ and physically sitting next to each other. Maybe it is time to replace
> > them one by one ...  
> 
> I am more suspecioius of them loading and unloading the head at a rate of
> more than once per minute!
> 
[ ... schnipp ... ]

-- 
O. Hartmann

Ich widerspreche der Nutzung oder Übermittlung meiner Daten für
Werbezwecke oder für die Markt- oder Meinungsforschung (§ 28 Abs. 4 BDSG).