Re: SMART: disk problems on RAIDZ1 pool: (ada6:ahcich6:0:0:0): CAM status: ATA Status Error

From: Rodney W. Grimes <freebsd-rwg_at_pdx.rh.CN85.dnsmgr.net>
Date: Wed, 13 Dec 2017 05:23:24 -0800 (PST)
> Am Tue, 12 Dec 2017 14:55:49 -0800 (PST)
> "Rodney W. Grimes" <freebsd-rwg_at_pdx.rh.CN85.dnsmgr.net> schrieb:
> > > Am Tue, 12 Dec 2017 10:52:27 -0800 (PST)
> > > "Rodney W. Grimes" <freebsd-rwg_at_pdx.rh.CN85.dnsmgr.net> schrieb:
> > > 
> > > Thank you for answering that fast!

Not so fast this time, had to sleep :)

> > > > > Hello,
> > > > > 
> > > > > running CURRENT (recent r326769), I realised that smartmond sends out some console
> > > > > messages when booting the box:
> > > > > 
> > > > > [...]
> > > > > Dec 12 14:14:33 <3.2> box1 smartd[68426]: Device: /dev/ada6, 1 Currently
> > > > > unreadable (pending) sectors Dec 12 14:14:33 <3.2> box1 smartd[68426]:
> > > > > Device: /dev/ada6, 1 Offline uncorrectable sectors
> > > > > [...]
> > > > > 
> > > > > Checking the drive's SMART log with smartctl (it is one of four 3TB disk drives),
> > > > > I gather these informations:
> > > > > 
> > > > > [... smartctl -x /dev/ada6 ...]
> > > > > Error 42 [17] occurred at disk power-on lifetime: 25335 hours (1055 days + 15
> > > > > hours) When the command that caused the error occurred, the device was active or
> > > > > idle.
> > > > > 
> > > > >   After command completion occurred, registers were:
> > > > >   ER -- ST COUNT  LBA_48  LH LM LL DV DC
> > > > >   -- -- -- == -- == == == -- -- -- -- --
> > > > >   40 -- 51 00 00 00 00 c2 7a 72 98 40 00  Error: UNC at LBA = 0xc27a7298 =
> > > > > 3262804632
> > > > > 
> > > > >   Commands leading to the command that caused the error were:
> > > > >   CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
> > > > >   -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
> > > > >   60 00 b0 00 88 00 00 c2 7a 73 20 40 08     23:38:12.195  READ FPDMA QUEUED
> > > > >   60 00 b0 00 80 00 00 c2 7a 72 70 40 08     23:38:12.195  READ FPDMA QUEUED
> > > > >   2f 00 00 00 01 00 00 00 00 00 10 40 08     23:38:12.195  READ LOG EXT
> > > > >   60 00 b0 00 70 00 00 c2 7a 73 20 40 08     23:38:09.343  READ FPDMA QUEUED
> > > > >   60 00 b0 00 68 00 00 c2 7a 72 70 40 08     23:38:09.343  READ FPDMA QUEUED
> > > > > [...]
> > > > > 
> > > > > and
> > > > > 
> > > > > [...]
> > > > > SMART Attributes Data Structure revision number: 16
> > > > > Vendor Specific SMART Attributes with Thresholds:
> > > > > ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
> > > > >   1 Raw_Read_Error_Rate     POSR-K   200   200   051    -    64
> > > > >   3 Spin_Up_Time            POS--K   178   170   021    -    6075
> > > > >   4 Start_Stop_Count        -O--CK   098   098   000    -    2406
> > > > >   5 Reallocated_Sector_Ct   PO--CK   200   200   140    -    0
> > > > >   7 Seek_Error_Rate         -OSR-K   200   200   000    -    0
> > > > >   9 Power_On_Hours          -O--CK   066   066   000    -    25339
> > > > >  10 Spin_Retry_Count        -O--CK   100   100   000    -    0
> > > > >  11 Calibration_Retry_Count -O--CK   100   100   000    -    0
> > > > >  12 Power_Cycle_Count       -O--CK   098   098   000    -    2404
> > > > > 192 Power-Off_Retract_Count -O--CK   200   200   000    -    154
> > > > > 193 Load_Cycle_Count        -O--CK   001   001   000    -    2055746
> > > > > 194 Temperature_Celsius     -O---K   122   109   000    -    28
> > > > > 196 Reallocated_Event_Count -O--CK   200   200   000    -    0
> > > > > 197 Current_Pending_Sector  -O--CK   200   200   000    -    1
> > > > > 198 Offline_Uncorrectable   ----CK   200   200   000    -    1

Note here, we have a pending and we have an offline uncorrectable,
an offline uncorrectable needs to end up in the remap, that should
never end up cleared and back in the good blocks iirc, but then
again firmware gets changed so maybe it is possible to return
this to a good sector, either way it looks as if at this point
in time we infact may have 2 seperate blocks that are bad.

I have some long use heavily worn drives that have 10's of remapped
sectors and they are still running fine.  I would not use them for
mission critical or in a high heavy use situation, but they are good
for cold storage and other non critical use.  A total of 2 reallocates
I would not worry much about.  Unless I am seeing a growth rate.
Note that when these drives are shipped brand now for the first N
Power On Hours they are in a special mode that is very quick to simply
remap a "weak" sector.  Ie, any sector that gets requires some threshold
of M bits of error, the ECC already corrected the data but they vendor
has decided that these are weak sectors and it should just remap them.
Some firmware does not even call them Reallocated sectors, and adds
them to the manaufactures P list.

> > > > > 199 UDMA_CRC_Error_Count    -O--CK   200   200   000    -    0
> > > > > 200 Multi_Zone_Error_Rate   ---R--   200   200   000    -    5
> > > > >                             ||||||_ K auto-keep
> > > > >                             |||||__ C event count
> > > > >                             ||||___ R error rate
> > > > >                             |||____ S speed/performance
> > > > >                             ||_____ O updated online
> > > > >                             |______ P prefailure warning
> > > > > 
> > > > > [...]    
> > > > 
> > > > The data up to this point informs us that you have 1 bad sector
> > > > on a 3TB drive, that is actually an expected event given the data
> > > > error rate on this stuff is such that your gona have these now
> > > > and again.
> > > > 
> > > > Given you have 1 single event I would not suspect that this drive
> > > > is dying, but it would be prudent to prepare for that possibility.  
> > > 
> > > Hello.
> > > 
> > > Well, I copied simply "one single event" that has been logged so far.
> > > 
> > > As you (and I) can see, it is error #42. After I posted here, a reboot has taken place
> > > because the "repair" process on the Pool suddenly increased time and now I'm with
> > > error #47, but interestingly, it is a new block that is damaged, but the SMART
> > > attribute fields show this for now:  
> > 
> > Can you send the complete output of smartctl -a /dev/foo, I somehow missed
> > that 40+ other errors had occured.
> 
> 
> Yes, here it is, but please do not beat me due to its size ;-). It is "smartctl -x", that
> shows me the errors. See file attached named "smart_ada.txt". It is everything of
> interest about the drive, I guess.

This was not that large:
     358    2901   17940 smart_ada.txt
 
> > > [...]
> > > SMART Attributes Data Structure revision number: 16
> > > Vendor Specific SMART Attributes with Thresholds:
> > > ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
> > >   1 Raw_Read_Error_Rate     POSR-K   200   200   051    -    69
> > >   3 Spin_Up_Time            POS--K   178   170   021    -    6075
> > >   4 Start_Stop_Count        -O--CK   098   098   000    -    2406
> > >   5 Reallocated_Sector_Ct   PO--CK   200   200   140    -    0  
> > 
> > Interesting, no reallocation has occured....
> > 
> > >   7 Seek_Error_Rate         -OSR-K   200   200   000    -    0
> > >   9 Power_On_Hours          -O--CK   066   066   000    -    25343
> > >  10 Spin_Retry_Count        -O--CK   100   100   000    -    0
> > >  11 Calibration_Retry_Count -O--CK   100   100   000    -    0
> > >  12 Power_Cycle_Count       -O--CK   098   098   000    -    2404
> > > 192 Power-Off_Retract_Count -O--CK   200   200   000    -    154
> > > 193 Load_Cycle_Count        -O--CK   001   001   000    -    2055746  
> > 
> > Hum, just noticed this.  25k hours power on, 2M load cycles, this is
> > very hard on a hard drive.  Your drive is going into power save mode
> > and unloading the heads.  Infact at a rate of 81 times per hour?
> > Oh, I can not believe that.  Either way we need to get this stopped,
> > it shall wear your drives out.
> > 
> > > 194 Temperature_Celsius     -O---K   122   109   000    -    28
> > > 196 Reallocated_Event_Count -O--CK   200   200   000    -    0
> > > 197 Current_Pending_Sector  -O--CK   200   200   000    -    0
> > > 198 Offline_Uncorrectable   ----CK   200   200   000    -    1
> > > 199 UDMA_CRC_Error_Count    -O--CK   200   200   000    -    0
> > > 200 Multi_Zone_Error_Rate   ---R--   200   200   000    -    5
> > >                             ||||||_ K auto-keep
> > >                             |||||__ C event count
> > >                             ||||___ R error rate
> > >                             |||____ S speed/performance
> > >                             ||_____ O updated online
> > >                             |______ P prefailure warning
> > > [...]
> > > 
> > > 
> > > 197 Current_Pending_Sector decreased to zero so far, but with every reboot, the error
> > > count seems to increase:  
> > 
> > Ok, some drive firmware well at the power on even try to test the
> > pending sector list and clear it if it can actually read the sector.
> > 
> > > 
> > > [...]
> > > Error 47 [22] occurred at disk power-on lifetime: 25343 hours (1055 days + 23 hours)
> > >   When the command that caused the error occurred, the device was active or idle.
> > > 
> > >   After command completion occurred, registers were:
> > >   ER -- ST COUNT  LBA_48  LH LM LL DV DC
> > >   -- -- -- == -- == == == -- -- -- -- --
> > >   40 -- 51 00 00 00 00 c2 19 d9 88 40 00  Error: UNC at LBA = 0xc219d988 = 3256473992
> > > 
> > >   Commands leading to the command that caused the error were:
> > >   CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
> > >   -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
> > >   60 00 b0 00 d0 00 00 c2 19 da 28 40 08  1d+07:12:34.336  READ FPDMA QUEUED
> > >   60 00 b0 00 c8 00 00 c2 19 d9 78 40 08  1d+07:12:34.336  READ FPDMA QUEUED
> > >   2f 00 00 00 01 00 00 00 00 00 10 40 08  1d+07:12:34.336  READ LOG EXT
> > >   60 00 b0 00 b8 00 00 c2 19 da 28 40 08  1d+07:12:31.484  READ FPDMA QUEUED
> > >   60 00 b0 00 b0 00 00 c2 19 d9 78 40 08  1d+07:12:31.483  READ FPDMA QUEUED
> > > 
> > > 
> > > I think this is watching a HDD dying, isn't it?  
> > 
> > It could be, need to see as many of the other 46 errors as we can to make
> > a decision on that.   Probably only 5 in the log though.
> 
> As said above, see attached file ;-)

I see 2 LBA's that have come up, and as I suspected only the last few errors
are in the log:
  40 -- 51 00 00 00 00 c2 19 d9 88 40 00  Error: UNC at LBA = 0xc219d988 = 3256473992
  40 -- 51 00 00 00 00 c2 19 d9 88 40 00  Error: UNC at LBA = 0xc219d988 = 3256473992
  40 -- 51 00 00 00 00 c2 19 d9 88 40 00  Error: UNC at LBA = 0xc219d988 = 3256473992
  40 -- 51 00 00 00 00 c2 19 d9 88 40 00  Error: UNC at LBA = 0xc219d988 = 3256473992
  40 -- 51 00 00 00 00 c2 19 d9 88 40 00  Error: UNC at LBA = 0xc219d988 = 3256473992
  40 -- 51 00 00 00 00 c2 7a 72 98 40 00  Error: UNC at LBA = 0xc27a7298 = 3262804632
  40 -- 51 00 00 00 00 c2 7a 72 98 40 00  Error: UNC at LBA = 0xc27a7298 = 3262804632
  40 -- 51 00 00 00 00 c2 7a 72 98 40 00  Error: UNC at LBA = 0xc27a7298 = 3262804632

> > > I'd say, a broken cabling would produce different errors, wouldn't it?  
> > Yes, there is a CRC error that would occur on cabling error.
> > 
> > > The Western Digital Green series HDD is a useful fellow when the HDD is used as a
> > > single drive. I think there might be an issue with paring 4 HDDs, 3 of them "GREEN",
> > > in a RAIDZ and physically sitting next to each other. Maybe it is time to replace
> > > them one by one ...  
> > 
> > I am more suspecioius of them loading and unloading the head at a rate of
> > more than once per minute!
> > 
> [ ... schnipp ... ]

At this point I woould turn one of tne ddrecover/ddrescue type tools loose
on just those 2 blocks, let them try say 100 times to read the blocks, if
either tool can read the block use dd to write it back to the same place
and that should let the drive do a repair.

If the block can not be read back at 100 tries I would just nuke the block
with a dd oseek=N bs=512 count=1 if=/dev/zero of=/dev/FOO.  Check to see
if the drive added a reallocation, and then check that you can now read the
block back with dd.  If those are both true I would run a zfs scrub to
get the correct data in the zero'ed block(s).


-- 
Rod Grimes                                                 rgrimes_at_freebsd.org
Received on Wed Dec 13 2017 - 12:23:27 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:14 UTC