Re: SMART: disk problems on RAIDZ1 pool: (ada6:ahcich6:0:0:0): CAM status: ATA Status Error

From: Rodney W. Grimes <freebsd-rwg_at_pdx.rh.CN85.dnsmgr.net> Date: Tue, 12 Dec 2017 14:55:49 -0800 (PST) · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:14 UTC

> Am Tue, 12 Dec 2017 10:52:27 -0800 (PST)
> "Rodney W. Grimes" <freebsd-rwg_at_pdx.rh.CN85.dnsmgr.net> schrieb:
> 
> 
> Thank you for answering that fast!
> 
> > > Hello,
> > > 
> > > running CURRENT (recent r326769), I realised that smartmond sends out some console
> > > messages when booting the box:
> > > 
> > > [...]
> > > Dec 12 14:14:33 <3.2> box1 smartd[68426]: Device: /dev/ada6, 1 Currently unreadable
> > > (pending) sectors Dec 12 14:14:33 <3.2> box1 smartd[68426]: Device: /dev/ada6, 1
> > > Offline uncorrectable sectors
> > > [...]
> > > 
> > > Checking the drive's SMART log with smartctl (it is one of four 3TB disk drives), I
> > > gather these informations:
> > > 
> > > [... smartctl -x /dev/ada6 ...]
> > > Error 42 [17] occurred at disk power-on lifetime: 25335 hours (1055 days + 15 hours)
> > >   When the command that caused the error occurred, the device was active or idle.
> > > 
> > >   After command completion occurred, registers were:
> > >   ER -- ST COUNT  LBA_48  LH LM LL DV DC
> > >   -- -- -- == -- == == == -- -- -- -- --
> > >   40 -- 51 00 00 00 00 c2 7a 72 98 40 00  Error: UNC at LBA = 0xc27a7298 = 3262804632
> > > 
> > >   Commands leading to the command that caused the error were:
> > >   CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
> > >   -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
> > >   60 00 b0 00 88 00 00 c2 7a 73 20 40 08     23:38:12.195  READ FPDMA QUEUED
> > >   60 00 b0 00 80 00 00 c2 7a 72 70 40 08     23:38:12.195  READ FPDMA QUEUED
> > >   2f 00 00 00 01 00 00 00 00 00 10 40 08     23:38:12.195  READ LOG EXT
> > >   60 00 b0 00 70 00 00 c2 7a 73 20 40 08     23:38:09.343  READ FPDMA QUEUED
> > >   60 00 b0 00 68 00 00 c2 7a 72 70 40 08     23:38:09.343  READ FPDMA QUEUED
> > > [...]
> > > 
> > > and
> > > 
> > > [...]
> > > SMART Attributes Data Structure revision number: 16
> > > Vendor Specific SMART Attributes with Thresholds:
> > > ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
> > >   1 Raw_Read_Error_Rate     POSR-K   200   200   051    -    64
> > >   3 Spin_Up_Time            POS--K   178   170   021    -    6075
> > >   4 Start_Stop_Count        -O--CK   098   098   000    -    2406
> > >   5 Reallocated_Sector_Ct   PO--CK   200   200   140    -    0
> > >   7 Seek_Error_Rate         -OSR-K   200   200   000    -    0
> > >   9 Power_On_Hours          -O--CK   066   066   000    -    25339
> > >  10 Spin_Retry_Count        -O--CK   100   100   000    -    0
> > >  11 Calibration_Retry_Count -O--CK   100   100   000    -    0
> > >  12 Power_Cycle_Count       -O--CK   098   098   000    -    2404
> > > 192 Power-Off_Retract_Count -O--CK   200   200   000    -    154
> > > 193 Load_Cycle_Count        -O--CK   001   001   000    -    2055746
> > > 194 Temperature_Celsius     -O---K   122   109   000    -    28
> > > 196 Reallocated_Event_Count -O--CK   200   200   000    -    0
> > > 197 Current_Pending_Sector  -O--CK   200   200   000    -    1
> > > 198 Offline_Uncorrectable   ----CK   200   200   000    -    1
> > > 199 UDMA_CRC_Error_Count    -O--CK   200   200   000    -    0
> > > 200 Multi_Zone_Error_Rate   ---R--   200   200   000    -    5
> > >                             ||||||_ K auto-keep
> > >                             |||||__ C event count
> > >                             ||||___ R error rate
> > >                             |||____ S speed/performance
> > >                             ||_____ O updated online
> > >                             |______ P prefailure warning
> > > 
> > > [...]  
> > 
> > The data up to this point informs us that you have 1 bad sector
> > on a 3TB drive, that is actually an expected event given the data
> > error rate on this stuff is such that your gona have these now
> > and again.
> > 
> > Given you have 1 single event I would not suspect that this drive
> > is dying, but it would be prudent to prepare for that possibility.
> 
> Hello.
> 
> Well, I copied simply "one single event" that has been logged so far.
> 
> As you (and I) can see, it is error #42. After I posted here, a reboot has taken place
> because the "repair" process on the Pool suddenly increased time and now I'm with error
> #47, but interestingly, it is a new block that is damaged, but the SMART attribute fields
> show this for now:

Can you send the complete output of smartctl -a /dev/foo, I somehow missed
that 40+ other errors had occured.

> [...]
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
>   1 Raw_Read_Error_Rate     POSR-K   200   200   051    -    69
>   3 Spin_Up_Time            POS--K   178   170   021    -    6075
>   4 Start_Stop_Count        -O--CK   098   098   000    -    2406
>   5 Reallocated_Sector_Ct   PO--CK   200   200   140    -    0

Interesting, no reallocation has occured....

>   7 Seek_Error_Rate         -OSR-K   200   200   000    -    0
>   9 Power_On_Hours          -O--CK   066   066   000    -    25343
>  10 Spin_Retry_Count        -O--CK   100   100   000    -    0
>  11 Calibration_Retry_Count -O--CK   100   100   000    -    0
>  12 Power_Cycle_Count       -O--CK   098   098   000    -    2404
> 192 Power-Off_Retract_Count -O--CK   200   200   000    -    154
> 193 Load_Cycle_Count        -O--CK   001   001   000    -    2055746

Hum, just noticed this.  25k hours power on, 2M load cycles, this is
very hard on a hard drive.  Your drive is going into power save mode
and unloading the heads.  Infact at a rate of 81 times per hour?
Oh, I can not believe that.  Either way we need to get this stopped,
it shall wear your drives out.

> 194 Temperature_Celsius     -O---K   122   109   000    -    28
> 196 Reallocated_Event_Count -O--CK   200   200   000    -    0
> 197 Current_Pending_Sector  -O--CK   200   200   000    -    0
> 198 Offline_Uncorrectable   ----CK   200   200   000    -    1
> 199 UDMA_CRC_Error_Count    -O--CK   200   200   000    -    0
> 200 Multi_Zone_Error_Rate   ---R--   200   200   000    -    5
>                             ||||||_ K auto-keep
>                             |||||__ C event count
>                             ||||___ R error rate
>                             |||____ S speed/performance
>                             ||_____ O updated online
>                             |______ P prefailure warning
> [...]
> 
> 
> 197 Current_Pending_Sector decreased to zero so far, but with every reboot, the error
> count seems to increase:

Ok, some drive firmware well at the power on even try to test the
pending sector list and clear it if it can actually read the sector.

> 
> [...]
> Error 47 [22] occurred at disk power-on lifetime: 25343 hours (1055 days + 23 hours)
>   When the command that caused the error occurred, the device was active or idle.
> 
>   After command completion occurred, registers were:
>   ER -- ST COUNT  LBA_48  LH LM LL DV DC
>   -- -- -- == -- == == == -- -- -- -- --
>   40 -- 51 00 00 00 00 c2 19 d9 88 40 00  Error: UNC at LBA = 0xc219d988 = 3256473992
> 
>   Commands leading to the command that caused the error were:
>   CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
>   -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
>   60 00 b0 00 d0 00 00 c2 19 da 28 40 08  1d+07:12:34.336  READ FPDMA QUEUED
>   60 00 b0 00 c8 00 00 c2 19 d9 78 40 08  1d+07:12:34.336  READ FPDMA QUEUED
>   2f 00 00 00 01 00 00 00 00 00 10 40 08  1d+07:12:34.336  READ LOG EXT
>   60 00 b0 00 b8 00 00 c2 19 da 28 40 08  1d+07:12:31.484  READ FPDMA QUEUED
>   60 00 b0 00 b0 00 00 c2 19 d9 78 40 08  1d+07:12:31.483  READ FPDMA QUEUED
> 
> 
> I think this is watching a HDD dying, isn't it?

It could be, need to see as many of the other 46 errors as we can to make
a decision on that.   Probably only 5 in the log though.

> I'd say, a broken cabling would produce different errors, wouldn't it?
Yes, there is a CRC error that would occur on cabling error.

> The Western Digital Green series HDD is a useful fellow when the HDD is used as a single
> drive. I think there might be an issue with paring 4 HDDs, 3 of them "GREEN", in a RAIDZ
> and physically sitting next to each other. Maybe it is time to replace them one by one ...

I am more suspecioius of them loading and unloading the head at a rate of
more than once per minute!

> > 
> > 
> > > 
> > > The ZFS pool is RAIDZ1, comprised of 3 WD Green 3TB HDD and one WD RED 3 TB HDD. The
> > > failure occured is on one of the WD Green 3 TB HDD.  
> > Ok, so the data is redundantly protected.  This helps a lot.
> > 
> > > The pool is marked as "resilvered" - I do scrubbing on a regular basis and the
> > > "resilvering" message has now aapeared the second time in row. Searching the net
> > > recommend on SMART attribute 197 errors, in my case it is one, and in combination with
> > > the problems occured that I should replace the disk.  
> > 
> > It is probably putting the RAIDZ in that state as the scrub is finding a block
> > it can not read.
> > 
> > > 
> > > Well, here comes the problem. The box is comprised from "electronical waste" made by
> > > ASRock - it is a Socket 1150/IvyBridge board, which has its last Firmware/BIOS update
> > > got in 2013 and since then UEFI booting FreeBSD from a HDD isn't possible (just to
> > > indicate that I'm aware of having issues with crap, but that is some other issue
> > > right now). The board's SATA connectors are all populated.
> > > 
> > > So: Due to the lack of adequate backup space I can only selectively backup portions,
> > > most of the space is occupied by scientific modelling data, which I had worked on. So
> > > backup exists! In one way or the other. My concern is how to replace the faulty HDD!
> > > Most HowTo's indicate a replacement disk being prepared and then "replaced" via ZFS's
> > > replace command. This isn't applicable here.
> > > 
> > > Question: is it possible to simply pull the faulty disk (implies I know exactly which
> > > one to pull!) and then prepare and add the replacement HDD and let the system do its
> > > job resilvering the pool?  
> > 
> > That may work, but I think I have a simpler solution.
> > 
> > > 
> > > Next question is: I'm about to replace the 3 TB HDD with a more recent and modern 4 TB
> > > HDD (WD RED 4TB). I'm aware of the fact that I can only use 3 TB as the other disks
> > > are 3 TB, but I'd like to know whether FreeBSD's ZFS is capable of handling it?   
> > 
> > Someone else?
> > 
> > > 
> > > This is the first time I have issues with ZFS and a faulty drive, so if some of my
> > > questions sound naive, please forgive me.  
> > 
> > One thing to try is to see if we can get the drive to fix itself, first order
> > of business is can you take this server out of service?  If so I would
> > simply try to do a
> > repeat 100 dd if=/dev/whicheverhdisbad of=/dev/null conv=noerror, sync iseek=3262804632
> > 
> > That is trying to read that block 100 times, if it successful even 1 time
> > smart should remap the block and you are all done.
> 
> Given the fact, that this errorneous block is like a moving target, it this solution
> still the favorite one? I'll try, but I already have the replacement 4 TB HDD at hand.

It could also be experince a problem I have with one of my 500G western digital
drives.  It has a sector that is marginal, it fails to read now and then, so
ends up in the pending list.  As soon as I write to that sector the drive
decides it is fine and removes it from the pending list.   This has been
repeating for 10 uears now.  If I have ufs on the drive I used old badsect
tools to hide it away, but eventually I wipe the drive and end up back in this
situation.

> > 
> > If that fails we can try to zero the block, there is a risk here, but raidz should just
> > handle this as a data corruption of a block.  This could possibly lead to data loss,
> > so USE AT YOUR OWN RISK ASSESMENT.
> > dd if=/dev/zero of=/dev/whateverdrivehasissues bs=512 count=1 oseek=3262804632
> 
> I would then be  oseek=3256473992, too.

Maybe do A dd READ over a range of blocks 100k blocks on each side of these
suspect areas:
dd if=/dev/FOO of=/dev/null bs=512 iseek=3262704632 count=200k
dd if=/dev/FOO of=/dev/null bs=512 iseek=3256473992 count=200k

I have tools that do this type of operation but record the time
to complete the operation, I look over that list for outliers
indicating the drive has done retries.   OHHHH.. um.. there is
a sysctl to turn off ata retries, probalby should hit that too.

kern.cam.ada.retry_count: 4

Do not attempt to write the sector unless it reproducable gives a
read error as it wont do any good.

> > 
> > That should forceable overwrite the bad block with 0's, the smart firmware
> > well see this in the pending list, write the data, read it back, if successful
> > remove it from the pending list, if failed reallocate the block and write
> > the 0's to the reallocation and add 1 to the remapped block count.
> > 
> > You might google for "how to fix a pending reallocation"
> > 
> > > Thanks in advance,
> > > Oliver
> > > -- 
> > > O. Hartmann  
> > 
> 
> Kind regards,
> Oliver
> -- 
> O. Hartmann

-- 
Rod Grimes                                                 rgrimes_at_freebsd.org