Re: SMART: disk problems on RAIDZ1 pool: (ada6:ahcich6:0:0:0): CAM status: ATA Status Error

From: Alan Somers <asomers_at_freebsd.org> Date: Tue, 12 Dec 2017 11:38:43 -0700 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:14 UTC

On Tue, Dec 12, 2017 at 11:21 AM, O. Hartmann <ohartmann_at_walstatt.org>
wrote:

> Hello,
>
> running CURRENT (recent r326769), I realised that smartmond sends out some
> console
> messages when booting the box:
>
> [...]
> Dec 12 14:14:33 <3.2> box1 smartd[68426]: Device: /dev/ada6, 1 Currently
> unreadable
> (pending) sectors Dec 12 14:14:33 <3.2> box1 smartd[68426]: Device:
> /dev/ada6, 1
> Offline uncorrectable sectors
> [...]
>
> Checking the drive's SMART log with smartctl (it is one of four 3TB disk
> drives), I
> gather these informations:
>
> [... smartctl -x /dev/ada6 ...]
> Error 42 [17] occurred at disk power-on lifetime: 25335 hours (1055 days +
> 15 hours)
>   When the command that caused the error occurred, the device was active
> or idle.
>
>   After command completion occurred, registers were:
>   ER -- ST COUNT  LBA_48  LH LM LL DV DC
>   -- -- -- == -- == == == -- -- -- -- --
>   40 -- 51 00 00 00 00 c2 7a 72 98 40 00  Error: UNC at LBA = 0xc27a7298 =
> 3262804632
>
>   Commands leading to the command that caused the error were:
>   CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time
> Command/Feature_Name
>   -- == -- == -- == == == -- -- -- -- --  ---------------
> --------------------
>   60 00 b0 00 88 00 00 c2 7a 73 20 40 08     23:38:12.195  READ FPDMA
> QUEUED
>   60 00 b0 00 80 00 00 c2 7a 72 70 40 08     23:38:12.195  READ FPDMA
> QUEUED
>   2f 00 00 00 01 00 00 00 00 00 10 40 08     23:38:12.195  READ LOG EXT
>   60 00 b0 00 70 00 00 c2 7a 73 20 40 08     23:38:09.343  READ FPDMA
> QUEUED
>   60 00 b0 00 68 00 00 c2 7a 72 70 40 08     23:38:09.343  READ FPDMA
> QUEUED
> [...]
>
> and
>
> [...]
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
>   1 Raw_Read_Error_Rate     POSR-K   200   200   051    -    64
>   3 Spin_Up_Time            POS--K   178   170   021    -    6075
>   4 Start_Stop_Count        -O--CK   098   098   000    -    2406
>   5 Reallocated_Sector_Ct   PO--CK   200   200   140    -    0
>   7 Seek_Error_Rate         -OSR-K   200   200   000    -    0
>   9 Power_On_Hours          -O--CK   066   066   000    -    25339
>  10 Spin_Retry_Count        -O--CK   100   100   000    -    0
>  11 Calibration_Retry_Count -O--CK   100   100   000    -    0
>  12 Power_Cycle_Count       -O--CK   098   098   000    -    2404
> 192 Power-Off_Retract_Count -O--CK   200   200   000    -    154
> 193 Load_Cycle_Count        -O--CK   001   001   000    -    2055746
> 194 Temperature_Celsius     -O---K   122   109   000    -    28
> 196 Reallocated_Event_Count -O--CK   200   200   000    -    0
> 197 Current_Pending_Sector  -O--CK   200   200   000    -    1
> 198 Offline_Uncorrectable   ----CK   200   200   000    -    1
> 199 UDMA_CRC_Error_Count    -O--CK   200   200   000    -    0
> 200 Multi_Zone_Error_Rate   ---R--   200   200   000    -    5
>                             ||||||_ K auto-keep
>                             |||||__ C event count
>                             ||||___ R error rate
>                             |||____ S speed/performance
>                             ||_____ O updated online
>                             |______ P prefailure warning
>
> [...]
>
> The ZFS pool is RAIDZ1, comprised of 3 WD Green 3TB HDD and one WD RED 3
> TB HDD. The
> failure occured is on one of the WD Green 3 TB HDD.
>
> The pool is marked as "resilvered" - I do scrubbing on a regular basis and
> the
> "resilvering" message has now aapeared the second time in row. Searching
> the net
> recommend on SMART attribute 197 errors, in my case it is one, and in
> combination with
> the problems occured that I should replace the disk.
>
> Well, here comes the problem. The box is comprised from "electronical
> waste" made by
> ASRock - it is a Socket 1150/IvyBridge board, which has its last
> Firmware/BIOS update got
> in 2013 and since then UEFI booting FreeBSD from a HDD isn't possible
> (just to indicate
> that I'm aware of having issues with crap, but that is some other issue
> right now). The
> board's SATA connectors are all populated.
>
> So: Due to the lack of adequate backup space I can only selectively backup
> portions, most
> of the space is occupied by scientific modelling data, which I had worked
> on. So backup
> exists! In one way or the other. My concern is how to replace the faulty
> HDD! Most
> HowTo's indicate a replacement disk being prepared and then "replaced" via
> ZFS's replace
> command. This isn't applicable here.
>
> Question: is it possible to simply pull the faulty disk (implies I know
> exactly which one
> to pull!) and then prepare and add the replacement HDD and let the system
> do its job
> resilvering the pool?
>

Absolutely.  If you don't know which disk to pull, then it's better to
power down and check serial numbers.  After you power back on, you can
replace the disk with a command like this:
zpool replace <POOLNAME> <MISSING DISK GUID> <NEW DISK DEVNAME>
The missing disk guid can be obtained, after removing the old disk, by
"zpool status".

>
> Next question is: I'm about to replace the 3 TB HDD with a more recent and
> modern 4 TB
> HDD (WD RED 4TB). I'm aware of the fact that I can only use 3 TB as the
> other disks are 3
> TB, but I'd like to know whether FreeBSD's ZFS is capable of handling it?
>

Yes.  ZFS will have no problem using the first 3TB of a 4TB drive.
However, you ought to check the disk's physical sector size.  If the old
disks had 512B physical sectors and the new disk has 4096B physical
sectors, then performance will suffer.  You can check by using the command
"diskinfo -v /dev/daXXX".  Look for the "stripesize" row.

>
> This is the first time I have issues with ZFS and a faulty drive, so if
> some of my
> questions sound naive, please forgive me.
>
> Thanks in advance,
>
> Oliver
>

-Alan