> Am Tue, 12 Dec 2017 10:52:27 -0800 (PST) > "Rodney W. Grimes" <freebsd-rwg_at_pdx.rh.CN85.dnsmgr.net> schrieb: > > > Thank you for answering that fast! > > > > Hello, > > > > > > running CURRENT (recent r326769), I realised that smartmond sends out some console > > > messages when booting the box: > > > > > > [...] > > > Dec 12 14:14:33 <3.2> box1 smartd[68426]: Device: /dev/ada6, 1 Currently unreadable > > > (pending) sectors Dec 12 14:14:33 <3.2> box1 smartd[68426]: Device: /dev/ada6, 1 > > > Offline uncorrectable sectors > > > [...] > > > > > > Checking the drive's SMART log with smartctl (it is one of four 3TB disk drives), I > > > gather these informations: > > > > > > [... smartctl -x /dev/ada6 ...] > > > Error 42 [17] occurred at disk power-on lifetime: 25335 hours (1055 days + 15 hours) > > > When the command that caused the error occurred, the device was active or idle. > > > > > > After command completion occurred, registers were: > > > ER -- ST COUNT LBA_48 LH LM LL DV DC > > > -- -- -- == -- == == == -- -- -- -- -- > > > 40 -- 51 00 00 00 00 c2 7a 72 98 40 00 Error: UNC at LBA = 0xc27a7298 = 3262804632 > > > > > > Commands leading to the command that caused the error were: > > > CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name > > > -- == -- == -- == == == -- -- -- -- -- --------------- -------------------- > > > 60 00 b0 00 88 00 00 c2 7a 73 20 40 08 23:38:12.195 READ FPDMA QUEUED > > > 60 00 b0 00 80 00 00 c2 7a 72 70 40 08 23:38:12.195 READ FPDMA QUEUED > > > 2f 00 00 00 01 00 00 00 00 00 10 40 08 23:38:12.195 READ LOG EXT > > > 60 00 b0 00 70 00 00 c2 7a 73 20 40 08 23:38:09.343 READ FPDMA QUEUED > > > 60 00 b0 00 68 00 00 c2 7a 72 70 40 08 23:38:09.343 READ FPDMA QUEUED > > > [...] > > > > > > and > > > > > > [...] > > > SMART Attributes Data Structure revision number: 16 > > > Vendor Specific SMART Attributes with Thresholds: > > > ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE > > > 1 Raw_Read_Error_Rate POSR-K 200 200 051 - 64 > > > 3 Spin_Up_Time POS--K 178 170 021 - 6075 > > > 4 Start_Stop_Count -O--CK 098 098 000 - 2406 > > > 5 Reallocated_Sector_Ct PO--CK 200 200 140 - 0 > > > 7 Seek_Error_Rate -OSR-K 200 200 000 - 0 > > > 9 Power_On_Hours -O--CK 066 066 000 - 25339 > > > 10 Spin_Retry_Count -O--CK 100 100 000 - 0 > > > 11 Calibration_Retry_Count -O--CK 100 100 000 - 0 > > > 12 Power_Cycle_Count -O--CK 098 098 000 - 2404 > > > 192 Power-Off_Retract_Count -O--CK 200 200 000 - 154 > > > 193 Load_Cycle_Count -O--CK 001 001 000 - 2055746 > > > 194 Temperature_Celsius -O---K 122 109 000 - 28 > > > 196 Reallocated_Event_Count -O--CK 200 200 000 - 0 > > > 197 Current_Pending_Sector -O--CK 200 200 000 - 1 > > > 198 Offline_Uncorrectable ----CK 200 200 000 - 1 > > > 199 UDMA_CRC_Error_Count -O--CK 200 200 000 - 0 > > > 200 Multi_Zone_Error_Rate ---R-- 200 200 000 - 5 > > > ||||||_ K auto-keep > > > |||||__ C event count > > > ||||___ R error rate > > > |||____ S speed/performance > > > ||_____ O updated online > > > |______ P prefailure warning > > > > > > [...] > > > > The data up to this point informs us that you have 1 bad sector > > on a 3TB drive, that is actually an expected event given the data > > error rate on this stuff is such that your gona have these now > > and again. > > > > Given you have 1 single event I would not suspect that this drive > > is dying, but it would be prudent to prepare for that possibility. > > Hello. > > Well, I copied simply "one single event" that has been logged so far. > > As you (and I) can see, it is error #42. After I posted here, a reboot has taken place > because the "repair" process on the Pool suddenly increased time and now I'm with error > #47, but interestingly, it is a new block that is damaged, but the SMART attribute fields > show this for now: Can you send the complete output of smartctl -a /dev/foo, I somehow missed that 40+ other errors had occured. > [...] > SMART Attributes Data Structure revision number: 16 > Vendor Specific SMART Attributes with Thresholds: > ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE > 1 Raw_Read_Error_Rate POSR-K 200 200 051 - 69 > 3 Spin_Up_Time POS--K 178 170 021 - 6075 > 4 Start_Stop_Count -O--CK 098 098 000 - 2406 > 5 Reallocated_Sector_Ct PO--CK 200 200 140 - 0 Interesting, no reallocation has occured.... > 7 Seek_Error_Rate -OSR-K 200 200 000 - 0 > 9 Power_On_Hours -O--CK 066 066 000 - 25343 > 10 Spin_Retry_Count -O--CK 100 100 000 - 0 > 11 Calibration_Retry_Count -O--CK 100 100 000 - 0 > 12 Power_Cycle_Count -O--CK 098 098 000 - 2404 > 192 Power-Off_Retract_Count -O--CK 200 200 000 - 154 > 193 Load_Cycle_Count -O--CK 001 001 000 - 2055746 Hum, just noticed this. 25k hours power on, 2M load cycles, this is very hard on a hard drive. Your drive is going into power save mode and unloading the heads. Infact at a rate of 81 times per hour? Oh, I can not believe that. Either way we need to get this stopped, it shall wear your drives out. > 194 Temperature_Celsius -O---K 122 109 000 - 28 > 196 Reallocated_Event_Count -O--CK 200 200 000 - 0 > 197 Current_Pending_Sector -O--CK 200 200 000 - 0 > 198 Offline_Uncorrectable ----CK 200 200 000 - 1 > 199 UDMA_CRC_Error_Count -O--CK 200 200 000 - 0 > 200 Multi_Zone_Error_Rate ---R-- 200 200 000 - 5 > ||||||_ K auto-keep > |||||__ C event count > ||||___ R error rate > |||____ S speed/performance > ||_____ O updated online > |______ P prefailure warning > [...] > > > 197 Current_Pending_Sector decreased to zero so far, but with every reboot, the error > count seems to increase: Ok, some drive firmware well at the power on even try to test the pending sector list and clear it if it can actually read the sector. > > [...] > Error 47 [22] occurred at disk power-on lifetime: 25343 hours (1055 days + 23 hours) > When the command that caused the error occurred, the device was active or idle. > > After command completion occurred, registers were: > ER -- ST COUNT LBA_48 LH LM LL DV DC > -- -- -- == -- == == == -- -- -- -- -- > 40 -- 51 00 00 00 00 c2 19 d9 88 40 00 Error: UNC at LBA = 0xc219d988 = 3256473992 > > Commands leading to the command that caused the error were: > CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name > -- == -- == -- == == == -- -- -- -- -- --------------- -------------------- > 60 00 b0 00 d0 00 00 c2 19 da 28 40 08 1d+07:12:34.336 READ FPDMA QUEUED > 60 00 b0 00 c8 00 00 c2 19 d9 78 40 08 1d+07:12:34.336 READ FPDMA QUEUED > 2f 00 00 00 01 00 00 00 00 00 10 40 08 1d+07:12:34.336 READ LOG EXT > 60 00 b0 00 b8 00 00 c2 19 da 28 40 08 1d+07:12:31.484 READ FPDMA QUEUED > 60 00 b0 00 b0 00 00 c2 19 d9 78 40 08 1d+07:12:31.483 READ FPDMA QUEUED > > > I think this is watching a HDD dying, isn't it? It could be, need to see as many of the other 46 errors as we can to make a decision on that. Probably only 5 in the log though. > I'd say, a broken cabling would produce different errors, wouldn't it? Yes, there is a CRC error that would occur on cabling error. > The Western Digital Green series HDD is a useful fellow when the HDD is used as a single > drive. I think there might be an issue with paring 4 HDDs, 3 of them "GREEN", in a RAIDZ > and physically sitting next to each other. Maybe it is time to replace them one by one ... I am more suspecioius of them loading and unloading the head at a rate of more than once per minute! > > > > > > > > > > The ZFS pool is RAIDZ1, comprised of 3 WD Green 3TB HDD and one WD RED 3 TB HDD. The > > > failure occured is on one of the WD Green 3 TB HDD. > > Ok, so the data is redundantly protected. This helps a lot. > > > > > The pool is marked as "resilvered" - I do scrubbing on a regular basis and the > > > "resilvering" message has now aapeared the second time in row. Searching the net > > > recommend on SMART attribute 197 errors, in my case it is one, and in combination with > > > the problems occured that I should replace the disk. > > > > It is probably putting the RAIDZ in that state as the scrub is finding a block > > it can not read. > > > > > > > > Well, here comes the problem. The box is comprised from "electronical waste" made by > > > ASRock - it is a Socket 1150/IvyBridge board, which has its last Firmware/BIOS update > > > got in 2013 and since then UEFI booting FreeBSD from a HDD isn't possible (just to > > > indicate that I'm aware of having issues with crap, but that is some other issue > > > right now). The board's SATA connectors are all populated. > > > > > > So: Due to the lack of adequate backup space I can only selectively backup portions, > > > most of the space is occupied by scientific modelling data, which I had worked on. So > > > backup exists! In one way or the other. My concern is how to replace the faulty HDD! > > > Most HowTo's indicate a replacement disk being prepared and then "replaced" via ZFS's > > > replace command. This isn't applicable here. > > > > > > Question: is it possible to simply pull the faulty disk (implies I know exactly which > > > one to pull!) and then prepare and add the replacement HDD and let the system do its > > > job resilvering the pool? > > > > That may work, but I think I have a simpler solution. > > > > > > > > Next question is: I'm about to replace the 3 TB HDD with a more recent and modern 4 TB > > > HDD (WD RED 4TB). I'm aware of the fact that I can only use 3 TB as the other disks > > > are 3 TB, but I'd like to know whether FreeBSD's ZFS is capable of handling it? > > > > Someone else? > > > > > > > > This is the first time I have issues with ZFS and a faulty drive, so if some of my > > > questions sound naive, please forgive me. > > > > One thing to try is to see if we can get the drive to fix itself, first order > > of business is can you take this server out of service? If so I would > > simply try to do a > > repeat 100 dd if=/dev/whicheverhdisbad of=/dev/null conv=noerror, sync iseek=3262804632 > > > > That is trying to read that block 100 times, if it successful even 1 time > > smart should remap the block and you are all done. > > Given the fact, that this errorneous block is like a moving target, it this solution > still the favorite one? I'll try, but I already have the replacement 4 TB HDD at hand. It could also be experince a problem I have with one of my 500G western digital drives. It has a sector that is marginal, it fails to read now and then, so ends up in the pending list. As soon as I write to that sector the drive decides it is fine and removes it from the pending list. This has been repeating for 10 uears now. If I have ufs on the drive I used old badsect tools to hide it away, but eventually I wipe the drive and end up back in this situation. > > > > If that fails we can try to zero the block, there is a risk here, but raidz should just > > handle this as a data corruption of a block. This could possibly lead to data loss, > > so USE AT YOUR OWN RISK ASSESMENT. > > dd if=/dev/zero of=/dev/whateverdrivehasissues bs=512 count=1 oseek=3262804632 > > I would then be oseek=3256473992, too. Maybe do A dd READ over a range of blocks 100k blocks on each side of these suspect areas: dd if=/dev/FOO of=/dev/null bs=512 iseek=3262704632 count=200k dd if=/dev/FOO of=/dev/null bs=512 iseek=3256473992 count=200k I have tools that do this type of operation but record the time to complete the operation, I look over that list for outliers indicating the drive has done retries. OHHHH.. um.. there is a sysctl to turn off ata retries, probalby should hit that too. kern.cam.ada.retry_count: 4 Do not attempt to write the sector unless it reproducable gives a read error as it wont do any good. > > > > That should forceable overwrite the bad block with 0's, the smart firmware > > well see this in the pending list, write the data, read it back, if successful > > remove it from the pending list, if failed reallocate the block and write > > the 0's to the reallocation and add 1 to the remapped block count. > > > > You might google for "how to fix a pending reallocation" > > > > > Thanks in advance, > > > Oliver > > > -- > > > O. Hartmann > > > > Kind regards, > Oliver > -- > O. Hartmann -- Rod Grimes rgrimes_at_freebsd.orgReceived on Tue Dec 12 2017 - 21:55:53 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:14 UTC