On 2016-08-04 07:22, Ultima wrote: > Hello, > > I recently had some issue with a PSU and ran several scrubs on a pool with > around 35T. Random drives would drop and require a zpool online, this found > checksum errors. (as expected) However, after all the scrubs I ran, I think > I may have found a bug with zpool online resilvering process. > > 24 disks total, 4 vdevs raidz2 (6 drives each). > > Before this next part... I had a backup PSU, however it was also going bad > and waiting for RMA. The current one seemed to be dieing but ran fine with > less drives. So I decided I would run the server short 4 drives. > > Started by offline(or already removed from psu) 4 drives from different > vdevs, then ran a scrub to verify everything. Many sum errors were present > on some of the drives, but this was expected due to faulty psu. Then > offlined 4 different drives and onlined the other 4 and scrubbed once > again. After resilver, again, many sum errors on these drives as expected. > > After the scrub completed, I decided to offline 4 different drives, then > online the ones that were out of pool for awhile. During the resilver, > checksum errors were once again found. I was surprised due to the recent > scrub, So I decided to run another scrub, and it found even more checksum > errors on these recently onlined drives. I didn't think much about it, > however after the replacement PSU arrived, I onlined all the drives out of > pool and again, resilver had checksum errors as well as another scrub with > more sum errors. > > Is this issue known? Is it common for a scrub to be required after onlining > a disk that was out of pool for some time? > > The drives are ST4000NM0033, and until recent have never had a single > checksum error in they're lifetime.(at least with zfs) > FreeBSD S1 12.0-CURRENT FreeBSD 12.0-CURRENT #19 r303224: Sat Jul 23 > 10:41:12 EDT 2016 > root_at_S1:/usr/src/head/obj/usr/src/head/src/sys/MYKERNEL-NODEBUG > amd64 > > > Sorry for the wall of text, but I hope this helps in tracking down this > possible bug. > Perhaps on or more of the drives running out of Realloc Sectors? I had once a case where smartctl showed no issues but zfs scrubbing showed a defect, some weeks later smartctl was showing some reallocated sectors and one week later the HD was out of spare sectors. Have you already tested every single HD for smart issues? -- olliReceived on Wed Aug 10 2016 - 16:56:21 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:07 UTC