Re: Possible zpool online, resilvering issue

From: Allan Jude <allanjude_at_freebsd.org> Date: Wed, 10 Aug 2016 13:48:53 -0400 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:07 UTC

On 2016-08-10 12:53, Ultima wrote:
> Hello,
> 
>> I didn't see any reply on the list, so I thought I might let you know
> 
> Sorry, never received this reply (till now) xD
> 
>> what I assume is happening:
> 
>> ZFS never updates data in place, which affects inode updates, e.g. if
>> a file has been read and access times must be updated. (For that reason,
>> many ZFS file systems are configured to ignore access time updates).
> 
>> Even if there were only R/O accesses to files in the pool, there will
>> have been updates to the inodes, which were missed by the offlined
>> drives (unless you ignore atime updates).
> 
>> But even if there are no access time updates, ZFS might have written
>> new uberblocks and other meta information. Check the POOL history and
>> see if there were any TXGs created during the scrub.
> 
>> If you scrub the pooll while it is off-line, it should stay stable
>> (but if any information about the scrub, the offlining of drives etc.
>> is recorded in the pool's history log, differences are to be expected).
> 
>> Just my $.02 ...
> 
>> Regards, STefan
> 
> Thanks for the reply, I'm not completely sure what would be considered a
> TXG. Maintained normal operations during most this noise and this pool has
> quite a bit of activity during normal operations. My zpool history looks
> like it gos on forever and the last scrub is showing it repaired 9.48G.
> That was for all these access time updates? I guess that would be a little
> less then 2.5G per disk worth.
> 
> The zpool history looks like it gos on forever (733373 lines). This pool
> has much of this activity with poudriere. All the entries I see are clone,
> destroy, rollback and snapshotting. I can't really say how much but at
> least 500 (prob much more than that) entries between the last two scrubs.
> Atime is off on all datasets.
> 
>  So to be clear, this is expected behavior with atime=off + TXGs during
> offline time? I had thought that the resilver after onlining the disk would
> bring that disk up-to-date with the pool. I guess my understanding was a
> bit off.
> 
> Ultima
> _______________________________________________
> freebsd-current_at_freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to "freebsd-current-unsubscribe_at_freebsd.org"
> 

A new transaction group (TXG) is created at LEAST every
vfs.zfs.txg.timeout (defaults to 5) seconds.

If you offline a drive for hours or more, it must have all blocks with a
'birth time' newer than the last transaction that was recorded on the
offlined drive replayed to catch that drive up to the other drives in
the pool.

As long as you have enough redundancy, the checksum errors can be
corrected without concern.

In the end, the checksum errors can be written off as being caused by
the bad hardware. After you finish the scrub and everything is OK, do:
'zpool clear poolname', and it will reset all of the error and checksum
counts to 0, so you can track if any more ever show up.

-- 
Allan Jude