Re: ZFS: Silent/hidden errors, nothing logged anywhere

From: Thomas Backman <serenity_at_exscape.org>
Date: Sat, 13 Jun 2009 09:32:09 +0200
On Jun 12, 2009, at 11:01 PM, Kip Macy wrote:

> On Fri, Jun 12, 2009 at 10:32 AM, Thomas  
> Backman<serenity_at_exscape.org> wrote:
>> OK, so I filed a PR late May (kern/135050):
>> http://www.freebsd.org/cgi/query-pr.cgi?pr=135050 .
>> I don't know if this is a "feature" or a bug, but it really should be
>> considered the latter. The data could be repaired in the background  
>> without
>> the user ever knowing - until the disk dies completely. I'd prefer  
>> to have
>> warning signs (i.e. checksum errors) so that I can buy a  
>> replacement drive
>> *before* that.
>>
>> Not only does this mean that errors can go unnoticed, but also that  
>> it's
>> impossible to figure out which disk is broken, if ZFS has  
>> *temporarily*
>> repaired the broken data! THAT is REALLY bad!
>> Is this something that we can expect to see changed before 8.0- 
>> RELEASE?
>
>
> I'm fairly certain that we've discussed this already. Solaris uses FMA
> - I don't think that I'll get to a "real fix" any time soon. The time
> that I do have will go to addressing stability problems (memory
> over-allocation, NFS interaction, control directory mounts) all of
> which cause panics. Maintaining them persistently in the label doesn't
> make sense  -  when do you drop them? Would a simple log message about
> the number of checksum errors suffice?
>
> Cheers,
> Kip
Yes, I suppose a log message would be OK, especially if there's a semi- 
simple way of mailing root automatically (either by the ZFS libs  
themselves, or by a simple log analyzer daemon that I'm sure there are  
plenty of already). I do think that storing them in the label does  
make sense, though, but if Solaris doesn't do it, I suppose we  
shouldn't, either. IF stored that way, they should IMHO remain until a  
"zpool clear" is executed on device (a device that causes errors is a  
device that causes errors - most of the time, this is a great way for  
the disk to say "hey, I'm dying here!"). In practice, this clearing is  
already done on reboot (although the relevant functions are of course  
never called).

Regards,
Thomas
Received on Sat Jun 13 2009 - 05:32:18 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:49 UTC