Re: Boot from ZFS

From: Duncan Young <duncan.young_at_pobox.com> Date: Sat, 12 Jul 2008 18:53:15 +1000 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:32 UTC

On Sat, 12 Jul 2008 02:55:08 pm Brooks Davis wrote:
> On Sat, Jul 12, 2008 at 10:43:09AM +1000, Duncan Young wrote:
> > Be carefull,  I've just had a 6 disk raidz array die.  Complete failure which 
> > required restore from backup (the controler card which had access to 4 of the 
> > disks, lost one disk, then a second (at which point the machine paniced, Upon 
> > reboot the raidz array was useless (Metadata corrupted)).  I'm also getting 
> > reasonably frequent machine lockups (panics) in the zfs code.  I'm going to 
> > start collecting crash dumps see if anyone can help in the next week or two.
> 
> If you look at the research on disk corruption and failure modes both
> in recent proceeding of FAST and the latest issue of ;LOGIN: it's clear
> that any RAID-like scheme that does not tolerate double faults is likely
> to fail.  In theory, zfs should tolerate certain classes of faults
> better than some other technologies, but can't deal with full disk
> double faults unless you use raidz2.

In this case the problem, I believe, was the controller card/software.  Unfortunately
on a home system, I can't spread my disks across different controllers.  The disks
themselves are fine (I'm using them right now).  My only issue is that the pool seems
think its corrupted as soon as an error occurred on the second disk, Even though I would
have thought that the meta-data (which I believe is spread across multiple disks) should
not have a chance to be made irreparable.

 i.e. from /var/log/messages:

<lots more I/O failure messages on /dev/da0>
Jul 10 18:39:27 triple0 root: ZFS: vdev I/O failure, zpool=big path=/dev/da0 offset=500028077056 size=1024 error=22
Jul 10 18:39:27 triple0 root: ZFS: vdev failure, zpool=big type=vdev.open_failed
Jul 10 18:40:14 triple0 kernel: hptrr: start channel [0,0]
Jul 10 18:40:25 triple0 kernel: hptrr: [0 0  ] failed to perform Soft Reset
Jul 10 18:40:25 triple0 kernel: hptrr: [0,0,0] device disconnected on channel
Jul 10 18:40:25 triple0 root: ZFS: vdev I/O failure, zpool=big path=/dev/da1 offset=56768110080 size=512 error=22
<panic and reboot>
Jul 10 20:15:36 triple0 syslogd: kernel boot file is /boot/kernel/kernel 

I have had quite a few corruptions from writing to non-parity USB/firewire drives (which have
an unfortunate tendency to "drop out" partway through a send/receive, but this just requires a
scrub, and destroying of the corrupted snapshot.  Never had a problem with importing the pool,
until now.

regards 

Duncan

> 
> > I guess what I'm trying to say is, that you can still lose everything on an 
> > entire pool, so backups are still essential, an a couple of smaller pools is 
> > probably preferable to one big pool (restore time is less).  zfs is not %100 
> > (yet?).  The lack of any type of fsck still causes me concern.
> 
> Regardless of the technology, backups are essential.  If you actually value
> your data, off-site backups are essential.