Re: bg fsck and fs corruption

From: Robert Watson <rwatson_at_freebsd.org> Date: Sat, 12 Jun 2004 15:55:40 -0400 (EDT) · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:37:57 UTC

On Sat, 12 Jun 2004, Anthony Ginepro wrote:

> > If you allow bgfsck to complete, does it eventually clean this up? 
> 
> I already had similar "corruptions" (as I never lost a file that way it
> isn't as terrible as a really corrupted file). 

It's worth noting that the problem I'm describing actually isn't
corruption as defined by soft updates: it's consistent subject to the
consistency model of soft updates.  The problem is that it conflicts with
reasonable user expectation ("No, there really isn't anything in the
directory, so don't tell me it's empty!").

> Complete bgfsck don't clean this up as it often chokes on this error
> (can't reming the exact error report, something like "SOFTDEP
> INCONSISTENCY"). 

This is a result of the assumptions of soft updates being violated.  There
are a few reasons this might happen:

(1) Bug in UFS/soft updates, resulting in things being sent to disk
    without correct dependency ordering (or corruption or whatever).

(2) Bug in the storage layer, be it GEOM, device driver, et al, which
    causes ordering requirements to be lost, or acknowledges a write
    request as complete when it's not.

(3) Bug in the hardware, such as acknowledging a write request as complete
    where it's not.

They're all serious issues, especially in the presence of system failure
(i.e., power failure, panic, etc).  Soft updates offers some nice
efficiency gains and fairly reasonable guarantees, but a lot of cheap PC
hardware completely fails to meet its requirements now (drives will lie
and indicate a change was committed to disk when it's really just in
cache, for example).  That makes it a bit hard to track these down.

The cases I find most interesting, though, are the ones where we know the
system halted for a reason that doesn't give the disks and excuse not
eventually have committed to disk.  A panic in the network stack, for
example, if failing stop, shouldn't result in corruption that can't be
recovered from by bgfsck.  And I've seen cases where that hasn't happened
-- since there's no power off, and there's a long delay before reboot,
it's unlikely either the disks/controllers are losing state, or that the
state was flushed during the soft reboot.

> However an fsck from single user always cleaned this.

Still highly undesirable, though, as it means the expected consistency on
the disk that soft updates relies on isn't present.  UFS and UFS-like file
systems (ext2fs, etc) aren't laid out in such a way that it's possible to
be highly tolerant of disk corruption.  The UFS implementation tolerates
some sorts of flaws, but can panic (cycles in the name space) or
experience additional corruption for other flaws (such as multiple files
owning the same block).  Some of those corruption modes are more likely
than others in the presence of simple failures (power loss, etc).  I had a
conversation with Tom Van Vleck recently on the file system used in
Multics, which was capable of detecting and tolerating a broad range of
corruption, and there are some interesting ideas there I'd love to see in
a modern UNIX... 

Robert N M Watson             FreeBSD Core Team, TrustedBSD Projects
robert_at_fledge.watson.org      Senior Research Scientist, McAfee Research

> 
> > Robert N M Watson             FreeBSD Core Team, TrustedBSD Projects
> > robert_at_fledge.watson.org      Senior Research Scientist, McAfee Research
> > 
> > 
> > > 
> > > twinsun# rm -rf old
> > > rm: old/26422/usr/local/lib: Directory not empty
> > > rm: old/26422/usr/local: Directory not empty
> > > rm: old/26422/usr: Directory not empty
> > > rm: old/26422/var/tmp/instmp.laCtQf/lib/perl5/5.8.4/mach/auto/threads: Directory not empty
> > > rm: old/26422/var/tmp/instmp.laCtQf/lib/perl5/5.8.4/mach/auto: Directory not empty
> > > rm: old/26422/var/tmp/instmp.laCtQf/lib/perl5/5.8.4/mach: Directory not empty
> > > rm: old/26422/var/tmp/instmp.laCtQf/lib/perl5/5.8.4: Directory not empty
> > > rm: old/26422/var/tmp/instmp.laCtQf/lib/perl5: Directory not empty
> > > rm: old/26422/var/tmp/instmp.laCtQf/lib: Directory not empty
> > > rm: old/26422/var/tmp/instmp.laCtQf: Directory not empty
> > > rm: old/26422/var/tmp: Directory not empty
> > > rm: old/26422/var: Directory not empty
> > > rm: old/26422: Directory not empty
> > > rm: old: Directory not empty
> > > twinsun# ls -l old/26422/usr/local/lib
> > > total 0
> > > 
> > > bg fsck noticed the usual softdep problems, but did not report or fix
> > > the corruption:
> > > 
> > > [...]
> > > Jun 12 07:38:47 twinsun fsck: /dev/da1c: INCORRECT BLOCK COUNT I=4381849 (4 should be 0) (CORRECTED)
> > > Jun 12 07:38:47 twinsun fsck: /dev/da1c: INCORRECT BLOCK COUNT I=4381850 (4 should be 0) (CORRECTED)
> > > Jun 12 07:38:47 twinsun fsck: /dev/da1c: INCORRECT BLOCK COUNT I=4381853 (4 should be 0) (CORRECTED)
> > > Jun 12 07:38:47 twinsun fsck:
> > > 
> > > Note the lack of summary line.  I don't know if it was trying to log
> > > the more serious corruption but didn't because of a bug, or if it just
> > > didn't detect it.
> > > 
> > > Kris
> > > 
> > 
> > _______________________________________________
> > freebsd-current_at_freebsd.org mailing list
> > http://lists.freebsd.org/mailman/listinfo/freebsd-current
> > To unsubscribe, send any mail to "freebsd-current-unsubscribe_at_freebsd.org"
>