Re: kern/93942: panic: ufs_dirbad: bad dir

From: David Rhodus <drhodus_at_machdep.com> Date: Wed, 1 Mar 2006 15:10:38 -0500 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:52 UTC

On 2/28/06, Yarema <yds_at_coolrat.org> wrote:
>
>
> --On February 28, 2006 2:53:43 PM -0500 Kris Kennaway <kris_at_obsecurity.org>
> wrote:
>
> > On Tue, Feb 28, 2006 at 10:35:36AM -0500, Yarema wrote:
> >>
> >> > Number:         93942
> >> > Category:       kern
> >> > Synopsis:       panic: ufs_dirbad: bad dir
> >> > Confidential:   no
> >> > Severity:       critical
> >> > Priority:       high
> >> > Responsible:    freebsd-bugs
> >> > State:          open
> >> > Quarter:
> >> > Keywords:
> >> > Date-Required:
> >> > Class:          sw-bug
> >> > Submitter-Id:   current-users
> >> > Arrival-Date:   Tue Feb 28 15:40:06 GMT 2006
> >> > Closed-Date:
> >> > Last-Modified:
> >> > Originator:     Yarema <yds_at_CoolRat.org>
> >> > Release:        FreeBSD 6.1-PRERELEASE i386
> >> > Organization:
> >> > Environment:
> >> System: FreeBSD 6.1-PRERELEASE #0: Mon Feb 27 04:52:11 EST 2006 i386
> >>
> >> > Description:
> >>
> >> This is at least the third file system which got hosed for me by the
> >> ufs_dirbad bug on three different hard drives since 5.3 STABLE.
> >> I suspect this is related to the following PRs:
> >> http://www.FreeBSD.org/cgi/query-pr.cgi?pr=49079
> >> http://www.FreeBSD.org/cgi/query-pr.cgi?pr=51001
> >>
> >> In every case a process would lock up making the whole system
> >> unresponsive.  A reboot, fsck -y in single user mode and another
> >> reboot would produce the following during the mount of the corrupt
> >> fs in rw mode:
> >>
> >> bad dir ino 2 at  offset 16384: mangled entry
> >> panic: ufs_dirbad: bad dir
> >> cpuid = 0
> >>
> >> Another reboot, fsck -y in single user mode and reboot produces the
> >> same results repeatedly.  Previously I had recovered by mounting the
> >> corrupt fs in ro mode, backup, newfs, restore.
> >>
> >> Recently I noticed Matthew Dillon commit the following to the
> >> DragonFly src repository:
> >>
> >> http://leaf.DragonFlyBSD.org/mailarchive/commits/2006-02/msg00057.html
> >>
> >> dillon      2006/02/21 10:46:56 PST
> >>
> >> DragonFly src repository
> >>
> >>   Modified files:
> >>     sys/kern             vfs_cluster.c
> >>   Log:
> >>   bioops.io_start() was being called in a situation where the buffer
> >>   could be brelse()'d afterwords instead of I/O being initiated.  When
> >>   this occurs, the buffer may contain softupdates-modified data which is
> >>   never reverted, resulting in serious filesystem corruption.  When
> >>   io_start is called on a buffer, I/O MUST be initiated and terminated
> >>   with a biodone() or the buffer's data may not be properly reverted.
> >>
> >>   Solve the problem by moving the io_start() call a little further on in
> >>   the code, after the potential brelse().
> >>
> >>   There is a possibility that this bug is responsible for the 'dirbad'
> >>   panics often reported in DragonFly and FreeBSD circles.
> >>
> >>   Revision  Changes    Path
> >>   1.16      +7 -6      src/sys/kern/vfs_cluster.c
> >>
> >> http://www.DragonFlyBSD.org/cvsweb/src/sys/kern/vfs_cluster.c.diff?r1=1.
> >> 15&r2=1.16&f=u
> >>
> >> Below is the equivalent patch to the FreeBSD RELENG_6 branch of
> >> src/sys/kern/vfs_cluster.c
> >>
> >> Hope this helps track down the problem.
> >
> > Does it work for you? :)
> >
> > Kris
>
> No way for me to know yet.  From what I gathered, mostly from this thread:
> <http://docs.FreeBSD.org/cgi/getmsg.cgi?fetch=331058+0+archive/2006/freebsd-current/20060108.freebsd-current>
>
> As per Matt Dillon
> <http://docs.FreeBSD.org/cgi/getmsg.cgi?fetch=217892+0+/usr/local/www/db/text/2006/freebsd-current/20060226.freebsd-current>,
> the corruption occurs much earlier than any consequences can be felt.
> The patch may prevent the corruption from occurring in the first place.
> But the patch does nothing for me now that I have a huge /home slice
> which cannot even be mounted as read-only in single user mode without
> triggering a page fault kernel panic in the mount process no matter
> how many times I run fsck -f on it.
>
> FWIW the page fault in the mount process is a different sort of kernel
> panic than what is described in this kern/93942 PR above.  The page fault
> occurs while attempting to mount read-only.  Attempting to mount raed-write
> causes the panic: ufs_dirbad: bad dir
>
> One more note, hitting the power button when the machine is locked up
> before the reboot and mount attempt which causes the panic produces the
> following output every time the button is pressed:
>
> kernel: acpi: suspend request ignored (not ready yet)
>
> Seems like there's two separate problems:
> 1) the root cause of the bad dir corruption.
> 2) fsck -f doesn't fix it no matter how many times you run it.
>
> Any pointers on how to recover my /home slice will be greatly appreciated.
>
> --
> Yarema

I have been working with the bad dir problem for several months and I
have not had corruption which fsck would not correct.

-DR