Re: kern/93942: panic: ufs_dirbad: bad dir

From: Yarema <yds_at_CoolRat.org> Date: Tue, 28 Feb 2006 18:43:58 -0500 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:52 UTC

--On February 28, 2006 2:53:43 PM -0500 Kris Kennaway <kris_at_obsecurity.org> 
wrote:

> On Tue, Feb 28, 2006 at 10:35:36AM -0500, Yarema wrote:
>>
>> > Number:         93942
>> > Category:       kern
>> > Synopsis:       panic: ufs_dirbad: bad dir
>> > Confidential:   no
>> > Severity:       critical
>> > Priority:       high
>> > Responsible:    freebsd-bugs
>> > State:          open
>> > Quarter:
>> > Keywords:
>> > Date-Required:
>> > Class:          sw-bug
>> > Submitter-Id:   current-users
>> > Arrival-Date:   Tue Feb 28 15:40:06 GMT 2006
>> > Closed-Date:
>> > Last-Modified:
>> > Originator:     Yarema <yds_at_CoolRat.org>
>> > Release:        FreeBSD 6.1-PRERELEASE i386
>> > Organization:
>> > Environment:
>> System: FreeBSD 6.1-PRERELEASE #0: Mon Feb 27 04:52:11 EST 2006 i386
>>
>> > Description:
>>
>> This is at least the third file system which got hosed for me by the
>> ufs_dirbad bug on three different hard drives since 5.3 STABLE.
>> I suspect this is related to the following PRs:
>> http://www.FreeBSD.org/cgi/query-pr.cgi?pr=49079
>> http://www.FreeBSD.org/cgi/query-pr.cgi?pr=51001
>>
>> In every case a process would lock up making the whole system
>> unresponsive.  A reboot, fsck -y in single user mode and another
>> reboot would produce the following during the mount of the corrupt
>> fs in rw mode:
>>
>> bad dir ino 2 at  offset 16384: mangled entry
>> panic: ufs_dirbad: bad dir
>> cpuid = 0
>>
>> Another reboot, fsck -y in single user mode and reboot produces the
>> same results repeatedly.  Previously I had recovered by mounting the
>> corrupt fs in ro mode, backup, newfs, restore.
>>
>> Recently I noticed Matthew Dillon commit the following to the
>> DragonFly src repository:
>>
>> http://leaf.DragonFlyBSD.org/mailarchive/commits/2006-02/msg00057.html
>>
>> dillon      2006/02/21 10:46:56 PST
>>
>> DragonFly src repository
>>
>>   Modified files:
>>     sys/kern             vfs_cluster.c
>>   Log:
>>   bioops.io_start() was being called in a situation where the buffer
>>   could be brelse()'d afterwords instead of I/O being initiated.  When
>>   this occurs, the buffer may contain softupdates-modified data which is
>>   never reverted, resulting in serious filesystem corruption.  When
>>   io_start is called on a buffer, I/O MUST be initiated and terminated
>>   with a biodone() or the buffer's data may not be properly reverted.
>>
>>   Solve the problem by moving the io_start() call a little further on in
>>   the code, after the potential brelse().
>>
>>   There is a possibility that this bug is responsible for the 'dirbad'
>>   panics often reported in DragonFly and FreeBSD circles.
>>
>>   Revision  Changes    Path
>>   1.16      +7 -6      src/sys/kern/vfs_cluster.c
>>
>> http://www.DragonFlyBSD.org/cvsweb/src/sys/kern/vfs_cluster.c.diff?r1=1.
>> 15&r2=1.16&f=u
>>
>> Below is the equivalent patch to the FreeBSD RELENG_6 branch of
>> src/sys/kern/vfs_cluster.c
>>
>> Hope this helps track down the problem.
>
> Does it work for you? :)
>
> Kris

No way for me to know yet.  From what I gathered, mostly from this thread:
<http://docs.FreeBSD.org/cgi/getmsg.cgi?fetch=331058+0+archive/2006/freebsd-current/20060108.freebsd-current>

As per Matt Dillon 
<http://docs.FreeBSD.org/cgi/getmsg.cgi?fetch=217892+0+/usr/local/www/db/text/2006/freebsd-current/20060226.freebsd-current>, 
the corruption occurs much earlier than any consequences can be felt.
The patch may prevent the corruption from occurring in the first place.
But the patch does nothing for me now that I have a huge /home slice
which cannot even be mounted as read-only in single user mode without
triggering a page fault kernel panic in the mount process no matter
how many times I run fsck -f on it.

FWIW the page fault in the mount process is a different sort of kernel 
panic than what is described in this kern/93942 PR above.  The page fault 
occurs while attempting to mount read-only.  Attempting to mount raed-write 
causes the panic: ufs_dirbad: bad dir

One more note, hitting the power button when the machine is locked up 
before the reboot and mount attempt which causes the panic produces the 
following output every time the button is pressed:

kernel: acpi: suspend request ignored (not ready yet)

Seems like there's two separate problems:
1) the root cause of the bad dir corruption.
2) fsck -f doesn't fix it no matter how many times you run it.

Any pointers on how to recover my /home slice will be greatly appreciated.

-- 
Yarema