Re: SU+J systems do not fsck themselves

From: Scott Long <scottl_at_samsco.org>
Date: Tue, 27 Dec 2011 23:54:44 -0700
On Dec 27, 2011, at 10:14 PM, David Thiel wrote:

> On Tue, Dec 27, 2011 at 02:48:22PM -0800, Xin Li wrote:
>>>> - use journalled fsck; - use normal fsck to check if the
>>>> journalled fsck did the right thing.
> 
> Ok, here is the log of fsck with and without journal.
> 
> http://redundancy.redundancy.org/fscklog3
> 

The first run of fsck, using the journal, gives results that I would expect.  The second run seems to imply that the fixes made on the first run didn't actually get written to disk.  This is definitely an oddity.  I see that you're using geli, maybe there's some strange side-effect there.  No idea.  Report as a bug, this is definitely undesired behavior.

> That was done the very next boot, after a clean shutdown. The errors 
> from the previous live fsck aren't there (oddly), but there are still 
> are apparently some corrections made. The next fsck still complains, but 
> doesn't give any salvage prompts.
> 
> Here is jsa_at_'s, done on a live FS with SU+J:
> 
> http://redundancy.redundancy.org/fscklog4
> 

For the love that is all good and holy, don't ever run fsck on a live filesystem.  It's going to report these kinds of problems!  It's normal; filesystem metadata updates stay cached in memory, and fsck bypasses that cache.  Also, what you see in your log is a file that has been unlinked but held open.  This is a common Unix idiom, and one that gets cleaned up by fsck on reboot, whether through the SUJ intent log processing or through a traditional fsck.

> I'm not actually looking to solve my particular problem per se. The 
> issue is that almost everyone I've checked with that's running SU+J gets 
> unref'd file and other errors when they check their filesystem (with the 
> fs live). Unless I'm missing something, a running FS should never have 
> those kinds of errors unless you deliberately disabled fsck.
> 

Nope, you are completely incorrect here.

> This leaves only a couple options:
> 
> - SU+J and fsck do not work correctly together to fix corruption on 
> boot, i.e. bgfsck isn't getting run when it should

The point of SUJ is to eliminate the need for bgfsck.  Effectively, they are exclusive ideas.  It's possible that there are still problems with SUJ and how fsck processes and commits the journal entires.  However, bgfsck has nothing to do with this, and I'd also like to know if your use of geli is complicating the problem.

> - Stuff is getting completely screwed up after boot

Possibly but unlikely

> - fsck is giving incorrect results

Very unlikely

> - I'm completely clueless about how SU+J is supposed to behave or be 
> deployed

No comment =-)

> 
> I'm pretty certain that the first is the issue here. It would be great 
> if others could check their own SU+J filesystems so we could get a few 
> more data points.
> 

Indeed, more data is needed.

Scott
Received on Wed Dec 28 2011 - 06:14:18 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:22 UTC