Re: SU+J systems do not fsck themselves

From: Scott Long <scottl_at_samsco.org>
Date: Wed, 28 Dec 2011 00:57:55 -0700
On Dec 28, 2011, at 12:34 AM, David Thiel wrote:

> On Tue, Dec 27, 2011 at 11:54:20PM -0700, Scott Long wrote:
>> The first run of fsck, using the journal, gives results that I would 
>> expect.  The second run seems to imply that the fixes made on the 
>> first run didn't actually get written to disk.  This is definitely an 
>> oddity.  I see that you're using geli, maybe there's some strange 
>> side-effect there.  No idea.  Report as a bug, this is definitely 
>> undesired behavior.
> 
> Not impossible, but I was seeing similar issues on two non-geli systems 
> as well, i.e. tons of errors fixed when doing a single-user 
> non-journalled fsck, but journalled fsck not fixing stuff. I'll try to 
> replicate on a test machine, as I already lost data on the last 
> (non-geli) machine this happened to.
> 
>> For the love that is all good and holy, don't ever run fsck on a live 
>> filesystem.  It's going to report these kinds of problems!  It's 
>> normal; filesystem metadata updates stay cached in memory, and fsck 
>> bypasses that cache.  
> 
> Ok. I expected fsck would be softupdate-aware in that way, but I 
> understand it not doing so.
> 
>>> - SU+J and fsck do not work correctly together to fix corruption on 
>>> boot, i.e. bgfsck isn't getting run when it should
>> 
>> The point of SUJ is to eliminate the need for bgfsck.  Effectively, 
>> they are exclusive ideas.  
> 
> This is surprising to me. It is my impression that under Linux at least, 
> ext3fs is checked against the journal, and gets a full e2fsck if it 
> finds it's still dirty. Additionally, there's a periodic fsck after 180 
> days continuous runtime or x number of mounts (see tune2fs -i and -c).  
> Is SU+J somehow implemented in such a way that this is unnecessary? What 
> does it do that the ext3fs people have missed?
> 

SUJ isn't like ext3 journaling, it doesn't do 100% metadata logging.  Instead, it's an extension of softupdates.  Softupdates (SU) is still responsible for ordering dependent writes to the disk to maintain consistency.  What SU can't handle is the Unix/POSIX idiom of unlinking a file from the namespace but keeping its inode active through refcounts.  When you have an unclean shutdown, you wind up with stale blocks allocated to orphaned inodes.  The point of bgfsck was to scan the filesystem for these allocations and free them, just like fsck does, but to do it in the background so that the boot could continue.  SUJ is basically just an intent log for this case; it tells fsck where to find these allocations so that fsck doesn't have to do the lengthy scan.  FWIW, this problem is present in most any journaling implementation and is usually solved via the use of intent records in a journal, not unlike SUJ.

So, there's an assumption with SUJ+fsck that SU is keeping the filesystem consistent.  Maybe that's a bad assumption, and I'm not trying to discredit your report.  But the intention with SUJ is to eliminate the need for anything more than a cursory check of the superblocks and a processing of the SUJ intent log.  If either of these fails then fsck reverts to a traditional scan.  In the same vein, ext3 and most other traditional journaling filesystems assume that the journal is correct and is preserving consistency, and don't do anything more than a cursory data structure scan and journal replay as well, but then revert to a full scan if that fails (zfs seems to be an exception here, with there being no actual fsck available for it).

As for the 180 day forced scan on ext3, I have no public comment.  SU has matured nicely over the last 10+ years, and I'm happy with the progress that SUJ has made in the last 2-3 years.  If there are bugs, they need to be exposed and addressed ASAP.

Scott
Received on Wed Dec 28 2011 - 06:58:02 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:22 UTC