Re: [PANIC] ufs_dirbad: bad dir

From: Matthew Dillon <dillon_at_apollo.backplane.com>
Date: Sun, 16 Oct 2005 17:01:28 -0700 (PDT)
:
:On 16 Oct, Matthew Dillon wrote:
:>     Ach.  sigh.  Another false alarm.  Sorry.  The code is fine.  It's
:>     because the 'end' block is calculated inclusively, e.g.
:>     end_lbn = start_lbn + len - 1.  I'm still investigating it.
:> 
:>     There is a bug if the range reallocblks is called with spans
:>     more then two blockmaps, but I don't think that case can occur in real
:>     life due to limitations in the range passed by the caller.  Probably
:>     worth a KASSERT, though.
:
:Is there any correlation between this problem and the file system block
:size?  I've *never* encountered this problem, but I've only used block
:sizes up to 16K, and mostly just 4K and 8K.  I seem to have a dim memory
:of a mention of problems of some sort with large block sizes.

    It's possible but unlikely.  Ours tend to be 1K/8K or 2K/16K.  The
    frag ratio is 1:8 in both cases so it doesn't hit the funny frag masking
    code in the fragment allocator.  The error is not occuring in a fragment,
    either, it has so far only occured in blocks running through an
    indirect block.

    So far the two crash dumps I've looked at show corruption in the
    first or second block addressed by the first indirect block (lbn 12
    and 13), which implies that an indirect block is getting trashed.  But
    the indirect block itself looks ok.

    From the crash dumps I have, the indirect block (-12) was in the
    buffer cache and I was able to look at it.  The contents of the block
    looked just fine:

    (kgdb) print $13.b_data
    $14 = 0xc1f00000 "ðAi"
    (kgdb) x/x $14
    0xc1f00000:     0x006941f0
    (kgdb) 
    0xc1f00004:     0x006944e0	<<<< this one
    (kgdb) 
    0xc1f00008:     0x00000000
    (kgdb) 
    0xc1f0000c:     0x00000000
    (kgdb) 
    0xc1f00010:     0x00000000
    (kgdb) 
    0xc1f00014:     0x00000000
    (kgdb) 
    0xc1f00018:     0x00000000
    (kgdb) 
    0xc1f0001c:     0x00000000
    ...

    This is consistent with the directory that the panic occured on.  So
    the indirect block itself does not appear to be garbage.  The DATA BLOCK,
    looks properly connected:

    (kgdb) print bp->b_lblkno  
    $18 = 13
	^^^^^^ corresponds to the filesystem block 0x006944e0 above.

    (kgdb) printf "%08x\n", bp->b_bio.bio_blkno >> 1
    006944e0		(bio_blkno is in device blocks, e.g. 512, so
			 divided by 2 to get filesystem blocks).
	^^^^^ Matches the data found in the indirect block (1K/8K)

    (kgdb) print $21->ufsmount_u.fs
    $25 = (struct fs *) 0xd1acb800
    (kgdb) print *$25
    $26 = {
      fs_firstfield = 0, 
      fs_unused_1 = 0, 
      fs_sblkno = 16, 
      fs_cblkno = 24, 
      fs_iblkno = 32, 
      fs_dblkno = 1400, 
      fs_cgoffset = 2048, 
      fs_cgmask = -1, 
      fs_time = 1129469246, 
      fs_size = 37771928, 
      fs_dsize = 36610722, 
      fs_ncg = 839, 
      fs_bsize = 8192, 			<<<<<< 1K/8K blocks
      fs_fsize = 1024, 
      fs_frag = 8, 
      fs_minfree = 8, 
      fs_rotdelay = 0, 
      fs_rps = 60, 
      fs_bmask = -8192, 
      fs_fmask = -1024, 
      fs_bshift = 13, 
      fs_fshift = 10, 

    But the contents of the data block is not a directory.  It looks like
    a piece of some other file:

    (kgdb) print bp
    $27 = (struct buf *) 0xc13ef8a0
    (kgdb) print bp->b_data
    $28 = 0xc341c000 "1_CA_CRT, &output);\n\n  if (output & GNUTLS_CERT_INVALID)\n    {\n      fprintf (stderr, \"Not trusted\");\n\n      if (output & GNUTLS_CERT_SIGNER_NOT_CA)\n\tfprintf (stderr, \": Issuer is not a CA\\n\");\n      "...
    (kgdb) 

    One of my users is reporting that multiple fscks are required to clean
    up the filesystem after the dirbad panic.  I haven't gotten the 
    fsck output from him but my guess is that there are duplicate blocks.

    David O'Brien has indicated that the problem occurs with softupdates 
    turned on or off, so it isn't softupdates specifically.

    So my guess is that there is something going on in UFS or the buffer
    cache.  I have a ton of bitmap sanity checks in DragonFly and none
    of them are being hit.  I have background bitmap writes turned off in
    DragonFly, so it has nothing to do with them.  I am investigating a
    number of things but at the moment I am at a loss as to the cause.  

					-Matt
					Matthew Dillon 
					<dillon_at_backplane.com>
Received on Sun Oct 16 2005 - 22:01:42 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:45 UTC