Re: dump trying to access incorrect block numbers?

From: Mark Millard <markmi_at_dsl-only.net> Date: Sat, 8 Jul 2017 10:45:41 -0700 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:12 UTC

[A normal multi-user boot's fsck activity can do
 fsck -B activity that gets the problem.]

On 2017-Jul-8, at 9:45 AM, Mark Millard <markmi at dsl-only.net> wrote:

> [I add notes about a problem that happens after the
> "fsck -B". Also  forgot to mention: production style
> kernel world builds were in use. And a tried a
> powerpc64 build and it works the same.]
> 
> On 2017-Jul-7, at 11:09 PM, Mark Millard <markmi at dsl-only.net> wrote:
> 
>> [This note has more information than one sent with extra text
>> in the subject but with a partially different "to" list.]
>> 
>> Peter Jeremy peter at rulingia.com wrote on
>> Sat Jul 8 02:00:47 UTC 2017 :
>> 
>>> When did you first notice this (what SVN revision)?
>>> Do you know what the last good SVN revision was?
>>> Is this a new or old filesystem?
>>> Is the filesystem mounted/active or not when you dump it?
>>> What are the relevant parameters for the filesystem on ada0s3a?
>>> Are you running softupdates, journalling etc?
>>> Which dump(8) phase is reporting the errors?
>>> What are the exact dump and fsck commands you ran?
>> 
>> I can add a little information with some contrast
>> and only "fsck -B" in use (with an unclean file
>> system from a prior crash), no dump use. Still:
>> a snapshot is involved in the below.
>> 
>> Unfortunately two problems with major consequences
>> for my involved context limit the svn range that I
>> can cover for the activity, the problem version
>> ranges being:
>> 
>> -r319722 through -r320651 (fixed by -r320652)
>> (actually this is why I had used "boot -s" 
>> in what I report later: I could get to a
>> shell prompt that way instead of crashing
>> before any login prompt; the crashes left
>> the file system in need of repair)
>> 
>> -r320509 through -r320561 (fixed by -r320570)
>> 
>> So I was using -r320570 to avoid one of the
>> two problems.
>> 
>> 
>> 
>> Context: 32-bit powerpc FreeBSD used on PowerMac G5
>> so-called "Quad-core". (So big-endian as well.)
>> Softupdates, no journalling. Long-in-use file
>> system having lots of FreeBSD versions updates
>> and port rebuilds over the time.
>> 
>> The following is from now, not from the time of the
>> example messages:
>> 
>> # dumpfs / | more
>> magic   19540119 (UFS2) time    Fri Jul  7 22:53:34 2017
>> superblock location     65536   id      [ <OMITTED> ]
>> ncg     158     size    25165823        blocks  24372006
>> bsize   32768   shift   15      mask    0xffff8000
>> fsize   4096    shift   12      mask    0xfffff000
>> frag    8       shift   3       fsbtodb 3
>> minfree 8%      optim   time    symlinklen 120
>> maxbsize 32768  maxbpg  4096    maxcontig 4     contigsumsize 4
>> nbfree  2130375 ndir    65518   nifree  11769796        nffree  425065
>> bpg     20032   fpg     160256  ipg     80128   unrefs  0
>> nindir  4096    inopb   128     maxfilesize     2252349704110079
>> sbsize  4096    cgsize  32768   csaddr  5048    cssize  4096
>> sblkno  24      cblkno  32      iblkno  40      dblkno  5048
>> cgrotor 127     fmod    0       ronly   0       clean   0
>> metaspace 6408  avgfpdir 64     avgfilesize 16384
>> flags   soft-updates trim 
>> fsmnt   /
>> volname FBSDG4Srootfs   swuid   0       providersize    25165823
>> . . .
>> 
>> 
>> 
>> What I had done that produced the messages was:
>> 
>> <Prior failed multi-user boot from system problem
>> leaves root (only) file system not marked clean
>> so fsck -B will actually do something below>
>> 
>> boot -s (so: single user mode)
>> # The next 3 lines are the content of a generic, manually-run script.
>> mount -u /
>> mount -a -t ufs (but there is no other file system)
>> swapon -a       (there is a swap partition)
>> #
>> fsck -B
>> 
>> That "fsck -B" caused the same kinds of lines
>> reported by Michael Butler, happening as fsck
>> makes a snapshot for the background processing
>> to use. (I have camera pictures and could type
>> in some of the lines if needed.)
>> 
>> After those lines was text like (typed in from
>> an example camera picture):
>> 
>> ** //.snap/fsck_snapshot
>> ** Last Mount on /
>> ** Root file system
>> ** Phase 1 - Check Blocks and Sizes
>> ** Phase 2 - Check Pathnames
>> ** Phase 3 - Check Connectivity
>> ** Phase 4 - Check Reference Counts
>> ** Phase 5 - Check Cyl groups
>> Reclaimed: 0 directories, 1 files, 22680 fragments
>> 780914 files, 4797127 used, 19552199 free (443479 frags, 3288590 blocks, 1.8% fragmentation)
>> 
>> ***** FILE SYSTEM MARKED CLEAN *****
> 
> [I forgot or mention that the context was a
> production style kernel and world build,
> no invariants or other such.]
> 
> Since I'm running a patched -r320570 for the
> issue:
> 
> -r319722 through -r320651 (fixed by -r320652)
> 
> I went back and forced a power-off without
> shutdown and did the sequence:
> 
> boot -s (so: single user mode)
> # The next 3 lines are the content of a generic, manually-run script.
> mount -u /
> mount -a -t ufs (but there is no other file system)
> swapon -a       (there is a swap partition)
> #
> fsck -B
> 
> but always waited briefly after the fsck -B finished.
> 
> Like before the following happens as it tries to trim:
> (typed in from camera picture)
> 
> panic: ffs_blkfree_cq: freeing free block
> cpuid = 2 (varies, of course)
> time = (varies)
> KDB: stack backtrace
> (stack addresses can vary: just an example here)
> 0xd23b17e0: at kdb_backtrace+0x5c
> 0xd23b1850: at vpanic+0x1e8
> 0xd23b18c0: at panic+0x54
> 0xd23b1910: at ffs_blkfree_cq+0x278
> 0xd23b1980: at ffs_blkfree_trim_task+0x60
> 0xd23b19b0: at taskqueue_run_locked+0x10
> 0xd23b1a10: at taskqueue_thread_loop+0x174
> 0xd23b1a50: at fork_exit+0xf4
> 0xd23b1a80: at fork_trampoline+0xc
> KDB: enter: panic
> [ thread pid 0 tid 1000082 ]
> Stopped at kdb_enter_0x70: addi r0,r0,0x0
> 
> 
> I've tried this on a powerpc64 and it works
> the same, complete with the "freeing free
> block" issue.

I tried a sequence using a normal boot to multi-user
that was not clean but did a automatic fsck -B and
I got the messages and the later "freeing free block"
crash.

It appears that having mksnap_ffs (and code equivalents
in other programs) broken in turn breaks fsck -B fairly
majorly. (Michael Butler did the mksnap_ffs test at
Rodney W. Grimes request.)

I've been using the following to clean things up
when I'm done with an experimental sequence that
leaves things needing a fsck:

boot -s (a single user boot)
fsck -F

So far it has resulted in a clean file system.
With that status fsck -B then has no such
problem: apparently it then does not create
a snaphot by default. So then a multi-user boot
works okay for its automatic fsck use.

===
Mark Millard
markmi at dsl-only.net