Re: kernel: failed: cg 5, cgp: 0xd11ecd0d != bp: 0x63d3ff1d

From: O. Hartmann <ohartmann_at_walstatt.org> Date: Thu, 22 Feb 2018 15:18:25 +0100 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:15 UTC

On Thu, 22 Feb 2018 09:26:20 +0100
Gary Jennejohn <gljennjohn_at_gmail.com> wrote:

> On Thu, 22 Feb 2018 08:37:07 +0100
> "O. Hartmann" <ohartmann_at_walstatt.org> wrote:
> 
> > On Tue, 20 Feb 2018 12:39:53 +0100
> > Gary Jennejohn <gljennjohn_at_gmail.com> wrote:
> >   
> > > On Mon, 19 Feb 2018 14:18:15 -0800
> > > "Chris H" <bsd-lists_at_BSDforge.com> wrote:
> > >     
> > > > I'm seeing a number of messages like the following:
> > > > kernel: failed: cg 5, cgp: 0xd11ecd0d != bp: 0x63d3ff1d
> > > > 
> > > > and was wondering if it's anything to be concerned with, or whether
> > > > fsck(8) is fixing them.
> > > > This began to happen when the power went out on a new install:
> > > > FreeBSD dns0 12.0-CURRENT FreeBSD 12.0-CURRENT #0: Wed Dec 13 06:07:59
> > > > PST 2017 root_at_dns0:/usr/obj/usr/src/amd64.amd64/sys/DNS0 amd64
> > > > which hadn't yet been hooked up to the UPS.
> > > > I performed an fsck in single user mode upon power-up. Which ended with
> > > > the mount points being masked CLEAN. I was asked if I wanted to use the
> > > > JOURNAL. I answered Y.
> > > > FWIW the systems are UFS2 (ffs) have gpart labels, and were newfs'd
> > > > thusly: newfs -U -j
> > > > 
> > > > Thank you for all your time, and consideration.
> > > >       
> > > 
> > > fsck fixes these errors only when the user does NOT use the journal.
> > > You should re-do the fsck.
> > >     
> > 
> > When first these mysterious errors occured on several boxes running CURRENT,
> > that was in December 2017 if I'm right, I also whitnessed mysterious and
> > frequent crashes on several SSD driven machines, where this error described
> > above occured.
> > 
> > While the error vanished somehow in the meanwhile while CURRENT proceeds,
> > the crashes continued - on two boxes, I dumped restore the OS on the
> > system's SSD by reformatting the SSD from sratch (UFS2, soft update+
> > journaling). On those boxes the mysterious crashes vanished since then!
> > 
> > On box left so far, my workstation. And this box continous to crash now and
> > started crashing today again while compiling world/kernel.
> > 
> > The fun-part is: even after a clean shutdown, where I can not detect any
> > filesystem inconsistencies and rebooting and, again: no reported
> > inconsistencies on the console/messages/logs, the box crashes spontanously.
> > Now (today) I could trigger the reboot by starting "make -j4 buildworld
> > buildkernel" and after showing the initial compiler statements/build
> > framework statements, the box went to Nirwana. A well known phenomenon
> > right now.
> > 
> > I checked now the consistency of the filesystem, here is the result of
> > the /usr/obj tree, which is a dedicated GPT partition
> > (label: /dev/gpt/usr.obj):
> > 
> > 
> > [...]
> >  root_at_box1:~ # fsck -fy /dev/gpt/usr.obj
> > ** /dev/gpt/usr.obj
> > ** Last Mounted on /usr/obj
> > ** Phase 1 - Check Blocks and Sizes
> > ** Phase 2 - Check Pathnames
> > UNALLOCATED  I=515  OWNER=root MODE=0
> > SIZE=0 MTIME=Feb 22 07:25 2018 
> > NAME=/usr/src/amd64.amd64/sys/BOX1/config.c.new
> > 
> > UNEXPECTED SOFT UPDATE INCONSISTENCY
> > 
> > REMOVE? yes
> > 
> > DIRECTORY CORRUPTED  I=169691  OWNER=root MODE=40775
> > SIZE=1536 MTIME=Feb 22 05:16 2018 
> > DIR=/usr/src/amd64.amd64/sys/BOX1/modules/usr/src/sys/modules/nfsd
> > 
> > UNEXPECTED SOFT UPDATE INCONSISTENCY
> > 
> > SALVAGE? yes
> > 
> > ** Phase 3 - Check Connectivity
> > ** Phase 4 - Check Reference Counts
> > ** Phase 5 - Check Cyl groups
> > FREE BLK COUNT(S) WRONG IN SUPERBLK
> > SALVAGE? yes
> > 
> > SUMMARY INFORMATION BAD
> > SALVAGE? yes
> > 
> > BLK(S) MISSING IN BIT MAPS
> > SALVAGE? yes
> > 
> > 126922 files, 848197 used, 1178482 free (89210 frags, 136159 blocks, 4.4%
> > fragmentation)
> > 
> > ***** FILE SYSTEM MARKED DIRTY *****
> > 
> > ***** FILE SYSTEM WAS MODIFIED *****
> > 
> > ***** PLEASE RERUN FSCK *****
> > 
> > [...]
> > 
> > When doing a installworld, I pre-emptively perform in single user mode
> > before mounting the partitions a "fsck -yf" two times. In most cases, the
> > filesystem are reported clean, but sometimes especially those under high
> > I/O (/usr/src and mostly /usr/obj on this build machine) there are reports
> > of corruption.
> > 
> > As I reported, the very same behaviour occured on three boxes simultanously
> > and I got rid of it by completely reformatting the SSDs (never had issues
> > so far with HDD based boxes!). 
> > 
> > I hope I can refurbish this weekend the remaining box and I could report, if
> > desired, whether this box returns to a healthy state as the others or if my
> > observation was a simple coincidence of issues ...
> > 
> > Thanks for the patience,
> >   
> 
> I also see such problems only with SSDs.  Probably because the SSDs
> are buffering writes internally which never make it into the flash
> chips, although the SSDs report that the writes were completed.
> 
> HDDs apparently don't do that, or have a smaller cache.
> 
> I then also run fsck in single-user mode, but I explicitly do NOT
> use the journal, i.e., I do NOT run fsck -y.  But I guess that using
> fsck -fy is equivalent to not using the journal.
> 
> In my case the SSDs are error free after doing the fsck without
> using the jounal until the next crash happens.  My box with a
> Ryzen 5 1600 tends to hang for no apparent reason, so I see these
> errors fairly frequently because I have to reset the box :(
> 

In my case here, I do not have to wait for a crash with an inconsistent
filesystem to have some weird behaviour with the journaling.

Somehow, in my naive terms, there is some strange problem hidden on partitions.

Since December last year I had very weird and bad corruptions of the filesystem
when performing "make installworld": boot process stopped at BTX or claimed
having no loader, although the installation process made it up to installing
everything in /boot/; but other folders like /sbin oder /libexec contained
nullified files. These corruptions even happend then, when I "fsck'ed" the SSD
prior to "make installworld" in single-user mode. Result of that was a
installation from a USB flash and then again, rebuild world, kernel, and so on.

Those horrible failures went away on all SSD based systems when
reformatting /usr/src, /usr/obj and /tmp (all dedicated partitions in my case)
where the inconsitencies occured most. 

Those systems, where I also reformatted /, all of these problems went away!

The remaining box were I havn't so far reformatted / is the box in question
here. Now, after /usr/obj and /usr/src newly formatted, the horror corruptions
while performing installworld disapperead, but the crashes are going on.
Especially after heavy I/O with lots of storage operations trigger spontanous
crashes.

For me, it looks like there is something really fishy with the UFS2. Since I
perfomr on three boxes almost daily buildworlds with CURRENT, I guess something
happened to the filesystem when CURRENT got hickups and the "inconsistency"
moved on until a complete newfs of the whole SSD.

I'm sorry not being able having more qualified data ...

Regards,
Oliver