Re: kernel: failed: cg 5, cgp: 0xd11ecd0d != bp: 0x63d3ff1d

From: Rodney W. Grimes <freebsd-rwg_at_pdx.rh.CN85.dnsmgr.net> Date: Thu, 22 Feb 2018 05:50:09 -0800 (PST) · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:15 UTC

> On Thu, 22 Feb 2018 08:37:07 +0100
> "O. Hartmann" <ohartmann_at_walstatt.org> wrote:
> 
> > On Tue, 20 Feb 2018 12:39:53 +0100
> > Gary Jennejohn <gljennjohn_at_gmail.com> wrote:
> > 
> > > On Mon, 19 Feb 2018 14:18:15 -0800
> > > "Chris H" <bsd-lists_at_BSDforge.com> wrote:
> > >   
> > > > I'm seeing a number of messages like the following:
> > > > kernel: failed: cg 5, cgp: 0xd11ecd0d != bp: 0x63d3ff1d
> > > > 
> > > > and was wondering if it's anything to be concerned with, or whether
> > > > fsck(8) is fixing them.
> > > > This began to happen when the power went out on a new install:
> > > > FreeBSD dns0 12.0-CURRENT FreeBSD 12.0-CURRENT #0: Wed Dec 13 06:07:59 PST
> > > > 2017 root_at_dns0:/usr/obj/usr/src/amd64.amd64/sys/DNS0 amd64
> > > > which hadn't yet been hooked up to the UPS.
> > > > I performed an fsck in single user mode upon power-up. Which ended with the
> > > > mount points being masked CLEAN. I was asked if I wanted to use the JOURNAL.
> > > > I answered Y.
> > > > FWIW the systems are UFS2 (ffs) have gpart labels, and were newfs'd thusly:
> > > > newfs -U -j
> > > > 
> > > > Thank you for all your time, and consideration.
> > > >     
> > > 
> > > fsck fixes these errors only when the user does NOT use the journal.
> > > You should re-do the fsck.
> > >   
> > 
> > When first these mysterious errors occured on several boxes running CURRENT,
> > that was in December 2017 if I'm right, I also whitnessed mysterious and
> > frequent crashes on several SSD driven machines, where this error described
> > above occured.
> > 
> > While the error vanished somehow in the meanwhile while CURRENT proceeds, the
> > crashes continued - on two boxes, I dumped restore the OS on the system's SSD
> > by reformatting the SSD from sratch (UFS2, soft update+ journaling). On those
> > boxes the mysterious crashes vanished since then!
> > 
> > On box left so far, my workstation. And this box continous to crash now and
> > started crashing today again while compiling world/kernel.
> > 
> > The fun-part is: even after a clean shutdown, where I can not detect any
> > filesystem inconsistencies and rebooting and, again: no reported
> > inconsistencies on the console/messages/logs, the box crashes spontanously. Now
> > (today) I could trigger the reboot by starting "make -j4 buildworld
> > buildkernel" and after showing the initial compiler statements/build framework
> > statements, the box went to Nirwana. A well known phenomenon right now.
> > 
> > I checked now the consistency of the filesystem, here is the result of
> > the /usr/obj tree, which is a dedicated GPT partition
> > (label: /dev/gpt/usr.obj):
> > 
> > 
> > [...]
> >  root_at_box1:~ # fsck -fy /dev/gpt/usr.obj
> > ** /dev/gpt/usr.obj
> > ** Last Mounted on /usr/obj
> > ** Phase 1 - Check Blocks and Sizes
> > ** Phase 2 - Check Pathnames
> > UNALLOCATED  I=515  OWNER=root MODE=0
> > SIZE=0 MTIME=Feb 22 07:25 2018 
> > NAME=/usr/src/amd64.amd64/sys/BOX1/config.c.new
> > 
> > UNEXPECTED SOFT UPDATE INCONSISTENCY
> > 
> > REMOVE? yes
> > 
> > DIRECTORY CORRUPTED  I=169691  OWNER=root MODE=40775
> > SIZE=1536 MTIME=Feb 22 05:16 2018 
> > DIR=/usr/src/amd64.amd64/sys/BOX1/modules/usr/src/sys/modules/nfsd
> > 
> > UNEXPECTED SOFT UPDATE INCONSISTENCY
> > 
> > SALVAGE? yes
> > 
> > ** Phase 3 - Check Connectivity
> > ** Phase 4 - Check Reference Counts
> > ** Phase 5 - Check Cyl groups
> > FREE BLK COUNT(S) WRONG IN SUPERBLK
> > SALVAGE? yes
> > 
> > SUMMARY INFORMATION BAD
> > SALVAGE? yes
> > 
> > BLK(S) MISSING IN BIT MAPS
> > SALVAGE? yes
> > 
> > 126922 files, 848197 used, 1178482 free (89210 frags, 136159 blocks, 4.4%
> > fragmentation)
> > 
> > ***** FILE SYSTEM MARKED DIRTY *****
> > 
> > ***** FILE SYSTEM WAS MODIFIED *****
> > 
> > ***** PLEASE RERUN FSCK *****
> > 
> > [...]
> > 
> > When doing a installworld, I pre-emptively perform in single user mode before
> > mounting the partitions a "fsck -yf" two times. In most cases, the filesystem
> > are reported clean, but sometimes especially those under high I/O (/usr/src and
> > mostly /usr/obj on this build machine) there are reports of corruption.
> > 
> > As I reported, the very same behaviour occured on three boxes simultanously and
> > I got rid of it by completely reformatting the SSDs (never had issues so far
> > with HDD based boxes!). 
> > 
> > I hope I can refurbish this weekend the remaining box and I could report, if
> > desired, whether this box returns to a healthy state as the others or if my
> > observation was a simple coincidence of issues ...
> > 
> > Thanks for the patience,
> > 
> 
> I also see such problems only with SSDs.  Probably because the SSDs
> are buffering writes internally which never make it into the flash
> chips, although the SSDs report that the writes were completed.
> 
> HDDs apparently don't do that, or have a smaller cache.
> 
> I then also run fsck in single-user mode, but I explicitly do NOT
> use the journal, i.e., I do NOT run fsck -y.  But I guess that using
> fsck -fy is equivalent to not using the journal.

fsck -y is not the same as running fsck and answer the special question
that has been added to fix something.  (I believe this is to turn back on
the cgcksum thing).

Kirk would have to correct me if I am wrong there.

> In my case the SSDs are error free after doing the fsck without
> using the jounal until the next crash happens.  My box with a
> Ryzen 5 1600 tends to hang for no apparent reason, so I see these
> errors fairly frequently because I have to reset the box :(

Instead of running fsck -fy or fsck -y I would recommend running
'fsck' at single user and see what it finds, and what options
it might give you to "fix" or "enable".

-- 
Rod Grimes                                                 rgrimes_at_freebsd.org