Re: fsck_ufs after every reboot

From: Jeremy Chadwick <koitsu_at_FreeBSD.org>
Date: Wed, 12 Nov 2008 13:27:50 -0800
On Wed, Nov 12, 2008 at 08:21:15PM +0100, Attilio Rao wrote:
> 2008/11/12, Jeremy Chadwick <koitsu_at_freebsd.org>:
> > On Wed, Nov 12, 2008 at 05:20:56PM +0100, Attilio Rao wrote:
> >  > 2008/11/12, Jeremy Chadwick <koitsu_at_freebsd.org>:
> >  > > On Wed, Nov 12, 2008 at 04:52:59PM +0100, Attilio Rao wrote:
> >  > >  > 2008/11/12, Jeremy Chadwick <koitsu_at_freebsd.org>:
> >  > >  > > On Wed, Nov 12, 2008 at 04:44:52PM +0100, Attilio Rao wrote:
> >  > >  > >  > 2008/11/12, Jeremy Chadwick <koitsu_at_freebsd.org>:
> >  > >  > >  > > On Wed, Nov 12, 2008 at 02:44:05PM +0000, O. Hartmann wrote:
> >  > >  > >  > >  > I run FreeBSD 8.0/AMD64 on two boxes (one is a UP older AMD64 Athlon64
> >  > >  > >  > >  > 3500, other an 8-Core Dell Poweredge 1950).
> >  > >  > >  > >  >
> >  > >  > >  > >  > After nearly every reboot the box does fsck on all UFS2 filesystems. In
> >  > >  > >  > >  > most cases, while shuting down, the box reports about not willing to die
> >  > >  > >  > >  > processes and after a reboot, the filesystems are unclean.
> >  > >  > >  > >  >
> >  > >  > >  > >  > Is this a common problem at the moment or special?
> >  > >  > >  > >
> >  > >  > >  > >
> >  > >  > >  > > I've seen this happen on my CURRENT box at home when using "shutdown -p
> >  > >  > >  > >  now".  Instead of the box powering off, it would lock up near the very
> >  > >  > >  > >  end of the shutdown process (before marking the filesystems clean).
> >  > >  > >  > >
> >  > >  > >  > >  Oddly, this works fine in RELENG_7, so I'm guessing there's some ACPI
> >  > >  > >  > >  development going on (I can't complain, it *is* CURRENT).
> >  > >  > >  >
> >  > >  > >  > This could cames after my VFS works.
> >  > >  > >  > Could you spend some time on this?
> >  > >  > >  > I will tell you what to look at.
> >  > >  > >
> >  > >  > >
> >  > >  > > Sure thing!
> >  > >  > >
> >  > >  > >  Let me know what I need to do to help, what information you need, or if
> >  > >  > >  I should revert some commits to see if the behaviour changes.  Build
> >  > >  > >  date of the box (src-all csup'd about 45 minutes prior to the build
> >  > >  > >  date):
> >  > >  > >
> >  > >  > >  FreeBSD icarus.home.lan 8.0-CURRENT FreeBSD 8.0-CURRENT #0: Fri Nov  7 14:19:03 PST 2008     root_at_icarus.home.lan:/usr/obj/usr/src/sys/X7SBA_CURRENT_amd64  amd64
> >  > >  >
> >  > >  > Is this reproducible?
> >  > >
> >  > >
> >  > > I don't have an answer at this time.  I've only performed "shutdown -p
> >  > >  now" on this box twice since running CURRENT, and both times the problem
> >  > >  described occurred.
> >  > >
> >  > >
> >  > >  > I need you build a kernel with following options:
> >  > >  > INVARIANT_SUPPORT
> >  > >  > INVARIANTS
> >  > >  > DEBUG_VFS_LOCKS
> >  > >  > WITNESS
> >  > >  > and without WITNESS_SKIPSPIN
> >  > >
> >  > >
> >  > > Will do.  Relevant options I use:
> >  > >
> >  > >  makeoptions     DEBUG=-g                # Build kernel with gdb(1) debug symbols
> >  > >  options         SCHED_ULE               # ULE scheduler
> >  > >  options         PREEMPTION              # Enable kernel thread preemption
> >  > >  options         BREAK_TO_DEBUGGER       # Sending a serial BREAK drops to DDB
> >  > >  options         KDB                     # Enable kernel debugger support
> >  > >  options         KDB_TRACE               # Print stack trace automatically on panic
> >  > >  options         DDB                     # Support DDB
> >  > >  options         GDB                     # Support remote GDB
> >  > >  options         INVARIANTS              # Enable calls of extra sanity checking
> >  > >  options         INVARIANT_SUPPORT       # Extra sanity checks of internal structures, required by INVARIANTS
> >  > >  options         WITNESS                 # Enable checks to detect deadlocks and cycles
> >  > >  options         DEBUG_VFS_LOCKS         # vfs lock debugging
> >  > >
> >  > >  I have physical access to the console of this machine on a regular
> >  > >  basis.
> >  >
> >  > It's fine, great.
> >
> >
> > And as luck would have it, I can't reproduce the problem any more.  I've
> >  shutdown -p now'd literally 6 times in a row without any sort of lock
> >  up, and this is running on the old kernel.  The same behaviour is now
> >  seen with the new kernel.
> >
> >  So, the 2-3 times I've seen "shutdown -p now" not fully power off the
> >  machine were either flukes, or who knows what/why.
> >
> >  I simply can't reproduce the problem any longer.  I'm sorry.
> 
> Can you recompile your kernel with the old option (read: not use the
> old kernel, but recompile it with the old options) and see if it
> hangs?

Here's the behaviour and details:

Old kernel, built 2008/11/07, csup'd 11/07, kernel config without
WITNESS: shutdown -p now failed 2-3 times, but appears to work now.  Not
sure what/where the fluke was.

New kernel, built 2008/11/12, csup'd 11/12, kernel config with
WITNESS: shutdown -p now works.

I'll try rebuilding the 2008/11/12 kernel (with the same csup sources)
but without WITNESS and see if a couple shutdown -p now's work OK.
I'm not sure when I'll get to this (see below) though.

I'm not sure how much longer I'll be able to test CURRENT, because I
keep encountering seriously broken shit (pardon my language) that I do
not have the tolerance to deal with (I REALLY need to just build a 2nd
FreeBSD box for my home to run CURRENT and test for folks).  This is not
the thread to put this in, but I do not see the point in starting a new
thread about this because I guarantee people will go "looks like a local
problem, sounds hardware related", especially since some others cannot
reproduce it themselves, and there are some high temperature with my
hardware (for unknown reasons).

I went with CURRENT because I kept encountering a deadlocked kernel on
RELENG_7 whenever attempting to use USB umass/da.  CURRENT has the same
problem as RELENG_7 in this regard, even with USB4BSD.  However, the
latest USB4BSD busdma patch fixes that issue, but there are times during
file copies (USB write operation) where the copy literally will sit for
20-30 full seconds doing nothing, yet dd speed/bandwidth tests show no
sign of such.  Possibly this "doing nothing for 20-30 seconds" is a
symptom of the next thing I'm seeing.

I've done a write-up on this completely bizarre problem where processes
on CURRENT are getting wedged for random amounts of time, chewing up
very large amounts of processor time (between 60-100%) on my dual-core
system; load average sky-rockets (above 7.xx), and coretemp(4) on the
cores shows a tremendous increase in temperature (indicating the
processors *really are* getting hammered by something).  Yet ktrace and
truss on the processes show nothing happening, and they flip between "-"
and "wait" state in ps.  I don't know what to do about it though, and
after reading my own write-up, I realise the visual symptoms are so
bizarre that it can't be taken seriously.

But there's more: even when the system is sitting idle (no processes in
that weird state), I believe the overall temperature of my cores is
8-10C higher than that of RELENG_7 (I was seeing core temperatures of
32-34C when idling on RELENG_7, and in CURRENT I'm seeing 40-42C; and
without powerd in CURRENT, I see temps of 50-51C while idling).  But I
should be fair with regards to this paragraph: I need to do a *full
reinstall* of RELENG_7 and gather statistics/evidence before stating
"yeah CURRENT is churning CPU and increasing CPU temps".  For all I know
there could be something evil going on with my hardware that has nothing
to do with FreeBSD, at least with regards to this issue.

-- 
| Jeremy Chadwick                                jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |
Received on Wed Nov 12 2008 - 20:27:54 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:37 UTC