Re: PANIC (watchdog)

From: Bill Paul <wpaul_at_FreeBSD.ORG> Date: Thu, 20 Oct 2005 21:27:55 +0000 (GMT) · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:46 UTC

> > I'm waiting for the part where you explain why you have software
> > watchdog support in kern_clock.c turned on. As far as I know, it's
> > not enabled by default in the GENERIC kernel config, and even if
> > it is, you have to twiddle debug.watchdog_enable in order to make
> > it trigger.
> 
> All I twiddle is watchdogd_enable in /etc/rc.conf... ;)
> 
> > I can only conclude that you turned on watchdog support for some
> > reason, set up a watchdog app to reset the watchdog timeout every
> > so often, and then forgot about it (or else it was enabled as part
> > of some other change you made and you weren't aware of it -- maybe one
> > of those unrelated things you foolishly chose not to tell us about).
> > Presumeably the watchdog app crashed while you were doing your
> > installworkd. If that's the case, you shouldn't be surprised when
> > the watchdog expiration occurs and dumps you into ddb.
> 
> Ok, then this panic is intended if I understood you correctly. I thought it 
> would trigger any kind of CPU reset, not a panic.
> Everything is fine then...
> 
> Thanks,
> 
> -Harry

Actually, it's not a panic (it's only a panic in a kernel without
DDB compiled in). It really just does kdb_enter(), which brings
you to the kernel debugger prompt. You should be able to resume the
system like this:

db> w watchdog_enabled 0
db> continue

The idea is that once the watchdog is enabled, hardclock() will dump
you into the kernel debugger _unless_ something resets the watchdog
timer periodically. That something is a user space app which pokes
debug.watchdog_reset. If the watchdog timeout is set for 20 seconds
and you reset the timer every 10 seconds, the system will keep running.
If the watchdog app dies, or if the kernel siezes up and stops scheduling
user processes, the timer will reach 0 and the kernel debugger will
come up.

The watchdog is supposed to give you a way to debug thread deadlocks
or stuck loops that occur in kernel mode. When such a condition arises,
interrupts may still occur and be handled, but user processes never
get a chance to run. This would stop the user space watchdog app,
so eventually the watchdog timeout would expire. In a system without
DDB, you'd get a panic instead of dropping into the kernel debugger.
Obviously, this is useful if you have an unattended machine to which
you have no easy console access: if the machine wedges, the watchdog
will fire and reboot the system, which hopefull till bring it back
to a working state and let you analyze and fix the problem remotely.

Unfortunately, there are many cases where the watchdog won't work.
Sometimes the system can wedge with interrupts disabled, or experience
some kind of hardware fault that prevents even hardclock() from running.
If that happens, you're screwed, unless you've rigged up some way to
deliver an NMI that can force the CPU to trap into the debugger.

-Bill

--
=============================================================================
-Bill Paul            (510) 749-2329 | Senior Engineer, Master of Unix-Fu
                 wpaul_at_windriver.com | Wind River Systems
=============================================================================
              <adamw> you're just BEGGING to face the moose
=============================================================================