Re: RELENG_7 and HEAD: bge causes system hang

From: Robert Watson <rwatson_at_FreeBSD.org>
Date: Mon, 26 Nov 2007 15:51:07 +0000 (GMT)
On Mon, 26 Nov 2007, Cristian KLEIN wrote:

> Great to hear this problem was solved. I still have one big fat question. 
> Why did the system hang and not allow the kernel debugger show up? I 
> strongly believe that this bug would have been easily spotted suppose KDB 
> would have responded. Is it perhaps possible to "harden" KDB, so that such 
> issues are easier to find and fix in future?

I don't know the details of this particular situation, but I can speak to at 
least one known issue in DDB: right now, getting into DDB from a serial 
console is a very quick and straight forward path, requiring only the delivery 
of the serial interrupt and execution of its fast handler.  The regular video 
console keypresses take a much more circuitous route, as syscons isn't MPSAFE, 
so include the scheduling of an ithread and acquisition of Giant.  As such, 
I've found breaking into the debugger much easier from a serial console for 
several years.  As Giant has been pushed off larger and larger parts of the 
kernel, the syscons break path has gotten a lot more reliable.  There will 
always be certain cases where a console break (serial or video) will not work, 
and those include cases where interrupts are disabled on all CPUs (such as if 
spinlocks are held on all CPUs, perhaps due to one being leaked and then a 
cascading deadline).  In that situation, there's nothing like a nice NMI 
button or IPMI NMI to get into the debugger :-).

We have a feature on i386 and amd64 called MP_WATCHDOG, which allows one CPU 
to be dedicated to being a watchdog for the others--on lower end hardware this 
isn't so useful, as CPUs aren't plentiful, but as the number of cores 
increases, it becomes more and more possible to run this without disrupting 
normal operation of the machine.  When it notices the kernel is no longer 
running callouts, it delivers an NMI to the other CPUs and kicks (hopefully) 
one of them into DDB.  There are a number of issues with the implementation, 
not least that we do actually run some other code on the watchdog CPU 
sometimes as our interrupt routing and scheduler need a bit more adaptation, 
but it can be quite useful nonetheless.

Robert N M Watson
Computer Laboratory
University of Cambridge
Received on Mon Nov 26 2007 - 14:51:14 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:23 UTC