Re: [PATCH] Netdump for review and testing -- preliminary version

From: Robert N. M. Watson <rwatson_at_freebsd.org> Date: Fri, 15 Oct 2010 20:45:55 +0100 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:08 UTC

On 15 Oct 2010, at 20:39, Garrett Cooper wrote:

>    But there are already some cases that aren't properly handled
> today in the ddb area dealing with dumping that aren't handled
> properly. Take for instance the following two scenarios:
> 1. Call doadump twice from the debugger.
> 2. Call doadump, exit the debugger, reenter the debugger, and call
> doadump again.
>    Both of these scenarios hang reliably for me.
>    I'm not saying that we should regress things further, but I'm just
> noting that there are most likely a chunk of edgecases that aren't
> being handled properly when doing dumps that could be handled better /
> fixed.

Right: one of the points I've made to Attilio is that we need to move to a more principled model as to what sorts of things we allow in various kernel environments. The early boot is a special environment -- so is the debugger, but the debugger on panic is not the same as the debugger when you can continue. Likewise, the crash dumping code is special, but also not the same as the debugger. Right now, exceptional behaviour to limit hangs/etc is done inconsistently. We need to develop a set of principles that tell us what is permitted in what contexts, and then use that to drive design decisions, normalizing what's there already.

This is not dissimilar to what we do with locking already, BTW: we define a set of kernel environments (fast interrupt handlers, non-sleepable threads, sleepable thread holding non-sleepable locks, etc), and based on those principles prevent significant sources of instability that might otherwise arise in a complex, concurrent kernel. We need to apply the same sort of approach to handling kernel debugging and crashing.

BTW, my view is that except in very exceptional cases, it should not be possible to continue after generating a dump. Dumps often cause disk controllers to get reset, which may leave outstanding I/O in nasty situations. Unless the dump device and model is known not to interfere with operation, we should set state indicating that the system is non-continuable once a dump has occurred.

Robert