Re: Cdiff patch for kernel gdb and mi_switch panic in freebsd 5.4 STABLE

From: Lonnie VanZandt <lonnie.vanzandt_at_ngc.com> Date: Thu, 1 Dec 2005 13:55:03 -0700 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:48 UTC

Early testing seems to indicate that the proposed patch doesn't work as well 
as the one I came up with. (I'm not saying mine is better because there may 
be other issues which I neglected that yours addresses. I'm just saying that 
for the immediate need of being able to reliably trap into and out of kdb on 
my SMP box, my patch worked better.)

So, I will revert to the edits that I had and retry the current testing. If 
the mi_switch/kdb_enter issues continue then I can let you know if my patch 
also isn't a viable solution.

Lonnie.

PS: It's hard to be precise because I'm attempting to debug kernel code (KLDs) 
which very likely can step out of bounds.

On Thursday 01 December 2005 10:33 am, Lonnie VanZandt wrote:
> Well, a patch in this area remains needed for 6.0 STABLE/REL_ENG. I just
> completed our 6.0 upgrade and got back to doing some kgdb debugging on our
> SMP box and blip! immediately encountered this kernel panic.
>
> So, now motivated, I'm applying your alternative patch and will report back
> should it not suffice.
>
> Lonnie.
>
> On Thursday 03 November 2005 12:29 pm, Lonnie VanZandt wrote:
> > I think I follow the proposal. Sure, I'll apply your patch and run with
> > it on my SMP box. It may take a while to reach a conclusion on its merits
> > due to the racy nature of the crash.
> >
> > On Thursday 03 November 2005 11:27 am, John Baldwin wrote:
> > > On Sunday 09 October 2005 05:49 pm, Lonnie VanZandt wrote:
> > > > Attached is the patch for the revised subr_kdb.c from FreeBSD 5.4
> > > > STABLE. (the rcsid is __FBSDID("$FreeBSD: src/sys/kern/subr_kdb.c,v
> > > > 1.5.2.2.2.1 2005/05/01 05:38:14 dwhite Exp $"); )
> > >
> > > I've looked at this, but I think t could maybe be done slightly
> > > differently. Here's a suggested patch that would close the race you are
> > > seeing I think while allowing semantics such that if two CPUs try to
> > > enter KDB at the same time, they would serialize and the second CPU
> > > would enter kdb after the first had exited.  Could you at least test it
> > > to see if it addresses your race condition?
> > >
> > > --- //depot/projects/smpng/sys/kern/subr_kdb.c	2005/10/27 19:51:50
> > > +++ //depot/user/jhb/ktrace/kern/subr_kdb.c	2005/11/03 18:24:38
> > > _at__at_ -39,6 +39,7 _at__at_
> > >  #include <sys/smp.h>
> > >  #include <sys/sysctl.h>
> > >
> > > +#include <machine/cpu.h>
> > >  #include <machine/kdb.h>
> > >  #include <machine/pcb.h>
> > >
> > > _at__at_ -462,12 +463,21 _at__at_
> > >  		return (0);
> > >
> > >  	/* We reenter the debugger through kdb_reenter(). */
> > > -	if (kdb_active)
> > > +	if (kdb_active == PCPU_GET(cpuid) + 1)
> > >  		return (0);
> > >
> > >  	critical_enter();
> > >
> > > -	kdb_active++;
> > > +	/*
> > > +	 * If more than one CPU tries to enter KDB at the same time
> > > +	 * then force them to serialize and go one at a time.
> > > +	 */
> > > +	while (!atomic_cmpset_int(&kdb_active, 0, PCPU_GET(cpuid) + 1)) {
> > > +		critical_exit();
> > > +		while (kdb_active)
> > > +			cpu_spinwait();
> > > +		critical_enter();
> > > +	}
> > >
> > >  #ifdef SMP
> > >  	if ((did_stop_cpus = kdb_stop_cpus) != 0)
> > > _at__at_ -484,13 +494,17 _at__at_
> > >
> > >  	handled = kdb_dbbe->dbbe_trap(type, code);
> > >
> > > +	/*
> > > +	 * We have to exit KDB before resuming the other CPUs so that they
> > > +	 * may run in a debugger-less context.
> > > +	 */
> > > +	kdb_active = 0;
> > > +
> > >  #ifdef SMP
> > >  	if (did_stop_cpus)
> > >  		restart_cpus(stopped_cpus);
> > >  #endif
> > >
> > > -	kdb_active--;
> > > -
> > >  	critical_exit();
> > >
> > >  	return (handled);