Re: Freeze

From: Peter Holm <peter_at_holm.cc> Date: Mon, 20 Dec 2004 12:04:11 +0100 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:24 UTC

On Thu, Dec 16, 2004 at 03:21:44PM -0500, John Baldwin wrote:
> On Monday 06 December 2004 08:59 am, Peter Holm wrote:
> > On Fri, Nov 19, 2004 at 05:10:19PM -0500, John Baldwin wrote:
> > > On Friday 19 November 2004 02:59 am, Peter Holm wrote:
> > > > On Mon, Nov 15, 2004 at 03:46:15PM -0500, John Baldwin wrote:
> > > > > On Friday 12 November 2004 07:33 am, Peter Holm wrote:
> > > > > > GENERIC HEAD from Nov 11 08:05 UTC
> > > > > >
> > > > > > The following stack traces etc. was done before my first
> > > > > > cup of coffee, so it's not so informative as it could have been :-(
> > > > > >
> > > > > > The test box appeared to have been frozen for more than 6 hours,
> > > > > > but was pingable.
> > > > > >
> > > > > > http://www.holm.cc/stress/log/cons86.html
> > > > >
> > > > > A weak guess is that you have the system in some sort of livelock due
> > > > > to fork()?  Have you tried running with 'debug.mpsafevm=1' set from
> > > > > the loader?
> > > > >
> > > > > --
> > > > > John Baldwin <jhb_at_FreeBSD.org>  <><  http://www.FreeBSD.org/~jhb/
> > > > > "Power Users Use the Power to Serve"  =  http://www.FreeBSD.org
> > > >
> > > > OK, I've got some more info:
> > > >
> > > > http://www.holm.cc/stress/log/cons88.html
> > > >
> > > > Looks like a spin in uma_zone_slab() when slab_zalloc() fails?
> > >
> > > Yes, I think if you specify M_WAITOK, then that might happen. 
> > > slab_zalloc() can fail if any of the init functions fail for example, in
> > > which case it would loop forever.  You can try this hack (though it may
> > > very well be wrong) to return failure if that is what is triggering:
> > >
> > > Index: uma_core.c
> > > ===================================================================
> > > RCS file: /usr/cvs/src/sys/vm/uma_core.c,v
> > > retrieving revision 1.110
> > > diff -u -r1.110 uma_core.c
> > > --- uma_core.c	6 Nov 2004 11:43:30 -0000	1.110
> > > +++ uma_core.c	19 Nov 2004 22:08:26 -0000
> > > _at__at_ -1998,6 +1998,10 _at__at_
> > >  		 */
> > >  		if (flags & M_NOWAIT)
> > >  			flags |= M_NOVM;
> > > +
> > > +		/* XXXHACK */
> > > +		if (flags & M_WAITOK)
> > > +			break;
> > >  	}
> > >  	return (slab);
> > >  }
> > >
> > > --
> > > John Baldwin <jhb_at_FreeBSD.org>  <><  http://www.FreeBSD.org/~jhb/
> > > "Power Users Use the Power to Serve"  =  http://www.FreeBSD.org
> >
> > I instrumented the code with this:
> > $ cvs diff -u
> > cvs diff: Diffing .
> > Index: uma_core.c
> > ===================================================================
> > RCS file: /home/ncvs/src/sys/vm/uma_core.c,v
> > retrieving revision 1.110
> > diff -u -r1.110 uma_core.c
> > --- uma_core.c  6 Nov 2004 11:43:30 -0000       1.110
> > +++ uma_core.c  6 Dec 2004 13:49:36 -0000
> > _at__at_ -1926,6 +1926,7 _at__at_
> >  {
> >         uma_slab_t slab;
> >         uma_keg_t keg;
> > +       int i;
> >
> >         keg = zone->uz_keg;
> >
> > _at__at_ -1943,7 +1944,8 _at__at_
> >
> >         slab = NULL;
> >
> > -       for (;;) {
> > +       for (i = 0;;i++) {
> > +               KASSERT(i < 10000, ("uma_zone_slab is looping"));
> >                 /*
> >                  * Find a slab with some space.  Prefer slabs that are
> > partially * used over those that are totally full.  This helps to reduce
> >
> > and now during test of Jeff Roberson's "SMP FFS" patch the assert
> > triggered: http://www.holm.cc/stress/log/cons92.html
> 
> Hmm.  Does the hack patch above make the hang go away or does it just break 
> things worse?
> 

I have been testing your patch for quite a while. If it's OK for
m_getcl with M_TRYWAIT to return NULL, your patch reviled a missing
test for NULL in kern/uipc_socket.c:750

Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address   = 0x1c
fault code              = supervisor write, page not present
instruction pointer     = 0x8:0xc0647d77
stack pointer           = 0x10:0xcfa9cbf0
frame pointer           = 0x10:0xcfa9cc38
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, def32 1, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 67417 (net)
[thread pid 67417 tid 100890 ]
Stopped at      sosend+0x227:   movl    $0,0x1c(%eax)
db> where
Tracing pid 67417 tid 100890 td 0xc1ae8000
sosend(c3454dec,0,cfa9cc90,0,0) at sosend+0x227
soo_write(c1db9374,cfa9cc90,c1aa6180,0,c1ae8000) at
soo_write+0x2d
dofilewrite(3,bfbfe740,400,ffffffff,ffffffff) at dofilewrite+0x99
write(c1ae8000,cfa9cd14,3,d,246) at write+0x48
syscall(2f,bfbf002f,bfbf002f,3,bfbfe740) at syscall+0x128
Xint0x80_syscall() at Xint0x80_syscall+0x1f
--- syscall (4, FreeBSD ELF32, write), eip = 0x280bfbf7, esp =
0xbfbfe71c, ebp = 0xbfbfeb68 ---
-- 
Peter Holm