Re: panic: uma_zone_slab is looping

From: Bosko Milekic <bmilekic_at_technokratis.com>
Date: Sun, 26 Dec 2004 20:37:45 -0500
On Sun, Dec 26, 2004 at 11:56:51PM +0100, Peter Holm wrote:
> On Sun, Dec 26, 2004 at 01:17:38PM -0500, Bosko Milekic wrote:
> > On Sun, Dec 26, 2004 at 05:11:53PM +0100, Peter Holm wrote:
> > > 
> > > Yes, I think that I have verified your exelent analysis of the
> > > problem: http://www.holm.cc/stress/log/freeze04.html
> > > 
> > > So, do have any fix suggenstons? :-)
> > 
> >   Not yet, because the problem is non-obvious from the trace.
> > 
> >   I need to know exactly when the UMA RCntSlabs zone recurses _first_,
> >   and I need to confirm that it is an actual recursion.  I've looked at
> >   the VM code and I don't see how/why recursion on the RCntSlabs zone
> >   would happen.
> > 
> >   Please modify the printf code to look exactly like this:
> > 
> >    if (keg->uk_flags & UMA_ZFLAG_INTERNAL && keg->uk_recurse != 0) {
> > 	if ((zone == slabzone) || (zone == slabrefzone))
> > 		panic("Zone %s forced to fail due to recurse non-null: %d\n",
> > 		    zone->uz_name, keg->uk_recurse);
> >    	return (NULL);
> >    }
> > 
> >   (You don't need to check any global counter -- the counter is imperfect
> >   anyway -- because even a single recursion on slabzone or slabrefzone
> >   should be illegal).
> > 
> >   I'd like to see the trace from the above panic, if possible.
> 
> Here it is: http://www.holm.cc/stress/log/freeze05.html

  I have checked the code here and looked at possible code paths and
  have unfortunately resorted to reguessing, and now I believe I have
  identified a problematic scenario.

  Consider this particular timeline (time moves downward):
  [I hope you can handle ASCII art]

  By the way, the stack trace you show would correspond to that of
  thread 2. I refer to a frame number below.

  thread 1 (t1)                      thread 2 (t2)
-------------------------------------------------------------------------

  t1.a) Allocating from a zone,
  needs slab header from one of
  the slab header zones (either
  slabzone or slabrefzone). Let's
  assume it is slabzone, as in
  your trace above. The allocation
  is performed with M_WAITOK.

                                     t2.a) Needs to allocate from
				     a zone, and it needs a
				     slab header too.  The allocation will
				     be performed with M_WAITOK.  Let's
				     assume that the slab header zone
				     we're allocating is also slabzone.

  t1.b) in uma_zone_slab(), has
  slabzone's keg lock, increments
  keg's uk_recurse.
  Enters slab_zalloc().

                                    t2.b) Blocks on zone lock.
 
  t1.c) Drops zone lock to
  allocate from VM, uk_recurse
  for the slabzone is currently
  1 (we incremented it in t1.b).

                                    t2.c) Takes zone lock for slabzone,
				    now in uma_zone_slab() (Frame 11),
				    and since uk_recurse is 1, it
				    decides recursion happened.  Wants
				    to return NULL even though
				    allocation was done with M_WAITOK.
				    Our panic is triggered. 

  I'll have to reserve some more time to think about this.  One way I
  think it might be solvable would be to change that check that
  triggers the NULL return explicitly check for the bucketzone, and not
  for all UMA_ZONE_INTERNAL zones; I need to think this through a little
  more.

  Does the scenario seem likely to you?

Cheers,
-- 
Bosko Milekic
bmilekic_at_technokratis.com
bmilekic_at_FreeBSD.org
Received on Mon Dec 27 2004 - 00:37:50 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:25 UTC