Re: reproducible panic in netisr

From: John Baldwin <jhb_at_freebsd.org> Date: Fri, 7 Aug 2009 08:35:19 -0400 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:53 UTC

On Thursday 06 August 2009 10:11:26 am Robert Watson wrote:
> On Thu, 6 Aug 2009, Larry Rosenman wrote:
> 
> > On Thu, 6 Aug 2009, Robert Watson wrote:
> >
> >> On Tue, 4 Aug 2009, Navdeep Parhar wrote:
> >> 
> >>>>> This occurs on today's HEAD + some unrelated patches.  That makes it 
> >>>>> 8.0BETA2+ code.  I haven't tried older builds.
> >>>> 
> >>>> We have finally been able to reproduce this ourselves yesterday and
> >>> 
> >>> Well, it happens every single time on all of my amd64 machines. After I'd 
> >>> already sent my email I noticed that the netisr mutex has an odd address 
> >>> (pun intended :-))
> >>> 
> >>> m=0xffffffff8144d867
> >> 
> >> Heh, indeed.  We just spotted the same result here.  In this case it's 
> >> causing a panic because it leads to a non-atomic read due to mtx_lock 
> >> spanning a cache line boundary, followed shortly by a panic because it's 
> >> not a valid thread pointer when it's dereferenced, as we get a fractional 
> >> pointer.
> > [snip]
> >
> > Do we have an ETA for a testable patch?
> 
> RSN, I'm afraid.  We can eliminate the effect by reverting the use of DPCPU in 
> netisr.c (basically reverting to pre-r195019 of netisr.c).  The interesting 
> question is where the problem originates -- is gcc/ld/etc not laying out the 
> elf section properly, or are the MD parts not providing an aligned base? 
> There are also probably issues in the DPCPU handling of modules along similar 
> lines, but first things first.

No, gcc/ld/etc is doing the right thing.  However, the DPCPU and VNET code
implicitly assumes that the dpcpu/vnet sets start off with a specific alignment
and that assumption is false (as it turns out).

-- 
John Baldwin