Re: reproducible panic in netisr

From: Robert Watson <rwatson_at_FreeBSD.org>
Date: Thu, 6 Aug 2009 00:17:11 +0100 (BST)
On Tue, 4 Aug 2009, Navdeep Parhar wrote:

>>> This occurs on today's HEAD + some unrelated patches.  That makes it 
>>> 8.0BETA2+ code.  I haven't tried older builds.
>>
>> We have finally been able to reproduce this ourselves yesterday and
>
> Well, it happens every single time on all of my amd64 machines. After I'd 
> already sent my email I noticed that the netisr mutex has an odd address 
> (pun intended :-))
>
> m=0xffffffff8144d867

Heh, indeed.  We just spotted the same result here.  In this case it's causing 
a panic because it leads to a non-atomic read due to mtx_lock spanning a cache 
line boundary, followed shortly by a panic because it's not a valid thread 
pointer when it's dereferenced, as we get a fractional pointer.

> It's a bit unusual for the mutex struct to start at a completely unaligned 
> address.  I hope things are better on sparc64 etc., not everyone is as 
> forgiving as amd64.

amd64 isn't as forgiving either, it turns out. :-)

> The mutex led me to some DPCPU stuff that I didn't quite get.
>
> (kgdb) p/x dpcpu_off
> $2 = {0x8407d7, 0xffffff807f4037d7, 0x0 <repeats 30 times>}
> (kgdb) p dpcpu
> $3 = (void *) 0xffffff8000010000
> (kgdb) p &__start_set_pcpu
> $4 = (uintptr_t **) 0xffffffff80c0c829
> (kgdb) p/x 0xffffff8000010000 - 0xffffffff80c0c829
> $5 = 0xffffff807f4037d7
>
> It's not clear why we prefer to store offsets from DPCPU_START, instead of 
> the base address of the dpcpu area directly.  On amd64, the dpcpu area for 
> cpu 0 is above kernbase (immediately after kernbase + thread0's stack). 
> For the other CPUs it's below kernbase.  This makes the pointer arithmetic 
> that calculates offsets more "interesting."
>
> Why have a dpcpu_off[] instead of a dpcpu_base[]?

Each field in DPCPU is named with respect to the start of a "master" dpcpu 
copy, which holds the static initialization.  This makes the per-CPU name:

    (&master_name_for_variable - DPCPU_START) + per-cpu-base

What Jeff has done is factor out the DPCPU_START subtraction, since it's a 
constant subtraction across all DPCPU use, and do it once when calculating 
dpcpu_off.  This should all be fine, the question is why we're losing the 
alignment during linking of the kernel.  netisr is linked into the base 
kernel, so I guess it's some problem with the way the linker set is being laid 
out at compile-time.  I expect we may have a similar issue with the run-time 
allocation of DPCPU space as well.

Robert N M Watson
Computer Laboratory
University of Cambridge
Received on Wed Aug 05 2009 - 21:17:12 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:53 UTC