Re: reproducible panic in netisr

From: Robert Watson <rwatson_at_FreeBSD.org> Date: Thu, 6 Aug 2009 15:11:26 +0100 (BST) · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:53 UTC

On Thu, 6 Aug 2009, Larry Rosenman wrote:

> On Thu, 6 Aug 2009, Robert Watson wrote:
>
>> On Tue, 4 Aug 2009, Navdeep Parhar wrote:
>> 
>>>>> This occurs on today's HEAD + some unrelated patches.  That makes it 
>>>>> 8.0BETA2+ code.  I haven't tried older builds.
>>>> 
>>>> We have finally been able to reproduce this ourselves yesterday and
>>> 
>>> Well, it happens every single time on all of my amd64 machines. After I'd 
>>> already sent my email I noticed that the netisr mutex has an odd address 
>>> (pun intended :-))
>>> 
>>> m=0xffffffff8144d867
>> 
>> Heh, indeed.  We just spotted the same result here.  In this case it's 
>> causing a panic because it leads to a non-atomic read due to mtx_lock 
>> spanning a cache line boundary, followed shortly by a panic because it's 
>> not a valid thread pointer when it's dereferenced, as we get a fractional 
>> pointer.
> [snip]
>
> Do we have an ETA for a testable patch?

RSN, I'm afraid.  We can eliminate the effect by reverting the use of DPCPU in 
netisr.c (basically reverting to pre-r195019 of netisr.c).  The interesting 
question is where the problem originates -- is gcc/ld/etc not laying out the 
elf section properly, or are the MD parts not providing an aligned base? 
There are also probably issues in the DPCPU handling of modules along similar 
lines, but first things first.

We'll be adding assertions of alignment to the various lock init functions to 
catch this happening explicitly in the future.  There are probably one or two 
other places where we have very strong alignment requirements on i386/amd64, 
such as the td_ucred pointer that we check for change on system calls/traps to 
see if we need to refresh the thread's credential from the process credential.

Robert N M Watson
Computer Laboratory
University of Cambridge