Re: panic: double fault with 11.0-CURRENT r258504

From: Konstantin Belousov <kostikbel_at_gmail.com>
Date: Thu, 28 Nov 2013 09:56:10 +0200
On Wed, Nov 27, 2013 at 01:11:35PM -0800, Don Lewis wrote:
> On 27 Nov, Konstantin Belousov wrote:
> > On Wed, Nov 27, 2013 at 11:35:19AM -0800, Don Lewis wrote:
> >> On 27 Nov, Konstantin Belousov wrote:
> >> > On Wed, Nov 27, 2013 at 11:02:57AM -0800, Don Lewis wrote:
> >> >> On 27 Nov, Konstantin Belousov wrote:
> >> >> > On Wed, Nov 27, 2013 at 10:33:30AM -0800, Don Lewis wrote:
> >> >> >> On 27 Nov, Konstantin Belousov wrote:
> >> >> >> > On Wed, Nov 27, 2013 at 09:41:36AM -0800, Don Lewis wrote:
> >> >> >> >> On 27 Nov, Konstantin Belousov wrote:
> >> >> >> >> > On Wed, Nov 27, 2013 at 02:49:12AM -0800, Don Lewis wrote:
> >> >> >> >> >> <http://people.freebsd.org/~truckman/doublefault2.JPG>
> >> >> >> >> > 
> >> >> >> >> > What is the instruction at cpu_switch+0x9b ?
> >> >> >> >> 
> >> >> >> >> movl 0x8(%edx),%eax
> >> >> >> > So it is line 176 in swtch.s. Is machine still in ddb, or did you
> >> >> >> > obtained the core ? If yes, please print out the content of words at
> >> >> >> > 0xe4f62bb0 + 4, +8 (*), +16. Please print the content of the word at
> >> >> >> > address (*) + 8.
> >> >> >> 
> >> >> >> It is still in ddb.
> >> >> >> 
> >> >> >> <http://people.freebsd.org/~truckman/doublefault3.JPG>, though not in
> >> >> >> the above order.
> >> >> > Uhm, sorry, I mistyped the last part of the instructions.
> >> >> > 
> >> >> > The new thread pointer is 0xd2f4e000, there is nothing incriminating.
> >> >> > Please print the word at 0xd2f4e000+0x254 == 0xd2f4e254, which would be
> >> >> > the address of the new thread pcb. It is load from the pcb + 8 which
> >> >> > faults.
> >> >> 
> >> >> 0xf3d44d60
> >> > Again, the pointer looks fine, and its tail is 0xd60, which is correct for
> >> > the pcb offset in the last page of the thread stack.
> >> > 
> >> > Please do 'show thread 0xd2f4e000' before trying below instructions.
> >> 
> >> Ok, see below:
> >>  
> >> > What happens if you try to read word at 0xf3d44d68 ?
> >> 
> >> Nothing bad ...
> >> 
> >> <http://people.freebsd.org/~truckman/doublefault4.JPG>
> >> 
> > So the thread structure looks sane, the stack region is in place where
> > it is supposed to be, all the gathered data looks self-consistent. And,
> > the access to the faulted address from ddb does not fault.
> > 
> > Thread stacks can only be invalidated when the process is swapped out and
> > kernel stack is written to swap.  Your thread flags indicate that it is
> > in memory, and TDF_CANSWAP is not set.  I do not believe that our swapout
> > code would invalidate stack mapping in such situation, otherwise we would
> > have too many complaints already.
> > 
> > Just in case, do you use swap on this box ?
> 
> I do.
> 
> > And, as the last resort, I do understand that this sounds as giving up,
> > do you monitor the temperature of the CPUs ? BTW, which CPUs are that,
> > please show the cpu identification lines from the boot dmesg.
> 
> I don't monitor the temperature, but I do hear the CPU fan speed ramping
> up and down when I'm building ports like this.  Even though I'm pretty
> much keeping one core busy the whole time, the temperature must drop
> enough at times to let the fan speed drop.
> 
> I can run math/mprime on this machine for a while to see if anything
> shows up.  I also have a very similar machine (same motherboard but
> different CPU) that I can move the drive over to and test.
> 
> Here's the full dmesg.boot:
> 
> Copyright (c) 1992-2013 The FreeBSD Project.
> Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
> 	The Regents of the University of California. All rights reserved.
> FreeBSD is a registered trademark of The FreeBSD Foundation.
> FreeBSD 11.0-CURRENT #63 r258614M: Tue Nov 26 00:29:01 PST 2013
>     dl_at_scratch.catspoiler.org:/usr/obj/usr/src/sys/GENERICSMB i386
> FreeBSD clang version 3.3 (tags/RELEASE_33/final 183502) 20130610
> WARNING: WITNESS option enabled, expect reduced performance.
> CPU: AMD Athlon(tm) 64 X2 Dual Core Processor 4800+ (2500.06-MHz 686-class CPU)
>   Origin = "AuthenticAMD"  Id = 0x60fb1  Family = 0xf  Model = 0x6b  Stepping = 1
>   Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT>
>   Features2=0x2001<SSE3,CX16>
>   AMD Features=0xea500800<SYSCALL,NX,MMX+,FFXSR,RDTSCP,LM,3DNow!+,3DNow!>
>   AMD Features2=0x11f<LAHF,CMP,SVM,ExtAPIC,CR8,Prefetch>

The errata list for the Athlon 64 X2 is quite long.  Do you have latest
BIOS ?  I am not sure if AMD provides standalone firmware update blocks
for their CPUs.  If any Linux distribution ships updates for AMD CPUs,
it might be useful to load the update with cpucontrol(8).  Even if we
do not hit a CPU bug, it would provide me with more certainity that we
are not chasing ghost.

Another things to try, in vain, is to compile kernel with gcc or disable
SMP.

Peter, could you, please, try to reproduce the issue ?  It does not look
like a random hardware failure, since in all cases, it is curthread access
which is faulting.  The issue is only reported by Don, and so far only
for i386 SMP.

Received on Thu Nov 28 2013 - 06:56:18 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:44 UTC