Re: panic: double fault with 11.0-CURRENT r258504

From: Don Lewis <truckman_at_FreeBSD.org>
Date: Thu, 28 Nov 2013 00:56:37 -0800 (PST)
On 28 Nov, Konstantin Belousov wrote:
> On Wed, Nov 27, 2013 at 01:11:35PM -0800, Don Lewis wrote:
>> On 27 Nov, Konstantin Belousov wrote:
>> > On Wed, Nov 27, 2013 at 11:35:19AM -0800, Don Lewis wrote:
>> >> On 27 Nov, Konstantin Belousov wrote:
>> >> > On Wed, Nov 27, 2013 at 11:02:57AM -0800, Don Lewis wrote:
>> >> >> On 27 Nov, Konstantin Belousov wrote:
>> >> >> > On Wed, Nov 27, 2013 at 10:33:30AM -0800, Don Lewis wrote:
>> >> >> >> On 27 Nov, Konstantin Belousov wrote:
>> >> >> >> > On Wed, Nov 27, 2013 at 09:41:36AM -0800, Don Lewis wrote:
>> >> >> >> >> On 27 Nov, Konstantin Belousov wrote:
>> >> >> >> >> > On Wed, Nov 27, 2013 at 02:49:12AM -0800, Don Lewis wrote:
>> >> >> >> >> >> <http://people.freebsd.org/~truckman/doublefault2.JPG>
>> >> >> >> >> > 
>> >> >> >> >> > What is the instruction at cpu_switch+0x9b ?
>> >> >> >> >> 
>> >> >> >> >> movl 0x8(%edx),%eax
>> >> >> >> > So it is line 176 in swtch.s. Is machine still in ddb, or did you
>> >> >> >> > obtained the core ? If yes, please print out the content of words at
>> >> >> >> > 0xe4f62bb0 + 4, +8 (*), +16. Please print the content of the word at
>> >> >> >> > address (*) + 8.
>> >> >> >> 
>> >> >> >> It is still in ddb.
>> >> >> >> 
>> >> >> >> <http://people.freebsd.org/~truckman/doublefault3.JPG>, though not in
>> >> >> >> the above order.
>> >> >> > Uhm, sorry, I mistyped the last part of the instructions.
>> >> >> > 
>> >> >> > The new thread pointer is 0xd2f4e000, there is nothing incriminating.
>> >> >> > Please print the word at 0xd2f4e000+0x254 == 0xd2f4e254, which would be
>> >> >> > the address of the new thread pcb. It is load from the pcb + 8 which
>> >> >> > faults.
>> >> >> 
>> >> >> 0xf3d44d60
>> >> > Again, the pointer looks fine, and its tail is 0xd60, which is correct for
>> >> > the pcb offset in the last page of the thread stack.
>> >> > 
>> >> > Please do 'show thread 0xd2f4e000' before trying below instructions.
>> >> 
>> >> Ok, see below:
>> >>  
>> >> > What happens if you try to read word at 0xf3d44d68 ?
>> >> 
>> >> Nothing bad ...
>> >> 
>> >> <http://people.freebsd.org/~truckman/doublefault4.JPG>
>> >> 
>> > So the thread structure looks sane, the stack region is in place where
>> > it is supposed to be, all the gathered data looks self-consistent. And,
>> > the access to the faulted address from ddb does not fault.
>> > 
>> > Thread stacks can only be invalidated when the process is swapped out and
>> > kernel stack is written to swap.  Your thread flags indicate that it is
>> > in memory, and TDF_CANSWAP is not set.  I do not believe that our swapout
>> > code would invalidate stack mapping in such situation, otherwise we would
>> > have too many complaints already.
>> > 
>> > Just in case, do you use swap on this box ?
>> 
>> I do.
>> 
>> > And, as the last resort, I do understand that this sounds as giving up,
>> > do you monitor the temperature of the CPUs ? BTW, which CPUs are that,
>> > please show the cpu identification lines from the boot dmesg.
>> 
>> I don't monitor the temperature, but I do hear the CPU fan speed ramping
>> up and down when I'm building ports like this.  Even though I'm pretty
>> much keeping one core busy the whole time, the temperature must drop
>> enough at times to let the fan speed drop.
>> 
>> I can run math/mprime on this machine for a while to see if anything
>> shows up.  I also have a very similar machine (same motherboard but
>> different CPU) that I can move the drive over to and test.
>> 
>> Here's the full dmesg.boot:
>> 
>> Copyright (c) 1992-2013 The FreeBSD Project.
>> Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
>> 	The Regents of the University of California. All rights reserved.
>> FreeBSD is a registered trademark of The FreeBSD Foundation.
>> FreeBSD 11.0-CURRENT #63 r258614M: Tue Nov 26 00:29:01 PST 2013
>>     dl_at_scratch.catspoiler.org:/usr/obj/usr/src/sys/GENERICSMB i386
>> FreeBSD clang version 3.3 (tags/RELEASE_33/final 183502) 20130610
>> WARNING: WITNESS option enabled, expect reduced performance.
>> CPU: AMD Athlon(tm) 64 X2 Dual Core Processor 4800+ (2500.06-MHz 686-class CPU)
>>   Origin = "AuthenticAMD"  Id = 0x60fb1  Family = 0xf  Model = 0x6b  Stepping = 1
>>   Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT>
>>   Features2=0x2001<SSE3,CX16>
>>   AMD Features=0xea500800<SYSCALL,NX,MMX+,FFXSR,RDTSCP,LM,3DNow!+,3DNow!>
>>   AMD Features2=0x11f<LAHF,CMP,SVM,ExtAPIC,CR8,Prefetch>
> 
> The errata list for the Athlon 64 X2 is quite long.  Do you have latest
> BIOS ?  I am not sure if AMD provides standalone firmware update blocks
> for their CPUs.  If any Linux distribution ships updates for AMD CPUs,
> it might be useful to load the update with cpucontrol(8).  Even if we
> do not hit a CPU bug, it would provide me with more certainity that we
> are not chasing ghost.

I haven't figured out how to find the currently installed BIOS version.
The motherboard is Abit, which is no more, but I found an archive of all
of their downloads.  I'll also check into updates from the Linux world.

> Another things to try, in vain, is to compile kernel with gcc or disable
> SMP.

It has survived 10 hours running two copies of mprime.  I just moved the
boot drive over to another machine with the the same type of
motherboard, but a different model AMD X2 CPU.

CPU: AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ (2200.05-MHz 686-class CPU
)
  Origin = "AuthenticAMD"  Id = 0x40fb2  Family = 0xf  Model = 0x4b  Stepping
= 2
  Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA
,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT>
  Features2=0x2001<SSE3,CX16>
  AMD Features=0xea500800<SYSCALL,NX,MMX+,FFXSR,RDTSCP,LM,3DNow!+,3DNow!>
  AMD Features2=0x1f<LAHF,CMP,SVM,ExtAPIC,CR8>
real memory  = 2147483648 (2048 MB)
avail memory = 1940611072 (1850 MB)

I also have a fairly new quad core AMD box I can test on, as well as an
old dual P III machine.

This machine gets updated every month or so and I've never had stability
problems with it until just recently.  It's definitely been using clang
for quite a while without any problems other than the ports mess.

> Peter, could you, please, try to reproduce the issue ?  It does not look
> like a random hardware failure, since in all cases, it is curthread access
> which is faulting.  The issue is only reported by Don, and so far only
> for i386 SMP.

The workload that is triggering this is
	portupgrade -fr lang/perl5.16

I've got 1000+ ports installed and this causes 400+ to be rebuilt.  That
seems to cause it to panic about half the time.  The last time it made
it through 268 ports before it croaked.
Received on Thu Nov 28 2013 - 07:56:52 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:44 UTC