Re: 8.0RC2 amd64 - kernel panic running make buildworld

From: Kai Gallasch <gallasch_at_free.de>
Date: Thu, 12 Nov 2009 19:59:32 +0100
Am Wed, 11 Nov 2009 15:04:14 -0500
schrieb John Baldwin <jhb_at_freebsd.org>:

> On Wednesday 11 November 2009 2:15:18 pm S.N.Grigoriev wrote:
> > 
> > 10.11.09, 09:15, "Mark Atkinson" <atkin901_at_yahoo.com>
> > wrote:
> > 
> > > Andriy Gapon wrote:
> > > > on 10/11/2009 17:22 gary.jennejohn_at_freenet.de said the
> > > > following:
 
> > > > Not a trivial issue unless it is hardware indeed.
> > > > 
> > > Also, you can try adding:
> > > hw.mca.enabled="1" in /boot/loader.conf, reboot,  and then see if
> > > there is a machine check exception on the console during the
> > > buildworld.
> > 
> > Mark,
> > 
> > I've added hw.mca.enabled="1" in /boot/loader.conf and got the
> > following screen during the buildworld:
> > 
> > .....
> >  -c /usr/src/gnu/usr.bin/binutils/as/../../../../contrib/binutils/gas/sb.c
> > 
> > MCA: CPU3 UNCOR PCC OVER DTLIB L1 error
> > MCA: Address 0x8015fb000
> 
> You hardware is broken and it is telling you so.  You have had
> multiple machine checks with the most severe one being an
> uncorrectable error in your data TLB (i.e. in the CPU itself).

John,

I also set hw.mca.enabled="1" and vm.pmap.pg_ps_enabled="1"
in /boot/loader.conf on my (under load) spontaneously rebooting
opteron proliant server.

Server was upgraded to FREEBSD-8.0-PRERELEASE today.

This is what happened..


---- machine check trap, first run ----

sonnenkraft:/usr/obj # MCA: CPU 5 UNCOR PCC OVER DTLB L1 error
MCA: Address 0x80e5c8000


Fatal trap 28: machine check trap while in user mode
cpuid = 5; apic id = 05
instruction pointer     = 0x43:0x691688
stack pointer           = 0x3b:0x7fffffffdf90
frame pointer           = 0x3b:0x6a2
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 3, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, IOPL = 0
current process         = 29319 (cc1)
[thread pid 29319 tid 100086 ]
Stopped at      0x691688:       leal    0x1(%rax),%edx
db> where  
Tracing pid 29319 tid 100086 td 0xffffff000e065390
WAKEUP_cpu() at 0x691688
*** error reading from address 6aa ***
db> bt  
Tracing pid 29319 tid 100086 td 0xffffff000e065390
WAKEUP_cpu() at 0x691688
*** error reading from address 6aa ***
db> call doadump  
Cannot dump. Device not defined or unavailable.
= 0x30


---- machine check trap, second run - this
                    time with dumpdev defined ----

sonnenkraft:~ # MCA: CPU 2 UNCOR PCC OVER DTLB L1 error
MCA: Address 0x8011d3000


Fatal trap 28: machine check trap while in user mode
cpuid = 2; apic id = 02
instruction pointer     = 0x43:0x6b1241
stack pointer           = 0x3b:0x7fffffffe200
frame pointer           = 0x3b:0x7fffffffe240
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 3, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, IOPL = 0
current process         = 69498 (cc1)
[thread pid 69498 tid 100338 ]
Stopped at      0x6b1241:       call    0x6af140
db> where  
Tracing pid 69498 tid 100338 td 0xffffff000ef75720
WAKEUP_cpu() at 0x6b1241
db> bt  
Tracing pid 69498 tid 100338 td 0xffffff000ef75720
WAKEUP_cpu() at 0x6b1241
db> call doadump  
Physical memory: 20462 MB
Dumping 2303 MB: 2288 2272 2256 2240 2224 2208 2192 2176 2160 2144 2128
2112 2096 2080 2064 2048 2032 2016 2000 1984 1968 1952 1936 1920 1904
1888 1872 1856 1840 1824 1808 1792 1776 1760 1744 1728 1712 1696 1680
1664 1648 1632 1616 1600 1584 1568 1552 1536 1520 1504 1488 1472 1456
1440 1424 1408 1392 1376 1360 1344 1328 1312 1296 1280 1264 1248 1232
1216 1200 1184 1168 1152 1136 1120 1104 1088 1072 1056 1040 1024 1008
992 976 960 944 928 912 896 880 864 848 832 816 800 784 768 752 736 720
704 688 672 656 640 624 608 592 576 560 544 528 512 496 480 464 448 432
416 400 384 368 352 336 320 304 288 272 256 240 224 208 192 176 160 144
128 112 96 80 64 48 32 16
Dump complete
= 0
db> reboot  
cpu_reset: Restarting BSP
cpu_reset_proxy: Stopped CPU 2


---- machine check trap, third run - BIOS: static low
               power mode enabled, to rule out power/heat issue ----

sonnenkraft:~ # MCA: CPU 4 UNCOR PCC OVER DTLB L1 error
MCA: Address 0x8011fd000


Fatal trap 28: machine check trap while in user mode
cpuid = 4; apic id = 04
instruction pointer     = 0x43:0x76127d
stack pointer           = 0x3b:0x7fffffffe068
frame pointer           = 0x3b:0x7fffffffe090
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 3, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, IOPL = 0
current process         = 73135 (cc1)
[thread pid 73135 tid 100146 ]
Stopped at      0x76127d:       xorl    %edx,%edx
db> where  
Tracing pid 73135 tid 100146 td 0xffffff00071caab0
WAKEUP_cpu() at 0x76127d
db> bt  
Tracing pid 73135 tid 100146 td 0xffffff00071caab0
WAKEUP_cpu() at 0x76127d
db> call doadump  
Physical memory: 20462 MB
Dumping 2335 MB: 2320 2304 2288 2272 2256 2240 2224 2208 2192 2176 2160
2144 2128 2112 2096 2080 2064 2048 2032 2016 2000 1984 1968 1952 1936
1920 1904 1888 1872 1856 1840 1824 1808 1792 1776 1760 1744 1728 1712
1696 1680 1664 1648 1632 1616 1600 1584 1568 1552 1536 1520 1504 1488
1472 1456 1440 1424 1408 1392 1376 1360 1344 1328 1312 1296 1280 1264
1248 1232 1216 1200 1184 1168 1152 1136 1120 1104 1088 1072 1056 1040
1024 1008 992 976 960 944 928 912 896 880 864 848 832 816 800 784 768
752 736 720 704 688 672 656 640 624 608 592 576 560 544 528 512 496 480
464 448 432 416 400 384 368 352 336 320 304 288 272 256 240 224 208 192
176 160 144 128 112 96 80 64 48 32 16
Dump complete
= 0
db> reboot  
cpu_reset: Restarting BSP
cpu_reset_proxy: Stopped CPU 4

---- END: ----

What hardware parts are defective and need replacement? CPU, memory
or mainboard?

I now have two vmcore's + crashinfo core.txt available on the server.
Are they of any use to get further information?

--Kai.


-- 
Draft beer, not people.
Received on Thu Nov 12 2009 - 17:59:35 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:57 UTC