Re: [amd64] Reproducible cold boot failure (reboot succeeds) in -CURRENT

From: John Baldwin <jhb_at_freebsd.org>
Date: Thu, 17 Nov 2011 11:33:34 -0500
On Thursday, November 17, 2011 3:59:43 am Stefan Esser wrote:
> Am 16.11.2011 17:16, schrieb John Baldwin:
> > On Sunday, November 13, 2011 12:56:12 pm Stefan Esser wrote:
> >> ...
> >> WARNING: WITNESS option enabled, expect reduced performance.
> >> Table 'FACP' at 0xba918a58
> >> Table 'APIC' at 0xba918b50
> >> Table 'SSDT' at 0xba918be8
> >> Table 'MCFG' at 0xba918dc0
> >> Table 'HPET' at 0xba918e00
> >> ACPI: No SRAT table found
> >> Preloaded elf kernel "/boot/kernel/kernel" at 0xffffffff81109000
> >> Preloaded elf obj module "/boot/kernel/zfs.ko" at 0xffffffff81109370 <--
> >> kldload: unexpected relocation type 67108875
> >> kernel trap 12 with interrupts disabled
> >>
> >> The irritating detail is the load address of "zfs.ko", which is just
> >> 0x370 bytes above the kernel load address ...
> > 
> > That isn't unusual.  Those are the addresses of the metadata provided by the 
> > loader, not the base address of the kernel or zfs.ko object themselves.  The 
> > unexpected relocation type is interesting however.  That value in hex is 
> > 0x400000b.  0xb is the R_X86_64_32S relocation type which is normal for the 
> > kernel.  I think you just have a single-bit memory error due to a failing 
> > DIMM.
> 
> Thanks for the information about the load address semantics. The other
> unexpected relocation type I observed was 268435457 == 0x10000001, which
> also hints at a single bit error. But today the system failed with a
> different error:
> 
> ath0: ...
> ioapic0: routing interrupt 18 to ...
> panic: vm_page_insert: page already inserted
> 
> This could of course also be caused by a single bit error ...

Yes, very likely.

> Hmmm, perhaps there is a problem with components at room temperature
> and the system is still significantly warmer after 3 hours?

Yes, I strongly suspect it is a thermal effect that the RAM "works" once it
is warmed up.  If you have data you care about on the machine, I would just
go ahead and replace the RAM now before waiting for the RAM's failure to
become worse.

-- 
John Baldwin
Received on Thu Nov 17 2011 - 15:33:35 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:20 UTC