[SOLVED]: HW defect (was: Re: [amd64] Reproducible cold boot failure (reboot succeeds) in -CURRENT)

From: Stefan Esser <se_at_freebsd.org>
Date: Wed, 30 Nov 2011 11:16:25 +0100
Am 17.11.2011 17:33, schrieb John Baldwin:
> On Thursday, November 17, 2011 3:59:43 am Stefan Esser wrote:
>> Am 16.11.2011 17:16, schrieb John Baldwin:
[...]
>>> That isn't unusual.  Those are the addresses of the metadata provided by the 
>>> loader, not the base address of the kernel or zfs.ko object themselves.  The 
>>> unexpected relocation type is interesting however.  That value in hex is 
>>> 0x400000b.  0xb is the R_X86_64_32S relocation type which is normal for the 
>>> kernel.  I think you just have a single-bit memory error due to a failing 
>>> DIMM.
>>
>> Thanks for the information about the load address semantics. The other
>> unexpected relocation type I observed was 268435457 == 0x10000001, which
>> also hints at a single bit error. But today the system failed with a
>> different error:
>>
>> ath0: ...
>> ioapic0: routing interrupt 18 to ...
>> panic: vm_page_insert: page already inserted
>>
>> This could of course also be caused by a single bit error ...
> 
> Yes, very likely.
> 
>> Hmmm, perhaps there is a problem with components at room temperature
>> and the system is still significantly warmer after 3 hours?
> 
> Yes, I strongly suspect it is a thermal effect that the RAM "works" once it
> is warmed up.  If you have data you care about on the machine, I would just
> go ahead and replace the RAM now before waiting for the RAM's failure to
> become worse.

Thanks a lot, John!

I should have checked the hardware before, but since the system
was perfectly stable, once it had been up and running, I had been
suspecting an initialization bug instead of defective RAM.

In fact, one of the 4GB DIMMs in the system returns bogus data
(0x10000000 or 0x04000000 instead of 0) for some 40 to 50 seconds
after power-on. Once warmed up, memtest86+ runs for days without a
single extra data error (I wanted to have an estimate for the defect
having led to damaged data in disk files).

When I was still doing hardware work, I always had a freezer aerosol
on my desk, which allowed me to quickly cool down a DUT by a few tens
of degrees, but without such a tool I had to wait for the components
to cool down over night between test.

Best regards, STefan
Received on Wed Nov 30 2011 - 09:29:43 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:21 UTC