Re: [CURRENT]: weird memory/linker problem?

From: Willem Jan Withagen <wjw_at_digiware.nl>
Date: Tue, 01 Jul 2014 17:57:26 +0200
On 2014-07-01 17:33, O. Hartmann wrote:
> Am Tue, 01 Jul 2014 17:23:14 +0200
> Willem Jan Withagen <wjw_at_digiware.nl> schrieb:
>
>> On 2014-07-01 16:48, Rang, Anton wrote:
>>> DOT => DOD
>>>
>>> 444F54 => 444F44
>>>
>>> That's a single-bit flip.  Bad memory, perhaps?
>>
>> Very likely, especially if the system does not have ECC....
>> It just happens on rare occasions that a alpha particle, power cycle, or
>> any things else disruptive damages a memory cell. And it could be that
>> it requires a special pattern of accesses to actually exhibit the error.
>>
>> In the past (199x's) 'make buildworld' used to be a rather good memory
>> tester. But nowadays look at
>> 	http://www.memtest.org/
>>
>> This tool has found all of the bad memory in all the systems I used and
>> or build for others...
>> Note that it might take a few runs and some more heat to actually
>> trigger the faulty cell, but memtest86 will usually find it.
>>
>> Note that on big systems with lots of memory it can take a loooooong
>> time to run just one full testset to completion.
>>
>> --WjW
>
> I already testet via memtest86+ (had to download the linux image, the port on FreeBSD is
> broken on CURRENT). It didn't find anything strange so far.
>
> I will do another test.
>
> I realised, that on that that specific box, the chipset temperature is 81 Grad Celius.
> The chipset is a Eaglelake P45 - in which the memory controller resides on that old
> platform. dmidecode gives:
>
>          Manufacturer: ASUSTeK Computer INC.
>          Product Name: P5Q-WS
>          Version: Rev 1.xx

Hi Oliver,

I've build several (5+) systems with these boards (from memory they date 
around 2009??). And if I recall right, one of them is still functional. 
The first one broke down in a couple of weeks, and the other did not 
survive time either.

The auxiliary chips on that board do run hot, but I never realized this 
hot. Is 81C is the CPU temp from sysctl, or did you measure the cooling 
body on the motherboard. In the later case it is just too hot, probably.
But even if it is the temp on the chip itself, I've rrarely seen temps 
go up this high.

You can need to run the memtest86 for more than 6-10 complete runs with 
all the tests.

If the memtests do not reveal anything broken, then you get into even 
more wizardry stuff, like bad power etc... Especially since it only 
occurs on occasion, it is going to be a nightmare to find the root cause 
of this. Other than replacing hardware piece by piece, which won't be 
easy given the age of the board and parts.

You could go into the bios, and try to config ram access at a slower 
speed and see if the problem goes away. Then it could be that you are 
running an the edge of the spec with regards to ram timing.

But like I said, it is all lots of funky details that can interact in 
strange and unexpected ways.

--WjW
Received on Tue Jul 01 2014 - 13:57:32 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:50 UTC