Re: [CURRENT]: weird memory/linker problem?

From: O. Hartmann <ohartman_at_zedat.fu-berlin.de>
Date: Tue, 1 Jul 2014 18:56:03 +0200
Am Tue, 01 Jul 2014 17:57:26 +0200
Willem Jan Withagen <wjw_at_digiware.nl> schrieb:

> On 2014-07-01 17:33, O. Hartmann wrote:
> > Am Tue, 01 Jul 2014 17:23:14 +0200
> > Willem Jan Withagen <wjw_at_digiware.nl> schrieb:
> >
> >> On 2014-07-01 16:48, Rang, Anton wrote:
> >>> DOT => DOD
> >>>
> >>> 444F54 => 444F44
> >>>
> >>> That's a single-bit flip.  Bad memory, perhaps?
> >>
> >> Very likely, especially if the system does not have ECC....
> >> It just happens on rare occasions that a alpha particle, power cycle, or
> >> any things else disruptive damages a memory cell. And it could be that
> >> it requires a special pattern of accesses to actually exhibit the error.
> >>
> >> In the past (199x's) 'make buildworld' used to be a rather good memory
> >> tester. But nowadays look at
> >> 	http://www.memtest.org/
> >>
> >> This tool has found all of the bad memory in all the systems I used and
> >> or build for others...
> >> Note that it might take a few runs and some more heat to actually
> >> trigger the faulty cell, but memtest86 will usually find it.
> >>
> >> Note that on big systems with lots of memory it can take a loooooong
> >> time to run just one full testset to completion.
> >>
> >> --WjW
> >
> > I already testet via memtest86+ (had to download the linux image, the port on FreeBSD
> > is broken on CURRENT). It didn't find anything strange so far.
> >
> > I will do another test.
> >
> > I realised, that on that that specific box, the chipset temperature is 81 Grad Celius.
> > The chipset is a Eaglelake P45 - in which the memory controller resides on that old
> > platform. dmidecode gives:
> >
> >          Manufacturer: ASUSTeK Computer INC.
> >          Product Name: P5Q-WS
> >          Version: Rev 1.xx
>


Hello Willem,

 
> Hi Oliver,
> 
> I've build several (5+) systems with these boards (from memory they date 
> around 2009??). And if I recall right, one of them is still functional. 
> The first one broke down in a couple of weeks, and the other did not 
> survive time either.
> 
> The auxiliary chips on that board do run hot, but I never realized this 
> hot. Is 81C is the CPU temp from sysctl, or did you measure the cooling 
> body on the motherboard. In the later case it is just too hot, probably.
> But even if it is the temp on the chip itself, I've rrarely seen temps 
> go up this high.

The temperature is seen in BIOS and by the usage of one of those health daemon, found in
ports (forgot about the name). 
There is no sysctl MIB showing the chipset temperature on that board, as far as I know.

> 
> You can need to run the memtest86 for more than 6-10 complete runs with 
> all the tests.

Last time I ran memtest86+ it took ~ 1 1/2 days to finish.

> 
> If the memtests do not reveal anything broken, then you get into even 
> more wizardry stuff, like bad power etc... Especially since it only 
> occurs on occasion, it is going to be a nightmare to find the root cause 
> of this. Other than replacing hardware piece by piece, which won't be 
> easy given the age of the board and parts.
> 
> You could go into the bios, and try to config ram access at a slower 
> speed and see if the problem goes away. Then it could be that you are 
> running an the edge of the spec with regards to ram timing.
> 
> But like I said, it is all lots of funky details that can interact in 
> strange and unexpected ways.
> 
> --WjW

I will check memory these days again.

Regards,
Oliver


Received on Tue Jul 01 2014 - 14:56:17 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:50 UTC