Re: [CURRENT]: weird memory/linker problem?

From: O. Hartmann <ohartman_at_zedat.fu-berlin.de> Date: Fri, 18 Jul 2014 00:22:52 +0200 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:50 UTC

Am Tue, 01 Jul 2014 17:23:14 +0200
Willem Jan Withagen <wjw_at_digiware.nl> schrieb:

> On 2014-07-01 16:48, Rang, Anton wrote:
> > DOT => DOD
> >
> > 444F54 => 444F44
> >
> > That's a single-bit flip.  Bad memory, perhaps?
> 
> Very likely, especially if the system does not have ECC....
> It just happens on rare occasions that a alpha particle, power cycle, or 
> any things else disruptive damages a memory cell. And it could be that 
> it requires a special pattern of accesses to actually exhibit the error.
> 
> In the past (199x's) 'make buildworld' used to be a rather good memory 
> tester. But nowadays look at
> 	http://www.memtest.org/
> 
> This tool has found all of the bad memory in all the systems I used and 
> or build for others...
> Note that it might take a few runs and some more heat to actually 
> trigger the faulty cell, but memtest86 will usually find it.
> 
> Note that on big systems with lots of memory it can take a loooooong 
> time to run just one full testset to completion.
> 
> --WjW
> 
> 
> >
> > Anton
> >
> > -----Original Message-----
> > From: owner-freebsd-current_at_freebsd.org [mailto:owner-freebsd-current_at_freebsd.org] On
> > Behalf Of O. Hartmann Sent: Tuesday, July 01, 2014 8:08 AM
> > To: Dimitry Andric
> > Cc: Adrian Chadd; FreeBSD CURRENT
> > Subject: Re: [CURRENT]: weird memory/linker problem?
> >
> > Am Mon, 23 Jun 2014 17:22:25 +0200
> > Dimitry Andric <dim_at_FreeBSD.org> schrieb:
> >
> >> On 23 Jun 2014, at 16:31, O. Hartmann <ohartman_at_zedat.fu-berlin.de> wrote:
> >>> Am Sun, 22 Jun 2014 10:10:04 -0700
> >>> Adrian Chadd <adrian_at_freebsd.org> schrieb:
> >>>> When they segfault, where do they segfault?
> >> ...
> >>> GIMP, LaTeX work, nothing special, but a bit memory consuming
> >>> regrading GIMP) I tried updating the ports tree and surprisingly the
> >>> tree is left over in a unclean condition while /usr/bin/svn segfault
> >>> (on console: pid 18013 (svn), uid 0: exited on signal 11 (core dumped)).
> >>>
> >>> Using /usr/local/bin/svn, which is from the devel/subversion port,
> >>> performs well, while FreeBSD 11's svn contribution dies as described. It did not
> >>> hours ago!
> >>
> >> I think what Adrian meant was: can you run svn (or another crashing
> >> program) in gdb, and post a backtrace?  Or maybe run ktrace, and see
> >> where it dies?
> >>
> >> Alternatively, put a core dump and the executable (with debug info) in
> >> a tarball, and upload it somewhere, so somebody else can analyze it.
> >>
> >> -Dimitry
> >>
> >
> > It's me again, with the same weird story.
> >
> > After a couple of days silence, the mysterious entity in my computer is back. This
> > time it is again a weird compiler message of failure (trying to buildworld):
> >
> > [...]
> > c++  -O2 -pipe -O3 -O3
> > c++ -I/usr/src/lib/clang/libllvmsupport/../../../contrib/llvm/include
> > -I/usr/src/lib/clang/libllvmsupport/../../../contrib/llvm/tools/clang/include
> > -I/usr/src/lib/clang/libllvmsupport/../../../contrib/llvm/lib/Support -I.
> > -I/usr/src/lib/clang/libllvmsupport/../../../contrib/llvm/../../lib/clang/include
> > -DLLVM_ON_UNIX -DLLVM_ON_FREEBSD -D__STDC_LIMIT_MACROS -D__STDC_CONSTANT_MACROS
> > -fno-strict-aliasing -DLLVM_DEFAULT_TARGET_TRIPLE=\"x86_64-unknown-freebsd11.0\"
> > -DLLVM_HOST_TRIPLE=\"x86_64-unknown-freebsd11.0\" -DDEFAULT_SYSROOT=\"\"
> > -Qunused-arguments -I/usr/obj/usr/src/tmp/legacy/usr/include -std=c++11
> > -fno-exceptions -fno-rtti -Wno-c++11-extensions
> > -c /usr/src/lib/clang/libllvmsupport/../../../contrib/llvm/lib/Support/Host.cpp -o
> > Host.o --- GraphWriter.o --- In file included
> > from /usr/src/lib/clang/libllvmsupport/../../../contrib/llvm/lib/Support/GraphWriter.cpp:14: /usr/src/lib/clang/libllvmsupport/../../../contrib/llvm/include/llvm/Support/GraphWriter.h:269:10:
> > error: use of undeclared identifier 'DOD'; did you mean 'DOT'? O <<
> > DOD::EscapeString(Label); ^~~
> > DOT /usr/src/lib/clang/libllvmsupport/../../../contrib/llvm/include/llvm/Support/GraphWriter.h:35:11:
> > note: 'DOT' declared here namespace DOT {  // Private functions... ^ 1 error
> > generated. *** [GraphWriter.o] Error code 1
> >
> >
> > Well, in the past I saw many of those messages, especially not found labels of
> > routines in shared objects/libraries or even those "funny" misspelled messages shown
> > above.
> >
> > I can not reproduce them after a reboot, but as long as the system is running with
> > this error occured, it is sticky. So in order to compile the OS successfully, I
> > reboot.
> >
> > Does anyone have an idea what this could be? Since it affects at the moment only one
> > machine (the other CoreDuo has been retired in the meanwhile), it feels a bit like a
> > miscompilation on a certain type of CPU.
> >
> > Thanks for your patience,
> >
> > Oliver

Hello all.

Well, I'd like to update some informations. It doesn't relief the special concern, but
might be a kind of replenishment of experience.

The box in question is now with only 4GB - and is oprable as expected. With 8 GB, I see
those reported weird bugs and they revealed themselfes as indeed bit flips. I can not
reproduce them, the occur spontanously, but I can raise the frequency by permutating the
RAM sticks. So far. As reported, the memtest86+ test doesn't show anything even after
three days(!) of testing!

The bos was built 2009 as a development system with 4GB RAM. That time, the developer
ordered special and expensive overclocker RAM, Ballistix, from Crucial. Usually, I
purchase JEDEC conform RAM - I have some allergic reaction to this stupid overclocking
and "planned destruction with fun" of silica by overdriving it. Especially when it
concerns equipment we have to rely on. The box has then been upgraded with further 4GB
RAM (two sticks) of the same type and brand, consuming 2+ volts (as far as I know).

Last summer, after 4 years of problem less operation, suddenly I had to fight with
spontanous crashes and blamed FBSD CURRENT, but very quickly the memory was revealed as
to be the culprit. The funny thing was: the box "roasted" literally the upper 4 GB bank
and they got that hot, you might have burned your fingers seriously when touched (I
did!). The end of that game was, after a cascade of tests, swapping RAM sticks, that
those sticks in the upper slots (B1 and B2) where destroyed! After I exchanged the RAM
completely to JEDEC conform 8 GB, the system ran perfectly, until this summer again. When
in end of May the temperatures went beyon 20 degree Celsius in my lab, the bos started
having the issues with this bit flips.

I guess that there is a temperature triggered problem with the voltage regulation or
something killing slowly the RAM modules/sticks. This is only a guess. As I reported, the
chipset itself reports 81 - 85 degree C (in BIOS and with healthd). This high temperature
occured suddenly last year and I first thought that could be a mismeasurement.

After testing VBox and occupying all available memory without any obvious error or crash,
I tried compiling the OS and it seems that the notable load LLVM/CLANG rpoduces building
parallelised world/kernel triggers also this bit flip which results very fast in strange
errors as reported earlier in this thread. The ultimate failure arose when I tried to
install a Windows 7 on a free harddrive with 8 GB: the install process died with a file
corruption or not-copied file. I didn't dare to try the FreeBSD installation since I know
from the past that even FreeBSD's copying also triggers very fast hardware issues if any
available (overheating and sibblings). With 4 GB only everything works as expected, but 4
GB is a pain in the ass with ZFS and 11.0-CURRENT alone, not to mention the pain when
doing some memory intensive calculations/simulations or even VBox.

At the end, there is a mixed conclusion. I realise that I can not trust the expertise of
memtest86+. There is no suitable "burn-in" test for FreeBSD consuming, stressing,
tortouring memory and bus systems as well as all cores of the CPU starting with Core2Duo
CPUs, since cpuburn/burncpu of the ports do not utilise AVX/SIMD or other "hot" facilities
of modern Intel-like CPUs or stressing the integrated memory controller in a "brutal"
way. Prime95 is only available for i386 - and that is a pity on amd64 and > 4GB RAM.

At the end, there is no reason to purchase again a Workstation-grade mainboard, as
advertised by ASUS, for instance, with this overclocking crap. I leave behind a very
bitter taste - for my personal view. Since the memory problems I realised do not reveal
themselfes as "steady-state" problems, permanently, I fear data corruption not indicated
by any protection - so for the future, ECC is some kind of a must. And this means, even
for "low end" workstations, byebye cheap crappy Intel toy CPUs! At least a XEON type,
ECC capable processor is a prerequisite and I wish AMD had not followed the cheap man's
path ripping the ECC facilities off their consumer CPUs. It is a matter of fact that even
in the academic environment "cheap" ECCless systems are purchased for "cost
effectiveness". 

At the end, I personally wish for some massive tortouring tools like cpuburn or something
more sophisticated to stress the CPU to its limit - to test the reliability, the cooling
facilities and the energy support (power supply flaky under heavy load, etc.?). FreeBSD's
port do not have even the simplest Prime95 in a 64bit version as it is available for
Linux or Windows. I'm sure, some professionals are capable of pulling together some
massive stresstest tools, but please could this be made available for the not so
professionals and more "common" users? Maybe a naive Christmas wish?

I need to replace the system since I can not rely on that flaky box anymore, even when
using encrypted devices. That is, after a painful time and hopes, the final conclusion.

Regards and thanks for the patience reading this far,
Oliver