Re: ECC memory driver in FreeBSD 10?

From: Andrew Boyer <aboyer_at_averesystems.com> Date: Mon, 9 Apr 2012 09:32:01 -0400 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:25 UTC

On Apr 9, 2012, at 6:04 AM, O. Hartmann wrote:

> Am 04/08/12 14:53, schrieb Miroslav Lachman:
>> Nikolay Denev wrote:
>>> On Apr 6, 2012, at 2:48 PM, O. Hartmann wrote:
>>> 
>>>> I'm looking for a way to force FreeBSD 10 to maintain/watch ECC errors
>>>> reported by UEFI (or BIOS).
>>>> Since ECC is said to be essential for server systems both in buisness
>>>> and science and I do not question this, I was wondering if I can not
>>>> report ECC errors via a watchdog or UEFI (ACPI?) report to syslog
>>>> facility on FreeBSD.
>>>> FreeBSD is supposed to be a server operating system, as far as I know,
>>>> so I believe there must be something which didn't have revealed itself
>>>> to me, yet.
>> 
>>> 
>>> If the hardware supports it, such errors should be logged as MCEs
>>> (Machine Check Exceptions).
>>> I can say for sure it works pretty well with Dell servers, as I had 
>>> one with failing RAM module, and
>>> it reported the corrected ECC errors in dmesg.
>> 
>> Memory ECC errors are logged in to messages and you can decode it by
>> sysutils/mcelog. I did it in the past on one of our Sun Fire X2100 M2
>> with FreeBSD 8.x.
>> 
>> Miroslav Lachman
> 
> Seems that I have been blessed with non-faulty memory over tha past
> three or four years. Last time I saw errors was around 2000. All of our
> 24/7 servers do have ECC RAM.
> 
> So, your replies all implies if I log the system's messages via syslog
> properly (as we do remotely on a centralized server), then ECC errors
> should be reported by FreeBSD/kernel in a canonical way as the UEFI/BIOS
> reports them?
> Without special drivers/tools, scripts which scans for those errors
> should report occurences?
> 
> Since my (FreeBSD) boxes didn't show up errors of that kind - Linux
> boxes of a colleague did once! - doesn't imply missing capabilities.
> This is nice to hear/read.
> 
> Thanks a lot,
> 
> Oliver
> 

This is what you see in syslog when sys/x86/x86/mca.c detects a memory error:
> Mar 16 12:37:33 hostname kernel: MCA: Bank 8, Status 0x8c0000400001009f
> Mar 16 12:37:33 hostname kernel: MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000000
> Mar 16 12:37:33 hostname kernel: MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 0
> Mar 16 12:37:33 hostname kernel: MCA: CPU 0 COR (1) RD channel ?? memory error
> Mar 16 12:37:33 hostname kernel: MCA: Address 0xb43ca6240
> Mar 16 12:37:33 hostname kernel: MCA: Misc 0x4ac8111000064808

mcelog will help you figure out which DIMM is affected.

Also, if your server includes an IPMI controller, the BIOS should be set up to log memory errors to the IPMI system event log (SEL).  You can look at the SEL with ipmitool from the ports collection.  'ipmitool sel list' will show you if any errors have been reported.

-Andrew

--------------------------------------------------
Andrew Boyer	aboyer_at_averesystems.com