(unknown charset) Re: kernel: MCA: CPU 0 COR (1) internal parity error

From: (unknown charset) Jeremy Chadwick <jdc_at_koitsu.org>
Date: Sat, 17 Jan 2015 13:46:53 -0800
On Sat, Jan 17, 2015 at 06:43:26PM +0100, Matthias Apitz wrote:
> El día Friday, January 16, 2015 a las 03:04:52PM -0500, Eric van Gyzen escribió:
> 
> > On 01/16/2015 14:45, Matthias Apitz wrote:
> > > Jan 16 12:04:24 c720-r276659 kernel: MCA: Bank 0, Status 0x90000040000f0005
> > > Jan 16 12:04:24 c720-r276659 kernel: MCA: Global Cap 0x0000000000000c07, Status 0x0000000000000000
> > > Jan 16 12:04:24 c720-r276659 kernel: MCA: Vendor "GenuineIntel", ID 0x40651, APIC ID 0
> > > Jan 16 12:04:24 c720-r276659 kernel: MCA: CPU 0 COR (1) internal parity error
> > 
> > Try ports/sysutils/mcelog.
> 
> I have installed that port and launched it as
> 
> # mcelog > mcelog.txt
> ...
> mcelog: Unsupported new Family 6 Model 45 CPU: only decoding architectural errors
> mcelog: Unsupported new Family 6 Model 45 CPU: only decoding architectural errors
> mcelog: Unsupported new Family 6 Model 45 CPU: only decoding architectural errors
> ...
> 
> (the messages are STDERR);
> 
> in 'mcelog.txt' it has for the last event from /var/log/messages:
> 
> Jan 17 18:23:54 c720-r276659 kernel: MCA: Bank 0, Status 0x90000040000f0005
> Jan 17 18:23:54 c720-r276659 kernel: MCA: Global Cap 0x0000000000000c07, Status 0x0000000000000000
> Jan 17 18:23:54 c720-r276659 kernel: MCA: Vendor "GenuineIntel", ID 0x40651, APIC ID 0
> Jan 17 18:23:54 c720-r276659 kernel: MCA: CPU 0 COR (1) internal parity error
> 
> the following lines (the uptime matches):
> 
> ...
> HARDWARE ERROR. This is *NOT* a software problem!
> Please contact your hardware vendor
> MCE 32
> CPU 0 BANK 0 TSC 36eec80fd688 [at 1397 Mhz 0 days 12:0:41 uptime (unreliable)]
> MCG status:
> MCi status:
> Error enabled
> MCA: Unknown Error 5
> STATUS 90000040000f0005 MCGSTATUS 0
> MCGCAP c07 APICID 0 SOCKETID 0
> CPUID Vendor Intel Family 6 Model 69
> 
> Questions:
> a) Is the output of mcelog valid (regardless of the msg on STDERR of
>    'unsupported model')?

It may or may not be reliable.  For MCE decoding to work accurately, the
software (read: kernel) needs to have full support for the processor
model and revision in question.  mcelog simply tries to decode the
output that the kernel spits out and provide a more "user-friendly"
explanation.

That isn't as simple as just modifying some table of supported CPUs; it
involves reading Intel documentation and implementing what can be
figured out through that.  VMware has a small KB about this, to give you
some insight into the complexity:

http://kb.vmware.com/kb/1005184

There are some capabilities of MCA that are "semi-universal" across
series of CPUs, so sometimes those can be decoded (mostly) accurately,
but other times such isn't the case.  Sometimes there are certain MCEs
that have be ignored by the kernel (i.e. the kernel MCE support has to
be updated to reflect changes in MCEs for that newer model of
processor).

The version of mcelog available in ports is extremely old, and the
amount of work to upgrade it to the latest Linux mcelog (1.08) I imagine
would be quite large:

http://git.kernel.org/cgit/utils/cpu/mce/mcelog.git

The existing FreeBSD port involves a large number of patches written by
John Baldwin, and whether or not those can be correctly backported to
newer mcelog releases is unknown.

I really need to renounce my maintainer flag of that port and let
someone else take care of it.

> b) Is it worth to contact the dealer or wait until it is broken
>    completely?

To me, the above message indicates that one of the CPU cores is
damaged/misbehaving.  I cannot determine if it's referring to L1, L2, or
L3 cache, but I don't see any clear indicator of that (possibly due to
the aforementioned explanation I gave about accuracy).

However, I will point you to this thread, which may indicate that the
model of CPU in question (or series or models of Intel CPUs) have MCEs
that happen which are considered "normal" and are thus not being decoded
correctly:

https://lists.freebsd.org/pipermail/freebsd-questions/2014-January/255873.html

I would suggest providing relevant dmesg lines about your exact
processor in this system and possibly ask for help from either John
Baldwin or someone on freebsd-hackers_at_.  I myself cannot help with this.
The dmesg lines I'm referring to, by the way, look like this (all of
them matter, particularly the first two):

CPU: Intel(R) Core(TM)2 Quad  CPU   Q9550  _at_ 2.83GHz (2833.59-MHz K8-class CPU)
  Origin = "GenuineIntel"  Id = 0x10677  Family = 0x6  Model = 0x17  Stepping = 7
  Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
  Features2=0x8e3fd<SSE3,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,SSE4.1>
  AMD Features=0x20100800<SYSCALL,NX,LM>
  AMD Features2=0x1<LAHF>
  TSC: P-state invariant, performance statistics

The OP of that freebsd-questions thread should have provided this but
didn't (instead just says "Intel i3-4310" -- this isn't precise enough),
so whether or not you two are using the same CPU is unknown.

There simply could be "new MCEs" or changes to the MCA that Intel
implemented in some newer models of Core iX that aren't being handled
correctly by the kernel (i.e. misreporting or mis-decoding).

Good luck!

-- 
| Jeremy Chadwick                                   jdc_at_koitsu.org |
| UNIX Systems Administrator                http://jdc.koitsu.org/ |
| Making life hard for others since 1977.             PGP 4BD6C0CB |
Received on Sat Jan 17 2015 - 20:46:58 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:55 UTC