Re: FreeBsd MCA Panic Crash !!

From: Anna Wilcox <AWilcox_at_Wilcox-Tech.com>
Date: Mon, 4 Jan 2016 08:05:46 -0600
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

On 04/01/16 04:34, shahzaibcb wrote:
> Hi,
>
> We've switched to FreeBSD recently to accomodate large video storage as we
> are running video streaming website. So the job of the FreeBSD is to
> transcode the uploaded videos using ffmpeg and serve them to users via
nginx
> webserver but so far our experience is not very good with it. It crashes
> every 2-3 days and we're unable to track down the problem. The server
specs
> are pretty high :
>
>
> Supermicro X5690 (12 cores, 24 threads - 2u)
> 96GB RAM
> 12x3TB RAID-10 (HBA-LSI9211)
>
> Here is the screenshot of recent crash :
>
> http://prntscr.com/9er3pk
>
> One thing worth mentioning is, before going down there's no load on
server,
> more or less free RAM usually is around 12GB.  We've tried following
> solutions so far :
>
>
> - Updated FreeBSD OS
> - Replaced 800W PS with 900W
> - We've reduced CMOS from MAX(26x) to 18x as suggested in this post
>
http://unix.stackexchange.com/questions/60574/determining-cause-of-linux-kernel-panic
>
> The solution we've not performed so far is :
>
> - Disable mca using (hw.mca.enabled: 0) - As we're getting MCA panics.
>
> Here is the crash dump :
>
> [root_at_cw001 /var/crash]# mcelog --no-dmi --ascii --file core.txt.1
> HARDWARE ERROR. This is *NOT* a software problem!
> Please contact your hardware vendor
> CPU 3 BANK 5
> MISC 0 ADDR 802bf6a69
> MCG status:MCIP
> MCi status:
> Uncorrected error
> Error enabled
> MCi_MISC register valid
> MCi_ADDR register valid
> Processor context corrupt
> MCA: Internal Timer error
> STATUS be00000000800400 MCGSTATUS 4
> MCGCAP 1c09 APICID 3 SOCKETID 0
> CPUID Vendor Intel Family 6 Model 44
> HARDWARE ERROR. This is *NOT* a software problem!
> Please contact your hardware vendor
> CPU 2 BANK 5
> MISC 0 ADDR 802bf6a69
> MCG status:MCIP
> MCi status:
> Uncorrected error
> Error enabled
> MCi_MISC register valid
> MCi_ADDR register valid
> Processor context corrupt
> MCA: Internal Timer error
> STATUS be00000000800400 MCGSTATUS 4
> MCGCAP 1c09 APICID 2 SOCKETID 0
> CPUID Vendor Intel Family 6 Model 44
> HARDWARE ERROR. This is *NOT* a software problem!
> Please contact your hardware vendor
> CPU 3 BANK 5
> MISC 0 ADDR 802bf6a69
> MCG status:MCIP
> MCi status:
> Uncorrected error
> Error enabled
> MCi_MISC register valid
> MCi_ADDR register valid
> Processor context corrupt
> MCA: Internal Timer error
> STATUS be00000000800400 MCGSTATUS 4
> MCGCAP 1c09 APICID 3 SOCKETID 0
> CPUID Vendor Intel Family 6 Model 44
> HARDWARE ERROR. This is *NOT* a software problem!
> Please contact your hardware vendor
> CPU 2 BANK 5
> MISC 0 ADDR 802bf6a69
> MCG status:MCIP
> MCi status:
> Uncorrected error
> Error enabled
> MCi_MISC register valid
> MCi_ADDR register valid
> Processor context corrupt
> MCA: Internal Timer error
> STATUS be00000000800400 MCGSTATUS 4
> MCGCAP 1c09 APICID 2 SOCKETID 0
> CPUID Vendor Intel Family 6 Model 44
>
>
-----------------------------------------------------------------------------------
>
> I showed those Hardware errors to Vendor from whom we purchased Supermicro
> servers . This is what he has to say :
>
> -----------------------------------
> Why do you not made one test environment with CentOS or one other
Linux that
> you know to use, and see if you have same errors ??? if not than you know
> that the errors come from OS not from hardware. ( CentOS, RedHead….work
> diferend like FreeBSD – work direct on hardware if you don’t have the
right
> kernel settings can the server crashed. CentOS , RedHead…. don’t work
direct
> on hardware and distribute the resource load better and you have better
> control and you can better debug one situation)
> -----------------------------------
>
> Now we're on a black hole and unable to find that either issue with
FreeBSD
> or Hardware. We're thinking to disable mca in loader.conf but ppl are not
> suggesting it. If you guys can help us, it'd be very kind.
>

Hello there,

This seems to me like it would be a CPU failure.  Can you try replacing
the CPU itself?  I've seen this exact message on a different board, and
the cause was a failing CPU.

Please do note that as the message says, this is not a software error. 
It is a failure of the hardware.  Your vendor can try to blame FreeBSD
all they want, but it is extremely improbable as to be almost impossible
that that is the problem.  You might also note to your vendor that it is
"Red Hat" Linux, not Red Head.

Hope this helps.
- --arw
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQIcBAEBCAAGBQJWinw5AAoJEMspy1GSK50UXI8QANH5y9c36q8uX2xtQtjQ79DR
ENN5O0cuxfiCn3mo7Kn+R0wD4Ahf1Qn6uR70WXwKDtdpre6VqsBxpZak7GVpHR9j
x0C0jJJQLU3qs3XREzs6DjWCOge8j7zDZG0i9gZt3NT3WnEUxrqI+dLm/1I1Cy3f
nSSHb3V3Sf9SxbB132NhCfiHfQNIVNGZsnrLCCIEWN0gI5vvEe2Av1e4PYoa1TJF
7B0qTmQ+nBb0zX/mccAbTXtMCAO7PBOrVkyxrwZN/J9kGYaPe2UEpsdHjXp76sui
fFzb7voaKYXvqu3XJEYU0Pxulape5cUGSuQWmWBmDZhnFmn7YYRlfRr+5anwwhxu
/EVDvOrdPNm4LpR3DCwR+FtHQb+fs9rfMEGIQ9EiLLF/rXXbs0Pfq+FzjHwk6RsX
ls339Qn2juM3hYnhVJVD2QUw00M+6BYxCpi1+cwwDJ4RA5sgI9E8YxXTevhf5/3q
Wc7lLrIC7LuXNtEJDiIwmX0N9fHAnMqw90leC4CLP6MbYmsmCZWkK62mgeGbTt6e
1QOJN6e7T4DFA/RIbtYcA+uQMCsefP6F0UPB8Al8nbby3wtteI31RNWVmFqwwueW
AdAtslEPqHSdyscbHxZD9NwqchW447v3UVY0oL1nkGSBaD0z/BV41S8jRTXWPj2Y
mIBxsB86TaDdFim3KCaV
=IqIn
-----END PGP SIGNATURE-----
Received on Mon Jan 04 2016 - 13:16:45 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:02 UTC