Re: FreeBsd MCA Panic Crash !!

From: Slawa Olhovchenkov <slw_at_zxy.spb.ru>
Date: Mon, 4 Jan 2016 23:07:07 +0300
On Mon, Jan 04, 2016 at 03:34:09AM -0700, shahzaibcb wrote:

> Hi,
> 
> We've switched to FreeBSD recently to accomodate large video storage as we
> are running video streaming website. So the job of the FreeBSD is to
> transcode the uploaded videos using ffmpeg and serve them to users via nginx
> webserver but so far our experience is not very good with it. It crashes
> every 2-3 days and we're unable to track down the problem. The server specs
> are pretty high :
> 
> 
> Supermicro X5690 (12 cores, 24 threads - 2u)
> 96GB RAM
> 12x3TB RAID-10 (HBA-LSI9211)
> 
> Here is the screenshot of recent crash :
> 
> http://prntscr.com/9er3pk
> 
> One thing worth mentioning is, before going down there's no load on server,
> more or less free RAM usually is around 12GB.  We've tried following
> solutions so far :
> 
> 
> - Updated FreeBSD OS
> - Replaced 800W PS with 900W
> - We've reduced CMOS from MAX(26x) to 18x as suggested in this post

Do you try to replace CPU?

> http://unix.stackexchange.com/questions/60574/determining-cause-of-linux-kernel-panic
> 
> The solution we've not performed so far is :
> 
> - Disable mca using (hw.mca.enabled: 0) - As we're getting MCA panics.
> 
> Here is the crash dump :
> 
> [root_at_cw001 /var/crash]# mcelog --no-dmi --ascii --file core.txt.1 
> HARDWARE ERROR. This is *NOT* a software problem!
> Please contact your hardware vendor
> CPU 3 BANK 5 
> MISC 0 ADDR 802bf6a69 
> MCG status:MCIP 
> MCi status:
> Uncorrected error
> Error enabled
> MCi_MISC register valid
> MCi_ADDR register valid
> Processor context corrupt
> MCA: Internal Timer error
> STATUS be00000000800400 MCGSTATUS 4
> MCGCAP 1c09 APICID 3 SOCKETID 0 
> CPUID Vendor Intel Family 6 Model 44
> HARDWARE ERROR. This is *NOT* a software problem!
> Please contact your hardware vendor
> CPU 2 BANK 5 
> MISC 0 ADDR 802bf6a69 
> MCG status:MCIP 
> MCi status:
> Uncorrected error
> Error enabled
> MCi_MISC register valid
> MCi_ADDR register valid
> Processor context corrupt
> MCA: Internal Timer error
> STATUS be00000000800400 MCGSTATUS 4
> MCGCAP 1c09 APICID 2 SOCKETID 0 
> CPUID Vendor Intel Family 6 Model 44
> HARDWARE ERROR. This is *NOT* a software problem!
> Please contact your hardware vendor
> CPU 3 BANK 5 
> MISC 0 ADDR 802bf6a69 
> MCG status:MCIP 
> MCi status:
> Uncorrected error
> Error enabled
> MCi_MISC register valid
> MCi_ADDR register valid
> Processor context corrupt
> MCA: Internal Timer error
> STATUS be00000000800400 MCGSTATUS 4
> MCGCAP 1c09 APICID 3 SOCKETID 0 
> CPUID Vendor Intel Family 6 Model 44
> HARDWARE ERROR. This is *NOT* a software problem!
> Please contact your hardware vendor
> CPU 2 BANK 5 
> MISC 0 ADDR 802bf6a69 
> MCG status:MCIP 
> MCi status:
> Uncorrected error
> Error enabled
> MCi_MISC register valid
> MCi_ADDR register valid
> Processor context corrupt
> MCA: Internal Timer error
> STATUS be00000000800400 MCGSTATUS 4
> MCGCAP 1c09 APICID 2 SOCKETID 0 
> CPUID Vendor Intel Family 6 Model 44
> 
> -----------------------------------------------------------------------------------
> 
> I showed those Hardware errors to Vendor from whom we purchased Supermicro
> servers . This is what he has to say :
> 
> -----------------------------------
> Why do you not made one test environment with CentOS or one other Linux that
> you know to use, and see if you have same errors ??? if not than you know
> that the errors come from OS not from hardware. ( CentOS, RedHead….work
> diferend like FreeBSD – work direct on hardware if you don’t have the right
> kernel settings can the server crashed. CentOS , RedHead…. don’t work direct
> on hardware and distribute the resource load better and you have better
> control and you can better debug one situation)
> -----------------------------------
> 
> Now we're on a black hole and unable to find that either issue with FreeBSD
> or Hardware. We're thinking to disable mca in loader.conf but ppl are not
> suggesting it. If you guys can help us, it'd be very kind.
> 
> 
> 
> --
> View this message in context: http://freebsd.1045724.n5.nabble.com/FreeBsd-MCA-Panic-Crash-tp6064691.html
> Sent from the freebsd-current mailing list archive at Nabble.com.
> _______________________________________________
> freebsd-current_at_freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to "freebsd-current-unsubscribe_at_freebsd.org"
Received on Mon Jan 04 2016 - 19:07:17 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:02 UTC