Routine Panic on HP Proliant G10

From: Dave Robison <davewrobison_at_gmail.com>
Date: Tue, 11 Sep 2018 14:13:36 -0700
Hiya,

I'm currently evaluating two classes of server which we source through NEC. However, the motherboards for these machines are HP. I can routinely panic both of these machines using 12.0-A4, as well as 11.1-R with a shoehorned in SES/SMARTPQI driver, and 11.2-R with its native SES/SMARTPQI driver. NEC seems to think this is a ZFS issue and they may be correct. If so I suspect ARC, though as I explain further down, I haven't had a problem on other hardware. 

I've managed to get a core dump on 11.1 and 11.2, but on 12.0 when the panic occurs, I can backtrace and force a panic and the system claims it is writing out a core dump, but on reboot there is no core dump.

Machine A: HP ProLiant DL360 Gen 10 with a Xeon Bronze 3106 and 16 gigs RAM and three hard drives.

Machine B: HP Proliant DL380 Gen 10 with a Xeon Silver 4114 and 32 gigs RAM and five hard drives.

I install 12.0-A4 using ZFS on root. I install with 8 gigs of swap but otherwise it's a standard FreeBSD install. I can panic these machines rather easily in 10-15 minutes by firing up 6 instances of bonnie++ and a few memtesters, three using 2g and three using 4g. I've done this on the 11.x installs without memtester and gotten panics within 10-15 minutes. Those gave me core dumps, but the panic error is different than with 12.0-A4. I have run some tests using UFS2 and did not manage to force a panic.

At first I thought the problem was the HPE RAID card which uses the SES driver, so I put in a recent LSI MegaRAID card using the MRSAS driver, and can panic that as well. I've managed to panic Machine B while it was using either RAID card to create two mirrors and one hot spare, and I've managed to panic it when letting the RAID cards pass through the hard drives so I could create a raidz of 4 drives and one hot spare. I know many people immediately think "Don't use a RAID card with ZFS!" but I've done this for years without a problem using the LSI MegaRAID in a variety of configurations.

It really seems to me that when ARC starts to ramp up and hits a lot of memory contention, a panic occurs. However, I've been running the same test on a previous generation NEC server with an LSI MegaRAID using the MRSAS driver under 11.2-R and it has been running like clockwork for 11 days. We use this iteration of server extensively. If this were a problem with ARC, I assume (perhaps presumptuously) that I would see the same problems. I also have servers running 11.2-R with ZFS and rather large and very heavily used JBOD arrays and have never had an issue.

The HPE RAID card info, from pciconf -lv:

smartpqi0_at_pci0:92:0:0:  class=0x010700 card=0x0654103c chip=0x028f9005 rev=0x01 hdr=0x00
    vendor     = 'Adaptec'
    device     = 'Smart Storage PQI 12G SAS/PCIe 3'
    class      = mass storage
    subclass   = SAS

And from dmesg:

root_at_hvm2d:~ # dmesg | grep smartpq
smartpqi0: <E208i-a SR Gen10> port 0x8000-0x80ff mem 0xe6c00000-0xe6c07fff at device 0.0 on pci9
smartpqi0: using MSI-X interrupts (40 vectors)
da0 at smartpqi0 bus 0 scbus0 target 0 lun 0
da1 at smartpqi0 bus 0 scbus0 target 1 lun 0
ses0 at smartpqi0 bus 0 scbus0 target 69 lun 0
pass3 at smartpqi0 bus 0 scbus0 target 1088 lun 0
smartpqi0: <E208i-a SR Gen10> port 0x8000-0x80ff mem 0xe6c00000-0xe6c07fff at device 0.0 on pci9
smartpqi0: using MSI-X interrupts (40 vectors)
da0 at smartpqi0 bus 0 scbus0 target 0 lun 0
da1 at smartpqi0 bus 0 scbus0 target 1 lun 0
ses0 at smartpqi0 bus 0 scbus0 target 69 lun 0
pass3 at smartpqi0 bus 0 scbus0 target 1088 lun 0

However, since I can panic these with either RAID card, I don't suspect the HPE RAID card as the culprit.

Here is an image with the scant bt info I got from the last panic:

https://ibb.co/dzFOn9

This thread from Saturday on -stable sounded all too familiar:

https://lists.freebsd.org/pipermail/freebsd-stable/2018-September/089623.html

I'm at a loss so I have gathered as much info as I can to predict questions and requests for more info. Hoping someone can point me in the right direction for further troubleshooting or at least isolation of the problem to a specific area.

Thanks for your time,

Dave
Received on Tue Sep 11 2018 - 19:13:40 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:18 UTC