lsi

From: Aurelien \ <beorn_at_binaries.fr> Date: Fri, 22 Mar 2019 10:06:11 +0100 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:20 UTC

Hi the list,

I have been using FreeBSD at home and in production for years and today
i stumbled upon a question i could not answer.

Context

-----------------------------------------

I'm building a backup server on a server with this HBA :

3:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2208 [Thunderbolt] (rev 05)
    Subsystem: LSI Logic / Symbios Logic MegaRAID SAS 9271-8i
    Flags: bus master, fast devsel, latency 0, IRQ 34
    I/O ports at e000
    Memory at fb160000 (64-bit, non-prefetchable)
    Memory at fb100000 (64-bit, non-prefetchable)
    Expansion ROM at fb140000 [disabled]
    Capabilities: [50] Power Management version 3
    Capabilities: [68] Express Endpoint, MSI 00
    Capabilities: [d0] Vital Product Data
    Capabilities: [a8] MSI: Enable- Count=1/1 Maskable- 64bit+
    Capabilities: [c0] MSI-X: Enable+ Count=16 Masked-
    Capabilities: [100] Advanced Error Reporting
    Capabilities: [1e0] Secondary PCI Express <?>
    Capabilities: [1c0] Power Budgeting <?>
    Capabilities: [190] Dynamic Power Allocation <?>
    Capabilities: [148] Alternative Routing-ID Interpretation (ARI)

After pushing the server I/Os to its limits the server had a very nasty 
crash.

It happens very seldomly, in roughly 10 years among the petabytes of
storage servers i kept running it always was hardware or driver/firmware
related.

    |Shortening read at 4292967280 from 16 to 15 ZFS: i/o error - all
    block copies unavailable ZFS: can't read object set for dataset 52
    ZFS: can't open root filesystem gptzfsboot: failed to mount default
    pool zroot|

After simply reinstalling (for nothing) the bootloaders, checking the
partition tables, i went digging a lot in the FreeBSD codebase. I found
that it was a ZFS problem.

The nasty crash was indeed due to ZFS  data corruption. Hence the
checksum errors while scrubing the zpool on a rescue network boot image :

      pool: zroot                                                                                                                                                                                                       
     state: ONLINE                                                                     
    status: One or more devices has experienced an unrecoverable error.  An            
            attempt was made to correct the error.  Applications are unaffected.       
    action: Determine if the device needs to be replaced, and clear the errors         
            using 'zpool clear' or replace the device with 'zpool replace'.            
       see: http://illumos.org/msg/ZFS-8000-9P                                         
      scan: scrub in progress since Fri Mar 15 15:15:25 2019                           
            49.6G scanned out of 1.65T at 109M/s, 4h15m to go                          
            677M repaired, 2.94% done                                                  
    config:                                                                            
            NAME              STATE     READ WRITE CKSUM                               
            zroot             ONLINE       0     0     0                               
              raidz2-0        ONLINE       0     0     0                               
                mfisyspd0p3   ONLINE       0     0 5.44K  (repairing)                  
                mfisyspd1p3   ONLINE       0     0 4.76K  (repairing)                  
                mfisyspd10p3  ONLINE       0     0 4.35K  (repairing)                  
                mfisyspd11p3  ONLINE       0     0 5.17K  (repairing)                  
                mfisyspd2p3   ONLINE       0     0 4.76K  (repairing)                  
                mfisyspd3p3   ONLINE       0     0 4.24K  (repairing)                  
                mfisyspd4p3   ONLINE       0     0 4.75K  (repairing)                  
                mfisyspd5p3   ONLINE       0     0 5.20K  (repairing)                  
                mfisyspd6p3   ONLINE       0     0 4.51K  (repairing)                  
                mfisyspd7p3   ONLINE       0     0 4.65K  (repairing)                  
                mfisyspd8p3   ONLINE       0     0 4.70K  (repairing)                  
                mfisyspd9p3   ONLINE       0     0 3.81K  (repairing)  

At this point the server was still unable to reboot. I've had to force
data re-copy with a dumb :

    mv /boot{,.dist}

    cp -pr /boot{.dist}

Which turned out to be fine.

Going further i finally killed for good the zpool. It took me some time
and i stumbled upon the mfi(4) and the mrsas(4) man pages and code.

     The mfi driver supports the following hardware:

     o   LSI MegaRAID SAS 1078

     o   LSI MegaRAID SAS 8408E

     o   LSI MegaRAID SAS 8480E

     o   LSI MegaRAID SAS 9240

     o   LSI MegaRAID SAS 9260

     o   Dell PERC5

     o   Dell PERC6

     o   IBM ServeRAID M1015 SAS/SATA

     o   IBM ServeRAID M1115 SAS/SATA

     o   IBM ServeRAID M5015 SAS/SATA

     o   IBM ServeRAID M5110 SAS/SATA

     o   IBM ServeRAID-MR10i

     o   Intel RAID Controller SRCSAS18E

     o   Intel RAID Controller SROMBSAS18E

     The mrsas driver supports the following hardware:

     [ Thunderbolt 6Gb/s MR controller ]

     o   LSI MegaRAID SAS 9265

     o   LSI MegaRAID SAS 9266

     o   LSI MegaRAID SAS 9267

     o   LSI MegaRAID SAS 9270

     o   LSI MegaRAID SAS 9271

     o   LSI MegaRAID SAS 9272

     o   LSI MegaRAID SAS 9285

     o   LSI MegaRAID SAS 9286

     o   DELL PERC H810

     o   DELL PERC H710/P

There was a detectoin priority problem

    hw.mfi.mrsas_enable=1