Re: lsi

From: Aurelien \ <beorn_at_binaries.fr> Date: Fri, 22 Mar 2019 10:12:02 +0100 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:20 UTC

On 3/22/19 10:06 AM, Aurelien "beorn" ROUGEMONT wrote:
> Hi the list,
>
> I have been using FreeBSD at home and in production for years and today
> i stumbled upon a question i could not answer.
>
>
> Context
>
> -----------------------------------------
>
> I'm building a backup server on a server with this HBA :
>
> 3:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2208 [Thunderbolt] (rev 05)
>     Subsystem: LSI Logic / Symbios Logic MegaRAID SAS 9271-8i
>     Flags: bus master, fast devsel, latency 0, IRQ 34
>     I/O ports at e000
>     Memory at fb160000 (64-bit, non-prefetchable)
>     Memory at fb100000 (64-bit, non-prefetchable)
>     Expansion ROM at fb140000 [disabled]
>     Capabilities: [50] Power Management version 3
>     Capabilities: [68] Express Endpoint, MSI 00
>     Capabilities: [d0] Vital Product Data
>     Capabilities: [a8] MSI: Enable- Count=1/1 Maskable- 64bit+
>     Capabilities: [c0] MSI-X: Enable+ Count=16 Masked-
>     Capabilities: [100] Advanced Error Reporting
>     Capabilities: [1e0] Secondary PCI Express <?>
>     Capabilities: [1c0] Power Budgeting <?>
>     Capabilities: [190] Dynamic Power Allocation <?>
>     Capabilities: [148] Alternative Routing-ID Interpretation (ARI)
>
> After pushing the server I/Os to its limits the server had a very nasty 
> crash.
>
> It happens very seldomly, in roughly 10 years among the petabytes of
> storage servers i kept running it always was hardware or driver/firmware
> related.
>
>     |Shortening read at 4292967280 from 16 to 15 ZFS: i/o error - all
>     block copies unavailable ZFS: can't read object set for dataset 52
>     ZFS: can't open root filesystem gptzfsboot: failed to mount default
>     pool zroot|
>
> After simply reinstalling (for nothing) the bootloaders, checking the
> partition tables, i went digging a lot in the FreeBSD codebase. I found
> that it was a ZFS problem.
>
> The nasty crash was indeed due to ZFS  data corruption. Hence the
> checksum errors while scrubing the zpool on a rescue network boot image :
>
>       pool: zroot                                                                                                                                                                                                       
>      state: ONLINE                                                                     
>     status: One or more devices has experienced an unrecoverable error.  An            
>             attempt was made to correct the error.  Applications are unaffected.       
>     action: Determine if the device needs to be replaced, and clear the errors         
>             using 'zpool clear' or replace the device with 'zpool replace'.            
>        see: http://illumos.org/msg/ZFS-8000-9P                                         
>       scan: scrub in progress since Fri Mar 15 15:15:25 2019                           
>             49.6G scanned out of 1.65T at 109M/s, 4h15m to go                          
>             677M repaired, 2.94% done                                                  
>     config:                                                                            
>             NAME              STATE     READ WRITE CKSUM                               
>             zroot             ONLINE       0     0     0                               
>               raidz2-0        ONLINE       0     0     0                               
>                 mfisyspd0p3   ONLINE       0     0 5.44K  (repairing)                  
>                 mfisyspd1p3   ONLINE       0     0 4.76K  (repairing)                  
>                 mfisyspd10p3  ONLINE       0     0 4.35K  (repairing)                  
>                 mfisyspd11p3  ONLINE       0     0 5.17K  (repairing)                  
>                 mfisyspd2p3   ONLINE       0     0 4.76K  (repairing)                  
>                 mfisyspd3p3   ONLINE       0     0 4.24K  (repairing)                  
>                 mfisyspd4p3   ONLINE       0     0 4.75K  (repairing)                  
>                 mfisyspd5p3   ONLINE       0     0 5.20K  (repairing)                  
>                 mfisyspd6p3   ONLINE       0     0 4.51K  (repairing)                  
>                 mfisyspd7p3   ONLINE       0     0 4.65K  (repairing)                  
>                 mfisyspd8p3   ONLINE       0     0 4.70K  (repairing)                  
>                 mfisyspd9p3   ONLINE       0     0 3.81K  (repairing)  
>
> At this point the server was still unable to reboot. I've had to force
> data re-copy with a dumb :
>
>     mv /boot{,.dist}
>
>     cp -pr /boot{.dist}
>
> Which turned out to be fine.
>
> Going further i finally killed for good the zpool. It took me some time
> and i stumbled upon the mfi(4) and the mrsas(4) man pages and code.
>
>      The mfi driver supports the following hardware:
>
>      o   LSI MegaRAID SAS 1078
>
>      o   LSI MegaRAID SAS 8408E
>
>      o   LSI MegaRAID SAS 8480E
>
>      o   LSI MegaRAID SAS 9240
>
>      o   LSI MegaRAID SAS 9260
>
>      o   Dell PERC5
>
>      o   Dell PERC6
>
>      o   IBM ServeRAID M1015 SAS/SATA
>
>      o   IBM ServeRAID M1115 SAS/SATA
>
>      o   IBM ServeRAID M5015 SAS/SATA
>
>      o   IBM ServeRAID M5110 SAS/SATA
>
>      o   IBM ServeRAID-MR10i
>
>      o   Intel RAID Controller SRCSAS18E
>
>      o   Intel RAID Controller SROMBSAS18E
>
>
>      The mrsas driver supports the following hardware:
>
>      [ Thunderbolt 6Gb/s MR controller ]
>
>      o   LSI MegaRAID SAS 9265
>
>      o   LSI MegaRAID SAS 9266
>
>      o   LSI MegaRAID SAS 9267
>
>      o   LSI MegaRAID SAS 9270
>
>      o   LSI MegaRAID SAS 9271
>
>      o   LSI MegaRAID SAS 9272
>
>      o   LSI MegaRAID SAS 9285
>
>      o   LSI MegaRAID SAS 9286
>
>      o   DELL PERC H810
>
>      o   DELL PERC H710/P
>
There was a detection priority problem mfi wins for the wrong HBA.

The fix was to add  hw.mfi.mrsas_enable=1 in /boot/loader.conf

After this the server behaved correctly.

Should it be fixed for everyone ? 

NB: sorry my last email was mistakenly sent unfinished