Hi the list, I have been using FreeBSD at home and in production for years and today i stumbled upon a question i could not answer. Context ----------------------------------------- I'm building a backup server on a server with this HBA : 3:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2208 [Thunderbolt] (rev 05) Subsystem: LSI Logic / Symbios Logic MegaRAID SAS 9271-8i Flags: bus master, fast devsel, latency 0, IRQ 34 I/O ports at e000 Memory at fb160000 (64-bit, non-prefetchable) Memory at fb100000 (64-bit, non-prefetchable) Expansion ROM at fb140000 [disabled] Capabilities: [50] Power Management version 3 Capabilities: [68] Express Endpoint, MSI 00 Capabilities: [d0] Vital Product Data Capabilities: [a8] MSI: Enable- Count=1/1 Maskable- 64bit+ Capabilities: [c0] MSI-X: Enable+ Count=16 Masked- Capabilities: [100] Advanced Error Reporting Capabilities: [1e0] Secondary PCI Express <?> Capabilities: [1c0] Power Budgeting <?> Capabilities: [190] Dynamic Power Allocation <?> Capabilities: [148] Alternative Routing-ID Interpretation (ARI) After pushing the server I/Os to its limits the server had a very nasty crash. It happens very seldomly, in roughly 10 years among the petabytes of storage servers i kept running it always was hardware or driver/firmware related. |Shortening read at 4292967280 from 16 to 15 ZFS: i/o error - all block copies unavailable ZFS: can't read object set for dataset 52 ZFS: can't open root filesystem gptzfsboot: failed to mount default pool zroot| After simply reinstalling (for nothing) the bootloaders, checking the partition tables, i went digging a lot in the FreeBSD codebase. I found that it was a ZFS problem. The nasty crash was indeed due to ZFS data corruption. Hence the checksum errors while scrubing the zpool on a rescue network boot image : pool: zroot state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://illumos.org/msg/ZFS-8000-9P scan: scrub in progress since Fri Mar 15 15:15:25 2019 49.6G scanned out of 1.65T at 109M/s, 4h15m to go 677M repaired, 2.94% done config: NAME STATE READ WRITE CKSUM zroot ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 mfisyspd0p3 ONLINE 0 0 5.44K (repairing) mfisyspd1p3 ONLINE 0 0 4.76K (repairing) mfisyspd10p3 ONLINE 0 0 4.35K (repairing) mfisyspd11p3 ONLINE 0 0 5.17K (repairing) mfisyspd2p3 ONLINE 0 0 4.76K (repairing) mfisyspd3p3 ONLINE 0 0 4.24K (repairing) mfisyspd4p3 ONLINE 0 0 4.75K (repairing) mfisyspd5p3 ONLINE 0 0 5.20K (repairing) mfisyspd6p3 ONLINE 0 0 4.51K (repairing) mfisyspd7p3 ONLINE 0 0 4.65K (repairing) mfisyspd8p3 ONLINE 0 0 4.70K (repairing) mfisyspd9p3 ONLINE 0 0 3.81K (repairing) At this point the server was still unable to reboot. I've had to force data re-copy with a dumb : mv /boot{,.dist} cp -pr /boot{.dist} Which turned out to be fine. Going further i finally killed for good the zpool. It took me some time and i stumbled upon the mfi(4) and the mrsas(4) man pages and code. The mfi driver supports the following hardware: o LSI MegaRAID SAS 1078 o LSI MegaRAID SAS 8408E o LSI MegaRAID SAS 8480E o LSI MegaRAID SAS 9240 o LSI MegaRAID SAS 9260 o Dell PERC5 o Dell PERC6 o IBM ServeRAID M1015 SAS/SATA o IBM ServeRAID M1115 SAS/SATA o IBM ServeRAID M5015 SAS/SATA o IBM ServeRAID M5110 SAS/SATA o IBM ServeRAID-MR10i o Intel RAID Controller SRCSAS18E o Intel RAID Controller SROMBSAS18E The mrsas driver supports the following hardware: [ Thunderbolt 6Gb/s MR controller ] o LSI MegaRAID SAS 9265 o LSI MegaRAID SAS 9266 o LSI MegaRAID SAS 9267 o LSI MegaRAID SAS 9270 o LSI MegaRAID SAS 9271 o LSI MegaRAID SAS 9272 o LSI MegaRAID SAS 9285 o LSI MegaRAID SAS 9286 o DELL PERC H810 o DELL PERC H710/P There was a detectoin priority problem hw.mfi.mrsas_enable=1Received on Fri Mar 22 2019 - 08:06:21 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:20 UTC