On 3/22/19 10:06 AM, Aurelien "beorn" ROUGEMONT wrote: > Hi the list, > > I have been using FreeBSD at home and in production for years and today > i stumbled upon a question i could not answer. > > > Context > > ----------------------------------------- > > I'm building a backup server on a server with this HBA : > > 3:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2208 [Thunderbolt] (rev 05) > Subsystem: LSI Logic / Symbios Logic MegaRAID SAS 9271-8i > Flags: bus master, fast devsel, latency 0, IRQ 34 > I/O ports at e000 > Memory at fb160000 (64-bit, non-prefetchable) > Memory at fb100000 (64-bit, non-prefetchable) > Expansion ROM at fb140000 [disabled] > Capabilities: [50] Power Management version 3 > Capabilities: [68] Express Endpoint, MSI 00 > Capabilities: [d0] Vital Product Data > Capabilities: [a8] MSI: Enable- Count=1/1 Maskable- 64bit+ > Capabilities: [c0] MSI-X: Enable+ Count=16 Masked- > Capabilities: [100] Advanced Error Reporting > Capabilities: [1e0] Secondary PCI Express <?> > Capabilities: [1c0] Power Budgeting <?> > Capabilities: [190] Dynamic Power Allocation <?> > Capabilities: [148] Alternative Routing-ID Interpretation (ARI) > > After pushing the server I/Os to its limits the server had a very nasty > crash. > > It happens very seldomly, in roughly 10 years among the petabytes of > storage servers i kept running it always was hardware or driver/firmware > related. > > |Shortening read at 4292967280 from 16 to 15 ZFS: i/o error - all > block copies unavailable ZFS: can't read object set for dataset 52 > ZFS: can't open root filesystem gptzfsboot: failed to mount default > pool zroot| > > After simply reinstalling (for nothing) the bootloaders, checking the > partition tables, i went digging a lot in the FreeBSD codebase. I found > that it was a ZFS problem. > > The nasty crash was indeed due to ZFS data corruption. Hence the > checksum errors while scrubing the zpool on a rescue network boot image : > > pool: zroot > state: ONLINE > status: One or more devices has experienced an unrecoverable error. An > attempt was made to correct the error. Applications are unaffected. > action: Determine if the device needs to be replaced, and clear the errors > using 'zpool clear' or replace the device with 'zpool replace'. > see: http://illumos.org/msg/ZFS-8000-9P > scan: scrub in progress since Fri Mar 15 15:15:25 2019 > 49.6G scanned out of 1.65T at 109M/s, 4h15m to go > 677M repaired, 2.94% done > config: > NAME STATE READ WRITE CKSUM > zroot ONLINE 0 0 0 > raidz2-0 ONLINE 0 0 0 > mfisyspd0p3 ONLINE 0 0 5.44K (repairing) > mfisyspd1p3 ONLINE 0 0 4.76K (repairing) > mfisyspd10p3 ONLINE 0 0 4.35K (repairing) > mfisyspd11p3 ONLINE 0 0 5.17K (repairing) > mfisyspd2p3 ONLINE 0 0 4.76K (repairing) > mfisyspd3p3 ONLINE 0 0 4.24K (repairing) > mfisyspd4p3 ONLINE 0 0 4.75K (repairing) > mfisyspd5p3 ONLINE 0 0 5.20K (repairing) > mfisyspd6p3 ONLINE 0 0 4.51K (repairing) > mfisyspd7p3 ONLINE 0 0 4.65K (repairing) > mfisyspd8p3 ONLINE 0 0 4.70K (repairing) > mfisyspd9p3 ONLINE 0 0 3.81K (repairing) > > At this point the server was still unable to reboot. I've had to force > data re-copy with a dumb : > > mv /boot{,.dist} > > cp -pr /boot{.dist} > > Which turned out to be fine. > > Going further i finally killed for good the zpool. It took me some time > and i stumbled upon the mfi(4) and the mrsas(4) man pages and code. > > The mfi driver supports the following hardware: > > o LSI MegaRAID SAS 1078 > > o LSI MegaRAID SAS 8408E > > o LSI MegaRAID SAS 8480E > > o LSI MegaRAID SAS 9240 > > o LSI MegaRAID SAS 9260 > > o Dell PERC5 > > o Dell PERC6 > > o IBM ServeRAID M1015 SAS/SATA > > o IBM ServeRAID M1115 SAS/SATA > > o IBM ServeRAID M5015 SAS/SATA > > o IBM ServeRAID M5110 SAS/SATA > > o IBM ServeRAID-MR10i > > o Intel RAID Controller SRCSAS18E > > o Intel RAID Controller SROMBSAS18E > > > The mrsas driver supports the following hardware: > > [ Thunderbolt 6Gb/s MR controller ] > > o LSI MegaRAID SAS 9265 > > o LSI MegaRAID SAS 9266 > > o LSI MegaRAID SAS 9267 > > o LSI MegaRAID SAS 9270 > > o LSI MegaRAID SAS 9271 > > o LSI MegaRAID SAS 9272 > > o LSI MegaRAID SAS 9285 > > o LSI MegaRAID SAS 9286 > > o DELL PERC H810 > > o DELL PERC H710/P > There was a detection priority problem mfi wins for the wrong HBA. The fix was to add hw.mfi.mrsas_enable=1 in /boot/loader.conf After this the server behaved correctly. Should it be fixed for everyone ? NB: sorry my last email was mistakenly sent unfinishedReceived on Fri Mar 22 2019 - 08:12:05 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:20 UTC