Today something unusual happened on one of my machines: kernel: (ada0:ahcich0:0:0:0): lost device kernel: (aprobe1:ahcich0:0:15:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00 00 00 kernel: (aprobe1:ahcich0:0:15:0): CAM status: Command timeout kernel: (aprobe1:ahcich0:0:15:0): Error 5, Retries exhausted kernel: (aprobe1:ahcich0:0:15:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00 00 00 kernel: (aprobe1:ahcich0:0:15:0): CAM status: Command timeout kernel: (aprobe1:ahcich0:0:15:0): Error 5, Retries exhausted kernel: cam_periph_alloc: attempt to re-allocate valid device ada0 rejected flags 0x18 refcount 1 kernel: adaasync: Unable to attach to new device due to status 0x6 It looks like the disk disappeared from the bus and then re-appeared on the bus, but not to the OS. One of the partitions that the disk hosted was a swap partition and it seems to be the cause of some of the following consequences. The consequences: * ZFS properly noticed disappearance of the disk, but its diagnostic was a little bit misleading: pool: pond state: DEGRADED status: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Online the device using 'zpool online' or replace the device with 'zpool replace'. scan: scrub repaired 0 in 8h55m with 0 errors on Sat Dec 22 12:06:30 2012 config: NAME STATE READ WRITE CKSUM pond DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 12725235722288301230 REMOVED 0 0 0 was /dev/gptid/fcf3558b-493b-11de-a8b9-001cc08221ff gptid/48782c6e-8fbd-11de-b3e1-00241d20d446 ONLINE 0 0 0 Yes, I agree that the disk got removed/lost, but disagree that "the administrator" did it. * geom_event thread started consuming 100% of CPU in g_wither_washer() * /dev/ada0 disappeared but camcontrol devlist still reported ada0: <ST3500410AS CC34> at scbus0 target 0 lun 0 (pass0,ada0) * As seen in the system messages, CAM layer refused to re-attach the disk * gpart command would just crash So, I can explain the behavior of the geom_event thread - apparently swapgeom_orphan doesn't do anything that is really meaningful to GEOM and so g_wither_washer is stuck waiting until the swap consumer goes way (drops its access bits). (Another sad thing about this state is that I couldn't swapoff the device, because there was no device entry.) I am not sure if the "attempt to re-allocate valid device" failure was caused by this, but it could be, if something in CAM layer was waiting for GEOM layer to be done with the disk. It would be nice if the swap code properly supported disappearance of the underlying disks. Especially in this case where the swap was actually never used / touched at all (few hours after reboot and completely idle system). -- Andriy GaponReceived on Sun Jan 20 2013 - 18:00:29 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:34 UTC