Re: Locked up nfsd after avg_at_ sendfile patch

From: Andriy Gapon <avg_at_freebsd.org>
Date: Thu, 14 Oct 2010 20:20:49 +0300
on 13/10/2010 21:43 Andriy Gapon said the following:
> Further walking child zio hierarchy we reach the one that looks like this:
> $59 = {io_bookmark = {zb_objset = 400, zb_object = 0, zb_level = -1, zb_blkid =
> 22437}, io_prop = {zp_checksum = ZIO_CHECKSUM_INHERIT, zp_compress =
> ZIO_COMPRESS_INHERIT, zp_type = DMU_OT_NONE,
>     zp_level = 0 '\0', zp_ndvas = 0 '\0'}, io_type = ZIO_TYPE_WRITE, io_child_type
> = ZIO_CHILD_VDEV, io_cmd = 0, io_priority = 0 '\0', io_reexecute = 0 '\0',
> io_state = "\001", io_txg = 0,
>   io_spa = 0xffffff00056c6000, io_bp = 0xffffff01acdbaa30, io_bp_copy = {blk_dva =
> {{dva_word = {12884902144, 1678614837}}, {dva_word = {0, 0}}, {dva_word = {0,
> 0}}}, blk_prop = 9225910817809957119,
>     blk_pad = {0, 0, 0}, blk_birth = 236695, blk_fill = 0, blk_cksum = {zc_word =
> {15569186404091016741, 3408946246337318984, 400, 22437}}}, io_parent_list =
> {list_size = 48, list_offset = 16,
>     list_head = {list_next = 0xffffff000826b7c0, list_prev = 0xffffff000826b7c0}},
> io_child_list = {list_size = 48, list_offset = 32, list_head = {list_next =
> 0xffffff00080aca98,
>       list_prev = 0xffffff00080aca98}}, io_walk_link = 0x0, io_logical =
> 0xffffff0008b8d660, io_transform_stack = 0x0, io_ready = 0, io_done =
> 0xffffffff80b99ab0 <vdev_mirror_child_done>,
>   io_private = 0xffffff00b5f469a8, io_bp_orig = {blk_dva = {{dva_word =
> {12884902144, 1678614837}}, {dva_word = {0, 0}}, {dva_word = {0, 0}}}, blk_prop =
> 9225910817809957119, blk_pad = {0, 0, 0},
>     blk_birth = 236695, blk_fill = 0, blk_cksum = {zc_word =
> {15569186404091016741, 3408946246337318984, 400, 22437}}}, io_data =
> 0xffffff80e6565000, io_size = 131072, io_vd = 0xffffff00084cd000,
>   io_vsd = 0x0, io_vsd_free = 0, io_offset = 859454990848, io_deadline = 20883,
> io_offset_node = {avl_child = {0x0, 0x0}, avl_pcb = 18446742974333891893},
> io_deadline_node = {avl_child = {0x0, 0x0},
>     avl_pcb = 1}, io_vdev_tree = 0xffffff00084cd578, io_flags = 179, io_stage =
> ZIO_STAGE_VDEV_IO_START, io_pipeline = 47104, io_orig_flags = 131, io_orig_stage =
> ZIO_STAGE_READY,
>   io_orig_pipeline = 47104, io_error = 0, io_child_error = {0, 0, 0}, io_children
> = {{0, 0}, {0, 0}, {0, 0}}, io_stall = 0x0, io_gang_leader = 0x0, io_gang_tree = 0x0,
>   io_executor = 0xffffff000875a8a0, io_waiter = 0x0, io_lock = {lock_object =
> {lo_name = 0xffffffff80c29a8b "zio->io_lock", lo_flags = 40960000, lo_data = 0,
> lo_witness = 0x0}, sx_lock = 1},
>   io_cv = {cv_description = 0xffffffff80c29a9a "zio->io_cv)", cv_waiters = 0},
> io_ena = 0, io_task = {ost_task = {ta_running = 0x0, ta_link = {stqe_next = 0x0},
> ta_pending = 0, ta_priority = 0,
>       ta_func = 0, ta_context = 0x0}, ost_func = 0, ost_arg = 0x0, ost_magic = 0}}

So, after some more investigation, it looks like this zio is genuinely stuck,
because its bio is stuck in geom because its ccb/command is stuck in arcmsr.
Looks like the driver (controller/firmware) isn't processing any more requests.
Perhaps a hardware issue, but I reckon that the driver should have detected the
situation, timed out the commands and reset the hardware (if needed).
Anyway, it looks that this is not related to ZFS[*].  Maybe firmware and BIOS
should be updated, maybe hardware replaced.

[*] Perhaps ZFS should have its own zio timeout mechanism.
And/or GEOM.
And/or peripheral or transport layer of CAM.
But, IMO, the SIM drivers must have it.
-- 
Andriy Gapon
Received on Thu Oct 14 2010 - 15:20:54 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:08 UTC