ZFS-related panic: "possible" spa->spa_errlog_lock deadlock

From: Fabian Keil <freebsd-listen_at_fabiankeil.de> Date: Sun, 7 Sep 2014 15:02:19 +0200 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:51 UTC

Using a kernel built from FreeBSD 11.0-CURRENT r271182 I got the following
panic yesterday:

[...]
Unread portion of the kernel message buffer:
[6880] panic: deadlkres: possible deadlock detected for 0xfffff80015289490, blocked for 1800503 ticks
[6880] 
[6880] cpuid = 1
[6880] KDB: stack backtrace:
[6880] db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe009432e9e0
[6880] kdb_backtrace() at kdb_backtrace+0x39/frame 0xfffffe009432ea90
[6880] panic() at panic+0x1c1/frame 0xfffffe009432eb50
[6880] deadlkres() at deadlkres+0x588/frame 0xfffffe009432ebb0
[6880] fork_exit() at fork_exit+0x9a/frame 0xfffffe009432ebf0
[6880] fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe009432ebf0
[6880] --- trap 0, rip = 0, rsp = 0xfffffe009432ecb0, rbp = 0 ---
[6880] KDB: enter: panic
[6880] Uptime: 1h54m40s
[6880] Dumping 354 out of 1973 MB:..5%..14%..23%..32%..41%..55%..64%..73%..82%..91%

Reading symbols from /boot/kernel/zfs.ko.symbols...done.
[...]
Loaded symbols for /boot/kernel/profile.ko.symbols
#0  doadump (textdump=1) at pcpu.h:219
219	pcpu.h: No such file or directory.
	in pcpu.h
(kgdb) where
#0  doadump (textdump=1) at pcpu.h:219
#1  0xffffffff80597bad in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:447
#2  0xffffffff80598100 in panic (fmt=<value optimized out>) at /usr/src/sys/kern/kern_shutdown.c:746
#3  0xffffffff80539b98 in deadlkres () at /usr/src/sys/kern/kern_clock.c:240
#4  0xffffffff8055e8da in fork_exit (callout=0xffffffff80539610 <deadlkres>, arg=0x0, frame=0xfffffe009432ec00) at /usr/src/sys/kern/kern_fork.c:977
#5  0xffffffff8083fb5e in fork_trampoline () at /usr/src/sys/amd64/amd64/exception.S:605
#6  0x0000000000000000 in ?? ()
Current language:  auto; currently minimal
(kgdb) set $td=(struct thread *)0xfffff80015289490
(kgdb) tid $td->td_tid
[Switching to thread 1184 (Thread 101428)]#0  sched_switch (td=0xfffff80015289490, newtd=<value optimized out>, flags=<value optimized out>) at /usr/src/sys/kern/sched_ule.c:1932
1932			cpuid = PCPU_GET(cpuid);
(kgdb) bt
#0  sched_switch (td=0xfffff80015289490, newtd=<value optimized out>, flags=<value optimized out>) at /usr/src/sys/kern/sched_ule.c:1932
#1  0xffffffff805a23c1 in mi_switch (flags=260, newtd=0x0) at /usr/src/sys/kern/kern_synch.c:493
#2  0xffffffff805e4bca in sleepq_wait (wchan=0x0, pri=0) at /usr/src/sys/kern/subr_sleepqueue.c:631
#3  0xffffffff805a12b2 in _sx_xlock_hard (sx=0xfffff800062ed820, tid=18446735277971510416, opts=<value optimized out>, file=0x0, line=0) at /usr/src/sys/kern/kern_sx.c:676
#4  0xffffffff805a0add in _sx_xlock (sx=0x0, opts=0, file=<value optimized out>, line=0) at sx.h:154
#5  0xffffffff8114a691 in spa_get_errlog_size (spa=0xfffff800062ed000) at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_errlog.c:142
#6  0xffffffff8113f549 in spa_get_stats (name=0xfffffe0006126000 "spaceloop", config=0xfffffe00950e58e8, altroot=0xfffffe0006126430 "", buflen=2048)
    at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c:3287
#7  0xffffffff81189a45 in zfs_ioc_pool_stats (zc=0xfffffe0006126000) at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ioctl.c:1656
#8  0xffffffff81187290 in zfsdev_ioctl (dev=<value optimized out>, zcmd=<value optimized out>, arg=<value optimized out>, flag=<value optimized out>, td=<value optimized out>)
    at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ioctl.c:6136
#9  0xffffffff80464a55 in devfs_ioctl_f (fp=0xfffff80017107f50, com=3222821381, data=0xfffff8004fb4b740, cred=<value optimized out>, td=0xfffff80015289490) at /usr/src/sys/fs/devfs/devfs_vnops.c:757
#10 0xffffffff805f3c3d in kern_ioctl (td=0xfffff80015289490, fd=<value optimized out>, com=0) at file.h:311
#11 0xffffffff805f381c in sys_ioctl (td=0xfffff80015289490, uap=0xfffffe00950e5b80) at /usr/src/sys/kern/sys_generic.c:702
#12 0xffffffff8085c2db in amd64_syscall (td=0xfffff80015289490, traced=0) at subr_syscall.c:133
#13 0xffffffff8083f90b in Xfast_syscall () at /usr/src/sys/amd64/amd64/exception.S:390
#14 0x00000008019fc3da in ?? ()
(kgdb) f 3
#3  0xffffffff805a12b2 in _sx_xlock_hard (sx=0xfffff800062ed820, tid=18446735277971510416, opts=<value optimized out>, file=0x0, line=0) at /usr/src/sys/kern/kern_sx.c:676
676				sleepq_wait(&sx->lock_object, 0);
(kgdb) p sx->lock_object
$14 = {lo_name = 0xffffffff81202163 "spa->spa_errlog_lock", lo_flags = 40960000, lo_data = 0, lo_witness = 0x0}

This happened several minutes after a couple of zpool processes
stopped responding while accessing information about the following
pool:

fk_at_r500 ~ $zpool status -v spaceloop
  pool: spaceloop
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 2.98M in 16h27m with 0 errors on Sun Sep  7 12:25:36 2014
config:

	NAME                   STATE     READ WRITE CKSUM
	spaceloop              ONLINE       0     0     0
	  label/spaceloop.eli  ONLINE       0     0    26

errors: No known data errors

Dump time was 21:58:48, the vdev disappeared a while before that,
therefore I believe the "possible" deadlock is a real one:

Sep  6 21:18:09 r500 kernel: [4441] GEOM_ELI: g_eli_write_done() failed (error=5) label/spaceloop.eli[WRITE(offset=2235700736, length=16384)]
Sep  6 21:18:09 r500 kernel: [4441] GEOM_ELI: g_eli_read_done() failed (error=5) label/spaceloop.eli[READ(offset=4008976384, length=8192)]
Sep  6 21:18:09 r500 kernel: [4441] GEOM_ELI: g_eli_read_done() failed (error=5) label/spaceloop.eli[READ(offset=4009238528, length=8192)]
Sep  6 21:18:09 r500 kernel: [4441] GEOM_ELI: g_eli_read_done() failed (error=5) label/spaceloop.eli[READ(offset=270336, length=8192)]
Sep  6 21:18:27 r500 kernel: [4459] (da1:umass-sim1:1:0:0): WRITE(10). CDB: 2a 00 00 60 65 0c 00 00 40 00 
Sep  6 21:18:27 r500 kernel: [4459] (da1:umass-sim1:1:0:0): CAM status: CCB request completed with an error
Sep  6 21:18:27 r500 kernel: [4459] (da1:umass-sim1:1:0:0): Retrying command
Sep  6 21:18:45 r500 kernel: [4477] (da1:umass-sim1:1:0:0): WRITE(10). CDB: 2a 00 00 60 65 0c 00 00 40 00 
Sep  6 21:18:45 r500 kernel: [4477] (da1:umass-sim1:1:0:0): CAM status: CCB request completed with an error
Sep  6 21:18:45 r500 kernel: [4477] (da1:umass-sim1:1:0:0): Retrying command
Sep  6 21:19:03 r500 kernel: [4495] (da1:umass-sim1:1:0:0): WRITE(10). CDB: 2a 00 00 60 65 0c 00 00 40 00 
Sep  6 21:19:03 r500 kernel: [4495] (da1:umass-sim1:1:0:0): CAM status: CCB request completed with an error
Sep  6 21:19:03 r500 kernel: [4495] (da1:umass-sim1:1:0:0): Retrying command
Sep  6 21:19:21 r500 kernel: [4514] (da1:umass-sim1:1:0:0): WRITE(10). CDB: 2a 00 00 60 65 0c 00 00 40 00 
Sep  6 21:19:21 r500 kernel: [4514] (da1:umass-sim1:1:0:0): CAM status: CCB request completed with an error
Sep  6 21:19:21 r500 kernel: [4514] (da1:umass-sim1:1:0:0): Retrying command
Sep  6 21:19:40 r500 kernel: [4532] (da1:umass-sim1:1:0:0): WRITE(10). CDB: 2a 00 00 60 65 0c 00 00 40 00 
Sep  6 21:19:40 r500 kernel: [4532] (da1:umass-sim1:1:0:0): CAM status: CCB request completed with an error
Sep  6 21:19:40 r500 kernel: [4532] (da1:umass-sim1:1:0:0): Error 5, Retries exhausted
Sep  6 21:19:40 r500 kernel: [4532] GEOM_ELI: g_eli_write_done() failed (error=5) label/spaceloop.eli[WRITE(offset=3234469888, length=32768)]
[...]
Sep  6 21:29:46 r500 kernel: [5138] GEOM_ELI: Device label/takems.eli destroyed.
Sep  6 21:29:52 r500 kernel: [5144] ugen0.3: <SMI Corporation> at usbus0 (disconnected)
Sep  6 21:29:52 r500 kernel: [5144] umass2: at uhub0, port 1, addr 3 (disconnected)
Sep  6 21:29:52 r500 kernel: [5144] da2 at umass-sim2 bus 2 scbus4 target 0 lun 0
Sep  6 21:29:52 r500 kernel: [5144] da2: <takeMS USB Mini 1100> s/n AA04012900007508 detached
Sep  6 21:29:52 r500 kernel: [5144] pass4 at umass-sim2 bus 2 scbus4 target 0 lun 0
Sep  6 21:29:52 r500 kernel: [5144] pass4: <takeMS USB Mini 1100> s/n AA04012900007508 detached
Sep  6 21:29:52 r500 kernel: [5144] (pass4:umass-sim2:2:0:0): Periph destroyed
Sep  6 21:29:52 r500 kernel: [5144] (da2:umass-sim2:2:0:0): Periph destroyed
Sep  6 21:44:56 r500 kernel: [6049] umass1: at uhub1, port 1, addr 3 (disconnected)
Sep  6 21:44:56 r500 kernel: [6049] da1 at umass-sim1 bus 1 scbus3 target 0 lun 0
Sep  6 21:44:56 r500 kernel: [6049] da1: <General USB Flash Disk 1.0> s/n 0000000000002511 detached
Sep  6 21:44:56 r500 kernel: [6049] pass3 at umass-sim1 bus 1 scbus3 target 0 lun 0
Sep  6 21:44:56 r500 kernel: [6049] pass3: <General USB Flash Disk 1.0> s/n 0000000000002511 detached
Sep  6 21:44:56 r500 kernel: [6049] (pass3:umass-sim1:1:0:0): Periph destroyed
Sep  6 21:44:56 r500 kernel: [6049] GEOM_ELI: Device label/spaceloop.eli destroyed.
Sep  6 21:44:56 r500 kernel: [6049] GEOM_ELI: Detached label/spaceloop.eli on last close.
Sep  6 21:44:56 r500 kernel: [6049] (da1:umass-sim1:1:0:0): Periph destroyed
Sep  6 21:44:57 r500 ZFS: vdev is removed, pool_guid=13593515477493551420 vdev_guid=2311923824790456307
Sep  6 21:44:57 r500 kernel: [6049] umass1: <General USB Flash Disk, class 0/0, rev 2.00/1.00, addr 3> on usbus1
Sep  6 21:44:57 r500 kernel: [6049] umass1:  SCSI over Bulk-Only; quirks = 0x4101
Sep  6 21:44:57 r500 kernel: [6049] umass1:3:1: Attached to scbus3
Sep  6 21:46:10 r500 kernel: [6122] (probe0:umass-sim1:1:0:0): INQUIRY. CDB: 12 00 00 00 24 00 
Sep  6 21:46:10 r500 kernel: [6122] (probe0:umass-sim1:1:0:0): CAM status: CCB request completed with an error
Sep  6 21:46:10 r500 kernel: [6122] (probe0:umass-sim1:1:0:0): Retrying command
Sep  6 21:46:45 r500 kernel: [6157] ugen1.3: <General> at usbus1 (disconnected)
Sep  6 21:46:45 r500 kernel: [6157] umass1: at uhub1, port 1, addr 3 (disconnected)
Sep  6 21:46:45 r500 kernel: [6157] (probe0:umass-sim1:1:0:0): INQUIRY. CDB: 12 00 00 00 24 00 
Sep  6 21:46:45 r500 kernel: [6157] (probe0:umass-sim1:1:0:0): CAM status: CCB request completed with an error
Sep  6 21:46:45 r500 kernel: [6157] (probe0:umass-sim1:1:0:0): Retrying command
Sep  6 21:46:45 r500 kernel: [6157] (probe0:umass-sim1:1:0:0): INQUIRY. CDB: 12 00 00 00 24 00 
Sep  6 21:46:45 r500 kernel: [6157] (probe0:umass-sim1:1:0:0): CAM status: CCB request completed with an error
Sep  6 21:46:45 r500 kernel: [6157] (probe0:umass-sim1:1:0:0): Retrying command
Sep  6 21:46:45 r500 kernel: [6157] (probe0:umass-sim1:1:0:0): INQUIRY. CDB: 12 00 00 00 24 00 
Sep  6 21:46:45 r500 kernel: [6157] (probe0:umass-sim1:1:0:0): CAM status: CCB request completed with an error
Sep  6 21:46:45 r500 kernel: [6157] (probe0:umass-sim1:1:0:0): Retrying command
Sep  6 21:46:45 r500 kernel: [6157] (probe0:umass-sim1:1:0:0): INQUIRY. CDB: 12 00 00 00 24 00 
Sep  6 21:46:45 r500 kernel: [6157] (probe0:umass-sim1:1:0:0): CAM status: CCB request completed with an error
Sep  6 21:46:45 r500 kernel: [6157] (probe0:umass-sim1:1:0:0): Error 5, Retries exhausted

Two other pools, one using USB as well, were unaffected until
the panic was triggered.

A pleasant surprise was that the system entered ddb from Xorg
and dumping on /dev/ada0s1b worked despite /dev/ada0s1b.eli being
configured as swap device. I wasn't aware that this was working now.

Fabian