Re: ZFS-related panic: "possible" spa->spa_errlog_lock deadlock

From: Xin Li <delphij_at_delphij.net>
Date: Sun, 07 Sep 2014 23:56:41 +0800
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On 9/7/14 11:23 PM, Fabian Keil wrote:
> Xin Li <delphij_at_delphij.net> wrote:
> 
>> On 9/7/14 9:02 PM, Fabian Keil wrote:
>>> Using a kernel built from FreeBSD 11.0-CURRENT r271182 I got
>>> the following panic yesterday:
>>> 
>>> [...] Unread portion of the kernel message buffer: [6880]
>>> panic: deadlkres: possible deadlock detected for
>>> 0xfffff80015289490, blocked for 1800503 ticks
>> 
>> Any chance to get all backtraces (e.g. thread apply all bt full
>> 16)? I think a different thread that held the lock have been
>> blocked, probably related to your disconnected vdev.
> 
> Output of "thread apply all bt full 16" is available at: 
> http://www.fabiankeil.de/tmp/freebsd/kgdb-output-spa_errlog_lock-deadlock.txt
>
>  A lot of the backtraces prematurely end with "Cannot access memory
> at address", therefore I also added "thread apply all bt" output.
> 
> Apparently there are at least two additional threads blocking below
> spa_get_stats():
> 
> Thread 1182 (Thread 101989): #0  sched_switch
> (td=0xfffff800628cc490, newtd=<value optimized out>, flags=<value
> optimized out>) at /usr/src/sys/kern/sched_ule.c:1932 #1
> 0xffffffff805a23c1 in mi_switch (flags=260, newtd=0x0) at
> /usr/src/sys/kern/kern_synch.c:493 #2  0xffffffff805e4bca in
> sleepq_wait (wchan=0x0, pri=0) at
> /usr/src/sys/kern/subr_sleepqueue.c:631 #3  0xffffffff80539f10 in
> _cv_wait (cvp=0xfffff80025534a50, lock=0xfffff80025534a30) at
> /usr/src/sys/kern/kern_condvar.c:139 #4  0xffffffff811721db in
> zio_wait (zio=<value optimized out>) at
> /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c:1442 
> #5  0xffffffff81102111 in dbuf_read (db=<value optimized out>,
> zio=<value optimized out>, flags=<value optimized out>) at
> /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dbuf.c:649 
> #6  0xffffffff81108e6d in dmu_buf_hold (os=<value optimized out>,
> object=<value optimized out>, offset=<value optimized out>,
> tag=0x0, dbp=0xfffffe00955c6648, flags=<value optimized out>) at
> /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu.c:172 
> #7  0xffffffff81163986 in zap_lockdir (os=0xfffff8002b7ab000,
> obj=92, tx=0x0, lti=RW_READER, fatreader=1, adding=0, zapp=<value
> optimized out>) at
> /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zap_micro.c:467
>
> 
#8  0xffffffff811644ad in zap_count (os=0x0, zapobj=0,
count=0xfffffe00955c66d8) at
/usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zap_micro.c:712
> #9  0xffffffff8114a6dc in spa_get_errlog_size
> (spa=0xfffff800062ed000) at
> /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_errlog.c:149
>
> 
- ---Type <return> to continue, or q <return> to quit---
> #10 0xffffffff8113f549 in spa_get_stats (name=0xfffffe0044cac000
> "spaceloop", config=0xfffffe00955c68e8, altroot=0xfffffe0044cac430
> "", buflen=2048) at
> /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c:3287 
> #11 0xffffffff81189a45 in zfs_ioc_pool_stats
> (zc=0xfffffe0044cac000) at
> /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ioctl.c:1656
>
> 
#12 0xffffffff81187290 in zfsdev_ioctl (dev=<value optimized out>,
zcmd=<value optimized out>, arg=<value optimized out>, flag=<value
optimized out>, td=<value optimized out>)
> at
> /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ioctl.c:6136
>
> 
#13 0xffffffff80464a55 in devfs_ioctl_f (fp=0xfffff80038bd00a0,
com=3222821381, data=0xfffff800067b80a0, cred=<value optimized out>,
td=0xfffff800628cc490) at /usr/src/sys/fs/devfs/devfs_vnops.c:757
> #14 0xffffffff805f3c3d in kern_ioctl (td=0xfffff800628cc490,
> fd=<value optimized out>, com=0) at file.h:311 #15
> 0xffffffff805f381c in sys_ioctl (td=0xfffff800628cc490,
> uap=0xfffffe00955c6b80) at /usr/src/sys/kern/sys_generic.c:702 #16
> 0xffffffff8085c2db in amd64_syscall (td=0xfffff800628cc490,
> traced=0) at subr_syscall.c:133 #17 0xffffffff8083f90b in
> Xfast_syscall () at /usr/src/sys/amd64/amd64/exception.S:390 #18
> 0x00000008019fc3da in ?? () Previous frame inner to this frame
> (corrupt stack?)

Yes, thread 1182 owned the lock and is waiting for the zio be done.
Other threads that wanted the lock would have to wait.

I don't have much clue why the system entered this state, however, as
the operations should have errored out (the GELI device is gone on
21:44:56 based on your log, which suggests all references were closed)
instead of waiting.

Adding mav_at_ as he may have some idea.

Cheers,
-----BEGIN PGP SIGNATURE-----

iQIcBAEBCgAGBQJUDIA5AAoJEJW2GBstM+nsdsgP/RT4nBT4mvOtpF2IEL7VFexJ
OjipGsWIDmG9kc8CEEN+qh3Q4+prJO3IwHGTUPa0Vu13jRZ3T/uoZj/ncVAqnCyW
s+SeEBjVVhZs/B08LqT2A8fZI9jVdqvFVWEL02z3ibWggoCnP60oag1OyvkNGqQU
apWtXkjTnrFmOEcbB95viD8QEhfUzlQgBbeRuK8ADtK/jQNEl6A8xdE5HCT2DIN4
icIwoj9eXqyLB0/aGVIFycRID4hWAZaqPehXVtGhCdnlGut7itdufuXtfmTCEDWs
z3vkGjTCJ3qsyKSxl/2ABGRgAH/lpR6J/N2yZOSNMRTt9PbjN1iLppu2IfD33OdY
QlQrI2HhbwvoZmYe4f4B/1/8qzKag77hzYG2J2aN0OGn45zkThOwoQt854zm7OHq
f3O3pysxUInTnIJrdBa53cT0arhQRtYn1xL5CYyvK4Nto74ht6g/AuBJbBWzWM/B
t2fKuZmpGt32lf+vHjWS0O2VWdkc6I6s5rVZJsSLzAMfc1rWePIcokPdUk9RucyX
qRrNDpMW53APWhPKGpkK+Utyx/OLKhO62d7iiYrGhVX0FU+trBV/6QdwPdkaM8Lx
6v51hDK6ZjF/K5ZCYFwQOPcSaSHiMIpQBQr0mFh2s/05p3/f7jNAthTx7npEIfYZ
PZ7DoMna8cwC4vtfoR1t
=Jxfg
-----END PGP SIGNATURE-----
Received on Sun Sep 07 2014 - 13:56:44 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:51 UTC