----- Original Message ----- From: "Xin Li" <delphij_at_delphij.net> To: "Fabian Keil" <freebsd-listen_at_fabiankeil.de>; <freebsd-current_at_freebsd.org> Cc: "Alexander Motin" <mav_at_ixsystems.com> Sent: Sunday, September 07, 2014 4:56 PM Subject: Re: ZFS-related panic: "possible" spa->spa_errlog_lock deadlock > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA512 > > On 9/7/14 11:23 PM, Fabian Keil wrote: >> Xin Li <delphij_at_delphij.net> wrote: >> >>> On 9/7/14 9:02 PM, Fabian Keil wrote: >>>> Using a kernel built from FreeBSD 11.0-CURRENT r271182 I got >>>> the following panic yesterday: >>>> >>>> [...] Unread portion of the kernel message buffer: [6880] >>>> panic: deadlkres: possible deadlock detected for >>>> 0xfffff80015289490, blocked for 1800503 ticks >>> >>> Any chance to get all backtraces (e.g. thread apply all bt full >>> 16)? I think a different thread that held the lock have been >>> blocked, probably related to your disconnected vdev. >> >> Output of "thread apply all bt full 16" is available at: >> http://www.fabiankeil.de/tmp/freebsd/kgdb-output-spa_errlog_lock-deadlock.txt >> >> A lot of the backtraces prematurely end with "Cannot access memory >> at address", therefore I also added "thread apply all bt" output. >> >> Apparently there are at least two additional threads blocking below >> spa_get_stats(): >> >> Thread 1182 (Thread 101989): #0 sched_switch >> (td=0xfffff800628cc490, newtd=<value optimized out>, flags=<value >> optimized out>) at /usr/src/sys/kern/sched_ule.c:1932 #1 >> 0xffffffff805a23c1 in mi_switch (flags=260, newtd=0x0) at >> /usr/src/sys/kern/kern_synch.c:493 #2 0xffffffff805e4bca in >> sleepq_wait (wchan=0x0, pri=0) at >> /usr/src/sys/kern/subr_sleepqueue.c:631 #3 0xffffffff80539f10 in >> _cv_wait (cvp=0xfffff80025534a50, lock=0xfffff80025534a30) at >> /usr/src/sys/kern/kern_condvar.c:139 #4 0xffffffff811721db in >> zio_wait (zio=<value optimized out>) at >> /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c:1442 >> #5 0xffffffff81102111 in dbuf_read (db=<value optimized out>, >> zio=<value optimized out>, flags=<value optimized out>) at >> /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dbuf.c:649 >> #6 0xffffffff81108e6d in dmu_buf_hold (os=<value optimized out>, >> object=<value optimized out>, offset=<value optimized out>, >> tag=0x0, dbp=0xfffffe00955c6648, flags=<value optimized out>) at >> /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu.c:172 >> #7 0xffffffff81163986 in zap_lockdir (os=0xfffff8002b7ab000, >> obj=92, tx=0x0, lti=RW_READER, fatreader=1, adding=0, zapp=<value >> optimized out>) at >> /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zap_micro.c:467 >> >> > #8 0xffffffff811644ad in zap_count (os=0x0, zapobj=0, > count=0xfffffe00955c66d8) at > /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zap_micro.c:712 >> #9 0xffffffff8114a6dc in spa_get_errlog_size >> (spa=0xfffff800062ed000) at >> /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_errlog.c:149 >> >> > - ---Type <return> to continue, or q <return> to quit--- >> #10 0xffffffff8113f549 in spa_get_stats (name=0xfffffe0044cac000 >> "spaceloop", config=0xfffffe00955c68e8, altroot=0xfffffe0044cac430 >> "", buflen=2048) at >> /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c:3287 >> #11 0xffffffff81189a45 in zfs_ioc_pool_stats >> (zc=0xfffffe0044cac000) at >> /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ioctl.c:1656 >> >> > #12 0xffffffff81187290 in zfsdev_ioctl (dev=<value optimized out>, > zcmd=<value optimized out>, arg=<value optimized out>, flag=<value > optimized out>, td=<value optimized out>) >> at >> /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ioctl.c:6136 >> >> > #13 0xffffffff80464a55 in devfs_ioctl_f (fp=0xfffff80038bd00a0, > com=3222821381, data=0xfffff800067b80a0, cred=<value optimized out>, > td=0xfffff800628cc490) at /usr/src/sys/fs/devfs/devfs_vnops.c:757 >> #14 0xffffffff805f3c3d in kern_ioctl (td=0xfffff800628cc490, >> fd=<value optimized out>, com=0) at file.h:311 #15 >> 0xffffffff805f381c in sys_ioctl (td=0xfffff800628cc490, >> uap=0xfffffe00955c6b80) at /usr/src/sys/kern/sys_generic.c:702 #16 >> 0xffffffff8085c2db in amd64_syscall (td=0xfffff800628cc490, >> traced=0) at subr_syscall.c:133 #17 0xffffffff8083f90b in >> Xfast_syscall () at /usr/src/sys/amd64/amd64/exception.S:390 #18 >> 0x00000008019fc3da in ?? () Previous frame inner to this frame >> (corrupt stack?) > > Yes, thread 1182 owned the lock and is waiting for the zio be done. > Other threads that wanted the lock would have to wait. > > I don't have much clue why the system entered this state, however, as > the operations should have errored out (the GELI device is gone on > 21:44:56 based on your log, which suggests all references were closed) > instead of waiting. > > Adding mav_at_ as he may have some idea. We're seen a disk drop invalidating a pool before, which should fail all reads / writes but process have instead just wedged in the kernel. >From experience I'd say it happens ~5% of time, so its quite hard to catch. Unfortunately never managed to get a dump of it. Regards SteveReceived on Sun Sep 07 2014 - 14:26:53 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:51 UTC