Re: Deadlocks / hangs in ZFS

From: Alexander Leidinger <Alexander_at_leidinger.net>
Date: Thu, 12 Jul 2018 14:42:29 +0200
Quoting Alexander Leidinger <Alexander_at_leidinger.net> (from Mon, 04  
Jun 2018 22:31:08 +0200):

> Quoting Slawa Olhovchenkov <slw_at_zxy.spb.ru> (from Sun, 3 Jun 2018  
> 22:28:14 +0300):
>
>> On Sun, Jun 03, 2018 at 09:14:50PM +0200, Alexander Leidinger wrote:
>>
>>> Quoting Alexander Leidinger <Alexander_at_leidinger.net> (from Mon, 28
>>> May 2018 09:02:01 +0200):
>>>
>>>> Quoting Slawa Olhovchenkov <slw_at_zxy.spb.ru> (from Mon, 28 May 2018
>>>> 01:06:12 +0300):
>>>>
>>>>> On Sun, May 27, 2018 at 09:41:59PM +0200, Kirill Ponomarev wrote:
>>>>>
>>>>>> On 05/22, Slawa Olhovchenkov wrote:
>>>>>>> > It has been a while since I tried Karl's patch the last time, and I
>>>>>>> > stopped because it didn't apply to -current anymore at some point.
>>>>>>> > Will what is provided right now in the patch work on -current?
>>>>>>>
>>>>>>> I am mean yes, after s/vm_cnt.v_free_count/vm_free_count()/g
>>>>>>> I am don't know how to have two distinct patch (for stable and
>>>>>>> current) in one review.
>>>>>>
>>>>>> I'm experiencing these issues sporadically as well, would you mind
>>>>>> to publish this patch for fresh current?
>>>>>
>>>>> Week ago I am adopt and publish patch to fresh current and stable, is
>>>>> adopt need again?
>>>>
>>>> I applied the patch in the review yesterday to rev 333966, it
>>>> applied OK (with some fuzz). I will try to reproduce my issue with
>>>> the patch.
>>>
>>> The behavior changed (or the system was long enough in this state
>>> without me noticing it). I have a panic now:
>>> panic: deadlkres: possible deadlock detected for 0xfffff803766db580,
>>> blocked for 1803003 ticks
>>
>> Hmm, may be first determinate locked function
>>
>> addr2line -ie /boot/kernel/kernel 0xfffff803766db580
>>
>> or
>>
>> kgdb
>> x/10i 0xfffff803766db580
>
> Both don'T produce any sensible output:
> (kgdb) x/10i 0xfffff803766db580
> 0xfffff803766db580:     subb   $0x80,-0x78(%rsi)
> 0xfffff803766db584:     (bad)
> 0xfffff803766db585:     (bad)
> 0xfffff803766db586:     (bad)
> 0xfffff803766db587:     incl   -0x7f7792(%rax)
> 0xfffff803766db58d:     (bad)
> 0xfffff803766db58e:     (bad)
> 0xfffff803766db58f:     incl   -0x7f7792(%rax)
> 0xfffff803766db595:     (bad)
> 0xfffff803766db596:     (bad)
>
>
> Seems I need to provoke a real kernel dump instead of a textdump for this.

Finally... time to recompile the kernel with crashdump-compress  
support and changing from textdump to normal dump and to install a  
recent gdb from ports...

The dump is with r336194 and the zfs patch as of 20180527.

---snip---
# kgdb -c /var/crash/vmcore.2 /boot/kernel/kernel
GNU gdb (GDB) 8.1 [GDB v8.1 for FreeBSD]
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-portbld-freebsd12.0".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /boot/kernel/kernel...Reading symbols from  
/usr/lib/debug//boot/kernel/kernel.debug...done.
done.

Unread portion of the kernel message buffer:


Fatal trap 12: page fault while in kernel mode
cpuid = 1; apic id = 01
fault virtual address   = 0x20
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff81391fbe
stack pointer           = 0x0:0xfffffe0000457b10
frame pointer           = 0x0:0xfffffe0000457bb0
code segment            = base 0x0, limit 0xfffff, type 0x1b
                         = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 15 (arc_reclaim_thread)
trap number             = 12
panic: page fault
cpuid = 1
time = 1531394214
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00004577d0
vpanic() at vpanic+0x1a3/frame 0xfffffe0000457830
panic() at panic+0x43/frame 0xfffffe0000457890
trap_fatal() at trap_fatal+0x35f/frame 0xfffffe00004578e0
trap_pfault() at trap_pfault+0x62/frame 0xfffffe0000457930
trap() at trap+0x2ba/frame 0xfffffe0000457a40
calltrap() at calltrap+0x8/frame 0xfffffe0000457a40
--- trap 0xc, rip = 0xffffffff81391fbe, rsp = 0xfffffe0000457b10, rbp  
= 0xfffffe0000457bb0 ---
arc_reclaim_thread() at arc_reclaim_thread+0x42e/frame 0xfffffe0000457bb0
fork_exit() at fork_exit+0x84/frame 0xfffffe0000457bf0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0000457bf0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
Uptime: 38m3s
Dumping 2378 out of 8037 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%

__curthread () at ./machine/pcpu.h:230
230             __asm("movq %%gs:%1,%0" : "=r" (td)
(kgdb) bt
#0  __curthread () at ./machine/pcpu.h:230
#1  doadump (textdump=1) at /usr/src/sys/kern/kern_shutdown.c:366
#2  0xffffffff80485e11 in kern_reboot (howto=260) at  
/usr/src/sys/kern/kern_shutdown.c:446
#3  0xffffffff804863f3 in vpanic (fmt=<optimized out>, ap=0xfffffe0000457870)
     at /usr/src/sys/kern/kern_shutdown.c:863
#4  0xffffffff80486443 in panic (fmt=<unavailable>) at  
/usr/src/sys/kern/kern_shutdown.c:790
#5  0xffffffff8075279f in trap_fatal (frame=0xfffffe0000457a50,  
eva=32) at /usr/src/sys/amd64/amd64/trap.c:892
#6  0xffffffff80752812 in trap_pfault (frame=0xfffffe0000457a50,  
usermode=<optimized out>)
     at /usr/src/sys/amd64/amd64/trap.c:728
#7  0xffffffff80751e1a in trap (frame=0xfffffe0000457a50) at  
/usr/src/sys/amd64/amd64/trap.c:427
#8  <signal handler called>
#9  0xffffffff81391fbe in arc_check_uma_cache (lowest=-1011712)
     at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c:4532
#10 arc_reclaim_thread (unused=<optimized out>)
     at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c:4657
#11 0xffffffff8044ca74 in fork_exit (callout=0xffffffff81391b90  
<arc_reclaim_thread>, arg=0x0,
     frame=0xfffffe0000457c00) at /usr/src/sys/kern/kern_fork.c:1057
#12 <signal handler called>
(kgdb) up 9
#9  0xffffffff81391fbe in arc_check_uma_cache (lowest=-1011712)
     at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c:4532
4532                    lowest +=  
uma_zone_get_free_size(zio_data_buf_cache[n]->kc_zone);
(kgdb) list
4527            int                     iter = 4;
4528            int                     step = 1 << (SPA_MAXBLOCKSHIFT  
- SPA_MINBLOCKSHIFT - 3);
4529            int                     n = (SPA_MAXBLOCKSIZE >>  
SPA_MINBLOCKSHIFT) - 1;
4530
4531            while (n >= 0) {
4532                    lowest +=  
uma_zone_get_free_size(zio_data_buf_cache[n]->kc_zone);
4533                    if (lowest >= 0)
4534                            return lowest;
4535                    n -= step;
4536                    if(--iter == 0) {
(kgdb) print n
$1 = 32767
(kgdb) print zio_data_buf_cache[n]
$2 = (kmem_cache_t *) 0x0
(kgdb)

---snip---

Bye,
Alexander.

-- 
http://www.Leidinger.net Alexander_at_Leidinger.net: PGP 0x8F31830F9F2772BF
http://www.FreeBSD.org    netchild_at_FreeBSD.org  : PGP 0x8F31830F9F2772BF

Received on Thu Jul 12 2018 - 10:43:15 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:17 UTC