Re: Deadlocks / hangs in ZFS

From: Alexander Leidinger <Alexander_at_leidinger.net>
Date: Sat, 26 May 2018 21:54:10 +0200
Quoting Steve Wills <swills_at_freebsd.org> (from Tue, 22 May 2018  
08:17:00 -0400):

> I may be seeing similar issues. Have you tried leaving top -SHa  
> running and seeing what threads are using CPU when it hangs? I did  
> and saw pid 17 [zfskern{txg_thread_enter}] using lots of CPU but no  
> disk activity happening. Do you see similar?

For me it is a different zfs process/kthread, l2arc_feed_thread.  
Please note that there is still 31 GB free, so it doesn't look lie  
resource exhaustion. What I consider strange is the swap usage. I  
watched the system and it started to use swap while there were >30 GB  
listed as free (in/out rates visible from time to time, and plenty of  
RAM free... ???).

last pid: 93392;  load averages:  0.16,  0.44,  1.03                    
                       up 1+15:36:34  22:35:45
1509 processes:17 running, 1392 sleeping, 3 zombie, 97 waiting
CPU:  0.1% user,  0.0% nice,  0.0% system,  0.0% interrupt, 99.9% idle
Mem: 597M Active, 1849M Inact, 6736K Laundry, 25G Wired, 31G Free
ARC: 20G Total, 9028M MFU, 6646M MRU, 2162M Anon, 337M Header, 1935M Other
      14G Compressed, 21G Uncompressed, 1.53:1 Ratio
Swap: 4096M Total, 1640M Used, 2455M Free, 40% Inuse

   PID     JID USERNAME      PRI NICE   SIZE    RES STATE   C   TIME    
  WCPU COMMAND
    10       0 root          155 ki31     0K   256K CPU1    1  35.4H  
100.00% [idle{idle: cpu1}]
    10       0 root          155 ki31     0K   256K CPU11  11  35.2H  
100.00% [idle{idle: cpu11}]
    10       0 root          155 ki31     0K   256K CPU3    3  35.2H  
100.00% [idle{idle: cpu3}]
    10       0 root          155 ki31     0K   256K CPU15  15  35.1H  
100.00% [idle{idle: cpu15}]
    10       0 root          155 ki31     0K   256K RUN     9  35.1H  
100.00% [idle{idle: cpu9}]
    10       0 root          155 ki31     0K   256K CPU5    5  35.0H  
100.00% [idle{idle: cpu5}]
    10       0 root          155 ki31     0K   256K CPU14  14  35.0H  
100.00% [idle{idle: cpu14}]
    10       0 root          155 ki31     0K   256K CPU0    0  35.8H   
99.12% [idle{idle: cpu0}]
    10       0 root          155 ki31     0K   256K CPU6    6  35.3H   
98.79% [idle{idle: cpu6}]
    10       0 root          155 ki31     0K   256K CPU8    8  35.1H   
98.31% [idle{idle: cpu8}]
    10       0 root          155 ki31     0K   256K CPU12  12  35.0H   
97.24% [idle{idle: cpu12}]
    10       0 root          155 ki31     0K   256K CPU4    4  35.4H   
96.71% [idle{idle: cpu4}]
    10       0 root          155 ki31     0K   256K CPU10  10  35.0H   
92.37% [idle{idle: cpu10}]
    10       0 root          155 ki31     0K   256K CPU7    7  35.2H   
92.20% [idle{idle: cpu7}]
    10       0 root          155 ki31     0K   256K CPU13  13  35.1H   
91.90% [idle{idle: cpu13}]
    10       0 root          155 ki31     0K   256K CPU2    2  35.4H   
90.97% [idle{idle: cpu2}]
    11       0 root          -60    -     0K   816K WAIT    0  15:08    
0.82% [intr{swi4: clock (0)}]
    31       0 root          -16    -     0K    80K pwait   0  44:54    
0.60% [pagedaemon{dom0}]
45453       0 root           20    0 16932K  7056K CPU9    9   4:12    
0.24% top -SHaj
    24       0 root           -8    -     0K   256K l2arc_  0   4:12    
0.21% [zfskern{l2arc_feed_thread}]
  2375       0 root           20    0 16872K  6868K select 11   3:52    
0.20% top -SHua
  7007      12    235         20    0 18017M   881M uwait  12   0:00    
0.19% [java{ESH-thingHandler-35}]
    32       0 root          -16    -     0K    16K psleep 15   5:03    
0.11% [vmdaemon]
41037       0 netchild       27    0 18036K  9136K select  4   2:20    
0.09% tmux: server (/tmp/tmux-1001/default) (t
    36       0 root          -16    -     0K    16K -       6   2:02    
0.09% [racctd]
  7007      12    235         20    0 18017M   881M uwait   9   1:24    
0.07% [java{java}]
  4746       0 root           20    0 13020K  3792K nanslp  8   0:52    
0.05% zpool iostat space 1
     0       0 root          -76    -     0K 10304K -       4   0:16    
0.05% [kernel{if_io_tqg_4}]
  5550       8    933         20    0  2448M   607M uwait   8   0:41    
0.03% [java{java}]
  5550       8    933         20    0  2448M   607M uwait  13   0:03    
0.03% [java{Timer-1}]
  7007      12    235         20    0 18017M   881M uwait   0   0:39    
0.02% [java{java}]
  5655       8    560         20    0 21524K  4840K select  6   0:21    
0.02% /usr/local/sbin/hald{hald}
    30       0 root          -16    -     0K    16K -       4   0:25    
0.01% [rand_harvestq]
  1259       0 root           20    0 18780K 18860K select 14   0:19    
0.01% /usr/sbin/ntpd -c /etc/ntp.conf -p /var/
     0       0 root          -76    -     0K 10304K -      12   0:19    
0.01% [kernel{if_config_tqg_0}]
    31       0 root          -16    -     0K    80K psleep  0   0:38    
0.01% [pagedaemon{dom1}]
     0       0 root          -76    -     0K 10304K -       5   0:04    
0.01% [kernel{if_io_tqg_5}]
  7007      12    235         20    0 18017M   881M uwait   1   0:16    
0.01% [java{Karaf Lock Monitor }]
12622       2     88         20    0  1963M   247M uwait   7   0:13    
0.01% [mysqld{mysqld}]
27043       0 netchild       20    0 18964K  9124K select  6   0:01    
0.01% sshd: netchild_at_pts/0 (sshd)
  7007      12    235         20    0 18017M   881M uwait   8   0:10    
0.01% [java{openHAB-job-schedul}]
  7007      12    235         20    0 18017M   881M uwait   6   0:10    
0.01% [java{openHAB-job-schedul}]


> On 05/22/18 04:17, Alexander Leidinger wrote:
>> Hi,
>>
>> does someone else experience deadlocks / hangs in ZFS?
>>
>> What I see is that if on a 2 socket / 4 cores -> 16 threads system  
>> I do a lot in parallel (e.g. updating ports in several jails), then  
>> the system may get into a state were I can login, but any exit  
>> (e.g. from top) or logout of shell blocks somewhere. Sometimes it  
>> helps to CTRL-C all updates to get the system into a good shape  
>> again, but most of the times it doesn't.
>>
>> On another system at the same rev (333966) with a lot less CPUs  
>> (and AMD instead of Intel), I don't see such a behavior.
>>
>> Bye,
>> Alexander.
>>
> _______________________________________________
> freebsd-current_at_freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to "freebsd-current-unsubscribe_at_freebsd.org"


-- 
http://www.Leidinger.net Alexander_at_Leidinger.net: PGP 0x8F31830F9F2772BF
http://www.FreeBSD.org    netchild_at_FreeBSD.org  : PGP 0x8F31830F9F2772BF

Received on Sat May 26 2018 - 17:54:39 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:16 UTC