Re: r358252 causes intermittent hangs where processes are stuck sleeping on btalloc

From: Rick Macklem <rmacklem_at_uoguelph.ca> Date: Tue, 16 Jun 2020 00:06:39 +0000 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:24 UTC

Rick Macklem wrote:
>r358098 will hang fairly easily, in 1-3 cycles of the kernel build over NFS.
>I thought this was the culprit, since I did 6 cycles of r358097 without a hang.
>However, I just got a hang with r358097, but it looks rather different.
>The r358097 hang did not have any processes sleeping on btalloc. They
>appeared to be waiting on two different locks in the buffer cache.
>As such, I think it might be a different problem. (I'll admit I should have
>made notes about this one before rebooting, but I was flustrated that
>it happened and rebooted before looking at it mush detail.)
Ok, so I did 10 cycles of the kernel build over NFS for r358096 and never
got a hang.
--> It seems that r358097 is the culprit and r358098 makes it easier
      to reproduce.
      --> Basically runs out of kernel memory.

It is not obvious if I can revert these two commits without reverting
other ones, since there were a bunch of vm changes after these.

I'll take a look, but if you guys have any ideas on how to fix this, please
let me know.

Thanks, rick

Jeff, to fill you in, I have been getting intermittent hangs on a Pentium 4
(single core i386) with 1.25Gbytes ram when doing kernel builds using
head kernels from this winter. (I also saw one when doing a kernel build
on UFS, so they aren't NFS specific, although easier to reproduce that way.)
After a typical hang, there will be a bunch of processes sleeping on "btalloc"
and several processes holding the following lock:
exclusive sx lock _at_ vm/vm_map.c:4761
- I have seen hangs where that is the only lock held by any process except
   the interrupt thread.
- I have also seen processes waiting on the following locks:
kern/subr_vmem.c:1343
kern/subr_vmem.c:633

I can't be absolutely sure r358098 is the culprit, but it seems to make the
problem more reproducible.

If anyone has a patch suggestion, I can test it.
Otherwise, I will continue to test r358097 and earlier, to try and see what hangs
occur. (I've done 8 cycles of testing of r356776 without difficulties, but that
doesn't guarantee it isn't broken.)

There is a bunch more of the stuff I got for Kostik and Ryan below.
I can do "db" when it is hung, but it is a screen console, so I need to
transcribe the output to email by hand. (ie. If you need something
specific I can do that, but trying to do everything Kostik and Ryan asked
for isn't easy.)

rick

Konstantin Belousov wrote:
>On Fri, May 22, 2020 at 11:46:26PM +0000, Rick Macklem wrote:
>> Konstantin Belousov wrote:
>> >On Wed, May 20, 2020 at 11:58:50PM -0700, Ryan Libby wrote:
>> >> On Wed, May 20, 2020 at 6:04 PM Rick Macklem <rmacklem_at_uoguelph.ca> wrote:
>> >> >
>> >> > Hi,
>> >> >
>> >> > Since I hadn't upgraded a kernel through the winter, it took me a while
>> >> > to bisect this, but r358252 seems to be the culprit.
No longer true. I succeeded in reproducing the hang to-day running a
r358251 kernel.

I haven't had much luck sofar, but see below for what I have learned.

>> >> >
>> >> > If I do a kernel build over NFS using my not so big Pentium 4 (single core,
>> >> > 1.25Gbytes RAM, i386), about every second attempt will hang.
>> >> > When I do a "ps" in the debugger, I see processes sleeping on btalloc.
>> >> > If I revert to r358251, I cannot reproduce this.
As above, this is no longer true.

>> >> >
>> >> > Any ideas?
>> >> >
>> >> > I can easily test any change you might suggest to see if it fixes the
>> >> > problem.
>> >> >
>> >> > If you want more debug info, let me know, since I can easily
>> >> > reproduce it.
>> >> >
>> >> > Thanks, rick
>> >>
>> >> Nothing obvious to me.  I can maybe try a repro on a VM...
>> >>
>> >> ddb ps, acttrace, alltrace, show all vmem, show page would be welcome.
>> >>
>> >> "btalloc" is "We're either out of address space or lost a fill race."
>From what I see, I think it is "out of address space".
For one of the hangs, when I did "show alllocks", everything except the
intr thread, was waiting for the
exclusive sx lock _at_ vm/vm_map.c:4761

>> >
>> >Yes, I would be not surprised to be out of something on 1G i386 machine.
>> >Please also add 'show alllocks'.
>> Ok, I used an up to date head kernel and it took longer to reproduce a hang.
Go down to Kostik's comment about kern.maxvnodes for the rest of what I've
learned. (The time it takes to reproduce one of these varies greatly, but I usually
get one within 3 cycles of a full kernel build over NFS. I have had it happen
once when doing a kernel build over UFS.)

>> This time, none of the processes are stuck on "btalloc".
> I'll try and give you most of the above, but since I have to type it in by hand
> from the screen, I might not get it all. (I'm no real typist;-)
> > show alllocks
> exclusive lockmgr ufs (ufs) r = 0 locked _at_ kern/vfs_subr.c: 3259
> exclusive lockmgr nfs (nfs) r = 0 locked _at_ kern/vfs_lookup.c:737
> exclusive sleep mutex kernel area domain (kernel arena domain) r = 0 locked _at_ kern/subr_vmem.c:1343
> exclusive lockmgr bufwait (bufwait) r = 0 locked _at_ kern/vfs_bio.c:1663
> exclusive lockmgr ufs (ufs) r = 0 locked _at_ kern/vfs_subr.c:2930
> exclusive lockmgr syncer (syncer) r = 0 locked _at_ kern/vfs_subr.c:2474
> Process 12 (intr) thread 0x.. (1000008)
> exclusive sleep mutex Giant (Giant) r = 0 locked _at_ kern/kern_intr.c:1152
>
> > ps
> - Not going to list them all, but here are the ones that seem interesting...
> 18 0 0 0 DL vlruwt 0x11d939cc [vnlru]
> 16 0 0 0 DL (threaded)               [bufdaemon]
> 100069  D  qsleep                      [bufdaemon]
> 100074  D  -                               [bufspacedaemon-0]
> 100084  D  sdflush  0x11923284 [/ worker]
> - and more of these for the other UFS file systems
> 9 0 0 0   DL psleep  0x1e2f830  [vmdaemon]
> 8 0 0 0   DL (threaded)               [pagedaemon]
> 100067  D   psleep 0x1e2e95c   [dom0]
> 100072  D   launds 0x1e2e968   [laundry: dom0]
> 100073  D   umarcl 0x12cc720   [uma]
> … a bunch of usb and cam ones
> 100025  D   -           0x1b2ee40  [doneq0]
> …
> 12 0 0 0 RL  (threaded)               [intr]
> 100007  I                                     [swi6: task queue]
> 100008  Run           CPU 0           [swi6: Giant taskq]
> …
> 100000  D   swapin 0x1d96dfc    [swapper]
> - and a bunch more in D state.
> Does this mean the swapper was trying to swap in?
>
> > acttrace
> - just shows the keyboard
> kdb_enter() at kdb_enter+0x35/frame
> vt_kbdevent() at vt_kdbevent+0x329/frame
> kdbmux_intr() at kbdmux_intr+0x19/frame
> taskqueue_run_locked() at taskqueue_run_locked+0x175/frame
> taskqueue_run() at taskqueue_run+0x44/frame
> taskqueue_swi_giant_run(0) at taskqueue_swi_giant_run+0xe/frame
> ithread_loop() at ithread_loop+0x237/frame
> fork_exit() at fork_exit+0x6c/frame
> fork_trampoline() at 0x../frame
>
> > show all vmem
> vmem 0x.. 'transient arena'
>   quantum: 4096
>   size:  23592960
>   inuse: 0
>   free: 23592960
>   busy tags:   0
>   free tags:    2
>                      inuse        size       free        size
>   16777216   0               0           1            23592960
> vmem 0x.. 'buffer arena'
>   quantum:  4096
>   size:   94683136
>   inuse: 94502912
>   free: 180224
>   busy tags:    1463
>   free tags:      3
>                        inuse      size         free        size
>   16384           2             32768     1             16384
>   32768           39           1277952 1              32768
>   65536            1422       93192192 0           0
>   131072          0             0            1              131072
> vmem 0x.. 'i386trampoline'
>   quantum:      1
>   size:               24576
>   inuse:             20860
>   free:               3716
>   busy tags:     9
>   free tags:      3
>                        inuse      size      free      size
>   32                 1             48        1           52
>   64                  2            208       0           0
>   128                2            280       0            0
>   2048              1            2048     1           3664
>   4096              2            8192     0           0
>   8192              1            10084   0           0
> vmem 0x.. 'kernel rwx arena'
>   quantum:    4096
>   size: 0
>   inuse: 0
>   free:   0
>   busy tags: 0
>   free tags:  0
> vmem 0x.. 'kernel area dom'
>   quantum:  4096
>   size: 56623104
>   inuse: 56582144
>>   free: 40960
>>   busy tags: 11224
>>   free tags: 3
>I think this is the trouble.
>
>Did you tried to reduce kern.maxvnodes ?  What is the default value for
>the knob on your machine ?
The default is 84342.
I have tried 64K, 32K and 128K and they all hung sooner or later.
For the 32K case, I did see vnodes being recycled for a while before it got hung,
so it isn't just when it hits the limit.

Although it is much easier for me to reproduce on an NFS mount, I did see
a hang while doing a kernel build on UFS (no NFS mount on the machine at
that time).

So, I now know that the problem pre-dates r358252 and is not NFS specific.

I'm not bisecting back further to try and isolate the commit that causes this.
(Unfortunately, each test cycle can take days. I now know that I have to do
several of these kernel builds, which take hours each, to see if a hang is going
to happen.)

I'll post if/when I have more, rick

We scaled maxvnodes for ZFS and UFS, might be NFS is even more demanding,
having larger node size.

>                 inuse     size          free              size
>   4096      11091    45428736 0                 0
>   8192       63         516096     0                 0
>   16384     12          196608    0                  0
>   32768      6           196608    0                  0
>   40960      2           81920      1                  40960
>   65536      16         1048576   0                 0
>   94208      1            94208      0                 0
>   110592     1          110592     0                 0
>   131072      15       2441216    0                0
>   262144      15       3997696    0                 0
>   524288      1         524288      0                 0
>   1048576     1        1945600     0                0
> vmem 0x.. 'kernel arena'
>   quantum: 4096
>   size: 390070272
>   inuse: 386613248
>   free: 3457024
>   busy tags:  873
>   free tags:   3
>                    inuse       size         free        size
> 4096           35           143360    1            4096
> 8192           2              16384      2           16384
> 12288         1              12288      0           0
> 16384         30             491520   0           0
> 20480         140           2867200  0          0
> 65536          1              65536     0           0
> 131072        631          82706432 0         0
> 1048576      0              0               1         1339392
> 2097152      27            56623104   1        2097152
> 8388608      1              13774848    0       0
> 16777216    3              74883072   0        0
> 33554432    1              36753408   0        0
> 67108864    1              118276096  0       0
>
> > alltrace
> - I can't face typing too much more, but I'll put a few
>   here that look interesting
>
> - for csh
> sched_switch()
> mi_switch()
> kern_yield()
> getblkx()
> breadn_flags()
> ffs_update()
> ufs_inactive()
> VOP_INACTIVE()
> vinactivef()
> vput_final()
> vm_object_deallocate()
> vm_map_process_deferred()
> kern_munmap()
> sys_munmap()
>
> - For cc
> sched_switch()
> mi_switch()
> sleepq_switch()
> sleepq_timedwait()
> _sleep()
> pause_sbt()
> vmem_bt_alloc()
> keg_alloc_slab()
> zone_import()
> cache_alloc()
> cache_alloc_retry()
> uma_zalloc_arg()
> bt_fill()
> vmem_xalloc()
> vmem_alloc()
> kmem_alloc()
> kmem_malloc_domainset()
> page_alloc()
> keg_alloc_slab()
> zone_import()
> cache_alloc()
> cache_alloc_retry()
> uma_zalloc_arg()
> nfscl_nget()
> nfs_create()
> vop_sigdefer()
> nfs_vnodeops_bypass()
> VOP_CREATE_APV()
> vn_open_cred()
> vn_open()
> kern_openat()
> sys_openat()
>
> Then there are a bunch of these... for cc, make
> sched_switch()
> mi_switch()
> sleepq_switch()
> sleepq_catch_signals()
> sleepq_wait_sig()
> kern_wait6()
> sys_wait4()
>
> - for vnlru
> sched_switch()
> mi_switch()
> sleepq_switch()
> sleepq_timedwait()
> _sleep()
> vnlru_proc()
> fork_exit()
> fork_trampoline()
>
> - for syncer
> sched_switch()
> mi_switch()
> critical_exit_preempt()
> intr_event_handle()
> intr_execute_handlers()
> lapic_handle_intr()
> Xapic_isr1()
> - interrupt
> memset()
> cache_alloc()
> cache_alloc_retry()
> uma_zalloc_arg()
> vmem_xalloc()
> vmem_bt_alloc()
> keg_alloc_slab()
> zone_import()
> cache_alloc()
> cache_alloc_retry()
> uma_zalloc_arg()
> bt_fill()
> vmem_xalloc()
> vmem_alloc()
> bufkva_alloc()
> getnewbuf()
> getblkx()
> breadn_flags()
> ffs_update()
> ffs_sync()
> sync_fsync()
> VOP_FSYNC_APV()
> sched_sync()
> fork_exit()
> fork_trampoline()
>
> - For bufdaemon (a bunch of these)
> sched_switch()
> mi_switch()
> sleepq_switch()
> sleepq_timedwait()
> _sleep()
> buf_daemon()
> fork_exit()
> fork_trampoline()
>
> vmdaemon and pagedaemon are basically just like above,
> sleeping in
> vm_daemon()
> or
> vm_pageout_worker()
> or
> vm_pageout_laundry_worker()
> or
> uma_reclaim_worker()
>
> That's all the typing I can take right now.
> I can probably make this happen again if you want more specific stuff.
>
> rick
>
>
>
>
_______________________________________________
freebsd-current_at_freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe_at_freebsd.org"
_______________________________________________
freebsd-current_at_freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe_at_freebsd.org"
_______________________________________________
freebsd-current_at_freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe_at_freebsd.org"