Re: r358252 causes intermittent hangs where processes are stuck sleeping on btalloc

From: Rick Macklem <rmacklem_at_uoguelph.ca> Date: Fri, 3 Jul 2020 07:17:25 +0000 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:24 UTC

Ryan Libby wrote:
>On Sun, Jun 28, 2020 at 9:57 PM Rick Macklem <rmacklem_at_uoguelph.ca> wrote:
>>
>> Just in case you were waiting for another email, I have now run several
>> cycles of the kernel build over NFS on a recent head kernel with the
>> one line change and it has not hung.
>>
>> I don't know if this is the correct fix, but it would be nice to get something
>> into head to fix this.
>>
>> If I don't hear anything in the next few days, I'll put it in a PR so it
>> doesn't get forgotten.
>>
>> rick
>
>Thanks for the follow through on this.
>
>I think the patch is not complete.  It looks like the problem is that
>for systems that do not have UMA_MD_SMALL_ALLOC, we do
>        uma_zone_set_allocf(vmem_bt_zone, vmem_bt_alloc);
>but we haven't set an appropriate free function.  This is probably why
>UMA_ZONE_NOFREE was originally there.  When NOFREE was removed, it was
>appropriate for systems with uma_small_alloc.
>
>So by default we get page_free as our free function.  That calls
>kmem_free, which calls vmem_free ... but we do our allocs with
>vmem_xalloc.  I'm not positive, but I think the problem is that in
>effect we vmem_xalloc -> vmem_free, not vmem_xfree.
>
>Three possible fixes:
> 1: The one you tested, but this is not best for systems with
>    uma_small_alloc.
> 2: Pass UMA_ZONE_NOFREE conditional on UMA_MD_SMALL_ALLOC.
> 3: Actually provide an appropriate vmem_bt_free function.
>
>I think we should just do option 2 with a comment, it's simple and it's
>what we used to do.  I'm not sure how much benefit we would see from
>option 3, but it's more work.
I set hw.physmem to 1Gbyte on my amd64 system (did not have the patch)
and ran 6 cycles of the kernel build over NFS without a hang, so I don't
think any fix is needed for systems that support UMA_MD_SMALL_ALLOC.

The trivial patch for option 2 is attached.
I didn't do a comment, since you understand this and can probably
describe it more correctly.

Thanks, rick

Ryan

>
> ________________________________________
> From: owner-freebsd-current_at_freebsd.org <owner-freebsd-current_at_freebsd.org> on behalf of Rick Macklem <rmacklem_at_uoguelph.ca>
> Sent: Thursday, June 18, 2020 11:42 PM
> To: Ryan Libby
> Cc: Konstantin Belousov; Jeff Roberson; freebsd-current_at_freebsd.org
> Subject: Re: r358252 causes intermittent hangs where processes are stuck sleeping on btalloc
>
> Ryan Libby wrote:
> >On Mon, Jun 15, 2020 at 5:06 PM Rick Macklem <rmacklem_at_uoguelph.ca> wrote:
> >>
> >> Rick Macklem wrote:
> >> >r358098 will hang fairly easily, in 1-3 cycles of the kernel build over =
> NFS.
> >> >I thought this was the culprit, since I did 6 cycles of r358097 without =
> a hang.
> >> >However, I just got a hang with r358097, but it looks rather different.
> >> >The r358097 hang did not have any processes sleeping on btalloc. They
> >> >appeared to be waiting on two different locks in the buffer cache.
> >> >As such, I think it might be a different problem. (I'll admit I should h=
> ave
> >> >made notes about this one before rebooting, but I was flustrated that
> >> >it happened and rebooted before looking at it mush detail.)
> >> Ok, so I did 10 cycles of the kernel build over NFS for r358096 and never
> >> got a hang.
> >> --> It seems that r358097 is the culprit and r358098 makes it easier
> >>       to reproduce.
> >>       --> Basically runs out of kernel memory.
> >>
> >> It is not obvious if I can revert these two commits without reverting
> >> other ones, since there were a bunch of vm changes after these.
> >>
> >> I'll take a look, but if you guys have any ideas on how to fix this, plea=
> se
> >> let me know.
> >>
> >> Thanks, rick
> >
> >Interesting.  Could you try re-adding UMA_ZONE_NOFREE to the vmem btag
> >zone to see if that rescues it, on whatever base revision gets you a
> >reliable repro?
> Good catch! That seems to fix it. I've done 8 cycles of kernel build over
> NFS without a hang (normally I'd get one in the first 1-3 cycles).
>
> I don't know if the intend was to delete UMA_ZONE_VM and r358097
> had a typo in it and deleted UMA_ZONE_NOFREE or ???
>
> Anyhow, I just put it back to UMA_ZONE_VM | UMA_ZONE_NOFREE and
> the hangs seem to have gone away.
>
> The small patch I did is attached, in case that isn't what you meant.
>
> I'll run a few more cycles just in case, but I think this fixes it.
>
> Thanks, rick
>
> >
> > Jeff, to fill you in, I have been getting intermittent hangs on a Pentium=
>  4
> > (single core i386) with 1.25Gbytes ram when doing kernel builds using
> > head kernels from this winter. (I also saw one when doing a kernel build
> > on UFS, so they aren't NFS specific, although easier to reproduce that wa=
> y.)
> > After a typical hang, there will be a bunch of processes sleeping on "bta=
> lloc"
> > and several processes holding the following lock:
> > exclusive sx lock _at_ vm/vm_map.c:4761
> > - I have seen hangs where that is the only lock held by any process excep=
> t
> >    the interrupt thread.
> > - I have also seen processes waiting on the following locks:
> > kern/subr_vmem.c:1343
> > kern/subr_vmem.c:633
> >
> > I can't be absolutely sure r358098 is the culprit, but it seems to make t=
> he
> > problem more reproducible.
> >
> > If anyone has a patch suggestion, I can test it.
> > Otherwise, I will continue to test r358097 and earlier, to try and see wh=
> at hangs
> > occur. (I've done 8 cycles of testing of r356776 without difficulties, bu=
> t that
> > doesn't guarantee it isn't broken.)
> >
> > There is a bunch more of the stuff I got for Kostik and Ryan below.
> > I can do "db" when it is hung, but it is a screen console, so I need to
> > transcribe the output to email by hand. (ie. If you need something
> > specific I can do that, but trying to do everything Kostik and Ryan asked
> > for isn't easy.)
> >
> > rick
> >
> >
> >
> > Konstantin Belousov wrote:
> > >On Fri, May 22, 2020 at 11:46:26PM +0000, Rick Macklem wrote:
> > >> Konstantin Belousov wrote:
> > >> >On Wed, May 20, 2020 at 11:58:50PM -0700, Ryan Libby wrote:
> > >> >> On Wed, May 20, 2020 at 6:04 PM Rick Macklem <rmacklem_at_uoguelph.ca>=
>  wrote:
> > >> >> >
> > >> >> > Hi,
> > >> >> >
> > >> >> > Since I hadn't upgraded a kernel through the winter, it took me a=
>  while
> > >> >> > to bisect this, but r358252 seems to be the culprit.
> > No longer true. I succeeded in reproducing the hang to-day running a
> > r358251 kernel.
> >
> > I haven't had much luck sofar, but see below for what I have learned.
> >
> > >> >> >
> > >> >> > If I do a kernel build over NFS using my not so big Pentium 4 (si=
> ngle core,
> > >> >> > 1.25Gbytes RAM, i386), about every second attempt will hang.
> > >> >> > When I do a "ps" in the debugger, I see processes sleeping on bta=
> lloc.
> > >> >> > If I revert to r358251, I cannot reproduce this.
> > As above, this is no longer true.
> >
> > >> >> >
> > >> >> > Any ideas?
> > >> >> >
> > >> >> > I can easily test any change you might suggest to see if it fixes=
>  the
> > >> >> > problem.
> > >> >> >
> > >> >> > If you want more debug info, let me know, since I can easily
> > >> >> > reproduce it.
> > >> >> >
> > >> >> > Thanks, rick
> > >> >>
> > >> >> Nothing obvious to me.  I can maybe try a repro on a VM...
> > >> >>
> > >> >> ddb ps, acttrace, alltrace, show all vmem, show page would be welco=
> me.
> > >> >>
> > >> >> "btalloc" is "We're either out of address space or lost a fill race=
> ."
> > From what I see, I think it is "out of address space".
> > For one of the hangs, when I did "show alllocks", everything except the
> > intr thread, was waiting for the
> > exclusive sx lock _at_ vm/vm_map.c:4761
> >
> > >> >
> > >> >Yes, I would be not surprised to be out of something on 1G i386 machi=
> ne.
> > >> >Please also add 'show alllocks'.
> > >> Ok, I used an up to date head kernel and it took longer to reproduce a=
>  hang.
> > Go down to Kostik's comment about kern.maxvnodes for the rest of what I'v=
> e
> > learned. (The time it takes to reproduce one of these varies greatly, but=
>  I usually
> > get one within 3 cycles of a full kernel build over NFS. I have had it ha=
> ppen
> > once when doing a kernel build over UFS.)
> >
> > >> This time, none of the processes are stuck on "btalloc".
> > > I'll try and give you most of the above, but since I have to type it in=
>  by hand
> > > from the screen, I might not get it all. (I'm no real typist;-)
> > > > show alllocks
> > > exclusive lockmgr ufs (ufs) r =3D 0 locked _at_ kern/vfs_subr.c: 3259
> > > exclusive lockmgr nfs (nfs) r =3D 0 locked _at_ kern/vfs_lookup.c:737
> > > exclusive sleep mutex kernel area domain (kernel arena domain) r =3D 0 =
> locked _at_ kern/subr_vmem.c:1343
> > > exclusive lockmgr bufwait (bufwait) r =3D 0 locked _at_ kern/vfs_bio.c:166=
> 3
> > > exclusive lockmgr ufs (ufs) r =3D 0 locked _at_ kern/vfs_subr.c:2930
> > > exclusive lockmgr syncer (syncer) r =3D 0 locked _at_ kern/vfs_subr.c:2474
> > > Process 12 (intr) thread 0x.. (1000008)
> > > exclusive sleep mutex Giant (Giant) r =3D 0 locked _at_ kern/kern_intr.c:1=
> 152
> > >
> > > > ps
> > > - Not going to list them all, but here are the ones that seem interesti=
> ng...
> > > 18 0 0 0 DL vlruwt 0x11d939cc [vnlru]
> > > 16 0 0 0 DL (threaded)               [bufdaemon]
> > > 100069  D  qsleep                      [bufdaemon]
> > > 100074  D  -                               [bufspacedaemon-0]
> > > 100084  D  sdflush  0x11923284 [/ worker]
> > > - and more of these for the other UFS file systems
> > > 9 0 0 0   DL psleep  0x1e2f830  [vmdaemon]
> > > 8 0 0 0   DL (threaded)               [pagedaemon]
> > > 100067  D   psleep 0x1e2e95c   [dom0]
> > > 100072  D   launds 0x1e2e968   [laundry: dom0]
> > > 100073  D   umarcl 0x12cc720   [uma]
> > > =E2=80=A6 a bunch of usb and cam ones
> > > 100025  D   -           0x1b2ee40  [doneq0]
> > > =E2=80=A6
> > > 12 0 0 0 RL  (threaded)               [intr]
> > > 100007  I                                     [swi6: task queue]
> > > 100008  Run           CPU 0           [swi6: Giant taskq]
> > > =E2=80=A6
> > > 100000  D   swapin 0x1d96dfc    [swapper]
> > > - and a bunch more in D state.
> > > Does this mean the swapper was trying to swap in?
> > >
> > > > acttrace
> > > - just shows the keyboard
> > > kdb_enter() at kdb_enter+0x35/frame
> > > vt_kbdevent() at vt_kdbevent+0x329/frame
> > > kdbmux_intr() at kbdmux_intr+0x19/frame
> > > taskqueue_run_locked() at taskqueue_run_locked+0x175/frame
> > > taskqueue_run() at taskqueue_run+0x44/frame
> > > taskqueue_swi_giant_run(0) at taskqueue_swi_giant_run+0xe/frame
> > > ithread_loop() at ithread_loop+0x237/frame
> > > fork_exit() at fork_exit+0x6c/frame
> > > fork_trampoline() at 0x../frame
> > >
> > > > show all vmem
> > > vmem 0x.. 'transient arena'
> > >   quantum: 4096
> > >   size:  23592960
> > >   inuse: 0
> > >   free: 23592960
> > >   busy tags:   0
> > >   free tags:    2
> > >                      inuse        size       free        size
> > >   16777216   0               0           1            23592960
> > > vmem 0x.. 'buffer arena'
> > >   quantum:  4096
> > >   size:   94683136
> > >   inuse: 94502912
> > >   free: 180224
> > >   busy tags:    1463
> > >   free tags:      3
> > >                        inuse      size         free        size
> > >   16384           2             32768     1             16384
> > >   32768           39           1277952 1              32768
> > >   65536            1422       93192192 0           0
> > >   131072          0             0            1              131072
> > > vmem 0x.. 'i386trampoline'
> > >   quantum:      1
> > >   size:               24576
> > >   inuse:             20860
> > >   free:               3716
> > >   busy tags:     9
> > >   free tags:      3
> > >                        inuse      size      free      size
> > >   32                 1             48        1           52
> > >   64                  2            208       0           0
> > >   128                2            280       0            0
> > >   2048              1            2048     1           3664
> > >   4096              2            8192     0           0
> > >   8192              1            10084   0           0
> > > vmem 0x.. 'kernel rwx arena'
> > >   quantum:    4096
> > >   size: 0
> > >   inuse: 0
> > >   free:   0
> > >   busy tags: 0
> > >   free tags:  0
> > > vmem 0x.. 'kernel area dom'
> > >   quantum:  4096
> > >   size: 56623104
> > >   inuse: 56582144
> > >>   free: 40960
> > >>   busy tags: 11224
> > >>   free tags: 3
> > >I think this is the trouble.
> > >
> > >Did you tried to reduce kern.maxvnodes ?  What is the default value for
> > >the knob on your machine ?
> > The default is 84342.
> > I have tried 64K, 32K and 128K and they all hung sooner or later.
> > For the 32K case, I did see vnodes being recycled for a while before it g=
> ot hung,
> > so it isn't just when it hits the limit.
> >
> > Although it is much easier for me to reproduce on an NFS mount, I did see
> > a hang while doing a kernel build on UFS (no NFS mount on the machine at
> > that time).
> >
> > So, I now know that the problem pre-dates r358252 and is not NFS specific=
> .
> >
> > I'm not bisecting back further to try and isolate the commit that causes =
> this.
> > (Unfortunately, each test cycle can take days. I now know that I have to =
> do
> > several of these kernel builds, which take hours each, to see if a hang i=
> s going
> > to happen.)
> >
> > I'll post if/when I have more, rick
> >
> > We scaled maxvnodes for ZFS and UFS, might be NFS is even more demanding,
> > having larger node size.
> >
> > >                 inuse     size          free              size
> > >   4096      11091    45428736 0                 0
> > >   8192       63         516096     0                 0
> > >   16384     12          196608    0                  0
> > >   32768      6           196608    0                  0
> > >   40960      2           81920      1                  40960
> > >   65536      16         1048576   0                 0
> > >   94208      1            94208      0                 0
> > >   110592     1          110592     0                 0
> > >   131072      15       2441216    0                0
> > >   262144      15       3997696    0                 0
> > >   524288      1         524288      0                 0
> > >   1048576     1        1945600     0                0
> > > vmem 0x.. 'kernel arena'
> > >   quantum: 4096
> > >   size: 390070272
> > >   inuse: 386613248
> > >   free: 3457024
> > >   busy tags:  873
> > >   free tags:   3
> > >                    inuse       size         free        size
> > > 4096           35           143360    1            4096
> > > 8192           2              16384      2           16384
> > > 12288         1              12288      0           0
> > > 16384         30             491520   0           0
> > > 20480         140           2867200  0          0
> > > 65536          1              65536     0           0
> > > 131072        631          82706432 0         0
> > > 1048576      0              0               1         1339392
> > > 2097152      27            56623104   1        2097152
> > > 8388608      1              13774848    0       0
> > > 16777216    3              74883072   0        0
> > > 33554432    1              36753408   0        0
> > > 67108864    1              118276096  0       0
> > >
> > > > alltrace
> > > - I can't face typing too much more, but I'll put a few
> > >   here that look interesting
> > >
> > > - for csh
> > > sched_switch()
> > > mi_switch()
> > > kern_yield()
> > > getblkx()
> > > breadn_flags()
> > > ffs_update()
> > > ufs_inactive()
> > > VOP_INACTIVE()
> > > vinactivef()
> > > vput_final()
> > > vm_object_deallocate()
> > > vm_map_process_deferred()
> > > kern_munmap()
> > > sys_munmap()
> > >
> > > - For cc
> > > sched_switch()
> > > mi_switch()
> > > sleepq_switch()
> > > sleepq_timedwait()
> > > _sleep()
> > > pause_sbt()
> > > vmem_bt_alloc()
> > > keg_alloc_slab()
> > > zone_import()
> > > cache_alloc()
> > > cache_alloc_retry()
> > > uma_zalloc_arg()
> > > bt_fill()
> > > vmem_xalloc()
> > > vmem_alloc()
> > > kmem_alloc()
> > > kmem_malloc_domainset()
> > > page_alloc()
> > > keg_alloc_slab()
> > > zone_import()
> > > cache_alloc()
> > > cache_alloc_retry()
> > > uma_zalloc_arg()
> > > nfscl_nget()
> > > nfs_create()
> > > vop_sigdefer()
> > > nfs_vnodeops_bypass()
> > > VOP_CREATE_APV()
> > > vn_open_cred()
> > > vn_open()
> > > kern_openat()
> > > sys_openat()
> > >
> > > Then there are a bunch of these... for cc, make
> > > sched_switch()
> > > mi_switch()
> > > sleepq_switch()
> > > sleepq_catch_signals()
> > > sleepq_wait_sig()
> > > kern_wait6()
> > > sys_wait4()
> > >
> > > - for vnlru
> > > sched_switch()
> > > mi_switch()
> > > sleepq_switch()
> > > sleepq_timedwait()
> > > _sleep()
> > > vnlru_proc()
> > > fork_exit()
> > > fork_trampoline()
> > >
> > > - for syncer
> > > sched_switch()
> > > mi_switch()
> > > critical_exit_preempt()
> > > intr_event_handle()
> > > intr_execute_handlers()
> > > lapic_handle_intr()
> > > Xapic_isr1()
> > > - interrupt
> > > memset()
> > > cache_alloc()
> > > cache_alloc_retry()
> > > uma_zalloc_arg()
> > > vmem_xalloc()
> > > vmem_bt_alloc()
> > > keg_alloc_slab()
> > > zone_import()
> > > cache_alloc()
> > > cache_alloc_retry()
> > > uma_zalloc_arg()
> > > bt_fill()
> > > vmem_xalloc()
> > > vmem_alloc()
> > > bufkva_alloc()
> > > getnewbuf()
> > > getblkx()
> > > breadn_flags()
> > > ffs_update()
> > > ffs_sync()
> > > sync_fsync()
> > > VOP_FSYNC_APV()
> > > sched_sync()
> > > fork_exit()
> > > fork_trampoline()
> > >
> > > - For bufdaemon (a bunch of these)
> > > sched_switch()
> > > mi_switch()
> > > sleepq_switch()
> > > sleepq_timedwait()
> > > _sleep()
> > > buf_daemon()
> > > fork_exit()
> > > fork_trampoline()
> > >
> > > vmdaemon and pagedaemon are basically just like above,
> > > sleeping in
> > > vm_daemon()
> > > or
> > > vm_pageout_worker()
> > > or
> > > vm_pageout_laundry_worker()
> > > or
> > > uma_reclaim_worker()
> > >
> > > That's all the typing I can take right now.
> > > I can probably make this happen again if you want more specific stuff.
> > >
> > > rick
> > >
> > >
> > >
> > >
> > _______________________________________________
> > freebsd-current_at_freebsd.org mailing list
> > https://lists.freebsd.org/mailman/listinfo/freebsd-current
> > To unsubscribe, send any mail to "freebsd-current-unsubscribe_at_freebsd.org=
> "
> > _______________________________________________
> > freebsd-current_at_freebsd.org mailing list
> > https://lists.freebsd.org/mailman/listinfo/freebsd-current
> > To unsubscribe, send any mail to "freebsd-current-unsubscribe_at_freebsd.org=
> "
> > _______________________________________________
> > freebsd-current_at_freebsd.org mailing list
> > https://lists.freebsd.org/mailman/listinfo/freebsd-current
> > To unsubscribe, send any mail to "freebsd-current-unsubscribe_at_freebsd.org