Re: r358252 causes intermittent hangs where processes are stuck sleeping on btalloc

From: Rick Macklem <rmacklem_at_uoguelph.ca> Date: Mon, 29 Jun 2020 04:57:23 +0000 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:24 UTC

Just in case you were waiting for another email, I have now run several
cycles of the kernel build over NFS on a recent head kernel with the
one line change and it has not hung.

I don't know if this is the correct fix, but it would be nice to get something
into head to fix this.

If I don't hear anything in the next few days, I'll put it in a PR so it
doesn't get forgotten.

rick

________________________________________
From: owner-freebsd-current_at_freebsd.org <owner-freebsd-current_at_freebsd.org> on behalf of Rick Macklem <rmacklem_at_uoguelph.ca>
Sent: Thursday, June 18, 2020 11:42 PM
To: Ryan Libby
Cc: Konstantin Belousov; Jeff Roberson; freebsd-current_at_freebsd.org
Subject: Re: r358252 causes intermittent hangs where processes are stuck sleeping on btalloc

Ryan Libby wrote:
>On Mon, Jun 15, 2020 at 5:06 PM Rick Macklem <rmacklem_at_uoguelph.ca> wrote:
>>
>> Rick Macklem wrote:
>> >r358098 will hang fairly easily, in 1-3 cycles of the kernel build over =
NFS.
>> >I thought this was the culprit, since I did 6 cycles of r358097 without =
a hang.
>> >However, I just got a hang with r358097, but it looks rather different.
>> >The r358097 hang did not have any processes sleeping on btalloc. They
>> >appeared to be waiting on two different locks in the buffer cache.
>> >As such, I think it might be a different problem. (I'll admit I should h=
ave
>> >made notes about this one before rebooting, but I was flustrated that
>> >it happened and rebooted before looking at it mush detail.)
>> Ok, so I did 10 cycles of the kernel build over NFS for r358096 and never
>> got a hang.
>> --> It seems that r358097 is the culprit and r358098 makes it easier
>>       to reproduce.
>>       --> Basically runs out of kernel memory.
>>
>> It is not obvious if I can revert these two commits without reverting
>> other ones, since there were a bunch of vm changes after these.
>>
>> I'll take a look, but if you guys have any ideas on how to fix this, plea=
se
>> let me know.
>>
>> Thanks, rick
>
>Interesting.  Could you try re-adding UMA_ZONE_NOFREE to the vmem btag
>zone to see if that rescues it, on whatever base revision gets you a
>reliable repro?
Good catch! That seems to fix it. I've done 8 cycles of kernel build over
NFS without a hang (normally I'd get one in the first 1-3 cycles).

I don't know if the intend was to delete UMA_ZONE_VM and r358097
had a typo in it and deleted UMA_ZONE_NOFREE or ???

Anyhow, I just put it back to UMA_ZONE_VM | UMA_ZONE_NOFREE and
the hangs seem to have gone away.

The small patch I did is attached, in case that isn't what you meant.

I'll run a few more cycles just in case, but I think this fixes it.

Thanks, rick

>
> Jeff, to fill you in, I have been getting intermittent hangs on a Pentium=
 4
> (single core i386) with 1.25Gbytes ram when doing kernel builds using
> head kernels from this winter. (I also saw one when doing a kernel build
> on UFS, so they aren't NFS specific, although easier to reproduce that wa=
y.)
> After a typical hang, there will be a bunch of processes sleeping on "bta=
lloc"
> and several processes holding the following lock:
> exclusive sx lock _at_ vm/vm_map.c:4761
> - I have seen hangs where that is the only lock held by any process excep=
t
>    the interrupt thread.
> - I have also seen processes waiting on the following locks:
> kern/subr_vmem.c:1343
> kern/subr_vmem.c:633
>
> I can't be absolutely sure r358098 is the culprit, but it seems to make t=
he
> problem more reproducible.
>
> If anyone has a patch suggestion, I can test it.
> Otherwise, I will continue to test r358097 and earlier, to try and see wh=
at hangs
> occur. (I've done 8 cycles of testing of r356776 without difficulties, bu=
t that
> doesn't guarantee it isn't broken.)
>
> There is a bunch more of the stuff I got for Kostik and Ryan below.
> I can do "db" when it is hung, but it is a screen console, so I need to
> transcribe the output to email by hand. (ie. If you need something
> specific I can do that, but trying to do everything Kostik and Ryan asked
> for isn't easy.)
>
> rick
>
>
>
> Konstantin Belousov wrote:
> >On Fri, May 22, 2020 at 11:46:26PM +0000, Rick Macklem wrote:
> >> Konstantin Belousov wrote:
> >> >On Wed, May 20, 2020 at 11:58:50PM -0700, Ryan Libby wrote:
> >> >> On Wed, May 20, 2020 at 6:04 PM Rick Macklem <rmacklem_at_uoguelph.ca>=
 wrote:
> >> >> >
> >> >> > Hi,
> >> >> >
> >> >> > Since I hadn't upgraded a kernel through the winter, it took me a=
 while
> >> >> > to bisect this, but r358252 seems to be the culprit.
> No longer true. I succeeded in reproducing the hang to-day running a
> r358251 kernel.
>
> I haven't had much luck sofar, but see below for what I have learned.
>
> >> >> >
> >> >> > If I do a kernel build over NFS using my not so big Pentium 4 (si=
ngle core,
> >> >> > 1.25Gbytes RAM, i386), about every second attempt will hang.
> >> >> > When I do a "ps" in the debugger, I see processes sleeping on bta=
lloc.
> >> >> > If I revert to r358251, I cannot reproduce this.
> As above, this is no longer true.
>
> >> >> >
> >> >> > Any ideas?
> >> >> >
> >> >> > I can easily test any change you might suggest to see if it fixes=
 the
> >> >> > problem.
> >> >> >
> >> >> > If you want more debug info, let me know, since I can easily
> >> >> > reproduce it.
> >> >> >
> >> >> > Thanks, rick
> >> >>
> >> >> Nothing obvious to me.  I can maybe try a repro on a VM...
> >> >>
> >> >> ddb ps, acttrace, alltrace, show all vmem, show page would be welco=
me.
> >> >>
> >> >> "btalloc" is "We're either out of address space or lost a fill race=
."
> From what I see, I think it is "out of address space".
> For one of the hangs, when I did "show alllocks", everything except the
> intr thread, was waiting for the
> exclusive sx lock _at_ vm/vm_map.c:4761
>
> >> >
> >> >Yes, I would be not surprised to be out of something on 1G i386 machi=
ne.
> >> >Please also add 'show alllocks'.
> >> Ok, I used an up to date head kernel and it took longer to reproduce a=
 hang.
> Go down to Kostik's comment about kern.maxvnodes for the rest of what I'v=
e
> learned. (The time it takes to reproduce one of these varies greatly, but=
 I usually
> get one within 3 cycles of a full kernel build over NFS. I have had it ha=
ppen
> once when doing a kernel build over UFS.)
>
> >> This time, none of the processes are stuck on "btalloc".
> > I'll try and give you most of the above, but since I have to type it in=
 by hand
> > from the screen, I might not get it all. (I'm no real typist;-)
> > > show alllocks
> > exclusive lockmgr ufs (ufs) r =3D 0 locked _at_ kern/vfs_subr.c: 3259
> > exclusive lockmgr nfs (nfs) r =3D 0 locked _at_ kern/vfs_lookup.c:737
> > exclusive sleep mutex kernel area domain (kernel arena domain) r =3D 0 =
locked _at_ kern/subr_vmem.c:1343
> > exclusive lockmgr bufwait (bufwait) r =3D 0 locked _at_ kern/vfs_bio.c:166=
3
> > exclusive lockmgr ufs (ufs) r =3D 0 locked _at_ kern/vfs_subr.c:2930
> > exclusive lockmgr syncer (syncer) r =3D 0 locked _at_ kern/vfs_subr.c:2474
> > Process 12 (intr) thread 0x.. (1000008)
> > exclusive sleep mutex Giant (Giant) r =3D 0 locked _at_ kern/kern_intr.c:1=
152
> >
> > > ps
> > - Not going to list them all, but here are the ones that seem interesti=
ng...
> > 18 0 0 0 DL vlruwt 0x11d939cc [vnlru]
> > 16 0 0 0 DL (threaded)               [bufdaemon]
> > 100069  D  qsleep                      [bufdaemon]
> > 100074  D  -                               [bufspacedaemon-0]
> > 100084  D  sdflush  0x11923284 [/ worker]
> > - and more of these for the other UFS file systems
> > 9 0 0 0   DL psleep  0x1e2f830  [vmdaemon]
> > 8 0 0 0   DL (threaded)               [pagedaemon]
> > 100067  D   psleep 0x1e2e95c   [dom0]
> > 100072  D   launds 0x1e2e968   [laundry: dom0]
> > 100073  D   umarcl 0x12cc720   [uma]
> > =E2=80=A6 a bunch of usb and cam ones
> > 100025  D   -           0x1b2ee40  [doneq0]
> > =E2=80=A6
> > 12 0 0 0 RL  (threaded)               [intr]
> > 100007  I                                     [swi6: task queue]
> > 100008  Run           CPU 0           [swi6: Giant taskq]
> > =E2=80=A6
> > 100000  D   swapin 0x1d96dfc    [swapper]
> > - and a bunch more in D state.
> > Does this mean the swapper was trying to swap in?
> >
> > > acttrace
> > - just shows the keyboard
> > kdb_enter() at kdb_enter+0x35/frame
> > vt_kbdevent() at vt_kdbevent+0x329/frame
> > kdbmux_intr() at kbdmux_intr+0x19/frame
> > taskqueue_run_locked() at taskqueue_run_locked+0x175/frame
> > taskqueue_run() at taskqueue_run+0x44/frame
> > taskqueue_swi_giant_run(0) at taskqueue_swi_giant_run+0xe/frame
> > ithread_loop() at ithread_loop+0x237/frame
> > fork_exit() at fork_exit+0x6c/frame
> > fork_trampoline() at 0x../frame
> >
> > > show all vmem
> > vmem 0x.. 'transient arena'
> >   quantum: 4096
> >   size:  23592960
> >   inuse: 0
> >   free: 23592960
> >   busy tags:   0
> >   free tags:    2
> >                      inuse        size       free        size
> >   16777216   0               0           1            23592960
> > vmem 0x.. 'buffer arena'
> >   quantum:  4096
> >   size:   94683136
> >   inuse: 94502912
> >   free: 180224
> >   busy tags:    1463
> >   free tags:      3
> >                        inuse      size         free        size
> >   16384           2             32768     1             16384
> >   32768           39           1277952 1              32768
> >   65536            1422       93192192 0           0
> >   131072          0             0            1              131072
> > vmem 0x.. 'i386trampoline'
> >   quantum:      1
> >   size:               24576
> >   inuse:             20860
> >   free:               3716
> >   busy tags:     9
> >   free tags:      3
> >                        inuse      size      free      size
> >   32                 1             48        1           52
> >   64                  2            208       0           0
> >   128                2            280       0            0
> >   2048              1            2048     1           3664
> >   4096              2            8192     0           0
> >   8192              1            10084   0           0
> > vmem 0x.. 'kernel rwx arena'
> >   quantum:    4096
> >   size: 0
> >   inuse: 0
> >   free:   0
> >   busy tags: 0
> >   free tags:  0
> > vmem 0x.. 'kernel area dom'
> >   quantum:  4096
> >   size: 56623104
> >   inuse: 56582144
> >>   free: 40960
> >>   busy tags: 11224
> >>   free tags: 3
> >I think this is the trouble.
> >
> >Did you tried to reduce kern.maxvnodes ?  What is the default value for
> >the knob on your machine ?
> The default is 84342.
> I have tried 64K, 32K and 128K and they all hung sooner or later.
> For the 32K case, I did see vnodes being recycled for a while before it g=
ot hung,
> so it isn't just when it hits the limit.
>
> Although it is much easier for me to reproduce on an NFS mount, I did see
> a hang while doing a kernel build on UFS (no NFS mount on the machine at
> that time).
>
> So, I now know that the problem pre-dates r358252 and is not NFS specific=
.
>
> I'm not bisecting back further to try and isolate the commit that causes =
this.
> (Unfortunately, each test cycle can take days. I now know that I have to =
do
> several of these kernel builds, which take hours each, to see if a hang i=
s going
> to happen.)
>
> I'll post if/when I have more, rick
>
> We scaled maxvnodes for ZFS and UFS, might be NFS is even more demanding,
> having larger node size.
>
> >                 inuse     size          free              size
> >   4096      11091    45428736 0                 0
> >   8192       63         516096     0                 0
> >   16384     12          196608    0                  0
> >   32768      6           196608    0                  0
> >   40960      2           81920      1                  40960
> >   65536      16         1048576   0                 0
> >   94208      1            94208      0                 0
> >   110592     1          110592     0                 0
> >   131072      15       2441216    0                0
> >   262144      15       3997696    0                 0
> >   524288      1         524288      0                 0
> >   1048576     1        1945600     0                0
> > vmem 0x.. 'kernel arena'
> >   quantum: 4096
> >   size: 390070272
> >   inuse: 386613248
> >   free: 3457024
> >   busy tags:  873
> >   free tags:   3
> >                    inuse       size         free        size
> > 4096           35           143360    1            4096
> > 8192           2              16384      2           16384
> > 12288         1              12288      0           0
> > 16384         30             491520   0           0
> > 20480         140           2867200  0          0
> > 65536          1              65536     0           0
> > 131072        631          82706432 0         0
> > 1048576      0              0               1         1339392
> > 2097152      27            56623104   1        2097152
> > 8388608      1              13774848    0       0
> > 16777216    3              74883072   0        0
> > 33554432    1              36753408   0        0
> > 67108864    1              118276096  0       0
> >
> > > alltrace
> > - I can't face typing too much more, but I'll put a few
> >   here that look interesting
> >
> > - for csh
> > sched_switch()
> > mi_switch()
> > kern_yield()
> > getblkx()
> > breadn_flags()
> > ffs_update()
> > ufs_inactive()
> > VOP_INACTIVE()
> > vinactivef()
> > vput_final()
> > vm_object_deallocate()
> > vm_map_process_deferred()
> > kern_munmap()
> > sys_munmap()
> >
> > - For cc
> > sched_switch()
> > mi_switch()
> > sleepq_switch()
> > sleepq_timedwait()
> > _sleep()
> > pause_sbt()
> > vmem_bt_alloc()
> > keg_alloc_slab()
> > zone_import()
> > cache_alloc()
> > cache_alloc_retry()
> > uma_zalloc_arg()
> > bt_fill()
> > vmem_xalloc()
> > vmem_alloc()
> > kmem_alloc()
> > kmem_malloc_domainset()
> > page_alloc()
> > keg_alloc_slab()
> > zone_import()
> > cache_alloc()
> > cache_alloc_retry()
> > uma_zalloc_arg()
> > nfscl_nget()
> > nfs_create()
> > vop_sigdefer()
> > nfs_vnodeops_bypass()
> > VOP_CREATE_APV()
> > vn_open_cred()
> > vn_open()
> > kern_openat()
> > sys_openat()
> >
> > Then there are a bunch of these... for cc, make
> > sched_switch()
> > mi_switch()
> > sleepq_switch()
> > sleepq_catch_signals()
> > sleepq_wait_sig()
> > kern_wait6()
> > sys_wait4()
> >
> > - for vnlru
> > sched_switch()
> > mi_switch()
> > sleepq_switch()
> > sleepq_timedwait()
> > _sleep()
> > vnlru_proc()
> > fork_exit()
> > fork_trampoline()
> >
> > - for syncer
> > sched_switch()
> > mi_switch()
> > critical_exit_preempt()
> > intr_event_handle()
> > intr_execute_handlers()
> > lapic_handle_intr()
> > Xapic_isr1()
> > - interrupt
> > memset()
> > cache_alloc()
> > cache_alloc_retry()
> > uma_zalloc_arg()
> > vmem_xalloc()
> > vmem_bt_alloc()
> > keg_alloc_slab()
> > zone_import()
> > cache_alloc()
> > cache_alloc_retry()
> > uma_zalloc_arg()
> > bt_fill()
> > vmem_xalloc()
> > vmem_alloc()
> > bufkva_alloc()
> > getnewbuf()
> > getblkx()
> > breadn_flags()
> > ffs_update()
> > ffs_sync()
> > sync_fsync()
> > VOP_FSYNC_APV()
> > sched_sync()
> > fork_exit()
> > fork_trampoline()
> >
> > - For bufdaemon (a bunch of these)
> > sched_switch()
> > mi_switch()
> > sleepq_switch()
> > sleepq_timedwait()
> > _sleep()
> > buf_daemon()
> > fork_exit()
> > fork_trampoline()
> >
> > vmdaemon and pagedaemon are basically just like above,
> > sleeping in
> > vm_daemon()
> > or
> > vm_pageout_worker()
> > or
> > vm_pageout_laundry_worker()
> > or
> > uma_reclaim_worker()
> >
> > That's all the typing I can take right now.
> > I can probably make this happen again if you want more specific stuff.
> >
> > rick
> >
> >
> >
> >
> _______________________________________________
> freebsd-current_at_freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to "freebsd-current-unsubscribe_at_freebsd.org=
"
> _______________________________________________
> freebsd-current_at_freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to "freebsd-current-unsubscribe_at_freebsd.org=
"
> _______________________________________________
> freebsd-current_at_freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to "freebsd-current-unsubscribe_at_freebsd.org