Re: maxswzone NOT used correctly and defaults incorrect?

From: Konstantin Belousov <kostikbel_at_gmail.com> Date: Sun, 25 Nov 2018 12:16:26 +0200 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:19 UTC

On Sat, Nov 24, 2018 at 12:09:34PM -0800, John-Mark Gurney wrote:
> Konstantin Belousov wrote this message on Sat, Nov 24, 2018 at 12:40 +0200:
> > On Sat, Nov 24, 2018 at 01:04:29AM -0800, John-Mark Gurney wrote:
> > > I have an BeagleBoard Black.  I'm running a recent snapshot:
> > > FreeBSD generic 13.0-CURRENT FreeBSD 13.0-CURRENT r340239 GENERIC  arm
> > > 
> > > aka:
> > > FreeBSD-13.0-CURRENT-arm-armv7-BEAGLEBONE-20181107-r340239.img.xz
> > > 
> > > It has 512MB of memory on board.  I created a 4GB swap file.  According
> > > to loader(8), this should be the default capable:
> > >                    in bytes of KVA space.  If no value is provided, the system
> > >                    allocates enough memory to handle an amount of swap that
> > >                    corresponds to eight times the amount of physical memory
> > >                    present in the system.
> > > 
> > > avail memory = 505909248 (482 MB)
> > > 
> > > but I get this:
> > > warning: total configured swap (1048576 pages) exceeds maximum recommended amount (248160 pages).
> > > warning: increase kern.maxswzone or reduce amount of swap.
> > > 
> > > So, this appears that it's only 2x amount of memory, NOT 8x like the
> > > documentation says.
> > > 
> > > When running make in sbin/ggate/ggated, make consumes a large amount
> > > of memory.  Before the OOM killer just kicked in, top showed:
> > > Mem: 224M Active, 4096 Inact, 141M Laundry, 121M Wired, 57M Buf, 2688K Free
> > > Swap: 1939M Total, 249M Used, 1689M Free, 12% Inuse, 1196K Out
> > > 
> > >   PID    UID      THR PRI NICE   SIZE    RES STATE    TIME    WCPU COMMAND
> > >  1029   1001        1  44    0   594M  3848K RUN      2:03  38.12% make
> > > 
> > > swapinfo -k showed:
> > > /dev/md99         4194304   254392  3939912     6%
> > > 
> > > sysctl:
> > > vm.swzone: 4466880
> > > vm.swap_maxpages: 496320
> > > kern.maxswzone: 0
> > > 
> > > dmesg when OOM strikes:
> > > swap blk zone exhausted, increase kern.maxswzone
> > > pid 1029 (make), uid 1001, was killed: out of swap space
> > > pid 984 (bash), uid 1001, was killed: out of swap space
> > > pid 956 (bash), uid 1001, was killed: out of swap space
> > > pid 952 (sshd), uid 0, was killed: out of swap space
> > > pid 1043 (bash), uid 1001, was killed: out of swap space
> > > pid 626 (dhclient), uid 65, was killed: out of swap space
> > > pid 955 (sshd), uid 1001, was killed: out of swap space
> > > pid 1025 (bash), uid 1001, was killed: out of swap space
> > > swblk zone ok
> > > lock order reversal:
> > >  1st 0xd374d028 filedesc structure (filedesc structure) _at_ /usr/src/sys/kern/sys_generic.c:1451
> > >  2nd 0xd41a5bc4 devfs (devfs) _at_ /usr/src/sys/kern/vfs_vnops.c:1513
> > > stack backtrace:
> > > swap blk zone exhausted, increase kern.maxswzone
> > > pid 981 (tmux), uid 1001, was killed: out of swap space
> > > pid 983 (tmux), uid 1001, was killed: out of swap space
> > > pid 1031 (bash), uid 1001, was killed: out of swap space
> > > pid 580 (dhclient), uid 0, was killed: out of swap space
> > > swblk zone ok
> > > swap blk zone exhausted, increase kern.maxswzone
> > > pid 577 (dhclient), uid 0, was killed: out of swap space
> > > pid 627 (devd), uid 0, was killed: out of swap space
> > > swblk zone ok
> > > swap blk zone exhausted, increase kern.maxswzone
> > > pid 942 (getty), uid 0, was killed: out of swap space
> > > swblk zone ok
> > > swap blk zone exhausted, increase kern.maxswzone
> > > pid 1205 (init), uid 0, was killed: out of swap space
> > > swblk zone ok
> > > swap blk zone exhausted, increase kern.maxswzone
> > > pid 1206 (init), uid 0, was killed: out of swap space
> > > swblk zone ok
> > > swap blk zone exhausted, increase kern.maxswzone
> > > swblk zone ok
> > > swap blk zone exhausted, increase kern.maxswzone
> > > swblk zone ok
> > > 
> > > So, as you can see, despite having plenty of swap, and swap usage being
> > > well below any of the maximums, the OOM killer kicked in, and killed off
> > > a bunch of processes.
> > OOM is guided by the pagedaemon progress, not by the swap amount left.
> > If the system cannot meet the pagedaemon targetp by doing
> > $(sysctl vm.pageout_oom_seq) back-to-back page daemon passes,
> > it declares OOM condition. E.g. if you have very active process which
> > keeps a lot of active memory by referencing the pages, and simultenously
> > a slow or stuck swap device, then you get into this state.
> > 
> > Just by looking at the top stats, you have a single page in the inactive
> > queue, which means that pagedaemon desperately frees clean pages and
> > moves dirty pages into the laundry.  Also, you have relatively large
> > laundry queue, which supports the theory about slow swap.
> 
> Yes, swap is "slow" by modern standards, but not really that slow... I'm
> swapping out at over 10MB/sec... For such a system, this is quite
> fast...
> 
> Though maybe I wasn't explicit, it's very clear that I'm running out
> of the swap blk zone, per the very first message, and the vmstat -z
> stats below (and the resulting failures):
> swap blk zone exhausted
> 
> > You may try to increase vm.pageout_oom_seq to move OOM trigger furhter
> > after the system is overloaded with swapping.
> > 
> > > 
> > > It also looks like the algorithm for calculating kern.maxswzone is not
> > > correct.
> > > 
> > > I just tried to run the system w/:
> > > kern.maxswzone: 21474836
> > > 
> > > and it again died w/ plenty of swap free:
> > > /dev/md99         4194304   238148  3956156     6%
> > > 
> > > This time I had vmstat -z | grep sw running, and saw:
> > > swpctrie:                48,  62084,     145,     270,     203,   0,   0
> > > swblk:                   72,  62040,   56357,      18,   56587,   0,   0
> > > 
> > > after the system died, I logged back in as see:
> > > swpctrie:                48,  62084,      28,     387,     240,   0,   0
> > > swblk:                   72,  62040,     175,   61865,   62957,  16,   0
> > > 
> > > so, it clearly ran out of swblk space VERY early, when only consuming
> > > around 232MB of swap...
> > > 
> > > Hmm... it looks like swblk and swpctrie are not affected by the setting
> > > of kern.maxswzone...  I just set it to:
> > > kern.maxswzone: 85899344
> > > 
> > > and the limits for the zones did not increase at ALL:
> > > swpctrie:                48,  62084,       0,       0,       0,   0,   0
> > > swblk:                   72,  62040,       0,       0,       0,   0,   0
> > The swap metadata zones must have all the KVA reserved in advance,
> > because we cannot wait for AS or memory while we try to free some
> > memory. At boot, the swap init code allocates KVA starting with the
> > requested amount. If the allocation fails, it reduces the amount by
> > 2/3 and retries, until the allocation succeeds. What you see in limits
> > is the actual amount of KVA that your platform is able to provide for
> > reserve, so increasing the maxswzone only results in more iterations to
> > allocate.
> 
> Except that I don't see the warning "Swap blk zone entries reduced
> from" in the dmesg which I'd expect to see that code is triggered...
> 
> I find it hard to believe that it can't allocate more than 5MB of KVA
> at boot...  per above, 72*62040 ~= 4.26MB...
> 
> It does look like the calculation is correct for swblk assuming maxswzone
> is not set (0), as:
> vm.stats.vm.v_page_count: 124041
> 
> and:
> n = vm_cnt.v_page_count / 2;
> 
> I'll be adding a print for maxswzone to make sure it's getting set,
> though it'll take me a while to get a kernel built...
> 
> and kenv does show it set:
> [freebsd_at_generic ~]$ sysctl kern.maxswzone
> kern.maxswzone: 85899344
> [freebsd_at_generic ~]$ kenv | grep kern.maxswzone
> kern.maxswzone="85899344"
> 
> so how that code isn't being triggered is quite strange...

Try this

diff --git a/sys/vm/swap_pager.c b/sys/vm/swap_pager.c
index 54370523086..b5e92bc97ee 100644
--- a/sys/vm/swap_pager.c
+++ b/sys/vm/swap_pager.c
_at__at_ -547,12 +547,12 _at__at_ swap_pager_swap_init(void)
 	mtx_unlock(&pbuf_mtx);
 
 	/*
-	 * Initialize our zone, guessing on the number we need based
-	 * on the number of pages in the system.
+	 * Initialize our zone, taking the user sizing or guessing on
+	 * the number we need based on the number of pages in the
+	 * system.
 	 */
-	n = vm_cnt.v_page_count / 2;
-	if (maxswzone && n > maxswzone / sizeof(struct swblk))
-		n = maxswzone / sizeof(struct swblk);
+	n = maxswzone != 0 ? maxswzone / sizeof(struct swblk) :
+	    vm_cnt.v_page_count / 2;
 	swpctrie_zone = uma_zcreate("swpctrie", pctrie_node_size(), NULL, NULL,
 	    pctrie_zone_init, NULL, UMA_ALIGN_PTR, UMA_ZONE_VM);
 	if (swpctrie_zone == NULL)
_at__at_ -580,7 +580,7 _at__at_ swap_pager_swap_init(void)
 	n = uma_zone_get_max(swblk_zone);
 
 	if (n < n2)
-		printf("Swap blk zone entries reduced from %lu to %lu.\n",
+		printf("Swap blk zone entries changed from %lu to %lu.\n",
 		    n2, n);
 	swap_maxpages = n * SWAP_META_PAGES;
 	swzone = n * sizeof(struct swblk);