Re: maxswzone NOT used correctly and defaults incorrect?

From: Konstantin Belousov <kostikbel_at_gmail.com>
Date: Sat, 24 Nov 2018 12:40:32 +0200
On Sat, Nov 24, 2018 at 01:04:29AM -0800, John-Mark Gurney wrote:
> I have an BeagleBoard Black.  I'm running a recent snapshot:
> FreeBSD generic 13.0-CURRENT FreeBSD 13.0-CURRENT r340239 GENERIC  arm
> 
> aka:
> FreeBSD-13.0-CURRENT-arm-armv7-BEAGLEBONE-20181107-r340239.img.xz
> 
> It has 512MB of memory on board.  I created a 4GB swap file.  According
> to loader(8), this should be the default capable:
>                    in bytes of KVA space.  If no value is provided, the system
>                    allocates enough memory to handle an amount of swap that
>                    corresponds to eight times the amount of physical memory
>                    present in the system.
> 
> avail memory = 505909248 (482 MB)
> 
> but I get this:
> warning: total configured swap (1048576 pages) exceeds maximum recommended amount (248160 pages).
> warning: increase kern.maxswzone or reduce amount of swap.
> 
> So, this appears that it's only 2x amount of memory, NOT 8x like the
> documentation says.
> 
> When running make in sbin/ggate/ggated, make consumes a large amount
> of memory.  Before the OOM killer just kicked in, top showed:
> Mem: 224M Active, 4096 Inact, 141M Laundry, 121M Wired, 57M Buf, 2688K Free
> Swap: 1939M Total, 249M Used, 1689M Free, 12% Inuse, 1196K Out
> 
>   PID    UID      THR PRI NICE   SIZE    RES STATE    TIME    WCPU COMMAND
>  1029   1001        1  44    0   594M  3848K RUN      2:03  38.12% make
> 
> swapinfo -k showed:
> /dev/md99         4194304   254392  3939912     6%
> 
> sysctl:
> vm.swzone: 4466880
> vm.swap_maxpages: 496320
> kern.maxswzone: 0
> 
> dmesg when OOM strikes:
> swap blk zone exhausted, increase kern.maxswzone
> pid 1029 (make), uid 1001, was killed: out of swap space
> pid 984 (bash), uid 1001, was killed: out of swap space
> pid 956 (bash), uid 1001, was killed: out of swap space
> pid 952 (sshd), uid 0, was killed: out of swap space
> pid 1043 (bash), uid 1001, was killed: out of swap space
> pid 626 (dhclient), uid 65, was killed: out of swap space
> pid 955 (sshd), uid 1001, was killed: out of swap space
> pid 1025 (bash), uid 1001, was killed: out of swap space
> swblk zone ok
> lock order reversal:
>  1st 0xd374d028 filedesc structure (filedesc structure) _at_ /usr/src/sys/kern/sys_generic.c:1451
>  2nd 0xd41a5bc4 devfs (devfs) _at_ /usr/src/sys/kern/vfs_vnops.c:1513
> stack backtrace:
> swap blk zone exhausted, increase kern.maxswzone
> pid 981 (tmux), uid 1001, was killed: out of swap space
> pid 983 (tmux), uid 1001, was killed: out of swap space
> pid 1031 (bash), uid 1001, was killed: out of swap space
> pid 580 (dhclient), uid 0, was killed: out of swap space
> swblk zone ok
> swap blk zone exhausted, increase kern.maxswzone
> pid 577 (dhclient), uid 0, was killed: out of swap space
> pid 627 (devd), uid 0, was killed: out of swap space
> swblk zone ok
> swap blk zone exhausted, increase kern.maxswzone
> pid 942 (getty), uid 0, was killed: out of swap space
> swblk zone ok
> swap blk zone exhausted, increase kern.maxswzone
> pid 1205 (init), uid 0, was killed: out of swap space
> swblk zone ok
> swap blk zone exhausted, increase kern.maxswzone
> pid 1206 (init), uid 0, was killed: out of swap space
> swblk zone ok
> swap blk zone exhausted, increase kern.maxswzone
> swblk zone ok
> swap blk zone exhausted, increase kern.maxswzone
> swblk zone ok
> 
> So, as you can see, despite having plenty of swap, and swap usage being
> well below any of the maximums, the OOM killer kicked in, and killed off
> a bunch of processes.
OOM is guided by the pagedaemon progress, not by the swap amount left.
If the system cannot meet the pagedaemon targetp by doing
$(sysctl vm.pageout_oom_seq) back-to-back page daemon passes,
it declares OOM condition. E.g. if you have very active process which
keeps a lot of active memory by referencing the pages, and simultenously
a slow or stuck swap device, then you get into this state.

Just by looking at the top stats, you have a single page in the inactive
queue, which means that pagedaemon desperately frees clean pages and
moves dirty pages into the laundry.  Also, you have relatively large
laundry queue, which supports the theory about slow swap.

You may try to increase vm.pageout_oom_seq to move OOM trigger furhter
after the system is overloaded with swapping.

> 
> It also looks like the algorithm for calculating kern.maxswzone is not
> correct.
> 
> I just tried to run the system w/:
> kern.maxswzone: 21474836
> 
> and it again died w/ plenty of swap free:
> /dev/md99         4194304   238148  3956156     6%
> 
> This time I had vmstat -z | grep sw running, and saw:
> swpctrie:                48,  62084,     145,     270,     203,   0,   0
> swblk:                   72,  62040,   56357,      18,   56587,   0,   0
> 
> after the system died, I logged back in as see:
> swpctrie:                48,  62084,      28,     387,     240,   0,   0
> swblk:                   72,  62040,     175,   61865,   62957,  16,   0
> 
> so, it clearly ran out of swblk space VERY early, when only consuming
> around 232MB of swap...
> 
> Hmm... it looks like swblk and swpctrie are not affected by the setting
> of kern.maxswzone...  I just set it to:
> kern.maxswzone: 85899344
> 
> and the limits for the zones did not increase at ALL:
> swpctrie:                48,  62084,       0,       0,       0,   0,   0
> swblk:                   72,  62040,       0,       0,       0,   0,   0
The swap metadata zones must have all the KVA reserved in advance,
because we cannot wait for AS or memory while we try to free some
memory. At boot, the swap init code allocates KVA starting with the
requested amount. If the allocation fails, it reduces the amount by
2/3 and retries, until the allocation succeeds. What you see in limits
is the actual amount of KVA that your platform is able to provide for
reserve, so increasing the maxswzone only results in more iterations to
allocate.

> 
> Thoughts?
> 
> -- 
>   John-Mark Gurney				Voice: +1 415 225 5579
> 
>      "All that I will do, has been done, All that I have, has not."
> _______________________________________________
> freebsd-arm_at_freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-arm
> To unsubscribe, send any mail to "freebsd-arm-unsubscribe_at_freebsd.org"
Received on Sat Nov 24 2018 - 09:40:46 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:19 UTC