Re: After update to r357104 build of poudriere jail fails with 'out of swap space'

From: Mark Millard <marklmi_at_yahoo.com> Date: Sat, 25 Jan 2020 12:02:07 -0800 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:22 UTC

Yasuhiro KIMURA yasu at utahime.org wrote on
Sat Jan 25 14:45:13 UTC 2020 :

> I use VirtualBox to run 13-CURRENT. Host is 64bit Windows 10 1909 and
> spec of VM is as following.
> 
> * 4 CPU
> * 8GB memory
> * 100GB disk
>   - 92GB ZFS pool (zroot)
>   - 8GB swap
> 
> Today I updated this VM to r357104. And after that I tried to update
> poudriere jail with `poudriere jail -u -j jailname -b`. But it failed
> at install stage. After the failure I found following message is
> written to syslog.
> 
> Jan 25 19:18:25 rolling-vm-freebsd1 kernel: pid 7963 (strip), jid 0, uid 0, was killed: out of swap space

This message text's detailed wording is a misnomer.
Do you also have any messages of the form:

. . . sentinel kernel: swap_pager_getswapspace(32): failed

If yes: you really were out of swap space.
If no:  you were not out of swap space,
        or at least it is highly unlikely that you were.

FreeBSD kills processes for multiple potential reasons.
For example:

a) Still low on free RAM after a number of tries to increase it above a threshold.
b) Slow paging I/O.
c) . . . (I do not know the full list) . . .

Unfortunately, FreeBSD is not explicit about the category
of problem that leads to the kill activity that happens.

You might learn more by watching how things are going
via top or some such program or other way of monitoring.

Below are some notes about specific tunables that might
or might not be of help. (There may be more tunables
that can help that I do not know about.)

For (a) there is a way to test if it is the issue by
adding to the number of tries before it gives up and
starts killing things. That will either:

1) let it get more done before kills start
2) let it complete before the count is reached
3) make no significant difference

(3) would imply that (b) or (c) are involved instead.

(1) might be handled by having it do even more tries.

For delaying how long free RAM staying low is
tolerated, one can increase vm.pageout_oom_seq from
12 to larger. The management of slow paging I've
less experience with but do have some notes about
below.

Examples follow that I use in contexts with
sufficient RAM that I do not have to worry about
out of swap/page space. These I've set in
/etc/sysctl.conf . (Of course, I'm not trying to
deliberately run out of RAM.)

#
# Delay when persisstent low free RAM leads to
# Out Of Memory killing of processes:
vm.pageout_oom_seq=120

I'll note that figures like 1024 or 1200 or
even more are possible. This is controlling how
many tries at regaining sufficient free RAM
that that level would be tolerated long-term.
After that it starts Out Of Memory kills to get
some free RAM.

No figure is designed to make the delay
unbounded. There may be large enough figures to
effectively be bounded beyond any reasonable
time to wait.

As for paging I/O (this is specific to 13,
or was last I checked):

#
# For plunty of swap/paging space (will not
# run out), avoid pageout delays leading to
# Out Of Memory killing of processes:
vm.pfault_oom_attempts=-1

(Note: In my context "plunty" really means
sufficient RAM that paging is rare. But
others have reported on using the -1 in
contexts where paging was heavy at times and
OOM kills had been happening that were
eliminated by the assignment.)

I've no experience with the below alternative
to that -1 use:

#
# For possibly insufficient swap/paging space
# (might run out), increase the pageout delay
# that leads to Out Of Memory killing of
# processes:
#vm.pfault_oom_attempts= ???
#vm.pfault_oom_wait= ???
# (The multiplication is the total but there
# are other potential tradoffs in the factors
# multiplied, even for nearly the same total.)

I'm not claiming that these 3 vm.???_oom_???
figures are always sufficient. Nor am I
claiming that tunables are always available
that would be sufficient. Nor that it is easy
to find the ones that do exist that might
help for specific OOM kill issues.

I have seen reports of OOM kills for other
reasons when both vm.pageout_oom_seq and
vm.pfault_oom_attempts=-1 were in use.
As I understand, FreeBSD did not report
what kind of condition lead to the
decision to do an OOM kill.

So the above notes may or may-not help you.

> To make sure I shutdown both VM and host, restarted them and tried
> update of jail again. Then the problem was reproduced.

===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)