Re: OOMA kill with vm.pfault_oom_attempts="-1" on RPi3 at r357147

From: Mark Millard <marklmi_at_yahoo.com> Date: Mon, 27 Jan 2020 21:29:52 -0800 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:22 UTC

On 2020-Jan-27, at 19:53, bob prohaska <fbsd at www.zefox.net> wrote:

> On Mon, Jan 27, 2020 at 06:22:20PM -0800, Mark Millard wrote:
>> 
>> So far as I know, in the past progress was only made when someone
>> already knowledgable got involved in isolating what was happening
>> and how to control it.
>> 
> Indeed. One can only hope said knowledgeables are reading....

May be I can suggest something that might kick-start
evidence gathering a little bit: add 4 unconditional
printf's to the kernel code, each just before one of
the vm_pageout_oom(. . .) calls. Have the message
uniquely identify which of the 4 it is before.

The details of what I found that suggested this
follows.

I found:

#define VM_OOM_MEM      1
#define VM_OOM_MEM_PF   2
#define VM_OOM_SWAPZ    3

In vm_fault(. . .) :

. . .
                                if (vm_pfault_oom_attempts < 0 ||
                                    oom < vm_pfault_oom_attempts) {
                                        oom++;
                                        vm_waitpfault(dset,
                                            vm_pfault_oom_wait * hz);
                                        goto RetryFault_oom;
                                }
                                if (bootverbose)
                                        printf(
        "proc %d (%s) failed to alloc page on fault, starting OOM\n",
                                            curproc->p_pid, curproc->p_comm);
                                vm_pageout_oom(VM_OOM_MEM_PF);
. . .

(I'd not have guessed that bootverbose would
control messages about OOM activity.)

The above one looks to be blocked by the "-1" setting
that we have been using.

In vm_pageout_mightbe_oom(. . .) :

. . .
        if (starting_page_shortage <= 0 || starting_page_shortage !=
            page_shortage)
                vmd->vmd_oom_seq = 0;
        else
                vmd->vmd_oom_seq++;
        if (vmd->vmd_oom_seq < vm_pageout_oom_seq) {
                if (vmd->vmd_oom) {
                        vmd->vmd_oom = FALSE;
                        atomic_subtract_int(&vm_pageout_oom_vote, 1);
                }
                return;
        }

        /*
         * Do not follow the call sequence until OOM condition is
         * cleared.
         */
        vmd->vmd_oom_seq = 0;

        if (vmd->vmd_oom)
                return;

        vmd->vmd_oom = TRUE;
        old_vote = atomic_fetchadd_int(&vm_pageout_oom_vote, 1);
        if (old_vote != vm_ndomains - 1)
                return;

        /*
         * The current pagedaemon thread is the last in the quorum to
         * start OOM.  Initiate the selection and signaling of the
         * victim.
         */
        vm_pageout_oom(VM_OOM_MEM);

        /*
         * After one round of OOM terror, recall our vote.  On the
         * next pass, current pagedaemon would vote again if the low
         * memory condition is still there, due to vmd_oom being
         * false.
         */
        vmd->vmd_oom = FALSE;
        atomic_subtract_int(&vm_pageout_oom_vote, 1);
. . .

The above is where the other setting we have been using
extends the number of tries before doing the OOM kill.
If the rate of attempts increased, less time would go by
for the same figure? This case might still be happening,
even for the > 4000 figure used on the 5 GiByte amd64
system with the i386 jail that was reported?

No specific printf above as things are.

In swp_pager_meta_build(. . .) :

. . .
                        if (uma_zone_exhausted(swblk_zone)) {
                                if (atomic_cmpset_int(&swblk_zone_exhausted,
                                    0, 1))
                                        printf("swap blk zone exhausted, "
                                            "increase kern.maxswzone\n");
                                vm_pageout_oom(VM_OOM_SWAPZ);
                                pause("swzonxb", 10);
                        } else
                                uma_zwait(swblk_zone);
. . .
                        if (uma_zone_exhausted(swpctrie_zone)) {
                                if (atomic_cmpset_int(&swpctrie_zone_exhausted,
                                    0, 1))
                                        printf("swap pctrie zone exhausted, "
                                            "increase kern.maxswzone\n");
                                vm_pageout_oom(VM_OOM_SWAPZ);
                                pause("swzonxp", 10);
                        } else
                                uma_zwait(swpctrie_zone);
. . .

The above we have not been controlling: uma zone exhaustion
for swblk_zone and swpctrie_zone. (Not that I'm familiar with
them or the rest of this material.) On a small memory machine,
there may be nothing that can be directly done that does not
have other, nasty tradeoffs. Of course, there might be reasons
that one or both of these exhaust faster then they used to.

There are the 2 printf messages, but they are conditional.
Still, they give something else to look for in console
or log output.

One possibility is always having an unconditional printf
just before each of the 4 vm_pageout_oom calls, each of
which identifies which of the 4 contexts is making the
call. That would at least be a start at figuring things
out.

(swp_pager_meta_build's code means that the argument
to vm_pageout_oom is not as specific for such
identification.)

The vm_pageout_oom(. . .) routine has:

. . .
        if (bigproc != NULL) {
                if (vm_panic_on_oom != 0)
                        panic("out of swap space");
                PROC_LOCK(bigproc);
                killproc(bigproc, "out of swap space");
                sched_nice(bigproc, PRIO_MIN);
                _PRELE(bigproc);
                PROC_UNLOCK(bigproc);
        }
. . .

That is where the can-be-a-misnomer "out of swap space"
is from. Looks like it is correct for some conditions,
but not the conditions we have historically got for our
contexts. It takes looking at other messages to figure
out if it is a misnomer: Another type of message carries
the actual out-of-swap information and if that message
is not present then the one based on the above is a
misnomer.

vm_pageout_oom could use its argument to be somewhat
more specific for the text it passes to killproc(. . .).

For reference:

# grep -r "VM_OOM_" /usr/src/sys/ | more
/usr/src/sys/vm/vm_fault.c:                         vm_pageout_oom(VM_OOM_MEM_PF);
/usr/src/sys/vm/vm_pageout.c:       vm_pageout_oom(VM_OOM_MEM);
/usr/src/sys/vm/vm_pageout.c:       if (shortage == VM_OOM_MEM_PF &&
/usr/src/sys/vm/vm_pageout.c:               if (shortage == VM_OOM_MEM || shortage == VM_OOM_MEM_PF)
/usr/src/sys/vm/swap_pager.c:                               vm_pageout_oom(VM_OOM_SWAPZ);
/usr/src/sys/vm/swap_pager.c:                               vm_pageout_oom(VM_OOM_SWAPZ);
/usr/src/sys/vm/vm_pageout.h:#define        VM_OOM_MEM      1
/usr/src/sys/vm/vm_pageout.h:#define        VM_OOM_MEM_PF   2
/usr/src/sys/vm/vm_pageout.h:#define        VM_OOM_SWAPZ    3

===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)