Re: After update to r357104 build of poudriere jail fails with 'out of swap space'

From: Mark Millard <marklmi_at_yahoo.com> Date: Mon, 27 Jan 2020 14:25:59 -0800 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:22 UTC

On 2020-Jan-27, at 12:48, Cy Schubert <Cy.Schubert at cschubert.com> wrote:

> In message <BA0CE7D8-CFA1-40A3-BEFA-21D0C230B082_at_yahoo.com>, Mark Millard 
> write
> s:
>> 
>> 
>> 
>> On 2020-Jan-27, at 10:20, Cy Schubert <Cy.Schubert at cschubert.com> wrote:
>> 
>>> On January 27, 2020 5:09:06 AM PST, Cy Schubert <Cy.Schubert_at_cschubert.com>
>> wrote:
>>>>> . . . 
>>>> 
>>>> Setting a lower arc_max at boot is unlikely to help. Rust was building
>>>> on 
>>>> the 8 GB and 5 GB 4 core machines last night. It completed successfully
>>>> on 
>>>> the 8 GB machine, while using 12 MB of swap. ARC was at 1307 MB.
>>>> 
>>>> On the 5 GB 4 core machine the rust build died of OOM. 328 KB swap was 
>>>> used. ARC was reported at 941 MB. arc_min on this machine is 489.2 MB.
>>> 
>>> MAKE_JOBS_NUMBER=3 worked building rust on the 5  GB 4 core machine. ARC is
>> at 534 MB with 12 MB swap used.
>> 
>> If you increase vm.pageout_oom_seq to, say, 10 times what you now use,
>> does MAKE_JOBS_NUMBER=4 complete --or at least go notably longer before
>> getting OOM behavior from the system? (The default is 12 last I checked.
>> So that might be what you are now using.)
> 
> It's already 4096 (default is 12).

Wow. Then the count of tries to get free RAM above the threshold
does not seem likely to be the source of the OOM kills.

>> 
>> Have you tried also having: vm.pfault_oom_attempts="-1" (Presuming
>> you are not worried about actually running out of swap/page space,
>> or can tolerate a deadlock if it does run out.) This setting presumes
>> head, not release or stable. (Last I checked anyway.)
> 
> Already there.

Then page-out delay does not seem likely to be the source of the OOM kills.

> The box is a sandbox with remote serial console access so deadlocks are ok.
> 
>> 
>> It would be interesting to know what difference those two settings
>> together might make for your context: it seems to be a good context
>> for testing in this area. (But you might already have set them.
>> If so, it would be good to report the figures in use.)
>> 
>> Of course, my experiment ideas need not be your actions.
> 
> It's a sandbox machine. We already know 8 GB works with 4 threads on as 
> many cores. And, 5 GB works with 3 threads on 4 cores.

It would be nice to find out what category of issue in the kernel
is driving the OOM kills for your 5GB context with MAKE_JOBS_NUMBER=4.
Too bad the first kill does not report a backtrace spanning the
code choosing to do the kill (or otherwise report the type of issue
leading the the kill).

Your is consistent with the small arm board folks reporting that recently
contexts that were doing buildworld and the like fine under somewhat
older kernels have started getting OOM kills, despite the two settings.

At the moment I'm not sure how to find the category(s) of issue(s) that
is(are) driving these OOM kills.

Thanks for reporting what settings you were using.

===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)