Re: spurious out of swap kills

From: Don Lewis <truckman_at_FreeBSD.org> Date: Sat, 14 Sep 2019 18:17:25 -0700 (PDT) · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:21 UTC

On 13 Sep, Konstantin Belousov wrote:
> On Thu, Sep 12, 2019 at 05:42:00PM -0700, Don Lewis wrote:
>> On 12 Sep, Mark Johnston wrote:
>> > On Thu, Sep 12, 2019 at 04:00:17PM -0700, Don Lewis wrote:
>> >> My poudriere machine is running 13.0-CURRENT and gets updated to the
>> >> latest version of -CURRENT periodically.  At least in the last week or
>> >> so, I've been seeing occasional port build failures when building my
>> >> default set of ports, and I finally had some time to do some
>> >> investigation.
>> >> 
>> >> It's a 16-thread Ryzen machine, with 64 GB of RAM and 40 GB of swap.
>> >> Poudriere is configured with
>> >>   USE_TMPFS="wrkdir data localbase"
>> >> and I have
>> >>   .if ${.CURDIR:M*/www/chromium}
>> >>   MAKE_JOBS_NUMBER=16
>> >>   .else
>> >>   MAKE_JOBS_NUMBER=7
>> >>   .endif
>> >> in /usr/local/etc/poudriere.d/make.conf, since this gives me the best
>> >> overall build time for my set of ports.  This hits memory pretty hard,
>> >> especially when chromium, firefox, libreoffice, and both versions of
>> >> openoffice are all building at the same time.  During this time, the
>> >> amount of space consumed by tmpfs for /wrkdir gets large when building
>> >> these large ports.  There is not enough RAM to hold it all, so some of
>> >> the older data spills over to swap.  Swap usage peaks at about 10 GB,
>> >> leaving about 30 GB of free swap.  Nevertheless, I see these errors,
>> >> with rustc being the usual victim:
>> >> 
>> >> Sep 11 23:21:43 zipper kernel: pid 16581 (rustc), jid 43, uid 65534, was killed: out of swap space
>> >> Sep 12 02:48:23 zipper kernel: pid 1209 (rustc), jid 62, uid 65534, was killed: out of swap space
>> >> 
>> >> Top shows the size of rustc being about 2 GB, so I doubt that it
>> >> suddenly needs an additional 30 GB of swap.
>> >> 
>> >> I'm wondering if there might be a transient kmem shortage that is
>> >> causing a malloc(..., M_NOWAIT) failure in the swap allocation path
>> >> that is the cause of the problem.
>> > 
>> > Perhaps this is a consequence of r351114?  To confirm this, you might
>> > try increasing the value of vm.pfault_oom_wait to a larger value, like
>> > 20 or 30, and see if the OOM kills still occur.
>> 
>> I wonder if increasing vm.pfault_oom_attempts might also be a good idea.
> If you are sure that you cannot exhaust your swap space, set
> attempts to -1 to disable this mechanism.

I had success just by increasing vm.pfault_oom_attempts from 3 to 10.

> Basically, page fault handler waits for vm.pfault_oom_wait *
> vm.pfault_oom_attempts for a page allocation before killing the process.
> Default is 30 secs, and if you cannot get a page for 30 secs, there is
> something very wrong with the machine.

There is nothing really wrong with the machine.  The load is just high.
Probably pretty bad for interactivity, but throughput is just fine, with
CPU %idle pretty much pegged at zero the whole time.

I kept an eye on the machine for a while during a run with the new
tuning.  Most of the time, free memory bounced between 2 and 4 GB, with
little page out activity.  There were about 60 running processes, most
of which were writing to 16 tmpfs filesystems.  Sometimes free memory
dropped into the 1 to 2 GB range and pageouts spiked.  This condition
could persist for 30 seconds or more, which is probably the reason for
the OOM kills with the default tuning.  I sometimes saw free memory drop
below 1 GB.  The lowest I aaw was 470 MB.  I'm guessing that this code
fails page allocation when free memmory is below some threshold to avoid
potential deadlocks.

Swap on this machine consists of a gmirror pair of partitions on a pair
of 1 TB WD Green drives, that are now on their third computer.  The
remainder of the space of the drives are used for the mirrored vdev for
the system zpool. Not terribly fast, even in the days when these drives
were new, but mostly fast enough to keep all the CPU cores busy other
than during poudriere startup and wind down when there isn't enough work
to go around.  I could spend money on faster storage, but it really
wouldn't decrease poudriere run time much. It probably is close enough
to the limit that I would need to improve storage speed if I swapped the
Ryzen for a Threadripper.