Re: spurious out of swap kills

From: Warner Losh <imp_at_bsdimp.com> Date: Sun, 15 Sep 2019 01:47:44 -0600 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:21 UTC

On Sun, Sep 15, 2019, 3:17 AM Don Lewis <truckman_at_freebsd.org> wrote:

> On 13 Sep, Konstantin Belousov wrote:
> > On Thu, Sep 12, 2019 at 05:42:00PM -0700, Don Lewis wrote:
> >> On 12 Sep, Mark Johnston wrote:
> >> > On Thu, Sep 12, 2019 at 04:00:17PM -0700, Don Lewis wrote:
> >> >> My poudriere machine is running 13.0-CURRENT and gets updated to the
> >> >> latest version of -CURRENT periodically.  At least in the last week
> or
> >> >> so, I've been seeing occasional port build failures when building my
> >> >> default set of ports, and I finally had some time to do some
> >> >> investigation.
> >> >>
> >> >> It's a 16-thread Ryzen machine, with 64 GB of RAM and 40 GB of swap.
> >> >> Poudriere is configured with
> >> >>   USE_TMPFS="wrkdir data localbase"
> >> >> and I have
> >> >>   .if ${.CURDIR:M*/www/chromium}
> >> >>   MAKE_JOBS_NUMBER=16
> >> >>   .else
> >> >>   MAKE_JOBS_NUMBER=7
> >> >>   .endif
> >> >> in /usr/local/etc/poudriere.d/make.conf, since this gives me the best
> >> >> overall build time for my set of ports.  This hits memory pretty
> hard,
> >> >> especially when chromium, firefox, libreoffice, and both versions of
> >> >> openoffice are all building at the same time.  During this time, the
> >> >> amount of space consumed by tmpfs for /wrkdir gets large when
> building
> >> >> these large ports.  There is not enough RAM to hold it all, so some
> of
> >> >> the older data spills over to swap.  Swap usage peaks at about 10 GB,
> >> >> leaving about 30 GB of free swap.  Nevertheless, I see these errors,
> >> >> with rustc being the usual victim:
> >> >>
> >> >> Sep 11 23:21:43 zipper kernel: pid 16581 (rustc), jid 43, uid 65534,
> was killed: out of swap space
> >> >> Sep 12 02:48:23 zipper kernel: pid 1209 (rustc), jid 62, uid 65534,
> was killed: out of swap space
> >> >>
> >> >> Top shows the size of rustc being about 2 GB, so I doubt that it
> >> >> suddenly needs an additional 30 GB of swap.
> >> >>
> >> >> I'm wondering if there might be a transient kmem shortage that is
> >> >> causing a malloc(..., M_NOWAIT) failure in the swap allocation path
> >> >> that is the cause of the problem.
> >> >
> >> > Perhaps this is a consequence of r351114?  To confirm this, you might
> >> > try increasing the value of vm.pfault_oom_wait to a larger value, like
> >> > 20 or 30, and see if the OOM kills still occur.
> >>
> >> I wonder if increasing vm.pfault_oom_attempts might also be a good idea.
> > If you are sure that you cannot exhaust your swap space, set
> > attempts to -1 to disable this mechanism.
>
> I had success just by increasing vm.pfault_oom_attempts from 3 to 10.
>
> > Basically, page fault handler waits for vm.pfault_oom_wait *
> > vm.pfault_oom_attempts for a page allocation before killing the process.
> > Default is 30 secs, and if you cannot get a page for 30 secs, there is
> > something very wrong with the machine.
>
> There is nothing really wrong with the machine.  The load is just high.
> Probably pretty bad for interactivity, but throughput is just fine, with
> CPU %idle pretty much pegged at zero the whole time.
>

And many low quality SD cards have extreme write performance cliffs once
whatever free NAND pages in the card are exhausted. Especially if there is
high fragmentation of its append store and a high write amp kicks in. Swap
traffic tends to trigger it because its writes are random and they may be
smaller than the internal page size these days... seconds or 10s of seconds
in these scenarios are not uncommon... this is why high quality band or
spinning media or SSDs make the problem go away.

Warner

I kept an eye on the machine for a while during a run with the new
> tuning.  Most of the time, free memory bounced between 2 and 4 GB, with
> little page out activity.  There were about 60 running processes, most
> of which were writing to 16 tmpfs filesystems.  Sometimes free memory
> dropped into the 1 to 2 GB range and pageouts spiked.  This condition
> could persist for 30 seconds or more, which is probably the reason for
> the OOM kills with the default tuning.  I sometimes saw free memory drop
> below 1 GB.  The lowest I aaw was 470 MB.  I'm guessing that this code
> fails page allocation when free memmory is below some threshold to avoid
> potential deadlocks.
>
> Swap on this machine consists of a gmirror pair of partitions on a pair
> of 1 TB WD Green drives, that are now on their third computer.  The
> remainder of the space of the drives are used for the mirrored vdev for
> the system zpool. Not terribly fast, even in the days when these drives
> were new, but mostly fast enough to keep all the CPU cores busy other
> than during poudriere startup and wind down when there isn't enough work
> to go around.  I could spend money on faster storage, but it really
> wouldn't decrease poudriere run time much. It probably is close enough
> to the limit that I would need to improve storage speed if I swapped the
> Ryzen for a Threadripper.
>
> _______________________________________________
> freebsd-current_at_freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to "freebsd-current-unsubscribe_at_freebsd.org"
>