Re: spurious out of swap kills

From: Mark Millard <marklmi_at_yahoo.com> Date: Sat, 14 Sep 2019 16:56:28 -0700 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:21 UTC

Konstantin Belousov kostikbel at gmail.com wrote on
Fri Sep 13 05:53:41 UTC 2019 :

> Basically, page fault handler waits for vm.pfault_oom_wait *
> vm.pfault_oom_attempts for a page allocation before killing the process.
> Default is 30 secs, and if you cannot get a page for 30 secs, there is
> something very wrong with the machine.

The following was not for something like a Ryzen, but for
an armv7 board using a USB device for the file system and
swap/paging partition. Still it may be a suggestive
example of writing out a large amount of laundry.

There was an exchange I had with Warner L. that implied easily
having long waits in the queue when trying to write out the
laundry (or other such) in low end contexts. I extract some
of it below.

dT: 1.006s  w: 1.000s
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w    d/s   kBps   ms/d   %busy Name
   56    312      0      0    0.0    312  19985  142.6      0      0    0.0   99.6| da0

Note: L(q) could be a lot bigger than 56 but I work with the
example figures that I used at the time and that Warner commented on.
The 142.6 ms/w includes time waiting in the queue and was vastly
more stable than the L(q) figures.

Warner wrote, in part:

QUOTE
142.6ms/write is the average of the time that the operations that completed during the polling interval took to complete. There's no estimating here.

So, at 6 or 7 per second for the operation to complete, coupled with a parallel factor of 1 (typical for low end junk flash), we wind up with 56 operations in the queue taking 8-10s to complete.
END QUOTE

Things went on from there but part of it was based on a
reporting patch that Mark Johnston had provided.

Me:
It appears to me that, compared to a observed capacity of
roughly around 20 MiBytes/sec for writes, large amounts of
bytes are being queued up to be written in a short time,
for which it just takes a while for the backlog to be
finished.

Warner:
Yes. That matches my expectation as well. In other devices, I've found that I needed to rate-limit things to more like 50-75% of the max value to keep variance in performance low. It's the whole reason I wrote the CAM I/O scheduler.

Me:
The following is from multiple such runs, several manually
stopped but some killed because of sustained low free
memory. I had left vm.pageout_oom_seq=12 in place for this,
making the kills easier to get than the 120 figure would. It
does not take very long generally for some sort of message to
show up.

(Added Note: 65s and 39s were at the large end of what I reported
at the time.)

. . .
swap_pager: indefinite wait buffer: bufobj: 0, blkno: 164064, size: 12288
waited 65s for async swap write
waited 65s for swap buffer
waited 65s for async swap write
waited 65s for async swap write
waited 65s for async swap write
v_free_count: 955, v_inactive_count: 1
Aug 20 06:11:49 pine64 kernel: pid 1047 (stress), uid 0, was killed: out of swap space
waited 5s for async swap write
waited 5s for swap buffer
waited 5s for async swap write
waited 5s for async swap write
waited 5s for async swap write
swap_pager: indefinite wait buffer: bufobj: 0, blkno: 314021, size: 12288
swap_pager: indefinite wait buffer: bufobj: 0, blkno: 314084, size: 32768
swap_pager: indefinite wait buffer: bufobj: 0, blkno: 314856, size: 32768
swap_pager: indefinite wait buffer: bufobj: 0, blkno: 314638, size: 131072
swap_pager: indefinite wait buffer: bufobj: 0, blkno: 312518, size: 4096
swap_pager: indefinite wait buffer: bufobj: 0, blkno: 312416, size: 16384
waited 39s for async swap write
waited 39s for swap buffer
waited 39s for async swap write
waited 39s for async swap write
waited 39s for async swap write
swap_pager: indefinite wait buffer: bufobj: 0, blkno: 314802, size: 24576
. . .

Warner:
These numbers are consistent with the theory that the swap device becomes overwhelmed, spiking latency and causing crappy down-stream effects. You can use the I/O scheduler to limit the write rates at the low end. You might also be able to schedule a lower write queue depth at the top end as well, but I've not seen good ways to do that.

===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)