Re: Strange ARC/Swap/CPU on yesterday's -CURRENT

From: Justin Hibbits <chmeeedalf_at_gmail.com> Date: Fri, 06 Apr 2018 00:47:14 +0000 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:15 UTC

On Wed, Apr 4, 2018, 13:20 Don Lewis <truckman_at_freebsd.org> wrote:

> On  4 Apr, Mark Johnston wrote:
> > On Tue, Apr 03, 2018 at 09:42:48PM -0700, Don Lewis wrote:
> >> On  3 Apr, Don Lewis wrote:
> >> > I reconfigured my Ryzen box to be more similar to my default package
> >> > builder by disabling SMT and half of the RAM, to limit it to 8 cores
> >> > and 32 GB and then started bisecting to try to track down the problem.
> >> > For each test, I first filled ARC by tarring /usr/ports/distfiles to
> >> > /dev/null.  The commit range that I was searching was r329844 to
> >> > r331716.  I narrowed the range to r329844 to r329904.  With r329904
> >> > and newer, ARC is totally unresponsive to memory pressure and the
> >> > machine pages heavily.  I see ARC sizes of 28-29GB and 30GB of wired
> >> > RAM, so there is not much leftover for getting useful work done.
> Active
> >> > memory and free memory both hover under 1GB each.  Looking at the
> >> > commit logs over this range, the most likely culprit is:
> >> >
> >> > r329882 | jeff | 2018-02-23 14:51:51 -0800 (Fri, 23 Feb 2018) | 13
> lines
> >> >
> >> > Add a generic Proportional Integral Derivative (PID) controller
> algorithm and
> >> > use it to regulate page daemon output.
> >> >
> >> > This provides much smoother and more responsive page daemon output,
> anticipating
> >> > demand and avoiding pageout stalls by increasing the number of pages
> to match
> >> > the workload.  This is a reimplementation of work done by myself and
> mlaier at
> >> > Isilon.
> >> >
> >> >
> >> > It is quite possible that the recent fixes to the PID controller will
> >> > fix the problem.  Not that r329844 was trouble free ... I left tar
> >> > running over lunchtime to fill ARC and the OOM killer nuked top, tar,
> >> > ntpd, both of my ssh sessions into the machine, and multiple instances
> >> > of getty while I was away.  I was able to log in again and
> successfully
> >> > run poudriere, and ARC did respond to the memory pressure and cranked
> >> > itself down to about 5 GB by the end of the run.  I did not see the
> same
> >> > problem with tar when I did the same with r329904.
> >>
> >> I just tried r331966 and see no improvement.  No OOM process kills
> >> during the tar run to fill ARC, but with ARC filled, the machine is
> >> thrashing itself at the start of the poudriere run while trying to build
> >> ports-mgmt/pkg (39 minutes so far).  ARC appears to be unresponsive to
> >> memory demand.  I've seen no decrease in ARC size or wired memory since
> >> starting poudriere.
> >
> > Re-reading the ARC reclaim code, I see a couple of issues which might be
> > at the root of the behaviour you're seeing.
> >
> > 1. zfs_arc_free_target is too low now. It is initialized to the page
> >    daemon wakeup threshold, which is slightly above v_free_min. With the
> >    PID controller, the page daemon uses a setpoint of v_free_target.
> >    Moreover, it now wakes up regularly rather than having wakeups be
> >    synchronized by a mutex, so it will respond quickly if the free page
> >    count dips below v_free_target. The free page count will dip below
> >    zfs_arc_free_target only in the face of sudden and extreme memory
> >    pressure now, so the FMT_LOTSFREE case probably isn't getting
> >    exercised. Try initializing zfs_arc_free_target to v_free_target.
>
> Changing zfs_arc_free_target definitely helps.  My previous poudriere
> run failed when poudriere timed out the ports-mgmt/pkg build after two
> hours.  After changing this setting, poudriere seems to be running
> properly and ARC has dropped from 29GB to 26GB ten minutes into the run
> and I'm not seeing processes in the swread state.
>
> > 2. In the inactive queue scan, we used to compute the shortage after
> >    running uma_reclaim() and the lowmem handlers (which includes a
> >    synchronous call to arc_lowmem()). Now it's computed before, so we're
> >    not taking into account the pages that get freed by the ARC and UMA.
> >    The following rather hacky patch may help. I note that the lowmem
> >    logic is now somewhat broken when multiple NUMA domains are
> >    configured, however, since it fires only when domain 0 has a free
> >    page shortage.
>
> I will try this next.
>
> > Index: sys/vm/vm_pageout.c
> > ===================================================================
> > --- sys/vm/vm_pageout.c       (revision 331933)
> > +++ sys/vm/vm_pageout.c       (working copy)
> > _at__at_ -1114,25 +1114,6 _at__at_
> >       boolean_t queue_locked;
> >
> >       /*
> > -      * If we need to reclaim memory ask kernel caches to return
> > -      * some.  We rate limit to avoid thrashing.
> > -      */
> > -     if (vmd == VM_DOMAIN(0) && pass > 0 &&
> > -         (time_uptime - lowmem_uptime) >= lowmem_period) {
> > -             /*
> > -              * Decrease registered cache sizes.
> > -              */
> > -             SDT_PROBE0(vm, , , vm__lowmem_scan);
> > -             EVENTHANDLER_INVOKE(vm_lowmem, VM_LOW_PAGES);
> > -             /*
> > -              * We do this explicitly after the caches have been
> > -              * drained above.
> > -              */
> > -             uma_reclaim();
> > -             lowmem_uptime = time_uptime;
> > -     }
> > -
> > -     /*
> >        * The addl_page_shortage is the number of temporarily
> >        * stuck pages in the inactive queue.  In other words, the
> >        * number of pages from the inactive count that should be
> > _at__at_ -1824,6 +1805,26 _at__at_
> >               atomic_store_int(&vmd->vmd_pageout_wanted, 1);
> >
> >               /*
> > +              * If we need to reclaim memory ask kernel caches to return
> > +              * some.  We rate limit to avoid thrashing.
> > +              */
> > +             if (vmd == VM_DOMAIN(0) &&
> > +                 vmd->vmd_free_count < vmd->vmd_free_target &&
> > +                 (time_uptime - lowmem_uptime) >= lowmem_period) {
> > +                     /*
> > +                      * Decrease registered cache sizes.
> > +                      */
> > +                     SDT_PROBE0(vm, , , vm__lowmem_scan);
> > +                     EVENTHANDLER_INVOKE(vm_lowmem, VM_LOW_PAGES);
> > +                     /*
> > +                      * We do this explicitly after the caches have been
> > +                      * drained above.
> > +                      */
> > +                     uma_reclaim();
> > +                     lowmem_uptime = time_uptime;
> > +             }
> > +
> > +             /*
> >                * Use the controller to calculate how many pages to free
> in
> >                * this interval.
> >                */
>

My powerpc64 embedded machine is virtually unusable since these vm changes.
I tried setting vfs.zfs.arc_free_target as suggested, and that didn't help
at all. Eventually the machine hangs and just gets stuck in vmdaemon, with
many processes in wait channel btalloc.

- Justin

>