Re: Strange ARC/Swap/CPU on yesterday's -CURRENT

From: Don Lewis <truckman_at_FreeBSD.org> Date: Wed, 4 Apr 2018 11:17:16 -0700 (PDT) · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:15 UTC

On  4 Apr, Mark Johnston wrote:
> On Tue, Apr 03, 2018 at 09:42:48PM -0700, Don Lewis wrote:
>> On  3 Apr, Don Lewis wrote:
>> > I reconfigured my Ryzen box to be more similar to my default package
>> > builder by disabling SMT and half of the RAM, to limit it to 8 cores
>> > and 32 GB and then started bisecting to try to track down the problem.
>> > For each test, I first filled ARC by tarring /usr/ports/distfiles to
>> > /dev/null.  The commit range that I was searching was r329844 to
>> > r331716.  I narrowed the range to r329844 to r329904.  With r329904
>> > and newer, ARC is totally unresponsive to memory pressure and the
>> > machine pages heavily.  I see ARC sizes of 28-29GB and 30GB of wired
>> > RAM, so there is not much leftover for getting useful work done.  Active
>> > memory and free memory both hover under 1GB each.  Looking at the
>> > commit logs over this range, the most likely culprit is:
>> > 
>> > r329882 | jeff | 2018-02-23 14:51:51 -0800 (Fri, 23 Feb 2018) | 13 lines
>> > 
>> > Add a generic Proportional Integral Derivative (PID) controller algorithm and
>> > use it to regulate page daemon output.
>> > 
>> > This provides much smoother and more responsive page daemon output, anticipating
>> > demand and avoiding pageout stalls by increasing the number of pages to match
>> > the workload.  This is a reimplementation of work done by myself and mlaier at
>> > Isilon.
>> > 
>> > 
>> > It is quite possible that the recent fixes to the PID controller will
>> > fix the problem.  Not that r329844 was trouble free ... I left tar
>> > running over lunchtime to fill ARC and the OOM killer nuked top, tar,
>> > ntpd, both of my ssh sessions into the machine, and multiple instances
>> > of getty while I was away.  I was able to log in again and successfully
>> > run poudriere, and ARC did respond to the memory pressure and cranked
>> > itself down to about 5 GB by the end of the run.  I did not see the same
>> > problem with tar when I did the same with r329904.
>> 
>> I just tried r331966 and see no improvement.  No OOM process kills
>> during the tar run to fill ARC, but with ARC filled, the machine is
>> thrashing itself at the start of the poudriere run while trying to build
>> ports-mgmt/pkg (39 minutes so far).  ARC appears to be unresponsive to
>> memory demand.  I've seen no decrease in ARC size or wired memory since
>> starting poudriere.
> 
> Re-reading the ARC reclaim code, I see a couple of issues which might be
> at the root of the behaviour you're seeing.
> 
> 1. zfs_arc_free_target is too low now. It is initialized to the page
>    daemon wakeup threshold, which is slightly above v_free_min. With the
>    PID controller, the page daemon uses a setpoint of v_free_target.
>    Moreover, it now wakes up regularly rather than having wakeups be
>    synchronized by a mutex, so it will respond quickly if the free page
>    count dips below v_free_target. The free page count will dip below
>    zfs_arc_free_target only in the face of sudden and extreme memory
>    pressure now, so the FMT_LOTSFREE case probably isn't getting
>    exercised. Try initializing zfs_arc_free_target to v_free_target.

Changing zfs_arc_free_target definitely helps.  My previous poudriere
run failed when poudriere timed out the ports-mgmt/pkg build after two
hours.  After changing this setting, poudriere seems to be running
properly and ARC has dropped from 29GB to 26GB ten minutes into the run
and I'm not seeing processes in the swread state.

> 2. In the inactive queue scan, we used to compute the shortage after
>    running uma_reclaim() and the lowmem handlers (which includes a
>    synchronous call to arc_lowmem()). Now it's computed before, so we're
>    not taking into account the pages that get freed by the ARC and UMA.
>    The following rather hacky patch may help. I note that the lowmem
>    logic is now somewhat broken when multiple NUMA domains are
>    configured, however, since it fires only when domain 0 has a free
>    page shortage.

I will try this next.

> Index: sys/vm/vm_pageout.c
> ===================================================================
> --- sys/vm/vm_pageout.c	(revision 331933)
> +++ sys/vm/vm_pageout.c	(working copy)
> _at__at_ -1114,25 +1114,6 _at__at_
>  	boolean_t queue_locked;
>  
>  	/*
> -	 * If we need to reclaim memory ask kernel caches to return
> -	 * some.  We rate limit to avoid thrashing.
> -	 */
> -	if (vmd == VM_DOMAIN(0) && pass > 0 &&
> -	    (time_uptime - lowmem_uptime) >= lowmem_period) {
> -		/*
> -		 * Decrease registered cache sizes.
> -		 */
> -		SDT_PROBE0(vm, , , vm__lowmem_scan);
> -		EVENTHANDLER_INVOKE(vm_lowmem, VM_LOW_PAGES);
> -		/*
> -		 * We do this explicitly after the caches have been
> -		 * drained above.
> -		 */
> -		uma_reclaim();
> -		lowmem_uptime = time_uptime;
> -	}
> -
> -	/*
>  	 * The addl_page_shortage is the number of temporarily
>  	 * stuck pages in the inactive queue.  In other words, the
>  	 * number of pages from the inactive count that should be
> _at__at_ -1824,6 +1805,26 _at__at_
>  		atomic_store_int(&vmd->vmd_pageout_wanted, 1);
>  
>  		/*
> +		 * If we need to reclaim memory ask kernel caches to return
> +		 * some.  We rate limit to avoid thrashing.
> +		 */
> +		if (vmd == VM_DOMAIN(0) &&
> +		    vmd->vmd_free_count < vmd->vmd_free_target &&
> +		    (time_uptime - lowmem_uptime) >= lowmem_period) {
> +			/*
> +			 * Decrease registered cache sizes.
> +			 */
> +			SDT_PROBE0(vm, , , vm__lowmem_scan);
> +			EVENTHANDLER_INVOKE(vm_lowmem, VM_LOW_PAGES);
> +			/*
> +			 * We do this explicitly after the caches have been
> +			 * drained above.
> +			 */
> +			uma_reclaim();
> +			lowmem_uptime = time_uptime;
> +		}
> +
> +		/*
>  		 * Use the controller to calculate how many pages to free in
>  		 * this interval.
>  		 */