Re: Strange ARC/Swap/CPU on yesterday's -CURRENT

From: O. Hartmann <ohartmann_at_walstatt.org> Date: Sun, 11 Mar 2018 00:47:10 +0100 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:15 UTC

Am Wed, 7 Mar 2018 14:39:13 +0400
Roman Bogorodskiy <novel_at_FreeBSD.org> schrieb:

>   Danilo G. Baio wrote:
> 
> > On Tue, Mar 06, 2018 at 01:36:45PM -0600, Larry Rosenman wrote:  
> > > On Tue, Mar 06, 2018 at 10:16:36AM -0800, Rodney W. Grimes wrote:  
> > > > > On Tue, Mar 06, 2018 at 08:40:10AM -0800, Rodney W. Grimes wrote:  
> > > > > > > On Mon, 5 Mar 2018 14:39-0600, Larry Rosenman wrote:
> > > > > > >   
> > > > > > > > Upgraded to:
> > > > > > > > 
> > > > > > > > FreeBSD borg.lerctr.org 12.0-CURRENT FreeBSD 12.0-CURRENT #11 r330385:
> > > > > > > > Sun Mar  4 12:48:52 CST 2018
> > > > > > > > root_at_borg.lerctr.org:/usr/obj/usr/src/amd64.amd64/sys/VT-LER  amd64
> > > > > > > > +1200060 1200060
> > > > > > > > 
> > > > > > > > Yesterday, and I'm seeing really strange slowness, ARC use, and SWAP use
> > > > > > > > and swapping.
> > > > > > > > 
> > > > > > > > See http://www.lerctr.org/~ler/FreeBSD/Swapuse.png  
> > > > > > > 
> > > > > > > I see these symptoms on stable/11. One of my servers has 32 GiB of 
> > > > > > > RAM. After a reboot all is well. ARC starts to fill up, and I still 
> > > > > > > have more than half of the memory available for user processes.
> > > > > > > 
> > > > > > > After running the periodic jobs at night, the amount of wired memory 
> > > > > > > goes sky high. /etc/periodic/weekly/310.locate is a particular nasty 
> > > > > > > one.  
> > > > > > 
> > > > > > I would like to find out if this is the same person I have
> > > > > > reporting this problem from another source, or if this is
> > > > > > a confirmation of a bug I was helping someone else with.
> > > > > > 
> > > > > > Have you been in contact with Michael Dexter about this
> > > > > > issue, or any other forum/mailing list/etc?    
> > > > > Just IRC/Slack, with no response.  
> > > > > > 
> > > > > > If not then we have at least 2 reports of this unbound
> > > > > > wired memory growth, if so hopefully someone here can
> > > > > > take you further in the debug than we have been able
> > > > > > to get.  
> > > > > What can I provide?  The system is still in this state as the full backup is
> > > > > slow.  
> > > > 
> > > > One place to look is to see if this is the recently fixed:
> > > > https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=222288
> > > > g_bio leak.
> > > > 
> > > > vmstat -z | egrep 'ITEM|g_bio|UMA'
> > > > 
> > > > would be a good first look
> > > >   
> > > borg.lerctr.org /home/ler $ vmstat -z | egrep 'ITEM|g_bio|UMA'
> > > ITEM                   SIZE  LIMIT     USED     FREE      REQ FAIL SLEEP
> > > UMA Kegs:               280,      0,     346,       5,     560,   0,   0
> > > UMA Zones:             1928,      0,     363,       1,     577,   0,   0
> > > UMA Slabs:              112,      0,25384098,  977762,102033225,   0,   0
> > > UMA Hash:               256,      0,      59,      16,     105,   0,   0
> > > g_bio:                  384,      0,      33,    1627,542482056,   0,   0
> > > borg.lerctr.org /home/ler $  
> > > > > > > Limiting the ARC to, say, 16 GiB, has no effect of the high amount of 
> > > > > > > wired memory. After a few more days, the kernel consumes virtually all 
> > > > > > > memory, forcing processes in and out of the swap device.  
> > > > > > 
> > > > > > Our experience as well.
> > > > > > 
> > > > > > ...
> > > > > > 
> > > > > > Thanks,
> > > > > > Rod Grimes
> > > > > > rgrimes_at_freebsd.org  
> > > > > Larry Rosenman                     http://www.lerctr.org/~ler  
> > > > 
> > > > -- 
> > > > Rod Grimes                                                 rgrimes_at_freebsd.org  
> > > 
> > > -- 
> > > Larry Rosenman                     http://www.lerctr.org/~ler
> > > Phone: +1 214-642-9640                 E-Mail: ler_at_lerctr.org
> > > US Mail: 5708 Sabbia Drive, Round Rock, TX 78665-2106  
> > 
> > 
> > Hi.
> > 
> > I noticed this behavior as well and changed vfs.zfs.arc_max for a smaller size.
> > 
> > For me it started when I upgraded to 1200058, in this box I'm only using
> > poudriere for building tests.  
> 
> I've noticed that as well.
> 
> I have 16G of RAM and two disks, the first one is UFS with the system
> installation and the second one is ZFS which I use to store media and
> data files and for poudreire.
> 
> I don't recall the exact date, but it started fairly recently. System would
> swap like crazy to a point when I cannot even ssh to it, and can hardly
> login through tty: it might take 10-15 minutes to see a command typed in
> the shell.
> 
> I've updated loader.conf to have the following:
> 
> vfs.zfs.arc_max="4G"
> vfs.zfs.prefetch_disable="1"
> 
> It fixed the problem, but introduced a new one. When I'm building stuff
> with poudriere with ccache enabled, it takes hours to build even small
> projects like curl or gnutls.
> 
> For example, current build:
> 
> [10i386-default] [2018-03-07_07h44m45s] [parallel_build:] Queued: 3  Built: 1  Failed:
> 0  Skipped: 0  Ignored: 0  Tobuild: 2   Time: 06:48:35 [02]: security/gnutls
> | gnutls-3.5.18             build           (06:47:51)
> 
> Almost 7 hours already and still going!
> 
> gstat output looks like this:
> 
> dT: 1.002s  w: 1.000s
>  L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
>     0      0      0      0    0.0      0      0    0.0    0.0  da0
>     0      1      0      0    0.0      1    128    0.7    0.1  ada0
>     1    106    106    439   64.6      0      0    0.0   98.8  ada1
>     0      1      0      0    0.0      1    128    0.7    0.1  ada0s1
>     0      0      0      0    0.0      0      0    0.0    0.0  ada0s1a
>     0      0      0      0    0.0      0      0    0.0    0.0  ada0s1b
>     0      1      0      0    0.0      1    128    0.7    0.1  ada0s1d
> 
> ada0 here is UFS driver, and ada1 is ZFS.
> 
> > Regards.
> > -- 
> > Danilo G. Baio (dbaio)  
> 
> 
> 
> Roman Bogorodskiy

This is from a APU, no ZFS, UFS on a small mSATA device, the APU (PCenigine) works as a
firewall, router, PBX):

last pid:  9665;  load averages:  0.13,  0.13,  0.11
up 3+06:53:55  00:26:26 19 processes:  1 running, 18 sleeping CPU:  0.3% user,  0.0%
nice,  0.2% system,  0.0% interrupt, 99.5% idle Mem: 27M Active, 6200K Inact, 83M
Laundry, 185M Wired, 128K Buf, 675M Free Swap: 7808M Total, 2856K Used, 7805M Free
[...]

The APU is running CURRENT ( FreeBSD 12.0-CURRENT #42 r330608: Wed Mar  7 16:55:59 CET
2018 amd64). Usually, the APU never(!) uses swap, now it is starting to swap like hell
for a couple of days and I have to reboot it failty often.

Another box, 16 GB RAM, ZFS, poudriere, the packaging box, is right now unresponsible:
after hours of building packages, I tried to copy the repository from one location on
the same ZFS volume to another - usually this task takes a couple of minutes for ~ 2200
ports. Now, I has taken 2 1/2 hours and the box got stuck, Ctrl-T  on the console
delivers:
load: 0.00  cmd: make 91199 [pfault] 7239.56r 0.03u 0.04s 0% 740k

No response from the box anymore.

The problem of swapping like hell and performing slow isn't an issue of the past days, it
is present at least since 1 1/2 weeks for now, even more. Since I build ports fairly
often, time taken on that specific box has increased from 2 to 3 days for all ~2200
ports. The system has 16 GB of RAM, IvyBridge 4-core XEON at 3,4 GHz, if this information
matters. The box is consuming swap really fast.

Today is the first time the machine got inresponsible (no ssh, no console login so far).
Need to coldstart. OS is CURRENT as well.

Regards,

O. Hartmann

-- 
O. Hartmann

Ich widerspreche der Nutzung oder Übermittlung meiner Daten für
Werbezwecke oder für die Markt- oder Meinungsforschung (§ 28 Abs. 4 BDSG).