Re: FreeBSD 11.x grinds to a halt after about 48h of uptime

From: Ulrich Spörlein <uqs_at_FreeBSD.org> Date: Mon, 24 Oct 2016 19:43:27 +0200 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:08 UTC

On Sat, 2016-10-15 at 09:36:27 -0700, Kevin Oberman wrote:
> On Sat, Oct 15, 2016 at 9:26 AM, Hans Petter Selasky <hps_at_selasky.org>
> wrote:
> 
> > On 10/15/16 18:18, Ulrich Spörlein wrote:
> >
> >> Hey all, while 11.x is -STABLE now, this happens to my machine ever
> >> since I upgraded it to 11-CURRENT years ago. I have no idea when this
> >> started, actually, but what always happens is this:
> >>
> >> - System and X11 is up and running, I keep it running over night as I'm
> >> too lazy to reboot and restart everthing.
> >> - There's a bunch of xterms, Chrome, Clementine-Player and some other
> >> programs running
> >> - Coming back to the machine the next day (or the day after) it will
> >> exit the screensaver just fine and then either I can use it for a couple
> >> of seconds before it freezes, or it's pretty much dead already. The
> >> mouse cursor still moves for a bit, but the also freezes (so it this a
> >> GPU problem??)
> >>
> >> Now what I currently see on the screen is a clock widget stuck at 18:04
> >> but conky itself has last updated at 18:00:18 ...
> >>
> >> This time I had some SSH sessions from another machine to see some more
> >> useful things. There was nothing in various logs under /var/log (I also
> >> can't run dmesg anymore ...)
> >> I had top(1) running in a loop, this is the last output:
> >>
> >> last pid: 25633;  load averages:  0.27,  0.39,  0.36  up 1+23:03:28
> >> 18:00:12
> >> 202 processes: 2 running, 188 sleeping, 11 zombie, 1 waiting
> >>
> >> Mem: 8873M Active, 1783M Inact, 5072M Wired, 567M Buf, 132M Free
> >> ARC: 1844M Total, 469M MFU, 268M MRU, 16K Anon, 96M Header, 1012M Other
> >> Swap: 4096M Total, 2395M Used, 1701M Free, 58% Inuse
> >>
> >>
> >>   PID USERNAME      THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU
> >> COMMAND
> >>    11 root            8 155 ki31     0K   128K CPU0    0 364.6H 772.95%
> >> idle
> >>              3122 uqs            15  28    0  7113M  5861M uwait   0
> >> 94:44  13.96% chrome
> >>                            2887 uqs            28  22    0  1394M   237M
> >> select  2 172:53   6.98% chrome
> >>                                        2890 uqs            11  21    0
> >> 1034M   178M select  5 231:21   1.95% chrome
> >>                                                    1062 root            9
> >> 21    0   440M 47220K select  0  67:09   0.98% Xorg
> >>                                                              3002 uqs
> >>       15  25    5  1159M   172M uwait   2  19:09   0.00% chrome
> >>  3139 uqs            17  25    5  1163M   156M uwait   2  16:15   0.00%
> >> chrome
> >>  3001 uqs            18  25    5  1639M   575M uwait   0  16:05   0.00%
> >> chrome
> >>    12 root           24 -64    -     0K   384K WAIT   -1  10:53   0.00%
> >> intr
> >>  3129 uqs            12  20    0  2820M  1746M uwait   6   8:36   0.00%
> >> chrome
> >>  2822 uqs             9  20    0   217M 81300K select  0   5:10   0.00%
> >> conky
> >>  3174 root            1  20    0 21532K  3188K select  0   4:20   0.00%
> >> systat
> >>  3130 uqs            16  20    0  1058M   131M uwait   4   3:03   0.00%
> >> chrome
> >>  2998 uqs            16  20    0  1110M   123M uwait   2   2:53   0.00%
> >> chrome
> >>  3165 uqs            10  20    0  1209M   215M uwait   6   2:52   0.00%
> >> chrome
> >>  3142 uqs            11  25    5  1344M   195M uwait   2   2:46   0.00%
> >> chrome
> >>  2876 uqs            19  20    0   580M 37164K select  3   2:42   0.00%
> >> clementine-player
> >>    20 root            2 -16    -     0K    32K psleep  6   2:25   0.00%
> >> pagedaemon
> >>
> >> I also had systat -vm running and it continued to update its screen ...
> >> for a short while, this is the last update before SSH died:
> >>
> >>
> >>    Mem usage:  0k%Phy  5%Kmem
> >> Mem: KB    REAL            VIRTUAL                      VN PAGER   SWAP
> >> PAGER
> >>         Tot   Share      Tot    Share    Free           in   out     in
> >>  out
> >> Act  11051k   67868 71051992   255448   61840  count
> >> All  11051k   67924 71058776   262100          pages
> >> Proc:
> >> Interrupts
> >>   r   p   d   s   w   Csw  Trp  Sys  Int  Sof  Flt        ioflt   224
> >> total
> >>      25     730  11   724  109  404  101   13             cow       2
> >> ehci0 16
> >>                                                           zfod      3
> >> ehci1 23
> >>  0.0%Sys   0.1%Intr  0.0%User  0.0%Nice 99.9%Idle         ozfod    16
> >> cpu0:timer
> >> |    |    |    |    |    |    |    |    |    |           %ozfod
> >>  xhci0 264
> >>                                                           daefr     3 em0
> >> 265
> >>                                         50 dtbuf          prcfr    94
> >> hdac1 266
> >> Namei     Name-cache   Dir-cache    349167 desvn          totfr
> >>  ahci0 270
> >>    Calls    hits   %    hits   %    349155 numvn          react     5
> >> cpu1:timer
> >>      121     121 100                253501 frevn          pdwak     1
> >> cpu2:timer
> >>                                                           pdpgs    29
> >> cpu7:timer
> >> Disks   md0  ada0  ada1 pass0 pass1 pass2                 intrn    12
> >> cpu3:timer
> >> KB/t   0.00  0.00  0.00  0.00  0.00  0.00         5318892 wire     41
> >> cpu6:timer
> >> tps       0     0     0     0     0     0         9261404 act      12
> >> cpu5:timer
> >> MB/s   0.00  0.00  0.00  0.00  0.00  0.00         1598184 inact     6
> >> cpu4:timer
> >> %busy     0     0     0     0     0     0                 cache
> >>  vgapci0
> >>                                                     61840 free
> >>                                                    712304 buf
> >>
> >>
> >> Why do I have a Chrome tab using about 6G? What other sort of debugging
> >> output can be helpful to get to the bottom of this? The machine still
> >> responds to pings just fine, TCP connections get set up but the SSH
> >> handshake never completes.
> >>
> >> This always happens between 30-50h and is super annoying and has been
> >> going on for >1year. Help?
> >>
> >> Note, I cut the power to the monitor overnight to save electricity, can
> >> this mess up something in the Radeon card or X server? What combinations
> >> would be most useful to try next?
> >>
> >>
> > Hi,
> >
> > Sounds like a memory leak. Can you track the memory use over time?
> >
> > Did you look at the output from:
> >
> > vmstat -m ?
> >
> > --HPS
> 
> 
> I have noted significant  memory leakage in chromium for some time. If I
> leave it running overnight, my system is essentially frozen. If I terminate
> the chromium process, it slowly comes back to life. I always keep a gkrellm
> session on-screen where the memory and swap utilization is continuously
> displayed and that clearly shows resources declining.

That is not what is happening to my system though, it actually
deadlocks. There's no way to recover from it, it seems.

So I killed Chromium overnight each day, and I'm at this:

% top -Sbores
last pid: 44526;  load averages:  0.10,  0.11,  0.56  up 7+09:53:30    19:33:25
156 processes: 2 running, 153 sleeping, 1 waiting

Mem: 315M Active, 550M Inact, 5671M Wired, 515M Buf, 9324M Free
ARC: 1852M Total, 541M MFU, 196M MRU, 16K Anon, 93M Header, 1022M Other
Swap: 4096M Total, 2186M Used, 1910M Free, 53% Inuse

  PID USERNAME      THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
 2755 uqs            10  20    0  1697M   311M select  1  47:23   0.00% conky
 2736 uqs            32  20    0   699M   116M select  7  94:29   0.00% clementine-player
 3000 uqs            12  20    0  1126M 69380K select  5   9:48   0.00% digikam
  960 root            9  20    0   448M 59076K select  0 250:22   0.00% Xorg
72608 uqs             8  20    0   939M 55432K uwait   5   0:01   0.00% chrome
72599 uqs             9  52    0   929M 55116K uwait   0   0:00   0.00% chrome
 2567 root            1  20    0 89948K 42964K select  1   1:51   0.00% bsnmpd
70476 uqs             1  20    0 93656K 25712K select  2   0:05   0.00% xterm
 2730 uqs             5  20    0   208M 14988K select  1   0:22   0.00% clock-applet
  880 root            1  20    0 22628K 12500K select  3   0:20   0.00% ntpd
 2726 uqs             4  20    0   206M 12456K select  6   0:09   0.00% mateweather-applet
44352 uqs             1  20    0 75224K 12348K select  4   0:00   0.00% xterm
43049 uqs             1  20    0 75224K 11792K select  5   0:00   0.00% xterm
 3074 uqs             2  20    0   308M  9692K select  1   0:02   0.00% kdeinit4
 2671 uqs             1  20    0   144M  9488K select  1   0:13   0.00% openbox
 3072 uqs             1  20    0   210M  8284K select  3   0:00   0.00% kdeinit4
 2724 uqs             4  20    0   154M  8256K select  2   0:19   0.00% wnck-applet
 2701 uqs             5  20    0   177M  8144K select  2   0:01   0.00% mate-panel

7d running, pretty good. But look closer, the system is doing pretty
much nothing but did swap out 2G. What?

> Try closing your chromium at night and see if that fixes the problem.

It's better, but I'm not sure it's a real fix. I've now turned off
"hardware acceleration" in Chromium, though chrome://gpu didn't real
inspire confidence that it was actually using any h/w accel at all.

> If you have never tried gkrellm (sysutils/gkrellm2), it is a the best
> system monitor I have found. though pulls in a lot of dependencies. It also
> can run as a server with remote systems displaying the data. Handy to
> monitor servers.

I had a cacti-setup that would also monitor my workstation (through a
OpenVPN tunnel), but that has bit-rotted and Apache only gives me 500s
on that cacti URL and nothing in the logs, oh well ...)

Hooking up a serial console and testing whether DDB works is probably
the next best step to take ...

Cheers,
Uli