FreeBSD 11.x grinds to a halt after about 48h of uptime

From: Ulrich Spörlein <uqs_at_FreeBSD.org>
Date: Sat, 15 Oct 2016 18:18:48 +0200
Hey all, while 11.x is -STABLE now, this happens to my machine ever
since I upgraded it to 11-CURRENT years ago. I have no idea when this
started, actually, but what always happens is this:

- System and X11 is up and running, I keep it running over night as I'm
too lazy to reboot and restart everthing.
- There's a bunch of xterms, Chrome, Clementine-Player and some other
programs running
- Coming back to the machine the next day (or the day after) it will
exit the screensaver just fine and then either I can use it for a couple
of seconds before it freezes, or it's pretty much dead already. The
mouse cursor still moves for a bit, but the also freezes (so it this a
GPU problem??)

Now what I currently see on the screen is a clock widget stuck at 18:04
but conky itself has last updated at 18:00:18 ...

This time I had some SSH sessions from another machine to see some more
useful things. There was nothing in various logs under /var/log (I also
can't run dmesg anymore ...)
I had top(1) running in a loop, this is the last output:

last pid: 25633;  load averages:  0.27,  0.39,  0.36  up 1+23:03:28    18:00:12
202 processes: 2 running, 188 sleeping, 11 zombie, 1 waiting

Mem: 8873M Active, 1783M Inact, 5072M Wired, 567M Buf, 132M Free
ARC: 1844M Total, 469M MFU, 268M MRU, 16K Anon, 96M Header, 1012M Other
Swap: 4096M Total, 2395M Used, 1701M Free, 58% Inuse


  PID USERNAME      THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
   11 root            8 155 ki31     0K   128K CPU0    0 364.6H 772.95% idle                                                                                     3122 uqs            15  28    0  7113M  5861M uwait   0  94:44  13.96% chrome                                                                                   2887 uqs            28  22    0  1394M   237M select  2 172:53   6.98% chrome                                                                                   2890 uqs            11  21    0  1034M   178M select  5 231:21   1.95% chrome                                                                                   1062 root            9  21    0   440M 47220K select  0  67:09   0.98% Xorg                                                                                     3002 uqs            15  25    5  1159M   172M uwait   2  19:09   0.00% chrome
 3139 uqs            17  25    5  1163M   156M uwait   2  16:15   0.00% chrome
 3001 uqs            18  25    5  1639M   575M uwait   0  16:05   0.00% chrome
   12 root           24 -64    -     0K   384K WAIT   -1  10:53   0.00% intr
 3129 uqs            12  20    0  2820M  1746M uwait   6   8:36   0.00% chrome
 2822 uqs             9  20    0   217M 81300K select  0   5:10   0.00% conky
 3174 root            1  20    0 21532K  3188K select  0   4:20   0.00% systat
 3130 uqs            16  20    0  1058M   131M uwait   4   3:03   0.00% chrome
 2998 uqs            16  20    0  1110M   123M uwait   2   2:53   0.00% chrome
 3165 uqs            10  20    0  1209M   215M uwait   6   2:52   0.00% chrome
 3142 uqs            11  25    5  1344M   195M uwait   2   2:46   0.00% chrome
 2876 uqs            19  20    0   580M 37164K select  3   2:42   0.00% clementine-player
   20 root            2 -16    -     0K    32K psleep  6   2:25   0.00% pagedaemon

I also had systat -vm running and it continued to update its screen ...
for a short while, this is the last update before SSH died:


   Mem usage:  0k%Phy  5%Kmem
Mem: KB    REAL            VIRTUAL                      VN PAGER   SWAP PAGER
        Tot   Share      Tot    Share    Free           in   out     in   out
Act  11051k   67868 71051992   255448   61840  count    
All  11051k   67924 71058776   262100          pages  
Proc:                                                            Interrupts
  r   p   d   s   w   Csw  Trp  Sys  Int  Sof  Flt        ioflt   224 total
     25     730  11   724  109  404  101   13             cow       2 ehci0 16
                                                          zfod      3 ehci1 23
 0.0%Sys   0.1%Intr  0.0%User  0.0%Nice 99.9%Idle         ozfod    16 cpu0:timer
|    |    |    |    |    |    |    |    |    |           %ozfod       xhci0 264
                                                          daefr     3 em0 265
                                        50 dtbuf          prcfr    94 hdac1 266
Namei     Name-cache   Dir-cache    349167 desvn          totfr       ahci0 270
   Calls    hits   %    hits   %    349155 numvn          react     5 cpu1:timer
     121     121 100                253501 frevn          pdwak     1 cpu2:timer
                                                          pdpgs    29 cpu7:timer
Disks   md0  ada0  ada1 pass0 pass1 pass2                 intrn    12 cpu3:timer
KB/t   0.00  0.00  0.00  0.00  0.00  0.00         5318892 wire     41 cpu6:timer
tps       0     0     0     0     0     0         9261404 act      12 cpu5:timer
MB/s   0.00  0.00  0.00  0.00  0.00  0.00         1598184 inact     6 cpu4:timer
%busy     0     0     0     0     0     0                 cache       vgapci0
                                                    61840 free
                                                   712304 buf


Why do I have a Chrome tab using about 6G? What other sort of debugging
output can be helpful to get to the bottom of this? The machine still
responds to pings just fine, TCP connections get set up but the SSH
handshake never completes.

This always happens between 30-50h and is super annoying and has been
going on for >1year. Help?

Note, I cut the power to the monitor overnight to save electricity, can
this mess up something in the Radeon card or X server? What combinations
would be most useful to try next?

Uli
Received on Sat Oct 15 2016 - 14:18:53 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:08 UTC