11.x deadlocking during pfault (was Re: FreeBSD 11.x grinds to a halt after about 48h of uptime)

From: Ulrich Spörlein <uqs_at_FreeBSD.org> Date: Wed, 26 Oct 2016 18:43:43 +0200 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:08 UTC

On Mon, 2016-10-24 at 19:43:27 +0200, Ulrich Spörlein wrote:
> On Sat, 2016-10-15 at 09:36:27 -0700, Kevin Oberman wrote:
> > On Sat, Oct 15, 2016 at 9:26 AM, Hans Petter Selasky <hps_at_selasky.org>
> > wrote:
> > 
> > > On 10/15/16 18:18, Ulrich Spörlein wrote:
> > >
> > >> Hey all, while 11.x is -STABLE now, this happens to my machine ever
> > >> since I upgraded it to 11-CURRENT years ago. I have no idea when this
> > >> started, actually, but what always happens is this:
> > >>
> > >> - System and X11 is up and running, I keep it running over night as I'm
> > >> too lazy to reboot and restart everthing.
> > >> - There's a bunch of xterms, Chrome, Clementine-Player and some other
> > >> programs running
> > >> - Coming back to the machine the next day (or the day after) it will
> > >> exit the screensaver just fine and then either I can use it for a couple
> > >> of seconds before it freezes, or it's pretty much dead already. The
> > >> mouse cursor still moves for a bit, but the also freezes (so it this a
> > >> GPU problem??)
> > >>
> > >> Now what I currently see on the screen is a clock widget stuck at 18:04
> > >> but conky itself has last updated at 18:00:18 ...
> > >>
> > >> This time I had some SSH sessions from another machine to see some more
> > >> useful things. There was nothing in various logs under /var/log (I also
> > >> can't run dmesg anymore ...)
> > >> I had top(1) running in a loop, this is the last output:
> > >>
> > >> last pid: 25633;  load averages:  0.27,  0.39,  0.36  up 1+23:03:28
> > >> 18:00:12
> > >> 202 processes: 2 running, 188 sleeping, 11 zombie, 1 waiting
> > >>
> > >> Mem: 8873M Active, 1783M Inact, 5072M Wired, 567M Buf, 132M Free
> > >> ARC: 1844M Total, 469M MFU, 268M MRU, 16K Anon, 96M Header, 1012M Other
> > >> Swap: 4096M Total, 2395M Used, 1701M Free, 58% Inuse
> > >>
> > >>
> > >>   PID USERNAME      THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU
> > >> COMMAND
> > >>    11 root            8 155 ki31     0K   128K CPU0    0 364.6H 772.95%
> > >> idle
> > >>              3122 uqs            15  28    0  7113M  5861M uwait   0
> > >> 94:44  13.96% chrome
> > >>                            2887 uqs            28  22    0  1394M   237M
> > >> select  2 172:53   6.98% chrome
> > >>                                        2890 uqs            11  21    0
> > >> 1034M   178M select  5 231:21   1.95% chrome
> > >>                                                    1062 root            9
> > >> 21    0   440M 47220K select  0  67:09   0.98% Xorg
> > >>                                                              3002 uqs
> > >>       15  25    5  1159M   172M uwait   2  19:09   0.00% chrome
> > >>  3139 uqs            17  25    5  1163M   156M uwait   2  16:15   0.00%
> > >> chrome
> > >>  3001 uqs            18  25    5  1639M   575M uwait   0  16:05   0.00%
> > >> chrome
> > >>    12 root           24 -64    -     0K   384K WAIT   -1  10:53   0.00%
> > >> intr
> > >>  3129 uqs            12  20    0  2820M  1746M uwait   6   8:36   0.00%
> > >> chrome
> > >>  2822 uqs             9  20    0   217M 81300K select  0   5:10   0.00%
> > >> conky
> > >>  3174 root            1  20    0 21532K  3188K select  0   4:20   0.00%
> > >> systat
> > >>  3130 uqs            16  20    0  1058M   131M uwait   4   3:03   0.00%
> > >> chrome
> > >>  2998 uqs            16  20    0  1110M   123M uwait   2   2:53   0.00%
> > >> chrome
> > >>  3165 uqs            10  20    0  1209M   215M uwait   6   2:52   0.00%
> > >> chrome
> > >>  3142 uqs            11  25    5  1344M   195M uwait   2   2:46   0.00%
> > >> chrome
> > >>  2876 uqs            19  20    0   580M 37164K select  3   2:42   0.00%
> > >> clementine-player
> > >>    20 root            2 -16    -     0K    32K psleep  6   2:25   0.00%
> > >> pagedaemon
> > >>
> > >> I also had systat -vm running and it continued to update its screen ...
> > >> for a short while, this is the last update before SSH died:
> > >>
> > >>
> > >>    Mem usage:  0k%Phy  5%Kmem
> > >> Mem: KB    REAL            VIRTUAL                      VN PAGER   SWAP
> > >> PAGER
> > >>         Tot   Share      Tot    Share    Free           in   out     in
> > >>  out
> > >> Act  11051k   67868 71051992   255448   61840  count
> > >> All  11051k   67924 71058776   262100          pages
> > >> Proc:
> > >> Interrupts
> > >>   r   p   d   s   w   Csw  Trp  Sys  Int  Sof  Flt        ioflt   224
> > >> total
> > >>      25     730  11   724  109  404  101   13             cow       2
> > >> ehci0 16
> > >>                                                           zfod      3
> > >> ehci1 23
> > >>  0.0%Sys   0.1%Intr  0.0%User  0.0%Nice 99.9%Idle         ozfod    16
> > >> cpu0:timer
> > >> |    |    |    |    |    |    |    |    |    |           %ozfod
> > >>  xhci0 264
> > >>                                                           daefr     3 em0
> > >> 265
> > >>                                         50 dtbuf          prcfr    94
> > >> hdac1 266
> > >> Namei     Name-cache   Dir-cache    349167 desvn          totfr
> > >>  ahci0 270
> > >>    Calls    hits   %    hits   %    349155 numvn          react     5
> > >> cpu1:timer
> > >>      121     121 100                253501 frevn          pdwak     1
> > >> cpu2:timer
> > >>                                                           pdpgs    29
> > >> cpu7:timer
> > >> Disks   md0  ada0  ada1 pass0 pass1 pass2                 intrn    12
> > >> cpu3:timer
> > >> KB/t   0.00  0.00  0.00  0.00  0.00  0.00         5318892 wire     41
> > >> cpu6:timer
> > >> tps       0     0     0     0     0     0         9261404 act      12
> > >> cpu5:timer
> > >> MB/s   0.00  0.00  0.00  0.00  0.00  0.00         1598184 inact     6
> > >> cpu4:timer
> > >> %busy     0     0     0     0     0     0                 cache
> > >>  vgapci0
> > >>                                                     61840 free
> > >>                                                    712304 buf
> > >>
> > >>
> > >> Why do I have a Chrome tab using about 6G? What other sort of debugging
> > >> output can be helpful to get to the bottom of this? The machine still
> > >> responds to pings just fine, TCP connections get set up but the SSH
> > >> handshake never completes.
> > >>
> > >> This always happens between 30-50h and is super annoying and has been
> > >> going on for >1year. Help?
> > >>
> > >> Note, I cut the power to the monitor overnight to save electricity, can
> > >> this mess up something in the Radeon card or X server? What combinations
> > >> would be most useful to try next?
> > >>
> > >>
> > > Hi,
> > >
> > > Sounds like a memory leak. Can you track the memory use over time?
> > >
> > > Did you look at the output from:
> > >
> > > vmstat -m ?
> > >
> > > --HPS
> > 
> > 
> > I have noted significant  memory leakage in chromium for some time. If I
> > leave it running overnight, my system is essentially frozen. If I terminate
> > the chromium process, it slowly comes back to life. I always keep a gkrellm
> > session on-screen where the memory and swap utilization is continuously
> > displayed and that clearly shows resources declining.
> 
> That is not what is happening to my system though, it actually
> deadlocks. There's no way to recover from it, it seems.
> 
> So I killed Chromium overnight each day, and I'm at this:
> 
> % top -Sbores
> last pid: 44526;  load averages:  0.10,  0.11,  0.56  up 7+09:53:30    19:33:25
> 156 processes: 2 running, 153 sleeping, 1 waiting
> 
> Mem: 315M Active, 550M Inact, 5671M Wired, 515M Buf, 9324M Free
> ARC: 1852M Total, 541M MFU, 196M MRU, 16K Anon, 93M Header, 1022M Other
> Swap: 4096M Total, 2186M Used, 1910M Free, 53% Inuse
> 
> 
>   PID USERNAME      THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
>  2755 uqs            10  20    0  1697M   311M select  1  47:23   0.00% conky
>  2736 uqs            32  20    0   699M   116M select  7  94:29   0.00% clementine-player
>  3000 uqs            12  20    0  1126M 69380K select  5   9:48   0.00% digikam
>   960 root            9  20    0   448M 59076K select  0 250:22   0.00% Xorg
> 72608 uqs             8  20    0   939M 55432K uwait   5   0:01   0.00% chrome
> 72599 uqs             9  52    0   929M 55116K uwait   0   0:00   0.00% chrome
>  2567 root            1  20    0 89948K 42964K select  1   1:51   0.00% bsnmpd
> 70476 uqs             1  20    0 93656K 25712K select  2   0:05   0.00% xterm
>  2730 uqs             5  20    0   208M 14988K select  1   0:22   0.00% clock-applet
>   880 root            1  20    0 22628K 12500K select  3   0:20   0.00% ntpd
>  2726 uqs             4  20    0   206M 12456K select  6   0:09   0.00% mateweather-applet
> 44352 uqs             1  20    0 75224K 12348K select  4   0:00   0.00% xterm
> 43049 uqs             1  20    0 75224K 11792K select  5   0:00   0.00% xterm
>  3074 uqs             2  20    0   308M  9692K select  1   0:02   0.00% kdeinit4
>  2671 uqs             1  20    0   144M  9488K select  1   0:13   0.00% openbox
>  3072 uqs             1  20    0   210M  8284K select  3   0:00   0.00% kdeinit4
>  2724 uqs             4  20    0   154M  8256K select  2   0:19   0.00% wnck-applet
>  2701 uqs             5  20    0   177M  8144K select  2   0:01   0.00% mate-panel
> 
> 
> 7d running, pretty good. But look closer, the system is doing pretty
> much nothing but did swap out 2G. What?
> 
> > Try closing your chromium at night and see if that fixes the problem.
> 
> It's better, but I'm not sure it's a real fix. I've now turned off
> "hardware acceleration" in Chromium, though chrome://gpu didn't real
> inspire confidence that it was actually using any h/w accel at all.
> 
> 
> > If you have never tried gkrellm (sysutils/gkrellm2), it is a the best
> > system monitor I have found. though pulls in a lot of dependencies. It also
> > can run as a server with remote systems displaying the data. Handy to
> > monitor servers.
> 
> I had a cacti-setup that would also monitor my workstation (through a
> OpenVPN tunnel), but that has bit-rotted and Apache only gives me 500s
> on that cacti URL and nothing in the logs, oh well ...)
> 
> Hooking up a serial console and testing whether DDB works is probably
> the next best step to take ...

Sigh, I forgot to shut down Chrome overnight, and my system is
deadlocked again. I can still switch virtual desktops in X11 and a
running xterm accepts keyboard input, but it is 18:35 now and I see that
my X11 clock has last updated at 17:18 and conky is stuck at 17:14:11

my top-in-a-loop is stuck here and no longer loops:

last pid: 73731;  load averages:  0.23,  0.24,  0.23  up 9+07:34:15    17:14:10
160 processes: 3 running, 146 sleeping, 11 zombie

Mem: 9302M Active, 763M Inact, 5682M Wired, 752M Buf, 113M Free
ARC: 1731M Total, 549M MFU, 129M MRU, 16K Anon, 91M Header, 963M Other
Swap:

  PID USERNAME      THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
  960 root            9  21    0   451M 55796K CPU7    7 299:15   2.98% Xorg
53884 uqs            27  22    0   904M   250M CPU4    4 105:05   2.98% chrome
54081 uqs            15  21    0  3601M  2527M uwait   6  38:11   0.98% chrome
 2736 uqs            33  20    0   697M 84620K select  5  95:38   0.00% clementine-player
 2755 uqs            10  20    0  2111M   721M select  0  60:47   0.00% conky
78943 uqs             1  20    0 21532K  3328K select  7  13:12   0.00% systat
54048 uqs            15  25    5  1093M   159M uwait   4   9:51   0.00% chrome
 3000 uqs            12  20    0  1126M 53248K select  7   9:50   0.00% digikam
53999 uqs            19  25    5  1525M   514M uwait   5   6:28   0.00% chrome
 2703 uqs             1  20    0   169M  5948K select  0   5:40   0.00% mate-volume-control
 2707 uqs             1  20    0 60240K  2516K select  7   5:21   0.00% autocutsel
54077 uqs            15  20    0  1803M   821M uwait   6   3:51   0.00% chrome
78869 uqs             1  20    0 93480K  6636K select  7   3:38   0.00% sshd
38396 uqs             1  20    0 75224K  3896K select  0   3:18   0.00% xterm
 2567 root            1  20    0 89948K 43892K select  6   2:25   0.00% bsnmpd
  876 root            9  52    0 30120K  3468K uwait   7   2:20   0.00% nscd
  883 root            1  20    0 10444K  1708K select  0   2:16   0.00% powerd
 2586 haldaemon       2  20    0 56620K  3920K select  1   2:04   0.00% hald

systat -vm:

   10 users    Load  0.21  0.23  0.23                  Oct 26 17:14
   Mem usage:  0k%Phy  5%Kmem
Mem: KB    REAL            VIRTUAL                      VN PAGER   SWAP PAGER
        Tot   Share      Tot    Share    Free           in   out     in   out
Act  10545k   51036 67451248   249344   61912  count   120     3
All  10550k   55504 67480212   275720          pages   807     3
Proc:                                                            Interrupts
  r   p   d   s   w   Csw  Trp  Sys  Int  Sof  Flt    149 ioflt   845 total
      6     781  12  9327 6923  24k  318   18 6649     20 cow       2 ehci0 16
                                                     5064 zfod      3 ehci1 23
 4.7%Sys   0.0%Intr  1.7%User  0.0%Nice 93.6%Idle      16 ozfod   106 cpu0:timer
|    |    |    |    |    |    |    |    |    |           %ozfod       xhci0 264
==>                                                   927 daefr    45 em0 265
                                        51 dtbuf      545 prcfr    94 hdac1 266
Namei     Name-cache   Dir-cache    349167 desvn     1787 totfr   175 ahci0 270
   Calls    hits   %    hits   %    337762 numvn          react    50 cpu1:timer
     935     901  96                170078 frevn          pdwak    48 cpu3:timer
                                                  1435985 pdpgs    66 cpu2:timer
Disks   md0  ada0  ada1 pass0 pass1 pass2                 intrn    77 cpu7:timer
KB/t   0.00 22.59  0.00  0.00  0.00  0.00         5819100 wire     75 cpu4:timer
tps      17   168     0     0     0     0         9723008 act      51 cpu6:timer
MB/s   0.00  3.70  0.00  0.00  0.00  0.00          636300 inact    53 cpu5:timer
%busy     0     3     0     0     0     0                 cache       vgapci0
                                                    61912 free
                                                   770732 buf

I'm unable to scroll back the vmstat -m output in my tmux pane (running on a
different system, this is super strange), so all I can show is this:

     inodedep    16 12293K       -  2770872  512
    bmsafemap     9    49K       -  2136243  256,8192
       newblk    28 24582K       -  4111401  256
     indirdep     4     1K       -   865983  128,16384
     freefrag     6     1K       -   682027  128
     freeblks     3     1K       -  1318985  256
     freefile     3     1K       -  1272832  64
       diradd     2     1K       -  1306741  128
        mkdir     0     0K       -     5120  128
       dirrem     0     0K       -  1305358  128
    newdirblk     0     0K       -     2610  64
     freework     9     1K       -  1566489  64,128
        sbdep     0     0K       -    45661  64
     savedino     0     0K       -   186280  256
      softdep     6     3K       -        6  512
  ufs_dirhash  1533   767K       -   109513  16,32,64,128,256,512,1024,2048,4096
    ufs_quota     1  2048K       -        1
    ufs_mount    18    55K       -       18  512,2048,4096,8192
    vm_pgdata     1  2048K       -        2  128
      UMAHash    23  5385K       -      108  512,1024,2048,4096,8192,16384,32768,65536
      memdesc     1     4K       -        1  4096
     atkbddev     2     1K       -        2  64
      entropy     1     1K       -   894489  32,4096
       apmdev     1     1K       -        1  128
   madt_table     0     0K       -        1  4096
     SCSI ENC    25   100K       -   120744  16,64,256,2048,32768
      io_apic     1     2K       -        1  2048
          MCA    18     3K       -       18  64,128
          msi    16     2K       -       16  128
     nexusdev     5     1K       -        5  16
         hdaa     5    54K       -        5  2048,16384,32768
         hdac     1     1K       -        1  1024
        hdacc     1     1K       -        1  32
        linux    29     2K       -       29  64
      solaris 295750 228696K       - 140520323  16,32,64,128,256,512,1024,2048,4096,8192,32768
   kstat_data     6     1K       -        6  64
     eli data    22     4K       -  6218901  64,256,512,1024,2048,4096,8192,16384,32768,65536
         ksem     1     8K       -        1  8192
  nullfs_hash     1  2048K       -        1
  nullfs_node     9     1K       -       41  64
 nullfs_mount     9     1K       -        9  32
  fdesc_mount     1     1K       -        1  16
     gem_name    46    14K       -      122  32,4096
   drm_global     2     1K       -        2  128,256
      drm_dma     1     1K       -        1  32
    drm_sarea     1     1K       -        1  16
   drm_driver    91  2278K       -      125  16,32,64,128,256,512,1024,2048,4096,8192,32768
    drm_minor     2     1K       -        2  128
    drm_files     1     1K       -        3  512
drm_ctxbitmap     1     4K       -        1  4096
     drm_sman    41     6K       -       42  128
  drm_hashtab     3  4129K       -        4  128,32768
      drm_kms    69    19K       -      163  16,32,64,128,256,512,2048,4096,8192
   drm_vblank     7     1K       -        7  32,256
       ttm_pd    16    17K       -       18  16,128,2048,65536
     ttm_rman     2     1K       -        2  256
     ttm_zone     2     1K       -        2  64
  ttm_poolmgr     1     1K       -        1  512

Now what?

The xterm I have running locally with a stuck top is showing the top 3 chrome
processes in pfault state and it has "Swap: 11M In" in the header, so clearly
11.x is prone to deadlock during page faults and or swapping. It has last
updated on 17:14:13 (compare to the other top at 17:14:10 not showing pfault
state yet).