Re: CURRENT slow and shaky network stability

From: Don Lewis <truckman_at_FreeBSD.org>
Date: Mon, 28 Mar 2016 14:52:09 -0700 (PDT)
On 28 Mar, O. Hartmann wrote:
> Am Sat, 26 Mar 2016 14:26:45 -0700 (PDT)
> Don Lewis <truckman_at_FreeBSD.org> schrieb:
> 
>> On 26 Mar, Michael Butler wrote:
>> > -current is not great for interactive use at all. The strategy of
>> > pre-emptively dropping idle processes to swap is hurting .. big time.
>> > 
>> > Compare inactive memory to swap in this example ..
>> > 
>> > 110 processes: 1 running, 108 sleeping, 1 zombie
>> > CPU:  1.2% user,  0.0% nice,  4.3% system,  0.0% interrupt, 94.5% idle
>> > Mem: 474M Active, 1609M Inact, 764M Wired, 281M Buf, 119M Free
>> > Swap: 4096M Total, 917M Used, 3178M Free, 22% Inuse
>> > 
>> >   PID USERNAME       THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU
>> > COMMAND
>> >  1819 imb              1  28    0   213M 11284K select  1 147:44   5.97%
>> > gkrellm
>> > 59238 imb             43  20    0   980M   424M select  0  10:07   1.92%
>> > firefox
>> > 
>> >  .. it shouldn't start randomly swapping out processes because they're
>> > used infrequently when there's more than enough RAM to spare ..  
>> 
>> I don't know what changed, and probably something can use some tweaking,
>> but paging out idle processes isn't always the wrong thing to do.  For
>> instance if I'm using poudriere to build a bunch of packages and its
>> heavy use of tmpfs is pushing the machine into many GB of swap usage, I
>> don't want interactive use like:
>> 	vi foo.c
>> 	cc foo.c
>> 	vi foo.c
>> to suffer because vi and cc have to be read in from a busy hard drive
>> each time while unused console getty and idle sshd processes in a bunch
>> of jails are still hanging on to memory even though they haven't
>> executed any instructions since shortly after the machine was booted
>> weeks ago.
>> 
>> > It also shows up when trying to reboot .. on all of my gear, 90 seconds
>> > of "fail-safe" time-out is no longer enough when a good proportion of
>> > daemons have been dropped onto swap and must be brought back in to flush
>> > their data segments :-(  
>> 
>> That's a different and known problem.  See:
>> <https://svnweb.freebsd.org/base/releng/10.3/bin/csh/config_p.h?revision=297204&view=markup>
> 
> CURRENT has rendered unusable and faulty. Updating ports for poudriere ends up in this
> error/broken pipe from remote console:
> 
>  [~] poudriere ports -u -p head
> [00:00:00] ====>> Updating portstree "head"
> [00:00:00] ====>> Updating the ports tree... done
> root_at_gate [~] Fssh_packet_write_wait: Connection to 192.168.250.111 port 22: Broken pipe
> 
> 
> Although not under load, several processes over time gets idled/paged out - and they
> never recover, the connection is then sabott, the whole thing unusable :-(

I'm definitely not seeing that here.  This is getting close to the end
of a big poudriere run:

last pid: 82549;  load averages: 20.05, 20.72, 23.51    up 5+12:34:14  12:51:55
144 processes: 20 running, 109 sleeping, 15 stopped
CPU: 85.3% user,  0.0% nice, 14.7% system,  0.0% interrupt,  0.0% idle
Mem: 1082M Active, 19G Inact, 9718M Wired, 249M Buf, 1095M Free
ARC: 3841M Total, 2039M MFU, 642M MRU, 3395K Anon, 111M Header, 1044M Other
Swap: 40G Total, 9691M Used, 31G Free, 23% Inuse, 196K In

At the moment, openoffice-4, openoffice-devel, libreoffice, and chromium
are all being built and are using tmpfs for "wrkdir data localbase", so
there are many GB of data in tmpfs, which is the reason for the high
inact and swap usage.  I just hit the return key in an idle (for a
couple of hours) terminal window containing an ssh login session to the
same machine.  I got a fresh command prompt essentially instantaneously.
It couldn't have taken more than a couple hundred milliseconds to wake
up and page in the idle sshd and shell processes on the build server.

[a couple hours later, after poudriere is done and all tmpfs is gone]

last pid: 66089;  load averages:  0.13,  1.59,  4.61    up 5+14:14:33  14:32:14
71 processes:  1 running, 55 sleeping, 15 stopped
CPU:  3.1% user,  0.0% nice,  0.0% system,  0.0% interrupt, 96.9% idle
Mem: 58M Active, 85M Inact, 12G Wired, 249M Buf, 19G Free
ARC: 6249M Total, 2792M MFU, 2246M MRU, 16K Anon, 133M Header, 1078M Other
Swap: 40G Total, 81M Used, 40G Free

[after tracking down and exiting all of those stopped processes]

last pid: 66103;  load averages:  0.20,  0.99,  3.80    up 5+14:17:18  14:34:59
56 processes:  1 running, 55 sleeping
CPU:  0.0% user,  0.0% nice,  0.1% system,  0.1% interrupt, 99.9% idle
Mem: 57M Active, 88M Inact, 12G Wired, 249M Buf, 19G Free
ARC: 6251M Total, 2793M MFU, 2247M MRU, 16K Anon, 133M Header, 1078M Other
Swap: 40G Total, 63M Used, 40G Free

The biggest chunk of the 63 MB of swap appears to be nginx.  It's
process size is 29 MB, but it has zero resident.  It hasn't executed any
code since it was first started when I booted the system several days
ago.  Other consumers appear to be getty and sshd and syslogd in various
untouched jails.


I've seen reports that r296137 and r297267 show the ssh problem, but
this machine is in the middle with r297204 and I don't see it.

As mentioned previously, I'm not running Xorg and a bunch of bloated
X11 clients on this machine.  Those make fat targets for having RAM
taken from them, which would probably make my interactive experience
less pleasant, but that should still not affect ssh.

On my FreeBSD 10 machine, which has only 8 GB of RAM, my experience is
that firefox gets pretty bloated after a while.  It's currently at 2.6
GB (with 2.8 GB of swap currently in use - I've got some other RAM hogs
running as well) and I'm not seeing any problems, but when it gets up in
the 4-5 GB range, things can start to get pretty laggy, but I don't see
problems with ssh.  The biggest problem with firefox seems to be
javascript, which seems to leak memory like a sieve.  Making heavy use
of the noscript plugin is the only way to keep Firefox usable.

The only thing I can think of is that this is triggered by something in
the machine configuration or the specific hardware.  I'm running a
GENERIC kernel and the only non-standard modification to /usr/src is the
dummynet AQM patchset.  The latter should have no effect since I"m not
using ipfw on this machine.

If I get a chance, I try booting my FreeBSD 11 machine with less RAM to
see if that is a trigger.
Received on Mon Mar 28 2016 - 19:52:19 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:03 UTC