In message <20160405092712.131ee52c_at_freyja.zeit4.iv.bundesimmobilien.de>, "O. H artmann" writes: > On Mon, 04 Apr 2016 23:46:08 -0700 > Cy Schubert <Cy.Schubert_at_komquats.com> wrote: > > > In message <20160405082047.670d7241_at_freyja.zeit4.iv.bundesimmobilien.de>, > > "O. H > > artmann" writes: > > > On Sat, 02 Apr 2016 16:14:57 -0700 > > > Cy Schubert <Cy.Schubert_at_komquats.com> wrote: > > > > > > > In message <20160402231955.41b05526.ohartman_at_zedat.fu-berlin.de>, "O. > > > > Hartmann" > > > > writes: > > > > > --Sig_/eJJPtbrEuK1nN2zIpc7BmVr > > > > > Content-Type: text/plain; charset=US-ASCII > > > > > Content-Transfer-Encoding: quoted-printable > > > > > > > > > > Am Sat, 2 Apr 2016 11:39:10 +0200 > > > > > "O. Hartmann" <ohartman_at_zedat.fu-berlin.de> schrieb: > > > > > > > > > > > Am Sat, 2 Apr 2016 10:55:03 +0200 > > > > > > "O. Hartmann" <ohartman_at_zedat.fu-berlin.de> schrieb: > > > > > >=20 > > > > > > > Am Sat, 02 Apr 2016 01:07:55 -0700 > > > > > > > Cy Schubert <Cy.Schubert_at_komquats.com> schrieb: > > > > > > > =20 > > > > > > > > In message <56F6C6B0.6010103_at_protected-networks.net>, Michael > > > > > > > > Butle > > > r > > > > > > > > = > > > > > writes: =20 > > > > > > > > > -current is not great for interactive use at all. The strateg > y > > > > > > > > > of pre-emptively dropping idle processes to swap is hurting . > . > > > > > > > > > big tim= > > > > > e. =20 > > > > > > > >=20 > > > > > > > > FreeBSD doesn't "preemptively" or arbitrarily push pages out to > > > > > > > > disk.= > > > > > LRU=20 > > > > > > > > doesn't do this. > > > > > > > > =20 > > > > > > > > >=20 > > > > > > > > > Compare inactive memory to swap in this example .. > > > > > > > > >=20 > > > > > > > > > 110 processes: 1 running, 108 sleeping, 1 zombie > > > > > > > > > CPU: 1.2% user, 0.0% nice, 4.3% system, 0.0% interrupt, > > > > > > > > > 94.5% i= > > > > > dle > > > > > > > > > Mem: 474M Active, 1609M Inact, 764M Wired, 281M Buf, 119M Fre > e > > > > > > > > > Swap: 4096M Total, 917M Used, 3178M Free, 22% Inuse =20 > > > > > > > > >=20 > > > > > > > > To analyze this you need to capture vmstat output. You'll see t > he > > > > > > > > fre= > > > > > e pool=20 > > > > > > > > dip below a threshold and pages go out to disk in response. If > you > > > > > > > > ha= > > > > > ve=20 > > > > > > > > daemons with small working sets, pages that are not part of the > > > > > > > > worki= > > > > > ng=20 > > > > > > > > sets for daemons or applications will eventually be paged out. > > > > > > > > This i= > > > > > s not=20 > > > > > > > > a bad thing. In your example above, the 281 MB of UFS buffers a > re > > > > > > > > mor= > > > > > e=20 > > > > > > > > active than the 917 MB paged out. If it's paged out and never u > sed > > > > > > > > ag= > > > > > ain,=20 > > > > > > > > then it doesn't hurt. However the 281 MB of buffers saves you I > /O. > > > > > > > > Th= > > > > > e=20 > > > > > > > > inactive pages are part of your free pool that were active at o > ne > > > > > > > > tim= > > > > > e but=20 > > > > > > > > now are not. They may be reclaimed and if they are, you've just > > > > > > > > saved= > > > > > more=20 > > > > > > > > I/O. > > > > > > > >=20 > > > > > > > > Top is a poor tool to analyze memory use. Vmstat is the better > > > > > > > > tool t= > > > > > o help=20 > > > > > > > > understand memory use. Inactive memory isn't a bad thing per se > . > > > > > > > > Moni= > > > > > tor=20 > > > > > > > > page outs, scan rate and page reclaims. > > > > > > > >=20 > > > > > > > > =20 > > > > > > >=20 > > > > > > > I give up! Tried to check via ssh/vmstat what is going on. Last > > > > > > > lines b= > > > > > efore broken > > > > > > > pipe: > > > > > > >=20 > > > > > > > [...] > > > > > > > procs memory page disks > > > > > > > faults > > > cpu > > > > > > > r b w avm fre flt re pi po fr sr ad0 ad1 in sy > > > > > > > c > > > s > > > > > > > = > > > > > us sy id > > > > > > > 22 0 22 5.8G 1.0G 46319 0 0 0 55721 1297 0 4 219 2390 > 7 > > > > > > > 540= > > > > > 0 95 5 0 > > > > > > > 22 0 22 5.4G 1.3G 51733 0 0 0 72436 1162 0 0 108 4086 > 9 > > > > > > > 345= > > > > > 9 93 7 0 > > > > > > > 15 0 22 12G 1.2G 54400 0 27 0 52188 1160 0 42 148 5219 > 2 > > > > > > > 436= > > > > > 6 91 9 0 > > > > > > > 14 0 22 12G 1.0G 44954 0 37 0 37550 1179 0 39 141 8620 > 9 > > > > > > > 436= > > > > > 8 88 12 0 > > > > > > > 26 0 22 12G 1.1G 60258 0 81 0 69459 1119 0 27 123 7795 > 69 > > > > > > > 704= > > > > > 359 87 13 0 > > > > > > > 29 3 22 13G 774M 50576 0 68 0 32204 1304 0 2 102 5073 > 37 > > > > > > > 484= > > > > > 861 93 7 0 > > > > > > > 27 0 22 13G 937M 47477 0 48 0 59458 1264 3 2 112 6813 > 1 > > > > > > > 4440= > > > > > 7 95 5 0 > > > > > > > 36 0 22 13G 829M 83164 0 2 0 82575 1225 1 0 126 9936 > 6 > > > > > > > 3806= > > > > > 0 89 11 0 > > > > > > > 35 0 22 6.2G 1.1G 98803 0 13 0 121375 1217 2 8 112 993 > 71 > > > > > > > 49= > > > > > 99 85 15 0 > > > > > > > 34 0 22 13G 723M 54436 0 20 0 36952 1276 0 17 153 2914 > 2 > > > > > > > 443= > > > > > 1 95 5 0 > > > > > > > Fssh_packet_write_wait: Connection to 192.168.0.1 port 22: Broken > > > > > > > pip > > > e > > > > > > >=20 > > > > > > >=20 > > > > > > > This makes this crap system completely unusable. The server (Free > BSD > > > > > > > 11= > > > > > .0-CURRENT #20 > > > > > > > r297503: Sat Apr 2 09:02:41 CEST 2016 amd64) in question did > > > > > > > poudriere= > > > > > bulk job. I > > > > > > > can not even determine what terminal goes down first - another on > e, > > > > > > > muc= > > > > > h more time > > > > > > > idle than the one shwoing the "vmstat 5" output, is still alive!= > 20 > > > > > > >=20 > > > > > > > i consider this a serious bug and it is no benefit what happened > > > > > > > sinc > > > e > > > > > > > = > > > > > this "fancy" > > > > > > > update. :-( =20 > > > > > >=20 > > > > > > By the way - it might be of interest and some hint. > > > > > >=20 > > > > > > One of my boxes is acting as server and gateway. It utilises NAT, > > > > > > IPFW, w= > > > > > hen it is under > > > > > > high load, as it was today, sometimes passing the network flow from > > > > > > ISP i= > > > > > nto the network > > > > > > for clients is extremely slow. I do not consider this the reason fo > r > > > > > > coll= > > > > > apsing ssh > > > > > > sessions, since this incident happens also under no-load, but in th > e > > > > > > over= > > > > > all-view onto > > > > > > the problem, this could be a hint - I hope.=20 > > > > > > > > > > I just checked on one box, that "broke pipe" very quickly after I > > > > > started > > > p= > > > > > oudriere, > > > > > while it did well a couple of hours before until the pipe broke. It > > > > > seems > > > i= > > > > > t's load > > > > > dependend when the ssh session gets wrecked, but more important, afte > r > > > > > th > > > e = > > > > > long-haul > > > > > poudriere run, I rebooted the box and tried again with the mentioned > > > > > brok > > > en= > > > > > pipe after a > > > > > couple of minutes after poudriere ran. Then I left the box for severa > l > > > > > ho > > > ur= > > > > > s and logged > > > > > in again and checked the swap. Although there was for hours no load o > r > > > > > ot > > > he= > > > > > r pressure, > > > > > there were 31% of of swap used - still (box has 16 GB of RAM and is > > > > > prope > > > ll= > > > > > ed by a XEON > > > > > E3-1245 V2). > > > > > > > > > > > > > 31%! Is it *actively* paging or is the 31% previously paged out and no > > > > paging is *currently* being experienced? 31% of how swap space in total > ? > > > > > > > > Also, what does ps aumx or ps aumxww say? Pipe it to head -40 or simila > r. > > > > > > > > > > > > > > On FreeBSD 11.0-CURRENT #4 r297573: Tue Apr 5 07:01:19 CEST 2016 amd64, > > > loca l > > > network, no NAT. Stuck ssh session in the middle of administering and > > > leaving the console/ssh session for a couple of minutes: > > > > > > root 2064 0.0 0.1 91416 8492 - Is 07:18 0:00.03 sshd: > > > hartmann [priv] (sshd) > > > > > > hartmann 2108 0.0 0.1 91416 8664 - I 07:18 0:07.33 sshd: > > > hartmann_at_pts/0 (sshd) > > > > > > root 72961 0.0 0.1 91416 8496 - Is 08:11 0:00.03 sshd: > > > hartmann [priv] (sshd) > > > > > > hartmann 72970 0.0 0.1 91416 8564 - S 08:11 0:00.02 sshd: > > > hartmann_at_pts/1 (sshd) > > > > > > The situation is worse and i consider this a serious bug. > > > > > > > There's not a lot to go on here. Do you have physical access to the machine > > > to pop into DDB and take a look? You did say you're using a lot of swap. > > IIRC 30%. You didn't answer how much 30% was of. Without more data I can't > > help you. At the best I can take wild guesses but that won't help you. Try > > to answer the questions I asked last week and we can go further. Until then > > > all we can do is wildly guess. > > > > > > Hello Cy, sorry for the lack of information. > > The machine in question is not accessible at this very moment. The box has 16 > GB of physical RAM, 32 GB of swap (on SSD) and a 4-core/8-thread CPU (I think So that's 10 GB of swap. Hmmm. Memory leak? What? We need to investigate this avenue. 4-core/8-thread: A total of 8 threads. It had 22-29 active processes in the run queue (based on your vmstat output). It would appear your box is simply overloaded. > tht is also important due to allocation of arbitrary memory). The problem I > described arose when using poudriere. The box uses 6 builders, but each build > er > can, as I understand, spwan several instances of jobs for compiling/linking e > tc. I'm thinking you have too much loaded on the box (22-29 active processes in the run queue and 10 GB swap used). I'm currently running two poudriere builds on two separate machines, each is dual core with no hyperthreading (AMD X2 5200+ and 5000+) with three builders, each single threaded: load average of about 3-4. > But - that box is only a placeholder for the weirdness that is going on > (despite the fact that it is using NAT since it is atatched to a DSL line). > > To the contrary, the system I face today at work is not(!) behind NAT and > doesn't have the "toy" network binding. This box I'm accessing now has 16 GB > of > physical RAM, and two sockets, each populated with a oldish 4-core XEON 5XXX > from the Core2Duo age (no SMT). That box does not run poudriere, only > postgresql and some other services. What are the performance stats there? CPU, swap, free memory (mem used), scan rate, reclaim rate, and also load average? Do you use ZFS, UFS or both? (I have both and if I use UFS, then swap is used because UFS cache displacement of infrequently used pages, though heavy use of ZFS cache will push some VM out to swap too. Not a big deal as I'm using memory I've paid for rather than it being idle. I wouldn't be this cheap at $JOB though.) > > In February, I was able to push the other box in question (behind NAT, as a > remark) to its limits using poudriere and using 8 builders. Network became > slow, since the box also acts as gateway, but it never failed, broke or dropp > ed > the ssh session due to "broken pipe". Without changing the config except the > base system's sources for CURRENT, since ~ two, or at most three weeks for no > w > I get this weird drops. And this is why I started "wining" - there is also a > drop in performance when compiling world, which elongates the compiling time > ~ > 5 - 10 minutes on the NATed box. > > I'm fully aware of being on CURRENT, but I think it is considerably reasonabl > e > to report about those weird things happening now. I did not receive any word > about dramatic changes that could trigger such a behaviour. And as I > understand the thread we are in here, there has been made a change which > results in a more aggressive swap of inactive processes. > > I tried to stop all services on the boxes (postgresql, icinga2, http etc) to > check whether those could force the kernel to swap a process. But the loss of Swapping isn't really performed until memory is critical. (Swapping is: swapping whole processes out to disk.) Paging OTOH is what you're seeing. That's classic LRU. Scan rate is the determining factor with LRU (or unreferenced interval count if you may -- the amount of time since the page has last been referenced). > ssh connection and the very strange behaviour that the ssh connection gets > irresponsive is eratic, which means: it comes sometimes very fast after > seconds not touching the xterm/ssh from the remote box, sometimes it lasts > up to 30 minutes, even on load. So, there is probably a problem with me > understanding this new feature ... Which new feature? Erratic and unresponsive sessions could be anything. Your stats look way off, indicating the one box is overloaded CPU-wise and memory-wise. Erratic behavior could be due to a number of other factors. What about the other box? CPU, memory stats, swap... What kind of workload is run on it? What kind of NICs do they have too? It's probably buried in another email but do you use UFS, ZFS or both? IIRC only UFS. (UFS and ZFS don't mix. They're like oil and water. The UFS buffer cache and ZFS ARC compete for memory. Throw applications into the mix (obviously) and minor to moderate paging will result. I'll cut to the chase. Without precise information, the areas of possible investigation are too much load (CPU & memory), possible memory leak (?), or possible NIC or network issue causing the disconnects. Similarities between the two systems -- better to look at both and we discover similarities here than through a lens. -- Cheers, Cy Schubert <Cy.Schubert_at_komquats.com> or <Cy.Schubert_at_cschubert.com> FreeBSD UNIX: <cy_at_FreeBSD.org> Web: http://www.FreeBSD.org The need of the many outweighs the greed of the few.Received on Tue Apr 05 2016 - 06:33:03 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:03 UTC