[SOLVED] Re: Strange behavior after running under high load

From: Stefan Esser <se_at_freebsd.org> Date: Fri, 2 Apr 2021 23:18:10 +0200 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:27 UTC

Am 28.03.21 um 16:39 schrieb Stefan Esser:
> After a period of high load, my now idle system needs 4 to 10 seconds to
> run any trivial command - even after 20 minutes of no load ...
> 
> 
> I have run some Monte-Carlo simulations for a few hours, with initially 
35 
> processes running in parallel for some 10 seconds each.
> 
> The load decreased over time since some parameter sets were faster to process.
> All in all 63000 processes ran within some 3 hours.
> 
> When the system became idle, interactive performance was very bad. Running
> any trivial command (e.g. uptime) takes some 5 to 10 seconds. Since I have
> to have this system working, I plan to reboot it later today, but will keep
> it in this state for some more time to see whether this state persists or
> whether the system recovers from it.
> 
> Any ideas what might cause such a system state???

Seems that Mateusz Guzik was right to mention performance issues when
the system is very low on vnodes. (Thanks!)

I have been able to reproduce the issue and have checked vnode stats:

kern.maxvnodes: 620370
kern.minvnodes: 155092
vm.stats.vm.v_vnodepgsout: 6890171
vm.stats.vm.v_vnodepgsin: 18475530
vm.stats.vm.v_vnodeout: 228516
vm.stats.vm.v_vnodein: 1592444
vfs.wantfreevnodes: 155092
vfs.freevnodes: 47	<----- obviously too low ...
vfs.vnodes_created: 19554702
vfs.numvnodes: 621284
vfs.cache.debug.vnodes_cel_3_failures: 0
vfs.cache.stats.heldvnodes: 6412

The freevnodes value stayed in this region over several minutes, with
typical program start times (e.g. for "uptime") in the region of 10 to
15 seconds.

After rising maxvnodes to 2,000,000 form 600,000 the system performance
is restored and I get:

kern.maxvnodes: 2000000
kern.minvnodes: 500000
vm.stats.vm.v_vnodepgsout: 7875198
vm.stats.vm.v_vnodepgsin: 20788679
vm.stats.vm.v_vnodeout: 261179
vm.stats.vm.v_vnodein: 1817599
vfs.wantfreevnodes: 500000
vfs.freevnodes: 205988	<----- still a lot higher than wantfreevnodes
vfs.vnodes_created: 19956502
vfs.numvnodes: 912880
vfs.cache.debug.vnodes_cel_3_failures: 0
vfs.cache.stats.heldvnodes: 20702

I do not know why the performance impact is so high - there are a few
free vnodes (more than required for the shared libraries to start e.g.
the uptime program). Most probably each attempt to get a vnode triggers
a clean-up attempt that runs for a significant time, but has no chance
to actually reach near the goal of 155k or 500k free vnodes.

Anyway, kern.maxvnodes can be changed at run-time and it is thus easy
to fix. It seems that no message is logged to report this situation.
A rate limited hint to rise the limit should help other affected users.

Regards, STefan