Re: [SOLVED] Re: Strange behavior after running under high load

From: Konstantin Belousov <kostikbel_at_gmail.com> Date: Sun, 4 Apr 2021 20:52:57 +0300 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:27 UTC

On Sun, Apr 04, 2021 at 08:45:41AM -0600, Warner Losh wrote:
> On Sun, Apr 4, 2021, 5:51 AM Mateusz Guzik <mjguzik_at_gmail.com> wrote:
> 
> > On 4/3/21, Poul-Henning Kamp <phk_at_phk.freebsd.dk> wrote:
> > > --------
> > > Mateusz Guzik writes:
> > >
> > >> It is high because of this:
> > >>                 msleep(&vnlruproc_sig, &vnode_list_mtx, PVFS, "vlruwk",
> > >> hz);
> > >>
> > >> i.e. it literally sleeps for 1 second.
> > >
> > > Before the line looked like that, it slept on "lbolt" aka "lightning
> > > bolt" which was woken once a second.
> > >
> > > The calculations which come up with those "constants" have always
> > > been utterly bogus math, not quite "square-root of shoe-size
> > > times sun-angle in Patagonia", but close.
> > >
> > > The original heuristic came from university environments with tons of
> > > students doing assignments and nethack behind VT102 terminals, on
> > > filesystems where files only seldom grew past 100KB, so it made sense
> > > to scale number of vnodes to how much RAM was in the system, because
> > > that also scaled the size of the buffer-cache.
> > >
> > > With a merged VM buffer-cache, whatever validity that heuristic had
> > > was lost, and we tweaked the bogomath in various ways until it
> > > seemed to mostly work, trusting the users for which it did not, to
> > > tweak things themselves.
> > >
> > > Please dont tweak the Finagle Constants again.
> > >
> > > Rip all that crap out and come up with something fundamentally better.
> > >
> >
> > Some level of pacing is probably useful to control total memory use --
> > there can be A LOT of memory tied up in mere fact that vnode is fully
> > cached. imo the thing to do is to come up with some watermarks to be
> > revisited every 1-2 years and to change the behavior when they get
> > exceeded -- try to whack some stuff but in face of trouble just go
> > ahead and alloc without sleep 1. Should the load spike sort itself
> > out, vnlru will slowly get things down to the watermark. If the
> > watermark is too low, maybe it can autotune. Bottom line is that even
> > with the current idea of limiting preferred total vnode count, the
> > corner case behavior can be drastically better suffering SOME perf
> > loss from recycling vnodes, but not sleeping for a second for every
> > single one.
> >
> 
> I'd suggest that going directly to a PID to control this would be better
> than the watermarks. That would give a smoother response than high/low
> watermarks would. While you'd need some level to keep things at still, the
> laundry stuff has shown the precise level of that level is less critical
> than the watermarks.
But what would you provide as the input for PID controller, and what
would be the targets?

The main reason for the (almost) hard cap on the number of vnodes is not
that excessive number of vnodes is harmful by itself.  Each allocated
vnode typically implies existence of several second-order allocations
that accumulate into significant KVA usage:
- filesystem inode
- vm object
- namecache entries
There are usually even more allocations, third-order, for instance UFS
inode carries a pointer to the dinode copy in RAM, and possibly EA area.
And of course, the fact that vnode names pages in the page cache owned by
corresponding file, i.e. amount of allocated vnodes regulates amount of
work for pagedaemon.

We currently trying to put some rational limit for total number of vnodes,
estimating both KVA and physical memory consumed by them.  If you remove
that limit, you need to ensure that we do not create OOM situation either
for KVA or for physical memory just by creating too many vnodes, otherwise
system cannot get out of it.

So there are some combinations of machine config (RAM) and loads where 
default settings are arguably low.  Raising the limits needs to handle
the indirect resource usage from vnode.

I do not know how to write the feedback formula, taking into account all
the consequences of the vnode existence, and that effects depend also on
the underlying filesystem and patterns of VM paging usage.  In this sense
ZFS is probably simplest case, because its caching subsystem is autonomous.
While UFS or NFS are tightly integrated with VM.

> 
> Warner
> 
> I think the notion of 'struct vnode' being a separately allocated
> > object is not very useful and it comes with complexity (and happens to
> > suffer from several bugs).
> >
> > That said, the easiest and safest thing to do in the meantime is to
> > bump the limit. Perhaps the sleep can be whacked as it is which would
> > largely sort it out.
> >
> > --
> > Mateusz Guzik <mjguzik gmail.com>
> > _______________________________________________
> > freebsd-current_at_freebsd.org mailing list
> > https://lists.freebsd.org/mailman/listinfo/freebsd-current
> > To unsubscribe, send any mail to "freebsd-current-unsubscribe_at_freebsd.org"
> >
> _______________________________________________
> freebsd-current_at_freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to "freebsd-current-unsubscribe_at_freebsd.org"