On Sun, Apr 04, 2021 at 07:01:44PM +0000, Poul-Henning Kamp wrote: > -------- > Konstantin Belousov writes: > > > But what would you provide as the input for PID controller, and what would be the targets? > > Viewing this purely as a vnode related issue is wrong, this is about memory allocation in general. > > We may or may not want a PID regulator, but putting it on counts of vnode would not improve things, precisely, as you point out, because the amount of memory a vnode ties up has enormous variance. > Yes > > We should focus on the end goal: To ensure "sufficient" memory can always be allocated for any purpose "without major delay". > and no > > Architecturally there are three major problems: > > A) While each subsystem generally have a good idea about memory that can be released "without major delay", the information does not trickle up through a summarizing NUMA aware tree. > > B) We lack a nuanced call-back to tell the subsystems to release some of their memory "without major delay". The delay in the wall clock sense does not drive the issue. We cannot expect any io to proceed while we are low on memory, in the sense that allocators cannot respond right now. More and more, our io subsystem requires allocating memory to make any progress with io. This is already quite bad with geom, although some hacks make it not too outstanding. It is very bad with ZFS, where swap on zvols causes deadlocks almost immediately. > > C) We have never attempted to enlist userland, where jemalloc often hang on to a lot of unused VM pages. > The userland does not add to this problem, because pagedaemon typically has enough processing power to convert user-allocated pages into usable clean or free pages. Of course, if there is no swap and dirty anon page cannot be launder, the issue would accumulate. But normally operating system does not have an issue with user pages. > > As far as vnodes go: > > > It used to be that "without major delay" meant "without disk-I/O" which again led to the "dirty buffers/VM pages" heuristic. > > With microsecond SSD backing store, that heuristic is not only invalid, it is down-right harmful in many cases. > > GEOM maintains estimates of per-provider latency and VM+VFS should use that to schedule write-back so that more of it happens outside rush-hour, in order to increase the amount of memory which can be released "without major delay". > > Today that happens largely as a side effect of the periodic syncer, which does a really bad job at it, because it still expects VAX-era hardware performance and workloads. > Io latency is not the factor there. We must avoid situations where instantiating a vnode stalls waiting for KVA to appear, similarly we must avoid system state where vnodes allocation consumed so much kmem that other allocations stall. Quite indicative is that we do not shrink the vnode list on low memory events. Vnlru also does not account for the memory pressure. Problem is that it is not clear how to express that relations between safe allocators state and our desire to cache file system data, which is bound to the vnode identity.Received on Sun Apr 04 2021 - 17:24:04 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:41:28 UTC