Re: Cleanup and untangling of kernel VM initialization

From: Konstantin Belousov <kostikbel_at_gmail.com> Date: Fri, 8 Mar 2013 11:16:34 +0200 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:35 UTC

On Thu, Mar 07, 2013 at 06:03:51PM +0100, Andre Oppermann wrote:
> On 01.02.2013 18:09, Alan Cox wrote:
> > On 02/01/2013 07:25, Andre Oppermann wrote:
> >>   Rebase auto-sizing of limits on the available KVM/kmem_map instead of
> >> physical
> >>   memory.  Depending on the kernel and architecture configuration these
> >> two can
> >>   be very different.
> >>
> >> Comments and reviews appreciated.
> >>
> >
> > I would really like to see the issues with the current auto-sizing code
> > addressed before any of the stylistic changes or en-masse conversions to
> > SYSINIT()s are considered.  In particular, can we please start with the
> > patch that moves the pipe_map initialization?  After that, I think that
> > we should revisit tunable_mbinit() and "maxmbufmem".
> 
> OK.  I'm trying to describe and explain the big picture for myself and
> other interested observers.  The following text and explanations are going
> to be verbose and sometime redundant.  If something is incorrect or incomplete
> please yell, I'm not an expert in all these parts and may easily have missed
> some subtle aspects.
> 
> The kernel_map serves as the container of the entire available kernel VM
> address space, including the kernel text, data and bss itself, as well as
> other bootstrapped and pre-VM allocated structures.
> 
> The kernel_map should cover a reasonable large amount of address space to be
> able to serve the various kernel subsystems demands in memory allocation.
> The cpu architecture's address range (32 or 64 bits) puts a hard ceiling on
> the total size of the kernel_map.  Depending on the architecture the kernel_map
> covers a special range in the total addressable address range.
> 
>   * VM_MIN_KERNEL_ADDRESS
>   *   [KERNBASE]
>   *   kernel_map    [actually mapped KVM range, direct allocations]
>   *   kernel text, data, bss
>   *   bootstrap and statically allocated structures  [pmap]
>   *   virtual_avail  [start of useable KVM]
>   *       kmem_map   [submap for (most) UMA zones and kernel malloc]
>   *       exec_map   [submap for temporary mapping during process exec()]
>   *       pipe_map   [submap for temporary buffering of data between piped processes]
>   *       clean_map  [submap for buffer_map and pager_map]
>   *         buffer_map [submap for BIO buffers]
>   *         pager_map  [submap for temporary pager IO holding]
>   *       memguard_map [submap for debugging of UMA and kernel malloc]
>   *       ...        [kernel_map direct allocations, free and unused space]
>   *   kernel_map     [end of kernel_map]
>   *   ...
>   *   virtual_end    [end of possible KVM]
>   * VM_MAX_KERNEL_ADDRESS
> 
> Some kernel_map's submaps are special by being non-pageable and
> by pre-allocating the necessary pmap structures to avoid page
> faults. The pre-allocation consumes physical memory. Thus a submap's
> pre-allocation should not be larger than a reasonable small fraction
> of available physical memory to leave enough space for other kernel
> and userspace memory demands.
Preallocation is done to ensure that calls to functions like pmap_qenter()
always succeed and do not sleep for succession.

> 
> The pseudo-code for a dynamic calculation of a submap size would look like this:
> 
>   submap.size = min(physmem.size / pmap.prealloc_max_fraction / pmap.size_per_page *
>       page_size, kernel_map.free_size)
> 
> The pmap.prealloc_max_fraction is the largest fraction of physical
> memory we allow the pre-allocated pmap structures of a single submap
> to occupy.
>
> Separate submaps are usually used to segregate certain types of memory
> usage and to have individual limits applied to them:
>
>   kmem_map: tries to be as large as possible. It serves the bulk of
>   all dynamically allocated kernel memory usage. It is the memory
>   pool used by UMA and kernel malloc. Almost all kernel structures
>   come from here: process-, thread-, file descriptors, mbuf's and
>   mbuf clusters, network connection control blocks, sockets, etc...
>   It is not pageable. Calculation: is currently only partially done
>   dynamically and the MD parts can specify particular min, max limits
>   and scaling factors. It likely can be generalized and with only very
>   special platforms requiring additional limits.
>
>   exec_map: is used as temporary storage to set up a processes address
>   space and related items. It is very small and by default contains
>   only 16 pages. Calculation: (exec_map_entries * round_page(PATH_MAX
>   + ARG_MAX)).
>
>   pipe_map: is used to move piped data between processes. It is
>   pageable memory. Calculation: min(physmem.size, kernel_map.size) /
>   64.
>
>   clean_map: overarching submap to contain the buffer_map and
>   pager_map. Likely no longer necessary and a leftover from earlier
>   incarnations of the kernel VM.
>
>   buffer_map: is used for BIO structures to perform IO between the
>   kernel VM and storage media (disk). Not pageable. Calculation:
>   min(physmem.size, kernel_map.size) / 4 up to 64MB and 1/10
>   thereafter.
>
>   pager_map: is used for pager IO to a storage media (disk). Not
>   pageable. Calculation: MAXPHYS * min(max(nbuf/4, 16), 256).
It is more versatile. The space is used for pbufs, and pbufs currently
also serve for physio, for the clustering, for aio needs.

>
>   memguard_map: is a special debugging submap substituting parts of 
>   kmem_map. Normally not used.
>
> There is some competition between these maps for physical memory. One
> has to be careful to find a total balance among them wrt. static and
> dynamic physical memory use.
They mostly compete for KVA, not for the physical memory.

>
> Within the submaps, especially the kmem_map, we have a number of
> dynamic UMA suballocators where we have to put a ceiling on their
> total memory usage to prevent them to consume all physical *and/or*
> kmem_map virtual memory. This is done with UMA zone limits.
Note that architectures with the direct maps do not use kmem_map for
the small allocations. The uma_small_alloc() utilizes the direct map
for VA of the new page. kmem_map is needed when allocation is multi-page
sized, to provide the continuous virtual mapping.

>
> No externally exploitable single UMA zone should be able to consume
> all available physical memory. This applies for example to the
> number of processes, file descriptors, sockets, mbufs and mbuf
> clusters. These need to be limited to a reasonable and heavy work-load
> permitting amount of available physical memory. However there is going
> to be overcommit among them and not all them can be at their limit
> at the same time. Probably none of these UMA zones should be allowed
> to occupy more than 1/2 of all available physical memory. Often
> individual UMA zone limits have to be put into context and related to
> other concurrent UMA zones. This usually means reduced UMA zone limit
> for a particular zone. Balancing this takes a slight amount of voodoo
> magic and knowledge of common extreme work-loads to align. On the
> other hand for most of those zones allocations are permitted to fail
> rendering an attempt at connection establishment unsuccessful. It can
> be retried later.
>
> Generic pseudo-code: UMA zone limit = min(kmem_map.size, physmem.size)
> / 4 (or other appropriate fraction).
>
> It could be that some of the kernel_map submaps are no longer
> necessary and their purpose could simply be emulated by using an
> appropriately limited UMA zone. For example the exec_map is very small
> and only used for the exec arguments. Putting this into pageable
> memory isn't very useful anymore.
I disagree. Having the strings copied on execve() pageable is good,
the default size of around 260KB max for the strings is quite a
load on the allocator.

>
> Also the interesting construct of the clean_map containing only
> the buffer_map and pager_map doesn't seem necessary anymore and is
> probably remains of an earlier incarnation of the VM.
>
> Comments, discussion and additional input welcome.
>
> -- Andre