On Thu, Mar 07, 2013 at 06:03:51PM +0100, Andre Oppermann wrote: > On 01.02.2013 18:09, Alan Cox wrote: > > On 02/01/2013 07:25, Andre Oppermann wrote: > >> Rebase auto-sizing of limits on the available KVM/kmem_map instead of > >> physical > >> memory. Depending on the kernel and architecture configuration these > >> two can > >> be very different. > >> > >> Comments and reviews appreciated. > >> > > > > I would really like to see the issues with the current auto-sizing code > > addressed before any of the stylistic changes or en-masse conversions to > > SYSINIT()s are considered. In particular, can we please start with the > > patch that moves the pipe_map initialization? After that, I think that > > we should revisit tunable_mbinit() and "maxmbufmem". > > OK. I'm trying to describe and explain the big picture for myself and > other interested observers. The following text and explanations are going > to be verbose and sometime redundant. If something is incorrect or incomplete > please yell, I'm not an expert in all these parts and may easily have missed > some subtle aspects. > > The kernel_map serves as the container of the entire available kernel VM > address space, including the kernel text, data and bss itself, as well as > other bootstrapped and pre-VM allocated structures. > > The kernel_map should cover a reasonable large amount of address space to be > able to serve the various kernel subsystems demands in memory allocation. > The cpu architecture's address range (32 or 64 bits) puts a hard ceiling on > the total size of the kernel_map. Depending on the architecture the kernel_map > covers a special range in the total addressable address range. > > * VM_MIN_KERNEL_ADDRESS > * [KERNBASE] > * kernel_map [actually mapped KVM range, direct allocations] > * kernel text, data, bss > * bootstrap and statically allocated structures [pmap] > * virtual_avail [start of useable KVM] > * kmem_map [submap for (most) UMA zones and kernel malloc] > * exec_map [submap for temporary mapping during process exec()] > * pipe_map [submap for temporary buffering of data between piped processes] > * clean_map [submap for buffer_map and pager_map] > * buffer_map [submap for BIO buffers] > * pager_map [submap for temporary pager IO holding] > * memguard_map [submap for debugging of UMA and kernel malloc] > * ... [kernel_map direct allocations, free and unused space] > * kernel_map [end of kernel_map] > * ... > * virtual_end [end of possible KVM] > * VM_MAX_KERNEL_ADDRESS > > Some kernel_map's submaps are special by being non-pageable and > by pre-allocating the necessary pmap structures to avoid page > faults. The pre-allocation consumes physical memory. Thus a submap's > pre-allocation should not be larger than a reasonable small fraction > of available physical memory to leave enough space for other kernel > and userspace memory demands. Preallocation is done to ensure that calls to functions like pmap_qenter() always succeed and do not sleep for succession. > > The pseudo-code for a dynamic calculation of a submap size would look like this: > > submap.size = min(physmem.size / pmap.prealloc_max_fraction / pmap.size_per_page * > page_size, kernel_map.free_size) > > The pmap.prealloc_max_fraction is the largest fraction of physical > memory we allow the pre-allocated pmap structures of a single submap > to occupy. > > Separate submaps are usually used to segregate certain types of memory > usage and to have individual limits applied to them: > > kmem_map: tries to be as large as possible. It serves the bulk of > all dynamically allocated kernel memory usage. It is the memory > pool used by UMA and kernel malloc. Almost all kernel structures > come from here: process-, thread-, file descriptors, mbuf's and > mbuf clusters, network connection control blocks, sockets, etc... > It is not pageable. Calculation: is currently only partially done > dynamically and the MD parts can specify particular min, max limits > and scaling factors. It likely can be generalized and with only very > special platforms requiring additional limits. > > exec_map: is used as temporary storage to set up a processes address > space and related items. It is very small and by default contains > only 16 pages. Calculation: (exec_map_entries * round_page(PATH_MAX > + ARG_MAX)). > > pipe_map: is used to move piped data between processes. It is > pageable memory. Calculation: min(physmem.size, kernel_map.size) / > 64. > > clean_map: overarching submap to contain the buffer_map and > pager_map. Likely no longer necessary and a leftover from earlier > incarnations of the kernel VM. > > buffer_map: is used for BIO structures to perform IO between the > kernel VM and storage media (disk). Not pageable. Calculation: > min(physmem.size, kernel_map.size) / 4 up to 64MB and 1/10 > thereafter. > > pager_map: is used for pager IO to a storage media (disk). Not > pageable. Calculation: MAXPHYS * min(max(nbuf/4, 16), 256). It is more versatile. The space is used for pbufs, and pbufs currently also serve for physio, for the clustering, for aio needs. > > memguard_map: is a special debugging submap substituting parts of > kmem_map. Normally not used. > > There is some competition between these maps for physical memory. One > has to be careful to find a total balance among them wrt. static and > dynamic physical memory use. They mostly compete for KVA, not for the physical memory. > > Within the submaps, especially the kmem_map, we have a number of > dynamic UMA suballocators where we have to put a ceiling on their > total memory usage to prevent them to consume all physical *and/or* > kmem_map virtual memory. This is done with UMA zone limits. Note that architectures with the direct maps do not use kmem_map for the small allocations. The uma_small_alloc() utilizes the direct map for VA of the new page. kmem_map is needed when allocation is multi-page sized, to provide the continuous virtual mapping. > > No externally exploitable single UMA zone should be able to consume > all available physical memory. This applies for example to the > number of processes, file descriptors, sockets, mbufs and mbuf > clusters. These need to be limited to a reasonable and heavy work-load > permitting amount of available physical memory. However there is going > to be overcommit among them and not all them can be at their limit > at the same time. Probably none of these UMA zones should be allowed > to occupy more than 1/2 of all available physical memory. Often > individual UMA zone limits have to be put into context and related to > other concurrent UMA zones. This usually means reduced UMA zone limit > for a particular zone. Balancing this takes a slight amount of voodoo > magic and knowledge of common extreme work-loads to align. On the > other hand for most of those zones allocations are permitted to fail > rendering an attempt at connection establishment unsuccessful. It can > be retried later. > > Generic pseudo-code: UMA zone limit = min(kmem_map.size, physmem.size) > / 4 (or other appropriate fraction). > > It could be that some of the kernel_map submaps are no longer > necessary and their purpose could simply be emulated by using an > appropriately limited UMA zone. For example the exec_map is very small > and only used for the exec arguments. Putting this into pageable > memory isn't very useful anymore. I disagree. Having the strings copied on execve() pageable is good, the default size of around 260KB max for the strings is quite a load on the allocator. > > Also the interesting construct of the clean_map containing only > the buffer_map and pager_map doesn't seem necessary anymore and is > probably remains of an earlier incarnation of the VM. > > Comments, discussion and additional input welcome. > > -- Andre
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:40:35 UTC