=== Introduction === Those of you who watch the cvs commit logs may have noticed several rather significant changes to malloc over the past few weeks. I took a fresh look at malloc after letting the details fade over the past year, and saw some significant opportunities for improvement. I appreciate the FreeBSD community's tolerance of these changes, which were perhaps of questionable wisdom given that we are now in code slush. =) Since these changes obsolete several aspects of the paper I presented last year at BSDcan, it seems like a good idea to provide a public summary of these changes and their rationale. The changes focus on two main issues: 1) mapped page reduction during low memory usage phases of application execution, and 2) fragmentation reduction. === 1) Mapped page reduction === First, here is a bit of relevant background. jemalloc breaks memory into 1MB chunks that are aligned at multiples of the chunk size. This makes it possible to find metadata for small and large user objects in constant time. (For huge objects, it takes lg(n) time, where n is the number of huge allocations.) The point is that chunks are a fundamental aspect of what makes jemalloc tick. However, there is a downside to managing memory in chunks: unless we use madvise(2), every page of a chunk that is touched remains physically backed until the entire chunk can be unmapped. Thus, even a single in-use page can cause retention of up to 1MB of memory. This is a bit of a catch-22, since madvise(2) is too expensive to enable by default, but if we really need it, then we really should be using it for all applications, since paging is a system-wide issue. There is no simple way out of this conundrum, which is why I have been working to reduce the number of chunks that remain mapped for programs that typically operate at well below their peak memory usage. In order to reduce the number of mapped pages, we need some way to preferentially pack data into certain chunks, so that others can drain. There is a simple allocation policy that in practice had been demonstrated to work very well: prefer low addresses. jemalloc already did this where convenient (a lesson learned from phkmalloc), but the recent changes push this policy throughout the allocator, almost to the maximum extreme possible. Each arena manages a set of chunks that are carved into runs. These runs back small and large allocations. jemalloc now uses the lowest fit for run allocation. Equal-sized small allocations are grouped together into runs since it is possible to fit multiple small objects in a single run. This means that such a run may have very low utilization, so we try to keep runs as full as possible in order to reduce the total number of runs in use for any given size class. Previously, jemalloc used an overly complicated run promotion/demotion system that was intended to place several consecutive allocation requests in the same run, reduce the overhead of switching between runs, and try to keep the runs moderately full. However, experiments demonstrated that run switching overhead has at most a small impact on allocator performance, and furthermore that the promotion/demotion algorithms were not very effective at decreasing run switching anyway. That being the case, I experimented with increasingly expensive run switching policies. Finally, I tried always switching to the lowest run with any unused space, which was the ideal layout policy given my goals. This change had the intended effect of tending to let high chunks drain. Much to my surprise though, there was no significant impact on runtime. I put this last experiment off a long time, because my intuition was that it would be far too expensive to add so many red-black tree modifications. I think there are two contributing factors to why I was wrong: 1) the highest run switching rate I saw for any application was ~1% of the total allocation rate (though carefully crafted code could push that up to nearly 50%), and 2) cache locality improvements tend to help a bit. My initial expectation would be for such a layout policy to *decrease* locality, since we have the potential to spread consecutive allocations across a large number of independent runs. However, in the steady state of a long-running application, we have a hard time dealing with this problem anyway, and by packing objects as low as possible in memory, we actually tend to do better than if we were to use non-full runs in MRU order. Anyway, the reason I dedicate so much space to this issue is that it still makes me uneasy, but I have been unable to find any application for which the layout policy causes significant performance degradation. If you think you may have found such a case, you will be able to gain some insight by looking at the statistics output from setting MALLOC_OPTIONS=P. Look at the 'reruns' column (# of run switches) versus the 'requests' column in order to determine what percentage of requests suffer the overhead of run switching. If you see very high run switch rates for a particular application (say, >5%), please tell me so that I can run further experiments. === 2) Fragmentation reduction === We want to fit user objects in memory as tightly as possible. jemalloc was doing a substandard job in a couple of regards. First, it used a binary buddy scheme to manage runs within chunks, meaning that runs were constrained to be 2^n pages, for 0 < n < 20. This caused significant internal fragmentation for, say, 20kB objects. To improve the situation, I switched to using a straightforward extent-based system. This was initially intended only as a fix for the problem just described, but it turned out to also be useful for runs that back small allocations. Small allocation runs can now be more precisely sized such that the run header overhead is barely below some threshold. This allows the bitmaps that track which small objects are in use to be as short as possible while still keeping metadata overhead under control. Previously, all run headers for small object runs were the same size, regardless of how many bits were actually needed in the header bitmap to track the run's objects. This was clearly wasteful. My memory is fuzzy regarding the original rationale for fixed-size run headers, but I think it had to do with the complexity of calculating run header size, number of runs, and run size all at the same time. Indeed, this proved to be tricky, but well worth the effort. === Summary === With these changes in place, all of my experiments indicate substantial improvements in memory usage, as well as overall speed improvements. We still have several months before the 7.0 release to verify that I didn't miss anything important. Even so, I tested these changes more thoroughly than any previous changes (excepting Kris Kennaway's testing before jemalloc entered cvs), so I think the risk of major problems is reasonably low. Other than bug fixes, I do not intend to make any more changes before 7.0. I have developed some novel algorithms for essentially eliminating thread contention on SMP systems, but it is too late in the development cycle to introduce such changes (not to mention that I lack the hardware to evaluate the algorithms). Thanks again for your patience and support. Please let me know if I can be of help in diagnosing suspected malloc issues. JasonReceived on Wed Mar 28 2007 - 20:36:38 UTC
This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:39:07 UTC