Re: New libc malloc patch

From: Jason Evans <jasone_at_canonware.com> Date: Sun, 11 Dec 2005 18:27:14 -0800 · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:49 UTC

Johan,

Thank you for your code review!  Specific responses follow.

On Dec 11, 2005, at 5:48 PM, Johan Bucht wrote:
>
> Just a few comments, not sure if all of them are valid since I don't
> have a recent enough system to apply the patch and test it, so my
> comments are just based on reading the patch. I might have missed some
> thing hidden between all the lines differing. Can you put up a patched
> malloc.c somewhere?

http://www.canonware.com/~jasone/jemalloc/malloc.c

> * Fork safety functions
> Nice to have for all allocators and is something I missed having.  
> Would
> like to have them regardless if your malloc becomes standard or not.

I think that the implementation is currently fork-safe.  The threads  
libraries call _malloc_prefork() and _malloc_postfork() in order to  
avoid locking issues.

> * magazines, per-arena caches
> If I didn't miss something you don't seem to have it, pardon if I  
> missed
> them in my quick readings. They can really help the scalability and is
> something that is implemented in Solaris libumem allocator with great
> success. I implemented them and they helped bring the scalability up
> quite a bit on the system I tested my allocator on aswell as bringing
> the allocation latency down.

Arenas do use caches for objects below a size threshold (currently  
~1024 bytes on 32-bit systems and ~2048 bytes on 64-bit systems).   
The caching mechanism is quite different than magazines, but still  
very effective.

> * Saw you used a tree to look up the allocation sizes at free, this
> * differs from how I and Solaris libumem does it. Instead I went for
> * prepending a 8/16 byte tag containing the size of the allocation for
> * lockless size lookup.

Actually, my implementation prepends a 4 byte tag that contains  
allocated/free flags, and the size of the allocation.  The trees are  
only used to look up the sizes of huge allocations (> 8 MB on 32-bit  
platforms, and > 256 MB on 64-bit platforms).  Huge allocations start  
at chunk boundaries, so prepending a tag would require an entire  
extra page (and would cause potentially serious external  
fragmentation on 32-bit systems).

> * Bucket sizes
> If I didn't miss something you are using power of two sizes. Might be
> better to use smaller sizes to avoid internal fragmentation, aswell as
> improving locking granularity.

My implementation uses quantum-sized buckets (not power of two size  
classes).  The quantum size is either 8 or 16 bytes, depending on the  
machine architecture.

> * Locking granularity
> You use a single lock for the chunk allocation instead of a lock per
> bucket size or something similar. This means that you limit access
> unless threads allocate from their private arenas. Since you hash  
> to get
> an arena there might be multiple threads accessing the same arena  
> at the
> same time introducing contention. Might be better to have a lock per
> allocation size to somewhat improve the situation.

Using a lock per allocation size requires that slower algorithms be  
used in some places (and it certainly complicates certain other  
operations).  This didn't seem like a worthwhile tradeoff to me.  By  
making the code faster and simpler, less time is spent in the malloc  
code.  Unless an app does nothing but malloc/free, I don't expect  
this to be an issue.  Even microbenchmarks that do nothing but malloc/ 
free don't appear to suffer.

> * Locking primitive
> The biggest issue and as David Xu pointed out is probably the locking
> primitives. The SPINLOCK use has a limit in the threading library and
> makes is hard to have a lot of mutexes. I ended up using a wrapper
> around the umtx_lock function to get recursive mutexes and it would
> probably be better to extend the umtx functions to handle recursion.
> This would probably also be appreciated by other malloc  
> implementations.
> Might be interesting to implement some of the ideas from the Linux  
> futex
> implementation to help umtx.

I have been contemplating creating a separate spinlock API that  
doesn't require the threads library to track the spinlocks across  
fork.  This would (if I understand correctly) remove the current  
static spinlock limitations.

As for supporting recursive spinlocks, I doubt that the overhead  
would be acceptable in general.  If I could get rid of the need for  
the one recursive lock in malloc.c, I certainly would. =)

> * Big allocations
> It might be preferable to use mmap/munmap directly for big allocations
> (bigger than a few pages) to avoid the overhead of trying to be smart.
> These big allocations should be pretty few and shouldn't be that
> pessimized by this. It also means some memory savings as you can
> directly free memory from big allocations.

I don't really see the approach I'm taking as "trying to be smart",  
so much as "making it simple by using the same algorithm for almost  
everything".  Many other malloc implementations I've examined use  
three or more allocation approaches, depending on the allocation  
size.  jemalloc uses only two, and huge allocations really are huge,  
such that they rarely if ever happen.  In practice, I don't think  
there is a real memory savings to be had.  Besides, madvise() calls  
in strategic locations would achieve approximately the same result.

> * Ownership
> Didn't look that much for it so might have missed it. Basicly what
> happens if you free memory allocated in another arena, do you return
> memory to the arena the memory was allocated from or to the current
> threads arena? Not returning it to the original arena leads to cache
> line sharing. Returning it to the original adds some overhead and it
> might be argued that applications doing this might not be considered
> scalable and should be fixed instead.

Allocated objects can be associated with their parent arenas in  
constant time (by masking bits in the object addresses).  This makes  
it possible to always free memory back to the same arena it came from.

> * Standalone version
> Would be nice to have a standalone version to test the allocator on
> solaris and linux to compare against their allocators. Would be  
> nice for
> all the people with only access to SMP machines running another OS. I
> tested my allocator on both Solaris 9 and Freebsd 5/6 and it was both
> helpful in testing scalability and impact on small machines aswell as
> finding bugs in the implementation. I tested my allocator against
> Solaris libc, Hoard and libumem on a 4-cpu Solaris 9 machine (4x
> UltraSparc IIIi, 1600MHz, 16384MB RAM) at the university and tested my
> allocator vs the FreeBSD libc on my 400MHz laptop with too much beer
> and almost non working cpu fan. Beer as in standing in 10cm of beer in
> my backpack =).
> Not sure how much of the implementation is FreeBSD dependant, RB- 
> trees?,

I actually have a separate more portable implementation of jemalloc  
that is part of a programming language runtime library (jemalloc is  
simply a FreeBSD-ized stand-alone version of that allocator).  I've  
tested it on OS X, Linux, FreeBSD, and (IIRC) Solaris.  On Linux I've  
compared against ptmalloc2, dlmalloc, and hoard.  Unfortunately, that  
version of the allocator has to make some performance sacrifices  
regarding locking, since it doesn't have access to the pthreads  
internals.  As such, it requires a lot of care to interpret  
performance results, so I haven't felt it to be worthwhile providing  
that version here.

> * Focus - Single thread vs multithread
> What are the focus of the allocator, how much memory is it worth to
> waste to implement a good scalable implementation. This is the  
> reason I
> think Solaris still keeps a serial allocator in libc and makes libumem
> accessible through LD_PRELOAD and friends.

jemalloc only creates two arenas for single-threaded applications  
(one for internal memory management, and one for user requests), so  
it doesn't really sacrifice much in the case of single-threaded  
programs -- a few pages maybe.  In the case of multi-threaded  
programs, it potentially sacrifices memory compactness in order to  
reduce contention and cache line sharing.

> * Comment
> Might be wise to fix the comment about always using mmap since brk can
> be used.

Thanks, done.

Jason