Re: New libc malloc patch

From: Jon Dama <jd_at_ugcs.caltech.edu> Date: Tue, 29 Nov 2005 12:06:39 -0800 (PST) · This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:48 UTC

I have a rather strong objection to make to this proposal (read: if this
change goes in I'm going to have to go through the effort of ripping it
out locally...):

There exists a problem right now--localized to i386 and any other arch
based on 32-bit pointers: address space is simply too scarce.

Your decision to switch to using mmap as the exclusive source of malloc
buckets is admirable for its modernity but it simply cannot stand unless
someone steps up to change the way mmap and brk interact within the
kernel.

The trouble arises from the need to set MAXDSIZ and the resulting effect
it has in determining the start of the mmap region--which I might add is
the location that the shared library loader is placed.  This effectively
(and explicitly) sets the limit for how large of a contiguous region can
be allocated with brk.

What you've done by switching the system malloc to exclusively using
mmap is induced a lot of motivation on the part of the sysadmin to push
that brk/mmap boundary down.

This wouldn't be a problem except that you've effectively shot in the foot
dozens of alternative c malloc implementations, not to mention the memory
allocator routines used in obscure languages such as Modula-3 and Haskell
that rely on brk derived buckets.

This isn't playing very nicely!

I looked into the issues and limitations with phkmalloc several months ago
and concluded that simply adopting ptmalloc2 (the linux malloc) was the
better approach--notably it is willing to draw from both brk and mmap, and
it also implements per-thread arenas.

There is also cause for concern about your "cache-line" business.  Simply
on the face of it there is the problem that the scheduler does not do a
good job of pinning threads to individual CPUs.  The threads are already
bounding from cpu to cpu and thrashing (really thrashing) each CPU cache
along the way.

Second, you've forgotten that there is a layer of indirection between your
address space and the cache: the mapping of logical pages (what you can
see in userspace) to physical pages (the addresses of which actually
matter for the purposes of the cache).  I don't recall off-hand whether or
not the L1 cache on i386 is based on tags of the virtual addresses, but I
am certain that the L2 and L3 caches tag the physical addresses not the
virtual addresses.

This means that your careful address selection based on cache-lines will
only work out if it is done in the vm codepath: remember the mapping of
physical addresses to the virtual addresses that come back from mmap can
be delayed arbitrarily long into the future depending on when the program
actually goes to touch that memory.

Furthermore, the answer may vary depending on the architecture or even the
processor version.

-Jon

On Mon, 28 Nov 2005, Jason Evans wrote:

> There is a patch that contains a new libc malloc implementation at:
>
> http://www.canonware.com/~jasone/jemalloc/jemalloc_20051127a.diff
>
> This implementation is very different from the current libc malloc.
> Probably the most important difference is that this one is designed
> with threads and SMP in mind.
>
> The patch has been tested for stability quite a bit already, thanks
> mainly to Kris Kennaway.  However, any help with performance testing
> would be greatly appreciated.  Specifically, I'd like to know how
> well this malloc holds up to threaded workloads on SMP systems.  If
> you have an application that relies on threads, please let me know
> how performance is affected.
>
> Naturally, if you notice horrible performance or ridiculous resident
> memory usage, that's a bad thing and I'd like to hear about it.
>
> Thanks,
> Jason
>
> === Important notes:
>
> * You need to do a full buildworld/installworld in order for the
> patch to work correctly, due to various integration issues with the
> threads libraries and rtld.
>
> * The virtual memory size of processes, as reported in the SIZE field
> by top, will appear astronomical for almost all processes (32+ MB).
> This is expected; it is merely an artifact of using large mmap()ed
> regions rather than sbrk().
>
> * In keeping with the default option settings for CURRENT, the A and
> J flags are enabled by default.  When conducting performance tests,
> specify MALLOC_OPTIONS="aj" .
>
> _______________________________________________
> freebsd-current_at_freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to "freebsd-current-unsubscribe_at_freebsd.org"
>