Some jemalloc questions...

Fri Apr 11 09:20:09 PDT 2014

On Apr 11, 2014, at 12:43 AM, Paul Pedriana <ppedriana at gmail.com> wrote:
> I've been reading through the jemalloc source in order to understand how it works, and I have a few questions.
> 
> What are bitmap levels (and groups)? The bitmap functionality seems to be a little different from a basic 1 dimensional array of bits and has some concept of levels and groups. But I don't understand what that's for and why it exists.

The bitmap levels implement a tree-like structure that allows O(lg N) searches for the lowest unset bit (see bitmap_sfu()).  Older versions of jemalloc used some heuristics to amortize the cost of searches, but some realistic allocation/deallocation patterns thwarted the heuristics.

> Is the reason for having malloc_printf, etc. to have a printf that won't call malloc internally?

As you say, if printf() were to allocate internally, that would cause problems, though in practice printf() tends not to allocate much for performance reasons.  The primary driver for malloc_printf() is that on FreeBSD, jemalloc is used as an integral part of libc, and if jemalloc were to use printf(), it would add run-time dependencies on all of the printf() machinery, including a bunch of floating point code.  Older versions of jemalloc used malloc_write() for everything that couldn't be omitted, and malloc_printf() (which was a thin wrapper around printf()) for e.g. statistics printing.  For the 3.0.0 release I implemented a stand-alone malloc_printf() so that it could be used everywhere in jemalloc.

> I see #define LG_TINY_MIN 3 . Does this mean that these allocations are minimally 8 byte aligned? I ask because I'm under the impression that some platforms (e.g. Apple) require 16 byte minimum allocator alignment.

My vague recollection of what the C standard says indicates that all x86_64 systems should require at least 16-byte alignment, because otherwise 128-bit vector code is a pain to write.  That's not what we get on Linux though, and now that we're seeing vector sizes on the order of 256 and 512 bits, maybe it's reasonable to provide default alignment smaller than that needed by the largest fundamental types.  jemalloc effectively implements 16-byte alignment, because 8-byte allocations are considered "tiny", and only tiny allocations have weaker alignment than the quantum (see LG_QUANTUM), which is 16 for most of the platforms jemalloc supports.  There's never any call to access an 8-byte allocation with an instruction that requires 16-byte alignment (unless compiler builtins like memcpy() make alignment assumptions), so this works out fine in practice.

> What is prof promote/demote?

jemalloc implements heap profiling, which under normal circumstances samples a representative subset of all allocations, and any small allocations (less than 1 page) are "promoted" to 1 page, so that profiling metadata can be stored in the chunk map.  However, the memory cost of promoting small objects becomes very high as the sample rate increases.  jemalloc currently implements an alternative bookkeeping strategy for small allocations when using a high sampling rate; metadata are stored in the page run header, just after the bitmap that tracks allocated/deallocated status.  As it happens, I'm ripping that alternative strategy out today, as part of some optimization and refactoring work for the 4.0.0 release.  Although it's possible to semi-efficiently track all allocations and perform comprehensive leak detection, in practice Valgrind is a reasonable alternative for this use case, because the performance overhead of capturing backtraces for every allocation is substantial enough that it isn't realistic to use jemalloc in this mode unless willing to accept a slowdown on the order of what Valgrind causes anyway.  Backtracing is expensive in the absence of frame pointers, which I didn't have a full appreciation for until after I'd already implemented the non-promoting bookkeeping alternative.

> What is the defined behavior of mb_write? Similarly what's the expected memory consistency behavior of the atomic functions (e.g. atomic_add_uint64)?

mb_write() implements a typical write barrier, which can be thought of as a barrier across which write operations cannot be moved.  It's used in jemalloc to perform lock-free multi-phase commit.  The atomic_*() functions have semantics equivalent to what can be built on top of compare-exchange, and they are used for updating/reading simple counters.

Jason
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://jemalloc.net/mailman/jemalloc-discuss/attachments/20140411/9f1fbe77/attachment.html>