jemalloc 3 performance vs. mozjemalloc

Tue Feb 3 15:25:11 PST 2015

On 03/02/15 05:51 PM, Mike Hommey wrote:
> Hi,
> 
> I've been tracking a startup time regression in Firefox for Android when
> we tried to switch from mozjemalloc (memory refresher: it's derived from
> jemalloc 0.9) to mostly current jemalloc dev.
> 
> It turned out to be https://github.com/jemalloc/jemalloc/pull/192 but in
> the process I found a few interesting things that I thought are worth
> mentioning:
> 
> - Several changesets between 3.6 and current dev made the number of
>   instructions as reported by perf stat on GNU/Linux x86-64 increase
>   significantly, on a ~200k alloc/dealloc testcase that does nothing
>   else[1]:
>   - 5460aa6f6676c7f253bfcb75c028dfd38cae8aaf made the count go from
>   69M to 76M.
>   - 6ef80d68f092caf3b3802a73b8d716057b41864c from 76M to 81.5M
>   - 4dcf04bfc03b9e9eb50015a8fc8735de28c23090 from 81.5M to 85M
>   - 155bfa7da18cab0d21d87aa2dce4554166836f5d from 85M to 88M
>   I didn't investigate further because it was a red herring as far as
>   the regression I was tracking was concerned.
> 
> - The average number of mutex lock per alloc/dealloc is close to 1 with
>   mozjemalloc (1.001), but 1.13 with jemalloc 3 (same testcase as above).
>   Fortunately, contention is likely lower (I measured it to be lower, but
>   the instrumentation had so much overhead that it may have skewed the
>   results), but pthread_mutex_lock/unlock are not free as far as
>   instruction count is concerned.
> 
> Cheers,
> 
> Mike

You can speed up locking/unlocking by ~10-20% by dropping a lighter
mutex implementation. Here's a simple C11 implementation based on
Drepper's futex paper, for example:

https://github.com/thestinger/allocator/blob/master/mutex.h
https://github.com/thestinger/allocator/blob/master/mutex.c

It would be easy enough to add (adaptive) spinning to lock/unlock just
like the glibc adaptive mutex that's currently used by jemalloc.

Implementing great load balancing for arenas would greatly reduce the
benefits of fine-grained locking. The best approach that I've come up
with is the following:

* 1 arena per core, rather than 4 arenas per core
* assign the initial threads via round-robin, until each arena is used
* when there are no unused arenas, switch to sched_getcpu()
* store the thread ID of the last thread to allocate in the arena

The algorithm for picking an arena for allocating:

if thread.last_arena.last_allocator == thread.id && trylock() != fail
    pass
else:
    pick_arena_with_sched_getcpu()
    lock()
set_last_allocator()

This results in significantly better load balancing than jemalloc has at
the moment while using 1/4 as many arenas.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: OpenPGP digital signature
URL: <http://jemalloc.net/mailman/jemalloc-discuss/attachments/20150203/7cf28004/attachment.sig>