jemalloc 3 performance vs. mozjemalloc
danielmicay at gmail.com
Tue Feb 3 15:25:11 PST 2015
On 03/02/15 05:51 PM, Mike Hommey wrote:
> I've been tracking a startup time regression in Firefox for Android when
> we tried to switch from mozjemalloc (memory refresher: it's derived from
> jemalloc 0.9) to mostly current jemalloc dev.
> It turned out to be https://github.com/jemalloc/jemalloc/pull/192 but in
> the process I found a few interesting things that I thought are worth
> - Several changesets between 3.6 and current dev made the number of
> instructions as reported by perf stat on GNU/Linux x86-64 increase
> significantly, on a ~200k alloc/dealloc testcase that does nothing
> - 5460aa6f6676c7f253bfcb75c028dfd38cae8aaf made the count go from
> 69M to 76M.
> - 6ef80d68f092caf3b3802a73b8d716057b41864c from 76M to 81.5M
> - 4dcf04bfc03b9e9eb50015a8fc8735de28c23090 from 81.5M to 85M
> - 155bfa7da18cab0d21d87aa2dce4554166836f5d from 85M to 88M
> I didn't investigate further because it was a red herring as far as
> the regression I was tracking was concerned.
> - The average number of mutex lock per alloc/dealloc is close to 1 with
> mozjemalloc (1.001), but 1.13 with jemalloc 3 (same testcase as above).
> Fortunately, contention is likely lower (I measured it to be lower, but
> the instrumentation had so much overhead that it may have skewed the
> results), but pthread_mutex_lock/unlock are not free as far as
> instruction count is concerned.
You can speed up locking/unlocking by ~10-20% by dropping a lighter
mutex implementation. Here's a simple C11 implementation based on
Drepper's futex paper, for example:
It would be easy enough to add (adaptive) spinning to lock/unlock just
like the glibc adaptive mutex that's currently used by jemalloc.
Implementing great load balancing for arenas would greatly reduce the
benefits of fine-grained locking. The best approach that I've come up
with is the following:
* 1 arena per core, rather than 4 arenas per core
* assign the initial threads via round-robin, until each arena is used
* when there are no unused arenas, switch to sched_getcpu()
* store the thread ID of the last thread to allocate in the arena
The algorithm for picking an arena for allocating:
if thread.last_arena.last_allocator == thread.id && trylock() != fail
This results in significantly better load balancing than jemalloc has at
the moment while using 1/4 as many arenas.
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 819 bytes
Desc: OpenPGP digital signature
More information about the jemalloc-discuss