jemalloc 3 performance vs. mozjemalloc

Mike Hommey mh at
Wed Feb 4 01:51:53 PST 2015

On Tue, Feb 03, 2015 at 07:38:49PM -0800, Matthew Hall wrote:
> On Wed, Feb 04, 2015 at 11:54:32AM +0900, Mike Hommey wrote:
> > *sigh* and sadly, this doesn't fix it all :(
> Can you compile all the tests with debug symbols, and use perf top et. al. to 
> stack-sample the top-executing functions inside the test-case processes? 

I wish it were that easy. We're talking about Android and ARM. This
means not much control over the kernel used, which may or may not have
perf counters enabled, and may or may not have everything necessary for
perf to return something interesting. For instance, I can't get perf to
save a profile it can read back without segfaulting when attaching it
to an existing process, for whatever reason. It works fine when perf is
starting the process, but starting an android process with perf means
having the zygote wrap the process, which in itself changes the runtime
conditions. And even trying that didn't work, even after disabling
selinux because it was preventing perf from reading some /proc files.

This also means at most 1 sample per millisecond, when we're talking
about 200k+ alloc/dealloc happening in the time frame of 3 seconds,
where those alloc/dealloc take something between 100ms and 500ms (hard to
tell precisely, but taken out of context in the separate testcase, they
do take around 180ms with jemalloc3 and 120ms with mozjemalloc, but
measuring in context with clock_gettime gives bigger numbers (and using
clock_gettime adds a huge overhead)).

The problem here is that very low-level things are involved, and taking
a test-case out of context changes the very conditions of those low-level
things. As I wrote in the start of this thread, I did see things out of
context that may or may not matter, and it turns out issue #192 is the
one thing that made the most difference because of its impact on page
faults. All the other things I did, like tweaking the config (enabling
tcache, adjusting lg_dirty_mult, ...), returning to 3.6 instead of dev,
etc. didn't do much.

With that being said, it /seems/ I'm getting close to mozjemalloc results
with #192 + 3.6, and it seems I'm getting better results than
mozjemalloc with #192 + tcache on dev (while tcache alone on dev didn't
make a big difference). *But* those results are also to be taken with a
grain of salt because I don't have results for all kinds of devices and
I've had very different response to changes on different devices (yeah,
one more benefit of low-level things being involved, depending on cpu,
memory, etc. results can be completely different; that's also why
mutexes and page faults are tempting candidates)


More information about the jemalloc-discuss mailing list