jemalloc 3 performance vs. mozjemalloc

Tue Feb 3 16:19:00 PST 2015

On Feb 3, 2015, at 2:51 PM, Mike Hommey <mh at glandium.org> wrote:
> I've been tracking a startup time regression in Firefox for Android when
> we tried to switch from mozjemalloc (memory refresher: it's derived from
> jemalloc 0.9) to mostly current jemalloc dev.
> 
> It turned out to be https://github.com/jemalloc/jemalloc/pull/192

I intentionally removed the functionality #192 adds back (in e3d13060c8a04f08764b16b003169eb205fa09eb), but apparently forgot to update the documentation.  Do you have an understanding of why it's hurting performance so much?  I originally implemented that additional threshold because dirty page purging happened on a per chunk granularity, and I didn't want to spend a bunch of time iterating over chunks with very little purgeable memory.  Now that an LRU is used to purge page runs (see Qinfan Wu's patches in July 2014), that is certainly no longer an issue.  The only way I can think of this change hurting Firefox startup time is if there are a bunch of large memory usage fluctuations.

> - Several changesets between 3.6 and current dev made the number of
>  instructions as reported by perf stat on GNU/Linux x86-64 increase
>  significantly, on a ~200k alloc/dealloc testcase that does nothing
>  else[1]:
>  - 5460aa6f6676c7f253bfcb75c028dfd38cae8aaf made the count go from
>  69M to 76M.

This is on ARM?  I can't think of a reason this would happen other than register pressure (which didn't appear to be an issue on x64), or a failure to inline despite all the *ALWAYS_INLINE* macros in the fast path.

>  - 6ef80d68f092caf3b3802a73b8d716057b41864c from 76M to 81.5M

This is strictly related to heap profiling, so it should have no impact on your test.  Perhaps it's related to binary layout randomness?

>  - 4dcf04bfc03b9e9eb50015a8fc8735de28c23090 from 81.5M to 85M

Not surprising in the context of high lock acquisition rates (which is a problem in itself).

>  - 155bfa7da18cab0d21d87aa2dce4554166836f5d from 85M to 88M

This might cause more unused memory coalescing due to lower fragmentation.

>  I didn't investigate further because it was a red herring as far as
>  the regression I was tracking was concerned.

Did you also collect elapsed times when you ran the tests?  I ran some heavy stress tests a few months ago and measured a substantial throughput increase for dev versus 3.6, so I'm curious if the instruction count increase you measured exists despite speedups.

> - The average number of mutex lock per alloc/dealloc is close to 1 with
>  mozjemalloc (1.001), but 1.13 with jemalloc 3 (same testcase as above).
>  Fortunately, contention is likely lower (I measured it to be lower, but
>  the instrumentation had so much overhead that it may have skewed the
>  results), but pthread_mutex_lock/unlock are not free as far as
>  instruction count is concerned.

This especially surprises me, and I really want to figure out what's going on.

Is there any chance you can make your test case available so I can dig in further?

Thanks,
Jason