jemalloc 3 performance vs. mozjemalloc

Tue Feb 3 16:40:55 PST 2015

On Tue, Feb 03, 2015 at 04:19:00PM -0800, Jason Evans wrote:
> On Feb 3, 2015, at 2:51 PM, Mike Hommey <mh at glandium.org> wrote:
> > I've been tracking a startup time regression in Firefox for Android
> > when we tried to switch from mozjemalloc (memory refresher: it's
> > derived from jemalloc 0.9) to mostly current jemalloc dev.
> > 
> > It turned out to be https://github.com/jemalloc/jemalloc/pull/192
> 
> I intentionally removed the functionality #192 adds back (in
> e3d13060c8a04f08764b16b003169eb205fa09eb), but apparently forgot to
> update the documentation.  Do you have an understanding of why it's
> hurting performance so much?

My understanding is that the huge increase in page faults is making the
difference. On Firefox startup we go from 50k page faults to 35k with
that patch. I can surely double check whether it's really the page
faults, or if it's actually the madvising itself that causes the
regression. Or both.

> I originally implemented that additional
> threshold because dirty page purging happened on a per chunk
> granularity, and I didn't want to spend a bunch of time iterating over
> chunks with very little purgeable memory.  Now that an LRU is used to
> purge page runs (see Qinfan Wu's patches in July 2014), that is
> certainly no longer an issue.  The only way I can think of this change
> hurting Firefox startup time is if there are a bunch of large memory
> usage fluctuations.
> 
> > - Several changesets between 3.6 and current dev made the number of
> > instructions as reported by perf stat on GNU/Linux x86-64 increase
> > significantly, on a ~200k alloc/dealloc testcase that does nothing
> > else[1]: - 5460aa6f6676c7f253bfcb75c028dfd38cae8aaf made the count
> > go from 69M to 76M.
> 
> This is on ARM?

non-Android x86-64, as written above.

> I can't think of a reason this would happen other than register
> pressure (which didn't appear to be an issue on x64), or a failure to
> inline despite all the *ALWAYS_INLINE* macros in the fast path.
> 
> >  - 6ef80d68f092caf3b3802a73b8d716057b41864c from 76M to 81.5M
> 
> This is strictly related to heap profiling, so it should have no
> impact on your test.  Perhaps it's related to binary layout
> randomness?
> 
> >  - 4dcf04bfc03b9e9eb50015a8fc8735de28c23090 from 81.5M to 85M
> 
> Not surprising in the context of high lock acquisition rates (which is
> a problem in itself).
> 
> >  - 155bfa7da18cab0d21d87aa2dce4554166836f5d from 85M to 88M
> 
> This might cause more unused memory coalescing due to lower
> fragmentation.
> 
> >  I didn't investigate further because it was a red herring as far as
> >  the regression I was tracking was concerned.
> 
> Did you also collect elapsed times when you ran the tests?  I ran some
> heavy stress tests a few months ago and measured a substantial
> throughput increase for dev versus 3.6, so I'm curious if the
> instruction count increase you measured exists despite speedups.

Elapsed times didn't seem to vary much, but that's x86-64. ARM would
likely be much more affected by this (and in fact, 3.6 *does* fare
better than current -dev on ARM)

> > - The average number of mutex lock per alloc/dealloc is close to 1
> > with mozjemalloc (1.001), but 1.13 with jemalloc 3 (same testcase as
> > above).  Fortunately, contention is likely lower (I measured it to
> > be lower, but the instrumentation had so much overhead that it may
> > have skewed the results), but pthread_mutex_lock/unlock are not free
> > as far as instruction count is concerned.
> 
> This especially surprises me, and I really want to figure out what's
> going on.
> 
> Is there any chance you can make your test case available so I can dig
> in further?

https://gist.githubusercontent.com/glandium/a42d0265e324688cafc4/raw/gistfile1.c

Yes, that's a big source file. I did that to eliminate other overheads
(we have a tool to replay a log we get out of firefox, but the tool
itself had overhead that i needed to be eliminated during investigation)

Mike