jemalloc 3 performance vs. mozjemalloc

Mon Feb 9 20:32:10 PST 2015

On Wed, Feb 04, 2015 at 10:15:37AM -0800, Jason Evans wrote:
> On Feb 3, 2015, at 4:40 PM, Mike Hommey <mh at glandium.org> wrote:
> > On Tue, Feb 03, 2015 at 04:19:00PM -0800, Jason Evans wrote:
> >> On Feb 3, 2015, at 2:51 PM, Mike Hommey <mh at glandium.org> wrote:
> >>> I've been tracking a startup time regression in Firefox for
> >>> Android when we tried to switch from mozjemalloc (memory
> >>> refresher: it's derived from jemalloc 0.9) to mostly current
> >>> jemalloc dev.
> >>> 
> >>> It turned out to be https://github.com/jemalloc/jemalloc/pull/192
> >> 
> >> I intentionally removed the functionality #192 adds back (in
> >> e3d13060c8a04f08764b16b003169eb205fa09eb), but apparently forgot to
> >> update the documentation.  Do you have an understanding of why it's
> >> hurting performance so much?
> > 
> > My understanding is that the huge increase in page faults is making
> > the difference. On Firefox startup we go from 50k page faults to 35k
> > with that patch. I can surely double check whether it's really the
> > page faults, or if it's actually the madvising itself that causes
> > the regression. Or both.
> > 
> >> Is there any chance you can make your test case available so I can
> >> dig in further?
> > 
> > https://gist.githubusercontent.com/glandium/a42d0265e324688cafc4/raw/gistfile1.c
> 
> I added some logging and determined that ~90% of the dirty page
> purging is happening in the first 2% of the allocation trace.  This
> appears to be almost entirely due to repeated 32 KiB
> allocation/deallocation.

So, interestingly, this appears to be a bug that was intended to have
been fixed, but wasn't (the repeated allocation/deallocation of 32kiB
buffers). Fixing that, however, still leaves us with a big difference in
the number of page faults (but lower than before), but now the dirty
page purging threshold patch seems to have less impact than it did...

I haven't analyzed these builds further yet, so I can't really tell much
more at the moment.

> I still have vague plans to add time-based hysteresis mechanisms so
> that #192 isn't necessary, but until then, #192 it is.

Sadly, #192 also makes the RSS footprint bigger when using more than one
arena. With 4 cores, so 16 arenas, and default 4MB chunks, that's 64MB
of memory that won't be purged. It's not a problem for us because we use
1MB chunks and 1 arena, but I can see this being a problem with the
default settings.

FWIW, I also tried to remove all the bin mutexes, and make them all use
the arena mutex, and, counter-intuitively, it made things faster. Not by
a very significant margin, though, but it's interesting to note that the
synchronization overheads of n locks can make things slower than 1
lock with more contention.

IOW, I'm still searching for what's wrong :(

Mike