jemalloc 3 performance vs. mozjemalloc

Mon Feb 9 21:16:45 PST 2015

On 09/02/15 11:32 PM, Mike Hommey wrote:
> On Wed, Feb 04, 2015 at 10:15:37AM -0800, Jason Evans wrote:
>> On Feb 3, 2015, at 4:40 PM, Mike Hommey <mh at glandium.org> wrote:
>>> On Tue, Feb 03, 2015 at 04:19:00PM -0800, Jason Evans wrote:
>>>> On Feb 3, 2015, at 2:51 PM, Mike Hommey <mh at glandium.org> wrote:
>>>>> I've been tracking a startup time regression in Firefox for
>>>>> Android when we tried to switch from mozjemalloc (memory
>>>>> refresher: it's derived from jemalloc 0.9) to mostly current
>>>>> jemalloc dev.
>>>>>
>>>>> It turned out to be https://github.com/jemalloc/jemalloc/pull/192
>>>>
>>>> I intentionally removed the functionality #192 adds back (in
>>>> e3d13060c8a04f08764b16b003169eb205fa09eb), but apparently forgot to
>>>> update the documentation.  Do you have an understanding of why it's
>>>> hurting performance so much?
>>>
>>> My understanding is that the huge increase in page faults is making
>>> the difference. On Firefox startup we go from 50k page faults to 35k
>>> with that patch. I can surely double check whether it's really the
>>> page faults, or if it's actually the madvising itself that causes
>>> the regression. Or both.
>>>
>>>> Is there any chance you can make your test case available so I can
>>>> dig in further?
>>>
>>> https://gist.githubusercontent.com/glandium/a42d0265e324688cafc4/raw/gistfile1.c
>>
>> I added some logging and determined that ~90% of the dirty page
>> purging is happening in the first 2% of the allocation trace.  This
>> appears to be almost entirely due to repeated 32 KiB
>> allocation/deallocation.
> 
> So, interestingly, this appears to be a bug that was intended to have
> been fixed, but wasn't (the repeated allocation/deallocation of 32kiB
> buffers). Fixing that, however, still leaves us with a big difference in
> the number of page faults (but lower than before), but now the dirty
> page purging threshold patch seems to have less impact than it did...

I think jemalloc now uses FIFO for purging, and that may not play very
well with the preference for reusing low addresses with some patterns.

FWIW, if you have large spans of memory that you know you are going to
use, you can greatly reduce the cost of page faults by pre-faulting with
MADV_WILLNEED instead of the regular lazy commit.

> I haven't analyzed these builds further yet, so I can't really tell much
> more at the moment.
> 
>> I still have vague plans to add time-based hysteresis mechanisms so
>> that #192 isn't necessary, but until then, #192 it is.
> 
> Sadly, #192 also makes the RSS footprint bigger when using more than one
> arena. With 4 cores, so 16 arenas, and default 4MB chunks, that's 64MB
> of memory that won't be purged. It's not a problem for us because we use
> 1MB chunks and 1 arena, but I can see this being a problem with the
> default settings.

This would be a lot less bad with the per-core arena design, since it
would be a n_cores multipler instead of n_cores * 4.

> FWIW, I also tried to remove all the bin mutexes, and make them all use
> the arena mutex, and, counter-intuitively, it made things faster. Not by
> a very significant margin, though, but it's interesting to note that the
> synchronization overheads of n locks can make things slower than 1
> lock with more contention.

It's not really surprising. A LOCK prefix on x86 is extremely expensive
so locks without contention still have an enormous cost, as do many of
the atomic operations. An atomic_load() with acquire semantics or an
atomic_store() with release semantics are fast since no LOCK is needed.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: OpenPGP digital signature
URL: <http://jemalloc.net/mailman/jemalloc-discuss/attachments/20150210/5959c8d7/attachment-0001.sig>