jemalloc 3 performance vs. mozjemalloc

Mon Feb 9 22:33:47 PST 2015

Perhaps I am tone deaf. "extremely expensive" sounds, to my ear, worse than
the reality that I've experienced.  But one persons "extreme" might be
another's "moderate".  Rather than use that kind of subjective adjective,
I've tried to give numbers, and it sounds like we are basically in
agreement about the costs.

To me, what is interesting about these numbers is that to overcome the lock
overhead, a thread cache can be small.  (There I go with a subjective
adjective again...)  For example, if the locked version is 5x as expensive
as the unlocked, then a thread cache that holds 5 objects will reduce the
effective cost to 2x the unlocked cache, in the worst case.  A thread cache
that holds 20 objects reduces the worst-case overhead to 25%.  In some
allocators, the thread cache is much larger (perhaps thousands of objects)
because it is trying to overcome the 200ns contended-lock overhead instead
of the 5ns uncontended-lock overhead.  On modern x86 processors, it isn't
always necessary to take heroic measures to avoid locks 99.9% of the time.
It's sometimes good enough to avoid locks in something like 80% of the
cases.

One way to make such a thread cache work is to maintain a doubly-linked
list so that when the lock *is* acquired (e.g., when the thread cache fills
up on a free()), all 5 objects in the thread cache can be moved to the
per-core cache with a constant number of memory references.  One also needs
a little hysteresis: so one might use two doubly-linked lists in the
per-thread cache, and an array of doubly-linked lists (each containing 5
objects) in the per-core cache.  When calling free(), if both doubly-linked
lists are full, then move one of them to the per-core cache.  When calling
malloc(), if both doubly-linked lists are empty, then get a collection of 5
objects from the per-core cache.  Similarly, when the per-core cache gets
too full, a group of 5 objects can be moved from a per-core cache to a
global cache using only O(1) operations, and with a little hysteresis, that
operation will seldom occur.

I'm curious.  How big is the thread cache in jemalloc?

-Bradley

On Tue, Feb 10, 2015 at 1:11 AM, Daniel Micay <danielmicay at gmail.com> wrote:

> On 10/02/15 12:53 AM, Bradley C. Kuszmaul wrote:
> > Lock instructions on modern x86 processors aren't really that
> > expensive.  What is expensive is lock contention.  When I've measured
> > something code that does this in a bunch of concurrent threads:
> >   1. acquire_lock()
> >   2. do_something_really_small_on_thread_local_data()
> >   3. release_lock()
> >
> > It costs about 1ns to do step 2 with no locks.
> > It costs about 5ns to acquire the lock if the lock is thread-local, and
> > thus not actually contended.
> > It costs about 100ns-200ns if the lock is actually contended.
> >
> > I've found that these measurements have changed the way I write
> > lock-based code.  For example, I like per-core data structures that need
> > a lock, because the per-core lock is almost always uncontended.  (The
> > difference between per-core and per-thread shows up only when a thread
> > is preempted.)
> >
> > -Badley
>
> A lock prefix *is* very expensive in this context. The cost of locking
> and unlocking is where up to 50% of the time is spent in a fast memory
> allocator without thread caching, *without* contention. It's why thread
> caching results in a huge performance win even when it's only being
> filled and flushed with no reuse. For example, making an intrusive list
> with one million nodes and then freeing the entire thing is ~2x faster
> with a thread cache on top (with a fast O(1) slab allocator at least).
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://jemalloc.net/mailman/jemalloc-discuss/attachments/20150210/46ffea03/attachment-0001.html>