jemalloc 3 performance vs. mozjemalloc

Daniel Micay danielmicay at
Mon Feb 9 22:11:58 PST 2015

On 10/02/15 12:53 AM, Bradley C. Kuszmaul wrote:
> Lock instructions on modern x86 processors aren't really that
> expensive.  What is expensive is lock contention.  When I've measured
> something code that does this in a bunch of concurrent threads:
>   1. acquire_lock()
>   2. do_something_really_small_on_thread_local_data()
>   3. release_lock()
> It costs about 1ns to do step 2 with no locks.
> It costs about 5ns to acquire the lock if the lock is thread-local, and
> thus not actually contended.
> It costs about 100ns-200ns if the lock is actually contended.
> I've found that these measurements have changed the way I write
> lock-based code.  For example, I like per-core data structures that need
> a lock, because the per-core lock is almost always uncontended.  (The
> difference between per-core and per-thread shows up only when a thread
> is preempted.)
> -Badley

A lock prefix *is* very expensive in this context. The cost of locking
and unlocking is where up to 50% of the time is spent in a fast memory
allocator without thread caching, *without* contention. It's why thread
caching results in a huge performance win even when it's only being
filled and flushed with no reuse. For example, making an intrusive list
with one million nodes and then freeing the entire thing is ~2x faster
with a thread cache on top (with a fast O(1) slab allocator at least).

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: OpenPGP digital signature
URL: <>

More information about the jemalloc-discuss mailing list