jemalloc 3 performance vs. mozjemalloc

Bradley C. Kuszmaul bradley at
Mon Feb 9 21:53:57 PST 2015

Lock instructions on modern x86 processors aren't really that expensive.
What is expensive is lock contention.  When I've measured something code
that does this in a bunch of concurrent threads:
  1. acquire_lock()
  2. do_something_really_small_on_thread_local_data()
  3. release_lock()

It costs about 1ns to do step 2 with no locks.
It costs about 5ns to acquire the lock if the lock is thread-local, and
thus not actually contended.
It costs about 100ns-200ns if the lock is actually contended.

I've found that these measurements have changed the way I write lock-based
code.  For example, I like per-core data structures that need a lock,
because the per-core lock is almost always uncontended.  (The difference
between per-core and per-thread shows up only when a thread is preempted.)


On Tue, Feb 10, 2015 at 12:16 AM, Daniel Micay <danielmicay at>

> > FWIW, I also tried to remove all the bin mutexes, and make them all use
> > the arena mutex, and, counter-intuitively, it made things faster. Not by
> > a very significant margin, though, but it's interesting to note that the
> > synchronization overheads of n locks can make things slower than 1
> > lock with more contention.
> It's not really surprising. A LOCK prefix on x86 is extremely expensive
> so locks without contention still have an enormous cost, as do many of
> the atomic operations. An atomic_load() with acquire semantics or an
> atomic_store() with release semantics are fast since no LOCK is needed.
> _______________________________________________
> jemalloc-discuss mailing list
> jemalloc-discuss at
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the jemalloc-discuss mailing list