jemalloc 3 performance vs. mozjemalloc
Bradley C. Kuszmaul
bradley at mit.edu
Mon Feb 9 21:53:57 PST 2015
Lock instructions on modern x86 processors aren't really that expensive.
What is expensive is lock contention. When I've measured something code
that does this in a bunch of concurrent threads:
It costs about 1ns to do step 2 with no locks.
It costs about 5ns to acquire the lock if the lock is thread-local, and
thus not actually contended.
It costs about 100ns-200ns if the lock is actually contended.
I've found that these measurements have changed the way I write lock-based
code. For example, I like per-core data structures that need a lock,
because the per-core lock is almost always uncontended. (The difference
between per-core and per-thread shows up only when a thread is preempted.)
On Tue, Feb 10, 2015 at 12:16 AM, Daniel Micay <danielmicay at gmail.com>
> > FWIW, I also tried to remove all the bin mutexes, and make them all use
> > the arena mutex, and, counter-intuitively, it made things faster. Not by
> > a very significant margin, though, but it's interesting to note that the
> > synchronization overheads of n locks can make things slower than 1
> > lock with more contention.
> It's not really surprising. A LOCK prefix on x86 is extremely expensive
> so locks without contention still have an enormous cost, as do many of
> the atomic operations. An atomic_load() with acquire semantics or an
> atomic_store() with release semantics are fast since no LOCK is needed.
> jemalloc-discuss mailing list
> jemalloc-discuss at canonware.com
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the jemalloc-discuss