jemalloc 3 performance vs. mozjemalloc

Mon Feb 9 22:00:42 PST 2015

On Tue, Feb 10, 2015 at 12:53:57AM -0500, Bradley C. Kuszmaul wrote:
> Lock instructions on modern x86 processors aren't really that expensive.
> What is expensive is lock contention.  When I've measured something code
> that does this in a bunch of concurrent threads:
>   1. acquire_lock()
>   2. do_something_really_small_on_thread_local_data()
>   3. release_lock()
> 
> It costs about 1ns to do step 2 with no locks.
> It costs about 5ns to acquire the lock if the lock is thread-local, and
> thus not actually contended.
> It costs about 100ns-200ns if the lock is actually contended.
> 
> I've found that these measurements have changed the way I write lock-based
> code.  For example, I like per-core data structures that need a lock,
> because the per-core lock is almost always uncontended.  (The difference
> between per-core and per-thread shows up only when a thread is preempted.)

... except I'm talking about arm and arm has very different performance
properties.

Mike