<div dir="ltr">Perhaps I am tone deaf. "extremely expensive" sounds, to my ear, worse than the reality that I've experienced. But one persons "extreme" might be another's "moderate". Rather than use that kind of subjective adjective, I've tried to give numbers, and it sounds like we are basically in agreement about the costs.<div><br></div><div>To me, what is interesting about these numbers is that to overcome the lock overhead, a thread cache can be small. (There I go with a subjective adjective again...) For example, if the locked version is 5x as expensive as the unlocked, then a thread cache that holds 5 objects will reduce the effective cost to 2x the unlocked cache, in the worst case. A thread cache that holds 20 objects reduces the worst-case overhead to 25%. In some allocators, the thread cache is much larger (perhaps thousands of objects) because it is trying to overcome the 200ns contended-lock overhead instead of the 5ns uncontended-lock overhead. On modern x86 processors, it isn't always necessary to take heroic measures to avoid locks 99.9% of the time. It's sometimes good enough to avoid locks in something like 80% of the cases.<div><br></div><div>One way to make such a thread cache work is to maintain a doubly-linked list so that when the lock *is* acquired (e.g., when the thread cache fills up on a free()), all 5 objects in the thread cache can be moved to the per-core cache with a constant number of memory references. One also needs a little hysteresis: so one might use two doubly-linked lists in the per-thread cache, and an array of doubly-linked lists (each containing 5 objects) in the per-core cache. When calling free(), if both doubly-linked lists are full, then move one of them to the per-core cache. When calling malloc(), if both doubly-linked lists are empty, then get a collection of 5 objects from the per-core cache. Similarly, when the per-core cache gets too full, a group of 5 objects can be moved from a per-core cache to a global cache using only O(1) operations, and with a little hysteresis, that operation will seldom occur.</div><div><br></div><div>I'm curious. How big is the thread cache in jemalloc?</div><div><br></div><div>-Bradley</div><div><br></div></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Feb 10, 2015 at 1:11 AM, Daniel Micay <span dir="ltr"><<a href="mailto:danielmicay@gmail.com" target="_blank">danielmicay@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On 10/02/15 12:53 AM, Bradley C. Kuszmaul wrote:<br>
> Lock instructions on modern x86 processors aren't really that<br>
> expensive. What is expensive is lock contention. When I've measured<br>
> something code that does this in a bunch of concurrent threads:<br>
> 1. acquire_lock()<br>
> 2. do_something_really_small_on_thread_local_data()<br>
> 3. release_lock()<br>
><br>
> It costs about 1ns to do step 2 with no locks.<br>
> It costs about 5ns to acquire the lock if the lock is thread-local, and<br>
> thus not actually contended.<br>
> It costs about 100ns-200ns if the lock is actually contended.<br>
><br>
> I've found that these measurements have changed the way I write<br>
> lock-based code. For example, I like per-core data structures that need<br>
> a lock, because the per-core lock is almost always uncontended. (The<br>
> difference between per-core and per-thread shows up only when a thread<br>
> is preempted.)<br>
><br>
> -Badley<br>
<br>
</span>A lock prefix *is* very expensive in this context. The cost of locking<br>
and unlocking is where up to 50% of the time is spent in a fast memory<br>
allocator without thread caching, *without* contention. It's why thread<br>
caching results in a huge performance win even when it's only being<br>
filled and flushed with no reuse. For example, making an intrusive list<br>
with one million nodes and then freeing the entire thing is ~2x faster<br>
with a thread cache on top (with a fast O(1) slab allocator at least).<br>
<br>
</blockquote></div><br></div>