jemalloc 3 performance vs. mozjemalloc
danielmicay at gmail.com
Mon Feb 9 22:42:28 PST 2015
On 10/02/15 01:33 AM, Bradley C. Kuszmaul wrote:
> Perhaps I am tone deaf. "extremely expensive" sounds, to my ear, worse
> than the reality that I've experienced. But one persons "extreme" might
> be another's "moderate". Rather than use that kind of subjective
> adjective, I've tried to give numbers, and it sounds like we are
> basically in agreement about the costs.
> To me, what is interesting about these numbers is that to overcome the
> lock overhead, a thread cache can be small. (There I go with a
> subjective adjective again...) For example, if the locked version is 5x
> as expensive as the unlocked, then a thread cache that holds 5 objects
> will reduce the effective cost to 2x the unlocked cache, in the worst
> case. A thread cache that holds 20 objects reduces the worst-case
> overhead to 25%. In some allocators, the thread cache is much larger
> (perhaps thousands of objects) because it is trying to overcome the
> 200ns contended-lock overhead instead of the 5ns uncontended-lock
> overhead. On modern x86 processors, it isn't always necessary to take
> heroic measures to avoid locks 99.9% of the time. It's sometimes good
> enough to avoid locks in something like 80% of the cases.
> One way to make such a thread cache work is to maintain a doubly-linked
> list so that when the lock *is* acquired (e.g., when the thread cache
> fills up on a free()), all 5 objects in the thread cache can be moved to
> the per-core cache with a constant number of memory references. One
> also needs a little hysteresis: so one might use two doubly-linked lists
> in the per-thread cache, and an array of doubly-linked lists (each
> containing 5 objects) in the per-core cache. When calling free(), if
> both doubly-linked lists are full, then move one of them to the per-core
> cache. When calling malloc(), if both doubly-linked lists are empty,
> then get a collection of 5 objects from the per-core cache. Similarly,
> when the per-core cache gets too full, a group of 5 objects can be moved
> from a per-core cache to a global cache using only O(1) operations, and
> with a little hysteresis, that operation will seldom occur.
> I'm curious. How big is the thread cache in jemalloc?
Yes, it seems we're in total agreement. By extremely expensive I meant
relative to normal instructions without a LOCK prefix. Slab allocation
for small size classes means you can have O(1) operations all of the way
down to chunk allocation so you don't really need anything other than a
tiny thread cache.
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 819 bytes
Desc: OpenPGP digital signature
More information about the jemalloc-discuss