jemalloc 3 performance vs. mozjemalloc

Mon Feb 9 22:42:28 PST 2015

On 10/02/15 01:33 AM, Bradley C. Kuszmaul wrote:
> Perhaps I am tone deaf. "extremely expensive" sounds, to my ear, worse
> than the reality that I've experienced.  But one persons "extreme" might
> be another's "moderate".  Rather than use that kind of subjective
> adjective, I've tried to give numbers, and it sounds like we are
> basically in agreement about the costs.
> 
> To me, what is interesting about these numbers is that to overcome the
> lock overhead, a thread cache can be small.  (There I go with a
> subjective adjective again...)  For example, if the locked version is 5x
> as expensive as the unlocked, then a thread cache that holds 5 objects
> will reduce the effective cost to 2x the unlocked cache, in the worst
> case.  A thread cache that holds 20 objects reduces the worst-case
> overhead to 25%.  In some allocators, the thread cache is much larger
> (perhaps thousands of objects) because it is trying to overcome the
> 200ns contended-lock overhead instead of the 5ns uncontended-lock
> overhead.  On modern x86 processors, it isn't always necessary to take
> heroic measures to avoid locks 99.9% of the time.  It's sometimes good
> enough to avoid locks in something like 80% of the cases.
> 
> One way to make such a thread cache work is to maintain a doubly-linked
> list so that when the lock *is* acquired (e.g., when the thread cache
> fills up on a free()), all 5 objects in the thread cache can be moved to
> the per-core cache with a constant number of memory references.  One
> also needs a little hysteresis: so one might use two doubly-linked lists
> in the per-thread cache, and an array of doubly-linked lists (each
> containing 5 objects) in the per-core cache.  When calling free(), if
> both doubly-linked lists are full, then move one of them to the per-core
> cache.  When calling malloc(), if both doubly-linked lists are empty,
> then get a collection of 5 objects from the per-core cache.  Similarly,
> when the per-core cache gets too full, a group of 5 objects can be moved
> from a per-core cache to a global cache using only O(1) operations, and
> with a little hysteresis, that operation will seldom occur.
> 
> I'm curious.  How big is the thread cache in jemalloc?
> 
> -Bradley

Yes, it seems we're in total agreement. By extremely expensive I meant
relative to normal instructions without a LOCK prefix. Slab allocation
for small size classes means you can have O(1) operations all of the way
down to chunk allocation so you don't really need anything other than a
tiny thread cache.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: OpenPGP digital signature
URL: <http://jemalloc.net/mailman/jemalloc-discuss/attachments/20150210/6784a35e/attachment.sig>