New Allocator features

Tue May 10 21:44:24 PDT 2011

On 05/10/2011 09:58 AM, Terrell Magee wrote:
> I have been using jemalloc for the past year with good results.  Our
> product has a couple of new requirements that I have not come across before.
>
> 1)The first is the ability to completely segregate control data from
> allocated data.  This makes the allocator more like a resource manager
> in that you allocate and mange the resource but have no access to it.
> The data structure design of jemalloc lends itself fairly well to
> meeting this requirement.

jemalloc takes advantage of constant-time address masking operations to 
find metadata associated with allocated objects, so although it mostly 
segregates metadata and data, completely abstracting the two away from 
each other would take quite a bit of refactoring, not to mention that it 
would probably slow down the allocator.  Backing up a bit, I'm trying to 
imagine an allocator with this constraint, and it's immediately clear 
that some standard features, like zero-filled memory via calloc(), would 
not be possible.  What do you need such strong separation for?

> 2)The second requirement is focused on multi-node NUMA systems.
> Consider the requirement to allocate memory on a specified node.  A
> buffer is returned to the application on the requested node but the
> application then migrates the data to another node for processing.  When
> the app frees the buffer, it is returned to the allocator’s free pool.
> The problem is the memory is now located on the wrong node.  Subsequent
> allocations can result in misplaced data and performance anomalies.
> This behavior is true for any allocator; libc malloc(3), etc.

If you disable thread caches in jemalloc, and you're careful about your 
thread-->arena associations, you will be able to avoid the problem. 
That is, memory is freed back to the arena from which it came, so as 
long as each arena is only ever used for allocation on a single node 
(and the allocated pages are touched before objects are passed to other 
threads), all of that arena's memory will be locally attached.

As an aside, I'm planning to experiment with using sched_getcpu(3) on 
Linux to choose the arena to allocate from.  Right now there are 4*ncpus 
arenas by default, and threads are uniformly distributed among the 
arenas in order to reduce lock contention.  With sched_getcpu(3) we 
should be able to effectively use only 1*ncpus arenas.  Furthermore, the 
application won't have to mess around with thread-->arena associations; 
instead it can set thread-->CPU affinity as desired, and jemalloc will 
automatically work well.

Cheers,
Jason