New Allocator features

Thu May 12 06:33:49 PDT 2011

The system architecture is a basic NUMA memory system to begin with.
However, not only is access latency non-uniform, the ability to access all
nodes does not exist.  One of the nodes contains a coprocessor that plays
in the global virtual space of a process but requires special APIs for the
Host processor to access memory on the coprocessor node.  So while
technically it is possible to access the allocated buffer data, the
performance of doing so for management purposes makes it undesirable.

-----Original Message-----
From: Jason Evans [mailto:jasone at canonware.com]
Sent: Tuesday, May 10, 2011 11:44 PM
To: Terrell Magee
Cc: jemalloc-discuss at canonware.com
Subject: Re: New Allocator features

On 05/10/2011 09:58 AM, Terrell Magee wrote:
> I have been using jemalloc for the past year with good results.  Our
> product has a couple of new requirements that I have not come across
before.
>
> 1)The first is the ability to completely segregate control data from
> allocated data.  This makes the allocator more like a resource manager
> in that you allocate and mange the resource but have no access to it.
> The data structure design of jemalloc lends itself fairly well to
> meeting this requirement.

jemalloc takes advantage of constant-time address masking operations to
find metadata associated with allocated objects, so although it mostly
segregates metadata and data, completely abstracting the two away from
each other would take quite a bit of refactoring, not to mention that it
would probably slow down the allocator.  Backing up a bit, I'm trying to
imagine an allocator with this constraint, and it's immediately clear that
some standard features, like zero-filled memory via calloc(), would not be
possible.  What do you need such strong separation for?

> 2)The second requirement is focused on multi-node NUMA systems.
> Consider the requirement to allocate memory on a specified node.  A
> buffer is returned to the application on the requested node but the
> application then migrates the data to another node for processing.
> When the app frees the buffer, it is returned to the allocator's free
pool.
> The problem is the memory is now located on the wrong node.
> Subsequent allocations can result in misplaced data and performance
anomalies.
> This behavior is true for any allocator; libc malloc(3), etc.

If you disable thread caches in jemalloc, and you're careful about your
thread-->arena associations, you will be able to avoid the problem.
That is, memory is freed back to the arena from which it came, so as long
as each arena is only ever used for allocation on a single node (and the
allocated pages are touched before objects are passed to other threads),
all of that arena's memory will be locally attached.

As an aside, I'm planning to experiment with using sched_getcpu(3) on
Linux to choose the arena to allocate from.  Right now there are 4*ncpus
arenas by default, and threads are uniformly distributed among the arenas
in order to reduce lock contention.  With sched_getcpu(3) we should be
able to effectively use only 1*ncpus arenas.  Furthermore, the application
won't have to mess around with thread-->arena associations; instead it can
set thread-->CPU affinity as desired, and jemalloc will automatically work
well.

Cheers,
Jason