Suitability for use in a reference counting / garbage collecting hybrid

Fri Feb 20 18:06:23 PST 2015

I've been looking at improving the performance of my code in R, and
thus started looking at how it handles memory allocation.   R is
single threaded, and currently uses a combination of internal slabs,
system malloc, and a crufty generational garbage collector.

When things go badly, allocating and deallocating memory can dominate
the execution time.   I'm fairly certain that jemalloc could do a much
better job.  A pie-in-the-sky option that has some backing within the
R community is to improve this by finding a good malloc that can be a
starting point for our needs.

The goal would likely be to have a simple reference counting
implementation that easily overflows (1, 2, 3, many, such that many -
1 == many) and to immediately release memory that's known to be
available for reuse, such as when a variable with a single reference
is assigned a new value.  But importantly, the goal would not be to
reference count everything accurately.  Variables would also be
allowed to just leave scope, with the plan that the garbage collector
will periodically handle them.

What sort of data does jemalloc keep centrally about previous allocations?
Is this information kept in a way that makes it efficient to iterate
through them?
Does such an iteration require accessing the memory which is being pointed to?
Would there be a good place to add a 2-bit reference count to the
existing record?
What's jemalloc's approach to returning memory to the system
especially for large allocations.
The hope would be for it to be possible, but not immediate.  That is,
we want to avoid the case where freeing a multi-GB allocation triggers
a munmap(), but then we immediately need to create another one.

Thanks for answers to any of these!

Nathan Kurz
nate at verse.com

(To be clear, I am not part of the core R community, just an
interested bystander who's been looking at the code.)