Managing pinned memory

Thu May 8 09:42:12 PDT 2014

On May 8, 2014, at 12:04 PM, Jason Evans <jasone at canonware.com> wrote:

> On May 8, 2014, at 9:00 AM, D'Alessandro, Luke K <ldalessa at indiana.edu> wrote:
>> I’m in the market for a good concurrent allocator to manage a memory region corresponding to pinned network memory for a multithreaded and distributed HPC application. Basically, I’m going to want to do RDMA to objects that are often malloced and freed. The pinning operation is expensive so it is important to amortize it over lots of uses. I’ve written a simple thread-local caching allocator that allows me to pin contiguous blocks when they’re first allocated, and then just use TLS free listing to reuse space, however I don’t really have the resources needed to implement this in a robust way.
>> 
>> Is there any natural way to do this in jemalloc at this time? My gut feeling is that there isn’t, explicitly specifying an arena breaks its caching and there’s not an obvious way to register a callback to run on internal block allocation and freeing (where I could pin/unpin the underlying memory).
>> 
>> If jemalloc doesn’t really support this use case, does anyone know of an efficient, scalable, robust allocator that does?
> 
> This pending change may be relevant to your needs:
> 
> 	https://github.com/jemalloc/jemalloc/pull/80
> 
> I’m imagining that you would implement a custom chunk allocator that pins entire chunks, and then specifically use that arena for allocations that you require to be pinned.  This approach has some shortcomings, but perhaps they don’t matter to your specific application.

Thanks Jason,

This patch appears to address half the battle, though I’m not 100% sure how to implement the chunk allocator without calling back into jemalloc recursively. I guess that I either use mmap() directly or jemalloc.h has a way to get “raw” memory already. Although chunk_alloc_core doesn’t seem like a name that’s going to be exposed, so maybe mmap() is the way to go—not a big deal.

Based on the proposed patch, it looks like MALLOCX_ARENA(a) /will/ have an effect for huge regions for both huge_palloc() and huge_dalloc() as well, which is exactly what I need.

The caching is an issue. Certain applications have threads that churn through this memory at nearly the rate they do function calls, and they (hopefully) use it with high temporal locality. It’s also used for inter-thread communication sometimes, in addition to its role in distributed communication. It may be enough though to use one arena per thread for pinned memory, which might cause problems for deallocation if it’s often remote. As always, no way to know if this will work without trying though.

Do you have any sense of the likelihood that this patch will be accepted going forward?

Thanks,
Luke