Memory allocation/release hooks

Thu Oct 15 11:26:39 PDT 2015

Hi Luke,

Please see below

>> 
>> (a) Allocation of memory that can be shared transparently between processes on the same node. For this purpose we would like to mmap memory with MAP_SHARED. This is very useful for implementation for Remote Memory Access (RMA) operations in MPI-3 one-sided [2] and OpenSHMEM [3] communication libraries. This allow a remote process to map user allocated memory and provide RMA operations through memcpy().
> 
> I’m not sure about this, but I expect that you just need to install a set of custom chunk hooks to manage this. You can read about the chunk_hooks_t [here](http://www.canonware.com/download/jemalloc/jemalloc-latest/doc/jemalloc.html).

This should do the trick. Our initial thought was to replace jemalloc pages_map() / pages_unmap() with our own version of mmap but it seems that chunk_hooks provides an elegant way to achieve the same.

> 
>> (b) Implementation of memory de-allocation hooks for RDMA hardware (Infiniband, ROCE, iWarp etc.). For optimization purpose we implement a lazy memory de-registration (memory unpinning) policy and we use the hook for the  notification of communication library about memory release event. On the event, we cleanup our registration cache and de-register (unpin) the memory on hardware.
> 
> We have been using jemalloc for some time to manage, among other things, registered memory regions in HPX-5 (https://hpx.crest.iu.edu/) for Verbs and uGNI. If you already have a mechanism which manages keys, then you can simply install a set of chunk hooks that can perform the registration/deregistration as necessary. We have found this to work quite well for our purposes.

How do you load jemalloc ? Do you do LD_PRELOAD  or the user is expect to allocate the memory explicitly through HPX runtime ?

> 
> [Here are our hooks](https://gitlab.crest.iu.edu/extreme/hpx/blob/v1.3.0/libhpx/network/pwc/jemalloc_registered.c). There is a bit of abstraction in there, but it’s basically straightforward. We only deal with chunk allocation and deallocation since we can’t really do anything interesting on commit/decommit due to the network registration (and we’re normally using hugetlbfs anyway).
> 
> In order to actually use the arenas that manage registered memory each pthread will call [this](https://gitlab.crest.iu.edu/extreme/hpx/blob/v1.3.0/libhpx/memory/jemalloc.c#L41) at startup, and registered allocation explicitly uses the caches created there. You need to be careful to ensure that jemalloc correctly keeps memory spaces disjoint by explicitly managing caches.
> 
> We also have a global heap that is implemented in a similar fashion, except that we’re implementing mmap() there to get chunk sized bits of a much larger segment of memory that we registered.
> 
> Obviously this won’t be exactly what you need, but it should serve as an example of chunk hook replacement for RDMA memory and can almost certainly be used as a basis for what you want to do. You may be able to simply decorate jemalloc’s existing chunk allocator with the registration calls that you need, rather than replacing its implementation entirely like we do (we customize mmap() to get huge pages from hugetlbfs when available, which adds to the complexity here).

Well, we actually implement a very similar functionality https://github.com/openucx/ucx/blob/master/src/uct/api/uct.h#L707
We support huge page, verbs allocator, xpmem and pretty much our goal is very similar - enable efficient zero-copy protocols for the user.

Thanks,
Pasha