jemalloc Suitable for embedded environments

Mon May 11 16:49:10 PDT 2015

On May 11, 2015, at 1:19 PM, Mayank Kumar (mayankum) <mayankum at cisco.com> wrote:
> -our processes use setrlimit to limit virtual memory usage of processes. Do you think jemalloc in someways could overshoot that limit or it might be doing something funky which is not tracked through setrlimit(like not going through brk/mmap/mremap).  Please excuse my limited understanding here.

jemalloc only uses mmap() and sbrk() to map memory on Unix-like systems.

> -someone pointed this link to me . http://locklessinc.com/benchmarks_allocator.shtml
> It says the following stuff 
> 
> <quote>
> Jemalloc allocator
> 
> This is a very good allocator when there is a large amount of contention, performing similarly to the Lockless memory allocator as the number of threads grows larger than the number of processors. However, when the number of allocating threads is smaller than the total number of cpus, it isn't quite as fast. The disadvantage of the jemalloc allocator is its memory usage. It uses power-of-two sized bins, which leads to a greatly increased memory footprint compared to other allocators. This can affect real-world performance due to excess cache and TLB misses.
> </quote>
> 
> Do you think it is still true, this might be an old link or just my limited understanding. Off course they are selling here...., but justed wanted your opinion here. For our case, though the allocating threads will be always larger than number of cores.

The above was a combination of incorrect/incomplete information and microbenchmark-based overgeneralization even at the time it was written ~4 years ago.  Specific issues:

- MP-scalable malloc implementations *avoid* contention in order to perform well.  The t-test1 microbenchmark as run did not induce appreciable contention in jemalloc.

- jemalloc's typically low memory usage has been a distinguishing quality since 2006.  To claim otherwise based on one microbenchmark is unjustifiable.

- jemalloc has at various times used power-of-two-*spaced* bins for limited size ranges, e.g. 1024..2048..4096..8192 and 4MiB..8MiB, but it has never done so universally.  I suspect the author misread my BSDcan paper (http://people.freebsd.org/~jasone/jemalloc/bsdcan2006/jemalloc.pdf), and mistook the binary buddy page management system for size classes.  However, the binary buddy page management system was replaced long before jemalloc 1.0.0.

On the bright side, the benchmarks report actual performance results for a version of jemalloc, unlike a previous version of that page, which erroneously reported glibc results, or an interim update which categorically blamed a memalign() call with questionable alignment and the resulting crashes on jemalloc.

Note that the Lockless malloc implementation has since been open-sourced, so you can conduct your own tests and see how well it works for your use case.

Jason