Regression: jemalloc >= 4.0 use munmap() even when configured with --disable-munmap

Fri Apr 22 22:41:04 PDT 2016

On Apr 22, 2016, at 10:22 PM, Daniel Mewes <daniel at rethinkdb.com> wrote:
> The reason for the failing `munmap` appears to be that we hit the kernel's `max_map_count` limit.
> 
> I can reproduce the issue very quickly by reducing the limit through `echo 16000 > /proc/sys/vm/max_map_count`, and it disappears in our tests when increasing it to something like `echo 131060 > /proc/sys/vm/max_map_count`. The default value is 65530 I believe.
> 
> We used to see this behavior in jemalloc 2.x, but didn't see it in 3.x anymore. It now re-appeared somewhere between 3.6 and 4.1.

Version 4 switched to per arena management of huge allocations, and along with that completely independent trees of cached chunks.  For many workloads this means increased virtual memory usage, since cached chunks can't migrate among arenas.  I have plans to reduce the impact somewhat by decreasing the number of arenas by 4X, but the independence of arenas' mappings has numerous advantages that I plan to leverage more over time.

> Do you think the allocator should handle reaching the map_count limit and somehow deal with it gracefully (if that's even possible)? Or should we just advise our users to raise the kernel limit, or alternatively try to change RethinkDB's allocation patterns to avoid hitting it?

I'm surprised you're hitting this, because the normal mode of operation is for jemalloc's chunk allocation to get almost all contiguous mappings, which means very few distinct kernel VM map entries.  Is it possible that RethinkDB is routinely calling mmap() and interspersing mappings that are not a multiple of the chunk size?  One would hope that the kernel could densely pack such small mappings in the existing gaps between jemalloc's chunks, but unfortunately Linux uses fragile heuristics to find available virtual memory (the exact problem that --disable-munmap works around).

To your question about making jemalloc gracefully deal with munmap() failure, it seems likely that mmap() is in imminent danger of failing under these conditions, so there's not much that can be done.  In fact, jemalloc only aborts if the abort option is set to true (the default for debug builds), so the error message jemalloc is printing probably doesn't directly correspond to a crash.

As a workaround, you could substantially increase the chunk size (e.g. MALLOC_CONF=lg_chunk:30), but better would be to diagnose and address whatever is causing the terrible VM map fragmentation.

Thanks,
Jason