Regression: jemalloc >= 4.0 use munmap() even when configured with --disable-munmap

Sat Apr 23 03:18:19 PDT 2016

If you are interested in making sure that jemalloc doesn't run into kernel
mapping limits, even in the face of an application that's doing additional
mappings and mprotects, then you can reduce the number of kernel mappings
used by jemalloc.  Increasing the chunk size is a blunt instrument, but
with a little work on the jemalloc internals one can do a little better.
One possible way to avoid this kind of mmap failure is to use a little bit
more virtual memory. If you were to allocate multiple chunks with a single
mmap(), you could guarantee that fewer kernel mappings are used, for
example.

To be concrete, suppose that the first 4MiB chunk you mmap 4MiB, but the
2nd one you mmap 8MiB (and save the extra chunks somewhere for later).  The
3rd chunk would come out of the saved extra chunks, and the 4th chunk would
call for for a 16MiB mmap.  You would be guaranteed to use at most 2X the
virtual memory and you would have at most log_2 N kernel mappings.

If you don't want to waste 2X virtual memory, you can reduce it by mapping
several times of a given size.  For example if you were to mmap(4MiB)
twice, then mmap(8Mib) twice, then mmap(16MiB) twice, and so forth, then
the VM space used would be at most 1.5X.  If you wanted to get the VM-space
overhead down to 10% you could do 10 mappings of a given size before
doubling the mapping size.  In general if you perform K mappings of a given
size before doubling the mapping size, you would have VM space overhead of
at most (1+K)/K, and the number of kernel mappings used by jemalloc would
be O(K log N) for N mappings.

By using other formulas for the amount of overallocation, you can further
reduce the VM overhead at the expense of having more mappings.  For example
you could make the $i$th mmap call allocate $i$ chunks.  This results in
overhead that is proportional to the square root of the number of mappings
done so far.  So at the beginning the overhead is large (a full factor of
two), but after running for a while, the overhead as a fraction becomes
smaller.  The number of kernel mappings would be bounded by O(\sqrt{N}) in
this case.

This approach doesn't stop the application from running out of mappings,
but it does avoid implicating jemalloc.

-Bradley

On Sat, Apr 23, 2016 at 1:41 AM, Jason Evans <jasone at canonware.com> wrote:

> On Apr 22, 2016, at 10:22 PM, Daniel Mewes <daniel at rethinkdb.com> wrote:
> > The reason for the failing `munmap` appears to be that we hit the
> kernel's `max_map_count` limit.
> >
> > I can reproduce the issue very quickly by reducing the limit through
> `echo 16000 > /proc/sys/vm/max_map_count`, and it disappears in our tests
> when increasing it to something like `echo 131060 >
> /proc/sys/vm/max_map_count`. The default value is 65530 I believe.
> >
> > We used to see this behavior in jemalloc 2.x, but didn't see it in 3.x
> anymore. It now re-appeared somewhere between 3.6 and 4.1.
>
> Version 4 switched to per arena management of huge allocations, and along
> with that completely independent trees of cached chunks.  For many
> workloads this means increased virtual memory usage, since cached chunks
> can't migrate among arenas.  I have plans to reduce the impact somewhat by
> decreasing the number of arenas by 4X, but the independence of arenas'
> mappings has numerous advantages that I plan to leverage more over time.
>
> > Do you think the allocator should handle reaching the map_count limit
> and somehow deal with it gracefully (if that's even possible)? Or should we
> just advise our users to raise the kernel limit, or alternatively try to
> change RethinkDB's allocation patterns to avoid hitting it?
>
> I'm surprised you're hitting this, because the normal mode of operation is
> for jemalloc's chunk allocation to get almost all contiguous mappings,
> which means very few distinct kernel VM map entries.  Is it possible that
> RethinkDB is routinely calling mmap() and interspersing mappings that are
> not a multiple of the chunk size?  One would hope that the kernel could
> densely pack such small mappings in the existing gaps between jemalloc's
> chunks, but unfortunately Linux uses fragile heuristics to find available
> virtual memory (the exact problem that --disable-munmap works around).
>
> To your question about making jemalloc gracefully deal with munmap()
> failure, it seems likely that mmap() is in imminent danger of failing under
> these conditions, so there's not much that can be done.  In fact, jemalloc
> only aborts if the abort option is set to true (the default for debug
> builds), so the error message jemalloc is printing probably doesn't
> directly correspond to a crash.
>
> As a workaround, you could substantially increase the chunk size (e.g.
> MALLOC_CONF=lg_chunk:30), but better would be to diagnose and address
> whatever is causing the terrible VM map fragmentation.
>
> Thanks,
> Jason
> _______________________________________________
> jemalloc-discuss mailing list
> jemalloc-discuss at canonware.com
> http://www.canonware.com/mailman/listinfo/jemalloc-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://jemalloc.net/mailman/jemalloc-discuss/attachments/20160423/d0ed2b91/attachment.html>