Rounding up huge allocations to page boundaries instead of chunks

Fri Sep 5 14:40:08 PDT 2014

On Aug 21, 2014, at 2:52 PM, Guilherme Goncalves <ggp at mozilla.com> wrote:
> As part of our effort to move to jemalloc3 on Firefox, it would be interesting to upstream the
> changes introduced into mozjemalloc in bug 683597 [1]. Basically, we've observed that, at least
> on Linux and OSX, the operating system will commit pages lazily when they're written to (as opposed
> to when they're mapped by jemalloc). This distorts the allocation stats for huge allocations, as
> they are rounded up to chunk boundaries.
> 
> For a concrete example, a huge allocation of size 1 chunk + 1 byte will cause jemalloc to map 2
> chunks, but the application will only ever physically use 1 chunk + 1 page. I haven't found any
> stats on jemalloc3 that reflect this smaller memory footprint; as far as I can see, all of the
> available stats.* metrics report multiples of the chunk size. There was some previous discussion
> about this on this list a few years ago, but it didn't seem to move forward at the time [2].
> 
> Would you be interested in upstreaming such change? I took a shot at adapting the old patch on that
> bug to the current jemalloc3 repository [3], and it doesn't look like this would introduce too much
> bookkeeping. I did seem to break some assumptions in other API functions (see the FIXME note on
> huge_salloc), so it may be easier to just introduce a new statistic instead of tweaking the existing
> size field in chunks. Thoughts?
> 
> 1- https://bugzilla.mozilla.org/show_bug.cgi?id=683597
> 2- http://jemalloc.net/mailman/jemalloc-discuss/2012-April/000221.html
> 3- https://github.com/guilherme-pg/jemalloc/commit/9ca3ca5f92053f3e605f7b470ade6e53e8fa5160

The main reason for the current approach for huge allocation size classes is that even if jemalloc avoids allocating virtual memory for the trailing unneeded space, every chunk must start at a chunk alignment boundary, so the resulting virtual memory holes are unusable by jemalloc.  In principle these holes could be useful to some auxiliary allocator in applications that use mmap() directly, but that's not a common use case.  Furthermore, these virtual memory holes cause map fragmentation in the kernel-level virtual memory data structures, and such holes are especially harmful on Linux, which uses linear map scan algorithms in some critical paths.  We have strong pressure to actually map full chunks, so historically I held the opinion that if we're mapping the virtual memory, we might as well make it available to the application.

That said, I laid some groundwork for unifying size classes (https://github.com/jemalloc/jemalloc/issues/77) this spring:

	https://github.com/jemalloc/jemalloc/commit/d04047cc29bbc9d1f87a9346d1601e3dd87b6ca0

The practical impact to your use case is that we'd go from having

	[4MiB, 8MiB, ..., (4n)MiB]
to
	[4MiB, 5MiB, 6MiB, 7MiB],
	[8MiB, 10MiB, 12MiB, 14MiB],
	[...],
	[(4m)MiB, (4m+1)MiB, (4m+2)MiB, (4m+3)MiB]

The implementation for the 4MiB..14MiB size classes will in effect need to manipulate the chunk metadata in the same way as your patch does.

Will this sufficiently address your accounting concerns?  There's the potential to over-report active memory by nearly 1.2X in the worst case, but that's a lot better than nearly 2X as things currently are.

Thanks,
Jason