1kb and 2kb allocation "waste"

Fri Apr 20 17:23:59 PDT 2012

On Apr 20, 2012, at 4:55 AM, Justin Lebar wrote:
> It turns out that 1kb and 2kb allocations are responsible for upwards
> of 1/3 of all wasted space in Firefox, where by "wasted space", I
> mean: Open a bunch of tabs, close them, and call "waste" any heap
> memory which is committed but not part of an active allocation.
> (We've been calling this "external fragmentation", but that doesn't
> quite match up with the common definition, as I understand it.)

Do 1 and 2 KiB allocations make up substantially less than 1/3 of the memory devoted to small allocations?

> My idea is that we can address this by adding a new allocation method,
> between small and large.  These "medium" allocations are allocated
> inside a chunk and don't have a run header, like large allocations.
> But whereas one large allocation gets one entry in the chunk metadata
> table, two or four medium allocations will share one entry in the
> chunk metadata table.  It looks like we have plenty of bits available
> to express "{first,second,third,fourth} quarter of this page is an
> active 1kb alloc" and "{first,second} half of this page is an active
> 2kb alloc".

There was a period of time that jemalloc had a related strategy.  In 2009 I added medium size classes that went up to 32 KiB, mainly in order to close the 4 KiB .. 8 KiB internal fragmentation gap.  In order to mitigate the increased RSS due to external fragmentation (the same problem you're describing for 1 and 2 KiB objects) I added extra bookkeeping that made it possible to call madvise() for completely unused interspersed pages.  Since these size classes were all >= 4 KiB, even a single free region guaranteed that at least one page could be madvise()d away.  Unfortunately, I had to remove medium size classes when I restructured dirty page purging to happen without any locks held; it wouldn't have been possible to safely madvise() pages within active medium runs without some locking.

As you noted, your suggestion of using the chunk map to store all metadata would work okay for 1 and 2 KiB size classes, but it wouldn't work out very well for the intermediate size classes that newer versions have.

It would be awesome if small runs didn't need an embedded metadata header, and the metadata could all be in the chunk header.  However, there are a few things that get in the way of a clean solution.  First, the worst case requirements for the bitmaps that track which regions are allocated would be 1/64 of arena memory, which is rather a lot.  Second, chunk map overhead would go up some because every page would need an arena_run_t.  Third, heap profiling currently works reasonably well for very high sample rates, and that functionality would have to be sacrificed.  I've toyed with this general direction many times, but I've never been able to convince myself that the tradeoffs are compelling.  I've also tried to come up with solutions to the high bitmap overhead, but it ends up boiling down to a suballocator problem that is a liability under worst case conditions.

> This is a solution which likely makes sense only for systems with 4kb
> pages, and it looks like jemalloc2 tries hard to be agnostic about
> things like page size.  I don't know how much this speaks against the
> solution, partially because I have no idea how these 64kb pages come
> into play.  Perhaps I'm misunderstanding what LG_PAGE==16 is meant to
> mean.

Yes, there are some systems that use 64 KiB pages by default, so any solutions need to be pretty pagesize-agnostic.

Today has been a bunch of interleaved fire drills for me, so I'd better send this email before it gets lost forever.  I'm interested to hear your thoughts on the design space.  It's tempting to do an experiment that removes run headers, but the back-of-the-envelope calculations I did earlier weren't encouraging…

Jason