jemalloc tuning help

Thu Nov 14 17:20:45 PST 2013

On Nov 14, 2013, at 1:17 PM, Nikhil Bhatia <nbhatia at vmware.com> wrote:
> I am observing a huge gap between the "total allocated" & 
> "active" counts in the jemalloc stats. The "active" & "mapped"
> correctly point to the RSS and VIRT counters in top. Below
> is a snippet of the stats output. 
> 
> How should I infer this gap? Is this the fragmentation caused
> by the chunk metadata & unused dirty pages?

The gap is due to external fragmentation of small object page runs.  I computed per size class fragmentation and overall blame for the fragmented memory:

bin	size	regs	pgs	allocated	cur runs	% of small		% of blame
							utilization	frag memory
0	8	501	1	50937728	40745	31%	1%	112368232	4%
1	16	252	1	77020144	21604	88%	2%	10087184	0%
2	32	126	1	429852096	231731	46%	12%	504487296	20%
3	48	84	1	774254160	344983	56%	22%	616717296	24%
4	64	63	1	270561344	102283	66%	8%	141843712	6%
5	80	50	1	526179760	163248	81%	15%	126812240	5%
6	96	84	2	66918048	20469	41%	2%	98143968	4%
7	112	72	2	141823360	31895	55%	4%	115377920	4%
8	128	63	2	117911808	22666	65%	3%	64866816	3%
9	160	51	2	104119200	22748	56%	3%	81504480	3%
10	192	63	3	178081344	20630	71%	5%	71459136	3%
11	224	72	4	65155104	5327	76%	2%	20758752	1%
12	256	63	4	48990208	7009	43%	1%	64050944	2%
13	320	63	5	99602240	10444	47%	3%	110948800	4%
14	384	63	6	22376448	1897	49%	1%	23515776	1%
15	448	63	7	19032384	2290	29%	1%	45600576	2%
16	512	63	8	83511808	4852	53%	2%	72994304	3%
17	640	51	8	40183040	2979	41%	1%	57051520	2%
18	768	47	9	17687040	747	66%	1%	9276672		0%
19	896	45	10	17929856	730	61%	1%	11503744	0%
20	1024	63	16	226070528	4142	85%	6%	41138176	2%
21	1280	51	16	24062720	786	47%	1%	27247360	1%
22	1536	42	16	9480192		326	45%	0%	11550720	0%
23	1792	38	17	3695104		223	24%	0%	11490304	0%
24	2048	65	33	42412032	565	56%	1%	32800768	1%
25	2560	52	33	27392000	760	27%	1%	73779200	3%
26	3072	43	33	1959936		65	23%	0%	6626304		0%
27	3584	39	35	24493056	235	75%	1%	8354304		0%

utilization = allocated / (size * regs * cur runs)
% of small = allocated / total allocated
frag memory = (size * regs * cur runs) - allocated
% of blame = frag memory / total frag memory

In order for fragmentation to be that bad, your application has to have a steady state memory usage that is well below its peak usage.  In absolute terms, 32- and 48-byte allocations are to blame for nearly half the total fragmentation, and they have utilization (1-fragmentation) of 46% and 56%, respectively.

The core of the problem is that short-lived and long-lived object allocations are being interleaved even during near-peak memory usage, and when the short-lived objects are freed, the long-lived objects keep entire page runs active, even if almost all neighboring regions have been freed.  jemalloc is robust with regard to multiple grow/shrink cycles, in that its layout policies keep fragmentation from increasing from cycle to cycle, but it can do very little about the external fragmentation that exists during the low-usage time periods.  If the application accumulates long-lived objects (i.e. each peak is higher than the previous), then the layout policies tend to cause accumulation of long-lived objects in low memory, and fragmentation in high memory is proportionally small.  Presumably that's not how your application behaves though.

You can potentially mitigate the problem by reducing the number of arenas (only helps if per thread memory usage spikes are uncorrelated).  Another possibility is to segregate short- and long-lived objects into different arenas, but this requires that you have reliable (and ideally stable) knowledge of object lifetimes.  In practice, segregation is usually very difficult to maintain.  If you choose to go this direction, take a look at the "arenas.extend" mallctl (for creating an arena that contains long-lived objects), and the ALLOCM_ARENA(a) macro argument to the [r]allocm() functions.

> I am purging unused
> dirty pages a bit more aggressively than default (lg_dirty_mult: 5). 
> Should I consider being more aggressive? 

Dirty page purging isn't related to this problem.

> Secondly, I am using 1 arena per CPU core but my application creates
> lots of transient threads making small allocations. Should I consider
> using more arenas to mitigate performance bottlenecks incurred due to
> blocking on per-arena locks?

In general, the more arenas you have, the worse fragmentation is likely to be.  Use the smallest number of arenas that doesn't unacceptably degrade throughput.

> Finally, looking at the jemalloc stats how should I go about 
> configuring the tcache? My application has a high thread churn & 
> each thread performs lots of short-lived small allocations. Should
> I consider decreasing lg_tcache_max to 4K? 

This probably won't have much effect one way or the other, but setting lg_tcache_max set to 12 will potentially reduce memory overhead, so go for it if application throughput doesn't degrade unacceptably as a side effect.

It's worth mentioning that the tcache is a cause of fragmentation, because it thwarts jemalloc's layout policy of always choosing the lowest available region.  Fragmentation may go down substantially if you completely disable the tcache, though the potential throughput degradation may be unacceptable.

Jason
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://jemalloc.net/mailman/jemalloc-discuss/attachments/20131114/8af6ca34/attachment.html>