jemalloc tuning help
Jason Evans
jasone at canonware.com
Thu Nov 14 17:20:45 PST 2013
On Nov 14, 2013, at 1:17 PM, Nikhil Bhatia <nbhatia at vmware.com> wrote:
> I am observing a huge gap between the "total allocated" &
> "active" counts in the jemalloc stats. The "active" & "mapped"
> correctly point to the RSS and VIRT counters in top. Below
> is a snippet of the stats output.
>
> How should I infer this gap? Is this the fragmentation caused
> by the chunk metadata & unused dirty pages?
The gap is due to external fragmentation of small object page runs. I computed per size class fragmentation and overall blame for the fragmented memory:
bin size regs pgs allocated cur runs % of small % of blame
utilization frag memory
0 8 501 1 50937728 40745 31% 1% 112368232 4%
1 16 252 1 77020144 21604 88% 2% 10087184 0%
2 32 126 1 429852096 231731 46% 12% 504487296 20%
3 48 84 1 774254160 344983 56% 22% 616717296 24%
4 64 63 1 270561344 102283 66% 8% 141843712 6%
5 80 50 1 526179760 163248 81% 15% 126812240 5%
6 96 84 2 66918048 20469 41% 2% 98143968 4%
7 112 72 2 141823360 31895 55% 4% 115377920 4%
8 128 63 2 117911808 22666 65% 3% 64866816 3%
9 160 51 2 104119200 22748 56% 3% 81504480 3%
10 192 63 3 178081344 20630 71% 5% 71459136 3%
11 224 72 4 65155104 5327 76% 2% 20758752 1%
12 256 63 4 48990208 7009 43% 1% 64050944 2%
13 320 63 5 99602240 10444 47% 3% 110948800 4%
14 384 63 6 22376448 1897 49% 1% 23515776 1%
15 448 63 7 19032384 2290 29% 1% 45600576 2%
16 512 63 8 83511808 4852 53% 2% 72994304 3%
17 640 51 8 40183040 2979 41% 1% 57051520 2%
18 768 47 9 17687040 747 66% 1% 9276672 0%
19 896 45 10 17929856 730 61% 1% 11503744 0%
20 1024 63 16 226070528 4142 85% 6% 41138176 2%
21 1280 51 16 24062720 786 47% 1% 27247360 1%
22 1536 42 16 9480192 326 45% 0% 11550720 0%
23 1792 38 17 3695104 223 24% 0% 11490304 0%
24 2048 65 33 42412032 565 56% 1% 32800768 1%
25 2560 52 33 27392000 760 27% 1% 73779200 3%
26 3072 43 33 1959936 65 23% 0% 6626304 0%
27 3584 39 35 24493056 235 75% 1% 8354304 0%
utilization = allocated / (size * regs * cur runs)
% of small = allocated / total allocated
frag memory = (size * regs * cur runs) - allocated
% of blame = frag memory / total frag memory
In order for fragmentation to be that bad, your application has to have a steady state memory usage that is well below its peak usage. In absolute terms, 32- and 48-byte allocations are to blame for nearly half the total fragmentation, and they have utilization (1-fragmentation) of 46% and 56%, respectively.
The core of the problem is that short-lived and long-lived object allocations are being interleaved even during near-peak memory usage, and when the short-lived objects are freed, the long-lived objects keep entire page runs active, even if almost all neighboring regions have been freed. jemalloc is robust with regard to multiple grow/shrink cycles, in that its layout policies keep fragmentation from increasing from cycle to cycle, but it can do very little about the external fragmentation that exists during the low-usage time periods. If the application accumulates long-lived objects (i.e. each peak is higher than the previous), then the layout policies tend to cause accumulation of long-lived objects in low memory, and fragmentation in high memory is proportionally small. Presumably that's not how your application behaves though.
You can potentially mitigate the problem by reducing the number of arenas (only helps if per thread memory usage spikes are uncorrelated). Another possibility is to segregate short- and long-lived objects into different arenas, but this requires that you have reliable (and ideally stable) knowledge of object lifetimes. In practice, segregation is usually very difficult to maintain. If you choose to go this direction, take a look at the "arenas.extend" mallctl (for creating an arena that contains long-lived objects), and the ALLOCM_ARENA(a) macro argument to the [r]allocm() functions.
> I am purging unused
> dirty pages a bit more aggressively than default (lg_dirty_mult: 5).
> Should I consider being more aggressive?
Dirty page purging isn't related to this problem.
> Secondly, I am using 1 arena per CPU core but my application creates
> lots of transient threads making small allocations. Should I consider
> using more arenas to mitigate performance bottlenecks incurred due to
> blocking on per-arena locks?
In general, the more arenas you have, the worse fragmentation is likely to be. Use the smallest number of arenas that doesn't unacceptably degrade throughput.
> Finally, looking at the jemalloc stats how should I go about
> configuring the tcache? My application has a high thread churn &
> each thread performs lots of short-lived small allocations. Should
> I consider decreasing lg_tcache_max to 4K?
This probably won't have much effect one way or the other, but setting lg_tcache_max set to 12 will potentially reduce memory overhead, so go for it if application throughput doesn't degrade unacceptably as a side effect.
It's worth mentioning that the tcache is a cause of fragmentation, because it thwarts jemalloc's layout policy of always choosing the lowest available region. Fragmentation may go down substantially if you completely disable the tcache, though the potential throughput degradation may be unacceptable.
Jason
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://jemalloc.net/mailman/jemalloc-discuss/attachments/20131114/8af6ca34/attachment.html>
More information about the jemalloc-discuss
mailing list