Profiling memory allocations in run-time in production

Thu Jan 16 06:10:29 PST 2014

On Wed, Jan 15, 2014 at 8:39 PM, Jason Evans <jasone at canonware.com> wrote:
> On Jan 15, 2014, at 1:09 AM, Evgeniy Ivanov <i at eivanov.com> wrote:
>> On Tue, Jan 14, 2014 at 10:22 PM, Jason Evans <jasone at canonware.com> wrote:
>>> On Dec 22, 2013, at 11:41 PM, Evgeniy Ivanov <i at eivanov.com> wrote:
>>>> I need to profile my application running in production. Is it
>>>> performance safe to build jemalloc with "--enable-prof", start
>>>> application with profiling disabled and enable it for short time
>>>> (probably via mallctl() call), when I need? I'm mostly interested in
>>>> stacks, i.e. opt.prof_accum. Or are there better alternatives in
>>>> Linux? I've tried perf, but it just counts stacks and doesn't care
>>>> about amount of memory allocated. There is also stap, but I haven't
>>>> try it yet.
>>>
>>> Yes, you can use jemalloc's heap profiling as you describe, with essentially no performance impact while heap profiling is inactive.  You may even be able to leave heap profiling active all the time with little performance impact, depending on how heavily your application uses malloc.  At Facebook we leave heap profiling active all the time for a wide variety of server applications; there are only a couple of exceptions I'm aware of for which the performance impact is unacceptable (heavy malloc use, ~2% slowdown when heap profiling is active).
>>
>> What settings had you been using and what had been measured, when you
>> got 2% slowdown?
>
> My vague recollection is that the app was heavily multi-threaded, and spent about 10% of its total time in malloc.  Therefore a 2% overall slowdown corresponded to a ~20% slowdown in jemalloc itself.  Note that size class distribution matters to heap profiling performance because there are two sources of overhead (counter maintenance and backtracing), but I don’t remember what the distribution looked like.  We were using a version of libunwind that had a backtrace caching mechanism built in (it was never accepted upstream, and libunwind’s current caching mechanism cannot safely be used by malloc).
>
>> In our test (latency related) I got following
>> results:
>> normal jemalloc: %99 <= 87 usec (Avg: 65 usec)
>> inactive profiling: %99 <= 88 usec (Avg: 66 usec)
>>
>> MALLOC_CONF="prof:true,prof_active:true,lg_prof_sample:19,prof_accum:true,prof_prefix:jeprof.out”
>
> We usually use prof_accum:false, mainly because complicated call graphs can cause a huge number of retained backtraces, but otherwise your settings match.

Stacks is our primary point of interest. Using DTrace we trace each
malloc, but skip the ones, which request less than 16 Kb. DTrace
overhead on Solaris is just 13%. Not sure if allocation statistics
might be useful for us.

>> prof-libgcc: %99 <= 125 usec (Avg: 70 usec)
>> prof-libunwind: %99 <= 146 usec (Avg: 76 usec)
>>
>> So in average slowdown is 6% for libgcc and 15% for libunwind. But for
>> distribution (99% < X) slowdown is 42% or 65% depending on library,
>> which is huge difference. For 64 Kb numbers are dramatic: 154% (99% <
>> X) performance lose.
>>
>> Do I miss something in configuration?
>
> If your application is spending ~10-30% of its time in malloc, then your numbers sound reasonable.  You may find that a lower sampling rate (e.g. lg_prof_sample:21) drops backtracing overhead enough that performance is acceptable.  I’ve experimented in the past with lower sampling rates, and for long-running applications I’ve found that the heap profiles are still totally usable, because total allocation volume is high.

Tested on one of our workloads. For lg_prof_sample:20 we get 18%
slowdown, and for lg_prof_sample:21 it is just 4.5%, which is
absolutely acceptable.

Jason, thanks a lot for your answers! jemalloc is really awesome and
powerful thing!

-- 
Cheers,
Evgeniy