jemalloc performance for very small allocations
garg_rajat at hotmail.com
Tue May 7 16:14:58 PDT 2013
Thanks for the explantion/comments. The increase in the TCACHE_NSLOTS_SMALL_MAX (which undercovers is limited to 1002 and 504 as you say) does help quit a bit in this case as I ran some profiles:
default setting ==>
Excl. Incl. Name
User CPU User CPU
390.275 481.712 jepvt_arena_dalloc_bin_locked
217.540 769.057 je_free
201.956 488.113 jepvt_arena_tcache_fill_small
199.592 283.012 arena_bin_malloc_hard
75.292 712.736 je_malloc
callee for je_free (most of runtime if in jepvt_tcache_bin_flush_small)
new setting ==>
Excl. Incl. Name
User CPU User CPU
234.378 234.730 je_free
154.511 307.229 jepvt_arena_malloc
78.713 385.964 je_malloc
57.772 70.838 jepvt_arena_dalloc_bin_locked
callee info for jepvt_arena_malloc:
So we get quite a bit of runtime improvement -- but for our case the runtime in je_malloc and je_free is still quite high. Given above profile do you think modifying bin_info_run_size_calc() to use a larger initial min_run_size help?
thanks a lot,
Subject: Re: jemalloc performance for very small allocations
From: jasone at canonware.com
Date: Tue, 30 Apr 2013 22:52:24 -0700
CC: jemalloc-discuss at canonware.com
To: garg_rajat at hotmail.com
On Apr 26, 2013, at 9:39 AM, Rajat Garg <garg_rajat at hotmail.com> wrote: I am using jemalloc 3.1.0, compiled with gcc 4.1.2 on CentOS release 5.6 for a compute intensive multithreaded application where bulk of allocations falls in 8-byte, 16-byte (stats below). We are seeing very high runtime in tcache_alloc_small_hard(), arena_tcache_fill_small() and arena_bin_malloc_hard(). The 8 and 16 bytes allocations happen in very large number. We probably want the allocations to come form tcache_alloc_easy() to not hit the locks and take less runtime. The allocations are mostly temporary in nature -- especially 8-byte ones -- so some number of 8-byte allocations are done, then they are freed and then alloc/free process repeated.
The runs use 8 threads (so 48 arenas and all other jemalloc settings are default) and are on an Intel Xeon server with 12-cores (no hyperthreading and no other user so no contention with other user; the peak memory in application is ~6.5GB and server has 128GB memory so no memory shortage/swapping etc.); the jemalloc output at the end of run is given below. As we can see, pretty much 8 and 16 byte bins are overloaded.
Only 8 of the 48 arenas are being used in your test case. That said, the extra 40 are lazily initialized, so there's no need to change settings.
A run with TCACHE_NSLOTS_SMALL_MAX=10000, LG_TCACHE_MAXCLASS_DEFAULT=10, changing small bins to hold only upto 224 bytes (NBINS=15), and hardcoding
tcache_bin_info[i].ncached_max = TCACHE_NSLOTS_SMALL_MAX in tcache.c file improves runtime by about 16% (for overall application runtime) but increases peak memory from 6.5GB to 6.7GB. We want the runtime to at least improve by another 10% without preferably any increase in peak memory (so even 6.5GB to 6.7GB is not desirable).
Any suggestions on what changes to jemalloc settings to try?
I don't think the TCACHE_NSLOTS_SMALL_MAX setting is having as much effect for your test case as you might expect. Even with the setting of 10000, tcache is limited to 1002 and 504 regions for 8- and 16-byte regions, respectively. If you want to further increase the tcache region count, you will need to either increase the logical page size, or modify bin_info_run_size_calc() to use a larger initial min_run_size.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the jemalloc-discuss