Tue Aug 5 16:10:59 PDT 2014

On Aug 5, 2014, at 10:35 AM, gholley at CeBiTec.Uni-Bielefeld.DE wrote:
> I’m currently working on a data structure allowing the storage of a
> dynamic set of short DNA sequences plus annotations.
> Here are few details : the data structure is written in C, tests are
> currently run on Ubuntu 14.04 64 bits, everything is single threaded and
> Valgrind indicates that the program which manipulates the data structure
> has no memory leaks.
> 
> I’ve started to use Jemalloc in an attempt to reduce the fragmentation of
> the memory (by using one arena, disabling the thread caching system and
> using a high ratio of dirty pages). On small data sets (30 millions
> insertions), results are very good in comparison of Glibc: about 150MB
> less by using tuned Jemalloc.
> 
> Now, I’ve started tests with much bigger data sets (3 to 10 billions
> insertions) and I realized that Jemalloc is using more memory than Glibc.
> I have generated a data set of 200 millions entries which I tried to
> insert in the data structure and when the memory used reached 1GB, I
> stopped the program and reported the number of entries inserted.
> When using Jemalloc, doesn’t matter the tuning parameters (1 or 4 arenas,
> tcache activated or not, lg_dirty = 3 or 8 or 16, lg_chunk = 14 or 22 or
> 30), the number of entries inserted varies between 120 millions to 172
> millions. Or by using the standard Glibc, I’m able to insert 187 millions
> of entries.
> And on billions of entries, Glibc (I don’t have precise numbers
> unfortunately) uses few Gigabytes less than Jemalloc.
> 
> So I would like to know if there is an explanation for this and if I can
> do something to make Jemalloc at least as efficient as Glibc is on my
> tests ? Maybe I’m not using Jemalloc correctly ?

There are a few possible issues, mainly related to fragmentation, but I can't make many specific guesses because I don't know what the allocation/deallocation patterns are in your application.  It sounds like your application just does a bunch of allocation, with very little interspersed deallocation, in which case I'm surprised by your results unless you happen to be allocating lots of objects that are barely larger than the nearest size class boundaries (e.g. 17 bytes).  Have you taken a close look at the output of malloc_stats_print()?

Jason