jemalloc coring in je_bitmap_set

Tue Mar 31 05:49:10 PDT 2015

Lately I've got reports of one of our server applications coring in je_bitmap_set. 

There is a common pattern with the stack traces from all the cores.  Starting at the call to malloc the stack trace is always the same (shown below). The frames leading up to the call to malloc don't show any common pattern, plus there is nothing immediately obvious in the application code that calls malloc. We are running a standard jemalloc 3.6.0 on a 64-bit Redhat 6.4

This issue cannot be reproduced in our test setup, so I'm looking for ideas on how to root cause the issue from the core file.  

Here are details on my analysis to date

#0  0x00002b997a4a65df in je_bitmap_set (bit=18446744073709551555, 
    binfo=0x2b997a6d1540 <je_arena_bin_info+768>, bitmap=0x2b999e660010)
    at include/jemalloc/internal/bitmap.h:101
#1  je_bitmap_sfu (binfo=0x2b997a6d1540 <je_arena_bin_info+768>, 
    bitmap=0x2b999e660010) at include/jemalloc/internal/bitmap.h:140
#2  arena_run_reg_alloc (bin_info=0x2b997a6d1518 <je_arena_bin_info+728>, 
    run=0x2b999e660000) at src/arena.c:291
#3  je_arena_tcache_fill_small (arena=0x2b997b07f700, tbin=0x2b99a5406108, 
    binind=7, prof_accumbytes=<optimized out>) at src/arena.c:1479
#4  0x00002b997a4c55af in je_tcache_alloc_small_hard (tcache=<optimized out>, 
    tbin=0x2b99a5406108, binind=<optimized out>) at src/tcache.c:72
#5  0x00002b997a49a3fd in je_tcache_alloc_small (size=<optimized out>, 
    tcache=0x2b99a5406000, zero=false)
    at include/jemalloc/internal/tcache.h:303
#6  je_arena_malloc (try_tcache=true, zero=false, size=<optimized out>, 
    arena=0x0) at include/jemalloc/internal/arena.h:957
#7  je_imalloct (arena=0x0, try_tcache=true, size=<optimized out>)
    at include/jemalloc/internal/jemalloc_internal.h:771
#8  je_imalloc (size=<optimized out>)
    at include/jemalloc/internal/jemalloc_internal.h:780
#9  malloc (size=<optimized out>) at src/jemalloc.c:929

The core is triggered at line 101 of bitmap_set below

88        JEMALLOC_INLINE void
89        bitmap_set(bitmap_t *bitmap, const bitmap_info_t *binfo, size_t bit)
90        {
91                size_t goff;
92                bitmap_t *gp;
93                bitmap_t g;
94        
95                assert(bit < binfo->nbits);
96                assert(bitmap_get(bitmap, binfo, bit) == false);
97                goff = bit >> LG_BITMAP_GROUP_NBITS;
98                gp = &bitmap[goff];
99                g = *gp;
100              assert(g & (1LU << (bit & BITMAP_GROUP_NBITS_MASK)));
101              g ^= 1LU << (bit & BITMAP_GROUP_NBITS_MASK);

In all the cores the value of "bit" is 0xffffffffffff. That then results in "g" being an invalid pointer. For example, this is what I see in one of the cores

    (gdb) p bitmap
    $15 = (bitmap_t *) 0x2b999e660010
    (gdb) p gp
    $16 = (bitmap_t *) 0x20002b999e660008
    (gdb) p g
    Cannot access memory at address 0x20002b999e660008

Here is an analysis of how "bit" got set to 0xffffffffffff in bitmap_sfu from the same core file

121	/* sfu: set first unset. */
122	JEMALLOC_INLINE size_t
123	bitmap_sfu(bitmap_t *bitmap, const bitmap_info_t *binfo)
124	{
125		size_t bit;
126		bitmap_t g;
127		unsigned i;
128	
129		assert(bitmap_full(bitmap, binfo) == false);
130	
131		i = binfo->nlevels - 1;
132		g = bitmap[binfo->levels[i].group_offset];
133		bit = ffsl(g) - 1;
134		while (i > 0) {
135			i--;
136			g = bitmap[binfo->levels[i].group_offset + bit];
137			bit = (bit << LG_BITMAP_GROUP_NBITS) + (ffsl(g) - 1);
138		}
139	
140		bitmap_set(bitmap, binfo, bit);
141		return (bit);
142	}

(gdb) p binfo->nlevels
$3 = 2
(gdb) p binfo->levels[1].group_offset
$4 = 2
(gdb) p bitmap[2]
$5 = 1

(gdb) hexdump memory bitmap bitmap+16
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000010  01 00 00 00 00 00 00 00  50 79 eb 08 74 2b 00 00  |........Py..t+..|
00000020  b0 79 eb 08 74 2b 00 00  00 00 00 00 01 00 00 00  |.y..t+..........|
00000030  02 00 00 00 01 00 00 00  01 00 00 00 01 00 00 00  |................|
00000040  b0 02 ea 08 74 2b 00 00  40 18 ea 08 74 2b 00 00  |....t+.. at ...t+..|
00000050  b0 12 ea 08 74 2b 00 00  00 00 00 00 00 00 00 00  |....t+..........|
00000060  00 00 00 00 09 00 00 00  e5 ff ff ff 12 00 00 00  |................|
00000070  02 00 00 00 01 00 00 00  00 00 00 00 00 00 00 00  |................|

Above means that "i" will be 1 at line 131, "g" will be 1 at line 132 and "bit" will be 0 at line 133.

The while loop will run once (because "I" is 1), so "g" is set to 1 at line 136, then bit get set to 0xffffffffffff at line 137.

That's as far as I've got.

Until the start of this week I didn't know the jemalloc code at all, so I'm not sure what my analysis infers about the root cause for the core. Any suggestions about what to check next would be most welcome.

cheers
Paul