Frequent segfaults in 2.2.5

Rogier 'DocWilco' Mulhuijzen rogier+jemalloc at fastly.com
Tue Jun 4 00:14:13 PDT 2013


We added --enable-fill and junk:true to the mix and ran headlong into a
wall of libnuma init code. Looks like glibc malloc followed by jemalloc
free. But even backporting the glibc hooks code from 3.x didn't help.

We'll try 3.4 in the morning, see where that takes us.

Thanks for the pointers so far. :)

Cheers,

      Doc


On Mon, Jun 3, 2013 at 9:11 PM, Jason Evans <jasone at canonware.com> wrote:

> On Jun 3, 2013, at 8:19 PM, Rogier 'DocWilco' Mulhuijzen <
> rogier+jemalloc at fastly.com> wrote:
>
> We're currently using jemalloc 2.2.5 statically linked into a private fork
> of Varnish with a very high rate of malloc/calloc/free, and we're seeing
> segfaults on a somewhat frequent basis. (one a day on a group of 6 hosts.)
>
> We had the same segfaults with 2.2.3, and upgrading to 2.2.5 seems not to
> have helped.
>
> (Also, we tried upgrading to 3.3.1 and things just got worse, tried
> enabling debugging which made it even more worse. Under time pressure, we
> dropped down to 2.2.5)
>
> I should mention that I backported the mmap strategy from 3.3.1 into
> 2.2.5, to prevent VM fragmentation, which was causing us to run into
> vm.max_map_count.
>
> So, to the meat of the problem! (We saw these in both 2.2.3 without the
> mmap strategy backported, and 2.2.5 with mmap strategy backported.)
>
> Unfortunately, we don't have core files (we're running with 153G resident,
> and 4075G virtual process size on one of the hosts that I'm looking at
> right now) so the internal Varnish (libgcc based) backtrace is all we have:
>
> *0x483894*: arena_tcache_fill_small+1a4
> 0x4916b9: tcache_alloc_small_hard+19
> 0x4841bf: arena_malloc+1bf
> 0x47b498: calloc+218
>
> Looking that up:
>
> # addr2line -e /usr/sbin/varnishd -i 0x483894
> /varnish-cache/lib/libjemalloc/include/jemalloc/internal/bitmap.h:101
> /varnish-cache/lib/libjemalloc/include/jemalloc/internal/bitmap.h:140
> /varnish-cache/lib/libjemalloc/src/arena.c:264
> /varnish-cache/lib/libjemalloc/src/arena.c:1395
>
> Which looks like:
>
> 97 goff = bit >> LG_BITMAP_GROUP_NBITS;
> 98 gp = &bitmap[goff];
> 99 g = *gp;
> 100 assert(g & (1LU << (bit & BITMAP_GROUP_NBITS_MASK)));
> *101* g ^= 1LU << (bit & BITMAP_GROUP_NBITS_MASK);
> 102 *gp = g;
>
> Which makes no sense at first, since there's no deref being done there,
> but a disassembly (thanks Devon) shows:
>
>   483883:       48 c1 ef 06             shr    $0x6,%rdi
>   483887:       83 e1 3f                and    $0x3f,%ecx
>   48388a:       4c 8d 04 fa             lea    (%rdx,%rdi,8),%r8
>   48388e:       49 d3 e1                shl    %cl,%r9
>   483891:       4c 89 c9                mov    %r9,%rcx
>   *483894*:       49 33 08                xor    (%r8),%rcx
>
> The optimizer got rid of g and just does the xor straight on *gp. So gp is
> an illegal address. According to our segfault handler, it's NULL.
>
> For gp to be NULL, both bitmap and goff need to be NULL. And bitmap being
> NULL is somewhat impossible due to:
>
> if ((run = bin->runcur) != NULL && run->nfree > 0)
>  ptr = arena_run_reg_alloc(run, &arena_bin_info[binind]);
>
> bitmap is an offset to run, so both the offset and the run need to be NULL
> (or perfectly matched to cancel eachother out, but also unlikely.)
>
> bin->runcur and bin->bitmap_offset both being NULL seems _very_ unlikely.
>
> And that's about as far as we've gotten.
>
> Help?
>
>
> This sort of crash can happen as a distant result of a double-free.  For
> example:
>
> 1) free(p)
> 2) free(p) corrupts run counters, causing the run to be deallocated.
> 3) free(q) causes the deallocated run to be placed back in service, but
> with definite corruption.
> 4) malloc(…) tries to allocate from run, but run metadata are in a bad
> state.
>
> My suggestion is to enable assertions (something like: CFLAGS=-O3
> ./configure --enable-debug --disable-tcache), disable tcache (which can
> keep double free bugs from exercising the assertions), and look for the
> source of corruption.  I'll be surprised if the problem deviates
> substantially from the above, but if it does, then my next bet will be a
> buffer overflow corrupting run metadata.
>
> Jason
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://jemalloc.net/mailman/jemalloc-discuss/attachments/20130604/f49abbf1/attachment.html>


More information about the jemalloc-discuss mailing list