Help with a segfault

Tue Oct 7 10:51:34 PDT 2014

On Oct 6, 2014, at 10:59 PM, Marcin Zalewski <marcin.zalewski at gmail.com> wrote:
> I am using the jemalloc from 4dcf04bfc03b9e9eb50015a8fc8735de28c23090 on a Cray system. We use jemalloc for all allocations, and I get a strange issue with Crays hugepages implementation. When I do not use the Cray hugepages module, my code runs fine. However, when I load hugepages64M, I get the following segmentation fault:
> 
> Program received signal SIGSEGV, Segmentation fault.
> je_chunk_alloc_default (size=2048, alignment=0, zero=0x7fffffffa96f, 
>     arena_ind=0) at chunk.c:254
> 254             return (chunk_alloc_core(size, alignment, false, zero,
> (gdb) bt
> #0  je_chunk_alloc_default (size=2048, alignment=0, zero=0x7fffffffa96f, 
>     arena_ind=0) at chunk.c:254
> #1  0x000000002001586f in je_huge_palloc (tsd=0x2aaab02092d0, 
>     arena=<optimized out>, size=size at entry=2048, alignment=0, 
>     zero=zero at entry=true) at huge.c:50
> #2  0x0000000020015908 in je_huge_malloc (tsd=<optimized out>, 
>     arena=<optimized out>, size=size at entry=2048, zero=zero at entry=true)
>     at huge.c:19
> #3  0x0000000020018c90 in je_icalloct (arena=<optimized out>, 
>     try_tcache=<optimized out>, size=2048, tsd=<optimized out>)
>     at ../../../contrib/jemalloc/include/jemalloc/internal/jemalloc_internal.h:662
> #4  imallocx_flags (arena=<optimized out>, try_tcache=<optimized out>, 
>     zero=true, alignment=0, usize=2048, tsd=<optimized out>) at jemalloc.c:1450
> #5  imallocx_no_prof (usize=<synthetic pointer>, flags=<optimized out>, 
>     size=<optimized out>, tsd=<optimized out>) at jemalloc.c:1531
> #6  libxxx_mallocx (size=<optimized out>, flags=<optimized out>)
>     at jemalloc.c:1550
> #7  0x00002aaaaf6b9445 in register_printf_type () from /lib64/libc.so.6
> #8  0x00002aaaabf019c0 in register_printf_flt128 ()
>     at ../../../cray-gcc-4.9.0/libquadmath/printf/quadmath-printf.c:390
> #9  0x00002aaaabf09de6 in __do_global_ctors_aux ()
>    from /opt/gcc/4.9.0/snos/lib64/libquadmath.so.0
> #10 0x00002aaaabee51fb in _init ()
>    from /opt/gcc/4.9.0/snos/lib64/libquadmath.so.0
> #11 0x00007fffffffaaf8 in ?? ()
> #12 0x00002aaaaaab91b8 in call_init () from /lib64/ld-linux-x86-64.so.2
> #13 0x00002aaaaaab92e7 in _dl_init_internal () from /lib64/ld-linux-x86-64.so.2
> #14 0x00002aaaaaaabb3a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
> #15 0x0000000000000001 in ?? ()
> #16 0x00007fffffffb209 in ?? ()
> #17 0x0000000000000000 in ?? ()
> 
> I know that this is not very much info to go on, but I wonder if it rings a bell for someone immediately. As far as I can understand, the Cray hugepages module silently changes all the pages to hugepages of a chosen size:
> 
> http://www.nersc.gov/users/computational-systems/hopper/programming/tuning-options/
> 
> What could be an obvious reason to cause the segmentation fault on that line? The line in question is this:
> 
>         return (chunk_alloc_core(size, alignment, false, zero,
>             arenas[arena_ind]->dss_prec));
> 
> It seems that "arenas" is not properly initialized, but only with hugepages.

I've been staring at this for a while, but can't come up with any conclusive picture of what's going on.  Part of the problem is that there are two frames missing from the backtrace, and the reported function arguments are clearly fictional.  Here's what the first several frames of the backtrace should look like:

	je_chunk_alloc_default(...)
>>>	je_chunk_alloc_arena(...)
>>>	je_arena_chunk_alloc_huge(...)
	je_huge_palloc(...)
	je_huge_malloc(...)
	je_icalloct(...)

The calls are being made through function pointers, so I don't think it's possible for inlining to explain the omissions.

The mystery is how arenas[arena_ind] could possibly be NULL, given that arena_chunk_alloc_huge() is reading arena->ind in order to pass the arena index to chunk_alloc_arena().  In fact it's unsafe to read arenas[arena_ind] because the arenas.extend mallctl can write to the arenas pointer, but in order for that to be causing this crash, there would need to be another thread creating a new arena, and it would be a small race window.  (I'll fix the bug though!)

One random observation is that this crash is happening very early during execution, due to a library initializer running before entry into main().  It appears though that jemalloc has successfully bootstrapped itself by the time of the crash; otherwise malloc_init() would have failed in mallocx().

Is register_printf_type() really calling mallocx()?  I'd expect it to call malloc() or calloc(), unless jemalloc is pretty deeply integrated.

Are you able to reproduce this crash with a debug build of jemalloc (hopefully with more accurate backtrace)?  I'm concerned that this could be a bug in jemalloc, but I can't find a code path that could cause this.  In the absence of additional evidence, my first guess is that huge pages are somehow causing a different initialization order that avoids a bug in jemalloc, but it's possible that huge pages are erroneously causing the arenas array to be erroneously zeroed after initialization, perhaps due to treating an madvise() on any sub-range as a request to discard the entire huge page.

Thanks,
Jason
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://jemalloc.net/mailman/jemalloc-discuss/attachments/20141007/fb0841da/attachment.html>