hint on debugging what seems to be a deadlock

Ricardo Nabinger Sanchez rnsanchez at wait4.org
Thu Aug 9 08:36:01 PDT 2012


While using jemalloc-3.0.0 on a busy server, glibc-2.15 (-r2, Gentoo),
kernel 3.2.25, our application is frequently hitting this backtrace
pasted below.  We'd appreciate tips on where to start looking for problems.

(gdb) bt
#0  0x00007f1e2b3ee304 in __lll_lock_wait () from /lib/libpthread.so.0
#1  0x00007f1e2b3e9789 in _L_lock_534 () from /lib/libpthread.so.0
#2  0x00007f1e2b3e959e in pthread_mutex_lock () from /lib/libpthread.so.0
#3  0x00007f1e2b606e4d in malloc_mutex_lock (arena=0x7f1e2ac7f8c0, chunk=0x7df49fc00000, run=0x7df49ff8b000, bin=0x7f1e2ac7fa48)
    at include/jemalloc/internal/mutex.h:77
#4  arena_dalloc_bin_run (arena=0x7f1e2ac7f8c0, chunk=0x7df49fc00000, run=0x7df49ff8b000, bin=0x7f1e2ac7fa48) at src/arena.c:1520
#5  0x00007f1e2b60782a in arena_dalloc_bin_locked (arena=0x7f1e2ac7f8c0, chunk=0x7df49fc00000, ptr=<value optimized out>, 
    mapelm=<value optimized out>) at src/arena.c:1593
#6  0x00007f1e2b61fa57 in tcache_bin_flush_small (tbin=0x7df48dc06048, binind=1, rem=35, tcache=0x7df48dc06000) at src/tcache.c:128
#7  0x00007f1e2b61fdc5 in tcache_event_hard (tcache=0x7df48dc06000) at src/tcache.c:39
#8  0x00007f1e2b600f18 in tcache_event (ptr=<value optimized out>) at include/jemalloc/internal/tcache.h:271
#9  tcache_dalloc_large (ptr=<value optimized out>) at include/jemalloc/internal/tcache.h:435
#10 arena_dalloc (ptr=<value optimized out>) at include/jemalloc/internal/arena.h:966
#11 idalloc (ptr=<value optimized out>) at include/jemalloc/internal/jemalloc_internal.h:840
#12 iqalloc (ptr=<value optimized out>) at include/jemalloc/internal/jemalloc_internal.h:852
#13 free (ptr=<value optimized out>) at src/jemalloc.c:1212

When this happens, most of our threads get stuck on what seems to be
a deadlock among certain threads:

(gdb) i thr
  28 Thread 0x7f1e2c30c700 (LWP 30862)  0x00007f1e2b1256a9 in syscall () from /lib/libc.so.6
  27 Thread 0x7df457fff700 (LWP 30870)  0x00007f1e2b3ecd3b in pthread_once () from /lib/libpthread.so.0
  26 Thread 0x7df4577fe700 (LWP 30871)  0x00007f1e2b0f64dd in nanosleep () from /lib/libc.so.6
  25 Thread 0x7df456ffd700 (LWP 30872)  0x00007f1e2b3ef03d in nanosleep () from /lib/libpthread.so.0
  24 Thread 0x7df4567fc700 (LWP 30873)  0x00007f1e2b3eeafd in accept () from /lib/libpthread.so.0
  23 Thread 0x7df455ffb700 (LWP 30874)  0x00007f1e2b1256a9 in syscall () from /lib/libc.so.6
  22 Thread 0x7df455f7a700 (LWP 30875)  0x00007f1e2b1256a9 in syscall () from /lib/libc.so.6
  21 Thread 0x7df455ef9700 (LWP 30876)  0x00007f1e2b1256a9 in syscall () from /lib/libc.so.6
  20 Thread 0x7df455e78700 (LWP 30877)  0x00007f1e2b1256a9 in syscall () from /lib/libc.so.6
  19 Thread 0x7df455df7700 (LWP 30878)  0x00007f1e2b1256a9 in syscall () from /lib/libc.so.6
  18 Thread 0x7df455d76700 (LWP 30879)  0x00007f1e2b3ee304 in __lll_lock_wait () from /lib/libpthread.so.0
  17 Thread 0x7df455cf5700 (LWP 30880)  0x00007f1e2b3ecd3b in pthread_once () from /lib/libpthread.so.0
  16 Thread 0x7df455c74700 (LWP 30881)  0x00007f1e2b1256a9 in syscall () from /lib/libc.so.6
  15 Thread 0x7df455bf3700 (LWP 30882)  0x00007f1e2b3ecd3b in pthread_once () from /lib/libpthread.so.0
  14 Thread 0x7df455b72700 (LWP 30883)  0x00007f1e2b3ecd3b in pthread_once () from /lib/libpthread.so.0
  13 Thread 0x7df455af1700 (LWP 30884)  0x00007f1e2b3ecd3b in pthread_once () from /lib/libpthread.so.0
  12 Thread 0x7df455a70700 (LWP 30885)  0x00007f1e2b3ecd3b in pthread_once () from /lib/libpthread.so.0
  11 Thread 0x7df4559ef700 (LWP 30886)  0x00007f1e2b3ecd3b in pthread_once () from /lib/libpthread.so.0
  10 Thread 0x7df45596e700 (LWP 30887)  0x00007f1e2b3ee304 in __lll_lock_wait () from /lib/libpthread.so.0
  9 Thread 0x7df4558ed700 (LWP 30888)  0x00007f1e2b3ecd3b in pthread_once () from /lib/libpthread.so.0
  8 Thread 0x7df45586c700 (LWP 30889)  0x00007f1e2b3ecd3b in pthread_once () from /lib/libpthread.so.0
  7 Thread 0x7df4557eb700 (LWP 30890)  0x00007f1e2b3ecd3b in pthread_once () from /lib/libpthread.so.0
  6 Thread 0x7df45576a700 (LWP 30891)  0x00007f1e2b3ee304 in __lll_lock_wait () from /lib/libpthread.so.0
  5 Thread 0x7df4556e9700 (LWP 30892)  0x00007f1e2b1256a9 in syscall () from /lib/libc.so.6
  4 Thread 0x7df455668700 (LWP 30893)  0x00007f1e2b3ecd3b in pthread_once () from /lib/libpthread.so.0
  3 Thread 0x7df4555e7700 (LWP 30894)  0x00007f1e2b3ecd3b in pthread_once () from /lib/libpthread.so.0
  2 Thread 0x7df455566700 (LWP 30895)  0x00007f1e2b3ef03d in nanosleep () from /lib/libpthread.so.0
* 1 Thread 0x7f1e2c30d740 (LWP 30861)  0x00007f1e2b1256a9 in syscall ()
  from /lib/libc.so.6

All threads on pthread_once() or __lll_lock_wait() are stuck and unresponsive
to anything, and it requires us to fire a -KILL to the application.  We have
*no* reason to suspect from jemalloc itself, but we cannot confirm using other
libraries because they simply can't handle the load last we tried.

Our application is not using pthread locks/mutexes anymore, so the pressure
on them is much slower now.

Perhaps I will be able to provide more info on this, if I can get my SSH
connection to the server back up.

Thank you for your attention.


Ricardo Nabinger Sanchez           http://rnsanchez.wait4.org/
  "Left to themselves, things tend to go from bad to worse."

More information about the jemalloc-discuss mailing list