hint on debugging what seems to be a deadlock

Thu Aug 9 11:48:10 PDT 2012

On Aug 9, 2012, at 8:36 AM, Ricardo Nabinger Sanchez wrote:
> While using jemalloc-3.0.0 on a busy server, glibc-2.15 (-r2, Gentoo),
> kernel 3.2.25, our application is frequently hitting this backtrace
> pasted below.  We'd appreciate tips on where to start looking for problems.
> 
> (gdb) bt
> […]
> 
> When this happens, most of our threads get stuck on what seems to be
> a deadlock among certain threads:
> 
> (gdb) i the
> […]
> 
> 
> All threads on pthread_once() or __lll_lock_wait() are stuck and unresponsive
> to anything, and it requires us to fire a -KILL to the application.  We have
> *no* reason to suspect from jemalloc itself, but we cannot confirm using other
> libraries because they simply can't handle the load last we tried.
> 
> Our application is not using pthread locks/mutexes anymore, so the pressure
> on them is much slower now.
> 
> Perhaps I will be able to provide more info on this, if I can get my SSH
> connection to the server back up.

My experience with such deadlocks has been that either there's a bootstrapping issue due to jemalloc calling some library API that recursively allocates, or the application is trying to allocate inside a signal handler when a signal was raised due to memory corruption.  As far as I know, jemalloc's bootstrapping issues have been worked out, but I've been wrong about that a few times before. =/  In any case, bootstrapping is unlikely to be the issue here, given how far into execution your application has already gotten (assuming the pthread_once() call isn't initiated by jemalloc itself).

Take a look at the gdb output from the following command and make sure you don't see any signs of recursive allocation.

	t apply all bt

If nothing pops out at you, I don't have any other immediate suggestions.  If you find that not all the mutex-related backtraces are the same as pasted above, there may be a deadlock bug in jemalloc that I can't spot from just one backtrace; feel free to send full gdb backtrace output my way if it's non-trivial to interpret.

Thanks,
Jason