calling a function via PLT and jemalloc realloc changes function first argument (XMM0)

David Abdurachmanov david.abdurachmanov at
Sat Nov 9 08:30:34 PST 2013


I am having problems with jemalloc 3.4.1 (currently we use 2.2.2 in production). I found that with jemalloc 3.4.1 function first argument will be changed if first argument is passed by XMM0 register. Compiled with GCC 4.8.1 (tested also with 4.8.2). No problems on Scientific Linux 6 (RHEL6-based), but it fails on Scientific Linux 5 (RHEL5-based). All of this is because _dl_lookup_symbol_x calls _realloc_ in Scientific Linux 5.

This probably makes jemalloc 3.4.1 and the whole 3.X.Y series not recommended for RHEL5 and RHEL5-based distributions.

Original email below.

- - - - - - -

My initial investigations were done on slc6_amd64_gcc481 and the release is available for slc5_amd64_gcc481.

Most of the workflows will fail on this [slc5_amd64_gcc481] architecture, while on slc6_amd64_gcc481 all workflows pass.

If you are interested into the cause and calling conventions continue reading.

Most workflows fails with:

----- Begin Fatal Exception 08-Nov-2013 14:19:25 CET-----------------------
An exception of category 'InvalidIntervalError' occurred while
  [0] Processing run: 208307 lumi: 1 event: 643482
  [1] Running path 'reconstruction_step'
  [2] Calling event method for module TrackIPProducer/'impactParameterTagInfos'
Exception Message:
Upper boundary below lower boundary in histogram integral.
----- End Fatal Exception -------------------------------------------------

Code triggering exception (CondFormats/PhysicsToolsObjects/interface/Histogram.icc):

244 template<typename Value_t, typename Axis_t>
245 Value_t Histogram<Value_t, Axis_t>::integral(Axis_t hBound, Axis_t lBound,
246                                              int mode) const
247 {
248         if (hBound < lBound)
249                 throw cms::Exception("InvalidIntervalError")
250                         << "Upper boundary below lower boundary in "
251                         << "histogram integral." << std::endl;

The problem by example (description below):

Dump of assembler code for function PhysicsTools::Calibration::Histogram<float, float>::normalizedIntegral(float, float, int) const:
  0x00002aaabc67ceb0 <+0>:     push   %rbx
  0x00002aaabc67ceb1 <+1>:     mov    %rdi,%rbx
  0x00002aaabc67ceb4 <+4>:     sub    $0x10,%rsp
  0x00002aaabc67ceb8 <+8>:     callq  0x2aaabc6331e0 <_ZNK12PhysicsTools11Calibration9HistogramIffE8integralEffi at plt>
  0x00002aaabc67cebd <+13>:    mov    %rbx,%rdi
  0x00002aaabc67cec0 <+16>:    movss  %xmm0,0xc(%rsp)
  0x00002aaabc67cec6 <+22>:    callq  0x2aaabc632c80 <_ZNK12PhysicsTools11Calibration9HistogramIffE13normalizationEv at plt>
  0x00002aaabc67cecb <+27>:    movss  0xc(%rsp),%xmm1
  0x00002aaabc67ced1 <+33>:    add    $0x10,%rsp
  0x00002aaabc67ced5 <+37>:    divss  %xmm0,%xmm1
  0x00002aaabc67ced9 <+41>:    pop    %rbx
  0x00002aaabc67ceda <+42>:    movaps %xmm1,%xmm0
  0x00002aaabc67cedd <+45>:    retq   
End of assembler dump.
this = 0x2aab170a9ff0
hBound = 57.6329994
lBound = 0
mode = 1

Breakpoint 1, PhysicsTools::Calibration::Histogram<float, float>::integral (this=0x2aab170a9ff0, hBound=-2.23135843e-10, lBound=0, mode=1)
   at /build/davidlt/CMSSW_7_0_0_pre8_jemalloc341/src/CondFormats/PhysicsToolsObjects/interface/Histogram.icc:245
245     Value_t Histogram<Value_t, Axis_t>::integral(Axis_t hBound, Axis_t lBound,
1: x/i $pc
=> 0x2aaabc67cbdc <PhysicsTools::Calibration::Histogram<float, float>::integral(float, float, int) const>:      push   %r14
this = 0x2aab170a9ff0
hBound = -2.23135843e-10
lBound = 0
mode = 1


_normalizedIntegral_ calls _integral_ with IDENTICAL arguments, yet once we reach _integral_ body our _hBound_ is changed to a different value.

We call _integral_ via PLT and we try to resolve the symbol (/lib64/ Between these two functions while we are resolving the symbol the value is modified.

That happens in _dl_lookup_symbol_x (/lib64/ as on SLC5 is calls _realloc_, and on SLC6 library calls _malloc_. This is the reason why in works fine under SLC6, the change in dynamic linker/loader.

_hBound_ is stored in $xmm0.v4_float[0]. It happens to be that in _realloc_ (jemalloc) for this (src/jemalloc.c):

1244     ta->allocated += usize;

1244 line compiler will generate SSE based code (using $xmm0).

  0x00002aaaad381666 <+630>:   mov    %r12,0x28(%rsp)
  0x00002aaaad38166b <+635>:   movq   0x28(%rsp),%xmm0
  0x00002aaaad381671 <+641>:   movhps 0x20(%rsp),%xmm0
  0x00002aaaad381676 <+646>:   paddq  (%rax),%xmm0
  0x00002aaaad38167a <+650>:   movdqa %xmm0,(%rax)
  0x00002aaaad38167e <+654>:   add    $0x38,%rsp 

Just a few instructions which modify _hBound_ value.

Old value = 57.6329994
New value = 6.72623263e-44
0x00002aaaad381671 in realloc (ptr=<optimized out>, size=<optimized out>) at src/jemalloc.c:1244
1244	src/jemalloc.c: No such file or directory.
1: x/i $pc
=> 0x2aaaad381671 <realloc+641>:	movhps 0x20(%rsp),%xmm0
Watchpoint 7: $xmm0.v4_float[0]

Old value = 6.72623263e-44
New value = -2.22548424e-10
0x00002aaaad38167a in realloc (ptr=<optimized out>, size=<optimized out>) at src/jemalloc.c:1244
1244	in src/jemalloc.c
1: x/i $pc
=> 0x2aaaad38167a <realloc+650>:	movdqa %xmm0,(%rax)

If you look into "Calling conventions for different C++ compilers and operating systems". (I assume should be fine for C also, as they are compatible).

64-bit Linux. Callee-saved registers: RBX, RBP, R12-R15. All fine in jemallo _realloc_:

Dump of assembler code for function realloc:
  0x00002aaaad3803f0 <+0>:     push   %r15
  0x00002aaaad3803f2 <+2>:     push   %r14
  0x00002aaaad3803f4 <+4>:     push   %r13
  0x00002aaaad3803f6 <+6>:     push   %r12
  0x00002aaaad3803f8 <+8>:     push   %rbp
  0x00002aaaad3803f9 <+9>:     mov    %rsi,%rbp
  0x00002aaaad3803fc <+12>:    push   %rbx

But all other registers are scratch registers.

Also looking into "System V Application Binary Interface AMD64 Architecture Processor Supplement" (October 7, 2013) [3.2.1 section]

Registers %rbp, %rbx and %r12 through %r15 "belong" to the calling function and the called function is required to preserve their values. In other words, a called function must preserve these registers' values for its caller. Remaining registers "belong" to the called function. If a calling function wants to preserve such a register value across a function call, it must save the value in its local stack frame.

Simply put, according to this /lib64/ dynamic linker/loader (_dl_lookup_symbol_x) before calling _realloc_ had to take the action to protect xmm0 register value.

You cannot compile jemalloc without SSE:

include/jemalloc/internal/prof.h:349:40: error: SSE register return with SSE disabled

If we cannot jemalloc from using SSE registers, how can we go around the problem?

1240   if (config_stats && ret != NULL) {
1241     thread_allocated_t *ta;
1242     assert(usize == isalloc(ret, config_prof));
1243     ta = thread_allocated_tsd_get();
1244     ta->allocated += usize;
1245     ta->deallocated += old_size;
1246   }

In _realloc_ 1244 line is wrapped around if with config_stats. Compiling jemalloc with --disable-stats options disables statistic collection, should also slightly increase performance.

It's a bit worrisome that arguments can change in between function calls.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the jemalloc-discuss mailing list