calling a function via PLT and jemalloc realloc changes function first argument (XMM0)

David Abdurachmanov david.abdurachmanov at gmail.com
Sat Nov 9 09:33:19 PST 2013


Some thoughts from a colleague:

When dl_lookup_symbol gets called, it doesn't really know what it's getting into--it doesn't know what processor model the stack frames around it were built with, it doesn't necessarily even know what registers the processor has. So there's a rule that ld.so/rtld et al. shouldn't touch the xmm registers or any of the other registers beyond the base set.  There's a glibc test, tst-xmmymm.sh, which checks this.  There's also a recent bug report,https://sourceware.org/bugzilla/show_bug.cgi?id=15627 , that gcc-4.8 can vectorize such things as memset (with -O3 or -ftree-vectorize), so rtld needed to add it's own assembler version of memset that doesn't touch the SSE registers.  It looks like this is basically the same thing; presumably tst-xmmymm.sh would fail if we ran it against ld.so linked with jemalloc built with gcc 4.8.

glibc contains tst-xmmymm.sh

Initial commit message states:

    Make sure no code in ld.so uses xmm/ymm registers on x86-64.

    This patch introduces a test to make sure no function modifies the
    xmm/ymm registers.  With the exception of the auditing functions.

Looks like jemalloc breaks the rule by using SSE registers.

david

On Nov 9, 2013, at 5:30 PM, David Abdurachmanov wrote:

> Hi,
> 
> I am having problems with jemalloc 3.4.1 (currently we use 2.2.2 in production). I found that with jemalloc 3.4.1 function first argument will be changed if first argument is passed by XMM0 register. Compiled with GCC 4.8.1 (tested also with 4.8.2). No problems on Scientific Linux 6 (RHEL6-based), but it fails on Scientific Linux 5 (RHEL5-based). All of this is because _dl_lookup_symbol_x calls _realloc_ in Scientific Linux 5.
> 
> This probably makes jemalloc 3.4.1 and the whole 3.X.Y series not recommended for RHEL5 and RHEL5-based distributions.
> 
> Original email below.
> 
> - - - - - - -
> 
> My initial investigations were done on slc6_amd64_gcc481 and the release is available for slc5_amd64_gcc481.
> 
> Most of the workflows will fail on this [slc5_amd64_gcc481] architecture, while on slc6_amd64_gcc481 all workflows pass.
> 
> If you are interested into the cause and calling conventions continue reading.
> 
> Most workflows fails with:
> 
> ----- Begin Fatal Exception 08-Nov-2013 14:19:25 CET-----------------------
> An exception of category 'InvalidIntervalError' occurred while
>   [0] Processing run: 208307 lumi: 1 event: 643482
>   [1] Running path 'reconstruction_step'
>   [2] Calling event method for module TrackIPProducer/'impactParameterTagInfos'
> Exception Message:
> Upper boundary below lower boundary in histogram integral.
> ----- End Fatal Exception -------------------------------------------------
> 
> Code triggering exception (CondFormats/PhysicsToolsObjects/interface/Histogram.icc):
> 
> 244 template<typename Value_t, typename Axis_t>
> 245 Value_t Histogram<Value_t, Axis_t>::integral(Axis_t hBound, Axis_t lBound,
> 246                                              int mode) const
> 247 {
> 248         if (hBound < lBound)
> 249                 throw cms::Exception("InvalidIntervalError")
> 250                         << "Upper boundary below lower boundary in "
> 251                         << "histogram integral." << std::endl;
> 
> The problem by example (description below):
> 
> Dump of assembler code for function PhysicsTools::Calibration::Histogram<float, float>::normalizedIntegral(float, float, int) const:
>   0x00002aaabc67ceb0 <+0>:     push   %rbx
>   0x00002aaabc67ceb1 <+1>:     mov    %rdi,%rbx
>   0x00002aaabc67ceb4 <+4>:     sub    $0x10,%rsp
>   0x00002aaabc67ceb8 <+8>:     callq  0x2aaabc6331e0 <_ZNK12PhysicsTools11Calibration9HistogramIffE8integralEffi at plt>
>   0x00002aaabc67cebd <+13>:    mov    %rbx,%rdi
>   0x00002aaabc67cec0 <+16>:    movss  %xmm0,0xc(%rsp)
>   0x00002aaabc67cec6 <+22>:    callq  0x2aaabc632c80 <_ZNK12PhysicsTools11Calibration9HistogramIffE13normalizationEv at plt>
>   0x00002aaabc67cecb <+27>:    movss  0xc(%rsp),%xmm1
>   0x00002aaabc67ced1 <+33>:    add    $0x10,%rsp
>   0x00002aaabc67ced5 <+37>:    divss  %xmm0,%xmm1
>   0x00002aaabc67ced9 <+41>:    pop    %rbx
>   0x00002aaabc67ceda <+42>:    movaps %xmm1,%xmm0
>   0x00002aaabc67cedd <+45>:    retq   
> End of assembler dump.
> this = 0x2aab170a9ff0
> hBound = 57.6329994
> lBound = 0
> mode = 1
> 
> Breakpoint 1, PhysicsTools::Calibration::Histogram<float, float>::integral (this=0x2aab170a9ff0, hBound=-2.23135843e-10, lBound=0, mode=1)
>    at /build/davidlt/CMSSW_7_0_0_pre8_jemalloc341/src/CondFormats/PhysicsToolsObjects/interface/Histogram.icc:245
> 245     Value_t Histogram<Value_t, Axis_t>::integral(Axis_t hBound, Axis_t lBound,
> 1: x/i $pc
> => 0x2aaabc67cbdc <PhysicsTools::Calibration::Histogram<float, float>::integral(float, float, int) const>:      push   %r14
> this = 0x2aab170a9ff0
> hBound = -2.23135843e-10
> lBound = 0
> mode = 1
> 
> KA-BOOM! 
> 
> _normalizedIntegral_ calls _integral_ with IDENTICAL arguments, yet once we reach _integral_ body our _hBound_ is changed to a different value.
> 
> We call _integral_ via PLT and we try to resolve the symbol (/lib64/ld-linux-x86-64.so.2). Between these two functions while we are resolving the symbol the value is modified.
> 
> That happens in _dl_lookup_symbol_x (/lib64/ld-linux-x86-64.so.2) as on SLC5 is calls _realloc_, and on SLC6 library calls _malloc_. This is the reason why in works fine under SLC6, the change in dynamic linker/loader.
> 
> _hBound_ is stored in $xmm0.v4_float[0]. It happens to be that in _realloc_ (jemalloc) for this (src/jemalloc.c):
> 
> 1244     ta->allocated += usize;
> 
> 1244 line compiler will generate SSE based code (using $xmm0).
> 
>   0x00002aaaad381666 <+630>:   mov    %r12,0x28(%rsp)
>   0x00002aaaad38166b <+635>:   movq   0x28(%rsp),%xmm0
>   0x00002aaaad381671 <+641>:   movhps 0x20(%rsp),%xmm0
>   0x00002aaaad381676 <+646>:   paddq  (%rax),%xmm0
>   0x00002aaaad38167a <+650>:   movdqa %xmm0,(%rax)
>   0x00002aaaad38167e <+654>:   add    $0x38,%rsp 
> 
> Just a few instructions which modify _hBound_ value.
> 
> Old value = 57.6329994
> New value = 6.72623263e-44
> 0x00002aaaad381671 in realloc (ptr=<optimized out>, size=<optimized out>) at src/jemalloc.c:1244
> 1244	src/jemalloc.c: No such file or directory.
> 1: x/i $pc
> => 0x2aaaad381671 <realloc+641>:	movhps 0x20(%rsp),%xmm0
> Continuing.
> Watchpoint 7: $xmm0.v4_float[0]
> 
> Old value = 6.72623263e-44
> New value = -2.22548424e-10
> 0x00002aaaad38167a in realloc (ptr=<optimized out>, size=<optimized out>) at src/jemalloc.c:1244
> 1244	in src/jemalloc.c
> 1: x/i $pc
> => 0x2aaaad38167a <realloc+650>:	movdqa %xmm0,(%rax)
> Continuing.
> 
> If you look into "Calling conventions for different C++ compilers and operating systems". (I assume should be fine for C also, as they are compatible).
> 
> 64-bit Linux. Callee-saved registers: RBX, RBP, R12-R15. All fine in jemallo _realloc_:
> 
> Dump of assembler code for function realloc:
>   0x00002aaaad3803f0 <+0>:     push   %r15
>   0x00002aaaad3803f2 <+2>:     push   %r14
>   0x00002aaaad3803f4 <+4>:     push   %r13
>   0x00002aaaad3803f6 <+6>:     push   %r12
>   0x00002aaaad3803f8 <+8>:     push   %rbp
>   0x00002aaaad3803f9 <+9>:     mov    %rsi,%rbp
>   0x00002aaaad3803fc <+12>:    push   %rbx
> 
> But all other registers are scratch registers.
> 
> Also looking into "System V Application Binary Interface AMD64 Architecture Processor Supplement" (October 7, 2013) [3.2.1 section]
> 
> Registers %rbp, %rbx and %r12 through %r15 "belong" to the calling function and the called function is required to preserve their values. In other words, a called function must preserve these registers' values for its caller. Remaining registers "belong" to the called function. If a calling function wants to preserve such a register value across a function call, it must save the value in its local stack frame.
> 
> Simply put, according to this /lib64/ld-linux-x86-64.so.2 dynamic linker/loader (_dl_lookup_symbol_x) before calling _realloc_ had to take the action to protect xmm0 register value.
> 
> You cannot compile jemalloc without SSE:
> 
> include/jemalloc/internal/prof.h:349:40: error: SSE register return with SSE disabled
> 
> If we cannot jemalloc from using SSE registers, how can we go around the problem?
> 
> 1240   if (config_stats && ret != NULL) {
> 1241     thread_allocated_t *ta;
> 1242     assert(usize == isalloc(ret, config_prof));
> 1243     ta = thread_allocated_tsd_get();
> 1244     ta->allocated += usize;
> 1245     ta->deallocated += old_size;
> 1246   }
> 
> In _realloc_ 1244 line is wrapped around if with config_stats. Compiling jemalloc with --disable-stats options disables statistic collection, should also slightly increase performance.
> 
> It's a bit worrisome that arguments can change in between function calls.
> 
> david




More information about the jemalloc-discuss mailing list