How can I get an intrinsic for the exp() function in x64 code?

I have the following code and am expecting the intrinsic version of the exp() function to be used. Unfortunately, it is not in an x64 build, making it slower than a similar Win32 (i.e., 32-bit build):

#include "stdafx.h"
#include <cmath>
#include <intrin.h>
#include <iostream>

int main()
{
  const int NUM_ITERATIONS=10000000;
  double expNum=0.00001;
  double result=0.0;

  for (double i=0;i<NUM_ITERATIONS;++i)
  {
    result+=exp(expNum); // <-- The code of interest is here
    expNum+=0.00001;
  }

  // To prevent the above from getting optimized out...
  std::cout << result << '\n';
}

I am using the following switches for my build:

/Zi /nologo /W3 /WX-
/Ox /Ob2 /Oi /Ot /Oy /GL /D "WIN32" /D "NDEBUG" 
/D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /Gm- 
/EHsc /GS /Gy /arch:SSE2 /fp:fast /Zc:wchar_t /Zc:forScope 
/Yu"StdAfx.h" /Fp"x64\Release\exp.pch" /FAcs /Fa"x64\Release\" 
/Fo"x64\Release\" /Fd"x64\Release\vc100.pdb" /Gd /errorReport:queue 

As you can see, I do have /Oi, /O2 and /fp:fast as required per the MSDN article on intrinsics. Yet, despite my efforts a call to the standard library is made, making exp() perform slower on x64 builds.

Here is the generated assembly:

  for (double i=0;i<NUM_ITERATIONS;++i)
000000013F911030  movsd      xmm10,mmword ptr [__real@3ff0000000000000 (13F912248h)]  
000000013F911039  movapd     xmm8,xmm6  
000000013F91103E  movapd     xmm7,xmm9  
000000013F911043  movaps     xmmword ptr [rsp+20h],xmm11  
000000013F911049  movsd      xmm11,mmword ptr [__real@416312d000000000 (13F912240h)]  
  {
    result+=exp(expNum);
000000013F911052  movapd     xmm0,xmm7  
000000013F911056  call       exp (13F911A98h) // ***** exp lib call is here *****
000000013F91105B  addsd      xmm8,xmm10  
    expNum+=0.00001;
000000013F911060  addsd      xmm7,xmm9  
000000013F911065  comisd     xmm8,xmm11  
000000013F91106A  addsd      xmm6,xmm0  
000000013F91106E  jb         main+52h (13F911052h)  
  }

As you can see in the assembly above, there is a call out to the exp() function. Now, let's look at the code generated for that for loop with a 32-bit build:

  for (double i=0;i<NUM_ITERATIONS;++i)
00101031  xorps       xmm1,xmm1  
00101034  rdtsc  
00101036  push        ebx  
00101037  push        esi  
00101038  movsd       mmword ptr [esp+1Ch],xmm0  
0010103E  movsd       xmm0,mmword ptr [__real@3ee4f8b588e368f1 (102188h)]  
00101046  push        edi  
00101047  mov         ebx,eax  
00101049  mov         dword ptr [esp+3Ch],edx  
0010104D  movsd       mmword ptr [esp+28h],xmm0  
00101053  movsd       mmword ptr [esp+30h],xmm1  
00101059  lea         esp,[esp]  
  {
    result+=exp(expNum);
00101060  call        __libm_sse2_exp (101EC0h) // <--- Quite different from 64-bit
00101065  addsd       xmm0,mmword ptr [esp+20h]  
0010106B  movsd       xmm1,mmword ptr [esp+30h]  
00101071  addsd       xmm1,mmword ptr [__real@3ff0000000000000 (102180h)]  
00101079  movsd       xmm2,mmword ptr [__real@416312d000000000 (102178h)]  
00101081  comisd      xmm2,xmm1  
00101085  movsd       mmword ptr [esp+20h],xmm0  
    expNum+=0.00001;
0010108B  movsd       xmm0,mmword ptr [esp+28h]  
00101091  addsd       xmm0,mmword ptr [__real@3ee4f8b588e368f1 (102188h)]  
00101099  movsd       mmword ptr [esp+28h],xmm0  
0010109F  movsd       mmword ptr [esp+30h],xmm1  
001010A5  ja          wmain+40h (101060h)  
  }

Much more code there, yet it's faster. A timing test I did on a 3.3 GHz Nehalem-EP host produced the following results:

32-bit:

For loop body average exec time: 34.849229 cycles / 10.560373 ns

64-bit:

For loop body average exec time: 45.845323 cycles / 13.892522 ns

Very odd behavior, indeed. Why is it happening?

Update:

I have created a Microsoft Connect bug report. Feel free to upvote it to get an authoritative answer from Microsoft itself on the use of floating point intrinsics, especially in x64 code.

Answers


On x64, floating point arithmetic is performed using SSE. This does not have a built-in operation for exp() and so a call to the standard library is inevitable unless you write your own inline manually-vectorized __m128d exp(__m128d) (Fastest Implementation of Exponential Function Using SSE).

I imagine that the MSDN article you are referring to was written with 32 bit code that uses 8087 FP in mind.


I think the only reason that Microsoft provides an intrinsic version of 32-bit SSE2 exp() is the standard calling conventions. The 32-bit calling conventions require the operand to be pushed on the main stack, and the result to be returned in the top register of the FPU stack. If you have SSE2 code generation enabled, then the return value is likely to be popped from the FPU stack into memory, then loaded from that location into an SSE2 register for whatever maths you want to do on the result. Clearly, it is faster to pass the operand in an SSE2 register and return the result in an SSE2 register. This is what __libm_sse2_exp() does. In 64-bit code, the standard calling convention passes the operand and returns the result in SSE2 registers anyway, so there is no advantage in having an intrinsic version.

The reason for the performance difference between 32-bit SSE2 and 64-bit implementations of exp() is that Microsoft uses different algorithms in the two implementations. I've no idea why they do this, and they produce different results (different by 1ulp) for some operands.


EDIT I'd like to add to this discussion the link to AMD's x64 instruction set manuals and Intel's reference.

At an initial inspection, there should be a way to use F2XM1 to compute the exponential. However, it's in the x87 instruction set, hidden in x64 mode.

There's hope in using MMX/x87 explicitly, as described in a posting on VirtualDub discussion boards. And, this is how to actually write asm in VC++.


Need Your Help

Using multiple markers in Log4j 2 to filter logging

java logging log4j2

I'm trying to set up my log4j logger to use multiple Marker's when filtering what to log. I'm not entirely sure if I'm even on the right track, but basically I want is to be able to log a number of

Looking for examples of "real" uses of continuations

ruby language-agnostic scheme smalltalk continuations

I'm trying to grasp the concept of continuations and I found several small teaching examples like this one from the Wikipedia article: