How can I get an intrinsic for the exp() function in x64 code?

I have the following code and am expecting the intrinsic version of the exp() function to be used. Unfortunately, it is not in an x64 build, making it slower than a similar Win32 (i.e., 32-bit build):

#include "stdafx.h"
#include <cmath>
#include <intrin.h>
#include <iostream>

int main()
{
  const int NUM_ITERATIONS=10000000;
  double expNum=0.00001;
  double result=0.0;

  for (double i=0;i<NUM_ITERATIONS;++i)
  {
    result+=exp(expNum); // <-- The code of interest is here
    expNum+=0.00001;
  }

  // To prevent the above from getting optimized out...
  std::cout << result << '\n';
}

I am using the following switches for my build:

/Zi /nologo /W3 /WX-
/Ox /Ob2 /Oi /Ot /Oy /GL /D "WIN32" /D "NDEBUG" 
/D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /Gm- 
/EHsc /GS /Gy /arch:SSE2 /fp:fast /Zc:wchar_t /Zc:forScope 
/Yu"StdAfx.h" /Fp"x64\Release\exp.pch" /FAcs /Fa"x64\Release\" 
/Fo"x64\Release\" /Fd"x64\Release\vc100.pdb" /Gd /errorReport:queue 

As you can see, I do have /Oi, /O2 and /fp:fast as required per the MSDN article on intrinsics. Yet, despite my efforts a call to the standard library is made, making exp() perform slower on x64 builds.

Here is the generated assembly:

  for (double i=0;i<NUM_ITERATIONS;++i)
000000013F911030  movsd      xmm10,mmword ptr [__real@3ff0000000000000 (13F912248h)]  
000000013F911039  movapd     xmm8,xmm6  
000000013F91103E  movapd     xmm7,xmm9  
000000013F911043  movaps     xmmword ptr [rsp+20h],xmm11  
000000013F911049  movsd      xmm11,mmword ptr [__real@416312d000000000 (13F912240h)]  
  {
    result+=exp(expNum);
000000013F911052  movapd     xmm0,xmm7  
000000013F911056  call       exp (13F911A98h) // ***** exp lib call is here *****
000000013F91105B  addsd      xmm8,xmm10  
    expNum+=0.00001;
000000013F911060  addsd      xmm7,xmm9  
000000013F911065  comisd     xmm8,xmm11  
000000013F91106A  addsd      xmm6,xmm0  
000000013F91106E  jb         main+52h (13F911052h)  
  }

As you can see in the assembly above, there is a call out to the exp() function. Now, let's look at the code generated for that for loop with a 32-bit build:

  for (double i=0;i<NUM_ITERATIONS;++i)
00101031  xorps       xmm1,xmm1  
00101034  rdtsc  
00101036  push        ebx  
00101037  push        esi  
00101038  movsd       mmword ptr [esp+1Ch],xmm0  
0010103E  movsd       xmm0,mmword ptr [__real@3ee4f8b588e368f1 (102188h)]  
00101046  push        edi  
00101047  mov         ebx,eax  
00101049  mov         dword ptr [esp+3Ch],edx  
0010104D  movsd       mmword ptr [esp+28h],xmm0  
00101053  movsd       mmword ptr [esp+30h],xmm1  
00101059  lea         esp,[esp]  
  {
    result+=exp(expNum);
00101060  call        __libm_sse2_exp (101EC0h) // <--- Quite different from 64-bit
00101065  addsd       xmm0,mmword ptr [esp+20h]  
0010106B  movsd       xmm1,mmword ptr [esp+30h]  
00101071  addsd       xmm1,mmword ptr [__real@3ff0000000000000 (102180h)]  
00101079  movsd       xmm2,mmword ptr [__real@416312d000000000 (102178h)]  
00101081  comisd      xmm2,xmm1  
00101085  movsd       mmword ptr [esp+20h],xmm0  
    expNum+=0.00001;
0010108B  movsd       xmm0,mmword ptr [esp+28h]  
00101091  addsd       xmm0,mmword ptr [__real@3ee4f8b588e368f1 (102188h)]  
00101099  movsd       mmword ptr [esp+28h],xmm0  
0010109F  movsd       mmword ptr [esp+30h],xmm1  
001010A5  ja          wmain+40h (101060h)  
  }

Much more code there, yet it's faster. A timing test I did on a 3.3 GHz Nehalem-EP host produced the following results:

32-bit:

For loop body average exec time: 34.849229 cycles / 10.560373 ns

64-bit:

For loop body average exec time: 45.845323 cycles / 13.892522 ns

Very odd behavior, indeed. Why is it happening?

Update:

I have created a Microsoft Connect bug report. Feel free to upvote it to get an authoritative answer from Microsoft itself on the use of floating point intrinsics, especially in x64 code.

Answers


On x64, floating point arithmetic is performed using SSE. This does not have a built-in operation for exp() and so a call to the standard library is inevitable. I imagine that the MSDN article you are referring to was written with 32 bit code that uses 8087 FP in mind.


Need Your Help

Site rendering incorrectly when uploaded to FTP

html css ftp

I have a strange problem, that when I upload my website to server it gets rendered all incorrectly - positioning, script etc. problems. The same site works perfectly on the same browser when run lo...

NullPointerException with Fragment Interface Listener

android interface nullpointerexception fragment

I'm sorry, I'm sure that this will be simple but I just cannot see where I'm going wrong here.

About UNIX Resources Network

Original, collect and organize Developers related documents, information and materials, contains jQuery, Html, CSS, MySQL, .NET, ASP.NET, SQL, objective-c, iPhone, Ruby on Rails, C, SQL Server, Ruby, Arrays, Regex, ASP.NET MVC, WPF, XML, Ajax, DataBase, and so on.