Information in this document is provided in
connection with Intel products. No license, express or
implied, by estoppel or otherwise, to any intellectual
property rights is granted by this document. Except as
provided in Intel's Terms and Conditions of Sale for such
products, Intel assumes no liability whatsoever, and
Intel disclaims any express or implied warranty, relating
to sale and/or use of Intel products including liability
or warranties relating to fitness for a particular
purpose, merchantability, or infringement of any patent,
copyright or other intellectual property right. Intel
products are not intended for use in medical, life
saving, or life sustaining applications. Intel may make
changes to specifications and product descriptions at any
time, without notice. Copyright (c) Intel Corporation 1997. *Third-party brands and names are the property of their respective owners. |
CONTENTS:
|
The Intel C/C++ Compiler Plug-in version 2.4 can easily be integrated into the Microsoft Developer Studio environment and allows users to use Pentium ® Pro and Pentium ® II processor inline assembly instructions which are currently not supported with the latest version of Microsoft Visual C++, version 5.0. The Intel C/C++ Compiler Plug-in is fully compatible with the Microsoft Visual C++ 4.x or later compilers in the following areas: command line switches, inline assembly format, object module, library and DLL formats, debug and C++ symbol formats.
The Intel C/C++ Compiler Plug-in provides additional optimizations that are not currently available with the Microsoft Visual C++ Compiler. For example, the compiler provides a rounding control option which optimizes floating point to integer conversions. The compiler also supports the Pentium Pro and Pentium II processor specific instructions. This application note will talk about these and other features available with the Intel C/C++ Compiler Plug-in which are useful in optimizing applications. The first few sections discuss how to install and setup the compiler and the next sections offer optimization techniques and an analysis of their performance.
The Intel C/C++ Compiler Plug-in installs directly into the Microsoft Visual C++ version 4.x or 5.0 environment. The installation program installs the Intel Compiler Selection Tool and makes the compiler accessible from the Developer Studio tools menu.
To compile a program using the Intel C/C++ Compiler Plug-in, choose the Select Compiler option located in the tools menu. A window will appear (figure 2.3.) which will allow the user to toggle between the Intel C/C++ Compiler Plug-in and the Microsoft Visual C++ Compiler.
Figure 2.3. Select Compiler Window
When switching between the two compilers select the Rebuild All option under the build menu to ensure the application is rebuilt with the newly selected compiler.
In table 3.1, some useful optimization switches are given with a brief description. The full listing of the available switches can be found in the "Intel C/C++ Compiler Plug-in User's Guide for Win32 Systems" which comes with the compiler. The following sections of this application note will cover how to use some of the optimization switches listed below and the performance gain that could be obtained when using these switches.
Optimization Switch |
Description of When to Use the Optimization Switch |
-GB or -G3 |
Used by default. Use this compiler switch when the application needs to run on a wide range of Intel processors |
-G4 |
Use to optimize code exclusively for the Intel486 processor |
-G5 |
Use to optimize code exclusively for the Pentium processor |
-G6 |
Use to optimize code exclusively for the Pentium Pro and Pentium II processors. |
-Qxi |
Allows the use and generation of Pentium Pro specific instructions |
-Qmem |
Use for memory optimizations to improve cache accesses and reduce memory accesses |
-Qprec |
Use to improve the floating-point precision |
-Qrcd |
Use to improve floating-point to integer conversions, by disabling the floating point rounding control |
Table 3.1 Some Useful Optimization Switches
Using the Intel C/C++ Compiler Plug-in allows Pentium Pro and Pentium II processor specific instructions to be used. These instructions (CMOVcc,FCMOVcc, FCOMI, RDPMC, and UD2) are not currently supported within Microsoft Visual C++ version 5.0. The CMOVcc, and FCMOVcc instructions are very powerful instructions on the Pentium Pro and Pentium II processors because they could improve the performance of applications which contain a lot of conditional branches. By using the CMOVcc and FCMOVcc instructions, the number of branches in the application will be decreased, which should improve the overall performance of the application.
The CMOVcc and FCMOVcc instructions are conditional move instructions. These instructions check the state of one or more of the status flags and perform a move operation if the flags are in a specified state. Using the CMOVcc instruction is beneficial because it does not require a branch which could be mispredicted. For example:
CMP EAX,EBX ;compare and set flags CMOVGE EAX,EBX ;if eax >= ebx then set eax=ebx otherwise no change
The following code would need to be used it the CMOVcc instruction is not supported:
CMP EAX,EBX ;compare the value in eax with the value in ebx JL NOTGE ;if !(eax >= ebx) jump over move instruction MOV EAX,EBX ;set eax = ebx because ebx >= eax NOTGE: ... ;additional code
Code Example 3.2. Comparing code for CMOVcc
The following sections discuss how the Intel C/C++ Compiler Plug-in supports the Pentium Pro and Pentium II processor specific instructions. Examples are given using the CMOVcc instruction, but the examples can easily be applied to the other specific instructions.
The Intel C/C++ Compiler Plug-in allows Pentium Pro and Pentium II processor specific instructions to be used as inline assembly instructions. This is not currently supported with the Microsoft Visual C++ version 4.x, or 5.0 compiler. The Pentium Pro and Pentium II processor specific instructions can be used as inline assembly instructions simply by using the asm directive (code example 3.2.1.).
int func(int x, int y) { int I; _asm { mov eax,x mov ebx,y cmovge eax,ebx mov I,eax } return I; }
Code Example 3.2.1. Using CMOVcc with _asm Directive
The Intel C/C++ Compiler Plug-in not only allows the user to use the specific Pentium Pro and Pentium II processor instructions as inline assembly instructions, but will also generate the specific assembly instructions to optimize C code. This optimization occurs when certain compiler switches are set. These switches notify the compiler that it should generate assembly code specifically for the Pentium Pro and Pentium II processors. The compiler switches that need to be set in order to generate Pentium Pro and Pentium II processor specific assembly instructions are -G6 and -Qxi. These settings can be set by selecting Project from the menu and choosing the Settings option. Select the C/C++ tab and type the specified settings into the project options box. The sample project settings window is shown below. Figure 3.2.2 shows the Project Settings dialog box with the specified project settings.
Figure 3.2.2 Project Settings using the Pentium Pro Processor's Optimization Switches
The Maximize Speed optimization option must also be set to produce the most optimal code.
To view the assembly output generated by the compiler, select Project from the menu and select the Settings option. A project settings dialog box will appear. Select the C/C++ tab. Under the category drop down menu select Listing Files and specify the Listing File Type and the Listing File Name. Figure 3.2.3 shows the project settings dialog box with a listing file type of Assembly-Only Listing and an output file location to place the assembly code listing.
Figure 3.2.3 Project Settings for Creating an Assembly Listing File
To show the benefit of using the -G6, -Qxi Pentium Pro processor optimizations a simple example is provided. The next sections will discuss the assembly code generated by using both the Intel C/C++ Compiler Plug-in and the Microsoft Visual C++ 5.0 Compiler. A performance analysis will be provided from the assembly code listing provided by the compiler and the RDTSC and CPUID instruction to measure the cycle time. The RDTSC instruction reads the current cycle count and the CPUID instruction is used to synchronize instructions. The CPUID instruction is necessary because both the Pentium Pro and Pentium II processor execute instructions out-of-order. The code used to demonstrate the generation of the CMOVcc instruction is given below:
int time_left[32]; //Loops 32 times storing the time left. The algorithm used is as follows //If(time_to_waste < t) //then set time_left = time_left - time_to_waste //else set time_left = time_left - t void timeloop( int time_to_waste) { for(int i; i<32;i++) { int t = time_left[32]; time_left[i] - = time_to_waste < t ? time_to_waste : t; } }
Code Example 3.2.3. Generation of CMOVcc
Code listing 3.2.3.1 describes the assembly code generated by the Microsoft Visual C++ Compiler version 5.0.
push esi ;Store the current value of esi mov esi, DWORD PTR _time_to_waste$[esp] ;esi = time_to_waste mov ecx, OFFSET FLAT:?time_left@@3PAHA ;ecx = time_left $L169: mov eax, DWORD PTR [ecx] ;eax = t = time_left[i] cmp esi, eax ;compare time_to_waste to t mov edx, esi ;edx = time_to_waste jl SHORT $L180 ;if time_to_waste < t jump mov edx, eax ;if time_to_waste !< t set edx = t $L180: sub eax, edx ;eax = time_left[i]-(either t or time_to_waste) mov DWORD PTR [ecx], eax ;store value in array time_left[i] add ecx, 4 ;increment to next array value ;Have we reached the end of the array cmp ecx, OFFSET FLAT:?time_left@@3PAHA+128 jl SHORT $L169 ;keep looping until the end of the array is reached pop esi ;restore esi value ret 0 ;return Code Example 3.2.3.1 Assembly Code Generated from the Visual C++ Compiler
Code listing 3.2.3.2 describes the assembly code generated by the Intel C/C++ Compiler Plug-in. Notice the use of the cmovle instruction.
push ebx ;store the current value of ebx mov ecx, DWORD PTR [esp+8] ;ecx = time_to_waste mov edx, -128 ;edx contains value of i for array index _B1_3: mov eax, DWORD PTR time_left[edx+128] ;eax = t = time_left[i] cmp eax, ecx ;compare t with time_to_waste to set flags mov ebx, ecx ;ebx = time_to_waste cmovle ebx, eax ;if(t <= time_to_waste) set ebx = t sub eax, ebx ;time_left[i] - (either t or time_to_waste) mov DWORD PTR time_left[edx+128], eax ;write the result to the array add edx, 4 ;increment to the next array value jnz _B1_3 ;keep looping until the entire array is traversed pop ebx ;restore the value of ebx ret
Code Example 3.2.3.2 Assembly Code Generated from the Intel C/C++ Compiler Plug-in
The performance analysis of the assembly instructions generated by each compiler are provided in Table 3.2.4. The cycles counts were obtained by using the RDTSC and CPUID instructions on a 266Mhz Pentium II processor.
COMPILER USED |
NUMBER OF ASSEMBLY LANGUAGE INSTRUCTIONS |
TOTAL NUMBER OF CYCLES |
Microsoft Visual C++ 5.0 Compiler |
15 Instructions |
266 Cycles |
Intel C/C++ Compiler Plug-in |
13 Instructions |
247 Cycles |
Table 3.2.4. Generation of CMOVcc Performance Comparison
The percentage improvement is only 7% for this simple example, but if an application contains a substantial amount of jumps and branches this optimization could significantly improve the overall application.
The Intel C/C++ Compiler Plug-in provides an optimization switch to improve the performance of floating point to integer conversions. Graphics applications which use floating point data as input into their rendering operations can benefit from this type of optimization. The rendering operations usually take floating point data as inputs and a conversion then needs to be made from floating point to integer. Any speed up in the conversion provides a benefit to the application.
The compiler switch that improves the floating point to integer conversions is the rounding control option, -Qrcd. the switch optimizes the conversion by controlling the change in rounding modes that generally take place during floating point calculations. In the C language, the floating point values must be truncated before converting the values to integer. The default rounding mode for the system is round-to-nearest. Therefore, in order to truncate the floating point values, a rounding mode switch must occur. The rounding mode then has to be switched back to the default, round-to-nearest, after the truncation takes place. Switching rounding modes adds additional overhead to each floating point calculation. By using the rounding control option on the Intel C/C++ Compiler Plug-in, the additional overhead associated with changing rounding modes is eliminated. The -Qrcd option does not effect the floating point calculations. However, since the rounding mode changes are eliminated the integer conversions do not conform to the C semantics.
To use the floating point rounding control optimization, the -Qrcd switch must be set in the project settings. Select Project from the menu and choose Settings. A dialog box will appear, select the C/C++ tab and type in the specified settings in the project options box. An example is provided below using the -Qrcd setting as a project option.
Code Example 3.3.1 Project Options using the Rounding Control Option
To show the benefit of using the -Qrcd rounding control option for floating point conversions a simple example is provided. The next sections will discuss the assembly code generated by using both the Intel C/C++ Compiler Plug-in and the Microsoft Visual C++ 5.0 Compiler.
int a = 5; float b = 1.4; void foo() { a = b; //floating point to integer conversion }
Code Example 3.3.2. Floating Point to Integer Conversion
Code example 3.3.3 describes the assembly code generated by the Visual C++ 5.0 Compiler.
?a@@3HA DD 01H DUP (?) ; a ?b@@3MA DD 01H DUP (?) ; b fld DWORD PTR ?b@@3MA ; loads the floating point value b call __ftol ; calls function to convert value to integer mov DWORD PTR ?a@@3HA, eax ; sets a= b
Code Example 3.3.3 Assembly Code Generated from the Visual C++ Compiler
Code example 3.3.3.1 describes the assembly code generated by the Intel C/C++ Compiler Plug-in using the -Qrcd optimization option.
fld DWORD PTR ?b@@3MA ;loads the floating pint value b fistp QWORD PTR [esp+8] ;converts value to an integer mov eax, DWORD PTR [esp+8] ;stores value to eax mov DWORD PTR ?a@@3HA, eax ;sets a = b
Code Example 3.3.3.1 Assembly Code Generated from the Intel C/C++ Compiler Plug-in
The performance analysis of the assembly instructions generated by each compiler is provided in Table 3.3.4. The cycles counts were obtained by using the RDTSC and CPUID instructions on a 266Mhz Pentium II processor.
COMPILER USED |
NUMBER OF ASSEMBLY LANGUAGE INSTRUCTIONS GENERATED FOR THE INTEGER CONVERSION |
TOTAL NUMBER OF CYCLES FOR THE CONVERSION |
Microsoft Visual C++ 5.0 Compiler |
3 Instructions |
135 Cycles |
Intel C/C++ Compiler Plug-in |
4 Instructions |
25 Cycles |
Table 3.3.4 Integer Conversion Performance Comparison
The floating point to integer conversion improves by 81% with the Intel C/C++ Compiler plug-in. This could be a substantial improvement to an overall application if these conversions occur frequently throughout the application.
Additional information regarding the features of the Intel C/C++ Compiler Plug-in can be found in the "Intel C/C++ Compiler Plug-in User's Guide for Win32 Systems". This document is provided with the compiler. Additional information is also available on the following web site: http:\\support.intel.com/oem_developer/msl/ic
#include <stdio.h> #include <stdlib.h> #define CPUID _asm _emit 0fh _asm _emit 0a2h #define RDTSC _asm _emit 0fh _asm _emit 031h int time_left[32]; int cyc; int base; void timeloop(int time_to_waste) { for(int i=0; i<32; i++) { int t = time_left[i]; time_left[i] -= time_to_waste < t ? time_to_waste: t; } } void main() { //Base is used to calculate the time it takes to execute the CPUID and RDTSC inst. base = 0; //Cyc will contain the amount of cycles taken to execute the timeloop function cyc = 0; _asm //computes the base time of the RDTSC and CPUID calls { CPUID RDTSC mov cyc,eax CPUID RDTSC sub eax, cyc mov base,eax } cyc = 0; //initializes the cycles to zero _asm { CPUID //computes the starting time RDTSC mov cyc,eax } timeloop(32); //calls the timeloop function _asm { CPUID //computes the ending time and total cycles RDTSC sub eax, cyc mov cyc, eax } //prints the number of cycles the timeloop function took printf("Base: %d\n",base); printf("Number of cycles: %d\n",(cyc-base)); }
#include <stdio.h> #include <stdlib.h> #define CPUID _asm _emit 0fh _asm _emit 0a2h #define RDTSC _asm _emit 0fh _asm _emit 031h int cyc; int base; int a = 5 //initialize the integer value; float b = 1.4; //initialize the floating point value void foo() { a = b; //floating point to integer conversion } void main() { //Base is used to calculate the time it takes to execute the CPUID and RDTSC inst. base = 0; //Cyc will contain the amount of cycles taken to execute the timeloop function cyc = 0; _asm //computes the base time of the RDTSC and CPUID calls { CPUID RDTSC mov cyc,eax CPUID RDTSC sub eax, cyc mov base,eax } cyc = 0; //initializes the cycles to zero _asm { CPUID //computes the starting time RDTSC mov cyc,eax } foo(); //call the foo function _asm { CPUID //computes the ending time and total cycles RDTSC sub eax, cyc mov cyc, eax } //prints the number of cycles for the float to integer conversion printf("Base: %d\n",base); printf("Number of cycles: %d\n",(cyc-base)); }
* Legal Information © 1998 Intel Corporation