optimizing compiler . interpocedural optimizations

Software & Services Group, Developer Products DivisionCopyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Optimizing compiler.Interpocedural optimizations.


Interprocedural optimizationHow to combine the good programming style and speed requirements for the

application?

Good programming style assumes: Modularity. Readability and the code re-usage. Implementation property encapsulation

A modularity of the source code complicates the task of optimization.

All previously discussed optimizations works on procedural level: Optimizations work effectively with local variables Every function call is a "black box“ with unknown properties The procedural parameters properties are unknown The global variables properties are unknown

To solve these problems the program has to be analyzed as a whole.


Some basic problems of a procedural level optimizations

1.) Scalar optimizations: According to iterative algorithm for data flow analysis Reaches (b) = U for all predecessors (defsout (p) U (reaches (p) ∩ ¬ killed (p)) In the case of calling of an unknown function from a basic block p, all local and

global variables which can be changed inside this function according to language rules should be put to killed (p).Compiler needs to know which objects can be changed inside the function according to perform high-quality optimizations.

2.) Loop optimizations:For high-quality loop optimizations compiler needs:

Determine objects which cannot address the same memory Determine a properties of the functions within the loops (do not change the iteration

variables, do not contain the program termination, etc.) estimating the number of loop iterations3.) Loop vectorization:

For successful loop vectorization information on the objects memory alignment can be very useful.

To obtain such information we need to analyze the entire program. This (interprocedural) infromation improves many classic intraprocedural optimizations.


One-pass and multi-pass compilationIn computer programming, a one-pass compiler is a compiler that processes the

source code of each compilation unit only once. It doesn’t look back at the previously processed code. Multi-pass compiler traverses the source code or its internal representation several times.

In order to gather the information about a function properties the compiler needs to analyze every function and it’s interconnection with other functions because each them can contain any function calls, as well as itself call (recursion). It is necessary to analyze a call graph.A call graph represents calling relationships between subroutines in a computer program. Each node represents a procedure and each edge (f, g) indicates that the procedure f calls procedure g.


A call graph may be static, calculated at compile time or dynamic. Static call graph contains all possible ways control can be passed. A dynamic call is obtained during the execution of program and can be useful for performance analysis.

One of the main of interprocedural analysis tasks is constructing the call graph and function property determination. For example, the global data flow analysis is intended to compute the objects which can be modified within the function.

Call graph can be complete or incomplete. If an application use utilities from an external library then the graph will be incomplete and full analysis will not be performed.

Software & Services Group, Developer Products DivisionCopyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 10/17/10

FE (C++/C or Fortran)

Internal representation

Profiler

Scalar optimizations Loop optimizations

Code generation

Source files

Object files

Temporary files or object files with

IR

Interprocedural optimizations

Scalar optimizations

Code generation

Executable file oflibrary

Two pass and single pass compilation

scheme

Loop optimizations

-Qipo/-Qip


One-pass and multi-pass compilation actionsWhen the one-pass compilation is used, the compiler performs the following

steps: parsing and internal representation creation, profile analysis, scalar and loop optimizations, code generation. Object files corresponding to sources files are generated. A linker builds the application from these files.

In the case of the multi-pass compilation compiler makes parsing, creates the internal representation, performs some scalar optimizations and saves the internal representation in the object files during the first pass. These files contain packed internal representation for the corresponding source files. It allows to perform the interprocedural analysis and optimizations. During this analysis the call graph is build and additional function properties are collected. The next step is interprocedural optimizations, such optimizations work with the part of the call graph. Finally the compiler performs scalar and loop optimizations. One or several final object files are generated.


Main compiler options for interprocedural optimizations

/Qipo[n] enables interprocedural optimization between files. This is also called multifile interprocedural optimization (multifile IPO) or Whole Program Optimization (WPO).

/Qipo-c generate a multi-file object file (ipo_out.obj)/Qipo-S generate a multi-file assembly file (ipo_out.asm)/Qipo-jobs<n> specify the number of jobs to be executed simultaneously during the IPO link phase There is a partial interprocedural analysis which works on single-file scope. In

this case some partial call graph is build and the interprocedural optimizations are performed according to information obtained by the graph analysis.

Qip[-] enable(DEFAULT)/disable single-file IP optimization within files


Mod/Ref Analysis

10/17/10

external void unknown(int *a);int main(){int a,b,c;

a=5;c=a;unknown(&a);if(a==5) printf("a==5\n");b=a;printf("%d %d %d\n",a,b,c);return(1);}

Let’s consider a simple example. There are two files. Function main contains call of function “unknown“ which is located in a other file. Local variable “a” obtains constant value before this call. In case of propagating this constant value through the function call the following if statement can be modified and check can be deleted.

#include <stdio.h>void unknown(int *a) { printf(“a=%d\n”, *a);}

Interprocedural analysis collects MOD and REF sets for each routine. MOD/REF sets contain objects which can be modified or referenced during the routine execution.These sets can be used for scalar optimizations.


We can use assembler files to define if check if(a==5) was deleted or wasn’t.

icl test.c unknown.c –S

There is this check in this case. Let’s inspect test.asm file .

call _unknown ;9.1

…

.B1.2: ; Preds .B1.8

mov edi, DWORD PTR [a.302.0.1] ;10.4

cmp edi, 5 ;10.7

jne .B1.4 ; Prob 0% ;10.7

icl –Ob0 test.c unknown.c -Qipo–S

With –Qipo check was eliminated. –Ob0 is needed to prevent inlining of unknown.

call _unknown. ;9.1

…

.B1.2: ; Preds .B1.7

push OFFSET FLAT: ??_C@_05A@a?$DN?$DN5?6?$AA@ ;11.3

call _printf ;11.3

10/17/10


Alias analysis It is used to determine if a storage location may be accessed in more than one way. Two pointers are said to be aliased if they point to the same location. explicit aliasing: different objects points to the same memory according to

programing language rules (union for C/C++, equivalence for Fortran) parameter aliasing: formal argument can be aliased with other formal

argument or objects from global scope. pointer analysis: pointers can be aliased if sets of objects which can be

referenced by these pointers have common elements.

Alias analysis is important to find loop dependences.


#include <stdio.h>int p1=1,p2=2;int *a,*b;

void init(int **a, int **b) {*a=&p1;*b=&p1; // <= a and b poins to p1 }

int main() {int i,ar[100];init(&a,&b);printf("*a= %d *b=%d\n",*a,*b);

for(i=0;i<100;i++) { ar[i]=i*(*a)*(*a); *b+=1; /* *a is changed through *b */}printf("ar[50]= %d p2=%d\n",ar[50],p2);}

Alias analysis example Dependence may appear if two pointers (a and b) reference the same memory location. In this case any loop optimizations are prohibited.


Other aspects of the interprocedural analysisInterprocedural analysis is used: to determine the function’s attributes. For example, there are attributes

“no_side_effect”, “always_return”, etc. used for simplifying some kind of analysis and optimizations.

to define an attributes of the variables. For example, if variable have no attribute "address was taken" than it cannot be updated through pointers, it simplifies many optimizations. Whole program analysis is required to handle the global variables.

for data promotion. Each variable has a scope . IPA allows to reduce this scope according to the real usage.

to remove unused global variables. to remove a dead code. There can be sub graphs in call graph which aren’t

connect with program entry. Such sub graphs can be safely removed from the final generated executable.

to feed the information about the argument alignment. If the actual function arguments are always aligned, then vectorization can be improved for the procedure.


Interprocedural optimizations Interprocedural optimization is a program transformation involves more than

one procedure in a program. In other words an optimization based on results of interprocedural analysis.

Constant propagation is performed on base of interprocedural value

propagation graph. As result of this optimization some formal arguments can be changed with corresponded constant value.

Simple example: If all calls of function f(x,y,z) have the same constant value for actual argument x, than formal argument x can be changed with this constant inside function body.

Constant result propagation. If a procedure returns some constant value than this value can be

propagated to caller function.


Interprocedural constant propagation example

10/17/10

#include <stdio.h>extern void known(int variant,int *var);int main() {int var;int ttt; var=2;ttt=3;known(var,&ttt);printf("ttt=%i\n",ttt);}

void known(int var,int *ttt) { if(var>0) (*ttt)++; else (*ttt)--;}

icc –Ob0 test.c known.c -fast -ipo-S…known:# parameter 1: %edi# parameter 2: %rsi..B2.1: # Preds ..B2.0..___tag_value_known.8: #1.30 addl $1, (%rsi) #3.3 ret #6.1 .align 16,0x90

IPO constant propagation should simplify the body of known routine.


#include <stdio.h>

int fcall(int x){ if(x>3) printf("x>3"); else printf("x<=3"); return x+1;}

int main() {int x,y;x=2;y=fcall(x);x=1;y=fcall(x);}

It is easy to see that the formal argument “x” of function fcall can be equal in this program to values 2 or 1. If_condition inside fcall is resolved identically for this values. Let’s check if interprocedural optimization makes constant propagation for this case.

icl test2.c –Ob0 –O3 –Qipo-S

??


InliningInlining or inline expansion is a compile optimization that replaces a function

call site with the body of the callee.Inlining reduces execution time by the cost of the function call, eliminates

branches and keep executing code close inside the memory. It improves instruction cache performance by improving the locality of reference. Inlining allows to perform intraprocedural optimizations on the inlined function body. In most of the cases larger scope enables better scheduling and register allocation.

Disadvantage of inlining is the application size increase. Compile time and compiler resources are also increased as a result.

Inlining heuristics are trying to choose the best candidates for inlining to get the most performance without exceeding the code increase allowed.

A programmer is able to recommend to inline function with inline attribute

For example,

inline int exforsys(int x1) { return 5*x1;}


REAL A(100) INTEGER I DO I = 1,100 A(I) = I END DO

DO I = 1,100 CALL AADD(A,I,1) END DO

PRINT *, A(100) END

SUBROUTINE AADD(ARRAY,EL,AD) REAL :: ARRAY(*) INTEGER EL REAL AD ARRAY(EL)=ARRAY(EL)+AD RETURN END

Inlining allows to perform intraprocedural optimizations on the inlined function body.

Inlining of subroutine AADD allows to perform vectorization for loop with call.

ifort -Ob0 test_vec.f90 -Qvec_report3…..\test_vec.f90(10): (col. 2) remark: loop was not vectorized: nonstandard loop is not a vectorization candidate.

ifort test_vec.f90 -Qvec_report3…C:\users\aanufrie\students\ipo\5\test_vec.f90(8): (col. 2) remark: LOOP WAS VECTORIZED.


Inlining directives#pragma inline[recursive]

#pragma forceinline[recursive]

#pragma noinline

Recursive demands to inline all routines which are called by the marked call.

Directive inline recommend to inline routine

noinline demand not to inline routine

forceinline demand to inline routine

Fortran directives

cDEC$ ATTRIBUTES INLINE :: procedure

cDEC$ ATTRIBUTES NOINLINE :: procedure

cDEC$ ATTRIBUTES FORCEINLINE :: procedure

10/17/10


Compiler options/Ob<n> control inline expansion:

n=0 disable inlining

n=1 inline functions declared with __inline, and perform C++ inlining

n=2 inline any function, at the compiler's discretion

/Qinline-min-size:<n>

set size limit for inlining small routines

/Qinline-min-size-

no size limit for inlining small routines

/Qinline-max-size:<n>

set size limit for inlining large routines

/Qinline-max-size-

no size limit for inlining large routines

/Qinline-max-total-size:<n>

maximum increase in size for inline function expansion

/Qinline-max-total-size-

no size limit for inline function expansion


/Qinline-max-per-routine:<n>

maximum number of inline instances in any function

/Qinline-max-per-routine-

no maximum number of inline instances in any function

/Qinline-max-per-compile:<n>

maximum number of inline instances in the current compilation

/Qinline-max-per-compile-

no maximum number of inline instances in the current compilation

/Qinline-factor:<n>

set inlining upper limits by n percentage

/Qinline-factor-

do not set set inlining upper limits

/Qinline-forceinline

treat inline routines as forceinline

/Qinline-dllimport

allow(DEFAULT)/disallow functions declared __declspec(dllimport) to be inlined

/Qinline-calloc directs the compiler to inline calloc() calls as malloc()/memset()


Procedure cloning Cloning is a specializing a function to a specific class of call sites.Sometimes specific characteristics of dummy arguments allow to perform a

special optimizations for procedure. In this case it is possible to create specialized procedure and change the initial procedure call to new one in all the cases where the actual arguments have these characteristics.

Trivial case is a call of a procedure with a constant argument. For example, if there are several calls of some procedure f in form f(x,y,TRUE) and several calls f(x,y,FALSE) than sometimes it is profitable to create procedures f_TRUE(x,y) and f_FALSE(x,y) and replace initial calls with calls of new procedures.


Partial inliningPartial inlining is an efficient way of inlining, which inlines only part of the callee

function.

while (q) { process_elem(q) q=q->next; }

void process_elem(plist p) { if(p->type !=2) return; … }

while (q) { if(q->type==2) process_elem(q) q=q->next; }

void process_elem(plist p) {… }


Data transformationsData transformation is a interprocedural optimization which change structure

of user data to provide better cash locality during execution.The following types of data transformation are widely known: permutation of structure fields structure splittingPermutation of structure fields can improve cash locality if the fields which are

used together during calculation are located closely. In this case system bus reads fewer cash lines from memory.

Structure splitting leaves hot (frequently used) fields in main structure and removes other fields to special frozen section. After this optimization data will need less memory and will fit cash better.

Compiler need to prove correctness of such transformation. In many cases whole program analysis is needed.


Structure splitting and field reordering example

10/17/10

#ifndef PERFtypedef struct { double x; char title[40]; double y; char title2[22]; double z;} VecR;#elsetypedef struct { char title[40]; char title2[22];} ColdFields;typedef struct { double x; double y; double z; ColdFields *cold;} VecR;#endif

#include "struct.h"int main() { int i, k; VecR *array = malloc(10000*sizeof(VecR));#ifdef PERF for(i=0;i<10000;i++) array[i].cold=(ColdFields*)malloc(sizeof(ColdFields));#endif for (i=0;i<10000;i++){ array[i].x = 1.0; array[i].y = 2.0; array[i].z = 0.0; } for(k=1;k<10000;k++) { for (i=k;i<9999;i++){ array[i].x = array[i-1].y+1.0; array[i].y = array[i+1].x+array[i+1].y; array[i].z = (array[i-1].y - array[i-1].x)/array[i-1].y; } }

printf("%f \n",array[100].z);#ifdef PERF for(i=0;i<10000;i++) free(array[i].cold);#endif free(array);}


Result of test execution

icc struct.c -fast -o a.outicc struct.c -fast -DPERF -o b.outtime ./a.outreal 0m0.808stime ./b.outreal 0m0.566s


Pointer chasingData access through several pointers is one of the most common problem in C++ code. If

a program data doesn’t fit in the cash subsystem, then every pointer dereference will cause significant stall in calculation.

This problem can be caused also by wrong data transformation.

10/17/10

Class Employers {Personal_info *p;…};;

Class personal_info {Family_info *f;…};

Class family_info {int members;…};

All_members+= employer->p->f->members;


Devirtualization for C++ virtual methodC++ - object-oriented language with a high level of abstraction and

ability to perform the class methods depending on the type of the object at run time. In this case pointers to different class methods are located in special table and call of virtual function is very expensive for the performance. Sometimes call through table of virtual method can be replaced with call of a specific method.

A => B => CAll derived classes override virtual int foo ()

int process (class A * a) {return (a-> foo ());}


#include <stdio.h>class A { virtual int foo() { return 1; }; friend int process(class A *a);};class B: public A { virtual int foo() { return 2; }; friend int process(class A *a);};int process(class A *a) { return(a->foo());};void main() { B* pB = new B; int result2 = process(pB);}

Devirtualization example Class A isn’t used in this source, so it is possible to perform devirtualization.

icl test.cpp –S

mov eax, DWORD PTR [ebx] mov ecx, ebx call DWORD PTR [eax] (call through table)

icl test.cpp –Qipo_S –Ob0 -Qipo

call ?process.@@YAHPAVA@@@Z


Thank you!

optimizing compiler . interpocedural optimizations

Documents