instruction level parallelism compiler optimization techniques anna university,k.thirunadana...

53
Compiler Optimization Techniques CP 7031 Dr.K.Thirunadana Sikamani

Upload: kaliyamoorthi-thirunadana-sikamani

Post on 03-Jul-2015

191 views

Category:

Engineering


2 download

DESCRIPTION

Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani. prepared from Compilers by Aho, Ullman

TRANSCRIPT

Page 1: Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

Compiler Optimization Techniques

CP 7031

Dr.K.Thirunadana Sikamani

Page 2: Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

Principal Sources of OptimizationElimination of unnecessary instructions in object code ,

or the replacement of one sequence of instructions by a

faster sequence of instructions that does the same thing

is usually called “code improvement” or “code

optimization”

Redundancy

Semantic preserving transformations

Global Common Subexpressions

Copy Propagation

Dead Code Elimination

Code Motion8/25/2014 Compiler OptimizationTechniques - unit II 2

Page 3: Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

The Speed of a program run on a processor with Instruction Level Parallelism depends on

1. The potential parallelism in the program.

2. The available parallelism on the processor.

3. Our ability to extract parallelism from the original sequential program.

4. Our ability to find the best parallel schedule given scheduling constraints.

8/25/2014 3Compiler OptimizationTechniques - unit II

Page 4: Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

Processor Architecture

8/25/2014 4Compiler OptimizationTechniques - unit II

Page 5: Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

1. Instruction Pipelines and Branch delays

2. Pipelined Execution

3. Multiple Instruction Issues –VLIW ( Very Long Instruction Word)

8/25/2014 5Compiler OptimizationTechniques - unit II

Page 6: Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

Code Scheduling Constraints

8/25/2014 6Compiler OptimizationTechniques - unit II

Page 7: Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

1. Control-dependence constraints

2. Data-dependence Constraints

3. Resource Constraints

8/25/2014 7Compiler OptimizationTechniques - unit II

Page 8: Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

Control dependence constraints

All the operations executed in original

program must be executed in the optimized one

8/25/2014 8Compiler OptimizationTechniques - unit II

Page 9: Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

Data Dependence Constraints

The operations in the optimized program must produce the same results

as the corresponding ones in the original program

8/25/2014 9Compiler OptimizationTechniques - unit II

Page 10: Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

Resource Constraints

The schedule must not oversubscribe the resources on the

machine

8/25/2014 10Compiler OptimizationTechniques - unit II

Page 11: Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

Data Dependence

True dependence - Read after Write

Antidependence - Write after Read

Output dependence - Write after Write

8/25/2014 11Compiler OptimizationTechniques - unit II

Page 12: Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

Classify dependence for the following statements 1. a =b

2.c =d

3.b =c

4. d =a

5. c= d

6. a = b

8/25/2014 Compiler OptimizationTechniques - unit II 12

1 and 43 and 51 and 6

Check the dependences for the following

Page 13: Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

Give the register level m/c code to provide maxmparallelism also give the solution for minimal usage of register

expression ((u+v) + (w+x)) + (y+z)

LD r1,uLD r2,vADD r1,r1,r2LD r2,wLD r3,xADD r2,r2,r3ADD r1,r1,r2LD r2,yLDr3,zADD r2,r2,r3ADD r1,r1,r2

8/25/2014 Compiler OptimizationTechniques - unit II 13

Clock1

LD r1,u

LD r2,v

LD r3,w

LD r4,x

LD r5,y

LD r6,z

Clock2

ADD r1,r1,r2

ADD r3,r3,r4

ADD r5,r5,r6

Clock3

ADD r1,r1,r3

clock4

ADD r1,r1,r5

Implementation of parallelism in 4 clocks

Page 14: Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

Finding dependences among memory Access1. Array data dependence analysis

for ( i = 0; i < n; i++)

A[2*i] = A[2* i+1]

2. Pointer alias analysis

Two pointers aliased if they refer to the same object

3. Inter procedural analysis It is to determine if same variable is passed as two or more

different arguments in passing parameters by reference language

8/25/2014 14Compiler OptimizationTechniques - unit II

Page 15: Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

Tradeoff between Register usage and Parallelisme.g., machine independent intermediate representation code

LD t1 , a

ST b , t1

LD t2 , c

ST d , t2

the code above is to copy the values of a and c to b and d . If all memory locations are distinct the copies can be proceed in parallel . The other case if t1 and t2 are assigned to use the same register to minimize the register usage.

8/25/2014 15Compiler OptimizationTechniques - unit II

Page 16: Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

Tradeoff between Register U sage and Parallelism

The syntax tree for the (a+b) + c + ( d+ e)

a b

+

+

+

+

C

d e

Machine code

LD r1 , aLDr2 , bADD r1,r1,r2LD r2 , cADD r1,r1,r2LD r2, dLD r3, eADD r2,r2,r3ADD r1,r1,r2

Parallel evaluation of the expression

r1 =ar6=r1+r2r8=r6+r3R9=r8+r7

r2=br7=r4+r5

r3=c r4=d r5=e

8/25/2014 16Compiler OptimizationTechniques - unit II

Page 17: Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

Phase Ordering between register allocation and Code Scheduling If registers are allocated before scheduling , the

resulting code tends to have many storage dependences that limit code scheduling.

On the other way around , the schedule created may require so many registers that register spilling

Spilling – storing the contents of a register in a memory location, so the register can be used for some other purpose.

Based on the characteristics of the program.

e.g., numeric , non numeric, etc.,

8/25/2014 17Compiler OptimizationTechniques - unit II

Page 18: Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

Control Dependence

If ( c ) s1; else s2; /* s1 and s2 are control dependent on c */

While ( c ) s; /* s is dependent on c */

if ( a > t )

b = a * a;

d = a + c; / * No dependence * /

8/25/2014 18Compiler OptimizationTechniques - unit II

Page 19: Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

Speculative Execution Support Prefetching - Bringing data from memory to cache

before it is used.

Poison Bits – Speculative load of data from memory to register file. Each register is augmented with poison bit. The poison bit is set when an illegal memory is accessed to raise exception at later usage.

8/25/2014 19Compiler OptimizationTechniques - unit II

Page 20: Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

Predicated Execution Predicated instructions were invented to reduce the

number of branches in a program.

A predicated instruction is like a normal instruction but has an extra predicate operand to guard its execution.

E.g., CMOVZ R2, R3, R1 has the semantics of moving contents of R3 to R2 if R1 is zero

if ( a == 0 ) b = c + d; can be implemented as

ADD R3 , R4 ,R5 /* a ,b,c ,d are allotedR1, R2 , R4,R5 */

CMOVZ R2, R3, R1

8/25/2014 20Compiler OptimizationTechniques - unit II

Page 21: Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

Basic Machine Model

Many machines can be represented as

M = < R , T >

T – Set of operation types T, such as loads, stores and arithmetic operations.

R is a vector – R = [ r1,r2,…..] are hardware resources.

r1 - number of units availabel of the ith kind of resources.

Resources – memory access units, ALUs, floating point functional units.

8/25/2014 21Compiler OptimizationTechniques - unit II

Page 22: Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

Basic Machine Model Each operation has a set of input operands , a set of

output operands and resource requirement

RTt– Resource –Reservation table

RTt[i,j]- is the number of units of jth resource used by an operation type t, i clocks after it is issued.

8/25/2014 22Compiler OptimizationTechniques - unit II

Page 23: Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

Basic-Block Scheduling

8/25/2014 23Compiler OptimizationTechniques - unit II

Page 24: Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

Data-Dependence Graphs

Graph G = ( N , E)

N --

E ---

A set of nodes representing the operations in the machine instructions.

A set of directed edges representing the data dependence constraints among operations

1. Each operation n in N has a resource reservation table RTn , whose value is simply the resource – reservation table associated with operation type of n

2. Each edge e in E is labeled with delay de indicating that the destination node must be issued no earlier than de clocks after the source node is issued.

8/25/2014 24Compiler OptimizationTechniques - unit II

Page 25: Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

Data- dependence Graph

LD R2, 0(R1)

ST 4(R1), R2

ADD R3,R3,R2

ADD R3, R3, R4

Ld R3, 8 (R1)

ST 0(R7), R7

ST 12(R1), R3

i1

i2

i3

i4

i5

i6

i7

2

2

1

1

1

1

1

1

2

1.Load operation takes 2 clock cycles2. R1 is a stack pointer having offset from 0 t0 12

8/25/2014 25Compiler OptimizationTechniques - unit II

Page 26: Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

List Scheduling of Basic Blocks This involves visiting each node of the

data-de pendence graph in “prioritized topological order”

Machine-resource vector R = [r1,r2,r3,..]

ri --- the number of units available of the ith kind of resource

G = ( N,E) data dependence graph

RTn ---- Resource -reservation table

Edge e = n1 n2 with de indicating n2

would be executed de delays after n1.

8/25/2014 Compiler OptimizationTechniques - unit II 26

Page 27: Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

List Scheduling AlgorithmRT = An empty reservation table

for ( each n in N in prioritized topological order){s = max e=p ->n in E (S(p) + de);

/* find the earliest time this instruction this instruction could begin given when its predecessors started */

while ( there exists i such that RT[s+i] + RTn [i] > R)s = s+ 1;

/* delay the instruction further until the needed resources are available */

S(n) = s;

for (all i)RT[s + i] = RT [ s+i ] + RTn [i]

}

8/25/2014 Compiler OptimizationTechniques - unit II 27

Page 28: Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

Prioritized topological OrderPossible prioritized orderings:1) Critical path - the longest path through the data-dependence graph.

Height of the node – the length of the longest path in the graph originating from the node.

2) The length of the schedule is constrained by the resource available.Critical resource - the one with the largest ratio of uses to the number of units of that resource available.

Operations using more critical resources may be given higher priority.

3) Source ordering – the operation that shows up earlier in the source program should be scheduled first.

8/25/2014 Compiler OptimizationTechniques - unit II 28

Page 29: Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

Result of applying List Scheduling (for example in slide 22)ALU Memory

LD R3 , 8(R1) /* using height as the priority function */

LD R2, 0(R1)

ADD R3, R3,R4 /* 2 delay */

ADD R3,R3,R2 ST 4(R1) , R2

St 12(R1), R3

St 0(R1),R7

8/25/2014 Compiler OptimizationTechniques - unit II 29

Page 30: Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

Global Code Scheduling Strategies that consider more than one Basic Block at a

time are referred to as Global Scheduling.

Conditions: ( must abide control and data dependencies)

1. All instructions in the original program are executed in the optimized one and

2. While the optimized program may execute extra instructions speculatively ,these instructions must not have any unwanted side effects.

8/25/2014 Compiler OptimizationTechniques - unit II 30

Page 31: Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

Basic Block

A basic Block is constituted by set of instructions inwhich the control enters the block through the firstinstruction and leaves the block via the last instructionwithout any deterrence or jump / branch in betweenthem. ( the flow will be linear)

8/25/2014 Compiler OptimizationTechniques - unit II 31

Page 32: Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

Primitive code motionSource Program

8/25/2014 Compiler OptimizationTechniques - unit II 32

if ( a == 0) goto L

e = d + d

c = bL:

Page 33: Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

Locally Scheduled Machine code

8/25/2014 Compiler OptimizationTechniques - unit II 33

LD R6 , 0(R1)nopBEQZ R6 , L

LD R7 ,0(R2)nopST 0(R3),R7

LD R8 , 0(R4)nopADD R8,R8,R8ST 0(R5), R8

B1

B2

B3

L:

Page 34: Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

Globally Scheduled machine code

8/25/2014 Compiler OptimizationTechniques - unit II 34

LD R6 , 0(R1)LD R8 , 0(R4)LD R7 , 0(R2)

ADD R8,R8,R8BEQZ R6 , L

ST 0(R5), R8ST 0(R5) , R8ST 0(R3) , R7

B1

B3’

B3

Page 35: Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

Upward Code motionIt moves as operation from block src up a control-flow path to block dst.

such move does not violate any data dependences and it makes the path through dst and src run faster

Case 1: If src does not postdominate dst

In this case there exists a path that passes through dstthat does not reach src

This code motion is illegal unless tehoperation moved has no unwanted side effects

8/25/2014 Compiler OptimizationTechniques - unit II 35

Page 36: Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

Contd…Case 2: If dst does not dominate srcIn this case there exists a path that reaches src without first going through dst.

We need to move copies of the moved operation along such pathsConstraints:

1.The operands of the operation must hold the same values as in the original.

2.The result does not overwrite a value that is still needed , and

3. It itself is not subsequently overwritten before reaching src.

8/25/2014 Compiler OptimizationTechniques - unit II 36

Page 37: Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

Downward Code MotionIt is moving an operation from block src down a control

flow path to block dst

Case 1: src does not dominate dst – There exists a path to dst that does not passes through src.

Case 2: dst does not postdominate src - There exists a path through src does not pass through dst

8/25/2014 Compiler OptimizationTechniques - unit II 37

Page 38: Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

E.g., If ( x == 0) a = b;

Else a =c;

d= a;

8/25/2014 Compiler OptimizationTechniques - unit II 38

(x==0)LD R1,x

NopBEQZ R1, L

(a = c)LD R3,c

NopST a,R3

( a= b)LD R2,b

NopST a, R2

(d =a)LD R4, a

NopST d, R4

B1

B2B3

B4

x---0(R5)b-----0(R6)

c --------0(R7)a -------- 0(R8)d -------- 0(R9)

L:

Page 39: Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

E.g., If ( x == 0) a = b;

Else a =c;

d= a;

8/25/2014 Compiler OptimizationTechniques - unit II 39

LD R1,0(R5), LD R3 , 0(R7)LD R2 , 0(R6)

ST 0(R8),R3

BEQZ R1, L /* CMOVZ 0(R8) ,R2,R1 */

ST 0(R8), R2

LD R4, 0(R8)Nop

ST 0(R9), R4

B1

B2

B4x---0(R5)b-----0(R6)

c --------0(R7)a -------- 0(R8)d -------- 0(R9)

L:

Page 40: Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

Updating data dependences Code motions can change data dependence relations

between operations. Thus data dependences just be updated after each code motions

8/25/2014 Compiler OptimizationTechniques - unit II 40

X = 1 X = 2

If one assignment is moved up the other can not.X is not live before code motion

Page 41: Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

Global Scheduling Algorithms Region Based Scheduling

Two easiest form of code motion

1. Moving operations up to control equivalent basic blocks

2. Moving operations speculatively up one branch to a dominating predecessor.

Assignment : Region Based Scheduling Algorithm

8/25/2014 Compiler OptimizationTechniques - unit II 41

Page 42: Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

Loop Unrollingunrolling creates more instructions in the loop body permitting

global scheduling algorithms to find more parallelism

for (i = 0; i < N; i ++)

{

S(i);

}

Can be unrolled

for ( i = 0; i+4 < N; i+=4) {

S(i);

S(i+1);

S(i+2);

S(i+3);

}

repeat

S;

until C;

Can be unrolled as

repeat {

S;

if(C) break;

S;

if (C) break;

S;

} until C ;

8/25/2014 Compiler OptimizationTechniques - unit II 42

Page 43: Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

Neighborhood Compaction

Examine each pair of basic blocks that are executedone after the other , and check if any operation can bemoved up or down between them to improve theexecution time to those blocks.

If such a pair is found we check if the instruction to bemoved needs to be duplicated along other paths.

8/25/2014 Compiler OptimizationTechniques - unit II 43

Page 44: Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

Advanced Code Motion Techniques Adding new basic blocks along the control flow edges

originating from blocks with more than one predecessor. Moving instructions from basic blocks, so that the block can be eliminated completely.

The code to be executed in each basic block is scheduled once and for all as each block is visited, because algorithms only move operations up to dominating block.

Implementing downward code motion is harder in an algorithm that visits basic blocks in topological order , We move all operations that

i) can be moved and

ii) can not be executed in their native block

8/25/2014 Compiler OptimizationTechniques - unit II 44

Page 45: Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

Interaction with dynamic Schedulers It can create new schedules according to the run time

conditions.

High latency instructions are issued early.

Data pre fetch instructions will help the dynamic scheduler to make them available advance.

Data dependent operations are put in correct order to ensure program correctness. For best performance the compiler should assign long delays to dependences that are likely to occur and short ones to those that are not likely.

Branch misprediction must be avoided

8/25/2014 Compiler OptimizationTechniques - unit II 45

Page 46: Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

Software Pipelining

8/25/2014 Compiler OptimizationTechniques - unit II 46

Page 47: Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

Software Pipelining Numerical applications often have loops whose

iterations are completely independent of one another.

These loops with many iterations have enough parallelism to saturate all the resources in a processor. It is up to the scheduler to take full advantage available parallelism.

Software Pipelining schedules an entire loop at a time to take full advantage of the parallelism across iterations.

8/25/2014 Compiler OptimizationTechniques - unit II 47

Page 48: Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

Machine Model The machine can issue in a single clock : one load, one

store, one arithmetic operation and one branch operation.

The machine has a loop back operation

BL R, L

which decrements register R and , unless the result is 0, branches to location L.

8/25/2014 Compiler OptimizationTechniques - unit II 48

Page 49: Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

Machine Model Memory operations have an auto increment

addressing mode , denoted by ++ after the register. The register is automatically incremented to point to the next consecutive address after each access.

The arithmetic operations are fully pipelined ; they can be initiated every clock but their results are not available until 2 clock later. All other instructions have a single- clock latency.

8/25/2014 Compiler OptimizationTechniques - unit II 49

Page 50: Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

Typical do-all loop

for ( i = 0; i< n; i++)

D[i] = A[i] * B[i] + c;

8/25/2014 Compiler OptimizationTechniques - unit II 50

//R1,R2,R3 = & A, &B, &D// R4 = c// R10 = n-1

LD R5 , 0(R1 ++)

LD R6 , 0(R2 ++)

MUL R7 , R5, R6

Nop

ADD R8 , R7, R4NopST 0(R3 ++) , R8 BL R10 , L

L:

Locally scheduled code

Page 51: Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

Five unrolled iterations of e.g.,for (i = 0; i < n; i ++) D[i] = A[i] * B[i] + c ;

8/25/2014 Compiler OptimizationTechniques - unit II 51

Clock j = 1 J =2 J = 3 J =4 J = 5

1 LD

2 LD

3 MUL LD

4 LD

5 MUL LD

6 ADD LD

7 MUL LD

8 ST ADD LD

9 MUL LD

10 ST ADD LD

11 MUL

12 ST ADD

13

14 ST ADD

15

16 ST

Page 52: Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

Clock j = 1 J =2 J = 3 J =4

1 LD

2 LD

3 MUL LD

4 LD

5 MUL LD

6 ADD LD

7 L: MUL LD

8 ST ADD LD BL (L)

9 MUL

10 ST ADD

11

12 ST ADD

13

14 ST

8/25/2014 Compiler OptimizationTechniques - unit II 52

Software pipelined Code

Page 53: Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

A new iteration can be started on the pipeline every 2 clocks

When first iteration proceeds to stage three , the second iteration starts to execute.

By clock 7 the pipeline is fully filled with first four iterations.

In the steady state four consecutive iterations are executing at the same time.

The sequence of instructions 1 through 6 is called prolog.

7 and 8 are steady state.

lines 9 through 14 is called epilog.

8/25/2014 Compiler OptimizationTechniques - unit II 53