cmput680 - winter 2006

CMPUT 329 - Computer Organization and Architecture II

CMPUT680 - Winter 2006

Topic I: Superblock and Hyperblock Formation

José Nelson Amaralhttp://www.cs.ualberta.ca/~amaral/courses/680

Instruction Level Parallelism Optimizations

The objective of an optimizer is to reduce thenumber and complexity of the instructionsexecuted by the processor.

Superscalar or Very Long Instruction Word (VLIW) processors can reduce the execution time even when the number of instructions executed moderatelyincreases, as long as the dependence height is reduced.

Speculative and Predicated Execution

Speculative Execution: execution of an instructionbefore knowing that its execution is required.

Predicated Execution: architecture-supported conditional execution of an instruction based on the value of a Boolean source operand, referred to as the predicate of the instruction.

Superblock: structure used to implement compiler-controlled speculative execution.

If-conversion: compiler algorithm that converts conditional branches into predicate-defining instructions to allow the use of predication.

Trace Scheduling (Fisher, 1981)

Some optimization and scheduling decisionsmay decrease the execution time for onecontrol path while increasing the executiontime for another path.

Thus decisions should favor more frequentlyexecuted paths to improve overall performance.

Trace scheduling divides a procedure in a setof frequently executed traces (paths).

Trace Scheduling

There may be conditional branches from themiddle of the trace (side exits) and transitionsfrom other traces into the middle of the trace(side entrances).

These control-flow transitions are ignored duringtrace scheduling.

After scheduling, bookeeping is required to ensurethe correct execution of off-trace code.

Bookeeping for Trace Scheduling

Instr 1Instr 2Instr 3Instr 4Instr 5

What bookeeping is required when Instr 1 is moved below the side entrance in the trace?

Instr 3Instr 4

What bookeeping is required when Instr 5 moves above the side entrance in the trace?

Instr 5

Superblocks

A superblock is a trace without side entrances, i.e.,control can only enter from the top, but it can leaveat one or more exit points.

The formation of superblocks creates additionaloptimization opportunities because constraintsassociated with infrequently executed paths ofcontrol are ignored (thus these constraints donot inhibit optimizations that favor frequentlyexecuted paths).

Superblock Formation(Example)

Is this a superblock?

No, a superblock cannothave side entrances, andthis set of nodes hastwo side entrances intonode F. How do weconvert it into a superblock?

Tail duplication, is the duplication of basic blocksthat appear after a side entrance to eliminate side entrances and transform a trace into a superblock.

F’10

Common Subexpression Elimination in Superblocks

opA: mul r1,r2,3

opC: mul r3,r2,3

opB: add r2,r2,199

Original Code

opA: mul r1,r2,3

opC: mul r3,r2,3

opB: add r2,r2,199

Code After Superblock Formation

opC’: mul r3,r2,3

opA: mul r1,r2,3

opC: mov r3,r1

opB: add r2,r2,199

Code After Common Subexpression Elimination

opC’: mul r3,r2,3

Operation Migration in Superblocks

Original Code

…mov r0,r1

…mov r0,r2

…mov r0,r3

…add r1,r1,4add r2,r2,4add r3,r3,4

After Operation Migration

…add r1,r1,4add r2,r2,4add r3,r3,4

mov r0,r1

mov r0,r2

mov r0,r3

Global Variable Migration in Superblock

OpA: ld_I r4, x, r0OpB: add r4, r4, r1OpC: st_I x, r0, r4

OpD: add r1, r1, 1

OpE: add r0, r0, 1100

Original Program Segment

MEM[r0+x]

OpD: add r1, r1, 1

OpE: add r0, r0, 1100

MEM[r0+x]

OpD: add r1, r1, 1

OpE: add r0, r0, 1100

MEM[r0+x]

OpD: add r1, r1, 1

OpE: add r0, r0, 1100

MEM[r0+x]

OpD: add r1, r1, 1

OpE: add r0, r0, 1100

MEM[r0+x]

OpD: add r1, r1, 1

OpE: add r0, r0, 1100

MEM[r0+x]

OpD: add r1, r1, 1

OpE: add r0, r0, 1100

MEM[r0+x]

OpD: add r1, r1, 1

OpE: add r0, r0, 1100

MEM[r0+x]

OpD: add r1, r1, 1

OpE: add r0, r0, 1100

MEM[r0+x]

OpD: add r1, r1, 1

OpE: add r0, r0, 1100

MEM[r0+x]

OpD: add r1, r1, 1

OpE: add r0, r0, 1100

MEM[r0+x]

OpD: add r1, r1, 1

OpE: add r0, r0, 1100

21MEM[r0+x]

OpD: add r1, r1, 1

OpE: add r0, r0, 1100

OpC: st_i x, r0, r4

OpC’: st_i x, r0, r4OpE: add r0, r0, 1

OpA: ld_I r4, x, r0

OpB: add r4, r4, r1

OpD: add r1, r1, 1

After Variable Migration

Superblock Enlarging Optimizations

By enlarging a superblock, we can provide thescheduler with more independent instructions

to choose from for each cycle

Superblock enlarging optimizations:Branch target expansionLoop unrollingLoop peeling

Branch Target Expansion

Idea: To expand the superblock with the targetof a likely taken branch.

blt r1, r2, L3

beq r3, r4, L5

jump L4

L2:L3:

20 100blt r1, r2, L3

beq r3, r4, L5

jump L4

Superblock Loops

A superblock loop is a superblock that has afrequently taken backedge from its last node toits first node.

We will study the extension of some commonloop optimizations to superblocks.

Dependence Removing Optimizations

The goal is to eliminate data dependences betweeninstructions within frequently executed superblocks.

Dependence removing optimizations include:Register renamingAccumulator variable expansionInduction variable expansionSearch variable expansionOperation combiningStrength reductionTree height reduction

Instruction Latencies for Examples

Function Latency Int ALU 1 Int multiply 3 Int divide 10 branch 1 Memory load 2 Memory store 1 FP ALU 3 FP conversion 3 FP multiply 3 FP divide 10

Register Renaming Example

For (j=0; j<n; j++) { C(j) = A(j)+B(j) }

Original Loop

L1: ld_f f2, A, r1 (a)ld_f f3, B, r1 (b)add_f f4, f2, f3 (c)st_f C, r1, f4 (d)add r1, r1, 4 (e)blt r1, r5, L1 (f)

Assembly Code

For all the examples we assume a superscalar processor with infiniteresources and no register renaming hardware. Thus for the code above, we obtain the following schedule.

For (j=0; j<n; j++) { C(j) = A(j)+B(j) }

Original Loop

Assembly Code

a ab b

c c cde

0 5 cycles

Instr.

Code Schedule

7 cycles / 1 iteration

Original Assembly Code

L1: ld_f f2, A, r1 (a)ld_f f3, B, r1 (b)add_f f4, f2, f3 (c)st_f C, r1, f4 (d)add r1, r1, 4 (e)

ld_f f2, A, r1 (f)ld_f f3, B, r1 (g)add_f f4, f2, f3 (h)st_f C, r1, f4 (i)add r1, r1, 4 (j)ld_f f2, A, r1 (k)ld_f f3, B, r1 (l)add_f f4, f2, f3 (m)st_f C, r1, f4 (n)add r1, r1, 4 (o)blt r1, r5, L1 (p)

After Loop Unrolling

Loop Unrolling

a ab b

c c cde

0 5 cycles

Instr.

Code Schedule

h h hij

k kl l

m m mno

19 cycles / 3 iterations = 6.3 cycles / iteration

Register Renaming

After Register Renaming

Loop Unrolling and Register Renaming

Instr.

a ab b

c c cd

0 5 cycles

Code Schedule

h h hi

jk kl l

m m mn

After Register Renaming

Accumulator Variable Expansion

An accumulator variable accumulates a sum or productin each iteration of a loop.

Accumulator variable expansion eliminates redefinitionsof an accumulator variable within an unrolled loop bycreating k temporary accumulators (k is the number ofaccumulation instructions). The values of all temporaryaccumulators must be summed at the exit points of the loop where the accumulator is live.

Accumulator Expansion Example

For (k=0; k<n; k++) { C(i,j) = C(i,j) + A(i,k) * B(k,j) }

Original Loop

ld_f f1, C, r2 (-)L1: ld_f f3, A, r4 (a)

ld_f f5, B, r6 (b)mul_f f7, f3, f5 (c)add_f f1, f1, f7 (d)add r4, r4, 4 (e)add r6, r6, r8 (f)blt r4, r9, L1 (g)st_f C, r2, f1 (-)

Assembly Code

For all examples we assume a superscalar processor with infiniteresources and no register renaming hardware. Thus for the code above, we obtain the following schedule.

Accumulator Expansion Example

For (k=0; k<n; k++) { C(i,j) = C(i,j) + A(i,k) * B(k,j) }

Original Loop

Assembly Codea ab b

c c cd

0 5 cycles

Instr.

Code Schedule

ld_f f1, C, r2 (-)L1: ld_f f3, A, r4 (a)

Assembly Code

After Unrolling and Renaming

ld_f f1, C, r2 (-)L1: ld_f f31, A, r41 (a)

ld_f f51, B, r61 (b)mul_f f71, f31, f51 (c)add_f f1, f1, f71 (d)add r42, r41, 4 (e)add r62, r61, r8 (f)ld_f f32, A, r42 (g)ld_f f52, B, r62 (h)mul_f f72, f32, f52 (i)add_f f1, f1, f72 (j)add r43, r42, 4 (k)add r63, r62, r8 (l)ld_f f33, A, r43 (m)ld_f f53, B, r63 (n)mul_f f73, f33, f53 (o)add_f f1, f1, f73 (p)add r41, r43, 4 (q)add r61, r63, r8 (r)blt r4, r9, L1 (s)st_f C, r2, f1 (-)

a ab b

c c cd

0 5 cycles

Code Schedule

g gh h

ld_f f1, C, r2 (-)L1: ld_f f31, A, r41 (a)

ld_f f51, B, r61 (b)mul_f f71, f31, f51 (c)add_f f1, f1, f71 (d)add r42, r41, 4 (e)add r62, r61, r8 (f)ld_f f32, A, r42 (g)ld_f f52, B, r62 (h)mul_f f72, f32, f52 (i)add_f f1, f1, f72 (j)add r43, r42, 4 (k)add r63, r62, r8 (l)ld_f f33, A, r43 (m)ld_f f53, B, r63 (n)mul_f f73, f33, f53 (o)add_f f1, f1, f73 (p)add r41, r43, 4 (q)add r61, r63, r8 (r)blt r4, r9, L1 (s)st_f C, r2, f1 (-)

Instr.

i ij j

m mn n

o op p

Accumulator Expansion

a ab b

c c cd

0 5 cycles

Code Schedule

g gh h

ld_f f11, C, r2 (-)mov_f f12, 0 (-)

mov_f f13, 0 (-)L1: ld_f f31, A, r41 (a)

ld_f f51, B, r61 (b)mul_f f71, f31, f51 (c)add_f f11, f11, f71 (d)add r42, r41, 4 (e)add r62, r61, r8 (f)ld_f f32, A, r42 (g)ld_f f52, B, r62 (h)mul_f f72, f32, f52 (i)add_f f12, f12, f72 (j)add r43, r42, 4 (k)add r63, r62, r8 (l)ld_f f33, A, r43 (m)ld_f f53, B, r63 (n)mul_f f73, f33, f53 (o)add_f f13, f13, f73 (p)add r41, r43, 4 (q)add r61, r63, r8 (r)blt r4, r9, L1 (s)add_f f11, f11, f12 (-)add_f f11, f11, f13 (-)st_f C, r2, f1 (-)

Instr.

i ij j

m mn n

o op p

Induction Variable Expansion

An induction variable is used to index through loop iterations and through regular data structure, such as arrays.

Induction variable expansion eliminates dependencesbetween definitions of induction variables and their usesin unrolled loops.

Induction Variable Expansion Example

For (i=0; i<n; i++) { C(j) = A(j) * B(j) j = j + K }

Original Loop

Assembly Codea ab b

c c cde

0 5 cycles

Instr.

Code Schedule

L1: ld_f f3, A, r2 (a)ld_f f4, B, r2 (b)mul_f f5, f3, f4 (c)st_f C, r2, f5 (d)add r2, r2, r7 (e)add r1, r1, 1 (f)blt r1, r6, L1 (g)

Assembly Code

After Unrolling and Renaming

L1: ld_f f31, A, r21 (a)ld_f f41, B, r21 (b)mul_f f51, f31, f41 (c)st_f C, r21, f51 (d)add r22, r21, r7 (e)

ld_f f32, A, r22 (f)ld_f f42, B, r22 (g)mul_f f52, f32, f42 (h)st_f C, r22, f52 (i)add r23, r22, r7 (j)ld_f f33, A, r23 (k)ld_f f43, B, r23 (l)mul_f f53, f33, f43 (m)st_f C, r23, f53 (n)add r21, r23, r7 (o)add r1, r1, 3 (p)blt r1, r6, L1 (q)

L1: ld_f f3, A, r2 (a)ld_f f4, B, r2 (b)mul_f f5, f3, f4 (c)st_f C, r2, f5 (d)add r2, r2, r7 (e)add r1, r1, 1 (f)blt r1, r6, L1 (g)

a ab b

c c cd

0 5 cycles

Code Schedule

f fg g

Instr.

k kl l

8 cycles / 3 iterations = 2.6 cycles / iteration After Unrolling and Renaming

L1: ld_f f31, A, r21 (a)ld_f f41, B, r21 (b)mul_f f51, f31, f41 (c)st_f C, r21, f51 (d)add r22, r21, r7 (e)

ld_f f32, A, r22 (f)ld_f f42, B, r22 (g)mul_f f52, f32, f42 (h)st_f C, r22, f52 (i)add r23, r22, r7 (j)ld_f f33, A, r23 (k)ld_f f43, B, r23 (l)mul_f f53, f33, f43 (m)st_f C, r23, f53 (n)add r21, r23, r7 (o)add r1, r1, 3 (p)blt r1, r6, L1 (q)

Induction Variable Expansion

a ab b

c c cd

0 5 cycles

Code Schedule

f fg g

Instr.

k kl l

6 cycles / 3 iterations = 2 cycles / iteration After Unrolling and Renaming

mov r21, r2 (-)add r22, r21, r7 (-)add r23, r22, r7 (-)mul r71, r7, 3 (-)

L1: ld_f f31, A, r21 (a)ld_f f41, B, r21 (b)mul_f f51, f31, f41 (c)st_f C, r21, f51 (d)ld_f f32, A, r22 (f)ld_f f42, B, r22 (g)mul_f f52, f32, f42 (h)st_f C, r22, f52 (i)ld_f f33, A, r23 (k)ld_f f43, B, r23 (l)mul_f f53, f33, f43 (m)st_f C, r23, f53 (n)add r21, r21, r71 (e)add r22, r22, r71 (j)add r23, r23, r71 (o)add r1, r1, 3 (p)blt r1, r6, L1 (q)

Search Variable Expansion

A search variable is a single value (p.e., a minimum or a maximum) computed for a collection of data.

Search variable expansion eliminates dependencesbetween definitions of search variables and their usesin unrolled loops.

Each search variable is expanded into k temporaryindependent variables. At the exit of the loop the valueof the original search variable is obtained by comparingthe values of the temporary search variables.

Superblock Scheduling

Superblock scheduling is a two step process:

Step 1: Build dependence graphStep 2: List scheduling using the dependence

graph, instruction latencies, and resource constraints of the processor

List Scheduling

List scheduling employs heuristics to choose amongall ready nodes, the combination of nodes

that should be scheduled in the current cycle.

A node is ready if:(i) all its parents in the dependence graph have been scheduled;(ii) the result produced by each parent is available; and (iii) the resources required by the node are available.

Speculative Execution in Superblocks

To produce an efficient schedule, the compilermust be able to move instructions above and below branches.

R: xy+z…S: bnz r1...

LIVE-OUT(BR) is the set ofvariables that may be used before being redefined when

the branch BR is taken

In the example, LIVE-OUT(S) is the set of variables that is live at point P.

If we want to move instruction R below the branchinstruction S, two situations might occur:

1) x LIVE-OUT(S)2) x LIVE-OUT(S)

What is the code thatthe compiler should

produce for each situation?

If we want to move instruction R below the branchinstruction S, two situations might occur:

1) x LIVE-OUT(S)insert a copy of

instruction R in thebranch target.

2) x LIVE-OUT(S)no compensation code

is required

…S: bnz r1…R: xy+z

R’: xy+z...

…S: bnz r1…R: xy+z

1) x LIVE-OUT(S) 2) x LIVE-OUT(S)must introduce R’ in

basic block B2no compensation code

is required

Upward code motion is more common to reducethe critical path of a superblock. (p.e. moving aload instruction upward to hide the load latency).

There are two major restrictions to move an instruction J from below to above a branch BR:Restriction 1: The destination of J is not in LIVE-OUT(BR).Restriction 2: J will never cause an exception that may terminate program execution when BR is taken.

Restriction 1 is usually removed by register renaming.By renaming the destination register of instruction J,we ensure that it is not in LIVE-OUT(BR).

There are two extreme interpretations to restriction 2.

Restricted Speculation Model: fully enforce restriction 2.

Therefore only instructions that cannot cause expections are candidates for speculative execution (p. e. memory load, memory store, integer divide, andall floating point instructions cannot be speculated).

General Speculation Model: completely ignore restriction 2.

Requires that the processor provide non-excepting or silent versions of all potentially excepting instructions in the instruction set architecure. If an exception occurs for a silent instruction, it

is simply ignored, and garbage is written in the destination.

Example for Speculative Execution

avg = 0;weight = 0;count = 0;while(prt != NULL) {

count = count + 1;if(prt->wt > 0) weight = weight - prt->wt;else weight = weight + prt->wt;prt = prt -> next;}

if(count != 0) avg = weight/count

C code segment

(i1) ld_i r1, prt, 0(i2) mov r7, 0 // avg(i3) mov r2, 0 // count(i4) mov r3, 0 // weight(i5) beq r1, 0, L3(i6) L0: add r2, r2, 1(i7) ld_i r4, r1, 0 // prt->wt(i8) bge r4, 0, L1(i9) sub r3, r3, r4(i10) jmp L2(i11) L1: add r3, r3, r4(i12) L2: ld_i r1, r1, 4(i13) bne r1, 0, L0(i14) L3: beq r2, 0, L4(i15) div r7, r3, r2(i16) st_i avg, 0, r7(i17) L4:

Assembly code segment

(i1) ld_i r1, prt, 0(i2) mov r7, 0 // avg(i3) mov r2, 0 // count(i4) mov r3, 0 // weight(i5) beq r1, 0, L3(i6) L0: add r2, r2, 1(i7) ld_i r4, r1, 0 // prt->wt(i8) bge r4, 0, L1(i9) sub r3, r3, r4(i10) jmp L2(i11) L1: add r3, r3, r4(i12) L2: ld_i r1, r1, 4(i13) bne r1, 0, L0(i14) L3: beq r2, 0, L4(i15) div r7, r3, r2(i16) st_i avg, 0, r7(i17) L4:

i6i7i8

i12i13

Trace Selection for the Loop

BB5BB5

i6i7i8

i12i13

Trace Selection for the Loop

i6i7i8

i12i13

i9i12’i13’

99(1/10)

1(9/10)

After superblock formationand branch target expansion

BB3’

1(1/10)

99(1/10)

i6i7i8

i12i13

i9i12’i13’

99(1/10)

1(9/10)

After superblock formationand branch target expansion

BB3’

1(1/10)

99(1/10)

ld_i r1, prt, 0mov r7, 0 // avgmov r2, 0 // countmov r3, 0 // weightbeq r1, 0, L3

(i6) L0: add r2, r2, 1(i7) ld_i r4, r1, 0 // prt->wt(i8) bge r4, 0, LA(i11) add r3, r3, r4(i12) ld_i r1, r1, 4 // prt->next(i13) bne r1, 0, L0(i9) LA: sub r3, r3, r4(i12’) ld_i r1, r1, 4 // prt->next(i13’) bne r1, 0, L0(i14) L3: beq r2, 0, L4(i15) div r7, r3, r2(i16) st_i avg, 0, r7(i17) L4:

(I1) L0: add r2, r2, 1(I2) ld_i r4, r1, 0 // prt->wt(I3) blt r4, 0, L1(I4) add r3, r3, r4(I5) ld_i r5, r1, 4 // prt->next(I6) beq r5, 0, L3(I7) add r2, r2, 1(I8) ld_i r6, r5, 0 // prt->wt(I9) blt r6, 0, L1’(I10) add r3, r3, r6(I11) ld_i r1, r5, 4 // prt -> next(I12) bne r1, 0, L0 L3: beq r2, 0, L4 div r7, r3, r2 st_I avg, 0, r7 L4: L1’: mov r1, r5 mov r4, r6 L1: sub r32, r3, r4 ld_i r1, r1, 4 bne r1, 0, L0

(I1) L0: add r2, r2, 1(I2) ld_i r4, r1, 0 // prt->wt(I3) blt r4, 0, L1

(I4) add r3, r3, r4(I5) ld_i r5, r1, 4 // prt->next(I6) beq r5, 0, L3

(I7) add r2, r2, 1(I8) ld_i r6, r5, 0 // prt->wt(I9) blt r6, 0, L1’

(I10) add r3, r3, r6(I11) ld_i r1, r5, 4 // prt -> next(I12) bne r1, 0, L0

L3: beq r2, 0, L4 div r7, r3, r2 st_I avg, 0, r7

L1’: mov r1, r5 mov r4, r6

L1: sub r32, r3, r4 ld_i r1, r1, 4 bne r1, 0, L0

(I1) L0: add r2, r2, 1(I2) ld_i r4, r1, 0 // prt->wt(I3) blt r4, 0, L1

(I4) add r3, r3, r4(I5) ld_i r5, r1, 4 // prt->next(I6) beq r5, 0, L3

(I7) add r2, r2, 1(I8) ld_i r6, r5, 0 // prt->wt(I9) blt r6, 0, L1’

(I10) add r3, r3, r6(I11) ld_i r1, r5, 4 // prt -> next(I12) bne r1, 0, L0

div r7, r3, r2 st_I avg, 0, r7

L1’: mov r1, r5 mov r4, r6

L1: sub r32, r3, r4 ld_i r1, r1, 4 bne r1, 0, L0

L3: beq r2, 0, L4

HyperblocksSuggested Reading

Scott A. Mahlke’s Ph.D. Thesis, chap. 7.

Hyperblock

A hyperblock is a collection of connected basicblocks in which control may only enter throughthe first block (entry block).

Control flow may leave from any number of blocksin the hyperblock.

Before scheduling, all control flow between basicblocks within a hyperblock is removed via if-conversion.

Hyperblock Formation

A five-step procedure is used to form hyperblocks:

1. region identification

2. loop backedge coalescing

3. block selection

4. tail duplication

5. if-conversion

Running Example: wc

Mahlke uses the inner loop of wc, the program that counts the number of characters, words, and lines in a file forlinux, as a running example.

The source code

linect =wordct = charct = token = 0; for ( ; ; )A: if (--(fp)->cnt < 0)C: c = filbuf(fp); elseB: c = *(fp)->ptr++;D: if (c == EOF) break;E: charct++; if ((‘ ‘ < c) &&F: (c < 0177)) {

H: if(! token) {K: wordct++; token++; } continue; }G: if (c == ‘\n’)I: linec++;J: else if ((c != ‘ ‘) &&L: (c != ‘\t’)) continue;M: token = 0; }

The Assembly Code

LA: ld_i r98, r3, 0 add r27, r98, -1 st_i r3, 0, 27 blt r98, 1, LCLB: ld_i r30, r3, 4 add r29, r30, 1 st_i r3, 4, r29 ld_c r4, r30, 0LD: beq r4, -1, EXITLE: ld_I r33, r73, 0 add r32, r33, 1 st_I r73, 0, r32 bge 32, r4, LGLF: bge r4, 127, LGLH: bne 0, r2, LA

LK: ld_I r36, r72, 0 add r35, r36, 1 st_I r72, 0, r35 add r2, r2, 1 jmp LALG: beq r4, r10, LILJ: bne r4, 32, LLLM: mov r2, 0 jmp LALI: ld_I r39, r71, 0 add r38, r39, 1 st_I r71, 0, r38 jmp LMLL: bne r4, 9, LA jmp LMLC: mov Parm0, r3 jsr filbuf mov r4, Ret0 jmp LD

Control Flow Graph

105K 14

14105K

77K 28K

04K 24K

Statistics of the Example

wc is formed by small basic blocks with a largepercentage of branches

It contains 13 basic blocks and 34 instructions:

14 branches: 8 conditional 5 unconditional 1 subroutine call

Step 1: Region Identification

A region is a group of basic blocks with a singleentry block that dominates all the blocks in theregion.

Regions are used because they provide easy tocompute outer boundaries for hyperblocks.

A basic block can only reside in a single region.

A second constraint imposed on region formationis that regions may not contain internal cycles(this constraint is relaxed later).

In wc, the entire control flow graph forms a region.

Step 2: Backedge Coalescing

If-conversion only can remove non-loop branches.

Thus we need to coaslece all back edges into asingle backedge. This allows the control logicthat choses which backedge is taken to beeliminated by if-conversion.

To coalesce the backedges, we introduce a newnode that will be the origin of the new single backedge.Then we retarget all existing backedges to this new node

CFG Before Backedge Coalescing

105K 14

14105K

77K 28K

04K 24K

CFG After Backedge Coalescing

105K 14

14105K

77K 28K

04K 24K

Step 3: Block Selection

Two conflicting goals:

(1) More blocks can potentially improve performance by eliminating branches among the blocks included.

(2) Too many blocks may result in performance loss due to over-saturation of processor resources or increased dependence height.

Enumerating Execution Paths

An execution path is a path of control flow fromthe entry block to an exit block in the region.

Mahlke assigns a priority to each execution path.This priority indicates the path relative importance.

Paths are included in the hyperblock from thehighest to the lowest priority based on the available resources.

Mahlke also estimates the available resourcesand the resource use of each path.

Path Priority Function

The path priority function combines four elements: (1) path execution frequency;

(2) number of instructions in the path;(3) path dependence height;(4) hazard conditions on the path;

Intuition: include paths with fewer instructions, with lower dependence height, that have few hazard conditions, and that are executed very often.

Hazard conditions include procedure calls andunresolvable memory stores.

( )( ) ( )Kratioopratiodephazardyprobabilitpriority

opsnum

opsnumratioop

heightdep

heightdepratiodep

++××=

⎟⎟⎟

⎜⎜⎜

⎛−=

⎟⎟⎟

⎜⎜⎜

⎛−=

≤≤

Malhke use a hazard multiplier of 0.25 for all pathscontaining a subroutine call or an unresolvable memory reference, and 1.0 for all other paths.

( )( ) ( )Kratioopratiodephazardyprobabilitpriority

opsnum

opsnumratioop

heightdep

heightdepratiodep

++××=

⎟⎟⎟

⎜⎜⎜

⎛−=

⎟⎟⎟

⎜⎜⎜

⎛−=

≤≤

The constant K makes the path with the largestdependence height and the most operations havea non-zero probability. Malhke used K=0.1.

Block Selection Algorithm

ISSUE_WIDTH = 1 to 8 /* as specified in the machine description file */RES_MULTIPLIER = 2MAX_DEP_GROWTH = 3MIN_PATH_PRIORITY_RATIO = 0.10

block_selection(region) { enumerate all paths in the region calculate priority of each path sort paths from highest to lowest priority /* Initialization of loop variables */ avail_resources = ISSUE_WIDTH dep_height1 RES_MULTIPLIER used_resources = 0 last_priority = 0.0 selected_paths = 0 for (i = 1 to num_paths) { /* Check if there are enough resources available to include the path */ if ((num_opsi + used_resources) > avail_resources) { continue } /* Prevent paths with large relative dependence heights from being included */ if (dep_heighti > (dep_height1 MAX_DEP_GROWTH)) { continue }

Block Selection Algorithm

/* Prevent paths with large relative dependence heights from being included */ if (dep_heighti > (dep_height1 MAX_DEP_GROWTH)) { continue }/* Do not include paths with a small relative priority to that of the last included path */ if (priorityi < (last_priority MIN_PATH_PRIORITY_RATIO)) { continue }/* Include the path in the hyperblock */ selected_paths = selected_paths pathi

used_resources = used_resources + num_opsi

last_priority = priorityi

} selected_blocks = all blocks contained within selected_paths return selected_blocks}

Block Selection

105K 14

14105K

77K 28K

04K 24K

1. A-B-D-E-F-H-N 2. A-B-D-E-F-H-K-N 3. A-B-D-E-G-J-M-N 4. A-B-D-E-G-J-L-M-N 5. A-B-D-E-G-I-M-N 6. A-B-D-E-G-J-L-N 7. A-B-D

8. A-C-D-E-F-H-N 9. A-C-D-E-F-H-K-N10. A-C-D-E-G-J-M-N11. A-C-D-E-G-J-L-M-N12. A-C-D-E-G-I-M-N13. A-C-D-E-G-J-L-N14. A-C-D

15. A-B-D-E-F-G-I-M-N16. A-B-D-E-F-G-J-M-N17. A-B-D-E-F-G-J-L-M-N18. A-B-D-E-F-G-J-L-N

19. A-C-D-E-F-G-I-M-N20. A-C-D-E-F-G-J-M-N21. A-C-D-E-F-G-J-L-M-N22. A-C-D-E-F-G-J-L-N

Block Selection

105K 14

14105K

77K 28K

04K 24K

Path Selection

Some paths that are not selected by the blockselection algorithms are also included in thehyperblocks because all their blocks belongto selected paths.

An alternative procedure could have eliminatedthese paths from the path set before the selection.

But the cost of such elimination would be higherthan maintaining these extra paths in the set.

Block Selection

105K 14

14105K

77K 28K

04K 24K

Step 4: Tail Duplication

To convert the set of selected blocks into ahyperblock (with a single entry block), controlflow from non-selected blocks (side entry points) must be eliminated.

The tail duplication algorithm first marks allblocks that have side entry points.

Then the algorithm marks all blocks that canbe reached from marked blocks.

All marked blocks form the tails that must beduplicated.

Tail Duplication

105K 14

14105K

77K 28K

04K 24K

Tail Duplication

105K 14

14105K

77K 28K

04K 24K

Tail Duplication

105K 14

14105K

77K 28K

04K 24K

I’ J’

105K 0

Anatomy of a Predicate Computation Operation

p<cmp> Pout1(type), Pout2(type), src1, src2 (Pin)

This instruction assigns value to Pout1 and Pout2:

The value assigned depends on:

The result of the comparisonThe value of Pin The type of Pout1 and Pout2

<cmp> = eq | ne | gt

<type> = U | U | OR | OR | AND | AND

Example:pge p4(OR), p2(/U), r4, 127 (p1)

cmp = ge, Pin = p1, Pout1 = p4, Pout2 = p2, src1 = r4, src2 = 127

U or U Always write into the destination register:

if type = U then if Pin = 0 then Pout = 0 elseif src1 <cmp> src2 then Pout = 1 else Pout = 0

if type = U then if Pin = 0 then Pout = 0 elseif src1 <cmp> src2 then Pout = 0 else Pout = 1

Write into the destination register onlyif Pin = 1 and <cmp> is true:

if type = OR and Pin = 1 and src1 <cmp> src2 then Pout = 1

Used when the execution of a block is enabled byone of multiple conditions.

OR type predicates must be initialized to 0 before their use.

OR or OR

if type = OR and Pin = 1 and src1 !<cmp> src2 then Pout = 1

Write into the destination register onlyif Pin = 1 and <cmp> is false:

if type =AND and Pin = 1 and src1 !<cmp> src2 then Pout = 0

Used when the execution of a block requiresseveral conditions to be true.

AND type predicates are often initialized to 1.

AND or AND

if type = AND and Pin = 1 and src1 <cmp> src2 then Pout = 0

Predicate Comparison Truth Table

• Pin predicates the entire predicate computation instruction.• Notice that for an unconditional type, the value 0 is written in Pout even when Pin is 0.

Pin Comparison UUOR ORAND AND0 0 0 0 - - - -0 1 0 0 - - - -1 0 0 1 - 1 0 -1 1 1 0 1 - - 0

Predicate Comparison Truth Table

p1 Comparison P4(OR) P2(/U) 0 0 - 0 0 1 - 0 1 0 - 1 1 1 1 0

pge p4(OR), p2(/U), r4, 127 (p1)

Pin Comparison UUOR ORAND AND0 0 0 0 - - - -0 1 0 0 - - - -1 0 0 1 - 1 0 -1 1 1 0 1 - - 0

Example:

Predicate Types

Unconditional predicates are used for control dependence sets that have a single edge.

OR-type predicates are used for predicates withmultiple edges in their control dependence sets.(OR-type predicates must be cleared beforeentering the hyperblock).

Step 5: If-conversion

For graph drawing, Malhke uses the convention that the left edge out of a basic block is the true condition and the right one is the false.

In this control flow graph the control dependencieson blocks I and J are:

I: brGJ: /brG

105K 14

14105K

77K 28K

04K 24K

D’-N’

14Control Dependences Predicate Assignment A : none A : null B : none B : null D : none C : null E : none E : null F : brE F : p1 (U) G : /brE, /brF G : p4 (OR) H : brF H : p2 (U) I : brG I : p7 (U) J : /brG J : p5 (U) K : brH K : p3 (U) L : /brJ L : p8 (U) M : brI, brJ, brL M : p6 (OR) N : none N : null

105K 14

14105K

77K 28K

04K 24K

D’-N’

77K 24K

Step 5: If-conversion (example)

105K 14

14105K

77K 28K

D’-N’

77K 24K

105K 14

14105K

77K 28K

D’-N’

77K 24K

105K 14

14105K

77K 28K

D’-N’

pclr p4, p6ld_i r98, r3, 0add r27, r98, -1st_i r3, 0, r27blt r98, 1, LC ld_i r30, r3, 4add r29, r30, 1st_i r3, 4, r29ld_c r4, r30, 0 beq r4, -1, EXITld_I r33, r73, 0add r32, r33, 1st_I r73, 0, r32pge p4(OR), p1(/U), 32, r4pge p4(OR), p2(/U), r4, 127 (p1)peq p3(U),-,0,r2 (p2)peq p6(OR), p5(/U), r4, r10 (p4)peq p7(U), -, r4, r10 (p4)...

77K 24K

77K 28K

Inner Loop After If-conversion

pclr p4, p6ld_I r98, r3, 0add r27, r98, -1st_I r3, 0, r27

ld_i r30, r3, 4add r29, r30, 1st_I r3, 4, r29ld_c r4, r30, 0

ld_I r33, r73, 0add r32, r33, 1st_I r73, 0, r32pge p4(OR), p1(/U), 32, r4pge p4(OR), p2(/U), r4, 127 (p1)peq p3(U),-,0,r2 (p2)peq p6(OR), p5(/U), r4, r10 (p4)peq p7(U), -, r4, r10 (p4)peq p6(OR), p8(/U), r4, 32 (p5)ld_I r36, r72, 0 (p3)add r35, r36, 1 (p3)st_I r72, 0, r35 (p3)add r2, r2, 1 (p3)ld_I r39, r71, 0 (p7)add r38, r39, 1 (p7)st_I r71, 0, r38 (p7)peq p6(OR), -, r4, 9 (p8)mov r2, 0 (p6)jmp loop

blt r98, 1, LC

beq r4, -1, EXIT

Predicate Hierarchy Graph

The Predicate Hierarchy Graph (PHG) is a directed acyclic graph representing the Boolean equations used to compute all the predicates in a hyperblock.

There are two types of nodes in the PHG: predicate nodes and condition nodes.

Two PHG nodes x and y are connected if thevalue specified by x is used to directly compute the value of y.

The PHG is used to derive relationships among predicates.

Example of PHG Construction

pclr p4, p6ld_I r98, r3, 0add r27, r98, -1st_I r3, 0, r27blt r98, 1, LC ld_i r30, r3, 4add r29, r30, 1st_I r3, 4, r29ld_c r4, r30, 0beq r4, -1, EXITld_I r33, r73, 0add r32, r33, 1st_I r73, 0, r32pge p4(OR), p1(/U), 32, r4 [c1, /c1]pge p4(OR), p2(/U), r4, 127 (p1) [c2, /c2]peq p3(U),-,0,r2 (p2) [c3]peq p6(OR), p5(/U), r4, r10 (p4) [c4, /c4]peq p7(U), -, r4, r10 (p4) [c4]peq p6(OR), p8(/U), r4, 32 (p5) [c5, /c5]ld_I r36, r72, 0 (p3)add r35, r36, 1 (p3)st_I r72, 0, r35 (p3)add r2, r2, 1 (p3)ld_I r39, r71, 0 (p7)add r38, r39, 1 (p7)st_I r71, 0, r38 (p7)peq p6(OR), -, r4, 9 (p8) [c6]mov r2, 0 (p6)jmp loop

pge p4(OR), p1(/U), 32, r4 [c1, /c1]

c1 /c1

pge p4(OR), p2(/U), r4, 127 (p1) [c2, /c2]

c1 /c1

c2 /c2

peq p3(U),-,0,r2 (p2) [c3]

c1 /c1

c2 /c2

peq p6(OR), p5(/U), r4, r10 (p4) [c4, /c4]

c1 /c1

c2 /c2

c4 /c4

peq p7(U), -, r4, r10 (p4) [c4]

c1 /c1

c2 /c2

c4 c4 /c4

peq p6(OR), p8(/U), r4, 32 (p5) [c5, /c5]

c1 /c1

c2 /c2

c5 /c5

c4 c4 /c4

peq p6(OR), -, r4, 9 (p8) [c6]

c1 /c1

c2 /c2

c5 /c5

c4 c4 /c4

c1 /c1

c2 /c2

c5 /c5

c4 c4 /c4

Purpose of PHG

The PHG is used to allow the compiler to deriverelations among the predicates. Mahlke identifies threepredicate relations:Ancestor: pi is an ancestor of pj if all conditions used to compute pj are derived from pi.The compiler can be sure that pj may be true only when pi is also true. Control Path: There is a control path between pi and pj if there is at least one set of conditions under which both pj and pi are true.The compiler knows that pi and pj may be true at the same time.

Implies: pi implies pj if the conditions that make pi true guatantee that pj will also be true.

Imply Relationshippclr p4, p6ld_I r98, r3, 0add r27, r98, -1st_I r3, 0, r27blt r98, 1, LC ld_i r30, r3, 4add r29, r30, 1st_I r3, 4, r29ld_c r4, r30, 0beq r4, -1, EXITld_I r33, r73, 0add r32, r33, 1st_I r73, 0, r32pge p4(OR), p1(/U), 32, r4 [c1, /c1]pge p4(OR), p2(/U), r4, 127 (p1) [c2, /c2]peq p3(U),-,0,r2 (p2) [c3]peq p6(OR), p5(/U), r4, r10 (p4) [c4, /c4]peq p7(U), -, r4, r10 (p4) [c4]peq p6(OR), p8(/U), r4, 32 (p5) [c5, /c5]ld_I r36, r72, 0 (p3)add r35, r36, 1 (p3)st_I r72, 0, r35 (p3)add r2, r2, 1 (p3)ld_I r39, r71, 0 (p7)add r38, r39, 1 (p7)st_I r71, 0, r38 (p7)peq p6(OR), -, r4, 9 (p8) [c6]mov r2, 0 (p6)jmp loop

c1 /c1

c2 /c2

c5 /c5

c4 c4 /c4

p7 implies p6

Ancestor Relationshippclr p4, p6ld_I r98, r3, 0add r27, r98, -1st_I r3, 0, r27blt r98, 1, LC ld_i r30, r3, 4add r29, r30, 1st_I r3, 4, r29ld_c r4, r30, 0beq r4, -1, EXITld_I r33, r73, 0add r32, r33, 1st_I r73, 0, r32pge p4(OR), p1(/U), 32, r4 [c1, /c1]pge p4(OR), p2(/U), r4, 127 (p1) [c2, /c2]peq p3(U),-,0,r2 (p2) [c3]peq p6(OR), p5(/U), r4, r10 (p4) [c4, /c4]peq p7(U), -, r4, r10 (p4) [c4]peq p6(OR), p8(/U), r4, 32 (p5) [c5, /c5]ld_I r36, r72, 0 (p3)add r35, r36, 1 (p3)st_I r72, 0, r35 (p3)add r2, r2, 1 (p3)ld_I r39, r71, 0 (p7)add r38, r39, 1 (p7)st_I r71, 0, r38 (p7)peq p6(OR), -, r4, 9 (p8) [c6]mov r2, 0 (p6)jmp loop

c1 /c1

c2 /c2

c5 /c5

c4 c4 /c4

Which predicate nodes are ancestors

of p5?

T, p4, and p5

Ancestor Relationshippclr p4, p6ld_I r98, r3, 0add r27, r98, -1st_I r3, 0, r27blt r98, 1, LC ld_i r30, r3, 4add r29, r30, 1st_I r3, 4, r29ld_c r4, r30, 0beq r4, -1, EXITld_I r33, r73, 0add r32, r33, 1st_I r73, 0, r32pge p4(OR), p1(/U), 32, r4 [c1, /c1]pge p4(OR), p2(/U), r4, 127 (p1) [c2, /c2]peq p3(U),-,0,r2 (p2) [c3]peq p6(OR), p5(/U), r4, r10 (p4) [c4, /c4]peq p7(U), -, r4, r10 (p4) [c4]peq p6(OR), p8(/U), r4, 32 (p5) [c5, /c5]ld_I r36, r72, 0 (p3)add r35, r36, 1 (p3)st_I r72, 0, r35 (p3)add r2, r2, 1 (p3)ld_I r39, r71, 0 (p7)add r38, r39, 1 (p7)st_I r71, 0, r38 (p7)peq p6(OR), -, r4, 9 (p8) [c6]mov r2, 0 (p6)jmp loop

c1 /c1

c2 /c2

c5 /c5

c4 c4 /c4

Which predicate nodes are in the same

control path as p5?T, p1, p4, p5, p6, p8

Classical/ILP Optimizations in Predicated Code

Example: Copy Propagation

A: mov r1, r2 (p1)B: add r2, r3, r4 (p2)C: ld_i r5, r1, 0 (p3)

Is the copy propagation frominstruction A to instruction C legal?

Depends on what we know about the relationship between p1, p2, and p3.If it is possible that p1 is false and p3is true, the propagation would be wrong!

Example: Copy Propagation

For instance, if we know that:(1) p1 is an ancestor of both p2 and p3, and (2) p2 and p3 are mutually exclusiveThen we can do the copy propagation safely.

cm /cm

Example: Instruction Scheduling

A: ld_i r1, r2, r3 (p2)B: add r4, r1, 4 (p2)C: ld_i r1, r5, 0 (p3)D: mul r6, r1, r7 (p3)

What are the data dependencies in thecode above? Depends on what we know about the relationship between p2, and p3.

cm /cm

For instance, if we know thatp2 and p3 are mutually exclusive,we have this DDG:

But if p2 implies p3,then have this DDG:

Predicate-Specific Optimizations

- Predicate Promotion- Branch Combining- Predicate Loop Peeling

Predicate Promotion

The idea it to speculate the execution of instructionsby replacing their predicate by a less constrainedpredecessor predicate.

Because the ancestor predicate is computed withfewer conditions, the execution of the promoted instruction is speculative.

The advantage of predicate promotion is the reductionof the dependence chain in a hyperblock.

Conditions for Simple Predicate Promotion

The predicate of an instruction op(x) canbe promoted to its predecessor predicateif all the following conditions are true:1. op(x) is predicated2. op(x) has a destination register3. op(x) has a speculative version4. there is a unique op(y) lexically before op(x) such that dest(y) = pred(x)5. dest(x) is not live at op(y)6. for any op(j) such that there is a path op(j)…op(y), dest(x) dest(j)7. It is profitable to promote op(x)

Example of Predicate Promotion (qsort)

1 LA: ld_i r20, r24, r1012 ld_i r23, r2, r1023 pge p126(U), p127(U), r20, r234 LB: ld_i r6, r123, 0 (p126)5 add r123, r123, 8 (p126)6 add r9, r9, 1 (p126)7 add r101, r101, 8 (p126)8 LC: ld_i r6, r124, 8 (p127)9 add r124, r124, 8 (p127)10 add r124, r124, 8 (p127)11 add r102, r102, 8 (p127)12 LD: st_i r114, 0, r2313 st_i r114, 4, r614 add r7, r7, 115 add r114, r114, 816 bge r9, r3, EXIT17 LE: blt r8, r1, LA

1 LA: ld_i r20, r24, r1012 ld_i r23, r2, r1023 pge p126(U), p127(U), r20, r234 LB: ld_i r6, r123, 0 5 add r123, r123, 8 (p126)6 add r9, r9, 1 (p126)7 add r101, r101, 8 (p126)8 LC: ld_i r60, r124, 8 8a mov r6, r60 (p127) 9 add r124, r124, 8 (p127)10 add r124, r124, 8 (p127)11 add r102, r102, 8 (p127)12 LD: st_i r114, 0, r2313 st_i r114, 4, r614 add r7, r7, 115 add r114, r114, 816 bge r9, r3, EXIT17 LE: blt r8, r1, LA

Branch Combining

Problem: too many infrequently executed branches in a hyperblock

1 A: bge r1, r5, EXIT12 ld_c r3, r1, 03 beq r3, 10, EXIT24 beq r3, 0, EXIT35 bge r2, r6, EXIT46 st_c r2, 0, r37 add r1, r1, 18 add r2, r2, 19 jmp A

Example: a loop in grep

Branch Combining

Solution: replace a group of exit branches by a corresponding group of predicate define instructions.

All predicate definitions write into the same predicateregister using the OR-type semantics.

The resultant predicate will be set to 1 if any of the exit branches were to be taken.

Because not exiting the hyperblock is the mostcommon case, the predicate will be false.

Branch Combining

1 A: bge r1, r5, EXIT 2 ld_c r3, r1, -1 3 beq r3, 10, EXIT2 4 beq r3, 0, EXIT3 5 bge r2, r6, EXIT4 6 st_c r2, -1, r3 7 bge r1, r7, EXIT5 8 ld_c r4, r1, 0 9 beq r4, 10, EXIT6

10 beq r4, 0, EXIT7 11 bge r2, r8, EXIT8 12 st_c r2, 0, r4 13 add r1, r1, 2 14 add r2, r2, 2 15 jmp A

0 A: pclr p1 1 pge p1(OR), r1, r5 2 ld_c r3, r1, -1 3 peq p1(OR), r3, 10 4 peq p1(OR), r3, 0 5 pge p1(OR), r2, r6 7 pge p1(OR), r1, r7 8 ld_c r4, r1, 0 9 peq p1(OR), r4, 10

10 peq p1(OR), r4, 0 11 pge p1(OR), r2, r8 16 jmp Decode (p1) 6’ st_c r2, -1, r3

12 st_c r2, 0, r4 13 add r1, r1, 2 14 add r2, r2, 2 15 jmp A

Decode: 1 bge r1, r5, EXIT1 3 beq r3, 10, EXIT2 4 beq r3, 0, EXIT3 5 bge r2, r6, EXIT4 6 st_c r2, -1, r3 7 bge r1, r7, EXIT5 9 beq r4, 10, EXIT6

10 beq r4, 0, EXIT7 11 jmp EXIT8

Instruction Between Combined Branches

Instructions between combined branches arespeculated.

For instructions that are between combined branchesbut cannot be speculated, the following must be done:

(1) move the instructions below the combined exit branch in the hyperblock.

(2) replicate these instructions in their original position with respect to the exit branches in the decode block.

Backend Compilation with Hyperblocks

Register Allocation

Instruction Scheduling

Classical Optim.

ILP/Predicate-SpecificOptimizations

Hyperblock/SuperblockFormation

Classical Optim.

Lcode generation

CFGGenerator

EquationSolver

predicate relations

dataflowinformation

predicateaware

cmput680 - winter 2006

computer organization

architecture iibookeeping

trace schedulinginstr

architecture iispeculative

conditional execution

trace code

execution time

ensurethe correct execution

Documents

winter 2006

march 14, 20021 cmput680 - winter 2006 topic c: loop fusion...

cmput 680 - compiler design and optimization1 cmput680 -...

fall - winter 2006

2006 winter navigator

winter 2006-07

cmput 680 - compiler design and optimization1 cmput680 -...

cmput 680 - compiler design and optimization1 cmput680 -...

crafe winter 2006

2006 winter drop

tresearch: winter 2006

dispatches (winter 2006)

masik winter 2006

cmput 680 - compiler design and optimization1 cmput680 -...

cmput 680 - compiler design and optimization1 cmput680 -...

2006 winter swe

2006 network (winter)

cmput 329 - computer organization and architecture ii1...

chronicle - winter 2006

cmput 680 - compiler design and optimization1 cmput680 -...