cmput680 - winter 2006
Post on 05-Jan-2016
39 Views
Preview:
DESCRIPTION
TRANSCRIPT
CMPUT 329 - Computer Organization and Architecture II
1
CMPUT680 - Winter 2006
Topic I: Superblock and Hyperblock Formation
José Nelson Amaralhttp://www.cs.ualberta.ca/~amaral/courses/680
CMPUT 329 - Computer Organization and Architecture II
2
Instruction Level Parallelism Optimizations
The objective of an optimizer is to reduce thenumber and complexity of the instructionsexecuted by the processor.
Superscalar or Very Long Instruction Word (VLIW) processors can reduce the execution time even when the number of instructions executed moderatelyincreases, as long as the dependence height is reduced.
CMPUT 329 - Computer Organization and Architecture II
3
Speculative and Predicated Execution
Speculative Execution: execution of an instructionbefore knowing that its execution is required.
Predicated Execution: architecture-supported conditional execution of an instruction based on the value of a Boolean source operand, referred to as the predicate of the instruction.
Superblock: structure used to implement compiler-controlled speculative execution.
If-conversion: compiler algorithm that converts conditional branches into predicate-defining instructions to allow the use of predication.
CMPUT 329 - Computer Organization and Architecture II
4
Trace Scheduling (Fisher, 1981)
Some optimization and scheduling decisionsmay decrease the execution time for onecontrol path while increasing the executiontime for another path.
Thus decisions should favor more frequentlyexecuted paths to improve overall performance.
Trace scheduling divides a procedure in a setof frequently executed traces (paths).
CMPUT 329 - Computer Organization and Architecture II
5
Trace Scheduling
There may be conditional branches from themiddle of the trace (side exits) and transitionsfrom other traces into the middle of the trace(side entrances).
These control-flow transitions are ignored duringtrace scheduling.
After scheduling, bookeeping is required to ensurethe correct execution of off-trace code.
CMPUT 329 - Computer Organization and Architecture II
6
Bookeeping for Trace Scheduling
Instr 1Instr 2Instr 3Instr 4Instr 5
Instr 2Instr 3Instr 4Instr 1Instr 5
What bookeeping is required when Instr 1 is moved below the side entrance in the trace?
CMPUT 329 - Computer Organization and Architecture II
7
Bookeeping for Trace Scheduling
Instr 1Instr 2Instr 3Instr 4Instr 5
Instr 2Instr 3Instr 4Instr 1Instr 5
Instr 3Instr 4
CMPUT 329 - Computer Organization and Architecture II
8
Bookeeping for Trace Scheduling
Instr 1Instr 2Instr 3Instr 4Instr 5
Instr 1Instr 5Instr 2Instr 3Instr 4
What bookeeping is required when Instr 5 moves above the side entrance in the trace?
CMPUT 329 - Computer Organization and Architecture II
9
Bookeeping for Trace Scheduling
Instr 1Instr 2Instr 3Instr 4Instr 5
Instr 1Instr 5Instr 2Instr 3Instr 4
Instr 5
CMPUT 329 - Computer Organization and Architecture II
10
Superblocks
A superblock is a trace without side entrances, i.e.,control can only enter from the top, but it can leaveat one or more exit points.
The formation of superblocks creates additionaloptimization opportunities because constraintsassociated with infrequently executed paths ofcontrol are ignored (thus these constraints donot inhibit optimizations that favor frequentlyexecuted paths).
CMPUT 329 - Computer Organization and Architecture II
11
Superblock Formation(Example)
Y
D100
C10
B90
E90
D0
F100
Z
1
90 10
900
090
10 99
1
Y
D100
C10
B90
E90
D0
F100
Z
1
90 10
900
090
10
99
1
CMPUT 329 - Computer Organization and Architecture II
12
Superblock Formation(Example)
Y
D100
C10
B90
E90
D0
F100
Z
1
90 10
900
090
10
99
1
Is this a superblock?
No, a superblock cannothave side entrances, andthis set of nodes hastwo side entrances intonode F. How do weconvert it into a superblock?
CMPUT 329 - Computer Organization and Architecture II
13
Superblock Formation(Example)
Y
D100
C10
B90
E90
D0
F90
Z
1
90 10
900
0
90
10
89.1
0.9
Tail duplication, is the duplication of basic blocksthat appear after a side entrance to eliminate side entrances and transform a trace into a superblock.
F’10
10
9.9
0.1
CMPUT 329 - Computer Organization and Architecture II
14
Common Subexpression Elimination in Superblocks
opA: mul r1,r2,3
opC: mul r3,r2,3
opB: add r2,r2,199
1
1
Original Code
opA: mul r1,r2,3
opC: mul r3,r2,3
opB: add r2,r2,199
1
Code After Superblock Formation
opC’: mul r3,r2,3
opA: mul r1,r2,3
opC: mov r3,r1
opB: add r2,r2,199
1
Code After Common Subexpression Elimination
opC’: mul r3,r2,3
CMPUT 329 - Computer Organization and Architecture II
15
Operation Migration in Superblocks
Original Code
…mov r0,r1
…mov r0,r2
…mov r0,r3
…add r1,r1,4add r2,r2,4add r3,r3,4
X
Y
Z
After Operation Migration
…
…
…
…add r1,r1,4add r2,r2,4add r3,r3,4
mov r0,r1
mov r0,r2
mov r0,r3
X
Y
Z
CMPUT 329 - Computer Organization and Architecture II
16
Global Variable Migration in Superblock
Loops
OpA: ld_I r4, x, r0OpB: add r4, r4, r1OpC: st_I x, r0, r4
OpD: add r1, r1, 1
OpE: add r0, r0, 1100
Original Program Segment
0
10
20
30
MEM[r0+x]
r4
1r1
1r0
CMPUT 329 - Computer Organization and Architecture II
17
Global Variable Migration in Superblock
Loops
OpA: ld_I r4, x, r0OpB: add r4, r4, r1OpC: st_I x, r0, r4
OpD: add r1, r1, 1
OpE: add r0, r0, 1100
Original Program Segment
0
10
20
30
MEM[r0+x]
10r4
1r1
1r0
CMPUT 329 - Computer Organization and Architecture II
18
Global Variable Migration in Superblock
Loops
OpA: ld_I r4, x, r0OpB: add r4, r4, r1OpC: st_I x, r0, r4
OpD: add r1, r1, 1
OpE: add r0, r0, 1100
Original Program Segment
0
10
20
30
MEM[r0+x]
11r4
1r1
1r0
CMPUT 329 - Computer Organization and Architecture II
19
Global Variable Migration in Superblock
Loops
OpA: ld_I r4, x, r0OpB: add r4, r4, r1OpC: st_I x, r0, r4
OpD: add r1, r1, 1
OpE: add r0, r0, 1100
Original Program Segment
020
11
30
MEM[r0+x]
11r4
1r1
1r0
CMPUT 329 - Computer Organization and Architecture II
20
Global Variable Migration in Superblock
Loops
OpA: ld_I r4, x, r0OpB: add r4, r4, r1OpC: st_I x, r0, r4
OpD: add r1, r1, 1
OpE: add r0, r0, 1100
Original Program Segment
020
11
30
MEM[r0+x]
11r4
2r1
1r0
CMPUT 329 - Computer Organization and Architecture II
21
Global Variable Migration in Superblock
Loops
OpA: ld_I r4, x, r0OpB: add r4, r4, r1OpC: st_I x, r0, r4
OpD: add r1, r1, 1
OpE: add r0, r0, 1100
Original Program Segment
0
11
20
30
MEM[r0+x]
11r4
2r1
1r0
CMPUT 329 - Computer Organization and Architecture II
22
Global Variable Migration in Superblock
Loops
OpA: ld_I r4, x, r0OpB: add r4, r4, r1OpC: st_I x, r0, r4
OpD: add r1, r1, 1
OpE: add r0, r0, 1100
Original Program Segment
0
11
20
30
MEM[r0+x]
12r4
2r1
1r0
CMPUT 329 - Computer Organization and Architecture II
23
Global Variable Migration in Superblock
Loops
OpA: ld_I r4, x, r0OpB: add r4, r4, r1OpC: st_I x, r0, r4
OpD: add r1, r1, 1
OpE: add r0, r0, 1100
Original Program Segment
020
12
30
MEM[r0+x]
12r4
2r1
1r0
CMPUT 329 - Computer Organization and Architecture II
24
Global Variable Migration in Superblock
Loops
OpA: ld_I r4, x, r0OpB: add r4, r4, r1OpC: st_I x, r0, r4
OpD: add r1, r1, 1
OpE: add r0, r0, 1100
Original Program Segment
020
12
30
MEM[r0+x]
12r4
2r1
2r0
CMPUT 329 - Computer Organization and Architecture II
25
Global Variable Migration in Superblock
Loops
OpA: ld_I r4, x, r0OpB: add r4, r4, r1OpC: st_I x, r0, r4
OpD: add r1, r1, 1
OpE: add r0, r0, 1100
Original Program Segment
020
12
30
MEM[r0+x]
20r4
2r1
2r0
CMPUT 329 - Computer Organization and Architecture II
26
Global Variable Migration in Superblock
Loops
OpA: ld_I r4, x, r0OpB: add r4, r4, r1OpC: st_I x, r0, r4
OpD: add r1, r1, 1
OpE: add r0, r0, 1100
Original Program Segment
020
12
30
MEM[r0+x]
21r4
2r1
2r0
CMPUT 329 - Computer Organization and Architecture II
27
Global Variable Migration in Superblock
Loops
OpA: ld_I r4, x, r0OpB: add r4, r4, r1OpC: st_I x, r0, r4
OpD: add r1, r1, 1
OpE: add r0, r0, 1100
Original Program Segment
0
12
30
21MEM[r0+x]
21r4
2r1
2r0
CMPUT 329 - Computer Organization and Architecture II
28
Global Variable Migration in Superblock
Loops
OpA: ld_I r4, x, r0OpB: add r4, r4, r1OpC: st_I x, r0, r4
OpD: add r1, r1, 1
OpE: add r0, r0, 1100
Original Program Segment
0
OpC: st_i x, r0, r4
OpC’: st_i x, r0, r4OpE: add r0, r0, 1
OpA: ld_I r4, x, r0
OpB: add r4, r4, r1
OpD: add r1, r1, 1
100
After Variable Migration
0
CMPUT 329 - Computer Organization and Architecture II
29
Superblock Enlarging Optimizations
By enlarging a superblock, we can provide thescheduler with more independent instructions
to choose from for each cycle
Superblock enlarging optimizations:Branch target expansionLoop unrollingLoop peeling
CMPUT 329 - Computer Organization and Architecture II
30
Branch Target Expansion
Idea: To expand the superblock with the targetof a likely taken branch.
blt r1, r2, L3
beq r3, r4, L5
L1:
jump L4
L2:L3:
20 100blt r1, r2, L3
beq r3, r4, L5
L1:
jump L4
L2:
20
CMPUT 329 - Computer Organization and Architecture II
31
Superblock Loops
A superblock loop is a superblock that has afrequently taken backedge from its last node toits first node.
We will study the extension of some commonloop optimizations to superblocks.
CMPUT 329 - Computer Organization and Architecture II
32
Dependence Removing Optimizations
The goal is to eliminate data dependences betweeninstructions within frequently executed superblocks.
Dependence removing optimizations include:Register renamingAccumulator variable expansionInduction variable expansionSearch variable expansionOperation combiningStrength reductionTree height reduction
CMPUT 329 - Computer Organization and Architecture II
33
Instruction Latencies for Examples
Function Latency Int ALU 1 Int multiply 3 Int divide 10 branch 1 Memory load 2 Memory store 1 FP ALU 3 FP conversion 3 FP multiply 3 FP divide 10
CMPUT 329 - Computer Organization and Architecture II
34
Register Renaming Example
For (j=0; j<n; j++) { C(j) = A(j)+B(j) }
Original Loop
L1: ld_f f2, A, r1 (a)ld_f f3, B, r1 (b)add_f f4, f2, f3 (c)st_f C, r1, f4 (d)add r1, r1, 4 (e)blt r1, r5, L1 (f)
Assembly Code
For all the examples we assume a superscalar processor with infiniteresources and no register renaming hardware. Thus for the code above, we obtain the following schedule.
CMPUT 329 - Computer Organization and Architecture II
35
Register Renaming Example
For (j=0; j<n; j++) { C(j) = A(j)+B(j) }
Original Loop
L1: ld_f f2, A, r1 (a)ld_f f3, B, r1 (b)add_f f4, f2, f3 (c)st_f C, r1, f4 (d)add r1, r1, 4 (e)blt r1, r5, L1 (f)
Assembly Code
a ab b
c c cde
f
0 5 cycles
Instr.
Code Schedule
7 cycles / 1 iteration
CMPUT 329 - Computer Organization and Architecture II
36
Register Renaming Example
L1: ld_f f2, A, r1 (a)ld_f f3, B, r1 (b)add_f f4, f2, f3 (c)st_f C, r1, f4 (d)add r1, r1, 4 (e)blt r1, r5, L1 (f)
Original Assembly Code
L1: ld_f f2, A, r1 (a)ld_f f3, B, r1 (b)add_f f4, f2, f3 (c)st_f C, r1, f4 (d)add r1, r1, 4 (e)
ld_f f2, A, r1 (f)ld_f f3, B, r1 (g)add_f f4, f2, f3 (h)st_f C, r1, f4 (i)add r1, r1, 4 (j)ld_f f2, A, r1 (k)ld_f f3, B, r1 (l)add_f f4, f2, f3 (m)st_f C, r1, f4 (n)add r1, r1, 4 (o)blt r1, r5, L1 (p)
After Loop Unrolling
CMPUT 329 - Computer Organization and Architecture II
37
Loop Unrolling
a ab b
c c cde
f
0 5 cycles
Instr.
Code Schedule
fg g
h h hij
k kl l
m m mno
p
10 15
19 cycles / 3 iterations = 6.3 cycles / iteration
L1: ld_f f2, A, r1 (a)ld_f f3, B, r1 (b)add_f f4, f2, f3 (c)st_f C, r1, f4 (d)add r1, r1, 4 (e)
ld_f f2, A, r1 (f)ld_f f3, B, r1 (g)add_f f4, f2, f3 (h)st_f C, r1, f4 (i)add r1, r1, 4 (j)ld_f f2, A, r1 (k)ld_f f3, B, r1 (l)add_f f4, f2, f3 (m)st_f C, r1, f4 (n)add r1, r1, 4 (o)blt r1, r5, L1 (p)
After Loop Unrolling
CMPUT 329 - Computer Organization and Architecture II
38
Register Renaming
L1: ld_f f21, A, r11 (a)ld_f f31, B, r11 (b)add_f f41, f21, f31 (c)st_f C, r11, f41 (d)add r12, r11, 4 (e)
ld_f f22, A, r12 (f)ld_f f32, B, r12 (g)add_f f42, f22, f32 (h)st_f C, r12, f42 (i)add r13, r12, 4 (j)ld_f f23, A, r13 (k)ld_f f33, B, r13 (l)add_f f43, f23, f33 (m)st_f C, r13, f43 (n)add r11, r13, 4 (o)blt r11, r5, L1 (p)
After Register Renaming
L1: ld_f f2, A, r1 (a)ld_f f3, B, r1 (b)add_f f4, f2, f3 (c)st_f C, r1, f4 (d)add r1, r1, 4 (e)
ld_f f2, A, r1 (f)ld_f f3, B, r1 (g)add_f f4, f2, f3 (h)st_f C, r1, f4 (i)add r1, r1, 4 (j)ld_f f2, A, r1 (k)ld_f f3, B, r1 (l)add_f f4, f2, f3 (m)st_f C, r1, f4 (n)add r1, r1, 4 (o)blt r1, r5, L1 (p)
After Loop Unrolling
CMPUT 329 - Computer Organization and Architecture II
39
Loop Unrolling and Register Renaming
Instr.
a ab b
c c cd
ef
0 5 cycles
Code Schedule
fg g
h h hi
jk kl l
m m mn
op
10 15
8 cycles / 3 iterations = 2.7 cycles / iteration
L1: ld_f f21, A, r11 (a)ld_f f31, B, r11 (b)add_f f41, f21, f31 (c)st_f C, r11, f41 (d)add r12, r11, 4 (e)
ld_f f22, A, r12 (f)ld_f f32, B, r12 (g)add_f f42, f22, f32 (h)st_f C, r12, f42 (i)add r13, r12, 4 (j)ld_f f23, A, r13 (k)ld_f f33, B, r13 (l)add_f f43, f23, f33 (m)st_f C, r13, f43 (n)add r11, r13, 4 (o)blt r11, r5, L1 (p)
After Register Renaming
CMPUT 329 - Computer Organization and Architecture II
40
Accumulator Variable Expansion
An accumulator variable accumulates a sum or productin each iteration of a loop.
Accumulator variable expansion eliminates redefinitionsof an accumulator variable within an unrolled loop bycreating k temporary accumulators (k is the number ofaccumulation instructions). The values of all temporaryaccumulators must be summed at the exit points of the loop where the accumulator is live.
CMPUT 329 - Computer Organization and Architecture II
41
Accumulator Expansion Example
For (k=0; k<n; k++) { C(i,j) = C(i,j) + A(i,k) * B(k,j) }
Original Loop
ld_f f1, C, r2 (-)L1: ld_f f3, A, r4 (a)
ld_f f5, B, r6 (b)mul_f f7, f3, f5 (c)add_f f1, f1, f7 (d)add r4, r4, 4 (e)add r6, r6, r8 (f)blt r4, r9, L1 (g)st_f C, r2, f1 (-)
Assembly Code
For all examples we assume a superscalar processor with infiniteresources and no register renaming hardware. Thus for the code above, we obtain the following schedule.
CMPUT 329 - Computer Organization and Architecture II
42
Accumulator Expansion Example
For (k=0; k<n; k++) { C(i,j) = C(i,j) + A(i,k) * B(k,j) }
Original Loop
Assembly Codea ab b
c c cd
ef
0 5 cycles
Instr.
Code Schedule
g
ld_f f1, C, r2 (-)L1: ld_f f3, A, r4 (a)
ld_f f5, B, r6 (b)mul_f f7, f3, f5 (c)add_f f1, f1, f7 (d)add r4, r4, 4 (e)add r6, r6, r8 (f)blt r4, r9, L1 (g)st_f C, r2, f1 (-)
d d
8 cycles / 1 iteration
CMPUT 329 - Computer Organization and Architecture II
43
Loop Unrolling and Register Renaming
ld_f f1, C, r2 (-)L1: ld_f f3, A, r4 (a)
ld_f f5, B, r6 (b)mul_f f7, f3, f5 (c)add_f f1, f1, f7 (d)add r4, r4, 4 (e)add r6, r6, r8 (f)blt r4, r9, L1 (g)st_f C, r2, f1 (-)
Assembly Code
After Unrolling and Renaming
ld_f f1, C, r2 (-)L1: ld_f f31, A, r41 (a)
ld_f f51, B, r61 (b)mul_f f71, f31, f51 (c)add_f f1, f1, f71 (d)add r42, r41, 4 (e)add r62, r61, r8 (f)ld_f f32, A, r42 (g)ld_f f52, B, r62 (h)mul_f f72, f32, f52 (i)add_f f1, f1, f72 (j)add r43, r42, 4 (k)add r63, r62, r8 (l)ld_f f33, A, r43 (m)ld_f f53, B, r63 (n)mul_f f73, f33, f53 (o)add_f f1, f1, f73 (p)add r41, r43, 4 (q)add r61, r63, r8 (r)blt r4, r9, L1 (s)st_f C, r2, f1 (-)
CMPUT 329 - Computer Organization and Architecture II
44
Loop Unrolling and Register Renaming
a ab b
c c cd
ef
0 5 cycles
Code Schedule
g gh h
ij
kl
10 15
d d
ld_f f1, C, r2 (-)L1: ld_f f31, A, r41 (a)
ld_f f51, B, r61 (b)mul_f f71, f31, f51 (c)add_f f1, f1, f71 (d)add r42, r41, 4 (e)add r62, r61, r8 (f)ld_f f32, A, r42 (g)ld_f f52, B, r62 (h)mul_f f72, f32, f52 (i)add_f f1, f1, f72 (j)add r43, r42, 4 (k)add r63, r62, r8 (l)ld_f f33, A, r43 (m)ld_f f53, B, r63 (n)mul_f f73, f33, f53 (o)add_f f1, f1, f73 (p)add r41, r43, 4 (q)add r61, r63, r8 (r)blt r4, r9, L1 (s)st_f C, r2, f1 (-)
Instr.
i ij j
m mn n
op
qr
o op p
s
14 cycles / 3 iterations = 4.7 cycles / iteration
CMPUT 329 - Computer Organization and Architecture II
45
Accumulator Expansion
a ab b
c c cd
ef
0 5 cycles
Code Schedule
g gh h
ij
kl
10 15
d d
ld_f f11, C, r2 (-)mov_f f12, 0 (-)
mov_f f13, 0 (-)L1: ld_f f31, A, r41 (a)
ld_f f51, B, r61 (b)mul_f f71, f31, f51 (c)add_f f11, f11, f71 (d)add r42, r41, 4 (e)add r62, r61, r8 (f)ld_f f32, A, r42 (g)ld_f f52, B, r62 (h)mul_f f72, f32, f52 (i)add_f f12, f12, f72 (j)add r43, r42, 4 (k)add r63, r62, r8 (l)ld_f f33, A, r43 (m)ld_f f53, B, r63 (n)mul_f f73, f33, f53 (o)add_f f13, f13, f73 (p)add r41, r43, 4 (q)add r61, r63, r8 (r)blt r4, r9, L1 (s)add_f f11, f11, f12 (-)add_f f11, f11, f13 (-)st_f C, r2, f1 (-)
Instr.
i ij j
m mn n
op
qr
o op p
s
10 cycles / 3 iterations = 3.3 cycles / iteration
CMPUT 329 - Computer Organization and Architecture II
46
Induction Variable Expansion
An induction variable is used to index through loop iterations and through regular data structure, such as arrays.
Induction variable expansion eliminates dependencesbetween definitions of induction variables and their usesin unrolled loops.
CMPUT 329 - Computer Organization and Architecture II
47
Induction Variable Expansion Example
For (i=0; i<n; i++) { C(j) = A(j) * B(j) j = j + K }
Original Loop
Assembly Codea ab b
c c cde
f
0 5 cycles
Instr.
Code Schedule
g
L1: ld_f f3, A, r2 (a)ld_f f4, B, r2 (b)mul_f f5, f3, f4 (c)st_f C, r2, f5 (d)add r2, r2, r7 (e)add r1, r1, 1 (f)blt r1, r6, L1 (g)
6 cycles / 1 iteration
CMPUT 329 - Computer Organization and Architecture II
48
Loop Unrolling and Register Renaming
Assembly Code
After Unrolling and Renaming
L1: ld_f f31, A, r21 (a)ld_f f41, B, r21 (b)mul_f f51, f31, f41 (c)st_f C, r21, f51 (d)add r22, r21, r7 (e)
ld_f f32, A, r22 (f)ld_f f42, B, r22 (g)mul_f f52, f32, f42 (h)st_f C, r22, f52 (i)add r23, r22, r7 (j)ld_f f33, A, r23 (k)ld_f f43, B, r23 (l)mul_f f53, f33, f43 (m)st_f C, r23, f53 (n)add r21, r23, r7 (o)add r1, r1, 3 (p)blt r1, r6, L1 (q)
L1: ld_f f3, A, r2 (a)ld_f f4, B, r2 (b)mul_f f5, f3, f4 (c)st_f C, r2, f5 (d)add r2, r2, r7 (e)add r1, r1, 1 (f)blt r1, r6, L1 (g)
CMPUT 329 - Computer Organization and Architecture II
49
Loop Unrolling and Register Renaming
a ab b
c c cd
e
0 5 cycles
Code Schedule
f fg g
hi
j
10 15
Instr.
h h
k kl l
mn
op
m m
q
8 cycles / 3 iterations = 2.6 cycles / iteration After Unrolling and Renaming
L1: ld_f f31, A, r21 (a)ld_f f41, B, r21 (b)mul_f f51, f31, f41 (c)st_f C, r21, f51 (d)add r22, r21, r7 (e)
ld_f f32, A, r22 (f)ld_f f42, B, r22 (g)mul_f f52, f32, f42 (h)st_f C, r22, f52 (i)add r23, r22, r7 (j)ld_f f33, A, r23 (k)ld_f f43, B, r23 (l)mul_f f53, f33, f43 (m)st_f C, r23, f53 (n)add r21, r23, r7 (o)add r1, r1, 3 (p)blt r1, r6, L1 (q)
CMPUT 329 - Computer Organization and Architecture II
50
Induction Variable Expansion
a ab b
c c cd
0 5 cycles
Code Schedule
f fg g
h
10 15
Instr.
h h
k kl l
m
p
m m
6 cycles / 3 iterations = 2 cycles / iteration After Unrolling and Renaming
mov r21, r2 (-)add r22, r21, r7 (-)add r23, r22, r7 (-)mul r71, r7, 3 (-)
L1: ld_f f31, A, r21 (a)ld_f f41, B, r21 (b)mul_f f51, f31, f41 (c)st_f C, r21, f51 (d)ld_f f32, A, r22 (f)ld_f f42, B, r22 (g)mul_f f52, f32, f42 (h)st_f C, r22, f52 (i)ld_f f33, A, r23 (k)ld_f f43, B, r23 (l)mul_f f53, f33, f43 (m)st_f C, r23, f53 (n)add r21, r21, r71 (e)add r22, r22, r71 (j)add r23, r23, r71 (o)add r1, r1, 3 (p)blt r1, r6, L1 (q)
e
i
j
n
o
q
CMPUT 329 - Computer Organization and Architecture II
51
Search Variable Expansion
A search variable is a single value (p.e., a minimum or a maximum) computed for a collection of data.
Search variable expansion eliminates dependencesbetween definitions of search variables and their usesin unrolled loops.
Each search variable is expanded into k temporaryindependent variables. At the exit of the loop the valueof the original search variable is obtained by comparingthe values of the temporary search variables.
CMPUT 329 - Computer Organization and Architecture II
52
Superblock Scheduling
Superblock scheduling is a two step process:
Step 1: Build dependence graphStep 2: List scheduling using the dependence
graph, instruction latencies, and resource constraints of the processor
CMPUT 329 - Computer Organization and Architecture II
53
List Scheduling
List scheduling employs heuristics to choose amongall ready nodes, the combination of nodes
that should be scheduled in the current cycle.
A node is ready if:(i) all its parents in the dependence graph have been scheduled;(ii) the result produced by each parent is available; and (iii) the resources required by the node are available.
CMPUT 329 - Computer Organization and Architecture II
54
Speculative Execution in Superblocks
To produce an efficient schedule, the compilermust be able to move instructions above and below branches.
R: xy+z…S: bnz r1...
...
P
LIVE-OUT(BR) is the set ofvariables that may be used before being redefined when
the branch BR is taken
In the example, LIVE-OUT(S) is the set of variables that is live at point P.
SB1
B2
CMPUT 329 - Computer Organization and Architecture II
55
Speculative Execution in Superblocks
If we want to move instruction R below the branchinstruction S, two situations might occur:
R: xy+z…S: bnz r1...
...
P
1) x LIVE-OUT(S)2) x LIVE-OUT(S)
What is the code thatthe compiler should
produce for each situation?
SB1
B2
CMPUT 329 - Computer Organization and Architecture II
56
Speculative Execution in Superblocks
If we want to move instruction R below the branchinstruction S, two situations might occur:
R: xy+z…S: bnz r1...
...
P
1) x LIVE-OUT(S)insert a copy of
instruction R in thebranch target.
2) x LIVE-OUT(S)no compensation code
is required
SB1
B2
CMPUT 329 - Computer Organization and Architecture II
57
Speculative Execution in Superblocks
…S: bnz r1…R: xy+z
R’: xy+z...
P
…S: bnz r1…R: xy+z
...
P
1) x LIVE-OUT(S) 2) x LIVE-OUT(S)must introduce R’ in
basic block B2no compensation code
is required
SB1
B2
SB1
B2
CMPUT 329 - Computer Organization and Architecture II
58
Speculative Execution in Superblocks
Upward code motion is more common to reducethe critical path of a superblock. (p.e. moving aload instruction upward to hide the load latency).
There are two major restrictions to move an instruction J from below to above a branch BR:Restriction 1: The destination of J is not in LIVE-OUT(BR).Restriction 2: J will never cause an exception that may terminate program execution when BR is taken.
CMPUT 329 - Computer Organization and Architecture II
59
Speculative Execution in Superblocks
Restriction 1 is usually removed by register renaming.By renaming the destination register of instruction J,we ensure that it is not in LIVE-OUT(BR).
There are two extreme interpretations to restriction 2.
Restricted Speculation Model: fully enforce restriction 2.
Therefore only instructions that cannot cause expections are candidates for speculative execution (p. e. memory load, memory store, integer divide, andall floating point instructions cannot be speculated).
CMPUT 329 - Computer Organization and Architecture II
60
Speculative Execution in Superblocks
General Speculation Model: completely ignore restriction 2.
Requires that the processor provide non-excepting or silent versions of all potentially excepting instructions in the instruction set architecure. If an exception occurs for a silent instruction, it
is simply ignored, and garbage is written in the destination.
CMPUT 329 - Computer Organization and Architecture II
61
Example for Speculative Execution
avg = 0;weight = 0;count = 0;while(prt != NULL) {
count = count + 1;if(prt->wt > 0) weight = weight - prt->wt;else weight = weight + prt->wt;prt = prt -> next;}
if(count != 0) avg = weight/count
C code segment
(i1) ld_i r1, prt, 0(i2) mov r7, 0 // avg(i3) mov r2, 0 // count(i4) mov r3, 0 // weight(i5) beq r1, 0, L3(i6) L0: add r2, r2, 1(i7) ld_i r4, r1, 0 // prt->wt(i8) bge r4, 0, L1(i9) sub r3, r3, r4(i10) jmp L2(i11) L1: add r3, r3, r4(i12) L2: ld_i r1, r1, 4(i13) bne r1, 0, L0(i14) L3: beq r2, 0, L4(i15) div r7, r3, r2(i16) st_i avg, 0, r7(i17) L4:
Assembly code segment
CMPUT 329 - Computer Organization and Architecture II
62
BB2
BB4
BB5
Example for Speculative Execution
(i1) ld_i r1, prt, 0(i2) mov r7, 0 // avg(i3) mov r2, 0 // count(i4) mov r3, 0 // weight(i5) beq r1, 0, L3(i6) L0: add r2, r2, 1(i7) ld_i r4, r1, 0 // prt->wt(i8) bge r4, 0, L1(i9) sub r3, r3, r4(i10) jmp L2(i11) L1: add r3, r3, r4(i12) L2: ld_i r1, r1, 4(i13) bne r1, 0, L0(i14) L3: beq r2, 0, L4(i15) div r7, r3, r2(i16) st_i avg, 0, r7(i17) L4:
Assembly code segment
i6i7i8
i11
i12i13
i9i10
10
10
90
90
99
1
1
Trace Selection for the Loop
BB3
CMPUT 329 - Computer Organization and Architecture II
63
BB2
BB4
BB5BB5
BB2
BB4
Example for Speculative Execution
i6i7i8
i11
i12i13
i9i10
10
10
90
90
99
1
1
Trace Selection for the Loop
BB3
i6i7i8
i11
i12i13
i9i12’i13’
1090
90
99(1/10)
1(9/10)
1
After superblock formationand branch target expansion
BB3’
1(1/10)
99(1/10)
SB1
SB2
CMPUT 329 - Computer Organization and Architecture II
64
Example for Speculative Execution
BB2
BB4
BB5
i6i7i8
i11
i12i13
i9i12’i13’
1090
90
99(1/10)
1(9/10)
1
After superblock formationand branch target expansion
BB3’
1(1/10)
99(1/10)
SB1
SB2
ld_i r1, prt, 0mov r7, 0 // avgmov r2, 0 // countmov r3, 0 // weightbeq r1, 0, L3
(i6) L0: add r2, r2, 1(i7) ld_i r4, r1, 0 // prt->wt(i8) bge r4, 0, LA(i11) add r3, r3, r4(i12) ld_i r1, r1, 4 // prt->next(i13) bne r1, 0, L0(i9) LA: sub r3, r3, r4(i12’) ld_i r1, r1, 4 // prt->next(i13’) bne r1, 0, L0(i14) L3: beq r2, 0, L4(i15) div r7, r3, r2(i16) st_i avg, 0, r7(i17) L4:
Assembly code segment
CMPUT 329 - Computer Organization and Architecture II
65
Example for Speculative Execution
ld_i r1, prt, 0mov r7, 0 // avgmov r2, 0 // countmov r3, 0 // weightbeq r1, 0, L3
(I1) L0: add r2, r2, 1(I2) ld_i r4, r1, 0 // prt->wt(I3) blt r4, 0, L1(I4) add r3, r3, r4(I5) ld_i r5, r1, 4 // prt->next(I6) beq r5, 0, L3(I7) add r2, r2, 1(I8) ld_i r6, r5, 0 // prt->wt(I9) blt r6, 0, L1’(I10) add r3, r3, r6(I11) ld_i r1, r5, 4 // prt -> next(I12) bne r1, 0, L0 L3: beq r2, 0, L4 div r7, r3, r2 st_I avg, 0, r7 L4: L1’: mov r1, r5 mov r4, r6 L1: sub r32, r3, r4 ld_i r1, r1, 4 bne r1, 0, L0
ld_i r1, prt, 0mov r7, 0 // avgmov r2, 0 // countmov r3, 0 // weightbeq r1, 0, L3
(I1) L0: add r2, r2, 1(I2) ld_i r4, r1, 0 // prt->wt(I3) blt r4, 0, L1
(I4) add r3, r3, r4(I5) ld_i r5, r1, 4 // prt->next(I6) beq r5, 0, L3
(I7) add r2, r2, 1(I8) ld_i r6, r5, 0 // prt->wt(I9) blt r6, 0, L1’
(I10) add r3, r3, r6(I11) ld_i r1, r5, 4 // prt -> next(I12) bne r1, 0, L0
L3: beq r2, 0, L4 div r7, r3, r2 st_I avg, 0, r7
L4:
L1’: mov r1, r5 mov r4, r6
L1: sub r32, r3, r4 ld_i r1, r1, 4 bne r1, 0, L0
CMPUT 329 - Computer Organization and Architecture II
66
Example for Speculative Execution
ld_i r1, prt, 0mov r7, 0 // avgmov r2, 0 // countmov r3, 0 // weightbeq r1, 0, L3
(I1) L0: add r2, r2, 1(I2) ld_i r4, r1, 0 // prt->wt(I3) blt r4, 0, L1
(I4) add r3, r3, r4(I5) ld_i r5, r1, 4 // prt->next(I6) beq r5, 0, L3
(I7) add r2, r2, 1(I8) ld_i r6, r5, 0 // prt->wt(I9) blt r6, 0, L1’
(I10) add r3, r3, r6(I11) ld_i r1, r5, 4 // prt -> next(I12) bne r1, 0, L0
div r7, r3, r2 st_I avg, 0, r7
L4:
L1’: mov r1, r5 mov r4, r6
L1: sub r32, r3, r4 ld_i r1, r1, 4 bne r1, 0, L0
L3: beq r2, 0, L4
CMPUT 329 - Computer Organization and Architecture II
67
HyperblocksSuggested Reading
Scott A. Mahlke’s Ph.D. Thesis, chap. 7.
CMPUT 329 - Computer Organization and Architecture II
68
Hyperblock
A hyperblock is a collection of connected basicblocks in which control may only enter throughthe first block (entry block).
Control flow may leave from any number of blocksin the hyperblock.
Before scheduling, all control flow between basicblocks within a hyperblock is removed via if-conversion.
CMPUT 329 - Computer Organization and Architecture II
69
Hyperblock Formation
A five-step procedure is used to form hyperblocks:
1. region identification
2. loop backedge coalescing
3. block selection
4. tail duplication
5. if-conversion
CMPUT 329 - Computer Organization and Architecture II
70
Running Example: wc
Mahlke uses the inner loop of wc, the program that counts the number of characters, words, and lines in a file forlinux, as a running example.
CMPUT 329 - Computer Organization and Architecture II
71
The source code
linect =wordct = charct = token = 0; for ( ; ; )A: if (--(fp)->cnt < 0)C: c = filbuf(fp); elseB: c = *(fp)->ptr++;D: if (c == EOF) break;E: charct++; if ((‘ ‘ < c) &&F: (c < 0177)) {
H: if(! token) {K: wordct++; token++; } continue; }G: if (c == ‘\n’)I: linec++;J: else if ((c != ‘ ‘) &&L: (c != ‘\t’)) continue;M: token = 0; }
CMPUT 329 - Computer Organization and Architecture II
72
The Assembly Code
LA: ld_i r98, r3, 0 add r27, r98, -1 st_i r3, 0, 27 blt r98, 1, LCLB: ld_i r30, r3, 4 add r29, r30, 1 st_i r3, 4, r29 ld_c r4, r30, 0LD: beq r4, -1, EXITLE: ld_I r33, r73, 0 add r32, r33, 1 st_I r73, 0, r32 bge 32, r4, LGLF: bge r4, 127, LGLH: bne 0, r2, LA
LK: ld_I r36, r72, 0 add r35, r36, 1 st_I r72, 0, r35 add r2, r2, 1 jmp LALG: beq r4, r10, LILJ: bne r4, 32, LLLM: mov r2, 0 jmp LALI: ld_I r39, r71, 0 add r38, r39, 1 st_I r71, 0, r38 jmp LMLL: bne r4, 9, LA jmp LMLC: mov Parm0, r3 jsr filbuf mov r4, Ret0 jmp LD
CMPUT 329 - Computer Organization and Architecture II
73
Control Flow Graph
E
A
CB
D
F
H
K
G
I J
L
M16K
105K 14
14105K
105K
EXIT
61K
77K
77K 28K
04K 24K
22K2K
4K
2K
28K
25
1
16K
CMPUT 329 - Computer Organization and Architecture II
74
Statistics of the Example
wc is formed by small basic blocks with a largepercentage of branches
It contains 13 basic blocks and 34 instructions:
14 branches: 8 conditional 5 unconditional 1 subroutine call
CMPUT 329 - Computer Organization and Architecture II
75
Step 1: Region Identification
A region is a group of basic blocks with a singleentry block that dominates all the blocks in theregion.
Regions are used because they provide easy tocompute outer boundaries for hyperblocks.
A basic block can only reside in a single region.
A second constraint imposed on region formationis that regions may not contain internal cycles(this constraint is relaxed later).
In wc, the entire control flow graph forms a region.
CMPUT 329 - Computer Organization and Architecture II
76
Step 2: Backedge Coalescing
If-conversion only can remove non-loop branches.
Thus we need to coaslece all back edges into asingle backedge. This allows the control logicthat choses which backedge is taken to beeliminated by if-conversion.
To coalesce the backedges, we introduce a newnode that will be the origin of the new single backedge.Then we retarget all existing backedges to this new node
CMPUT 329 - Computer Organization and Architecture II
77
CFG Before Backedge Coalescing
E
A
CB
D
F
H
K
G
I J
L
M16K
105K 14
14105K
105K
EXIT
61K
77K
77K 28K
04K 24K
22K2K
4K
2K
28K
25
1
16K
CMPUT 329 - Computer Organization and Architecture II
78
CFG After Backedge Coalescing
E
A
CB
D
F
H
K
G
I J
L
M16K
105K 14
14105K
105K
EXIT
61K
77K
77K 28K
04K 24K
22K2K
4K
2K
28K
25
N
105K
1
16K
CMPUT 329 - Computer Organization and Architecture II
79
Step 3: Block Selection
Two conflicting goals:
(1) More blocks can potentially improve performance by eliminating branches among the blocks included.
(2) Too many blocks may result in performance loss due to over-saturation of processor resources or increased dependence height.
CMPUT 329 - Computer Organization and Architecture II
80
Enumerating Execution Paths
An execution path is a path of control flow fromthe entry block to an exit block in the region.
Mahlke assigns a priority to each execution path.This priority indicates the path relative importance.
Paths are included in the hyperblock from thehighest to the lowest priority based on the available resources.
Mahlke also estimates the available resourcesand the resource use of each path.
CMPUT 329 - Computer Organization and Architecture II
81
Path Priority Function
The path priority function combines four elements: (1) path execution frequency;
(2) number of instructions in the path;(3) path dependence height;(4) hazard conditions on the path;
Intuition: include paths with fewer instructions, with lower dependence height, that have few hazard conditions, and that are executed very often.
Hazard conditions include procedure calls andunresolvable memory stores.
CMPUT 329 - Computer Organization and Architecture II
82
Path Priority Function
( )
( )( ) ( )Kratioopratiodephazardyprobabilitpriority
opsnum
opsnumratioop
heightdep
heightdepratiodep
iiiii
jNj
ii
jNj
ii
++××=
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛−=
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛−=
≤≤
≤≤
__
_max
_0.1_
_max
_0.1_
1
1
Malhke use a hazard multiplier of 0.25 for all pathscontaining a subroutine call or an unresolvable memory reference, and 1.0 for all other paths.
CMPUT 329 - Computer Organization and Architecture II
83
Path Priority Function
( )
( )( ) ( )Kratioopratiodephazardyprobabilitpriority
opsnum
opsnumratioop
heightdep
heightdepratiodep
iiiii
jNj
ii
jNj
ii
++××=
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛−=
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛−=
≤≤
≤≤
__
_max
_0.1_
_max
_0.1_
1
1
The constant K makes the path with the largestdependence height and the most operations havea non-zero probability. Malhke used K=0.1.
CMPUT 329 - Computer Organization and Architecture II
84
Block Selection Algorithm
ISSUE_WIDTH = 1 to 8 /* as specified in the machine description file */RES_MULTIPLIER = 2MAX_DEP_GROWTH = 3MIN_PATH_PRIORITY_RATIO = 0.10
block_selection(region) { enumerate all paths in the region calculate priority of each path sort paths from highest to lowest priority /* Initialization of loop variables */ avail_resources = ISSUE_WIDTH dep_height1 RES_MULTIPLIER used_resources = 0 last_priority = 0.0 selected_paths = 0 for (i = 1 to num_paths) { /* Check if there are enough resources available to include the path */ if ((num_opsi + used_resources) > avail_resources) { continue } /* Prevent paths with large relative dependence heights from being included */ if (dep_heighti > (dep_height1 MAX_DEP_GROWTH)) { continue }
CMPUT 329 - Computer Organization and Architecture II
85
Block Selection Algorithm
/* Prevent paths with large relative dependence heights from being included */ if (dep_heighti > (dep_height1 MAX_DEP_GROWTH)) { continue }/* Do not include paths with a small relative priority to that of the last included path */ if (priorityi < (last_priority MIN_PATH_PRIORITY_RATIO)) { continue }/* Include the path in the hyperblock */ selected_paths = selected_paths pathi
used_resources = used_resources + num_opsi
last_priority = priorityi
} selected_blocks = all blocks contained within selected_paths return selected_blocks}
CMPUT 329 - Computer Organization and Architecture II
86
Block Selection
E
A
CB
D
F
H
K
G
I J
L
M16K
105K 14
14105K
105K
EXIT
61K
77K
77K 28K
04K 24K
22K2K
4K
2K
28K
25
N
105K
1
16K
1. A-B-D-E-F-H-N 2. A-B-D-E-F-H-K-N 3. A-B-D-E-G-J-M-N 4. A-B-D-E-G-J-L-M-N 5. A-B-D-E-G-I-M-N 6. A-B-D-E-G-J-L-N 7. A-B-D
8. A-C-D-E-F-H-N 9. A-C-D-E-F-H-K-N10. A-C-D-E-G-J-M-N11. A-C-D-E-G-J-L-M-N12. A-C-D-E-G-I-M-N13. A-C-D-E-G-J-L-N14. A-C-D
15. A-B-D-E-F-G-I-M-N16. A-B-D-E-F-G-J-M-N17. A-B-D-E-F-G-J-L-M-N18. A-B-D-E-F-G-J-L-N
19. A-C-D-E-F-G-I-M-N20. A-C-D-E-F-G-J-M-N21. A-C-D-E-F-G-J-L-M-N22. A-C-D-E-F-G-J-L-N
CMPUT 329 - Computer Organization and Architecture II
87
Block Selection
E
A
CB
D
F
H
K
G
I J
L
M16K
105K 14
14105K
105K
EXIT
61K
77K
77K 28K
04K 24K
22K2K
4K
2K
28K
25
N
105K
1
16K
1. A-B-D-E-F-H-N 2. A-B-D-E-F-H-K-N 3. A-B-D-E-G-J-M-N 4. A-B-D-E-G-J-L-M-N 5. A-B-D-E-G-I-M-N 6. A-B-D-E-G-J-L-N 7. A-B-D
8. A-C-D-E-F-H-N 9. A-C-D-E-F-H-K-N10. A-C-D-E-G-J-M-N11. A-C-D-E-G-J-L-M-N12. A-C-D-E-G-I-M-N13. A-C-D-E-G-J-L-N14. A-C-D
15. A-B-D-E-F-G-I-M-N16. A-B-D-E-F-G-J-M-N17. A-B-D-E-F-G-J-L-M-N18. A-B-D-E-F-G-J-L-N
19. A-C-D-E-F-G-I-M-N20. A-C-D-E-F-G-J-M-N21. A-C-D-E-F-G-J-L-M-N22. A-C-D-E-F-G-J-L-N
CMPUT 329 - Computer Organization and Architecture II
88
Path Selection
Some paths that are not selected by the blockselection algorithms are also included in thehyperblocks because all their blocks belongto selected paths.
An alternative procedure could have eliminatedthese paths from the path set before the selection.
But the cost of such elimination would be higherthan maintaining these extra paths in the set.
CMPUT 329 - Computer Organization and Architecture II
89
Block Selection
E
A
CB
D
F
H
K
G
I J
L
M16K
105K 14
14105K
105K
EXIT
61K
77K
77K 28K
04K 24K
22K2K
4K
2K
28K
25
N
105K
1
16K
1. A-B-D-E-F-H-N 2. A-B-D-E-F-H-K-N 3. A-B-D-E-G-J-M-N 4. A-B-D-E-G-J-L-M-N 5. A-B-D-E-G-I-M-N 6. A-B-D-E-G-J-L-N 7. A-B-D
8. A-C-D-E-F-H-N 9. A-C-D-E-F-H-K-N10. A-C-D-E-G-J-M-N11. A-C-D-E-G-J-L-M-N12. A-C-D-E-G-I-M-N13. A-C-D-E-G-J-L-N14. A-C-D
15. A-B-D-E-F-G-I-M-N16. A-B-D-E-F-G-J-M-N17. A-B-D-E-F-G-J-L-M-N18. A-B-D-E-F-G-J-L-N
19. A-C-D-E-F-G-I-M-N20. A-C-D-E-F-G-J-M-N21. A-C-D-E-F-G-J-L-M-N22. A-C-D-E-F-G-J-L-N
CMPUT 329 - Computer Organization and Architecture II
90
Step 4: Tail Duplication
To convert the set of selected blocks into ahyperblock (with a single entry block), controlflow from non-selected blocks (side entry points) must be eliminated.
The tail duplication algorithm first marks allblocks that have side entry points.
Then the algorithm marks all blocks that canbe reached from marked blocks.
All marked blocks form the tails that must beduplicated.
CMPUT 329 - Computer Organization and Architecture II
91
Tail Duplication
E
A
CB
D
F
H
K
G
I J
L
M16K
105K 14
14105K
105K
EXIT
61K
77K
77K 28K
04K 24K
22K2K
4K
2K
28K
25
N
105K
1
16K
CMPUT 329 - Computer Organization and Architecture II
92
Tail Duplication
E
A
CB
D
F
H
K
G
I J
L
M16K
105K 14
14105K
105K
EXIT
61K
77K
77K 28K
04K 24K
22K2K
4K
2K
28K
25
N
105K
1
16K
CMPUT 329 - Computer Organization and Architecture II
93
Tail Duplication
E
A
CB
D
F
H
K
G
I J
L
M16K
105K 14
14105K
105K
EXIT
61K
77K
77K 28K
04K 24K
22K2K
4K
2K
28K
25
N
1
16K
E’
D’
F’
H’
K’
G’
I’ J’
L’
M’2
14
8
10
10 4
01 3
30
1
0
4
0
N’
105K 0
2
14
CMPUT 329 - Computer Organization and Architecture II
94
Anatomy of a Predicate Computation Operation
p<cmp> Pout1(type), Pout2(type), src1, src2 (Pin)
This instruction assigns value to Pout1 and Pout2:
The value assigned depends on:
The result of the comparisonThe value of Pin The type of Pout1 and Pout2
CMPUT 329 - Computer Organization and Architecture II
95
Anatomy of a Predicate Computation Operation
p<cmp> Pout1(type), Pout2(type), src1, src2 (Pin)
<cmp> = eq | ne | gt
<type> = U | U | OR | OR | AND | AND
Example:pge p4(OR), p2(/U), r4, 127 (p1)
cmp = ge, Pin = p1, Pout1 = p4, Pout2 = p2, src1 = r4, src2 = 127
CMPUT 329 - Computer Organization and Architecture II
96
Anatomy of a Predicate Computation Operation
p<cmp> Pout1(type), Pout2(type), src1, src2 (Pin)
<type> = U | U | OR | OR | AND | AND
U or U Always write into the destination register:
if type = U then if Pin = 0 then Pout = 0 elseif src1 <cmp> src2 then Pout = 1 else Pout = 0
if type = U then if Pin = 0 then Pout = 0 elseif src1 <cmp> src2 then Pout = 0 else Pout = 1
CMPUT 329 - Computer Organization and Architecture II
97
Anatomy of a Predicate Computation Operation
p<cmp> Pout1(type), Pout2(type), src1, src2 (Pin)
<type> = U | U | OR | OR | AND | AND
Write into the destination register onlyif Pin = 1 and <cmp> is true:
if type = OR and Pin = 1 and src1 <cmp> src2 then Pout = 1
Used when the execution of a block is enabled byone of multiple conditions.
OR type predicates must be initialized to 0 before their use.
OR or OR
if type = OR and Pin = 1 and src1 !<cmp> src2 then Pout = 1
CMPUT 329 - Computer Organization and Architecture II
98
Anatomy of a Predicate Computation Operation
p<cmp> Pout1(type), Pout2(type), src1, src2 (Pin)
<type> = U | U | OR | OR | AND | AND
Write into the destination register onlyif Pin = 1 and <cmp> is false:
if type =AND and Pin = 1 and src1 !<cmp> src2 then Pout = 0
Used when the execution of a block requiresseveral conditions to be true.
AND type predicates are often initialized to 1.
AND or AND
if type = AND and Pin = 1 and src1 <cmp> src2 then Pout = 0
CMPUT 329 - Computer Organization and Architecture II
99
Predicate Comparison Truth Table
• Pin predicates the entire predicate computation instruction.• Notice that for an unconditional type, the value 0 is written in Pout even when Pin is 0.
Pout
Pin Comparison UUOR ORAND AND0 0 0 0 - - - -0 1 0 0 - - - -1 0 0 1 - 1 0 -1 1 1 0 1 - - 0
p<cmp> Pout1(type), Pout2(type), src1, src2 (Pin)
CMPUT 329 - Computer Organization and Architecture II
100
Predicate Comparison Truth Table
p1 Comparison P4(OR) P2(/U) 0 0 - 0 0 1 - 0 1 0 - 1 1 1 1 0
pge p4(OR), p2(/U), r4, 127 (p1)
Pout
Pin Comparison UUOR ORAND AND0 0 0 0 - - - -0 1 0 0 - - - -1 0 0 1 - 1 0 -1 1 1 0 1 - - 0
Example:
CMPUT 329 - Computer Organization and Architecture II
101
Predicate Types
Unconditional predicates are used for control dependence sets that have a single edge.
OR-type predicates are used for predicates withmultiple edges in their control dependence sets.(OR-type predicates must be cleared beforeentering the hyperblock).
CMPUT 329 - Computer Organization and Architecture II
102
Step 5: If-conversion
For graph drawing, Malhke uses the convention that the left edge out of a basic block is the true condition and the right one is the false.
G
I J
In this control flow graph the control dependencieson blocks I and J are:
I: brGJ: /brG
CMPUT 329 - Computer Organization and Architecture II
103
Step 5: If-conversion
E
A
CB
D
F
H
K
G
I J
L
M16K
105K 14
14105K
105K
EXIT
61K
77K
77K 28K
04K 24K
22K2K
4K
2K
28K
25
N
105K
1
16K
D’-N’
14Control Dependences Predicate Assignment A : none A : null B : none B : null D : none C : null E : none E : null F : brE F : p1 (U) G : /brE, /brF G : p4 (OR) H : brF H : p2 (U) I : brG I : p7 (U) J : /brG J : p5 (U) K : brH K : p3 (U) L : /brJ L : p8 (U) M : brI, brJ, brL M : p6 (OR) N : none N : null
CMPUT 329 - Computer Organization and Architecture II
104
Step 5: If-conversion
E
A
CB
D
F
H
K
G
I J
L
M16K
105K 14
14105K
105K
EXIT
61K
77K
77K 28K
04K 24K
22K2K
4K
2K
28K
25
N
105K
1
16K
D’-N’
14Control Dependences Predicate Assignment A : none A : null B : none B : null D : none C : null E : none E : null F : brE F : p1 (U) G : /brE, /brF G : p4 (OR) H : brF H : p2 (U) I : brG I : p7 (U) J : /brG J : p5 (U) K : brH K : p3 (U) L : /brJ L : p8 (U) M : brI, brJ, brL M : p6 (OR) N : none N : null
CMPUT 329 - Computer Organization and Architecture II
105
EXIT
4K
H
77K 24K
Step 5: If-conversion (example)
I J
A
CB
D
K L
M16K
105K 14
14105K
105K
61K
77K 28K
0
22K2K
4K
2K
28K
25
N
105K
1
16K
D’-N’
14Control Dependences Predicate Assignment A : none A : null B : none B : null D : none C : null E : none E : null F : brE F : p1 (U) G : /brE, /brF G : p4 (OR) H : brF H : p2 (U) I : brG I : p7 (U) J : /brG J : p5 (U) K : brH K : p3 (U) L : /brJ L : p8 (U) M : brI, brJ, brL M : p6 (OR) N : none N : null
E
FG
CMPUT 329 - Computer Organization and Architecture II
106
EXIT
4K
H
77K 24K
Step 5: If-conversion (example)
I J
A
CB
D
K L
M16K
105K 14
14105K
105K
61K
77K 28K
0
22K2K
4K
2K
28K
25
N
105K
1
16K
D’-N’
14
E
FG
LA: ld_i r98, r3, 0 add r27, r98, -1 st_i r3, 0, 27 blt r98, 1, LCLB: ld_i r30, r3, 4 add r29, r30, 1 st_i r3, 4, r29 ld_c r4, r30, 0LD: beq r4, -1, EXITLE: ld_I r33, r73, 0 add r32, r33, 1 st_I r73, 0, r32 bge 32, r4, LGLF: bge r4, 127, LGLH: bne 0, r2, LA
LK: ld_I r36, r72, 0 add r35, r36, 1 st_I r72, 0, r35 add r2, r2, 1 jmp LALG: beq r4, r10, LILJ: bne r4, 32, LLLM: mov r2, 0 jmp LALI: ld_I r39, r71, 0 add r38, r39, 1 st_I r71, 0, r38 jmp LMLL: bne r4, 9, LA jmp LMLC: mov Parm0, r3 jsr filbuf mov r4, Ret0 jmp LD
CMPUT 329 - Computer Organization and Architecture II
107
EXIT
4K
H
77K 24K
Step 5: If-conversion (example)
I J
A
CB
D
K L
M16K
105K 14
14105K
105K
61K
77K 28K
0
22K2K
4K
2K
28K
25
N
105K
1
16K
D’-N’
14
E
FG
pclr p4, p6ld_i r98, r3, 0add r27, r98, -1st_i r3, 0, r27blt r98, 1, LC ld_i r30, r3, 4add r29, r30, 1st_i r3, 4, r29ld_c r4, r30, 0 beq r4, -1, EXITld_I r33, r73, 0add r32, r33, 1st_I r73, 0, r32pge p4(OR), p1(/U), 32, r4pge p4(OR), p2(/U), r4, 127 (p1)peq p3(U),-,0,r2 (p2)peq p6(OR), p5(/U), r4, r10 (p4)peq p7(U), -, r4, r10 (p4)...
CMPUT 329 - Computer Organization and Architecture II
108
Step 5: If-conversion (example)
pclr p4, p6ld_i r98, r3, 0add r27, r98, -1st_i r3, 0, r27blt r98, 1, LC ld_i r30, r3, 4add r29, r30, 1st_i r3, 4, r29ld_c r4, r30, 0 beq r4, -1, EXITld_I r33, r73, 0add r32, r33, 1st_I r73, 0, r32pge p4(OR), p1(/U), 32, r4pge p4(OR), p2(/U), r4, 127 (p1)peq p3(U),-,0,r2 (p2)peq p6(OR), p5(/U), r4, r10 (p4)peq p7(U), -, r4, r10 (p4)...
EXIT
4K
H
77K 24K
I J
105K
77K 28K
0
1
E
FG
LA: ld_i r98, r3, 0 add r27, r98, -1 st_i r3, 0, 27 blt r98, 1, LCLB: ld_i r30, r3, 4 add r29, r30, 1 st_i r3, 4, r29 ld_c r4, r30, 0LD: beq r4, -1, EXITLE: ld_I r33, r73, 0 add r32, r33, 1 st_I r73, 0, r32 bge 32, r4, LGLF: bge r4, 127, LGLH: bne 0, r2, LA
LK: ld_I r36, r72, 0 add r35, r36, 1 st_I r72, 0, r35 add r2, r2, 1 jmp LALG: beq r4, r10, LILJ: bne r4, 32, LLLM: mov r2, 0 jmp LALI: ld_I r39, r71, 0 add r38, r39, 1 st_I r71, 0, r38 jmp LMLL: bne r4, 9, LA jmp LMLC: mov Parm0, r3 jsr filbuf mov r4, Ret0 jmp LD
CMPUT 329 - Computer Organization and Architecture II
109
Inner Loop After If-conversion
pclr p4, p6ld_I r98, r3, 0add r27, r98, -1st_I r3, 0, r27
ld_i r30, r3, 4add r29, r30, 1st_I r3, 4, r29ld_c r4, r30, 0
ld_I r33, r73, 0add r32, r33, 1st_I r73, 0, r32pge p4(OR), p1(/U), 32, r4pge p4(OR), p2(/U), r4, 127 (p1)peq p3(U),-,0,r2 (p2)peq p6(OR), p5(/U), r4, r10 (p4)peq p7(U), -, r4, r10 (p4)peq p6(OR), p8(/U), r4, 32 (p5)ld_I r36, r72, 0 (p3)add r35, r36, 1 (p3)st_I r72, 0, r35 (p3)add r2, r2, 1 (p3)ld_I r39, r71, 0 (p7)add r38, r39, 1 (p7)st_I r71, 0, r38 (p7)peq p6(OR), -, r4, 9 (p8)mov r2, 0 (p6)jmp loop
blt r98, 1, LC
beq r4, -1, EXIT
CMPUT 329 - Computer Organization and Architecture II
110
Predicate Hierarchy Graph
The Predicate Hierarchy Graph (PHG) is a directed acyclic graph representing the Boolean equations used to compute all the predicates in a hyperblock.
There are two types of nodes in the PHG: predicate nodes and condition nodes.
Two PHG nodes x and y are connected if thevalue specified by x is used to directly compute the value of y.
The PHG is used to derive relationships among predicates.
CMPUT 329 - Computer Organization and Architecture II
111
Example of PHG Construction
pclr p4, p6ld_I r98, r3, 0add r27, r98, -1st_I r3, 0, r27blt r98, 1, LC ld_i r30, r3, 4add r29, r30, 1st_I r3, 4, r29ld_c r4, r30, 0beq r4, -1, EXITld_I r33, r73, 0add r32, r33, 1st_I r73, 0, r32pge p4(OR), p1(/U), 32, r4 [c1, /c1]pge p4(OR), p2(/U), r4, 127 (p1) [c2, /c2]peq p3(U),-,0,r2 (p2) [c3]peq p6(OR), p5(/U), r4, r10 (p4) [c4, /c4]peq p7(U), -, r4, r10 (p4) [c4]peq p6(OR), p8(/U), r4, 32 (p5) [c5, /c5]ld_I r36, r72, 0 (p3)add r35, r36, 1 (p3)st_I r72, 0, r35 (p3)add r2, r2, 1 (p3)ld_I r39, r71, 0 (p7)add r38, r39, 1 (p7)st_I r71, 0, r38 (p7)peq p6(OR), -, r4, 9 (p8) [c6]mov r2, 0 (p6)jmp loop
T
CMPUT 329 - Computer Organization and Architecture II
112
Example of PHG Construction
pclr p4, p6ld_I r98, r3, 0add r27, r98, -1st_I r3, 0, r27blt r98, 1, LC ld_i r30, r3, 4add r29, r30, 1st_I r3, 4, r29ld_c r4, r30, 0beq r4, -1, EXITld_I r33, r73, 0add r32, r33, 1st_I r73, 0, r32pge p4(OR), p1(/U), 32, r4 [c1, /c1]pge p4(OR), p2(/U), r4, 127 (p1) [c2, /c2]peq p3(U),-,0,r2 (p2) [c3]peq p6(OR), p5(/U), r4, r10 (p4) [c4, /c4]peq p7(U), -, r4, r10 (p4) [c4]peq p6(OR), p8(/U), r4, 32 (p5) [c5, /c5]ld_I r36, r72, 0 (p3)add r35, r36, 1 (p3)st_I r72, 0, r35 (p3)add r2, r2, 1 (p3)ld_I r39, r71, 0 (p7)add r38, r39, 1 (p7)st_I r71, 0, r38 (p7)peq p6(OR), -, r4, 9 (p8) [c6]mov r2, 0 (p6)jmp loop
T
pge p4(OR), p1(/U), 32, r4 [c1, /c1]
c1 /c1
p1
p4
CMPUT 329 - Computer Organization and Architecture II
113
Example of PHG Construction
pclr p4, p6ld_I r98, r3, 0add r27, r98, -1st_I r3, 0, r27blt r98, 1, LC ld_i r30, r3, 4add r29, r30, 1st_I r3, 4, r29ld_c r4, r30, 0beq r4, -1, EXITld_I r33, r73, 0add r32, r33, 1st_I r73, 0, r32pge p4(OR), p1(/U), 32, r4 [c1, /c1]pge p4(OR), p2(/U), r4, 127 (p1) [c2, /c2]peq p3(U),-,0,r2 (p2) [c3]peq p6(OR), p5(/U), r4, r10 (p4) [c4, /c4]peq p7(U), -, r4, r10 (p4) [c4]peq p6(OR), p8(/U), r4, 32 (p5) [c5, /c5]ld_I r36, r72, 0 (p3)add r35, r36, 1 (p3)st_I r72, 0, r35 (p3)add r2, r2, 1 (p3)ld_I r39, r71, 0 (p7)add r38, r39, 1 (p7)st_I r71, 0, r38 (p7)peq p6(OR), -, r4, 9 (p8) [c6]mov r2, 0 (p6)jmp loop
T
pge p4(OR), p2(/U), r4, 127 (p1) [c2, /c2]
c1 /c1
p1
c2 /c2
p4 p2
CMPUT 329 - Computer Organization and Architecture II
114
Example of PHG Construction
pclr p4, p6ld_I r98, r3, 0add r27, r98, -1st_I r3, 0, r27blt r98, 1, LC ld_i r30, r3, 4add r29, r30, 1st_I r3, 4, r29ld_c r4, r30, 0beq r4, -1, EXITld_I r33, r73, 0add r32, r33, 1st_I r73, 0, r32pge p4(OR), p1(/U), 32, r4 [c1, /c1]pge p4(OR), p2(/U), r4, 127 (p1) [c2, /c2]peq p3(U),-,0,r2 (p2) [c3]peq p6(OR), p5(/U), r4, r10 (p4) [c4, /c4]peq p7(U), -, r4, r10 (p4) [c4]peq p6(OR), p8(/U), r4, 32 (p5) [c5, /c5]ld_I r36, r72, 0 (p3)add r35, r36, 1 (p3)st_I r72, 0, r35 (p3)add r2, r2, 1 (p3)ld_I r39, r71, 0 (p7)add r38, r39, 1 (p7)st_I r71, 0, r38 (p7)peq p6(OR), -, r4, 9 (p8) [c6]mov r2, 0 (p6)jmp loop
T
peq p3(U),-,0,r2 (p2) [c3]
c1 /c1
p1
c2 /c2
p4 p2
c3
p3
CMPUT 329 - Computer Organization and Architecture II
115
Example of PHG Construction
pclr p4, p6ld_I r98, r3, 0add r27, r98, -1st_I r3, 0, r27blt r98, 1, LC ld_i r30, r3, 4add r29, r30, 1st_I r3, 4, r29ld_c r4, r30, 0beq r4, -1, EXITld_I r33, r73, 0add r32, r33, 1st_I r73, 0, r32pge p4(OR), p1(/U), 32, r4 [c1, /c1]pge p4(OR), p2(/U), r4, 127 (p1) [c2, /c2]peq p3(U),-,0,r2 (p2) [c3]peq p6(OR), p5(/U), r4, r10 (p4) [c4, /c4]peq p7(U), -, r4, r10 (p4) [c4]peq p6(OR), p8(/U), r4, 32 (p5) [c5, /c5]ld_I r36, r72, 0 (p3)add r35, r36, 1 (p3)st_I r72, 0, r35 (p3)add r2, r2, 1 (p3)ld_I r39, r71, 0 (p7)add r38, r39, 1 (p7)st_I r71, 0, r38 (p7)peq p6(OR), -, r4, 9 (p8) [c6]mov r2, 0 (p6)jmp loop
T
peq p6(OR), p5(/U), r4, r10 (p4) [c4, /c4]
c1 /c1
p1
c2 /c2
p4
p5
c4 /c4
p6
p2
c3
p3
CMPUT 329 - Computer Organization and Architecture II
116
Example of PHG Construction
pclr p4, p6ld_I r98, r3, 0add r27, r98, -1st_I r3, 0, r27blt r98, 1, LC ld_i r30, r3, 4add r29, r30, 1st_I r3, 4, r29ld_c r4, r30, 0beq r4, -1, EXITld_I r33, r73, 0add r32, r33, 1st_I r73, 0, r32pge p4(OR), p1(/U), 32, r4 [c1, /c1]pge p4(OR), p2(/U), r4, 127 (p1) [c2, /c2]peq p3(U),-,0,r2 (p2) [c3]peq p6(OR), p5(/U), r4, r10 (p4) [c4, /c4]peq p7(U), -, r4, r10 (p4) [c4]peq p6(OR), p8(/U), r4, 32 (p5) [c5, /c5]ld_I r36, r72, 0 (p3)add r35, r36, 1 (p3)st_I r72, 0, r35 (p3)add r2, r2, 1 (p3)ld_I r39, r71, 0 (p7)add r38, r39, 1 (p7)st_I r71, 0, r38 (p7)peq p6(OR), -, r4, 9 (p8) [c6]mov r2, 0 (p6)jmp loop
T
peq p7(U), -, r4, r10 (p4) [c4]
c1 /c1
p1
c2 /c2
p4
p5
c4 c4 /c4
p6
p2
c3
p3p7
CMPUT 329 - Computer Organization and Architecture II
117
Example of PHG Construction
pclr p4, p6ld_I r98, r3, 0add r27, r98, -1st_I r3, 0, r27blt r98, 1, LC ld_i r30, r3, 4add r29, r30, 1st_I r3, 4, r29ld_c r4, r30, 0beq r4, -1, EXITld_I r33, r73, 0add r32, r33, 1st_I r73, 0, r32pge p4(OR), p1(/U), 32, r4 [c1, /c1]pge p4(OR), p2(/U), r4, 127 (p1) [c2, /c2]peq p3(U),-,0,r2 (p2) [c3]peq p6(OR), p5(/U), r4, r10 (p4) [c4, /c4]peq p7(U), -, r4, r10 (p4) [c4]peq p6(OR), p8(/U), r4, 32 (p5) [c5, /c5]ld_I r36, r72, 0 (p3)add r35, r36, 1 (p3)st_I r72, 0, r35 (p3)add r2, r2, 1 (p3)ld_I r39, r71, 0 (p7)add r38, r39, 1 (p7)st_I r71, 0, r38 (p7)peq p6(OR), -, r4, 9 (p8) [c6]mov r2, 0 (p6)jmp loop
T
peq p6(OR), p8(/U), r4, 32 (p5) [c5, /c5]
c1 /c1
p1
c2 /c2
p4
p5
c5 /c5
p8
c4 c4 /c4
p6
p2
c3
p3p7
CMPUT 329 - Computer Organization and Architecture II
118
Example of PHG Construction
pclr p4, p6ld_I r98, r3, 0add r27, r98, -1st_I r3, 0, r27blt r98, 1, LC ld_i r30, r3, 4add r29, r30, 1st_I r3, 4, r29ld_c r4, r30, 0beq r4, -1, EXITld_I r33, r73, 0add r32, r33, 1st_I r73, 0, r32pge p4(OR), p1(/U), 32, r4 [c1, /c1]pge p4(OR), p2(/U), r4, 127 (p1) [c2, /c2]peq p3(U),-,0,r2 (p2) [c3]peq p6(OR), p5(/U), r4, r10 (p4) [c4, /c4]peq p7(U), -, r4, r10 (p4) [c4]peq p6(OR), p8(/U), r4, 32 (p5) [c5, /c5]ld_I r36, r72, 0 (p3)add r35, r36, 1 (p3)st_I r72, 0, r35 (p3)add r2, r2, 1 (p3)ld_I r39, r71, 0 (p7)add r38, r39, 1 (p7)st_I r71, 0, r38 (p7)peq p6(OR), -, r4, 9 (p8) [c6]mov r2, 0 (p6)jmp loop
T
peq p6(OR), -, r4, 9 (p8) [c6]
c1 /c1
p1
c2 /c2
p4
p5
c5 /c5
p8
c6
c4 c4 /c4
p6
p2
c3
p3p7
CMPUT 329 - Computer Organization and Architecture II
119
Example of PHG Construction
pclr p4, p6ld_I r98, r3, 0add r27, r98, -1st_I r3, 0, r27blt r98, 1, LC ld_i r30, r3, 4add r29, r30, 1st_I r3, 4, r29ld_c r4, r30, 0beq r4, -1, EXITld_I r33, r73, 0add r32, r33, 1st_I r73, 0, r32pge p4(OR), p1(/U), 32, r4 [c1, /c1]pge p4(OR), p2(/U), r4, 127 (p1) [c2, /c2]peq p3(U),-,0,r2 (p2) [c3]peq p6(OR), p5(/U), r4, r10 (p4) [c4, /c4]peq p7(U), -, r4, r10 (p4) [c4]peq p6(OR), p8(/U), r4, 32 (p5) [c5, /c5]ld_I r36, r72, 0 (p3)add r35, r36, 1 (p3)st_I r72, 0, r35 (p3)add r2, r2, 1 (p3)ld_I r39, r71, 0 (p7)add r38, r39, 1 (p7)st_I r71, 0, r38 (p7)peq p6(OR), -, r4, 9 (p8) [c6]mov r2, 0 (p6)jmp loop
T
c1 /c1
p1
c2 /c2
p4
p5
c5 /c5
p8
c6
c4 c4 /c4
p6
p2
c3
p3p7
CMPUT 329 - Computer Organization and Architecture II
120
Purpose of PHG
The PHG is used to allow the compiler to deriverelations among the predicates. Mahlke identifies threepredicate relations:Ancestor: pi is an ancestor of pj if all conditions used to compute pj are derived from pi.The compiler can be sure that pj may be true only when pi is also true. Control Path: There is a control path between pi and pj if there is at least one set of conditions under which both pj and pi are true.The compiler knows that pi and pj may be true at the same time.
Implies: pi implies pj if the conditions that make pi true guatantee that pj will also be true.
CMPUT 329 - Computer Organization and Architecture II
121
Imply Relationshippclr p4, p6ld_I r98, r3, 0add r27, r98, -1st_I r3, 0, r27blt r98, 1, LC ld_i r30, r3, 4add r29, r30, 1st_I r3, 4, r29ld_c r4, r30, 0beq r4, -1, EXITld_I r33, r73, 0add r32, r33, 1st_I r73, 0, r32pge p4(OR), p1(/U), 32, r4 [c1, /c1]pge p4(OR), p2(/U), r4, 127 (p1) [c2, /c2]peq p3(U),-,0,r2 (p2) [c3]peq p6(OR), p5(/U), r4, r10 (p4) [c4, /c4]peq p7(U), -, r4, r10 (p4) [c4]peq p6(OR), p8(/U), r4, 32 (p5) [c5, /c5]ld_I r36, r72, 0 (p3)add r35, r36, 1 (p3)st_I r72, 0, r35 (p3)add r2, r2, 1 (p3)ld_I r39, r71, 0 (p7)add r38, r39, 1 (p7)st_I r71, 0, r38 (p7)peq p6(OR), -, r4, 9 (p8) [c6]mov r2, 0 (p6)jmp loop
T
c1 /c1
p1
c2 /c2
p4
p5
c5 /c5
p8
c6
c4 c4 /c4
p6
p2
c3
p3p7
p7 implies p6
CMPUT 329 - Computer Organization and Architecture II
122
Ancestor Relationshippclr p4, p6ld_I r98, r3, 0add r27, r98, -1st_I r3, 0, r27blt r98, 1, LC ld_i r30, r3, 4add r29, r30, 1st_I r3, 4, r29ld_c r4, r30, 0beq r4, -1, EXITld_I r33, r73, 0add r32, r33, 1st_I r73, 0, r32pge p4(OR), p1(/U), 32, r4 [c1, /c1]pge p4(OR), p2(/U), r4, 127 (p1) [c2, /c2]peq p3(U),-,0,r2 (p2) [c3]peq p6(OR), p5(/U), r4, r10 (p4) [c4, /c4]peq p7(U), -, r4, r10 (p4) [c4]peq p6(OR), p8(/U), r4, 32 (p5) [c5, /c5]ld_I r36, r72, 0 (p3)add r35, r36, 1 (p3)st_I r72, 0, r35 (p3)add r2, r2, 1 (p3)ld_I r39, r71, 0 (p7)add r38, r39, 1 (p7)st_I r71, 0, r38 (p7)peq p6(OR), -, r4, 9 (p8) [c6]mov r2, 0 (p6)jmp loop
T
c1 /c1
p1
c2 /c2
p4
p5
c5 /c5
p8
c6
c4 c4 /c4
p6
p2
c3
p3p7
Which predicate nodes are ancestors
of p5?
T, p4, and p5
CMPUT 329 - Computer Organization and Architecture II
123
Ancestor Relationshippclr p4, p6ld_I r98, r3, 0add r27, r98, -1st_I r3, 0, r27blt r98, 1, LC ld_i r30, r3, 4add r29, r30, 1st_I r3, 4, r29ld_c r4, r30, 0beq r4, -1, EXITld_I r33, r73, 0add r32, r33, 1st_I r73, 0, r32pge p4(OR), p1(/U), 32, r4 [c1, /c1]pge p4(OR), p2(/U), r4, 127 (p1) [c2, /c2]peq p3(U),-,0,r2 (p2) [c3]peq p6(OR), p5(/U), r4, r10 (p4) [c4, /c4]peq p7(U), -, r4, r10 (p4) [c4]peq p6(OR), p8(/U), r4, 32 (p5) [c5, /c5]ld_I r36, r72, 0 (p3)add r35, r36, 1 (p3)st_I r72, 0, r35 (p3)add r2, r2, 1 (p3)ld_I r39, r71, 0 (p7)add r38, r39, 1 (p7)st_I r71, 0, r38 (p7)peq p6(OR), -, r4, 9 (p8) [c6]mov r2, 0 (p6)jmp loop
T
c1 /c1
p1
c2 /c2
p4
p5
c5 /c5
p8
c6
c4 c4 /c4
p6
p2
c3
p3p7
Which predicate nodes are in the same
control path as p5?T, p1, p4, p5, p6, p8
CMPUT 329 - Computer Organization and Architecture II
124
Classical/ILP Optimizations in Predicated Code
Example: Copy Propagation
A: mov r1, r2 (p1)B: add r2, r3, r4 (p2)C: ld_i r5, r1, 0 (p3)
Is the copy propagation frominstruction A to instruction C legal?
Depends on what we know about the relationship between p1, p2, and p3.If it is possible that p1 is false and p3is true, the propagation would be wrong!
A: mov r1, r2 (p1)B: add r2, r3, r4 (p2)C: ld_i r5, r2, 0 (p3)
CMPUT 329 - Computer Organization and Architecture II
125
Classical/ILP Optimizations in Predicated Code
Example: Copy Propagation
A: mov r1, r2 (p1)B: add r2, r3, r4 (p2)C: ld_i r5, r1, 0 (p3)
For instance, if we know that:(1) p1 is an ancestor of both p2 and p3, and (2) p2 and p3 are mutually exclusiveThen we can do the copy propagation safely.
p1
pk
cm /cm
p2 p3
CMPUT 329 - Computer Organization and Architecture II
126
Classical/ILP Optimizations in Predicated Code
Example: Instruction Scheduling
A: ld_i r1, r2, r3 (p2)B: add r4, r1, 4 (p2)C: ld_i r1, r5, 0 (p3)D: mul r6, r1, r7 (p3)
What are the data dependencies in thecode above? Depends on what we know about the relationship between p2, and p3.
CMPUT 329 - Computer Organization and Architecture II
127
Classical/ILP Optimizations in Predicated Code
Example: Instruction Scheduling
A: ld_i r1, r2, r3 (p2)B: add r4, r1, 4 (p2)C: ld_i r1, r5, 0 (p3)D: mul r6, r1, r7 (p3)
pk
cm /cm
p2 p3
For instance, if we know thatp2 and p3 are mutually exclusive,we have this DDG:
A
B
C
D
CMPUT 329 - Computer Organization and Architecture II
128
Classical/ILP Optimizations in Predicated Code
Example: Instruction Scheduling
A: ld_i r1, r2, r3 (p2)B: add r4, r1, 4 (p2)C: ld_i r1, r5, 0 (p3)D: mul r6, r1, r7 (p3)
pk
cm cm
p2 p3
But if p2 implies p3,then have this DDG:
A
BC
D
CMPUT 329 - Computer Organization and Architecture II
129
Predicate-Specific Optimizations
- Predicate Promotion- Branch Combining- Predicate Loop Peeling
CMPUT 329 - Computer Organization and Architecture II
130
Predicate Promotion
The idea it to speculate the execution of instructionsby replacing their predicate by a less constrainedpredecessor predicate.
Because the ancestor predicate is computed withfewer conditions, the execution of the promoted instruction is speculative.
The advantage of predicate promotion is the reductionof the dependence chain in a hyperblock.
CMPUT 329 - Computer Organization and Architecture II
131
Conditions for Simple Predicate Promotion
The predicate of an instruction op(x) canbe promoted to its predecessor predicateif all the following conditions are true:1. op(x) is predicated2. op(x) has a destination register3. op(x) has a speculative version4. there is a unique op(y) lexically before op(x) such that dest(y) = pred(x)5. dest(x) is not live at op(y)6. for any op(j) such that there is a path op(j)…op(y), dest(x) dest(j)7. It is profitable to promote op(x)
CMPUT 329 - Computer Organization and Architecture II
132
Example of Predicate Promotion (qsort)
1 LA: ld_i r20, r24, r1012 ld_i r23, r2, r1023 pge p126(U), p127(U), r20, r234 LB: ld_i r6, r123, 0 (p126)5 add r123, r123, 8 (p126)6 add r9, r9, 1 (p126)7 add r101, r101, 8 (p126)8 LC: ld_i r6, r124, 8 (p127)9 add r124, r124, 8 (p127)10 add r124, r124, 8 (p127)11 add r102, r102, 8 (p127)12 LD: st_i r114, 0, r2313 st_i r114, 4, r614 add r7, r7, 115 add r114, r114, 816 bge r9, r3, EXIT17 LE: blt r8, r1, LA
1 LA: ld_i r20, r24, r1012 ld_i r23, r2, r1023 pge p126(U), p127(U), r20, r234 LB: ld_i r6, r123, 0 5 add r123, r123, 8 (p126)6 add r9, r9, 1 (p126)7 add r101, r101, 8 (p126)8 LC: ld_i r60, r124, 8 8a mov r6, r60 (p127) 9 add r124, r124, 8 (p127)10 add r124, r124, 8 (p127)11 add r102, r102, 8 (p127)12 LD: st_i r114, 0, r2313 st_i r114, 4, r614 add r7, r7, 115 add r114, r114, 816 bge r9, r3, EXIT17 LE: blt r8, r1, LA
CMPUT 329 - Computer Organization and Architecture II
133
Branch Combining
Problem: too many infrequently executed branches in a hyperblock
1 A: bge r1, r5, EXIT12 ld_c r3, r1, 03 beq r3, 10, EXIT24 beq r3, 0, EXIT35 bge r2, r6, EXIT46 st_c r2, 0, r37 add r1, r1, 18 add r2, r2, 19 jmp A
Example: a loop in grep
14
4035
0
0
CMPUT 329 - Computer Organization and Architecture II
134
Branch Combining
Solution: replace a group of exit branches by a corresponding group of predicate define instructions.
All predicate definitions write into the same predicateregister using the OR-type semantics.
The resultant predicate will be set to 1 if any of the exit branches were to be taken.
Because not exiting the hyperblock is the mostcommon case, the predicate will be false.
CMPUT 329 - Computer Organization and Architecture II
135
Branch Combining
1 A: bge r1, r5, EXIT 2 ld_c r3, r1, -1 3 beq r3, 10, EXIT2 4 beq r3, 0, EXIT3 5 bge r2, r6, EXIT4 6 st_c r2, -1, r3 7 bge r1, r7, EXIT5 8 ld_c r4, r1, 0 9 beq r4, 10, EXIT6
10 beq r4, 0, EXIT7 11 bge r2, r8, EXIT8 12 st_c r2, 0, r4 13 add r1, r1, 2 14 add r2, r2, 2 15 jmp A
jmp
0 A: pclr p1 1 pge p1(OR), r1, r5 2 ld_c r3, r1, -1 3 peq p1(OR), r3, 10 4 peq p1(OR), r3, 0 5 pge p1(OR), r2, r6 7 pge p1(OR), r1, r7 8 ld_c r4, r1, 0 9 peq p1(OR), r4, 10
10 peq p1(OR), r4, 0 11 pge p1(OR), r2, r8 16 jmp Decode (p1) 6’ st_c r2, -1, r3
12 st_c r2, 0, r4 13 add r1, r1, 2 14 add r2, r2, 2 15 jmp A
jmp
Decode: 1 bge r1, r5, EXIT1 3 beq r3, 10, EXIT2 4 beq r3, 0, EXIT3 5 bge r2, r6, EXIT4 6 st_c r2, -1, r3 7 bge r1, r7, EXIT5 9 beq r4, 10, EXIT6
10 beq r4, 0, EXIT7 11 jmp EXIT8
jmp
CMPUT 329 - Computer Organization and Architecture II
136
Instruction Between Combined Branches
Instructions between combined branches arespeculated.
For instructions that are between combined branchesbut cannot be speculated, the following must be done:
(1) move the instructions below the combined exit branch in the hyperblock.
(2) replicate these instructions in their original position with respect to the exit branches in the decode block.
CMPUT 329 - Computer Organization and Architecture II
137
Backend Compilation with Hyperblocks
Register Allocation
Instruction Scheduling
Classical Optim.
ILP/Predicate-SpecificOptimizations
Hyperblock/SuperblockFormation
Classical Optim.
Lcode generation
PHG
CFGGenerator
EquationSolver
predicate relations
dataflowinformation
predicateaware
top related