lecture 13: trace scheduling, conditional execution...
TRANSCRIPT
![Page 1: Lecture 13: Trace Scheduling, Conditional Execution ...bnrg.eecs.berkeley.edu/~randy/Courses/CS252.S96/Lecture13.pdf · RHK.S96 1 Lecture 13: Trace Scheduling, Conditional Execution,](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab645a07f8b9a1a048da601/html5/thumbnails/1.jpg)
RHK.S96 1
Lecture 13: Trace Scheduling, Conditional Execution,
Speculation, Limits of ILP
Professor Randy H. KatzComputer Science 252
Spring 1996
![Page 2: Lecture 13: Trace Scheduling, Conditional Execution ...bnrg.eecs.berkeley.edu/~randy/Courses/CS252.S96/Lecture13.pdf · RHK.S96 1 Lecture 13: Trace Scheduling, Conditional Execution,](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab645a07f8b9a1a048da601/html5/thumbnails/2.jpg)
RHK.S96 2
Review: Getting CPI < 1Multiple Instructions/Cycle
• Two variations:– Superscalar: varying no. instructions/cycle (1 to 8),
scheduled by compiler or by HW (Tomasulo)» IBM PowerPC, Sun SuperSparc, DEC Alpha, HP 7100
– Very Long Instruction Words (VLIW): fixed number of instructions (16) scheduled by the compiler
» Joint HP/Intel agreement in 1998?
![Page 3: Lecture 13: Trace Scheduling, Conditional Execution ...bnrg.eecs.berkeley.edu/~randy/Courses/CS252.S96/Lecture13.pdf · RHK.S96 1 Lecture 13: Trace Scheduling, Conditional Execution,](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab645a07f8b9a1a048da601/html5/thumbnails/3.jpg)
RHK.S96 3
Loop Unrolling in SuperScalar
Integer instruction FP instruction Clock cycleLoop: LD F0,0(R1) 1
LD F6,-8(R1) 2LD F10,-16(R1) ADDD F4,F0,F2 3LD F14,-24(R1) ADDD F8,F6,F2 4LD F18,-32(R1) ADDD F12,F10,F2 5SD 0(R1),F4 ADDD F16,F14,F2 6SD -8(R1),F8 ADDD F20,F18,F2 7SD -16(R1),F12 8SD -24(R1),F16 9SUBI R1,R1,#40 10BNEZ R1,LOOP 11SD -32(R1),F20 12
Unrolled 5 times to avoid delays (+1 due to SS)12 clocks, or 2.4 clocks per iteration
![Page 4: Lecture 13: Trace Scheduling, Conditional Execution ...bnrg.eecs.berkeley.edu/~randy/Courses/CS252.S96/Lecture13.pdf · RHK.S96 1 Lecture 13: Trace Scheduling, Conditional Execution,](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab645a07f8b9a1a048da601/html5/thumbnails/4.jpg)
RHK.S96 4
Loop Unrolling in VLIW
Memory Memory FP FP Int. op/ Clockreference 1 reference 2 operation 1 op. 2 branchLD F0,0(R1) LD F6,-8(R1) 1LD F10,-16(R1) LD F14,-24(R1) 2LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2 ADDD F8,F6,F2 3LD F26,-48(R1) ADDD F12,F10,F2 ADDD F16,F14,F2 4
ADDD F20,F18,F2 ADDD F24,F22,F2 5
SD 0(R1),F4 SD -8(R1),F8 ADDD F28,F26,F2 6SD -16(R1),F12 SD -24(R1),F16 7SD -32(R1),F20 SD -40(R1),F24 SUBI R1,R1,#48 8SD -0(R1),F28 BNEZ R1,LOOP 9
Unrolled 7 times to avoid delays 7 results in 9 clocks, or 1.3 clocks per iteration Need more registers in VLIW
![Page 5: Lecture 13: Trace Scheduling, Conditional Execution ...bnrg.eecs.berkeley.edu/~randy/Courses/CS252.S96/Lecture13.pdf · RHK.S96 1 Lecture 13: Trace Scheduling, Conditional Execution,](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab645a07f8b9a1a048da601/html5/thumbnails/5.jpg)
RHK.S96 5
Limits to Multi-Issue Machines
• Inherent limitations of ILP– 1 branch in 5: How to keep a 5-way VLIW busy?– Latencies of units: many operations must be scheduled– Need about Pipeline Depth x No. Functional Units of independent
operations to keep machines busy
• Difficulties in building HW– Duplicate FUs to get parallel execution– Increase ports to Register File
» VLIW example needs 7 read and 3 write for Int. Reg. & 5 read and 3 write for FP reg
– Increase ports to memory– Decoding SS and impact on clock rate, pipeline depth
![Page 6: Lecture 13: Trace Scheduling, Conditional Execution ...bnrg.eecs.berkeley.edu/~randy/Courses/CS252.S96/Lecture13.pdf · RHK.S96 1 Lecture 13: Trace Scheduling, Conditional Execution,](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab645a07f8b9a1a048da601/html5/thumbnails/6.jpg)
RHK.S96 6
Limits to Multi-Issue Machines
• Limitations specific to either SS or VLIW implementation
– Decode issue in SS– VLIW code size: unroll loops + wasted fields in VLIW– VLIW lock step => 1 hazard & all instructions stall– VLIW & binary compatibility is practical weakness
![Page 7: Lecture 13: Trace Scheduling, Conditional Execution ...bnrg.eecs.berkeley.edu/~randy/Courses/CS252.S96/Lecture13.pdf · RHK.S96 1 Lecture 13: Trace Scheduling, Conditional Execution,](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab645a07f8b9a1a048da601/html5/thumbnails/7.jpg)
RHK.S96 7
Software Pipelining ExampleBefore: Unrolled 3 times 1 LD F0,0(R1) 2 ADDD F4,F0,F2 3 SD 0(R1),F4 4 LD F6,-8(R1) 5 ADDD F8,F6,F2 6 SD -8(R1),F8 7 LD F10,-16(R1) 8 ADDD F12,F10,F2 9 SD -16(R1),F12 10 SUBI R1,R1,#24 11 BNEZ R1,LOOP
After: Software Pipelined 1 SD 0(R1),F4 ; Stores M[i] 2 ADDD F4,F0,F2 ; Adds to M[i-1] 3 LD F0,-16(R1);Loads M[i-2] 4 SUBI R1,R1,#8 5 BNEZ R1,LOOP
• Symbolic Loop Unrolling– Less code space– Fill & drain pipe only once vs. each iteration in loop unrolling
![Page 8: Lecture 13: Trace Scheduling, Conditional Execution ...bnrg.eecs.berkeley.edu/~randy/Courses/CS252.S96/Lecture13.pdf · RHK.S96 1 Lecture 13: Trace Scheduling, Conditional Execution,](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab645a07f8b9a1a048da601/html5/thumbnails/8.jpg)
RHK.S96 8
Review: Summary• Branch Prediction
– Branch History Table: 2 bits for loop accuracy– Correlation: Recently executed branches correlated with
next branch– Branch Target Buffer: include branch address &
prediction
• SuperScalar and VLIW– CPI < 1– Dynamic issue vs. Static issue– More instructions issue at same time, larger the penalty
of hazards
• SW Pipelining– Symbolic Loop Unrolling to get most from pipeline with
little code expansion, little overhead
![Page 9: Lecture 13: Trace Scheduling, Conditional Execution ...bnrg.eecs.berkeley.edu/~randy/Courses/CS252.S96/Lecture13.pdf · RHK.S96 1 Lecture 13: Trace Scheduling, Conditional Execution,](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab645a07f8b9a1a048da601/html5/thumbnails/9.jpg)
RHK.S96 9
Trace Scheduling
• Parallelism across IF branches vs. LOOP branches• Two steps:
– Trace Selection» Find likely sequence of basic blocks (trace) of (statically
predicted) long sequence of straight-line code– Trace Compaction
» Squeeze trace into few VLIW instructions» Need bookkeeping code in case prediction is wrong
![Page 10: Lecture 13: Trace Scheduling, Conditional Execution ...bnrg.eecs.berkeley.edu/~randy/Courses/CS252.S96/Lecture13.pdf · RHK.S96 1 Lecture 13: Trace Scheduling, Conditional Execution,](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab645a07f8b9a1a048da601/html5/thumbnails/10.jpg)
RHK.S96 10
HW support for More ILP
• Avoid branch prediction by turning branches into conditionally executed instructions:
if (x) then A = B op C else NOP– If false, then neither store result or cause exception– Expanded ISA of Alpha, MIPS, PowerPC, SPARC have
conditional move; PA-RISC can annul any following instr.
• Drawbacks to conditional instructions– Still takes a clock even if “annulled”– Stall if condition evaluated late– Complex conditions reduce effectiveness;
condition becomes known late in pipeline
![Page 11: Lecture 13: Trace Scheduling, Conditional Execution ...bnrg.eecs.berkeley.edu/~randy/Courses/CS252.S96/Lecture13.pdf · RHK.S96 1 Lecture 13: Trace Scheduling, Conditional Execution,](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab645a07f8b9a1a048da601/html5/thumbnails/11.jpg)
RHK.S96 11
HW support for More ILP
• Speculation: allow an instruction to issue that is dependent on branch predicted to be taken without any consequences (including exceptions) if branch is not actually taken (“HW undo”)
• Often try to combine with dynamic scheduling• Tomasulo: separate speculative bypassing of
results from real bypassing of results– When instruction no longer speculative, write results
(instruction commit)– execute out-of-order but commit in order
![Page 12: Lecture 13: Trace Scheduling, Conditional Execution ...bnrg.eecs.berkeley.edu/~randy/Courses/CS252.S96/Lecture13.pdf · RHK.S96 1 Lecture 13: Trace Scheduling, Conditional Execution,](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab645a07f8b9a1a048da601/html5/thumbnails/12.jpg)
RHK.S96 12
HW support for More ILP
• Need HW buffer for results of uncommitted instructions: reorder buffer
– Reorder buffer can be operand source
– Once operand commits, result is found in register
– 3 fields: instr. type, destination, value– Use reorder buffer number instead
of reservation station– Instructionscommit in order– As a result, its easy to undo
speculated instructions on mispredicted branches or on exceptions
ReorderBuffer
FP Regs
FPOp
Queue
FP Adder FP Adder
Res Stations Res Stations
Figure 4.34, page 311
![Page 13: Lecture 13: Trace Scheduling, Conditional Execution ...bnrg.eecs.berkeley.edu/~randy/Courses/CS252.S96/Lecture13.pdf · RHK.S96 1 Lecture 13: Trace Scheduling, Conditional Execution,](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab645a07f8b9a1a048da601/html5/thumbnails/13.jpg)
RHK.S96 13
Four Steps of Speculative Tomasulo Algorithm
1. Issue—get instruction from FP Op Queue If reservation station or reorder buffer slot free, issue instr &
send operands & reorder buffer no. for destination.
2.Execution—operate on operands (EX) When both operands ready then execute; if not ready, watch
CDB for result; when both in reservation station, execute
3.Write result—finish execution (WB) Write on Common Data Bus to all awaiting FUs & reorder
buffer; mark reservation station available.
4.Commit—update register with reorder result When instr. at head of reorder buffer & result present, update
register with result (or store to memory) and remove instr from reorder buffer.
![Page 14: Lecture 13: Trace Scheduling, Conditional Execution ...bnrg.eecs.berkeley.edu/~randy/Courses/CS252.S96/Lecture13.pdf · RHK.S96 1 Lecture 13: Trace Scheduling, Conditional Execution,](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab645a07f8b9a1a048da601/html5/thumbnails/14.jpg)
RHK.S96 14
Limits to ILP• Conflicting studies of amount of parallelism
available in late 1980s and early 1990s. Different assumptions about:
– Benchmarks (vectorized Fortran FP vs. integer C programs)– Hardware sophistication– Compiler sophistication
![Page 15: Lecture 13: Trace Scheduling, Conditional Execution ...bnrg.eecs.berkeley.edu/~randy/Courses/CS252.S96/Lecture13.pdf · RHK.S96 1 Lecture 13: Trace Scheduling, Conditional Execution,](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab645a07f8b9a1a048da601/html5/thumbnails/15.jpg)
RHK.S96 15
Limits to ILP
Initial HW Model here; MIPS compilers1. Register renaming–infinite virtual registers and all WAW & WAR hazards are avoided2. Branch prediction–perfect; no mispredictions 3. Jump prediction–all jumps perfectly predicted => machine with perfect speculation & an unbounded buffer of instructions available4. Memory-address alias analysis–addresses are known & a store can be moved before a load provided addresses not equal
1 cycle latency for all instructions
![Page 16: Lecture 13: Trace Scheduling, Conditional Execution ...bnrg.eecs.berkeley.edu/~randy/Courses/CS252.S96/Lecture13.pdf · RHK.S96 1 Lecture 13: Trace Scheduling, Conditional Execution,](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab645a07f8b9a1a048da601/html5/thumbnails/16.jpg)
RHK.S96 16
Upper Limit to ILP(Figure 4.38, page 319)
Programs
Instr
ucti
on I
ssues p
er
cycle
0
20
40
60
80
100
120
140
160
gcc espresso l i fpppp doducd tomcatv
54.862.6
17.9
75.2
118.7
150.1
![Page 17: Lecture 13: Trace Scheduling, Conditional Execution ...bnrg.eecs.berkeley.edu/~randy/Courses/CS252.S96/Lecture13.pdf · RHK.S96 1 Lecture 13: Trace Scheduling, Conditional Execution,](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab645a07f8b9a1a048da601/html5/thumbnails/17.jpg)
RHK.S96 17
Program
Inst
ructi
on iss
ues
per
cycle
0
10
20
30
40
50
60
gcc espresso l i fpppp doducd tomcatv
35
41
16
61
5860
9
1210
48
15
67 6
46
13
45
6 6 7
45
14
45
2 2 2
29
4
19
46
Perfect Selective predictor Standard 2-bit Static None
More Realistic HW: Branch ImpactFigure 4.40, Page 323
Change from Infinite window to examine to 2000 and maximum issue of 64 instructions per clock cycle
ProfileBHT (512)Pick Cor. or BHTPerfect
![Page 18: Lecture 13: Trace Scheduling, Conditional Execution ...bnrg.eecs.berkeley.edu/~randy/Courses/CS252.S96/Lecture13.pdf · RHK.S96 1 Lecture 13: Trace Scheduling, Conditional Execution,](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab645a07f8b9a1a048da601/html5/thumbnails/18.jpg)
RHK.S96 18
Program
Inst
ructi
on iss
ues
per
cycle
0
10
20
30
40
50
60
gcc espresso l i fpppp doducd tomcatv
11
15
12
29
54
10
15
12
49
16
10
1312
35
15
44
9 10 11
20
11
28
5 5 6 5 57
4 45
45 5
59
45
Infinite 256 128 64 32 None
More Realistic HW: Register ImpactFigure 4.42, Page 325
Change 2000 instr window, 64 instr issue, 8K 2 level Prediction
64 None256Infinite 32128
![Page 19: Lecture 13: Trace Scheduling, Conditional Execution ...bnrg.eecs.berkeley.edu/~randy/Courses/CS252.S96/Lecture13.pdf · RHK.S96 1 Lecture 13: Trace Scheduling, Conditional Execution,](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab645a07f8b9a1a048da601/html5/thumbnails/19.jpg)
RHK.S96 19
Program
Inst
ructi
on iss
ues
per
cycle
0
5
10
15
20
25
30
35
40
45
50
gcc espresso l i fpppp doducd tomcatv
10
15
12
49
16
45
7 79
49
16
45 4 4
6 53
53 3 4 4
45
Perfect Global/stack Perfect Inspection None
More Realistic HW: Alias ImpactFigure 4.44, Page 328
Change 2000 instr window, 64 instr issue, 8K 2 level Prediction, 256 renaming registers
NoneGlobal/Stack perf;heap conflicts
Perfect Inspec.Assem.
![Page 20: Lecture 13: Trace Scheduling, Conditional Execution ...bnrg.eecs.berkeley.edu/~randy/Courses/CS252.S96/Lecture13.pdf · RHK.S96 1 Lecture 13: Trace Scheduling, Conditional Execution,](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab645a07f8b9a1a048da601/html5/thumbnails/20.jpg)
RHK.S96 20
Program
Inst
ructi
on iss
ues
per
cycle
0
10
20
30
40
50
60
gcc expresso l i fpppp doducd tomcatv
10
15
12
52
17
56
10
15
12
47
16
10
1311
35
15
34
910 11
22
12
8 8 9
14
9
14
6 6 68
79
4 4 4 5 46
3 2 3 3 3 3
45
22
Infinite 256 128 64 32 16 8 4
Realistic HW for ‘9X: Window Impact(Figure 4.48, Page 332)
Perfect disambiguation (HW), 1K Selective Prediction, 16 entry return, 64 registers, issue as many as window
64 16256Infinite 32128 8 4
![Page 21: Lecture 13: Trace Scheduling, Conditional Execution ...bnrg.eecs.berkeley.edu/~randy/Courses/CS252.S96/Lecture13.pdf · RHK.S96 1 Lecture 13: Trace Scheduling, Conditional Execution,](https://reader031.vdocuments.net/reader031/viewer/2022030507/5ab645a07f8b9a1a048da601/html5/thumbnails/21.jpg)
RHK.S96 21
• 8-scalar IBM Power-2 @ 71.5 MHz (5 stage pipe) vs. 2-scalar Alpha @ 200 MHz (7 stage pipe)
Braniac vs. Speed Demon
Benchmark
SPECM
ark
s
0
100
200
300
400
500
600
700
800
900
espre
sso li
eqnt
ott
com
pre
ss sc gcc
spic
e
dodu
c
mdljd
p2
wav
e5
tom
catv
ora
alv
inn
ear
mdljsp
2
swm
25
6
su2
cor
hydro
2d
nasa
fpppp