instruction-level parallelism and its exploitation (part i...
TRANSCRIPT
Instruction-Level Parallelism and Its Exploitation(part I and II)
ECE 154B
Dmitri Strukov
Introduction
• Pipelining become universal technique in 1985– Overlaps execution of instructions
– Exploits “Instruction Level Parallelism”
• Beyond this, there are two main approaches:– Hardware-based dynamic approaches
• Used in server and desktop processors
• Not used as extensively in PMP processors
– Compiler-based static approaches• Not as successful outside of scientific applications
Instruction-Level Parallelism
• When exploiting instruction-level parallelism, goal is to minimize CPI
– Pipeline CPI =• Ideal pipeline CPI +
• Structural stalls +
• Data hazard stalls +
• Control stalls
ILP techniques
Data Dependence
• How much parallelisms exist is a property of code
• Exploiting the most of it is a goal of ILP
• Challenges: Dependencies and Hazards– Data dependency
– Name dependency
– Control dependency
• Dependency could result in data hazard
Data Dependence
• Data dependency– Instruction j is data dependent on instruction i if
• Instruction i produces a result that may be used by instruction j• Instruction j is data dependent on instruction k and instruction k is data dependent on
instruction I
• Dependent instructions cannot be executed simultaneously
• Pipeline organization determines if dependence is detected and if it causes a stall
• Data dependence conveys:– Possibility of a hazard– Order in which results must be calculated– Upper bound on exploitable instruction level parallelism
• Dependencies that flow through memory locations are difficult to detect
Name Dependence
• Two instructions use the same name but no flow of information– Not a true data dependence, but is a problem when
reordering instructions– Antidependence: instruction j writes a register or memory
location that instruction i reads• Initial ordering (i before j) must be preserved
– Output dependence: instruction i and instruction j write the same register or memory location• Ordering must be preserved
• To resolve, use renaming techniques
Data Hazards
• Data Hazards– Read after write (RAW)
• occur with true data dependency• add 3 1 2 ; nand 4 3 5
– Write after write (WAW)• output dependency• in pipelines that write in more than one pipe stage or with out-or-
order execution• add 3 1 2 ; nand 3 5 6
– Write after read (WAR)• antidependency• add 1 2 3 ; nand 2 5 6• can occur with reoredering
Compiler Techniques for Exposing ILP
• Pipeline scheduling– Separate dependent instruction from the source
instruction by the pipeline latency of the source instruction
• Example:for (i=999; i>=0; i=i-1)
x[i] = x[i] + s;
A More Realistic Pipeline
Pipeline Stalls
Loop: L.D F0,0(R1)
stall
ADD.D F4,F0,F2
stall
stall
S.D F4,0(R1)
DADDUI R1,R1,#-8
stall (assume integer load latency is 1)
BNE R1,R2,Loop
Pipeline Scheduling
Scheduled code:Loop: L.D F0,0(R1)
DADDUI R1,R1,#-8
ADD.D F4,F0,F2
stall
stall
S.D F4,8(R1)
BNE R1,R2,Loop
Loop unrolling
• Parallelism with basic block is limited– Typical size of basic block = 3-6 instructions
– Must optimize across branches
• Loop-Level Parallelism– Unroll loop statically or dynamically
– Use SIMD (vector processors and GPUs)
Loop Unrolling
• Loop unrolling– Unroll by a factor of 4 (assume # elements is divisible by 4)– Eliminate unnecessary instructions
Loop: L.D F0,0(R1)
ADD.D F4,F0,F2
S.D F4,0(R1) ;drop DADDUI & BNE
L.D F6,-8(R1)
ADD.D F8,F6,F2
S.D F8,-8(R1) ;drop DADDUI & BNE
L.D F10,-16(R1)
ADD.D F12,F10,F2
S.D F12,-16(R1) ;drop DADDUI & BNE
L.D F14,-24(R1)
ADD.D F16,F14,F2
S.D F16,-24(R1)
DADDUI R1,R1,#-32
BNE R1,R2,Loop
note: number of live registers vs. original loop
Loop Unrolling/Pipeline Scheduling
• Pipeline schedule the unrolled loop:
Loop: L.D F0,0(R1)
L.D F6,-8(R1)
L.D F10,-16(R1)
L.D F14,-24(R1)
ADD.D F4,F0,F2
ADD.D F8,F6,F2
ADD.D F12,F10,F2
ADD.D F16,F14,F2
S.D F4,0(R1)
S.D F8,-8(R1)
DADDUI R1,R1,#-32
S.D F12,16(R1)
S.D F16,8(R1)
BNE R1,R2,Loop
Strip Mining
• Unknown number of loop iterations?– Number of iterations = n
– Goal: make k copies of the loop body
– Generate pair of loops:• First executes n mod k times
• Second executes n / k times
• “Strip mining”
Challenges to Loop Unrolling
• Limited benefit for large k (Amdahl's law)
• Code size
• Register pressure
Out-of-Order Execution: Scoreboarding
Out-of-Order Execution
• Variable latencies make out-of-order execution desirable• How do we prevent WAR and WAW hazards?• How do we deal with variable latency?
– Forwarding for RAW hazards harder.
Instruction
add 3 2 1
mul 6 5 4
div 8 7 6add 7 2 1
sub 8 2 1
1 2 3 4 5 6 7 8 9 10 11 12 13 14
IF ID EX M WB
IF ID E1 E2 E3 E4 E5 E6 E7 M WB
IF ID x x x x x x E1 E2 E3 E4 …
IF ID EX M WB
IF ID EX M WB
Scoreboard: a Bookkeeping Technique
• Out-of-order execution divides ID stage:1. Issue—decode instructions, check for structural hazards2. Read operands—wait until no data hazards, then read
operands
• Scoreboards date to CDC6600 in 1963• Instructions execute whenever not dependent on
previous instructions and no hazards. • CDC 6600: In order issue, out-of-order execution, out-
of-order commit (or completion)– No forwarding!– Imprecise interrupt/exception model
Scoreboard Architecture(CDC 6600)
Fun
ctiona
l Units
Regist
ers
FP Mult
FP Mult
FP Divide
FP Add
Integer
MemorySCOREBOARD
Fetch/Issue
Scoreboard Implications• Out-of-order completion => WAR, WAW hazards?• Solutions for WAR:
– Stall writeback until all prior uses of the destination register have been read
– Read registers only during Read Operands stage
• Solution for WAW:– Detect hazard and stall issue of new instruction until other
instruction completes
• Need to have multiple instructions in execution phase => multiple execution units or pipelined execution units
• Scoreboard keeps track of dependencies between instructions that have already issued
• Scoreboard replaces ID, EX, M and WB with 4 stages
Four Stages of Scoreboard Control
• Issue—decode instructions & check for structural hazards (ID1)– Instructions issued in program order (for hazard checking)– Don’t issue if structural hazard– Don’t issue if instruction is output dependent on any previously
issued but uncompleted instruction (no WAW hazards)
• Read operands—wait until no data hazards, then read operands (ID2)– All real dependencies (RAW hazards) resolved in this stage,
since we wait for instructions to write back data.– No forwarding of data in this model!
Four Stages of Scoreboard Control
• Execution—operate on operands (EX)– The functional unit begins execution upon receiving
operands. When the result is ready, it notifies the scoreboard that it has completed execution. This includes memory ops.
• Write result—finish execution (WB)– Stall until no WAR hazards with previous instructions:
Example: DIV 1 2 3ADD 3 4 5NAND 4 7 6
CDC 6600 scoreboard would stall NAND until ADD reads operands
Three Parts of the Scoreboard
• Instruction status: Which state the instruction is in
• Functional unit status: Indicates the state of the functional unit (FU)―nine fields for each functional unit– Busy: Indicates whether the unit is busy or not
– Op: Operation to perform in the unit (e.g., + or –)
– Fi: Destination register number
– Fj,Fk: Source-register numbers
– Qj,Qk: Functional units producing source registers Fj, Fk
– Rj,Rk: Flags indicating when Fj, Fk are ready
• Register result status– Indicates which functional unit will write each register, if one exists
– Blank when no pending instructions will write that register
Scoreboard ExampleInstruction status: Read Exec Write
Instruction j k Issue Oper Comp Result
LD F6 34+ R2
LD F2 45+ R3
MULTD F0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status: dest S1 S2 FU FU Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add No
Divide No
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F30FU
Detailed Scoreboard Pipeline Control
Read operands
Execution complete
Instruction status
Write result
Issue
Bookkeeping
Rj No; Rk No
f(if Qj(f)=FU then Rj(f) Yes);f(if Qk(f)=FU then Rj(f) Yes); Result(Fi(FU)) 0; Busy(FU) No
Busy(FU) yes; Op(FU) op; Fi(FU) `D’; Fj(FU) `S1’;
Fk(FU) `S2’; Qj Result(‘S1’); Qk Result(`S2’); Rj not Qj;
Rk not Qk; Result(‘D’) FU;
Rj and Rk
Functional unit done
Wait until
f((Fj(f)Fi(FU) or Rj(f)=No) &
(Fk(f)Fi(FU) or Rk( f )=No))
Not busy (FU) and not result(D)
Scoreboard Example: Cycle 1Instruction status: Read Exec Write
Instruction j k Issue Oper Comp Result
LD F6 34+ R2 1
LD F2 45+ R3
MULTD F0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status: dest S1 S2 FU FU Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F6 R2 Yes
Mult1 No
Mult2 No
Add No
Divide No
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F301 FU Integer
Scoreboard Example: Cycle 2Instruction status: Read Exec Write
Instruction j k Issue Oper Comp Result
LD F6 34+ R2 1 2
LD F2 45+ R3
MULTD F0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status: dest S1 S2 FU FU Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F6 R2 Yes
Mult1 No
Mult2 No
Add No
Divide No
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F302 FU Integer
• Issue 2nd LD?
Scoreboard Example: Cycle 3Instruction status: Read Exec Write
Instruction j k Issue Oper Comp Result
LD F6 34+ R2 1 2 3
LD F2 45+ R3
MULTD F0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status: dest S1 S2 FU FU Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F6 R2 No
Mult1 No
Mult2 No
Add No
Divide No
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F303 FU Integer
• Issue MULT?
Scoreboard Example: Cycle 4Instruction status: Read Exec Write
Instruction j k Issue Oper Comp Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3
MULTD F0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status: dest S1 S2 FU FU Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add No
Divide No
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F304 FU
Scoreboard Example: Cycle 5Instruction status: Read Exec Write
Instruction j k Issue Oper Comp Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5
MULTD F0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status: dest S1 S2 FU FU Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F2 R3 Yes
Mult1 No
Mult2 No
Add No
Divide No
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F305 FU Integer
Scoreboard Example: Cycle 6Instruction status: Read Exec Write
Instruction j k Issue Oper Comp Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6
MULTD F0 F2 F4 6
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status: dest S1 S2 FU FU Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F2 R3 Yes
Mult1 Yes Mult F0 F2 F4 Integer No Yes
Mult2 No
Add No
Divide No
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F306 FU Mult1 Integer
Scoreboard Example: Cycle 7Instruction status: Read Exec Write
Instruction j k Issue Oper Comp Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7
MULTD F0 F2 F4 6
SUBD F8 F6 F2 7
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status: dest S1 S2 FU FU Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F2 R3 No
Mult1 Yes Mult F0 F2 F4 Integer No Yes
Mult2 No
Add Yes Sub F8 F6 F2 Integer Yes No
Divide No
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F307 FU Mult1 Integer Add
• Read multiply operands?
Scoreboard Example: Cycle 8Instruction status: Read Exec Write
Instruction j k Issue Oper Comp Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTD F0 F2 F4 6
SUBD F8 F6 F2 7
DIVD F10 F0 F6 8
ADDD F6 F8 F2
Functional unit status: dest S1 S2 FU FU Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
Add Yes Sub F8 F6 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F308 FU Mult1 Add Divide
Scoreboard Example: Cycle 9Instruction status: Read Exec Write
Instruction j k Issue Oper Comp Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTD F0 F2 F4 6 9
SUBD F8 F6 F2 7 9
DIVD F10 F0 F6 8
ADDD F6 F8 F2
Functional unit status: dest S1 S2 FU FU Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
10 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
2 Add Yes Sub F8 F6 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F309 FU Mult1 Add Divide
• Read operands for MULT & SUB? Issue ADDD?
Note Remaining
Scoreboard Example: Cycle 10Instruction status: Read Exec Write
Instruction j k Issue Oper Comp Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTD F0 F2 F4 6 9
SUBD F8 F6 F2 7 9
DIVD F10 F0 F6 8
ADDD F6 F8 F2
Functional unit status: dest S1 S2 FU FU Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
9 Mult1 Yes Mult F0 F2 F4 No No
Mult2 No
1 Add Yes Sub F8 F6 F2 No No
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F3010 FU Mult1 Add Divide
Scoreboard Example: Cycle 11Instruction status: Read Exec Write
Instruction j k Issue Oper Comp Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTD F0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11
DIVD F10 F0 F6 8
ADDD F6 F8 F2
Functional unit status: dest S1 S2 FU FU Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
8 Mult1 Yes Mult F0 F2 F4 No No
Mult2 No
0 Add Yes Sub F8 F6 F2 No No
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F3011 FU Mult1 Add Divide
Scoreboard Example: Cycle 12Instruction status: Read Exec Write
Instruction j k Issue Oper Comp Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTD F0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2
Functional unit status: dest S1 S2 FU FU Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
7 Mult1 Yes Mult F0 F2 F4 No No
Mult2 No
Add No
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F3012 FU Mult1 Divide
• Read operands for DIVD?
Scoreboard Example: Cycle 13Instruction status: Read Exec Write
Instruction j k Issue Oper Comp Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTD F0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13
Functional unit status: dest S1 S2 FU FU Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
6 Mult1 Yes Mult F0 F2 F4 No No
Mult2 No
Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F3013 FU Mult1 Add Divide
Scoreboard Example: Cycle 14Instruction status: Read Exec Write
Instruction j k Issue Oper Comp Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTD F0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14
Functional unit status: dest S1 S2 FU FU Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
5 Mult1 Yes Mult F0 F2 F4 No No
Mult2 No
2 Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F3014 FU Mult1 Add Divide
Scoreboard Example: Cycle 15Instruction status: Read Exec Write
Instruction j k Issue Oper Comp Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTD F0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14
Functional unit status: dest S1 S2 FU FU Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
4 Mult1 Yes Mult F0 F2 F4 No No
Mult2 No
1 Add Yes Add F6 F8 F2 No No
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F3015 FU Mult1 Add Divide
Scoreboard Example: Cycle 16Instruction status: Read Exec Write
Instruction j k Issue Oper Comp Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTD F0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14 16
Functional unit status: dest S1 S2 FU FU Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
3 Mult1 Yes Mult F0 F2 F4 No No
Mult2 No
0 Add Yes Add F6 F8 F2 No No
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F3016 FU Mult1 Add Divide
Scoreboard Example: Cycle 17Instruction status: Read Exec Write
Instruction j k Issue Oper Comp Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTD F0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14 16
Functional unit status: dest S1 S2 FU FU Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
2 Mult1 Yes Mult F0 F2 F4 No No
Mult2 No
Add Yes Add F6 F8 F2 No No
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F3017 FU Mult1 Add Divide
• Why not write result of ADD???
WAR Hazard!
Scoreboard Example: Cycle 18Instruction status: Read Exec Write
Instruction j k Issue Oper Comp Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTD F0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14 16
Functional unit status: dest S1 S2 FU FU Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
1 Mult1 Yes Mult F0 F2 F4 No No
Mult2 No
Add Yes Add F6 F8 F2 No No
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F3018 FU Mult1 Add Divide
Scoreboard Example: Cycle 19Instruction status: Read Exec Write
Instruction j k Issue Oper Comp Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTD F0 F2 F4 6 9 19
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14 16
Functional unit status: dest S1 S2 FU FU Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
0 Mult1 Yes Mult F0 F2 F4 No No
Mult2 No
Add Yes Add F6 F8 F2 No No
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F3019 FU Mult1 Add Divide
Scoreboard Example: Cycle 20Instruction status: Read Exec Write
Instruction j k Issue Oper Comp Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTD F0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14 16
Functional unit status: dest S1 S2 FU FU Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add Yes Add F6 F8 F2 No No
Divide Yes Div F10 F0 F6 Yes Yes
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F3020 FU Add Divide
Scoreboard Example: Cycle 21Instruction status: Read Exec Write
Instruction j k Issue Oper Comp Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTD F0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8 21
ADDD F6 F8 F2 13 14 16
Functional unit status: dest S1 S2 FU FU Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add Yes Add F6 F8 F2 No No
Divide Yes Div F10 F0 F6 Yes Yes
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F3021 FU Add Divide
• WAR Hazard is now gone...
Scoreboard Example: Cycle 22Instruction status: Read Exec Write
Instruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTD F0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8 21
ADDD F6 F8 F2 13 14 16 22
Functional unit status: dest S1 S2 FU FU Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add No
39 Divide Yes Div F10 F0 F6 No No
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F3022 FU Divide
…
Scoreboard Example: Cycle 61Instruction status: Read Exec Write
Instruction j k Issue Oper Comp Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTD F0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8 21 61
ADDD F6 F8 F2 13 14 16 22
Functional unit status: dest S1 S2 FU FU Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add No
0 Divide Yes Div F10 F0 F6 No No
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F3061 FU Divide
Scoreboard Example: Cycle 62Instruction status: Read Exec Write
Instruction j k Issue Oper Comp Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTD F0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8 21 61 62
ADDD F6 F8 F2 13 14 16 22
Functional unit status: dest S1 S2 FU FU Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add No
Divide No
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F3062 FU
Scoreboard Example: Cycle 62Instruction status: Read Exec Write
Instruction j k Issue Oper Comp Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTD F0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8 21 61 62
ADDD F6 F8 F2 13 14 16 22
Functional unit status: dest S1 S2 FU FU Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add No
Divide No
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F3062 FU
• In-order issue; out-of-order execute & commit
CDC 6600 Scoreboard
• Historical context:– Speedup 1.7 from compiler; 2.5 by hand
BUT slow memory (no cache) limits benefit
• Limitations of 6600 scoreboard:– No forwarding hardware
– Limited to instructions in basic block (small window)
– Small number of functional units (structural hazards), especially integer/load store units
– Do not issue on structural hazards
– Wait for WAR hazards
– Prevent WAW hazards
• Precise interrupts?
Instruction-Level Parallelism and Its Exploitation
(Part II)
ECE 154B
Dmitri Strukov
ILP techniques
this week
lastweek
not covered
nextweek
Scoreboard Technique Review
• Allow for out of order execution by processing several instructions simultaneously
– In-order issue / out-of-order ex/completion
– Pipe stages: issue, read registers, ex, write back
• Book keeping to resolve RAW, WAW, and RAW
– Resolve RAW at “read registers” stage
– Stall issue for WAW, stall completion (WB) for WAR
Instruction status: Read Exec Write
Instruction j k Issue Oper Comp Result
LD F6 34+ R2
LD F2 45+ R3
MULTD F0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status: dest S1 S2 FU FU Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add No
Divide No
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F30FU
Read operands
Execution complete
Instruction status
Write result
Issue
Bookkeeping
Rj No; Rk No
f(if Qj(f)=FU then Rj(f) Yes);f(if Qk(f)=FU then Rj(f) Yes); Result(Fi(FU)) 0; Busy(FU) No
Busy(FU) yes; Op(FU) op; Fi(FU) `D’; Fj(FU) `S1’;
Fk(FU) `S2’; Qj Result(‘S1’); Qk Result(`S2’); Rj not Qj;
Rk not Qk; Result(‘D’) FU;
Rj and Rk
Functional unit done
Wait until
f((Fj(f)Fi(FU) or Rj(f)=No) &
(Fk(f)Fi(FU) or Rk( f )=No))
Not busy (FU) and not result(D)
B@t+1
A@t
C@t+2
Fi Fj Fk Qj Qk Rj Rk
Not completed
Not completed
Completed
Instr A: r2 r4/r5Instr B: r6 r3*r2Instr C: r3 r7+r8
tim
e
Read operands
Execution complete
Instruction status
Write result
Issue
Bookkeeping
Rj No; Rk No
f(if Qj(f)=FU then Rj(f) Yes);f(if Qk(f)=FU then Rj(f) Yes); Result(Fi(FU)) 0; Busy(FU) No
Busy(FU) yes; Op(FU) op; Fi(FU) `D’; Fj(FU) `S1’;
Fk(FU) `S2’; Qj Result(‘S1’); Qk Result(`S2’); Rj not Qj;
Rk not Qk; Result(‘D’) FU;
Rj and Rk
Functional unit done
Wait until
f((Fj(f)Fi(FU) or Rj(f)=No) &
(Fk(f)Fi(FU) or Rk( f )=No))
Not busy (FU) and not result(D)
B@t+1
A@t
C@t+2
Fi Fj Fk Qj Qk Rj Rk
Not completed
Not completed
Completed
r3 r2
r2
r3
B No
Read operands
Execution complete
Instruction status
Write result
Issue
Bookkeeping
Rj No; Rk No
f(if Qj(f)=FU then Rj(f) Yes);f(if Qk(f)=FU then Rj(f) Yes); Result(Fi(FU)) 0; Busy(FU) No
Busy(FU) yes; Op(FU) op; Fi(FU) `D’; Fj(FU) `S1’;
Fk(FU) `S2’; Qj Result(‘S1’); Qk Result(`S2’); Rj not Qj;
Rk not Qk; Result(‘D’) FU;
Rj and Rk
Functional unit done
Wait until
f((Fj(f)Fi(FU) or Rj(f)=No) &
(Fk(f)Fi(FU) or Rk( f )=No))
Not busy (FU) and not result(D)
Instr A: r2 r4/r5Instr B: r6 r3*r2Instr C: r3 r7+r8
tim
e
Another Dynamic Algorithm: Tomasulo’s Algorithm
• For IBM 360/91 about 3 years after CDC 6600 (1966)
• Goal: High Performance without special compilers
• Differences between IBM 360 & CDC 6600 ISA– IBM has only 2 register specifiers/instr vs. 3 in CDC 6600
– IBM has 4 FP registers vs. 8 in CDC 6600
– IBM has memory-register ops
• Small number of floating point registers prevented interesting compiler scheduling of operationsThis led Tomasulo to try to figure out how to get more effective
registers — renaming in hardware!
• Why Study? The descendants of this have flourished!Alpha 21264, HP 8000, MIPS 10000, Pentium II, PowerPC 604, …
Tomasulo vs. Scoreboard• Control & buffers distributed with Function Units (FU) vs.
centralized in scoreboardFU buffers called “reservation stations”; have pending operands
• Registers in instructions replaced by values or pointers to reservation stations(RS); called register renaming– avoids WAR, WAW hazards– More reservation stations than registers, so can do optimizations
compilers can’t
• Results to FU from RS, not through registers, over Common Data Bus that broadcasts results to all FUs
• Load and Stores treated as FUs with RSs as well
1) Trying in improve performance by exploiting ILP (overlapping as much as possible execution of instructions)
2) So far, issue one instruction per cycle and overlap exaction of many by pipelining- We will broaden it later to allow issue of multiple instructions per cycle (superscalar
processors)
3) Problem with pipelining hazards (WAW, RAW, WAR), and resulting stalls, which are especially painful with relatively large memory (less of a problem earlier, more a problem now – consequence of memory wall problem) and FP latencies (was problem earlier, less of a problem now)
4) Let’s reorder instructions (without affective program correctness) can align (overlap) better instructions to reduce hazards
- Scoreboard allows reordering but deals somewhat efficiently only with RAW but not with WAW and WAR
- WAW and RAW improved in Tomasulo via register renaming (essentially storing register value instead of register label in scoreboard). Also more efficient due to forwarding (common data bus)
- However, both scoreboard and Tomasulo are limited in reordering given typical small number of instructions in basic blocks (i.e. cannot reorder across branches)
Big picture on ILP for now
Tomasulo Organization
+ register status
Reservation Station Components
Op: Operation to perform in the unit (e.g., + or –)
Vj, Vk: Value of source operandsStore buffers have V fields, results to be stored
Qj, Qk: Reservation stations producing source registers (value to be written)– No ready flags as in Scoreboard; Qj,Qk=0 => ready
– Store buffers only have Qi for RS producing result
Busy: Indicates reservation station or FU is busy
Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register.
Three Stages of Tomasulo Algorithm
1. Issue—get instruction from FP Op QueueIf reservation station free (no structural hazard), control issues instr & sends operands (renames registers)
2. Execute—operate on operands (EX)When both operands ready then execute;if not ready, watch Common Data Bus for result
3. Write result—finish execution (WB)Write on Common Data Bus to all awaiting units; mark reservation station available
• Normal data bus: data + destination (“go to” bus)• Common data bus: data + source (“come from” bus)
– 64 bits of data + 4 bits of Functional Unit source address– Write if matches expected Functional Unit (produces result)– Does the broadcast
Tomasulo ExampleInstruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 Load1 No
LD F2 45+ R3 Load2 No
MULTD F0 F2 F4 Load3 No
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
Mult2 No
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F300 FU
Tomasulo Example Cycle 1Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 1 Load1 Yes 34+R2
LD F2 45+ R3 Load2 No
MULTD F0 F2 F4 Load3 No
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
Mult2 No
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F301 FU Load1
Tomasulo Example Cycle 2Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 1 Load1 Yes 34+R2
LD F2 45+ R3 2 Load2 Yes 45+R3
MULTD F0 F2 F4 Load3 No
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
Mult2 No
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F302 FU Load2 Load1
Note: Unlike 6600, can have multiple loads outstanding(This was not an inherent limitation of scoreboarding)
Tomasulo Example Cycle 3Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 1 3 Load1 Yes 34+R2
LD F2 45+ R3 2 Load2 Yes 45+R3
MULTD F0 F2 F4 3 Load3 No
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 Yes MULTD R(F4) Load2
Mult2 No
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F303 FU Mult1 Load2 Load1
• Note: registers names are removed (“renamed”) in Reservation Stations; MULT issued vs. scoreboard
• Load1 completing; what is waiting for Load1?
Tomasulo Example Cycle 4Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 Load2 Yes 45+R3
MULTD F0 F2 F4 3 Load3 No
SUBD F8 F6 F2 4
DIVD F10 F0 F6
ADDD F6 F8 F2
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 Yes SUBD M(A1) Load2
Add2 No
Add3 No
Mult1 Yes MULTD R(F4) Load2
Mult2 No
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F304 FU Mult1 Load2 M(A1) Add1
• Load2 completing; what is waiting for Load2?
Tomasulo Example Cycle 5Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
MULTD F0 F2 F4 3 Load3 No
SUBD F8 F6 F2 4
DIVD F10 F0 F6 5
ADDD F6 F8 F2
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
2 Add1 Yes SUBD M(A1) M(A2)
Add2 No
Add3 No
10 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F305 FU Mult1 M(A2) M(A1) Add1 Mult2
Tomasulo Example Cycle 6Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
MULTD F0 F2 F4 3 Load3 No
SUBD F8 F6 F2 4
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
1 Add1 Yes SUBD M(A1) M(A2)
Add2 Yes ADDD M(A2) Add1
Add3 No
9 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F306 FU Mult1 M(A2) Add2 Add1 Mult2
• Issue ADDD here?
Tomasulo Example Cycle 7Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
MULTD F0 F2 F4 3 Load3 No
SUBD F8 F6 F2 4 7
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
0 Add1 Yes SUBD M(A1) M(A2)
Add2 Yes ADDD M(A2) Add1
Add3 No
8 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F307 FU Mult1 M(A2) Add2 Add1 Mult2
• Add1 completing; what is waiting for it?
Tomasulo Example Cycle 8Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
MULTD F0 F2 F4 3 Load3 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
2 Add2 Yes ADDD (M-M) M(A2)
Add3 No
7 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F308 FU Mult1 M(A2) Add2 (M-M) Mult2
Tomasulo Example Cycle 9Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
MULTD F0 F2 F4 3 Load3 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
1 Add2 Yes ADDD (M-M) M(A2)
Add3 No
6 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F309 FU Mult1 M(A2) Add2 (M-M) Mult2
Tomasulo Example Cycle 10Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
MULTD F0 F2 F4 3 Load3 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 10
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
0 Add2 Yes ADDD (M-M) M(A2)
Add3 No
5 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F3010 FU Mult1 M(A2) Add2 (M-M) Mult2
• Add2 completing; what is waiting for it?
Tomasulo Example Cycle 11Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
MULTD F0 F2 F4 3 Load3 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 10 11
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
4 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F3011 FU Mult1 M(A2) (M-M+M)(M-M) Mult2
• Write result of ADDD here?
• All quick instructions complete in this cycle!
Tomasulo Example Cycle 12Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
MULTD F0 F2 F4 3 Load3 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 10 11
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
3 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F3012 FU Mult1 M(A2) (M-M+M)(M-M) Mult2
Tomasulo Example Cycle 13Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
MULTD F0 F2 F4 3 Load3 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 10 11
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
2 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F3013 FU Mult1 M(A2) (M-M+M)(M-M) Mult2
Tomasulo Example Cycle 14Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
MULTD F0 F2 F4 3 Load3 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 10 11
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
1 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F3014 FU Mult1 M(A2) (M-M+M)(M-M) Mult2
Tomasulo Example Cycle 15Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
MULTD F0 F2 F4 3 15 Load3 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 10 11
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
0 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F3015 FU Mult1 M(A2) (M-M+M)(M-M) Mult2
Tomasulo Example Cycle 16Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
MULTD F0 F2 F4 3 15 16 Load3 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 10 11
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
40 Mult2 Yes DIVD M*F4 M(A1)
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F3016 FU M*F4 M(A2) (M-M+M)(M-M) Mult2
Faster than light computation(skip a couple of cycles)
Tomasulo Example Cycle 55Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
MULTD F0 F2 F4 3 15 16 Load3 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 10 11
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
1 Mult2 Yes DIVD M*F4 M(A1)
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F3055 FU M*F4 M(A2) (M-M+M)(M-M) Mult2
Tomasulo Example Cycle 56Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
MULTD F0 F2 F4 3 15 16 Load3 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5 56
ADDD F6 F8 F2 6 10 11
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
0 Mult2 Yes DIVD M*F4 M(A1)
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F3056 FU M*F4 M(A2) (M-M+M)(M-M) Mult2
• Mult2 is completing; what is waiting for it?
Tomasulo Example Cycle 57Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No
MULTD F0 F2 F4 3 15 16 Load3 No
SUBD F8 F6 F2 4 7 8
DIVD F10 F0 F6 5 56 57
ADDD F6 F8 F2 6 10 11
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
Mult2 Yes DIVD M*F4 M(A1)
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F3056 FU M*F4 M(A2) (M-M+M)(M-M) Result
• Once again: In-order issue, out-of-order execution and completion
Compare to Scoreboard Cycle 62
Instruction status: Read Exec Write Exec Write
Instruction j k Issue Oper Comp Result Issue ComplResult
LD F6 34+ R2 1 2 3 4 1 3 4
LD F2 45+ R3 5 6 7 8 2 4 5
MULTD F0 F2 F4 6 9 19 20 3 15 16
SUBD F8 F6 F2 7 9 11 12 4 7 8
DIVD F10 F0 F6 8 21 61 62 5 56 57
ADDD F6 F8 F2 13 14 16 22 6 10 11
• Why take longer on scoreboard/6600?
• Not efficient WAR and WAW
• Structural hazards
• Lack of forwarding
Tomasulo vs. Scoreboard(IBM 360/91 vs. CDC 6600)
Pipelined Functional Units Multiple Functional Units
(6 load, 3 store, 3+, 2x/÷) (1 load/store, 1+, 2x, 1÷)
window size: ≤ 14 instructions ≤ 5 instructions
No issue on structural hazard same
WAR: renaming avoids stall completion
WAW: renaming avoids stall issue
Broadcast results from FU Write/read registers
Control: reservation stations central scoreboard
Tomasulo Drawbacks
• Complexitydelays of 360/91, MIPS 10000, IBM 620?
• Many associative stores (CDB) at high speed
• Performance limited by Common Data Bus– Each CDB must go to multiple functional units high
capacitance, high wiring density
– Number of functional units that can complete per cycle limited to one!
(Multiple CDBs more FU logic for parallel assoc stores)
• Non-precise interrupts!
Summary on Tomasulo
• Reservations stations: implicit register renaming to larger set of registers + buffering source operands– Prevents registers as bottleneck– Avoids WAR, WAW hazards of Scoreboard– Allows loop unrolling in HW (when branch can be quickly
resolved) • Helps cache misses as well• Lasting Contributions
– Dynamic scheduling– Register renaming– Load/store disambiguation
• 360/91 descendants are Pentium II; PowerPC 604; MIPS R10000; HP-PA 8000; Alpha 21264
1) Trying in improve performance by exploiting ILP (overlapping as much as possible execution of instructions)
2) So far, issue one instruction per cycle and overlap exaction of many by pipelining- We will broaden it later to allow issue of multiple instructions per cycle (superscalar
processors)
3) Problem with pipelining hazards (WAW, RAW, WAR), and resulting stalls, which are especially painful with relatively large memory (less of a problem earlier, more a problem now – consequence of memory wall phenomena) and FP latencies (was problem earlier, less of a problem now)
4) Let’s reorder instructions (without affective program correctness) can align (overlap) better instructions to reduce hazards
- Scoreboard allows reordering but deals somewhat efficiently only with RAW but not with WAW and WAR
- WAW and RAW improved in Tomasulo via register renaming (essentially storing register value instead of register label in scoreboard). Also more efficient due to forwarding (common data bus)
- However, both scoreboard and Tomasulo are limited in reordering given typical small number of instructions in basic blocks (i.e. cannot reorder across branches)
5) Solution let’s execute speculatively instructions to have more instructions available for reordering
- Need mechanism to recover from wrong execution.
Summary on ILP for now
Exploiting More ILP
• Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP
(will look at branch prediction implementations next week)
• One way to overcome control dependencies is with speculation– Make a guess and execute program as if our guess is correct
– Need mechanisms to handle the case when the speculation is incorrect
• Can do some speculation in the compiler– Reordered / duplicated instructions around branches
Hardware-Based Speculation
• Extends the idea of dynamic scheduling with three key ideas:1. Dynamic branch prediction2. Speculation to allow the execution of instructions before
control dependencies are resolved3. Dynamic scheduling to deal with scheduling different
combinations of basic blocks• What we saw earlier was within a basic block
• Modern processors started using speculation around the introduction of the PowerPC 603, Intel Pentium II and extend Tomasulo’s approach to support speculation
Speculating with Tomasulo
• Separate execution from completion– Allow instructions to execute speculatively but do not let
instructions update registers or memory until they are no longer speculative
• Instruction Commit– After an instruction is no longer speculative it is allowed to
make register and memory updates
• Allow instructions to execute and complete out of order but force them to commit in order
• Add a hardware buffer, called the reorder buffer (ROB), with registers to hold the result of an instruction between completion and commit– Acts as a FIFO queue in order issued
Original Tomasulo Architecture
Tomasulo and Reorder Buffer
• Sits between Execution and Register File
• Source of operands• In this case integrated with
Store buffer• Reservation stations use
ROB slot as a tag• Instructions commit at
head of ROB FIFO queue– Easy to undo
speculated instructions on mispredictedbranches or on exceptions
ROB Data Structure
• Instruction Type Field– Indicates whether the instruction is a branch, store, or
register operation
• Destination Field– Register number for loads, ALU ops, or memory address
for stores
• Value Field– Holds the value of the instruction result until instruction
commits
• Ready Field– Indicates if instruction has completed execution and the
value is ready
Instruction Execution1. Issue: Get an instruction from the Instruction Queue
If the reservation station and the ROB has a free slot (no structural hazard), issue the instruction to the reservation station and the ROB, send operands to the reservation station if available in the register file or the ROB. The allocated ROB slot number is sent to the reservation station to use as a tag when placing data on the CDB.
2. Execution: Operate on operands (EX) When both operands ready then execute; if not ready, watch CDB for result
3. Write result: Finish execution (WB) Write on CDB to all awaiting units and to the ROB using the tag; mark
reservation station available
4. Commit: Update register or memory with the ROB result When an instruction reaches the head of the ROB and results are present,
update the register with the result or store to memory and remove the instruction from the ROB
If an incorrectly predicted branch reaches the head of the ROB, flush the ROB, and restart at the correct successor of the branch
Blue text = Change from Tomasulo
Tomasulo With ROB
ToMemory
FP adders FP multipliers
Reservation Stations
FP OpQueue
State
Dest
Oldest
Newest
From Memory
Dest
Reorder Buffer
Registers
ROB7
ROB6
ROB5
ROB4
ROB3
ROB2
F0 LD F0, 10(R2) I ROB1
Dest
1 10+R2
Tomasulo With ROB
ToMemory
FP adders FP multipliers
Reservation Stations
FP OpQueue
State
Dest
Oldest
Newest
From Memory
Dest
Reorder Buffer
Registers
ROB7
ROB6
ROB5
ROB4
ROB3
ROB2
F0 LD F0, 10(R2) E ROB1
Dest
1 10+R2
Tomasulo With ROB
ToMemory
FP adders FP multipliers
Reservation Stations
FP OpQueue
Dest
Oldest
Newest
From Memory
Dest
Reorder Buffer
Registers
ROB7
ROB6
ROB5
ROB4
ROB3
F10 ADDD F10,F4,F0 I ROB2
F0 LD F0, 10(R2) E ROB1
2 ADDD R(F4),1
Dest
1 10+R2
State
Tomasulo With ROB
ToMemory
FP adders FP multipliers
Reservation Stations
FP OpQueue
Dest
Oldest
Newest
From Memory
Dest
Reorder Buffer
Registers
ROB7
ROB6
ROB5
ROB4
F2 MULD F2,F10,F6 I ROB3
F10 ADDD F10,F4,F0 I ROB2
F0 LD F0, 10(R2) E ROB1
2 ADDD R(F4),1
Dest
3 MULD 2,R(F6)
1 10+R2
State
Tomasulo With ROB
ToMemory
FP adders FP multipliers
Reservation Stations
FP OpQueue
Dest
Oldest
Newest
From Memory
Dest
Reorder Buffer
Registers
ROB7
F0 ADDD F0,F4,F6 I ROB6
F4 LD F4,0(R3) E ROB5
-- BNE F0, 0, L I ROB4
F2 MULD F2,F10,F6 I ROB3
F10 ADDD F10,F4,F0 I ROB2
F0 LD F0, 10(R2) E ROB1
2 ADDD R(F4),1
6 ADDD 5,R(F6)
Dest
3 MULD 2,R(F6)
1 10+R2
5 0+R3
State
Tomasulo With ROB
ToMemory
FP adders FP multipliers
Reservation Stations
FP OpQueue
Dest
Oldest
Newest
From Memory
Dest
Reorder Buffer
Registers
[R3] ROB5 ST F4, 0(R3) I ROB7
F0 ADDD F0,F4,F6 I ROB6
F4 LD F4,0(R3) E ROB5
-- BNE F0, 0, L I ROB4
F2 MULD F2,F10,F6 I ROB3
F10 ADDD F10,F4,F0 I ROB2
F0 LD F0, 10(R2) E ROB1
2 ADDD R(F4),1
6 ADDD 5,R(F6)
Dest
3 MULD 2,R(F6)
1 10+R2
5 0+R3
State
Tomasulo With ROB
ToMemory
FP adders FP multipliers
Reservation Stations
FP OpQueue
Dest
Oldest
Newest
From Memory
Dest
Reorder Buffer
Registers
[R3] V1 ST F4, 0(R3) W ROB7
F0 ADDD F0,F4,F6 I ROB6
F4 V1 LD F4,0(R3) W ROB5
-- BNE F0, 0, L I ROB4
F2 MULD F2,F10,F6 I ROB3
F10 ADDD F10,F4,F0 I ROB2
F0 LD F0, 10(R2) E ROB1
2 ADDD R(F4),1
6 ADDD V1,R(F6)
Dest
3 MULD 2,R(F6)
1 10+R2
State
Tomasulo With ROB
ToMemory
FP adders FP multipliers
Reservation Stations
FP OpQueue
Dest
Oldest
Newest
From Memory
Dest
Reorder Buffer
Registers
[R3] V1 ST F4, 0(R3) W ROB7
F0 ADDD F0,F4,F6 E ROB6
F4 V1 LD F4,0(R3) W ROB5
-- BNE F0, 0, L I ROB4
F2 MULD F2,F10,F6 I ROB3
F10 ADDD F10,F4,F0 I ROB2
F0 LD F0, 10(R2) E ROB1
2 ADDD R(F4),1
6 ADDD V1,R(F6)
Dest
3 MULD 2,R(F6)
1 10+R2
State
Tomasulo With ROB
ToMemory
FP adders FP multipliers
Reservation Stations
FP OpQueue
Dest
Oldest
Newest
From Memory
Dest
Reorder Buffer
Registers
[R3] V1 ST F4, 0(R3) W ROB7
F0 V2 ADDD F0,F4,F6 W ROB6
F4 V1 LD F4,0(R3) W ROB5
-- BNE F0, 0, L I ROB4
F2 MULD F2,F10,F6 I ROB3
F10 ADDD F10,F4,F0 I ROB2
F0 LD F0, 10(R2) E ROB1
2 ADDD R(F4),1
Dest
3 MULD 2,R(F6)
1 10+R2
State
Tomasulo With ROB
ToMemory
FP adders FP multipliers
Reservation Stations
FP OpQueue
Dest
Oldest
Newest
From Memory
Dest
Reorder Buffer
Registers
[R3] V1 ST F4, 0(R3) W ROB7
F0 V2 ADDD F0,F4,F6 W ROB6
F4 V1 LD F4,0(R3) W ROB5
-- BNE F0, 0, L I ROB4
F2 MULD F2,F10,F6 I ROB3
F10 ADDD F10,F4,F0 I ROB2
F0 V3 LD F0, 10(R2) W ROB1
2 ADDD R(F4),V3
Dest
3 MULD 2,R(F6)
State
Tomasulo With ROB
ToMemory
FP adders FP multipliers
Reservation Stations
FP OpQueue
Dest
Oldest
Newest
From Memory
Dest
Reorder Buffer
Registers
[R3] V1 ST F4, 0(R3) W ROB7
F0 V2 ADDD F0,F4,F6 W ROB6
F4 V1 LD F4,0(R3) W ROB5
-- BNE F0, 0, L E ROB4
F2 MULD F2,F10,F6 I ROB3
F10 ADDD F10,F4,F0 E ROB2
F0 V3 LD F0, 10(R2) C ROB1
2 ADDD R(F4),V3
Dest
3 MULD 2,R(F6)
State
F0=V3
Tomasulo With ROB
ToMemory
FP adders FP multipliers
Reservation Stations
FP OpQueue
Dest
Oldest
Newest
From Memory
Dest
Reorder Buffer
Registers
[R3] V1 ST F4, 0(R3) W ROB7
F0 V2 ADDD F0,F4,F6 W ROB6
F4 V1 LD F4,0(R3) W ROB5
-- BNE F0, 0, L W ROB4
F2 MULD F2,F10,F6 I ROB3
F10 V4 ADDD F10,F4,F0 W ROB2
F0 V3 LD F0, 10(R2) C ROB1
Dest
3 MULD V4,R(F6)
State
F0=V3
Tomasulo With ROB
ToMemory
FP adders FP multipliers
Reservation Stations
FP OpQueue
Dest
Oldest
Newest
From Memory
Dest
Reorder Buffer
Registers
[R3] V1 ST F4, 0(R3) W ROB7
F0 V2 ADDD F0,F4,F6 W ROB6
F4 V1 LD F4,0(R3) W ROB5
-- BNE F0, 0, L W ROB4
F2 MULD F2,F10,F6 E ROB3
F10 V4 ADDD F10,F4,F0 C ROB2
F0 V3 LD F0, 10(R2) C ROB1
Dest
3 MULD V4,R(F6)
State
F0=V3F10=V4
Tomasulo With ROB
ToMemory
FP adders FP multipliers
Reservation Stations
FP OpQueue
Dest
Oldest
Newest
From Memory
Dest
Reorder Buffer
Registers
[R3] V1 ST F4, 0(R3) W ROB7
F0 V2 ADDD F0,F4,F6 W ROB6
F4 V1 LD F4,0(R3) W ROB5
-- BNE F0, 0, L W ROB4
F2 V5 MULD F2,F10,F6 W ROB3
F10 V4 ADDD F10,F4,F0 C ROB2
F0 V3 LD F0, 10(R2) C ROB1
Dest
State
F0=V3F10=V4
Tomasulo With ROB
ToMemory
FP adders FP multipliers
Reservation Stations
FP OpQueue
Dest
Oldest
Newest
From Memory
Dest
Reorder Buffer
Registers
[R3] V1 ST F4, 0(R3) W ROB7
F0 V2 ADDD F0,F4,F6 W ROB6
F4 V1 LD F4,0(R3) W ROB5
-- BNE F0, 0, L W ROB4
F2 V5 MULD F2,F10,F6 C ROB3
F10 V4 ADDD F10,F4,F0 C ROB2
F0 V3 LD F0, 10(R2) C ROB1
Dest
State
F0=V3F10=V4F2=V5
Tomasulo With ROB
ToMemory
FP adders FP multipliers
Reservation Stations
FP OpQueue
Dest
Oldest
Newest
From Memory
Dest
Reorder Buffer
Registers
[R3] V1 ST F4, 0(R3) W ROB7
F0 V2 ADDD F0,F4,F6 W ROB6
F4 V1 LD F4,0(R3) W ROB5
-- BNE F0, 0, L C ROB4
F2 V5 MULD F2,F10,F6 C ROB3
F10 V4 ADDD F10,F4,F0 C ROB2
F0 V3 LD F0, 10(R2) C ROB1
Dest
State
F0=V3F10=V4F2=V5
Avoiding Memory Hazards
• A store only updates memory when it reaches the head of the ROB– Otherwise WAW and WAR hazards are possible– By waiting to reach the head memory is updated in order and no
earlier loads or stores can still be pending
• If a load accesses a memory location written to by an earlier store then it cannot perform the memory access until the store has written the data– Prevents RAW hazard through memory
Reorder Buffer Implementation
• In practice– Try to recover as early as possible after a branch is
mispredicted rather than wait until branch reaches the head
– Performance in speculative processors more sensitive to branch prediction• Higher cost of misprediction
• Exceptions (will look into that next week)– Don’t recognize the exception until it is ready to
commit– Could try to handle exceptions as they arise and
earlier branches resolved, but more challenging
Acknowledgements
Some of the slides contain material developed and copyrighted by Sally A. McKee and K. Mock (Cornell University) and instructor material for the textbook
119