1
ECE4750/CS4420 Computer Architecture
L9: Tomasulo’s Algorithm
Edward Suh
Computer Systems Laboratory
2
Announcements
Lab2 grade
• Will be out on Friday
• Re-grade request within a week
HW2 due Sunday
Class schedule next week
• Tuesday: no class (fall break)
• Thursday: no class, evening prelim 7:30-9:30
ECE4750/CS4420 — Computer Architecture, Fall 2008
2
3
Overview
Scoreboard review
Limitations of Scoreboard
• WAR & WAW hazards
Tomasulo’s algorithm
Example
Reading: Chapter 2.4 & 2.5
ECE4750/CS4420 — Computer Architecture, Fall 2008
4
Scoreboard Review
Step 1: issue
• if FU available (structural), and
• if no earlier instruction writes to same destination (WAW), then
• send instruction to FU
Step 2: read operands (a.k.a. dispatch)
• if no operand pending update (RAW), then
• instruct FU to read operands and start execution
Step 3: execution
• inform scoreboard at completion
Step 4: write result (a.k.a. retire)
• if WAR hazard possible, stall at WB stage
ECE4750/CS4420 — Computer Architecture, Fall 2008
3
5
Parts of Scoreboard
Instruction Status – in which of the four steps each instruction is
FU Status – is FU available?
• Busy – FU is busy
• Op – operation to be performed
• Fi, Fj, Fk – Destination and source registers
• Qj, Qk – FUs producing Fj, Fk
• Rj, Rk – Operand-ready flags; reset after operands are read
Register Status – is a register (Reg) up-to-date?
• Result[Reg] – which FU will write Reg
ECE4750/CS4420 — Computer Architecture, Fall 2008
6
Scoreboard Details
Issue
• Wait till no structural (not Busy[FU]) and WAW (not Result[D]) hazard
• Busy[FU] = yes; Op[FU] = op; Fi[FU] = D; Fj[FU] = S1; Fk[FU] = S2; Qj = Result[S1]; Qk = Result[S2]; Rj = not Qj; Rk = not Qk; Result[D] = FU
Read operands
• Wait till no RAW hazard: Rj and Rk
• Rj = No; Rk = No; Qj = 0; Qk = 0
Execution
Write result
• Wait till no WAR hazard: for all other FUs, sources (Fj, Fk) that are in the register file (Rj, Rk == yes) do not match the register (Fi[FU]) to overwite
• Rj[f], Rk[f] = yes if Qj[f], Qk[f] == FU; Result[Fi[FU]] = 0; Busy[FU] = No;
ECE4750/CS4420 — Computer Architecture, Fall 2008
4
7
Scoreboard: Example
Instruction
Status
Instruction I R E W
fld1 f6,34($2)
fld2 f2,45($3)
fmul f0,f2,f4
fsub f8,f6,f2
fdiv f10,f0,f6
fadd f6,f8,f2
ECE4750/CS4420 — Computer Architecture, Fall 2008, Suh
FU Status
Busy Op Fi Fj Fk Qj Qk Rj Rk
Int
Mul1
Mul2
Add
Div
Register Result Status
F0 F2 F4 F6 F8 F10 F12 … F30
FU
Clock
0
Latencies: fadd – 2 cycles, fmul – 10 cycles, fdiv – 40 cycles, fld – 1 cycle (cache hit)
8
Limitations: an Example
ECE4750/CS4420 — Computer Architecture, Fall 2008
latency1 LD F2, 34(R2) 1
2 LD F4, 45(R3) long
3 MULTD F6, F4, F2 3
4 SUBD F8, F2, F2 1
5 DIVD F4, F2, F8 4
6 ADDD F10, F6, F4 1
In-order: 1 (2,1) . . . . . . 2 3 4 4 3 5 . . . 5 6 6
1 2
34
5
6
Out-of-order: 1 (2,1)
5
9
Limitations in ISA
Which features of an ISA limit the number of instructions in the pipeline?
ECE4750/CS4420 — Computer Architecture, Fall 2008
10
Instruction-level Parallelism via Renaming
ECE4750/CS4420 — Computer Architecture, Fall 2008
latency1 LD F2, 34(R2) 1
2 LD F4, 45(R3) long
3 MULTD F6, F4, F2 3
4 SUBD F8, F2, F2 1
5 DIVD F4’, F2, F8 4
6 ADDD F10, F6, F4’ 1
In-order: 1 (2,1) . . . . . . 2 3 4 4 3 5 . . . 5 6 6Out-of-order: 1 (2,1)
1 2
34
5
6
X
Any name dependence can be eliminated by renaming.(renaming additional storage)
6
11
Dynamic Scheduling by Tomasulo’s
Developed for IBM 360/91 three years after CDC 6600
Goal was high performance without compiler help
• only four floating-point registers
• wanted portability of code
Innovations over scoreboard
• control and buffers distributed: “reservation stations”
• source operands point to reservation stations– renaming, eliminates WAR, WAW hazards
– Common Data Bus (CDB) broadcasts results
Original IBM 360/91 used reg-mem ISA, but we’ll use MIPS ISA instead
ECE4750/CS4420 — Computer Architecture, Fall 2008
12
Tomasulo-based FPU
ECE4750/CS4420 — Computer Architecture, Fall 2008
7
13
Three Steps to Instruction Execution
Step 1: issue
• if reservation station available (structural), then
• rename operands, send instruction to reservation station
• read operands from a register file
Step 2: execution
• if operand(s) not available, monitor CDB (snoop)
• inform control logic at completion
Step 3: write result
• broadcast result via CDB
• if no WAW hazard, update register
ECE4750/CS4420 — Computer Architecture, Fall 2008
14
Parts of Tomasulo
Instruction Status – in which of the three steps each instruction is
Reservation Station Status – is the reservation station available?
• Busy – reservation station is busy
• Op – operation to be performed
• Address – effective address (if load/store)
• Vj, Vk – Source values (not registers!)
• Qj, Qk – reservation stations producing Vj, Vk for this instruction
Register Status – which reservation station will update each register
• Qi – reservation station producing the most updated value for the register
ECE4750/CS4420 — Computer Architecture, Fall 2008
8
15
Tomasulo: Example
Instr. Status
Instruction I E W
fld1 f6,34($2)
fld2 f2,45($3)
fmul f0,f2,f4
fsub f8,f6,f2
fdiv f10,f0,f6
fadd f6,f8,f2
ECE4750/CS4420 — Computer Architecture, Fall 2008
Reservation Stations
Busy Op Vj Vk Qj Qk A
Load1
Load2
Add1
Add2
Add3
Mul1
Mul2
Register Result Status
F0 F2 F4 F6 F8 F10 F12 … F30
Qi
Clock
16
Tomasulo: Example
Instr. Status
Instruction I E W
fld1 f6,34($2) 1
fld2 f2,45($3)
fmul f0,f2,f4
fsub f8,f6,f2
fdiv f10,f0,f6
fadd f6,f8,f2
ECE4750/CS4420 — Computer Architecture, Fall 2008
Reservation Stations
Busy Op Vj Vk Qj Qk A
Load1 Y ld $2 34
Load2
Add1
Add2
Add3
Mul1
Mul2
Register Result Status
F0 F2 F4 F6 F8 F10 F12 … F30
Qi Load1
Clock
1
9
17
Tomasulo: Example
Instr. Status
Instruction I E W
fld1 f6,34($2) 1
fld2 f2,45($3) 2
fmul f0,f2,f4
fsub f8,f6,f2
fdiv f10,f0,f6
fadd f6,f8,f2
ECE4750/CS4420 — Computer Architecture, Fall 2008
Reservation Stations
Busy Op Vj Vk Qj Qk A
1 Load1 Y ld $2 $2+34
Load2 Y ld $3 45
Add1
Add2
Add3
Mul1
Mul2
Register Result Status
F0 F2 F4 F6 F8 F10 F12 … F30
Qi Load2 Load1
Clock
2
18
Tomasulo: Example
Instr. Status
Instruction I E W
fld1 f6,34($2) 1 3
fld2 f2,45($3) 2
fmul f0,f2,f4 3
fsub f8,f6,f2
fdiv f10,f0,f6
fadd f6,f8,f2
ECE4750/CS4420 — Computer Architecture, Fall 2008
Reservation Stations
Busy Op Vj Vk Qj Qk A
0 Load1 Y ld $2 $2+34
1 Load2 Y ld $3 $3+45
Add1
Add2
Add3
Mul1 Y fmul f4 Load2
Mul2
Register Result Status (also values)
F0 F2 F4 F6 F8 F10 F12 … F30
Qi Mul1 Load2 Load1
Clock
3
10
19
Tomasulo: Example
Instr. Status
Instruction I E W
fld1 f6,34($2) 1 3 4
fld2 f2,45($3) 2 4
fmul f0,f2,f4 3
fsub f8,f6,f2 4
fdiv f10,f0,f6
fadd f6,f8,f2
ECE4750/CS4420 — Computer Architecture, Fall 2008
Reservation Stations
Busy Op Vj Vk Qj Qk A
Load1 - - - -
0 Load2 Y ld $3 $3+45
Add1 Y fsub M1 Load2
Add2
Add3
Mul1 Y fmul f4 Load2
Mul2
Register Result Status (also values)
F0 F2 F4 F6 F8 F10 F12 … F30
Qi Mul1 Load2 M1 Add1
Clock
4
20
Tomasulo: Example
Instr. Status
Instruction I E W
fld1 f6,34($2) 1 3 4
fld2 f2,45($3) 2 4 5
fmul f0,f2,f4 3
fsub f8,f6,f2 4
fdiv f10,f0,f6 5
fadd f6,f8,f2
ECE4750/CS4420 — Computer Architecture, Fall 2008
Reservation Stations
Busy Op Vj Vk Qj Qk A
Load1
Load2 - - - -
2 Add1 Y fsub M1 M2 -
Add2
Add3
10 Mul1 Y fmul M2 f4 -
Mul2 Y fdiv M1 Mul1
Register Result Status (also values)
F0 F2 F4 F6 F8 F10 F12 … F30
Qi Mul1 M2 M1 Add1 Mul2
Clock
5
11
21
Tomasulo: Example
Instr. Status
Instruction I E W
fld1 f6,34($2) 1 3 4
fld2 f2,45($3) 2 4 5
fmul f0,f2,f4 3
fsub f8,f6,f2 4 7 8
fdiv f10,f0,f6 5
fadd f6,f8,f2 6
ECE4750/CS4420 — Computer Architecture, Fall 2008
Reservation Stations
Busy Op Vj Vk Qj Qk A
Load1
Load2
Add1 - - - -
2 Add2 Y fadd M1- M2 M2 -
Add3
7 Mul1 Y fmul M2 f4
Mul2 Y fdiv M1 Mul1
Register Result Status (also values)
F0 F2 F4 F6 F8 F10 F12 … F30
Qi Mul1 M2 Add2 M1- M2 Mul2
Clock
8
22
Tomasulo: Example
Instr. Status
Instruction I E W
fld1 f6,34($2) 1 3 4
fld2 f2,45($3) 2 4 5
fmul f0,f2,f4 3
fsub f8,f6,f2 4 7 8
fdiv f10,f0,f6 5
fadd f6,f8,f2 6 10 11
ECE4750/CS4420 — Computer Architecture, Fall 2008
Reservation Stations
Busy Op Vj Vk Qj Qk A
Load1
Load2
Add1
Add2 - - - -
Add3
4 Mul1 Y fmul M2 f4
Mul2 Y fdiv M1 Mul1
Register Result Status (also values)
F0 F2 F4 F6 F8 F10 F12 … F30
Qi Mul1 M2 M1 M1- M2 Mul2
Clock
11
12
23
Tomasulo: Example
Instr. Status
Instruction I E W
fld1 f6,34($2) 1 3 4
fld2 f2,45($3) 2 4 5
fmul f0,f2,f4 3 15 16
fsub f8,f6,f2 4 7 8
fdiv f10,f0,f6 5
fadd f6,f8,f2 6 10 11
ECE4750/CS4420 — Computer Architecture, Fall 2008
Reservation Stations
Busy Op Vj Vk Qj Qk A
Load1
Load2
Add1
Add2
Add3
Mul1 - - - -
40 Mul2 Y fdiv M2xf4 M1
Register Result Status (also values)
F0 F2 F4 F6 F8 F10 F12 … F30
Qi M2xf4 M2 M1 M1- M2 Mul2
Clock
16
24
Tomasulo: Example
Instr. Status
Instruction I E W
fld1 f6,34($2) 1 3 4
fld2 f2,45($3) 2 4 5
fmul f0,f2,f4 3 15 16
fsub f8,f6,f2 4 7 8
fdiv f10,f0,f6 5 56 57
fadd f6,f8,f2 6 10 11
ECE4750/CS4420 — Computer Architecture, Fall 2008
Reservation Stations
Busy Op Vj Vk Qj Qk A
Load1
Load2
Add1
Add2
Add3
Mul1
Mul2 - - - -
Register Result Status (also values)
F0 F2 F4 F6 F8 F10 F12 … F30
Qi M2xf4 M2 M1 M1- M2 M2xf4/M1
Clock
57
13
25
Tomasulo: Loop Example
Renaming powerful tool across loops
• Scoreboard unable to process multiple iterations simultaneously
Hardware loop unrolling
• process several loop iterations simultaneously
• make it transparent to the compiler
L: fld f0,0($1)
fmul f4,f0,f2
fsd f4,0(r1)
subi $1,$1,8
bne $1,$0,L
ECE4750/CS4420 — Computer Architecture, Fall 2008
26
Tomasulo: Example
Instr. Status
Instruction I E W
fld1 f0,0($1)
fmul1 f4,f0,f2
fsd1 f4,0(r1)
fld2 f0,0($1)
fmul2 f4,f0,f2
fsd2 f4,0(r1)
ECE4750/CS4420 — Computer Architecture, Fall 2008
Reservation Stations
Busy Op Vj Vk Qj Qk A
Load1
Load2
Add1
Add2
Add3
Mul1
Mul2
Store1
Store2
Register Result Status
F0 F2 F4
Qi
Clock$1