tomasulo’s algorithm
DESCRIPTION
Tomasulo’s Algorithm. There are only three stages that an instruction goes through - PowerPoint PPT PresentationTRANSCRIPT
CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04
1
Tomasulo’s Algorithm There are only three stages that an instruction goes
through Issue – get next instruction from FIFO instruction queue.
If there is empty reservation station transfer instruction there along with operand values or names of reservation stations (tags) that will produce operand values. If there are no reservation stations stall on structural hazard.
Execute – when all operands are available start execution. Loads need only effective address. Stores also need data to be stored. No instruction can start executing before all prior branches have been evaluated.
Write result – write on CDB and from there into registers and pending reservation stations or memory
CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04
2
Tomasulo’s Algorithm Each reservation station has seven fields
Op – operation to perform Qj, Qk – reservation station tags that will produce
operands (0 indicates the operand is ready) Vj, Vk – operand values A – immediate field and later effective address of
load/store instruction Busy – this reservation station and its functional unit are
occupied Register file has a field
Qi – tag of reservation station computing the result
CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04
3
Issue Execute Write result L.D F6, 34(R2)L.D F2, 45(R3)MUL.D F0, F2, F4SUB.D F8, F2, F6DIV.D F10, F0, F6ADD.D F6, F8, F2
Instruction status
Busy Op Vj Vk Qj Qk ALoad1Load2Add1Add2Add3
Reservation stations
F0 … F2 … F4 … F6 … F8 … F10 … F12
Qi
Register result status
First load is issued
Mult1Mult2
yes Load
Load1
Regs[R2] 34
Time =1
CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04
4
Issue Execute Write result L.D F6, 34(R2)L.D F2, 45(R3)MUL.D F0, F2, F4SUB.D F8, F2, F6DIV.D F10, F0, F6ADD.D F6, F8, F2
Instruction status
Busy Op Vj Vk Qj Qk ALoad1Load2Add1Add2Add3
Reservation stations
F0 … F2 … F4 … F6 … F8 … F10 … F12
Register result status
First load calculates address
Mult1Mult2
yesyes
Load
Load1
34
Load
Load2
Regs[R3] 45
Qi
Time =2
Second load is issued
Regs[R2] +
CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04
5
Issue Execute Write result L.D F6, 34(R2)L.D F2, 45(R3)MUL.D F0, F2, F4SUB.D F8, F2, F6DIV.D F10, F0, F6ADD.D F6, F8, F2
Instruction status
Busy Op Vj Vk Qj Qk ALoad1Load2Add1Add2Add3
Reservation stations
F0 … F2 … F4 … F6 … F8 … F10 … F12
Register result status
Mult is issued
Mult1Mult2
yesyes
yes
Load
Load1
Regs[R2]+34
Load
Load2
Mult Regs[F4] Load2
Mult1Qi
Time =3
First load reads from memorySecond load calculates address
Regs[R3] 45+
CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04
6
Issue Execute Write result L.D F6, 34(R2)L.D F2, 45(R3)MUL.D F0, F2, F4SUB.D F8, F2, F6DIV.D F10, F0, F6ADD.D F6, F8, F2
Instruction status
Busy Op Vj Vk Qj Qk ALoad1Load2Add1Add2Add3
Reservation stations
F0 … F2 … F4 … F6 … F8 … F10 … F12
Register result status
Sub is issued
Mult1Mult2
yesyes
yes
Mem[34+Regs[R2]]
Load
Load2
Regs[R3]+45
Mult Regs[F4] Load2
Mult1
Sub Load2
Add1Qi
Time =4
First load writes result
yes Load Regs[R2]+34
Load1
Second load reads from memoryMul is stalled
CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04
7
Issue Execute Write result L.D F6, 34(R2)L.D F2, 45(R3)MUL.D F0, F2, F4SUB.D F8, F2, F6DIV.D F10, F0, F6ADD.D F6, F8, F2
Instruction status
Busy Op Vj Vk Qj Qk ALoad1Load2Add1Add2Add3
Reservation stations
F0 … F2 … F4 … F6 … F8 … F10 … F12
Register result status
Second load writes result
Mult1Mult2
yes
yesyes
Mult
Mult1
Sub
Add1
Div
Mult2
Regs[F4]Mem[34+Regs[R2]] Mult1
Qi
Mem[34+Regs[R2]]Mem[45+Regs[R3]]
Time =5
Div is issued
Mult is stalledSub is stalled
yes Load Regs[R3]+45
Load2
Load2
Load2
Mem[45+Regs[R3]]
CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04
8
Issue Execute Write result L.D F6, 34(R2)L.D F2, 45(R3)MUL.D F0, F2, F4SUB.D F8, F2, F6DIV.D F10, F0, F6ADD.D F6, F8, F2
Instruction status
Reservation stations
F0 … F2 … F4 … F6 … F8 … F10 … F12
Register result status
Sub is executed (1 out of 2)
Mult1
Add1
Mult2
Add2Qi
Busy Op Vj Vk Qj Qk ALoad1Load2Add1Add2Add3
Register result status
Mult1Mult2
yesyes
yesyes
Mult Regs[F4]
Sub
Mem[45+Regs[R3]]
Div Mem[34+Regs[R2]] Mult1
Add Add1Mem[45+Regs[R3]]
Mem[34+Regs[R2]]Mem[45+Regs[R3]]
Time = 6
Add is issued
Mult is executed (1 out of 10)
6
6
Div is stalled
CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04
9
Issue Execute Write result L.D F6, 34(R2)L.D F2, 45(R3)MUL.D F0, F2, F4SUB.D F8, F2, F6DIV.D F10, F0, F6ADD.D F6, F8, F2
Instruction status
Load1Load2Add1Add2Add3
Reservation stations
F0 … F2 … F4 … F6 … F8 … F10 … F12
Register result status
Mult1Mult2
yesyes
yesyes
Mult
Mult1
Sub
Add1
Div
Mult2
Add
Add2
Mult1
Add1
Regs[F4]Mem[45+Regs[R3]]
Mem[34+Regs[R2]]
Mem[45+Regs[R3]]
Mem[34+Regs[R2]]Mem[45+Regs[R3]]
Busy Op Vj Vk Qj Qk A
Qi
Time = 7
Sub is executed (2 out of 2)Mult is executed (2 out of 10)
Add is stalledDiv is stalled
6
6
CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04
10
Issue Execute Write result L.D F6, 34(R2)L.D F2, 45(R3)MUL.D F0, F2, F4SUB.D F8, F2, F6DIV.D F10, F0, F6ADD.D F6, F8, F2
Instruction status
Load1Load2Add1Add2Add3
Reservation stations
F0 … F2 … F4 … F6 … F8 … F10 … F12
Register result status
Mult1Mult2
yes
yesyes
Mult
Mult1
Div
Mult2
Add
Add2
X=Mem[34+Regs[R2]]-Mem[45+Regs[R3]]
Mult1
X
Regs[F4]Mem[45+Regs[R3]]
Mem[34+Regs[R2]]
Mem[45+Regs[R3]]
Busy Op Vj Vk Qj Qk A
Qi
Time = 8
Sub writes result
Add is stalledDiv is stalled
Mult is executed (3 out of 10)
yes Sub Mem[34+Regs[R2]]Mem[45+Regs[R3]]
Add1
Add1
6
6
CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04
11
Issue Execute Write result L.D F6, 34(R2)L.D F2, 45(R3)MUL.D F0, F2, F4SUB.D F8, F2, F6DIV.D F10, F0, F6ADD.D F6, F8, F2
Instruction status
Load1Load2Add1Add2Add3
Reservation stations
F0 … F2 … F4 … F6 … F8 … F10 … F12
Register result status
Mult1Mult2
yes
yesyes
Mult
Mult1
Div
Mult2
Add
Add2
Mult1
X
Regs[F4]Mem[45+Regs[R3]]
Mem[34+Regs[R2]]
Mem[45+Regs[R3]]
Busy Op Vj Vk Qj Qk A
Qi
Time = 9
Add is executed (1 out of 2)Div is stalledMult is executed (4 out of 10)
6
CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04
12
Issue Execute Write result L.D F6, 34(R2)L.D F2, 45(R3)MUL.D F0, F2, F4SUB.D F8, F2, F6DIV.D F10, F0, F6ADD.D F6, F8, F2
Instruction status
Load1Load2Add1Add2Add3
Reservation stations
F0 … F2 … F4 … F6 … F8 … F10 … F12
Register result status
Mult1Mult2
yes
yesyes
Mult
Mult1
Div
Mult2
Add
Add2
Mult1
X
Regs[F4]Mem[45+Regs[R3]]
Mem[34+Regs[R2]]
Mem[45+Regs[R3]]
Busy Op Vj Vk Qj Qk A
Qi
Time = 10
Add is executed (2 out of 2)Div is stalledMult is executed (5 out of 10)
6
CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04
13
Issue Execute Write result L.D F6, 34(R2)L.D F2, 45(R3)MUL.D F0, F2, F4SUB.D F8, F2, F6DIV.D F10, F0, F6ADD.D F6, F8, F2
Instruction status
Load1Load2Add1Add2Add3
Reservation stations
F0 … F2 … F4 … F6 … F8 … F10 … F12
Register result status
Mult1Mult2
yesyes
Mult
Mult1
Div
Mult2
Mult1Regs[F4]Mem[45+Regs[R3]]
Mem[34+Regs[R2]]
Busy Op Vj Vk Qj Qk A
Qi
Time = 11
Add writes resultDiv is stalledMult is executed (6 out of 10)
yes Add X Mem[45+Regs[R3]]
Add2
6
CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04
14
Issue Execute Write result L.D F6, 34(R2)L.D F2, 45(R3)MUL.D F0, F2, F4SUB.D F8, F2, F6DIV.D F10, F0, F6ADD.D F6, F8, F2
Instruction status
Load1Load2Add1Add2Add3
Reservation stations
F0 … F2 … F4 … F6 … F8 … F10 … F12
Register result status
Mult1Mult2 yes
Div
Mult2
YMem[34+Regs[R2]]
Busy Op Vj Vk Qj Qk A
Qi
Y=Mem[45+Regs[R3]]*Regs[F4]
Time = 16
Div is stalledMult writes result
Mult1yes Mult Regs[F4]Mem[45+Regs[R3]]
Mult1
CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04
15
Issue Execute Write result L.D F6, 34(R2)L.D F2, 45(R3)MUL.D F0, F2, F4SUB.D F8, F2, F6DIV.D F10, F0, F6ADD.D F6, F8, F2
Instruction status
Load1Load2Add1Add2Add3
Reservation stations
F0 … F2 … F4 … F6 … F8 … F10 … F12
Register result status
Mult1Mult2 yes
Div
Mult2
YMem[34+Regs[R2]]
Busy Op Vj Vk Qj Qk A
Qi
Time = 17
Div is executed (1 out of 40)
17
CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04
16
Issue Execute Write result L.D F6, 34(R2)L.D F2, 45(R3)MUL.D F0, F2, F4SUB.D F8, F2, F6DIV.D F10, F0, F6ADD.D F6, F8, F2
Instruction status
Load1Load2Add1Add2Add3
Reservation stations
F0 … F2 … F4 … F6 … F8 … F10 … F12
Register result status
Mult1Mult2
Busy Op Vj Vk Qj Qk A
Qi
Time = 57
Div writes result
yes Div
Mult2
YMem[34+Regs[R2]]
CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04
17
Tomasulo’s Alg. and Loop Unrolling Consider a loop
LOOP: L.D F0, 0(R1)MUL.D F4, F0, F2S.D F4, 0(R1)DADDUI R1, R1,#-8BNE R1, R2, LOOP
We will assume that branch is always predicted as taken and issue instructions from two loop iterations Assume none of the load/store or FP operations have
completed
CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04
18
Issue Execute Write result L.D F0, 0(R1)MUL.D F4, F0, F2S.D F4, 0(R1)
Load1Load2Add1Add2Add3
F0 … F2 … F4 … F6 … F8 … F10 … F12
Mult1Mult2
yesyes
yesyes
Busy Op Vj Vk Qj Qk A
Qi
L.D F0, -8(R1)MUL.D F4, F0, F2S.D F4, -8(R1)
Store1Store2
yesyes
LoadLoad
MultMultStoreStore
Regs[R1]+0Regs[R1]-8
F2F2
Load1Load2
Regs[R1]+0Regs[R1]-8
Mult1Mult2
Load2 Mult2
CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04
19
Dynamic Memory Disambiguation Order of loads and stores must be preserved
Since they access memory locations we can examine order only after we calculate effective address
Effective address calculation is performed in order Address of a load is examined against A fields of all
store buffers Address of a store is examined against A fields of all
load and store buffers
CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04
20
Dynamic Hardware Branch Prediction Predict the outcome of a branch
Change the prediction after observing a few iterations
To achieve good effectiveness we must Have accurate prediction technique Have a low cost for misprediction
CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04
21
Local Prediction: Branch Prediction Buffer
A table indexed by low bits of branch instruction address It contains a bit indicating whether the branch was
recently taken or not If it turns out we have been wrong the bit is inverted
Branch address
4
1 bit
CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04
22
1-bit Branch Prediction Buffer
Problem – even simplest branches are mispredicted twice
LD R1, #5Loop: LD R2, 0(R5)
ADD R2, R2, R4STORE R2, 0(R5)ADD R5, R5, #4SUB R1, R1, #1BNEZ R1, Loop
First time: prediction = 0 but the branch is taken change prediction to 1 miss
Time 2, 3, 4: prediction = 1 and the branch is taken
Time 5: prediction = 1 but the branch is not taken change prediction to 0 miss
CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04
23
2-bit Branch Prediction Buffer
To amend this we will use 2 bits, we must miss twice before we change our prediction
Predict taken11
Predict taken10
Predict not taken01
Predict not taken00
TakenTaken
Not taken
Not taken
Not taken
Taken
Not taken
Taken
CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04
24
2-bit Branch Prediction Buffer
First time we encounter this loop
LD R1, #5Loop: LD R2, 0(R5)
ADD R2, R2, R4STORE R2, 0(R5)ADD R5, R5, #4SUB R1, R1, #1BNEZ R1, Loop
First time: prediction = 00, not taken the branch is taken change prediction to 01 miss
Time 2: prediction = 01, not taken the branch is taken change prediction to 11 miss
Time 3,4: prediction = 11, taken the branch is taken
Time 5: prediction = 11, taken the branch is not taken change prediction to 10 miss
CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04
25
n-bit Branch Prediction Buffer We can generalize this technique to n-bit prediction
buffers When the counter is ≥ 2n-1, branch is predicted as taken Those predictors are not much more accurate than 2-bit
Predict taken111
Predict taken110
Predict taken100
Predict not taken011
Predict not taken001
Predict not taken000
Taken
Taken Taken
Not taken Not taken
Not taken
Taken
Not taken
Taken
Not taken
Taken
CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04
26
Correlating (Global) Branch Predictors Assign two prediction bits, one if the previous
branch was not taken, the other if it was taken
b1: if (d==0) d=1;
b2: if (d==1)
If b1 is taken, b2 is taken
b1: BNEZ R1, L1DADDUI R1, R0, #1
L1: DSUBUI R3, R1, #1b2: BNEZ R3, L2
…….L2:
If b1 is not taken, b2 is not taken0/0
One bit indicating what to do if one previous branch was not taken
One bit indicating what to do if one previous branch was taken
CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04
27
Correlating Branch Predictors Assign two prediction bits, one if the previous
branch was not taken, the other if it was taken
R1=? b1 prediction b1 action New b1 prediction b2 prediction b2 action New b2 prediction
b1: BNEZ R1, L1DADDUI R1, R0, #1
L1: DSUBUI R3, R1, #1b2: BNEZ R3, L2
…….L2:
2020
NT/NT NT/NTT T/NT T NT/TT/NT NT T/NT
m mNT/T NT NT/T
T/NT T T/NT NT/T T NT/T
T/NT NT T/NT NT/T NT NT/T
This is (1,1) predictor it usesoutcome of 1 previous branch to do prediction with 1-bit predictor
CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04
28
Correlating Branch Predictors (m,n) Observe behavior of m previous branches, use n-bit
predictor0/0
One bit indicating what to do if one previous branch was not taken
One bit indicating what to do if one previous branch was taken
0/0/0/…/0One bit indicating what to do if m previous branches were not taken
One bit indicating what to do if m previous branches were taken
(1,1)
(m,1)
0111/0011/0001/…/1110
n bits indicating what to do if m previous branches were not taken
n bits indicating what to do if m previous branches were taken (m,n)
CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04
29
Correlating Branch Predictors (m,n) 2m combinations, n-bits each
Branch address
4
…
m bits indicatingoutcome of m
previous branches
n bits n bits n bits
CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04
30
Correlating Branch Predictors (m,n) How many bits do we need for (m,n) predictor?
2m combinations, n-bits each, suppose we use last t bits of branch target to select prediction
2m * n * 2t
CIS 662 – Computer Architecture – Fall 2004 - Class 12 – 10/14/04
31
Tournament Predictors Combine one global and one local predictor with a
selector
Use predictor 1 Use predictor 2
Use predictor 1 Use predictor 2
1/1, 0/0, 1/0
0/0, 1/1
1/0
0/1
1/0 0/1 0/1 1/0
0/0, 1/1
1/1, 0/0, 0/1
First selector was right Second selector
was wrong