processor architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table...
TRANSCRIPT
Processor Architecture
Advanced Dynamic Scheduling Techniques
M. Schölzel
Content
Tomasulo with speculative execution
Introducing superscalarity into the instruction pipeline
Multithreading
Content
Tomasulo with speculative execution
Introducing superscalarity into the instruction pipeline
Multithreading
Control Flow Dependencies
Let b a conditional branch at address a with branch taget z.
An operation c ist control flow dependent on b, if the execution of c depends on the branch of b.
Otherwise c is not control flow dependent.
Examples:
c
b
c
b a:
a+1: z:
a+1: z:
a:
c is not control flow dependent on b
b1 a:
a+1: z:
c is control flow dependent on b1 and b2
b2
c
d
c is not control flow dependent on b
What about d?
Scheduling restrictions imposed by control flow dependencies
… for control flow dependent operations: cannot be moved before the branch
… for not control flow dependent operations:
cannot be moved behind a branch
c
b b
b
c c
b c
Program order Speculative Execution of c Program order c may be not executed
Performance Problem due to Control Hazards
Problem: − Branch target of an operation is only known after execution − Long pipeline stalls required in processors with deep pipelines
Solution: − Branch prediction helps, but is limitted
• Tomasulo supports speculative fetch, issue , but not execute of operations
b Instruction Queue
Speicher
Branch operation
PC
? Address for next instruction fetch is not known
to the reservation stations
Drawbacks of Speculative Execution
What happens if an operation is executed speculatively and speculation was wrong? − May affect the data flow − May affect the exception behavior
b
c
b c
control flow graph of a program
speculatively executed block
block to be executed
after dynamic scheduling executed program
Example
Affected Data Flow c is executed speculatively before b
mul-operation now receives the value in
r1 from sub- instead of from add-operation
Affected Exceptions Behavior c is executed speculatively before b
Division by 0 possible
b c a add r1 <- r2,r3
sub r1 <- r4,r5
mul r0 <- r1,r6
does not write in
r1 c
b c a
if r5 = 0 then x else y div r1 <- r4,r5
No division executed c
x: y:
Solution
Divide WB-Phase from Tomasulo-Algorithm into two phases: − Forwarding results from EU to RSs (WB-Phase) − Writing results into architectural registers/memory (Commit-Phase)
Implemented by: − Reorder-Buffer for buffering results from WB-phase − Committing buffered results from the Reorder-Buffer in-oder
By this:
− Usage of speculative results possible, without modifying architectural registers/memory locations
Architecture for Tomasulo with speculative Execution
Inst
ruct
ion
Q
ueue
Program Memory PC
Memory Unit
RS RS
RS
…
Execute 1
RS RS
RS
…
Execute m
RS RS
RS
…
Reg 0 Reg 1 Reg 2
Reg r
Result Bus
Operand Bus B
…
…
…
Operand Bus A Operation Bus
EU-Bus EU-Bus EU-Bus
Reorder Buffer
Arch
itect
ure
Regi
ster
Reorder Buffer (ROB)
Implemented as a queue: − When issuing an operation, an entry is reserved
− During WB, result is written-back to the reserved entry
− Commit is done in-order and writes results back to the architectural register
• speculatively executed operations are committed after preceding branches have been comitted
ROB-entries have now the meaning of virtual registers
entry 1
entry 2
entry n
DeM
ux
Mux
Reserved entry from RS
first last
Result Bus …
To th
e ar
chite
ctur
al
regi
ster
s
Bypass zu den RS
to is
sue-
phas
e (b
ypas
s)
busy
Fields of the ROB
Structure of a ROB-entry
Meaning of the fields depends on the operation type
Operation types: − Branch operation − Memory operation − ALU operation
field/Meaning res addr type busy valid
Branch operation computed target address (will be stored in the PC)
c = speculation was correct w = speculation was wrong
3 entry reserved 0 = result has not been computed yet 1 = result was computed and is available in the res-field
Memory operation Value to be stored in the memory
Address at which the res-Value should be stored
2
ALU operation Result of the operation - 1
res type addr valid busy
Reservation Station Fields
RS has the same functionality as in ordinary Tomasulo: − Buffers operations − Buffers operands
But, ROB-entries are used for determining operand source (virtual register)
opc …
Qj Qk Vj Vk busy
opc Qj Qk Vj Vk busy
opc Qj Qk Vj Vk busy
Reservation Station
Operand Bus B
Operand Bus A Operation Bus
EU-Bus
Operation to be executed (e.g. add, sub, mul, …) Qj = x, if ROB-entry x will store value for operand A, otherwise 0
Qk = x, if ROB-entry x will store value for operand B, otherwise 0
Value for operand B Value for operand A
Reserved ROB-entry
type type
type
Type of operation (see table in previous slide)
DeM
ux
Mux
rob rob
rob
RS is occupied/free
stat stat
stat
Status in pipeline (RO, EX, WB)
misc misc
misc
Miscellaneous
Register File Extensions
Mapping of architecture registers to virtual registers (ROB-entries)
Architecture register n stores ROB-entry, of the latest operation that is computing the value for n (register renaming)
Example:
Reg 0 Reg 1 Reg 2
Reg r
…
rob rob rob
rob
Operand Bus B Operand Bus A
Result Bus
Reg 0 Reg 1 Reg 2
Reg r
…
5 0 1
0
ROB-entry 5 contains result of latest operation with destination register 0
Register 1 is not computed by any operation in the pipeline
ROB-entry 1 contains result of latest operation with destination register 2
Overview Pipeline Phases
Issue − Schedule operation from instruction queue to RS − Read operand values or rename registers (solving WAR- und WAW-Hazard) − Reserve ROB-entry − Issue is in-order
Execute
− Wait for operands to be ready − Execute operation as soon as operands are ready and EU is available (solve RAW-Hazard) − Execute is out-of-order
Write-Back
− Write result through result bus into reserved ROB-entry − WB is out-of-order
Commit
− Write results from ROB in order into destination registers/memory − Commit is in-order
Overview Pipeline Phases (Issue)
Issue operation from instruction queue to RS, if: − RS is free and − ROB not full
Otherwise: Stall issue stage
Allocate RS- and ROB-entry
Read operands, if
− present in register file, or − present in ROB
ROB-entry corresponds
to a virtual register
Programmspeicher PC
Memory Unit
RS RS
RS
…
Execute 1
RS RS
RS
…
Execute m
RS RS
RS
…
Reg 0 Reg 1 Reg 2
Reg r
…
…
…
Reorder Buffer
Op A
Op A
reservierter Platz für Op A
Overview Pipeline Phases (Execute)
Operation is waiting in RS for operands and free EU
Execute operation as soon as all operands are available and EU is free
RS can store state of operation during execution
Programmspeicher PC
Memory Unit
RS RS
RS
…
Execute 1
RS RS
RS
…
Execute m
RS RS
RS
…
Reg 0 Reg 1 Reg 2
Reg r
…
…
…
Reorder Buffer
Op A
Op A
reservierter Platz für Op A
Overview Pipeline Phases (Write-Back)
Write result into reserved ROB-entry
ROB-entry ID has been stored in the rob-field of the RS
Result is forwarded to all waiting RS through the result bus (value is identified by its ROB ID)
Free RS Programmspeicher PC
Memory Unit
RS RS
RS
…
Execute 1
RS RS
RS
…
Execute m
RS RS
RS
…
Reg 0 Reg 1 Reg 2
Reg r
…
…
…
Reorder Buffer
Op A
reservierter Platz für Op A Ergebnis
Overview Pipeline Phases (Commit)
Write results from the first entry in the ROB into the corresponding destination register
Free ROB-entry
Programmspeicher PC
Memory Unit
RS RS
RS
…
Execute 1
RS RS
RS
…
Execute m
RS RS
RS
…
Reg 0 Reg 1 Reg 2
Reg r
…
…
…
Reorder Buffer
reservierter Platz für Op A
Ergebnis
Issue-Phase – Details (for ALU-operations)
For the operation teat will be issued let denote: − opc … operation type (add, sub, mul, …) − src1, src2 … source registers − dst … destination registers
Operation can be issued, if
− there exists an x, where RS[x].busy = 0 and − ROB[last].busy = 0
Update after issue
− if Reg[src1].rob = 0 then // determine value of left operand RS[x].Qj := 0; RS[x].Vj := Reg[src1] // read left operand from the register file else // left operand is still under computation or in ROB if ROB[Reg[src1].rob].valid = 1 then RS[x].Qj := 0; RS[x].Vj := ROB[Reg[src1].rob].res // read operand from ROB else RS[x].Qj := Reg[src1].rob // wait for operand in RS fi
− if Reg[src2].rob = 0 then // the same for the right operand… …
− RS[x].busy := 1; RS[x].rob := tail RS[x].opc := opc; RS[x].type := 1; RS[x].status := RO
Issue-Phase – Details (Example 1)
Situation: − Op A can be issued − Value for r1 is taken from the register file − Value for r2 is taken from the ROB
Update − if Reg[srcy].rob = 0 then
RS[x].Qj/k := 0; RS[x].Vj/k := Reg[srcy] else if ROB[Reg[srcy].rob].valid = 1 then RS[x].Qj/k := 0; RS[x].Vj/k := ROB[Reg[srcy].rob].res else RS[x].Qj/k := Reg[srcy].Qj/k fi
Programmspeicher PC
Memory Unit
RS RS
RS
…
Execute 1
RS RS
RS
…
Execute m
RS RS
RS
…
…
…
1:
2:
3:
4: [56,-,1,1,1]
5: [-,-,-,0,0]
add r0 <- r1, r2 // Op A sub r3 <- r0, r1 // Op B …
OP A
res type addr valid busy
R0: 5 R1: 14 R2: 89 R3: 17
0 0 4 0
Issue-Phase – Details (Example 1)
Situation: − Op A was issued and ROB-entry 5 was allocated
Update after issue − if Reg[srcy].rob = 0 then
RS[x].Qj/k := 0; RS[x].Vj/k := Reg[srcy] else if ROB[Reg[srcy].rob].valid = 1 then RS[x].Qj/k := 0; RS[x].Vj/k := ROB[Reg[srcy].rob].res else RS[x].Qj/k := Reg[srcy].Qj/k fi
Programmspeicher PC
Memory Unit
RS RS
RS
…
Execute 1
[add,0,0,14,56,-,1,5,RO,1]
RS
RS
…
Execute m
RS RS
RS
…
…
…
1:
2:
3:
4: [56,-,1,1,1]
5: [-,-,1,0,1]
add r0 <- r1, r2 // Op A sub r3 <- r0, r1 // Op B …
res type addr valid busy
opc Qj Qk Vj Vk busy type rob stat misc
R0: 5 R1: 14 R2: 89 R3: 17
5 0 4 0
Issue-Phase – Details (Example 2)
Situation: − issue of Op A − Value of r1 is read from the register file − Value in r2 is computed by RS4
Update after issue:
− if Reg[srcy].rob = 0 then RS[x].Qj/k := 0; RS[x].Vj/k := Reg[srcy] else if ROB[Reg[srcy].rob].valid = 1 then RS[x].Qj/k := 0; RS[x].Vj/k := ROB[Reg[srcy].rob].res else RS[x].Qj/k := Reg[src1].Qj/k fi
Programmspeicher PC
Memory Unit
RS RS
[ld,0,0,100,-,-,2,4,RO,1]
…
Execute 1
RS RS
RS
…
Execute m
RS RS
RS
…
…
…
1:
2:
3:
4: [-,-,1,0,1]
5: [-,-,-,0,0]
add r0 <- r1, r2 // Op A sub r3 <- r0, r1 // Op B …
OP A
res type addr valid busy
opc Qj Qk Vj Vk busy type rob stat misc
R0: 5 R1: 14 R2: 89 R3: 17
0 0 4 0
Issue-Phase – Details (Example 2)
Situation: − Op A was issued − Uses ROB-entry 5 − Has to wait for the value from r
Update after issue:
− if Reg[srcy].rob = 0 then RS[x].Qj/k := 0; RS[x].Vj/k := Reg[srcy] else if ROB[Reg[srcy].rob].valid = 1 then RS[x].Qj/k := 0; RS[x].Vj/k := ROB[Reg[srcy].rob].res else RS[x].Qj/k := Reg[src1].Qj/k fi
Programmspeicher PC
Memory Unit
RS RS
[ld,0,0,100,-,-,2,4,RO,1]
…
Execute 1
RS
RS
…
Execute m
RS RS
RS
…
…
…
1:
2:
3:
4: [-,-,1,0,1]
5: [-,-,1,0,1]
add r0 <- r1, r2 // Op A sub r3 <- r0, r1 // Op B …
res type addr valid busy
opc Qj Qk Vj Vk busy type rob stat misc
R0: 5 R1: 14 R2: 89 R3: 17
0 0 4 0
[add,0,4,14,-,-,1,5,RO,1]
Execute – Details
Executing an operation from a RS is possible, if − RS[x].status = RO and − RS[x].Qj = 0 and RS[x].Qk = 0
Update after start of execution: − Perform computation with RS[x].Vj and RS[x].Vk − RS[x].status := EX
Update after end of execution: − RS[x].Vj := res // Store result temporary in the reservation station − RS[x].status := WB
Execute-Phase – Details (Example 3)
Both operands are ready: − RS[x].status = RO and − RS[x].Qj = 0 und RS[x].Qk = 0
Programmspeicher PC
Memory Unit
RS RS
[ld,0,0,100,-,-,2,4,RO,1]
…
Execute 1
RS RS
RS
…
Execute m
RS RS
RS
…
Reg 0 Reg 1 Reg 2
Reg r
…
…
…
1: 0 0 4
0
2:
3:
4: [-,-,1,0,1]
5:
add r0 <- r1, r2 // Op A sub r3 <- r0, r1 // Op B …
OP A
res type addr valid busy
opc Qj Qk Vj Vk busy type rob stat misc
Execute-Phase – Details (Example 3)
Operation is executed: − RS[x].status = EX
Programmspeicher PC
Memory Unit
RS RS
[ld,0,0,100,-,-,2,4,EX,1]
…
Execute 1
RS RS
RS
…
Execute m
RS RS
RS
…
Reg 0 Reg 1 Reg 2
Reg r
…
…
…
1: 0 0 4
0
2:
3:
4: [-,-,1,0,1]
5:
add r0 <- r1, r2 // Op A sub r3 <- r0, r1 // Op B …
OP A
res type addr valid busy
opc Qj Qk Vj Vk busy type rob stat misc
Execute-Phase – Details (Example 3)
Result is computed: − Result will be stored temporarily in the field Vj − RS[x].status = WB
Result is ready for WB
Programmspeicher PC
Memory Unit
RS RS
[ld,0,0,89,-,-,2,4,WB,1]
…
Execute 1
RS RS
RS
…
Execute m
RS RS
RS
…
Reg 0 Reg 1 Reg 2
Reg r
…
…
…
1: 0 0 4
0
2:
3:
4: [-,-,1,0,1]
5:
add r0 <- r1, r2 // Op A sub r3 <- r0, r1 // Op B …
OP A
res type addr valid busy
opc Qj Qk Vj Vk busy type rob stat misc
Write-Back – Details (ALU-Operation)
Write-Back of the result res from RS x possible, if − RS[x].status = WB and − Result bus available
Update after WB:
− ROB[RS[x].rob] := RS[x].Vj // Write result to allocated ROB-entry − RS[x].busy := 0 // free RS − ROB[RS[x].rob].valid := 1 // Declare ROB-entry as valid
− for all reservation stations y ¹ x: // Forwarding of the result
if RS[y].Qj = RS[x].rob then RS[y].Vj := RS[x].rob; RS[y].Qj := 0 if RS[y].Qk = RS[x].rob then RS[y].Vk := RS[x].rob; RS[y].Qk := 0
Write-Back-Phase – Details (Example 4)
Situation: − Result of the ld-Operation is written back − Result bus contains:
• ROB-entry ID, e.g. 4 • ROB-value, e.g. 89
− add-operation waits for the right-hand operand
Programmspeicher PC
Memory Unit
RS RS
[ld,0,0,20,-,-,2,4,WB,1]
…
Execute 1
[add,0,4,14,-,-,1,5,RO,1]
RS
RS
…
Execute m
RS RS
RS
…
…
…
1:
2:
3:
4: [-,-,1,0,1]
5: [-,-,1,0,1]
add r0 <- r1, r2 // Op A sub r3 <- r0, r1 // Op B …
OP B
res type addr valid busy
(4,89)
opc Qj Qk Vj Vk busy type rob stat misc
R0: 5 R1: 14 R2: 89 R3: 17
5 0 4 0
Write-Back-Phase – Details (Example 4)
Situation: − Result was stored in ROB-entry 4 − RS containing add-Operation has also stored the result − RS was freed
Programmspeicher
PC
Memory Unit
RS RS
RS
…
Execute 1
[add,0,0,14,20,-,1,5,RO,1]
RS
RS
…
Execute m
RS RS
RS
…
…
…
1:
2:
3:
4: [20,-,1,1,1]
5: [-,-,1,0,1]
add r0 <- r1, r2 // Op A sub r3 <- r0, r1 // Op B …
OP B
res type addr valid busy
opc Qj Qk Vj Vk busy type rob stat misc
R0: 5 R1: 14 R2: 89 R3: 17
5 0 4 0
Commit– Details (ALU-Operation)
It must be checked: − ROB[first].valid = 1
Update by commit: − for all Architectural Registers r with Reg[r].rob = head do
Reg[r] := ROB[head].res Reg[r].rob := 0
Commit-Phase – Details (Example 5)
Situation: − Let be head = 4 for the ROB-head − R2 waits for result from ROB-entry 4
Programmspeicher PC
Memory Unit
RS RS
RS
…
Execute 1
[add,0,0,56,89,-,1,5,RO,1]
RS
RS
…
Execute m
RS RS
RS
…
…
…
1:
2:
3:
4: [20,-,1,1,1]
5: [-,-,1,0,1]
add r0 <- r1, r2 // Op A sub r3 <- r0, r1 // Op B …
OP B
res type addr valid busy
(4,89)
opc Qj Qk Vj Vk busy type rob stat misc
R0: 5 R1: 14 R2: 89 R3: 17
5 0 4 0
Commit-Phase – Details (Example 5)
Situation: − R2 has received result from ROB
Programmspeicher PC
Memory Unit
RS RS
RS
…
Execute 1
[add,0,0,56,89,-,1,5,RO,1]
RS
RS
…
Execute m
RS RS
RS
…
…
…
1:
2:
3:
4: [20,-,1,0,0]
5: [-,-,1,0,1]
add r0 <- r1, r2 // Op A sub r3 <- r0, r1 // Op B …
OP B
res type addr valid busy
(4,89)
opc Qj Qk Vj Vk busy type rob stat misc
R0: 5 R1: 14 R2: 20 R3: 17
5 0 0 0
Executing Branch Operations
Issue: − Vk-field: stores branch target z − misc-field: remembers address a of the branch operation − misc-field: also remembers which address (‚z‘ or ‚a‘) was predicted
Execute: − Computed target address is stored in Vk-Field of RS:
• Vk := z, if branch is taken • Vk := a+1, if branch is not taken
− misc-field stores, whether or not prediction was correct (‚c‘ = correct; ‚w‘ = wrong)
Write-Back: − res-field of ROB received branch target (Vk-field of the RS) − addr-field receives value of misc-field from RS: ‚c‘ or ‚w‘
Commit:
− If addr-field = c, nothing must be done (operations were fetched from correct address) − If addr-field = w, then copy res-field into PC and flush the whole pipeline:
• All subsequent ROB-entries • All RS-entries • instruction queue
Branch – Details (Example 6 – correct prediction)
Situation: − Branch-operation was issued to RS2 − Branch depends on ld-operation − Op A, Op B, … will be executed speculateively
Programmspeicher PC
Memory Unit
[ld,0,0,-,200,-,2,1,EX,1]
RS
RS
…
Execute 1
[bz,1,0,-,123,112a,3,2,RO,1]
RS
RS
…
Execute m
RS RS
RS
…
R0: 5 R1: 14 R2: 89 R3:
1: [-,-,2,0,1] 0 0 1 0
2: [-,-,3,0,1]
3: [-,-,-,0,0]
4: [-,-,-,0,0]
5: [-,-,-,0,0]
111: ld r2 <- (200) 112: bz r2, #123 113: add r1 <- r1, r0 // Op A 114: sub r3 <- r0, r1 // Op B 115: …
OP A
res type addr valid busy
opc Qj Qk Vj Vk busy type rob stat misc
OP B
OP C
2
ld
Branch – Details (Example 6 – correct prediction)
Situation: − Op A is executed speculatively − Op B waits for result of Op A
Programmspeicher PC
Memory Unit
[ld,0,0,-,200,-,2,1,EX,1]
RS
RS
…
Execute 1
[bz,1,0,-,123,112a,3,2,RO,1]
RS
RS
…
Execute m
RS RS
RS
…
R0: 5 R1: 14 R2: 89 R3:
1: [-,-,2,0,1] 0 3 1 4
2: [-,-,3,0,1]
3: [-,-,1,0,1]
4: [-,-,1,0,1]
5: [-,-,-,0,0] OP C
res type addr valid busy
opc Qj Qk Vj Vk busy type rob stat misc
OP D
OP E
1
ld
OP A OP B
OP A
111: ld r2 <- (200) 112: bz r2, #123 113: add r1 <- r1, r0 // Op A 114: sub r3 <- r0, r1 // Op B 115: …
Branch – Details (Example 6 – correct prediction)
Situation: − Op A wrote result to ROB, but not to R1 − Op B is executed speculatively
Programmspeicher PC
Memory Unit
[ld,0,0,-,200,-,2,1,EX,1]
RS
RS
…
Execute 1
[bz,1,0,-,123,112a,3,2,RO,1]
RS
RS
…
Execute m
RS RS
RS
…
R0: 5 R1: 14 R2: 89 R3:
1: [-,-,2,0,1] 0 3 1 4
2: [-,-,3,0,1]
3: [19,-,1,1,1]
4: [-,-,1,0,1]
5: [-,-,-,0,0] OP C
res type addr valid busy
opc Qj Qk Vj Vk busy type rob stat misc
OP D
OP E
1
ld
OP B
111: ld r2 <- (200) 112: bz r2, #123 113: add r1 <- r1, r0 // Op A 114: sub r3 <- r0, r1 // Op B 115: …
Branch – Details (Example 6 – correct prediction)
Situation: − Op B wrote result to ROB − ld write result to ROB − bz can be executed
Programmspeicher PC
Memory Unit
RS
RS
…
Execute 1
[bz,0,0,6,123,112a,3,2,RO,1]
RS
RS
…
Execute m
RS RS
RS
…
R0: 5 R1: 14 R2: 89 R3:
1: [6,-,2,1,1] 0 3 1 4
2: [-,-,3,0,1]
3: [19,-,1,1,1]
4: [-14,-,1,1,1]
5: [-,-,-,0,0] OP C
res type addr valid busy
opc Qj Qk Vj Vk busy type rob stat misc
OP D
OP E
1
RS
111: ld r2 <- (200) 112: bz r2, #123 113: add r1 <- r1, r0 // Op A 114: sub r3 <- r0, r1 // Op B 115: …
Branch – Details (Example 6 – correct prediction)
Situation: − bz will be executed: Branch is not taken − Commit for ld-operation is done
Programmspeicher
PC
Memory Unit
RS
RS
…
Execute 1
[bz,0,0,6,123,112a,3,2,EX,1]
RS
RS
…
Execute m
RS RS
RS
…
R0: 5 R1: 14 R2: 6 R3:
1: [-,-,-,0,0] 0 3 0 4
2: [-,-,3,0,1]
3: [19,-,1,1,1]
4: [-14,-,1,1,1]
5: [-,-,-,0,0] OP C
res type addr valid busy
opc Qj Qk Vj Vk busy type rob stat misc
OP D
OP E
1
RS
BZ
111: ld r2 <- (200) 112: bz r2, #123 113: add r1 <- r1, r0 // Op A 114: sub r3 <- r0, r1 // Op B 115: …
Branch – Details (Example 6 – correct prediction)
Situation: − bz was executed − WB for bz was done − Prediction was correct
(ROB[2].addr := c) Programmspeicher PC
Memory Unit
RS
RS
…
Execute 1
RS
RS
…
Execute m
RS RS
RS
…
R0: 5 R1: 14 R2: 6 R3:
1: [-,-,-,0,0] 0 3 0 4
2: [113,c,3,1,1]
3: [19,-,1,1,1]
4: [-14,-,1,1,1]
5: [-,-,-,0,0] OP C
res type addr valid busy
opc Qj Qk Vj Vk busy type rob stat misc
OP D
OP E
1
RS RS
111: ld r2 <- (200) 112: bz r2, #123 113: add r1 <- r1, r0 // Op A 114: sub r3 <- r0, r1 // Op B 115: …
Branch – Details (Example 6 – correct prediction)
Situation: − Commit of the branch operation does not require
any action, because prediction was correct
− Now Commit can be done for speculatively executed operations A and B
Programmspeicher PC
Memory Unit
RS
RS
…
Execute 1
RS
RS
…
Execute m
RS RS
RS
…
R0: 5 R1: 14 R2: 6 R3:
1: [-,-,-,0,0] 0 3 0 4
2: [-,-,-,0,0]
3: [19,-,1,1,1]
4: [-14,-,1,1,1]
5: [-,-,-,0,0] OP C
res type addr valid busy
opc Qj Qk Vj Vk busy type rob stat misc
OP D
OP E
0
RS RS
111: ld r2 <- (200) 112: bz r2, #123 113: add r1 <- r1, r0 // Op A 114: sub r3 <- r0, r1 // Op B 115: …
Branch – Details (Example 7 – wrong prediction)
Situation: − Same situation as in example 6 − But, ld-Operation has stored 0 in R2 − I.e., branch is taken
Programmspeicher PC
Memory Unit
RS
RS
…
Execute 1
[bz,0,0,0,123,112a,3,2,EX,1]
RS
RS
…
Execute m
RS RS
RS
…
R0: 5 R1: 14 R2: 0 R3:
1: [-,-,-,0,0] 0 3 0 4
2: [-,-,3,0,1]
3: [19,-,1,1,1]
4: [-14,-,1,1,1]
5: [-,-,-,0,0]
OP C
res type addr valid busy
opc Qj Qk Vj Vk busy type rob stat misc
OP D
OP E
1
RS
BZ
OP F
111: ld r2 <- (200) 112: bz r2, #123 113: add r1 <- r1, r0 // Op A 114: sub r3 <- r0, r1 // Op B 115: …
Branch – Details (Example 7 – wrong prediction)
Situation: − bz-operation was executed − Prediction was wrong
(ROB[2].addr := w) − Correct target can be found
in the res-field Programmspeicher
PC
Memory Unit
RS
RS
…
Execute 1
RS
RS
…
Execute m
RS RS
RS
…
R0: 5 R1: 14 R2: 6 R3:
1: [-,-,-,0,0] 0 3 0 4
2: [123,w,3,1,1]
3: [19,-,1,1,1]
4: [-14,-,1,1,1]
5: [-,-,-,0,0]
OP C
res type addr valid busy
opc Qj Qk Vj Vk busy type rob stat misc
OP D
OP E
1
RS RS
OP F
OP G
OP C
111: ld r2 <- (200) 112: bz r2, #123 113: add r1 <- r1, r0 // Op A 114: sub r3 <- r0, r1 // Op B 115: …
Branch – Details (Example 7 – wrong prediction)
Situation: − Commit of the branch moves correct address into PC
− Flushing the pipeline
Programmspeicher PC
123
Memory Unit
RS
RS
…
Execute 1
RS
RS
…
Execute m
RS RS
RS
…
R0: 5 R1: 14 R2: 6 R3:
1: [-,-,-,0,0] 0 3 0 4
2: [-,-,-,0,0]
3: [19,-,1,1,1]
4: [-14,-,1,1,1]
5: [-,-,-,0,0]
res type addr valid busy
opc Qj Qk Vj Vk busy type rob stat misc
0
RS RS OP C OP D
OP E
OP F
OP G
OP C
111: ld r2 <- (200) 112: bz r2, #123 113: add r1 <- r1, r0 // Op A 114: sub r3 <- r0, r1 // Op B 115: …
Executing Memory Operations
For out-of-order-execution of memory operations holds: − Ordering of load-does not matter − Ordering of load- and store-operations as well as of store- and store-
operations must be maintained
Example:
Strategy: − Writing to memory takes place during commit-phase (in-order)
− Reading from memory takes place during execute-phase (out-of-order)
• But only, if valid-field of all preceding write-operations in the ROB is 1
ld r2 <- (r0) ld r1 <- (r4) st r4 -> (r0) ld r5 <- (r6) st r7 -> (r8)
Example (store-Operation)
Issue-Phase: − issue the first st-operation
Execute-Phase
− Execution of store-operation can start, if both source operands are available
− Execution has no effect − Rather, WB of st-operation starts
immediately Programmspeicher
PC
Memory Unit
[st,0,0,17,5,-,2-,1,RO,1]
RS
RS
…
Execute 1
RS
RS
…
Execute m
RS RS
RS
…
…
…
1: [-,-,2,0,1]
2:
3:
4:
5:
res type addr valid busy
opc Qj Qk Vj Vk busy type rob stat misc
R0: 5 R1: 14 R2: 20 R3: 17
0 0 0 0
M 0 0
RS
st r3 -> (r0) ld r3 <- (r1) st r1-> (r2)
Example (store-Operation)
Updates during WB of the st-Operation − ROB[x].res := RS[y].Vj − ROB[x].addr := RS[y].Vk
Programmspeicher PC
Memory Unit
[st,0,0,17,5,-,2-,1,WB,1]
RS
RS
…
Execute 1
RS
RS
…
Execute m
RS RS
RS
…
…
…
1: [-,-,2,0,1]
2:
3:
4:
5:
res type addr valid busy
opc Qj Qk Vj Vk busy type rob stat misc
R0: 5 R1: 14 R2: 20 R3: 17
0 0 0 0
M 0 0
RS
st r3 -> (r0) ld r3 <- (r1) st r1-> (r2)
Example (store-Operation)
Commit for st-operation − MEM[ROB[first].addr] := ROB[first].res
Programmspeicher PC
Memory Unit
RS
RS
…
Execute 1
RS
RS
…
Execute m
RS RS
RS
…
…
…
1: [17,5,2,1,1]
2:
3:
4:
5:
res type addr valid busy
opc Qj Qk Vj Vk busy type rob stat misc
R0: 5 R1: 14 R2: 20 R3: 17
0 0 0 0
M 0 0
RS
st r3 -> (r0) ld r3 <- (r1) st r1-> (r2)
RS
Example (load-Operation)
Suppose first st-operation was issued and waits for execution
Then ld-operation was issued, and its source operands are available
Programmspeicher PC
Memory Unit
[st,5,0,-,5,-,2,1,RO,1]
[ld,0,0,14,-,-,2,2,RO,1]
RS
…
Execute 1
RS
RS
…
Execute m
RS RS
RS
…
…
…
1: [-,-,2,0,1]
2: [-,-,2,0,1]
3:
4:
5:
res type addr valid busy
opc Qj Qk Vj Vk busy type rob stat misc
R0: 5 R1: 14 R2: 20 R3: 17
0 0 0 2
M 0 0
RS
st r3 -> (r0) ld r3 <- (r1) st r1-> (r2)
OP C
OP C
Example (load-Operation)
Situation : − ld-operation is not executed, because valid-bit of
first st-operation is 0
Programmspeicher PC
Memory Unit
[st,5,0,-,5,-,2,1,RO,1]
[ld,0,0,14,-,-,2,2,RO,1]
RS
…
Execute 1
RS
RS
…
Execute m
RS RS
RS
…
…
…
1: [-,-,2,0,1]
2: [-,-,2,0,1]
3:
4:
5:
res type addr valid busy
opc Qj Qk Vj Vk busy type rob stat misc
R0: 5 R1: 14 R2: 20 R3: 17
0 0 0 2
M 0 0
RS
st r3 -> (r0) ld r3 <- (r1) st r1-> (r2)
OP C
OP C
Example (load-Operation)
Situation: − Now, ld-operation can be executed (see valid-bit
of first st-operation)
− ld-operation can read value either from memory or from ROB (if addr-field of a preceding st-operation matches Vj-field of ld-operation
Programmspeicher PC
Memory Unit
[ld,0,0,14,-,-,2,2,RO,1]
RS
…
Execute 1
RS
RS
…
Execute m
RS RS
RS
…
…
…
1: [17,5,2,1,1]
2: [-,-,2,0,1]
3:
4:
5:
res type addr valid busy
opc Qj Qk Vj Vk busy type rob stat misc
R0: 5 R1: 14 R2: 20 R3: 17
0 0 0 2
M 0 0
RS
st r3 -> (r0) ld r3 <- (r1) st r1-> (r2)
RS
Example (load-Operation)
WB for ld-operation complete Commit-phase for ld-operations is the
same as for alu-operations
Programmspeicher PC
Memory Unit
[ld,0,0,14,-,-,2,2,RO,1]
RS
…
Execute 1
RS
RS
…
Execute m
RS RS
RS
…
…
…
1: [17,5,2,1,1]
2: [20,-,2,1,1]
3:
4:
5:
res type addr valid busy
opc Qj Qk Vj Vk busy type rob stat misc
R0: 5 R1: 14 R2: 20 R3: 17
0 0 0 2
M 0 0
RS
st r3 -> (r0) ld r3 <- (r1) st r1-> (r2)
RS
Tomasulo with Speculation (Example 9: Loop Iteration, Cycle 0)
Memory Unit
RS 1 RS 2
Execute 1
RS 3 RS 4
RS 5 RS 6
R0: 0 R1: 100 R2: 13
0 0
0 0
0 0
Execute 2
0 0 0
R3: 0 0 R4: 0 0
Loop: ld r0 <- (r1) mul r4 <- r0, r2 add r3 <- r3, r4 add r1 <- r1,1 bne loop, r1,200 …
ld r0,(r1)
mul r4,r0,r2
PC 0
add r3,r3,r4
1:
2:
3:
4:
5:
Tomasulo with Speculation (Example 9: Loop Iteration, Cycle 1)
Memory Unit
RS 1 RS 2
Execute 1
RS 3 RS 4
RS 5 RS 6
R0: 0 R1: 100 R2: 13
1 0
0 0
0 0
Execute 2
1 0 0
R3: 0 0 R4: 0 0
PC 0
ld r0,(100)
mul r4,r0,r2
add r3,r3,r4
add r1,r1,1
Loop: ld r0 <- (r1) mul r4 <- r0, r2 add r3 <- r3, r4 add r1 <- r1,1 bne loop, r1,200 …
1:
2:
3:
4:
5:
ld r0,(100)
Tomasulo with Speculation (Example 9: Loop Iteration, Cycle 2)
Memory Unit
RS 1 RS 2
Execute 1
RS 3 RS 4
RS 5 RS 6
R0: 0 R1: 100 R2: 13
1 0
1 0
0 0
Execute 2
1 0 0
R3: 0 0 R4: 0 2
PC 0
bne loop,r1,200
mul r4,rob1,13
add r3,r3,r4
add r1,r1,1
ld r0,(100)
ld r0,(100)
Loop: ld r0 <- (r1) mul r4 <- r0, r2 add r3 <- r3, r4 add r1 <- r1,1 bne loop, r1,200 …
1:
2:
3:
4:
5:
ld r0,(100)
mul r4
Tomasulo with Speculation (Example 9: Loop Iteration, Cycle 3)
Memory Unit
RS 1 RS 2
Execute 1
RS 3 RS 4
RS 5 RS 6
R0: 0 R1: 100 R2: 13
1 0
1 1
0 0
Execute 2
1 0 0
R3: 0 3 R4: 0 2
PC 0
bne loop,r1,200
add r3,0,rob2
add r1,r1,1
ld r0,(100)
ld r0,(r1)
mul r4,rob1,13 ld r0,(100)
Loop: ld r0 <- (r1) mul r4 <- r0, r2 add r3 <- r3, r4 add r1 <- r1,1 bne loop, r1,200 …
1:
2:
3:
4:
5:
ld r0,(100)
mul r4
add r3
Tomasulo with Speculation (Example 9: Loop Iteration, Cycle 4)
Memory Unit
RS 1 RS 2
Execute 1
RS 3 RS 4
RS 5 RS 6
R0: 0 R1: 100 R2: 13
0 0
1 1
1 0
Execute 2
1 4 0
R3: 0 3 R4: 0 2
PC 0
bne loop,r1,200
add r3,0,rob2
add r1,100,1
ld r0,(r1)
mul r4,20,13
mul r4,r0,r2
ld r0,(100)
Loop: ld r0 <- (r1) mul r4 <- r0, r2 add r3 <- r3, r4 add r1 <- r1,1 bne loop, r1,200 …
1:
2:
3:
4:
5:
20
mul r4
add r3
add r1
Tomasulo with Speculation (Example 9: Loop Iteration, Cycle 5)
Memory Unit
RS 1 RS 2
Execute 1
RS 3 RS 4
RS 5 RS 6
R0: 20 R1: 100 R2: 13
EU-Bus
0 0
1 1
1 1
Execute 2
0 4 0
R3: 0 3 R4: 0 2
PC 5
bne loop,rob4,200 add r3,0,rob2
add r1,100,1
ld r0,(r1)
mul r4,20,13
mul r4,r0,r2
add r3,r3,r4
mul r4,20,13 add r1,100,1
Loop: ld r0 <- (r1) mul r4 <- r0, r2 add r3 <- r3, r4 add r1 <- r1,1 bne loop, r1,200 …
1:
2:
3:
4:
5:
mul r4
add r3
add r1
bne
Tomasulo with Speculation (Example 9: Loop Iteration, Cycle 6)
Memory Unit
RS 1 RS 2
Execute 1
RS 3 RS 4
RS 5 RS 6
R0: 20 R1: 100 R2: 13
1 0
0 1
0 1
Execute 2
1 4 0
R3: 0 3 R4: 0 2
PC 5
bne loop,101,200 add r3,0,260
add r1,100,1 ld r0,(101) mul r4,20,13
mul r4,r0,r2
add r3,r3,r4
add r1,r1,1
Loop: ld r0 <- (r1) mul r4 <- r0, r2 add r3 <- r3, r4 add r1 <- r1,1 bne loop, r1,200 …
1:
2:
3:
4:
5:
260
add r3
101
ld r0
bne
Operations are fetched and issued
speculatively
Operations from different loop
iterations are in the pipeline
ld-operation is no longer dependent on the branch operation
Tomasulo with Speculation (Example 9: Loop Iteration, Cycle 7)
Memory Unit
RS 1 RS 2
Execute 1
RS 3 RS 4
RS 5 RS 6
R0: 20 R1: 100 R2: 13
1 0
1 1
0 1
Execute 2
1 4 0
R3: 0 3 R4: 260 2
PC 5
bne loop,101,200 add r3,0,260
ld r0,(101) mul r4,rob1,13
add r3,r3,r4
add r1,r1,1
bne loop,101,200 add r3,0,260
bne loop,r1,200
Loop: ld r0 <- (r1) mul r4 <- r0, r2 add r3 <- r3, r4 add r1 <- r1,1 bne loop, r1,200 …
Now speculative execution possible
1:
2:
3:
4:
5:
add r3
101
ld r0
mul r4
ld r0,(101)
bne
Tomasulo with Speculation (Example 9: Loop Iteration, Cycle 8)
Memory Unit
RS 1 RS 2
Execute 1
RS 3 RS 4
RS 5 RS 6
R0: 20 R1: 100 R2: 13
1 0
1 0
1 0
Execute 2
1 4 0 3
R4: 260 2
PC 5
bne loop,101,200 add r3,0,260
ld r0,(101) mul r4,rob1,13
R3: 0
Loop: ld r0 <- (r1) mul r4 <- r0, r2 add r3 <- r3, r4 add r1 <- r1,1 bne loop, r1,200 …
1:
2:
3:
4:
5:
260
101
ld r0
mul r4
ld r0,(101)
bne c add r3,r3,r4
add r1,r1,1
bne loop,r1,200
Tomasulo with Speculation (Example 9: Loop Iteration, Cycle 9)
Memory Unit
RS 1 RS 2
Execute 1
RS 3 RS 4
RS 5 RS 6
R0: 20 R1: 100 R2: 13
1 0
1 1
1 0
Execute 2
1 4 0 3
R4: 260 2
PC 5
ld r0,(101) mul r4,8,13
add r1,101,1
R3: 260 ld r0,(r1)
add r3,260,rob2
Loop: ld r0 <- (r1) mul r4 <- r0, r2 add r3 <- r3, r4 add r1 <- r1,1 bne loop, r1,200 …
1:
2:
3:
4:
5:
101
8
mul r4
bne c add r1,r1,1
bne loop,r1,200
add r3
Summary
We have seen Tomasulo-algorithm with speculation
Importance of the Reorder-Buffer
Execution of − Alu-operations − Branch-operations − Memory-operations
But: Issue- and Commit-phase are limited to processing a
single operation per clock cycle
Content
Tomasulo with speculative execution
Introducing superscalarity into the instruction pipeline
Multithreading
Superscalar Instruction Pipeline
So far: Only data path is superscalar
Parallel execution of operation in the data path, but CPI < 1 not possible
Required: super scalar − Fetch-, − Issue-, − WB-, − Commit-Phase
Programmspeicher PC
Memory Unit
RS
…
Execute 1
RS
RS
…
Execute m
RS RS
RS
…
…
…
ROB
R0: 5 R1: 14 R2: 20 R3: 17
0 0 0 2
0
RS RS
RS
Superscalar Fetch-Phase
Fetch: − Fetching n operations simultaneously from code cache/memory − Requires wider busses
Inst
ruct
ion
qu
eue
RS
operand bus B1
…
operand bus A1
Register File
RS RS
n
operation bus 1
operation bus n
An
operand bus Bn
Cache/Memory
Superscalar Issue-Phase
Issue: − Issue the first n Operations from the instruction queue (n operation busses required) − n operand busses for left operand required (A) − n operand busses for right operand required (B)
Checking for free RS and free ROB-entry must be done simultaneously for up to
n operations!
In
stru
ctio
n
queu
e
RS
operand bus B1
…
operand bus A1
Register File
RS RS
n
operation bus 1
operation bus n
An
operand bus Bn
Cache/Memory
Implementing simultaneous checking in issue-phase
Issue-Logic (1 Operation)
Old ROB Status
Old RS Status
New state for ROB
New state for RS
Issue-Logic (1. Op)
Old ROB Status
Old RS Status
Issue-Locik (2. Op)
For a single operations For two operations
RF Control for
operand buses A and B
Combine
New state for ROB
New state for RS
RF control for operand busses
A2 und B2
RF Control for
operand buses A1 and B1
…
Superscalar WB-Phase
Every EU has its own result bus Ei − All EUs may write simultaneously to the ROB
This makes also the bypass for the reservation stations more complex
Memory Unit Execute 1 Execute m
result busses
…
ROB
Bypasses to RS
opc Qj Qk Vj Vk busy type rob stat misc
A1 An B1 Bn E0 Em R1 Rk
E0 E1 Em
Bypass RS
RS
Superscalar Commit-Phase
For up to n ROB-entries starting at the head: − check if the valid-bit is set to 1 − Then write their result to the register file
Register file needs n write-ports
Memory Unit Execute 1 Execute m
result busses
…
ROB
Bypasses to RS
Register File
opc Qj Qk Vj Vk busy type rob stat misc
A1 An B1 Bn E0 Em R1 Rk
E0 E1 Em
Bypass RS
RS
Example PowerPC
Quelle: PowerPC™ e500 Core Family Reference Manual
Limitations for ILP
Memory band width limits the amount of simultaneously fetched operations (typical 4 to 6 operations)
HW-Overhead and delay for: − Control logic issue-phase − Bypasses for reservation stations − Number of read-/write-ports in the register file
Branches
− Possible Solution: Branch prediction
Available parallelism in the application − Possible Solution: Multithreading
Content
Tomasulo with speculative execution
Introducing superscalarity into the instruction pipeline
Multithreading
Motivation for Multithreading
True dependencies prevent the EUs from being used in parallel (horizontal performance loss)
Operations with a very long delay during execution create vertical performance loss − E.g. memory access of an operation A in a Pentium
4 (3-way-superscalar) can take 380 clock cycles (cache misses)
− I.e. 1140 operations have to bypass Op A in order to utilize EUs fully during this time
− But: Reorder buffer has only 126 entries − Hence, 339 execution cycles are wasted
Solution Multithreading: Execute multiple
threads that share the same execution units, but have no dependencies
OP 1
OP 2
OP n
1 2
3 4
EU usage
5
…
…
1 2 3 EU usage OP 1
OP 2
OP 3
OP A
OP 4
OP 5
OP 6
…
A 4 5
124
…
123 122 only
41
cycl
es
… af
ter 3
80 c
ycle
s WB
of A
Process vs. Thread
Each process has its own context − address space (Code, data, heap, stack) − TLB − …
Switching between processes takes tens of thousands of clock cycles (context switch)
OS is involved
Threads share the same context
Switching between two threads only requires to change the values in the architectural registers
Code Section
Data Section
Heap
Stack
Code Section
Data Section
Heap
Stack
…
Code Section
Data Section
Heap
Stack 1
Stack 2
Process 1 Process 2
Code Section
Data Section
Heap
Stack 1
Stack 2
Thread 1 in
Process 2
Thread 2 in
Process 2
Multithreading
Multithreading: A fixed number of n threads can share the same execution units
Hardware supports fast switching between n threads: − n copies of some resources, e.g. architectural registers (including PC) − fix partitioning of some resources, e.g. RS (or limited sharing) − shared usage of some resources, e.g. EUs
Inst
ruct
ion
queu
e
Programmspeicher PC 1
Memory Unit
RS
…
Execute 1
RS
RS
…
Execute m
RS RS
RS
…
…
… EU-Bus
ROB 1
RF 1
RS RS
RS
PC 2
RF 2 ROB 2
OP A OP B
OP C OP D
OP E OP F
Inst
ruct
ion
queu
e
Multithreading
Types of Multithreading:
no MT Coarse Grained MT Fine Grained MT Simultaneous MT
Coarse Grained Multithreading
A single thread runs for many clock cycles before the hardware switches to another thread
Hardware switches between threads only, if − a long running operation is detected, e.g. cache miss, or − a fix time slice has passed
A processor with n-way MT appears to an operating system
like n processors − OS schedules n threads of the same process to these processors
Example
Two threads are scheduled to the processor
Reservation stations and EUs are shared resources
Hardware switches between both PCs and IQ (e.g. by multiplexors)
Fetched operations are tagged with thread number
Situation: Thread 1 is running Inst
ruct
ion
Que
ue
thre
ad 1
Programmspeicher PC 1
Memory Unit
RS
…
Execute 1
RS
RS
…
Execute m
RS RS
RS
…
…
…
ROB 1
RF 1
RS RS
RS
PC 2
RF 2 ROB 2
OP D.1 OP A.2
OP E.1 OP B.2
OP F.1 OP C.2
Inst
ruct
ion
Que
ue
thre
ad 2
OP A.1
OP A.1
OP B.1
OP B.1
OP A.1
OP B.1
OP C.1
1
OP C.1
Example
Memory operation D.1 of thread 1 was issued
Thread 1 is still running…
Inst
ruct
ion
Que
ue
thre
ad 1
Programmspeicher PC 1
Memory Unit
RS
…
Execute 1
RS
RS
…
Execute m
RS RS
RS
…
…
…
ROB 1
RF 1
RS RS
RS
PC 2
RF 2 ROB 2 OP G.1
OP A.2 OP E.1
OP B.2 OP F.1
OP C.2
Inst
ruct
ion
Que
ue
thre
ad 2
OP A.1
OP A.1
OP B.1
OP B.1
OP A.1
OP B.1
OP C.1
OP D.1
1
OP C.1
OP D.1
Example
Memory operation is executed and cache miss is detected
Processor has switched to thread 2 − another PC is used − another instruction queue
is used
Inst
ruct
ion
Que
ue
thre
ad 1
Programmspeicher PC 1
Memory Unit
RS
…
Execute 1
RS
RS
…
Execute m
RS RS
RS
…
…
…
ROB 1
RF 1
RS RS
RS
PC 2
RF 2 ROB 2
OP G.1
OP A.2
OP E.1
OP B.2
OP F.1
OP C.2
Inst
ruct
ion
Que
ue
thre
ad 2
OP A.1
OP A.1
OP B.1
OP A.1
OP B.1
OP C.1
OP D.1
OP D.1
2
OP H.1 OP C.1
OP D.1
OP E.1
Example
Issued operations of thread 1 are further processed
But issue now takes place from instruction queue 2
Inst
ruct
ion
Que
ue
thre
ad 1
Programmspeicher PC 1
Memory Unit
RS
…
Execute 1
RS
RS
…
Execute m
RS RS
RS
…
…
…
ROB 1
RF 1
RS RS
RS
PC 2
RF 2 ROB 2
OP G.1
OP A.2
OP E.1
OP B.2 OP F.1
OP C.2
Inst
ruct
ion
Que
ue
thre
ad 2
OP A.1
OP A.1
OP B.1
OP A.1
OP B.1
OP C.1
OP D.1
OP D.1
2
OP H.1 OP D.2 OP C.1
OP D.1
OP E.1
OP A.2
Example
Issued operations of thread 1 are further processed
But issue now takes place from instruction queue 2
Inst
ruct
ion
Que
ue
thre
ad 1
Programmspeicher PC 1
Memory Unit
RS
…
Execute 1
RS
RS
…
Execute m
RS RS
RS
…
…
…
ROB 1
RF 1
RS RS
RS
PC 2
RF 2 ROB 2
OP G.1
OP A.2
OP E.1
OP B.2
OP F.1 OP C.2 Inst
ruct
ion
Que
ue
thre
ad 2
OP A.1
OP B.1 OP B.1
OP C.1
OP D.1
OP D.1
2
OP H.1
OP D.2
OP C.1
OP D.1
OP E.1
OP A.2
OP E.2
OP A.2
OP B.2
Example
Operations of Thread 1 are further processed, but not committed while simultaneously operations from Thread 2 are processed
If operation E.1 has a true-dependency to D.1 then it blocks the reservation station for operations from Thread 2
Balancing between shared resources important
Inst
ruct
ion
Que
ue
thre
ad 1
Programmspeicher PC 1
Memory Unit
RS
…
Execute 1
RS
RS
…
Execute m
RS RS
RS
…
…
…
ROB 1
RF 1
RS RS
RS
PC 2
RF 2 ROB 2
OP G.1
OP E.1
OP B.2
OP F.1
OP C.2
Inst
ruct
ion
Que
ue
thre
ad 2
OP B.1 OP B.1
OP C.1
OP D.1
OP D.1
2
OP H.1
OP D.2
OP C.1
OP D.1
OP E.1 OP E.2
OP A.2
OP B.2
OP C.2
OP F.2
Coarse Grained MT - Limitations
Does not help to overcome the problem of horizontal performance loss (a single thread may not have enough ILP) − Only right after switching between threads, there are operations of
both threads simultaneously processed
Switching between threads may has a negative impact on the
cache hit rate for each thread and affects the performance negatively
Fine-Grained Multithreading
Processor switches in every clock cycle to another thread − E.g. in a round robin manner:
This helps to overcome horizontal performance loss
A single instruction queue and a single reorder buffer are
sufficient (shared)
Operations must be tagged with the corresponding Thread number
Example
Inst
ruct
ion
queu
e
Programmspeicher PC 1
Memory Unit
RS
…
Execute 1
RS
RS
…
Execute m
RS RS
RS
…
…
…
ROB RF 1
RS RS
RS
PC 2
RF 2 OP A.1
1
Example
Inst
ruct
ion
queu
e
Programmspeicher PC 1
Memory Unit
RS
…
Execute 1
RS
RS
…
Execute m
RS RS
RS
…
…
…
ROB RF 1
RS RS
RS
PC 2
RF 2
OP A.1
2
OP A.2
Example
Inst
ruct
ion
queu
e
Programmspeicher PC 1
Memory Unit
RS
…
Execute 1
RS
RS
…
Execute m
RS RS
RS
…
…
…
ROB RF 1
RS RS
RS
PC 2
RF 2
OP A.1
1
OP A.2
OP B.1
Example
Inst
ruct
ion
queu
e
Programmspeicher PC 1
Memory Unit
RS
…
Execute 1
RS
RS
…
Execute m
RS RS
RS
…
…
…
ROB RF 1
RS RS
RS
PC 2
RF 2
OP A.1
2
OP A.2
OP B.1
OP B.2
OP A.1
Example
Inst
ruct
ion
queu
e
Programmspeicher PC 1
Memory Unit
RS
…
Execute 1
RS
RS
…
Execute m
RS RS
RS
…
…
…
ROB RF 1
RS RS
RS
PC 2
RF 2
1
OP A.1
OP A.2
OP B.1
OP B.2
OP C.1
OP A.1
OP A.2
Example
Inst
ruct
ion
queu
e
Programmspeicher PC 1
Memory Unit
RS
…
Execute 1
RS
RS
…
Execute m
RS RS
RS
…
…
…
ROB RF 1
RS RS
RS
PC 2
RF 2
2
OP A.1 OP A.2
OP B.1
OP B.2
OP C.1
OP A.1
OP A.2
OP C.2 OP B.1
Example
Inst
ruct
ion
queu
e
Programmspeicher PC 1
Memory Unit
RS
…
Execute 1
RS
RS
…
Execute m
RS RS
RS
…
…
…
ROB RF 1
RS RS
RS
PC 2
RF 2
1
ROB
OP A.1
OP A.2
OP B.1
OP B.2
OP C.1
OP A.1
OP C.2
OP D.1 OP B.1
OP B.2
Example
Inst
ruct
ion
queu
e
Programmspeicher PC 1
Memory Unit
RS
…
Execute 1
RS
RS
…
Execute m
RS RS
RS
…
…
…
ROB RF 1
RS RS
RS
PC 2
RF 2
2
OP A.1
OP A.2
OP B.1 OP B.2
OP C.1
OP C.2
OP D.1
OP B.1
OP B.2 OP D.2
OP C.1
Example
Inst
ruct
ion
queu
e
Programmspeicher PC 1
Memory Unit
RS
…
Execute 1
RS
RS
…
Execute m
RS RS
RS
…
…
…
ROB RF 1
RS RS
RS
PC 2
RF 2
1
OP A.2
OP B.1 OP B.2
OP C.1
OP C.2
OP D.1
OP B.1
OP B.2 OP D.2
OP C.1
Fine-Grained Multithreading - Limitations
Vertically performance loss cannot be avoided − A long running operation prevents other operation from the same
thread from being executed − due to the shared IQ and ROB, also the other thread is blocked after a
while • Improvement: Stop fetching for a blocked thread
Performance of a single thread is reduced (even if there are no
operations from a second blocked thread), because issue takes place in every second cycle
MT reduces cache performance
Simultaneous Multithreading
Mixing Coarse- and Fine-Grained MT
In every clock cycle operations from n threads will be fetched and issued (Intel calls this Hyperthreading)
Operations must be tagged with the corresponding Thread number
Solving the problem of having either horizontal or vertical performance loss: − If both threads are not blocked, then available ILP is utilized, and horizontal
performance loss is avoided − If one thread is blocked, then the other thread still uses the resources, and vertical
performance loss is avoided (but not horizontal one)
Even if one thread is blocked, the other one can run at full speed (issue in every clock cycle)
Example
Fetch and issue takes place simultaneously for both threads
Each thread has its own IQ, RF, PC, ROB
Reservation Stations are partitioned
Programmspeicher PC 1
Memory Unit
RS
…
Execute 1
RS
RS
…
Execute m
RS RS
RS
…
…
…
ROB 1
RF 1
RS RS
RS
PC 2
RF 2 ROB 2
OP A OP B
OP C OP D
OP E OP F
Inst
ruct
ion
queu
e 1
Inst
ruct
ion
queu
e 2
Used for thread 1
Used for thread 2
Example
Both threads are executed...
Avoids horizontal performance loss
Programmspeicher PC 1
Memory Unit
RS
…
Execute 1
RS
RS
…
Execute m
RS RS
RS
…
…
…
ROB 1
RF 1
RS RS
RS
PC 2
RF 2 ROB 2
OP A
OP C OP D
OP E OP F
OP B
OP G OP H
Inst
ruct
ion
queu
e 1
Inst
ruct
ion
queu
e 2
Used for thread 1
Used for thread 2
Example
Both threads are executed...
Avoids horizontal performance loss
Programmspeicher PC 1
Memory Unit
RS
…
Execute 1
RS
RS
…
Execute m
RS RS
RS
…
…
…
ROB 1
RF 1
RS RS
RS
PC 2
RF 2 ROB 2
OP A OP C
OP D
OP E OP F
OP B
OP G OP H
OP A OP B
OP I OP J
Inst
ruct
ion
queu
e 1
Inst
ruct
ion
queu
e 2
Used for thread 1
Used for thread 2
Example
Both threads are executed...
Avoids horizontal performance loss
But, now long running operation E is issued Programmspeicher
PC 1
Memory Unit
RS
…
Execute 1
RS
RS
…
Execute m
RS RS
RS
…
…
…
ROB 1
RF 1
RS RS
RS
PC 2
RF 2 ROB 2
OP C
OP D
OP E
OP F
OP G OP H
OP A OP B
OP I OP J
OP K OP L
OP C OP D
Inst
ruct
ion
queu
e 1
Inst
ruct
ion
queu
e 2
Used for thread 1
Used for thread 2
Example
Assume G is true dependent on E
Programmspeicher PC 1
Memory Unit
RS
…
Execute 1
RS
RS
…
Execute m
RS RS
RS
…
…
…
ROB 1
RF 1
RS RS
RS
PC 2
RF 2 ROB 2
OP E
OP F
OP G
OP H
Res A
Res B
OP I OP J
OP K OP L
Res C Res D
OP E OP F
Inst
ruct
ion
queu
e 1
Inst
ruct
ion
queu
e 2 OP M OP N
Used for thread 1
Used for thread 2
Example
Assume G is true dependent on E; and I, too
Then, thread 1 is now blocked…
Programmspeicher PC 1
Memory Unit
RS
…
Execute 1
RS
RS
…
Execute m
RS RS
RS
…
…
…
ROB 1
RF 1
RS RS
RS
PC 2
RF 2 ROB 2
OP E
OP K
Res A
Res B
OP I
OP J
OP M
OP L
Res C
OP E
Res F
Inst
ruct
ion
queu
e 1
Inst
ruct
ion
queu
e 2
OP G
OP H
OP H
OP N
OP O OP P
Used for thread 1
Used for thread 2
Example
… but thread 2 can continuous
Programmspeicher PC 1
Memory Unit
RS
…
Execute 1
RS
RS
…
Execute m
RS RS
RS
…
…
…
ROB 1
RF 1
RS RS
RS
PC 2
RF 2 ROB 2
OP E
OP K
Res A
Res B
OP I
OP J
OP M
OP L
Res C
OP E
Inst
ruct
ion
queu
e 1
Inst
ruct
ion
queu
e 2
OP G
OP H
OP H
OP N
OP O
OP P
OP Q
Used for thread 1
Used for thread 2
Summary - Multithreading
Allows to fill the pipeline with operations from different threads − no data dependency between operations from different threads − allows for higher resource utilization
Coarse-grained MT suffers from horizontal performance loss
Fine-grained MT suffers from horizontal performance loss
SMT solves these problems
− Improvement: Balancing between partitioned resources
All MT approaches have impact on the cache performance
In particular Fine-Grained MT can be also used in statically scheduled processor pipelines to avoid hazards − In a pipeline with n pipeline stages, operations from n threads are issued − no data-/control hazard occur because operations in the pipeline have no dependencies