processor architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table...

106
Processor Architecture Advanced Dynamic Scheduling Techniques M. Schölzel

Upload: vantu

Post on 07-May-2018

225 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Processor Architecture

Advanced Dynamic Scheduling Techniques

M. Schölzel

Page 2: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Content

Tomasulo with speculative execution

Introducing superscalarity into the instruction pipeline

Multithreading

Page 3: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Content

Tomasulo with speculative execution

Introducing superscalarity into the instruction pipeline

Multithreading

Page 4: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Control Flow Dependencies

Let b a conditional branch at address a with branch taget z.

An operation c ist control flow dependent on b, if the execution of c depends on the branch of b.

Otherwise c is not control flow dependent.

Examples:

c

b

c

b a:

a+1: z:

a+1: z:

a:

c is not control flow dependent on b

b1 a:

a+1: z:

c is control flow dependent on b1 and b2

b2

c

d

c is not control flow dependent on b

What about d?

Page 5: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Scheduling restrictions imposed by control flow dependencies

… for control flow dependent operations: cannot be moved before the branch

… for not control flow dependent operations:

cannot be moved behind a branch

c

b b

b

c c

b c

Program order Speculative Execution of c Program order c may be not executed

Page 6: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Performance Problem due to Control Hazards

Problem: − Branch target of an operation is only known after execution − Long pipeline stalls required in processors with deep pipelines

Solution: − Branch prediction helps, but is limitted

• Tomasulo supports speculative fetch, issue , but not execute of operations

b Instruction Queue

Speicher

Branch operation

PC

? Address for next instruction fetch is not known

to the reservation stations

Page 7: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Drawbacks of Speculative Execution

What happens if an operation is executed speculatively and speculation was wrong? − May affect the data flow − May affect the exception behavior

b

c

b c

control flow graph of a program

speculatively executed block

block to be executed

after dynamic scheduling executed program

Page 8: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Example

Affected Data Flow c is executed speculatively before b

mul-operation now receives the value in

r1 from sub- instead of from add-operation

Affected Exceptions Behavior c is executed speculatively before b

Division by 0 possible

b c a add r1 <- r2,r3

sub r1 <- r4,r5

mul r0 <- r1,r6

does not write in

r1 c

b c a

if r5 = 0 then x else y div r1 <- r4,r5

No division executed c

x: y:

Page 9: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Solution

Divide WB-Phase from Tomasulo-Algorithm into two phases: − Forwarding results from EU to RSs (WB-Phase) − Writing results into architectural registers/memory (Commit-Phase)

Implemented by: − Reorder-Buffer for buffering results from WB-phase − Committing buffered results from the Reorder-Buffer in-oder

By this:

− Usage of speculative results possible, without modifying architectural registers/memory locations

Page 10: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Architecture for Tomasulo with speculative Execution

Inst

ruct

ion

Q

ueue

Program Memory PC

Memory Unit

RS RS

RS

Execute 1

RS RS

RS

Execute m

RS RS

RS

Reg 0 Reg 1 Reg 2

Reg r

Result Bus

Operand Bus B

Operand Bus A Operation Bus

EU-Bus EU-Bus EU-Bus

Reorder Buffer

Arch

itect

ure

Regi

ster

Page 11: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Reorder Buffer (ROB)

Implemented as a queue: − When issuing an operation, an entry is reserved

− During WB, result is written-back to the reserved entry

− Commit is done in-order and writes results back to the architectural register

• speculatively executed operations are committed after preceding branches have been comitted

ROB-entries have now the meaning of virtual registers

entry 1

entry 2

entry n

DeM

ux

Mux

Reserved entry from RS

first last

Result Bus …

To th

e ar

chite

ctur

al

regi

ster

s

Bypass zu den RS

to is

sue-

phas

e (b

ypas

s)

busy

Page 12: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Fields of the ROB

Structure of a ROB-entry

Meaning of the fields depends on the operation type

Operation types: − Branch operation − Memory operation − ALU operation

field/Meaning res addr type busy valid

Branch operation computed target address (will be stored in the PC)

c = speculation was correct w = speculation was wrong

3 entry reserved 0 = result has not been computed yet 1 = result was computed and is available in the res-field

Memory operation Value to be stored in the memory

Address at which the res-Value should be stored

2

ALU operation Result of the operation - 1

res type addr valid busy

Page 13: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Reservation Station Fields

RS has the same functionality as in ordinary Tomasulo: − Buffers operations − Buffers operands

But, ROB-entries are used for determining operand source (virtual register)

opc …

Qj Qk Vj Vk busy

opc Qj Qk Vj Vk busy

opc Qj Qk Vj Vk busy

Reservation Station

Operand Bus B

Operand Bus A Operation Bus

EU-Bus

Operation to be executed (e.g. add, sub, mul, …) Qj = x, if ROB-entry x will store value for operand A, otherwise 0

Qk = x, if ROB-entry x will store value for operand B, otherwise 0

Value for operand B Value for operand A

Reserved ROB-entry

type type

type

Type of operation (see table in previous slide)

DeM

ux

Mux

rob rob

rob

RS is occupied/free

stat stat

stat

Status in pipeline (RO, EX, WB)

misc misc

misc

Miscellaneous

Page 14: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Register File Extensions

Mapping of architecture registers to virtual registers (ROB-entries)

Architecture register n stores ROB-entry, of the latest operation that is computing the value for n (register renaming)

Example:

Reg 0 Reg 1 Reg 2

Reg r

rob rob rob

rob

Operand Bus B Operand Bus A

Result Bus

Reg 0 Reg 1 Reg 2

Reg r

5 0 1

0

ROB-entry 5 contains result of latest operation with destination register 0

Register 1 is not computed by any operation in the pipeline

ROB-entry 1 contains result of latest operation with destination register 2

Page 15: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Overview Pipeline Phases

Issue − Schedule operation from instruction queue to RS − Read operand values or rename registers (solving WAR- und WAW-Hazard) − Reserve ROB-entry − Issue is in-order

Execute

− Wait for operands to be ready − Execute operation as soon as operands are ready and EU is available (solve RAW-Hazard) − Execute is out-of-order

Write-Back

− Write result through result bus into reserved ROB-entry − WB is out-of-order

Commit

− Write results from ROB in order into destination registers/memory − Commit is in-order

Page 16: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Overview Pipeline Phases (Issue)

Issue operation from instruction queue to RS, if: − RS is free and − ROB not full

Otherwise: Stall issue stage

Allocate RS- and ROB-entry

Read operands, if

− present in register file, or − present in ROB

ROB-entry corresponds

to a virtual register

Programmspeicher PC

Memory Unit

RS RS

RS

Execute 1

RS RS

RS

Execute m

RS RS

RS

Reg 0 Reg 1 Reg 2

Reg r

Reorder Buffer

Op A

Op A

reservierter Platz für Op A

Page 17: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Overview Pipeline Phases (Execute)

Operation is waiting in RS for operands and free EU

Execute operation as soon as all operands are available and EU is free

RS can store state of operation during execution

Programmspeicher PC

Memory Unit

RS RS

RS

Execute 1

RS RS

RS

Execute m

RS RS

RS

Reg 0 Reg 1 Reg 2

Reg r

Reorder Buffer

Op A

Op A

reservierter Platz für Op A

Page 18: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Overview Pipeline Phases (Write-Back)

Write result into reserved ROB-entry

ROB-entry ID has been stored in the rob-field of the RS

Result is forwarded to all waiting RS through the result bus (value is identified by its ROB ID)

Free RS Programmspeicher PC

Memory Unit

RS RS

RS

Execute 1

RS RS

RS

Execute m

RS RS

RS

Reg 0 Reg 1 Reg 2

Reg r

Reorder Buffer

Op A

reservierter Platz für Op A Ergebnis

Page 19: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Overview Pipeline Phases (Commit)

Write results from the first entry in the ROB into the corresponding destination register

Free ROB-entry

Programmspeicher PC

Memory Unit

RS RS

RS

Execute 1

RS RS

RS

Execute m

RS RS

RS

Reg 0 Reg 1 Reg 2

Reg r

Reorder Buffer

reservierter Platz für Op A

Ergebnis

Page 20: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Issue-Phase – Details (for ALU-operations)

For the operation teat will be issued let denote: − opc … operation type (add, sub, mul, …) − src1, src2 … source registers − dst … destination registers

Operation can be issued, if

− there exists an x, where RS[x].busy = 0 and − ROB[last].busy = 0

Update after issue

− if Reg[src1].rob = 0 then // determine value of left operand RS[x].Qj := 0; RS[x].Vj := Reg[src1] // read left operand from the register file else // left operand is still under computation or in ROB if ROB[Reg[src1].rob].valid = 1 then RS[x].Qj := 0; RS[x].Vj := ROB[Reg[src1].rob].res // read operand from ROB else RS[x].Qj := Reg[src1].rob // wait for operand in RS fi

− if Reg[src2].rob = 0 then // the same for the right operand… …

− RS[x].busy := 1; RS[x].rob := tail RS[x].opc := opc; RS[x].type := 1; RS[x].status := RO

Page 21: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Issue-Phase – Details (Example 1)

Situation: − Op A can be issued − Value for r1 is taken from the register file − Value for r2 is taken from the ROB

Update − if Reg[srcy].rob = 0 then

RS[x].Qj/k := 0; RS[x].Vj/k := Reg[srcy] else if ROB[Reg[srcy].rob].valid = 1 then RS[x].Qj/k := 0; RS[x].Vj/k := ROB[Reg[srcy].rob].res else RS[x].Qj/k := Reg[srcy].Qj/k fi

Programmspeicher PC

Memory Unit

RS RS

RS

Execute 1

RS RS

RS

Execute m

RS RS

RS

1:

2:

3:

4: [56,-,1,1,1]

5: [-,-,-,0,0]

add r0 <- r1, r2 // Op A sub r3 <- r0, r1 // Op B …

OP A

res type addr valid busy

R0: 5 R1: 14 R2: 89 R3: 17

0 0 4 0

Page 22: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Issue-Phase – Details (Example 1)

Situation: − Op A was issued and ROB-entry 5 was allocated

Update after issue − if Reg[srcy].rob = 0 then

RS[x].Qj/k := 0; RS[x].Vj/k := Reg[srcy] else if ROB[Reg[srcy].rob].valid = 1 then RS[x].Qj/k := 0; RS[x].Vj/k := ROB[Reg[srcy].rob].res else RS[x].Qj/k := Reg[srcy].Qj/k fi

Programmspeicher PC

Memory Unit

RS RS

RS

Execute 1

[add,0,0,14,56,-,1,5,RO,1]

RS

RS

Execute m

RS RS

RS

1:

2:

3:

4: [56,-,1,1,1]

5: [-,-,1,0,1]

add r0 <- r1, r2 // Op A sub r3 <- r0, r1 // Op B …

res type addr valid busy

opc Qj Qk Vj Vk busy type rob stat misc

R0: 5 R1: 14 R2: 89 R3: 17

5 0 4 0

Page 23: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Issue-Phase – Details (Example 2)

Situation: − issue of Op A − Value of r1 is read from the register file − Value in r2 is computed by RS4

Update after issue:

− if Reg[srcy].rob = 0 then RS[x].Qj/k := 0; RS[x].Vj/k := Reg[srcy] else if ROB[Reg[srcy].rob].valid = 1 then RS[x].Qj/k := 0; RS[x].Vj/k := ROB[Reg[srcy].rob].res else RS[x].Qj/k := Reg[src1].Qj/k fi

Programmspeicher PC

Memory Unit

RS RS

[ld,0,0,100,-,-,2,4,RO,1]

Execute 1

RS RS

RS

Execute m

RS RS

RS

1:

2:

3:

4: [-,-,1,0,1]

5: [-,-,-,0,0]

add r0 <- r1, r2 // Op A sub r3 <- r0, r1 // Op B …

OP A

res type addr valid busy

opc Qj Qk Vj Vk busy type rob stat misc

R0: 5 R1: 14 R2: 89 R3: 17

0 0 4 0

Page 24: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Issue-Phase – Details (Example 2)

Situation: − Op A was issued − Uses ROB-entry 5 − Has to wait for the value from r

Update after issue:

− if Reg[srcy].rob = 0 then RS[x].Qj/k := 0; RS[x].Vj/k := Reg[srcy] else if ROB[Reg[srcy].rob].valid = 1 then RS[x].Qj/k := 0; RS[x].Vj/k := ROB[Reg[srcy].rob].res else RS[x].Qj/k := Reg[src1].Qj/k fi

Programmspeicher PC

Memory Unit

RS RS

[ld,0,0,100,-,-,2,4,RO,1]

Execute 1

RS

RS

Execute m

RS RS

RS

1:

2:

3:

4: [-,-,1,0,1]

5: [-,-,1,0,1]

add r0 <- r1, r2 // Op A sub r3 <- r0, r1 // Op B …

res type addr valid busy

opc Qj Qk Vj Vk busy type rob stat misc

R0: 5 R1: 14 R2: 89 R3: 17

0 0 4 0

[add,0,4,14,-,-,1,5,RO,1]

Page 25: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Execute – Details

Executing an operation from a RS is possible, if − RS[x].status = RO and − RS[x].Qj = 0 and RS[x].Qk = 0

Update after start of execution: − Perform computation with RS[x].Vj and RS[x].Vk − RS[x].status := EX

Update after end of execution: − RS[x].Vj := res // Store result temporary in the reservation station − RS[x].status := WB

Page 26: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Execute-Phase – Details (Example 3)

Both operands are ready: − RS[x].status = RO and − RS[x].Qj = 0 und RS[x].Qk = 0

Programmspeicher PC

Memory Unit

RS RS

[ld,0,0,100,-,-,2,4,RO,1]

Execute 1

RS RS

RS

Execute m

RS RS

RS

Reg 0 Reg 1 Reg 2

Reg r

1: 0 0 4

0

2:

3:

4: [-,-,1,0,1]

5:

add r0 <- r1, r2 // Op A sub r3 <- r0, r1 // Op B …

OP A

res type addr valid busy

opc Qj Qk Vj Vk busy type rob stat misc

Page 27: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Execute-Phase – Details (Example 3)

Operation is executed: − RS[x].status = EX

Programmspeicher PC

Memory Unit

RS RS

[ld,0,0,100,-,-,2,4,EX,1]

Execute 1

RS RS

RS

Execute m

RS RS

RS

Reg 0 Reg 1 Reg 2

Reg r

1: 0 0 4

0

2:

3:

4: [-,-,1,0,1]

5:

add r0 <- r1, r2 // Op A sub r3 <- r0, r1 // Op B …

OP A

res type addr valid busy

opc Qj Qk Vj Vk busy type rob stat misc

Page 28: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Execute-Phase – Details (Example 3)

Result is computed: − Result will be stored temporarily in the field Vj − RS[x].status = WB

Result is ready for WB

Programmspeicher PC

Memory Unit

RS RS

[ld,0,0,89,-,-,2,4,WB,1]

Execute 1

RS RS

RS

Execute m

RS RS

RS

Reg 0 Reg 1 Reg 2

Reg r

1: 0 0 4

0

2:

3:

4: [-,-,1,0,1]

5:

add r0 <- r1, r2 // Op A sub r3 <- r0, r1 // Op B …

OP A

res type addr valid busy

opc Qj Qk Vj Vk busy type rob stat misc

Page 29: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Write-Back – Details (ALU-Operation)

Write-Back of the result res from RS x possible, if − RS[x].status = WB and − Result bus available

Update after WB:

− ROB[RS[x].rob] := RS[x].Vj // Write result to allocated ROB-entry − RS[x].busy := 0 // free RS − ROB[RS[x].rob].valid := 1 // Declare ROB-entry as valid

− for all reservation stations y ¹ x: // Forwarding of the result

if RS[y].Qj = RS[x].rob then RS[y].Vj := RS[x].rob; RS[y].Qj := 0 if RS[y].Qk = RS[x].rob then RS[y].Vk := RS[x].rob; RS[y].Qk := 0

Page 30: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Write-Back-Phase – Details (Example 4)

Situation: − Result of the ld-Operation is written back − Result bus contains:

• ROB-entry ID, e.g. 4 • ROB-value, e.g. 89

− add-operation waits for the right-hand operand

Programmspeicher PC

Memory Unit

RS RS

[ld,0,0,20,-,-,2,4,WB,1]

Execute 1

[add,0,4,14,-,-,1,5,RO,1]

RS

RS

Execute m

RS RS

RS

1:

2:

3:

4: [-,-,1,0,1]

5: [-,-,1,0,1]

add r0 <- r1, r2 // Op A sub r3 <- r0, r1 // Op B …

OP B

res type addr valid busy

(4,89)

opc Qj Qk Vj Vk busy type rob stat misc

R0: 5 R1: 14 R2: 89 R3: 17

5 0 4 0

Page 31: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Write-Back-Phase – Details (Example 4)

Situation: − Result was stored in ROB-entry 4 − RS containing add-Operation has also stored the result − RS was freed

Programmspeicher

PC

Memory Unit

RS RS

RS

Execute 1

[add,0,0,14,20,-,1,5,RO,1]

RS

RS

Execute m

RS RS

RS

1:

2:

3:

4: [20,-,1,1,1]

5: [-,-,1,0,1]

add r0 <- r1, r2 // Op A sub r3 <- r0, r1 // Op B …

OP B

res type addr valid busy

opc Qj Qk Vj Vk busy type rob stat misc

R0: 5 R1: 14 R2: 89 R3: 17

5 0 4 0

Page 32: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Commit– Details (ALU-Operation)

It must be checked: − ROB[first].valid = 1

Update by commit: − for all Architectural Registers r with Reg[r].rob = head do

Reg[r] := ROB[head].res Reg[r].rob := 0

Page 33: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Commit-Phase – Details (Example 5)

Situation: − Let be head = 4 for the ROB-head − R2 waits for result from ROB-entry 4

Programmspeicher PC

Memory Unit

RS RS

RS

Execute 1

[add,0,0,56,89,-,1,5,RO,1]

RS

RS

Execute m

RS RS

RS

1:

2:

3:

4: [20,-,1,1,1]

5: [-,-,1,0,1]

add r0 <- r1, r2 // Op A sub r3 <- r0, r1 // Op B …

OP B

res type addr valid busy

(4,89)

opc Qj Qk Vj Vk busy type rob stat misc

R0: 5 R1: 14 R2: 89 R3: 17

5 0 4 0

Page 34: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Commit-Phase – Details (Example 5)

Situation: − R2 has received result from ROB

Programmspeicher PC

Memory Unit

RS RS

RS

Execute 1

[add,0,0,56,89,-,1,5,RO,1]

RS

RS

Execute m

RS RS

RS

1:

2:

3:

4: [20,-,1,0,0]

5: [-,-,1,0,1]

add r0 <- r1, r2 // Op A sub r3 <- r0, r1 // Op B …

OP B

res type addr valid busy

(4,89)

opc Qj Qk Vj Vk busy type rob stat misc

R0: 5 R1: 14 R2: 20 R3: 17

5 0 0 0

Page 35: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Executing Branch Operations

Issue: − Vk-field: stores branch target z − misc-field: remembers address a of the branch operation − misc-field: also remembers which address (‚z‘ or ‚a‘) was predicted

Execute: − Computed target address is stored in Vk-Field of RS:

• Vk := z, if branch is taken • Vk := a+1, if branch is not taken

− misc-field stores, whether or not prediction was correct (‚c‘ = correct; ‚w‘ = wrong)

Write-Back: − res-field of ROB received branch target (Vk-field of the RS) − addr-field receives value of misc-field from RS: ‚c‘ or ‚w‘

Commit:

− If addr-field = c, nothing must be done (operations were fetched from correct address) − If addr-field = w, then copy res-field into PC and flush the whole pipeline:

• All subsequent ROB-entries • All RS-entries • instruction queue

Page 36: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Branch – Details (Example 6 – correct prediction)

Situation: − Branch-operation was issued to RS2 − Branch depends on ld-operation − Op A, Op B, … will be executed speculateively

Programmspeicher PC

Memory Unit

[ld,0,0,-,200,-,2,1,EX,1]

RS

RS

Execute 1

[bz,1,0,-,123,112a,3,2,RO,1]

RS

RS

Execute m

RS RS

RS

R0: 5 R1: 14 R2: 89 R3:

1: [-,-,2,0,1] 0 0 1 0

2: [-,-,3,0,1]

3: [-,-,-,0,0]

4: [-,-,-,0,0]

5: [-,-,-,0,0]

111: ld r2 <- (200) 112: bz r2, #123 113: add r1 <- r1, r0 // Op A 114: sub r3 <- r0, r1 // Op B 115: …

OP A

res type addr valid busy

opc Qj Qk Vj Vk busy type rob stat misc

OP B

OP C

2

ld

Page 37: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Branch – Details (Example 6 – correct prediction)

Situation: − Op A is executed speculatively − Op B waits for result of Op A

Programmspeicher PC

Memory Unit

[ld,0,0,-,200,-,2,1,EX,1]

RS

RS

Execute 1

[bz,1,0,-,123,112a,3,2,RO,1]

RS

RS

Execute m

RS RS

RS

R0: 5 R1: 14 R2: 89 R3:

1: [-,-,2,0,1] 0 3 1 4

2: [-,-,3,0,1]

3: [-,-,1,0,1]

4: [-,-,1,0,1]

5: [-,-,-,0,0] OP C

res type addr valid busy

opc Qj Qk Vj Vk busy type rob stat misc

OP D

OP E

1

ld

OP A OP B

OP A

111: ld r2 <- (200) 112: bz r2, #123 113: add r1 <- r1, r0 // Op A 114: sub r3 <- r0, r1 // Op B 115: …

Page 38: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Branch – Details (Example 6 – correct prediction)

Situation: − Op A wrote result to ROB, but not to R1 − Op B is executed speculatively

Programmspeicher PC

Memory Unit

[ld,0,0,-,200,-,2,1,EX,1]

RS

RS

Execute 1

[bz,1,0,-,123,112a,3,2,RO,1]

RS

RS

Execute m

RS RS

RS

R0: 5 R1: 14 R2: 89 R3:

1: [-,-,2,0,1] 0 3 1 4

2: [-,-,3,0,1]

3: [19,-,1,1,1]

4: [-,-,1,0,1]

5: [-,-,-,0,0] OP C

res type addr valid busy

opc Qj Qk Vj Vk busy type rob stat misc

OP D

OP E

1

ld

OP B

111: ld r2 <- (200) 112: bz r2, #123 113: add r1 <- r1, r0 // Op A 114: sub r3 <- r0, r1 // Op B 115: …

Page 39: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Branch – Details (Example 6 – correct prediction)

Situation: − Op B wrote result to ROB − ld write result to ROB − bz can be executed

Programmspeicher PC

Memory Unit

RS

RS

Execute 1

[bz,0,0,6,123,112a,3,2,RO,1]

RS

RS

Execute m

RS RS

RS

R0: 5 R1: 14 R2: 89 R3:

1: [6,-,2,1,1] 0 3 1 4

2: [-,-,3,0,1]

3: [19,-,1,1,1]

4: [-14,-,1,1,1]

5: [-,-,-,0,0] OP C

res type addr valid busy

opc Qj Qk Vj Vk busy type rob stat misc

OP D

OP E

1

RS

111: ld r2 <- (200) 112: bz r2, #123 113: add r1 <- r1, r0 // Op A 114: sub r3 <- r0, r1 // Op B 115: …

Page 40: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Branch – Details (Example 6 – correct prediction)

Situation: − bz will be executed: Branch is not taken − Commit for ld-operation is done

Programmspeicher

PC

Memory Unit

RS

RS

Execute 1

[bz,0,0,6,123,112a,3,2,EX,1]

RS

RS

Execute m

RS RS

RS

R0: 5 R1: 14 R2: 6 R3:

1: [-,-,-,0,0] 0 3 0 4

2: [-,-,3,0,1]

3: [19,-,1,1,1]

4: [-14,-,1,1,1]

5: [-,-,-,0,0] OP C

res type addr valid busy

opc Qj Qk Vj Vk busy type rob stat misc

OP D

OP E

1

RS

BZ

111: ld r2 <- (200) 112: bz r2, #123 113: add r1 <- r1, r0 // Op A 114: sub r3 <- r0, r1 // Op B 115: …

Page 41: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Branch – Details (Example 6 – correct prediction)

Situation: − bz was executed − WB for bz was done − Prediction was correct

(ROB[2].addr := c) Programmspeicher PC

Memory Unit

RS

RS

Execute 1

RS

RS

Execute m

RS RS

RS

R0: 5 R1: 14 R2: 6 R3:

1: [-,-,-,0,0] 0 3 0 4

2: [113,c,3,1,1]

3: [19,-,1,1,1]

4: [-14,-,1,1,1]

5: [-,-,-,0,0] OP C

res type addr valid busy

opc Qj Qk Vj Vk busy type rob stat misc

OP D

OP E

1

RS RS

111: ld r2 <- (200) 112: bz r2, #123 113: add r1 <- r1, r0 // Op A 114: sub r3 <- r0, r1 // Op B 115: …

Page 42: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Branch – Details (Example 6 – correct prediction)

Situation: − Commit of the branch operation does not require

any action, because prediction was correct

− Now Commit can be done for speculatively executed operations A and B

Programmspeicher PC

Memory Unit

RS

RS

Execute 1

RS

RS

Execute m

RS RS

RS

R0: 5 R1: 14 R2: 6 R3:

1: [-,-,-,0,0] 0 3 0 4

2: [-,-,-,0,0]

3: [19,-,1,1,1]

4: [-14,-,1,1,1]

5: [-,-,-,0,0] OP C

res type addr valid busy

opc Qj Qk Vj Vk busy type rob stat misc

OP D

OP E

0

RS RS

111: ld r2 <- (200) 112: bz r2, #123 113: add r1 <- r1, r0 // Op A 114: sub r3 <- r0, r1 // Op B 115: …

Page 43: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Branch – Details (Example 7 – wrong prediction)

Situation: − Same situation as in example 6 − But, ld-Operation has stored 0 in R2 − I.e., branch is taken

Programmspeicher PC

Memory Unit

RS

RS

Execute 1

[bz,0,0,0,123,112a,3,2,EX,1]

RS

RS

Execute m

RS RS

RS

R0: 5 R1: 14 R2: 0 R3:

1: [-,-,-,0,0] 0 3 0 4

2: [-,-,3,0,1]

3: [19,-,1,1,1]

4: [-14,-,1,1,1]

5: [-,-,-,0,0]

OP C

res type addr valid busy

opc Qj Qk Vj Vk busy type rob stat misc

OP D

OP E

1

RS

BZ

OP F

111: ld r2 <- (200) 112: bz r2, #123 113: add r1 <- r1, r0 // Op A 114: sub r3 <- r0, r1 // Op B 115: …

Page 44: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Branch – Details (Example 7 – wrong prediction)

Situation: − bz-operation was executed − Prediction was wrong

(ROB[2].addr := w) − Correct target can be found

in the res-field Programmspeicher

PC

Memory Unit

RS

RS

Execute 1

RS

RS

Execute m

RS RS

RS

R0: 5 R1: 14 R2: 6 R3:

1: [-,-,-,0,0] 0 3 0 4

2: [123,w,3,1,1]

3: [19,-,1,1,1]

4: [-14,-,1,1,1]

5: [-,-,-,0,0]

OP C

res type addr valid busy

opc Qj Qk Vj Vk busy type rob stat misc

OP D

OP E

1

RS RS

OP F

OP G

OP C

111: ld r2 <- (200) 112: bz r2, #123 113: add r1 <- r1, r0 // Op A 114: sub r3 <- r0, r1 // Op B 115: …

Page 45: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Branch – Details (Example 7 – wrong prediction)

Situation: − Commit of the branch moves correct address into PC

− Flushing the pipeline

Programmspeicher PC

123

Memory Unit

RS

RS

Execute 1

RS

RS

Execute m

RS RS

RS

R0: 5 R1: 14 R2: 6 R3:

1: [-,-,-,0,0] 0 3 0 4

2: [-,-,-,0,0]

3: [19,-,1,1,1]

4: [-14,-,1,1,1]

5: [-,-,-,0,0]

res type addr valid busy

opc Qj Qk Vj Vk busy type rob stat misc

0

RS RS OP C OP D

OP E

OP F

OP G

OP C

111: ld r2 <- (200) 112: bz r2, #123 113: add r1 <- r1, r0 // Op A 114: sub r3 <- r0, r1 // Op B 115: …

Page 46: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Executing Memory Operations

For out-of-order-execution of memory operations holds: − Ordering of load-does not matter − Ordering of load- and store-operations as well as of store- and store-

operations must be maintained

Example:

Strategy: − Writing to memory takes place during commit-phase (in-order)

− Reading from memory takes place during execute-phase (out-of-order)

• But only, if valid-field of all preceding write-operations in the ROB is 1

ld r2 <- (r0) ld r1 <- (r4) st r4 -> (r0) ld r5 <- (r6) st r7 -> (r8)

Page 47: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Example (store-Operation)

Issue-Phase: − issue the first st-operation

Execute-Phase

− Execution of store-operation can start, if both source operands are available

− Execution has no effect − Rather, WB of st-operation starts

immediately Programmspeicher

PC

Memory Unit

[st,0,0,17,5,-,2-,1,RO,1]

RS

RS

Execute 1

RS

RS

Execute m

RS RS

RS

1: [-,-,2,0,1]

2:

3:

4:

5:

res type addr valid busy

opc Qj Qk Vj Vk busy type rob stat misc

R0: 5 R1: 14 R2: 20 R3: 17

0 0 0 0

M 0 0

RS

st r3 -> (r0) ld r3 <- (r1) st r1-> (r2)

Page 48: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Example (store-Operation)

Updates during WB of the st-Operation − ROB[x].res := RS[y].Vj − ROB[x].addr := RS[y].Vk

Programmspeicher PC

Memory Unit

[st,0,0,17,5,-,2-,1,WB,1]

RS

RS

Execute 1

RS

RS

Execute m

RS RS

RS

1: [-,-,2,0,1]

2:

3:

4:

5:

res type addr valid busy

opc Qj Qk Vj Vk busy type rob stat misc

R0: 5 R1: 14 R2: 20 R3: 17

0 0 0 0

M 0 0

RS

st r3 -> (r0) ld r3 <- (r1) st r1-> (r2)

Page 49: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Example (store-Operation)

Commit for st-operation − MEM[ROB[first].addr] := ROB[first].res

Programmspeicher PC

Memory Unit

RS

RS

Execute 1

RS

RS

Execute m

RS RS

RS

1: [17,5,2,1,1]

2:

3:

4:

5:

res type addr valid busy

opc Qj Qk Vj Vk busy type rob stat misc

R0: 5 R1: 14 R2: 20 R3: 17

0 0 0 0

M 0 0

RS

st r3 -> (r0) ld r3 <- (r1) st r1-> (r2)

RS

Page 50: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Example (load-Operation)

Suppose first st-operation was issued and waits for execution

Then ld-operation was issued, and its source operands are available

Programmspeicher PC

Memory Unit

[st,5,0,-,5,-,2,1,RO,1]

[ld,0,0,14,-,-,2,2,RO,1]

RS

Execute 1

RS

RS

Execute m

RS RS

RS

1: [-,-,2,0,1]

2: [-,-,2,0,1]

3:

4:

5:

res type addr valid busy

opc Qj Qk Vj Vk busy type rob stat misc

R0: 5 R1: 14 R2: 20 R3: 17

0 0 0 2

M 0 0

RS

st r3 -> (r0) ld r3 <- (r1) st r1-> (r2)

OP C

OP C

Page 51: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Example (load-Operation)

Situation : − ld-operation is not executed, because valid-bit of

first st-operation is 0

Programmspeicher PC

Memory Unit

[st,5,0,-,5,-,2,1,RO,1]

[ld,0,0,14,-,-,2,2,RO,1]

RS

Execute 1

RS

RS

Execute m

RS RS

RS

1: [-,-,2,0,1]

2: [-,-,2,0,1]

3:

4:

5:

res type addr valid busy

opc Qj Qk Vj Vk busy type rob stat misc

R0: 5 R1: 14 R2: 20 R3: 17

0 0 0 2

M 0 0

RS

st r3 -> (r0) ld r3 <- (r1) st r1-> (r2)

OP C

OP C

Page 52: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Example (load-Operation)

Situation: − Now, ld-operation can be executed (see valid-bit

of first st-operation)

− ld-operation can read value either from memory or from ROB (if addr-field of a preceding st-operation matches Vj-field of ld-operation

Programmspeicher PC

Memory Unit

[ld,0,0,14,-,-,2,2,RO,1]

RS

Execute 1

RS

RS

Execute m

RS RS

RS

1: [17,5,2,1,1]

2: [-,-,2,0,1]

3:

4:

5:

res type addr valid busy

opc Qj Qk Vj Vk busy type rob stat misc

R0: 5 R1: 14 R2: 20 R3: 17

0 0 0 2

M 0 0

RS

st r3 -> (r0) ld r3 <- (r1) st r1-> (r2)

RS

Page 53: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Example (load-Operation)

WB for ld-operation complete Commit-phase for ld-operations is the

same as for alu-operations

Programmspeicher PC

Memory Unit

[ld,0,0,14,-,-,2,2,RO,1]

RS

Execute 1

RS

RS

Execute m

RS RS

RS

1: [17,5,2,1,1]

2: [20,-,2,1,1]

3:

4:

5:

res type addr valid busy

opc Qj Qk Vj Vk busy type rob stat misc

R0: 5 R1: 14 R2: 20 R3: 17

0 0 0 2

M 0 0

RS

st r3 -> (r0) ld r3 <- (r1) st r1-> (r2)

RS

Page 54: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Tomasulo with Speculation (Example 9: Loop Iteration, Cycle 0)

Memory Unit

RS 1 RS 2

Execute 1

RS 3 RS 4

RS 5 RS 6

R0: 0 R1: 100 R2: 13

0 0

0 0

0 0

Execute 2

0 0 0

R3: 0 0 R4: 0 0

Loop: ld r0 <- (r1) mul r4 <- r0, r2 add r3 <- r3, r4 add r1 <- r1,1 bne loop, r1,200 …

ld r0,(r1)

mul r4,r0,r2

PC 0

add r3,r3,r4

1:

2:

3:

4:

5:

Page 55: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Tomasulo with Speculation (Example 9: Loop Iteration, Cycle 1)

Memory Unit

RS 1 RS 2

Execute 1

RS 3 RS 4

RS 5 RS 6

R0: 0 R1: 100 R2: 13

1 0

0 0

0 0

Execute 2

1 0 0

R3: 0 0 R4: 0 0

PC 0

ld r0,(100)

mul r4,r0,r2

add r3,r3,r4

add r1,r1,1

Loop: ld r0 <- (r1) mul r4 <- r0, r2 add r3 <- r3, r4 add r1 <- r1,1 bne loop, r1,200 …

1:

2:

3:

4:

5:

ld r0,(100)

Page 56: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Tomasulo with Speculation (Example 9: Loop Iteration, Cycle 2)

Memory Unit

RS 1 RS 2

Execute 1

RS 3 RS 4

RS 5 RS 6

R0: 0 R1: 100 R2: 13

1 0

1 0

0 0

Execute 2

1 0 0

R3: 0 0 R4: 0 2

PC 0

bne loop,r1,200

mul r4,rob1,13

add r3,r3,r4

add r1,r1,1

ld r0,(100)

ld r0,(100)

Loop: ld r0 <- (r1) mul r4 <- r0, r2 add r3 <- r3, r4 add r1 <- r1,1 bne loop, r1,200 …

1:

2:

3:

4:

5:

ld r0,(100)

mul r4

Page 57: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Tomasulo with Speculation (Example 9: Loop Iteration, Cycle 3)

Memory Unit

RS 1 RS 2

Execute 1

RS 3 RS 4

RS 5 RS 6

R0: 0 R1: 100 R2: 13

1 0

1 1

0 0

Execute 2

1 0 0

R3: 0 3 R4: 0 2

PC 0

bne loop,r1,200

add r3,0,rob2

add r1,r1,1

ld r0,(100)

ld r0,(r1)

mul r4,rob1,13 ld r0,(100)

Loop: ld r0 <- (r1) mul r4 <- r0, r2 add r3 <- r3, r4 add r1 <- r1,1 bne loop, r1,200 …

1:

2:

3:

4:

5:

ld r0,(100)

mul r4

add r3

Page 58: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Tomasulo with Speculation (Example 9: Loop Iteration, Cycle 4)

Memory Unit

RS 1 RS 2

Execute 1

RS 3 RS 4

RS 5 RS 6

R0: 0 R1: 100 R2: 13

0 0

1 1

1 0

Execute 2

1 4 0

R3: 0 3 R4: 0 2

PC 0

bne loop,r1,200

add r3,0,rob2

add r1,100,1

ld r0,(r1)

mul r4,20,13

mul r4,r0,r2

ld r0,(100)

Loop: ld r0 <- (r1) mul r4 <- r0, r2 add r3 <- r3, r4 add r1 <- r1,1 bne loop, r1,200 …

1:

2:

3:

4:

5:

20

mul r4

add r3

add r1

Page 59: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Tomasulo with Speculation (Example 9: Loop Iteration, Cycle 5)

Memory Unit

RS 1 RS 2

Execute 1

RS 3 RS 4

RS 5 RS 6

R0: 20 R1: 100 R2: 13

EU-Bus

0 0

1 1

1 1

Execute 2

0 4 0

R3: 0 3 R4: 0 2

PC 5

bne loop,rob4,200 add r3,0,rob2

add r1,100,1

ld r0,(r1)

mul r4,20,13

mul r4,r0,r2

add r3,r3,r4

mul r4,20,13 add r1,100,1

Loop: ld r0 <- (r1) mul r4 <- r0, r2 add r3 <- r3, r4 add r1 <- r1,1 bne loop, r1,200 …

1:

2:

3:

4:

5:

mul r4

add r3

add r1

bne

Page 60: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Tomasulo with Speculation (Example 9: Loop Iteration, Cycle 6)

Memory Unit

RS 1 RS 2

Execute 1

RS 3 RS 4

RS 5 RS 6

R0: 20 R1: 100 R2: 13

1 0

0 1

0 1

Execute 2

1 4 0

R3: 0 3 R4: 0 2

PC 5

bne loop,101,200 add r3,0,260

add r1,100,1 ld r0,(101) mul r4,20,13

mul r4,r0,r2

add r3,r3,r4

add r1,r1,1

Loop: ld r0 <- (r1) mul r4 <- r0, r2 add r3 <- r3, r4 add r1 <- r1,1 bne loop, r1,200 …

1:

2:

3:

4:

5:

260

add r3

101

ld r0

bne

Operations are fetched and issued

speculatively

Operations from different loop

iterations are in the pipeline

ld-operation is no longer dependent on the branch operation

Page 61: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Tomasulo with Speculation (Example 9: Loop Iteration, Cycle 7)

Memory Unit

RS 1 RS 2

Execute 1

RS 3 RS 4

RS 5 RS 6

R0: 20 R1: 100 R2: 13

1 0

1 1

0 1

Execute 2

1 4 0

R3: 0 3 R4: 260 2

PC 5

bne loop,101,200 add r3,0,260

ld r0,(101) mul r4,rob1,13

add r3,r3,r4

add r1,r1,1

bne loop,101,200 add r3,0,260

bne loop,r1,200

Loop: ld r0 <- (r1) mul r4 <- r0, r2 add r3 <- r3, r4 add r1 <- r1,1 bne loop, r1,200 …

Now speculative execution possible

1:

2:

3:

4:

5:

add r3

101

ld r0

mul r4

ld r0,(101)

bne

Page 62: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Tomasulo with Speculation (Example 9: Loop Iteration, Cycle 8)

Memory Unit

RS 1 RS 2

Execute 1

RS 3 RS 4

RS 5 RS 6

R0: 20 R1: 100 R2: 13

1 0

1 0

1 0

Execute 2

1 4 0 3

R4: 260 2

PC 5

bne loop,101,200 add r3,0,260

ld r0,(101) mul r4,rob1,13

R3: 0

Loop: ld r0 <- (r1) mul r4 <- r0, r2 add r3 <- r3, r4 add r1 <- r1,1 bne loop, r1,200 …

1:

2:

3:

4:

5:

260

101

ld r0

mul r4

ld r0,(101)

bne c add r3,r3,r4

add r1,r1,1

bne loop,r1,200

Page 63: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Tomasulo with Speculation (Example 9: Loop Iteration, Cycle 9)

Memory Unit

RS 1 RS 2

Execute 1

RS 3 RS 4

RS 5 RS 6

R0: 20 R1: 100 R2: 13

1 0

1 1

1 0

Execute 2

1 4 0 3

R4: 260 2

PC 5

ld r0,(101) mul r4,8,13

add r1,101,1

R3: 260 ld r0,(r1)

add r3,260,rob2

Loop: ld r0 <- (r1) mul r4 <- r0, r2 add r3 <- r3, r4 add r1 <- r1,1 bne loop, r1,200 …

1:

2:

3:

4:

5:

101

8

mul r4

bne c add r1,r1,1

bne loop,r1,200

add r3

Page 64: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Summary

We have seen Tomasulo-algorithm with speculation

Importance of the Reorder-Buffer

Execution of − Alu-operations − Branch-operations − Memory-operations

But: Issue- and Commit-phase are limited to processing a

single operation per clock cycle

Page 65: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Content

Tomasulo with speculative execution

Introducing superscalarity into the instruction pipeline

Multithreading

Page 66: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Superscalar Instruction Pipeline

So far: Only data path is superscalar

Parallel execution of operation in the data path, but CPI < 1 not possible

Required: super scalar − Fetch-, − Issue-, − WB-, − Commit-Phase

Programmspeicher PC

Memory Unit

RS

Execute 1

RS

RS

Execute m

RS RS

RS

ROB

R0: 5 R1: 14 R2: 20 R3: 17

0 0 0 2

0

RS RS

RS

Page 67: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Superscalar Fetch-Phase

Fetch: − Fetching n operations simultaneously from code cache/memory − Requires wider busses

Inst

ruct

ion

qu

eue

RS

operand bus B1

operand bus A1

Register File

RS RS

n

operation bus 1

operation bus n

An

operand bus Bn

Cache/Memory

Page 68: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Superscalar Issue-Phase

Issue: − Issue the first n Operations from the instruction queue (n operation busses required) − n operand busses for left operand required (A) − n operand busses for right operand required (B)

Checking for free RS and free ROB-entry must be done simultaneously for up to

n operations!

In

stru

ctio

n

queu

e

RS

operand bus B1

operand bus A1

Register File

RS RS

n

operation bus 1

operation bus n

An

operand bus Bn

Cache/Memory

Page 69: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Implementing simultaneous checking in issue-phase

Issue-Logic (1 Operation)

Old ROB Status

Old RS Status

New state for ROB

New state for RS

Issue-Logic (1. Op)

Old ROB Status

Old RS Status

Issue-Locik (2. Op)

For a single operations For two operations

RF Control for

operand buses A and B

Combine

New state for ROB

New state for RS

RF control for operand busses

A2 und B2

RF Control for

operand buses A1 and B1

Page 70: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Superscalar WB-Phase

Every EU has its own result bus Ei − All EUs may write simultaneously to the ROB

This makes also the bypass for the reservation stations more complex

Memory Unit Execute 1 Execute m

result busses

ROB

Bypasses to RS

opc Qj Qk Vj Vk busy type rob stat misc

A1 An B1 Bn E0 Em R1 Rk

E0 E1 Em

Bypass RS

RS

Page 71: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Superscalar Commit-Phase

For up to n ROB-entries starting at the head: − check if the valid-bit is set to 1 − Then write their result to the register file

Register file needs n write-ports

Memory Unit Execute 1 Execute m

result busses

ROB

Bypasses to RS

Register File

opc Qj Qk Vj Vk busy type rob stat misc

A1 An B1 Bn E0 Em R1 Rk

E0 E1 Em

Bypass RS

RS

Page 72: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Example PowerPC

Quelle: PowerPC™ e500 Core Family Reference Manual

Page 73: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Limitations for ILP

Memory band width limits the amount of simultaneously fetched operations (typical 4 to 6 operations)

HW-Overhead and delay for: − Control logic issue-phase − Bypasses for reservation stations − Number of read-/write-ports in the register file

Branches

− Possible Solution: Branch prediction

Available parallelism in the application − Possible Solution: Multithreading

Page 74: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Content

Tomasulo with speculative execution

Introducing superscalarity into the instruction pipeline

Multithreading

Page 75: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Motivation for Multithreading

True dependencies prevent the EUs from being used in parallel (horizontal performance loss)

Operations with a very long delay during execution create vertical performance loss − E.g. memory access of an operation A in a Pentium

4 (3-way-superscalar) can take 380 clock cycles (cache misses)

− I.e. 1140 operations have to bypass Op A in order to utilize EUs fully during this time

− But: Reorder buffer has only 126 entries − Hence, 339 execution cycles are wasted

Solution Multithreading: Execute multiple

threads that share the same execution units, but have no dependencies

OP 1

OP 2

OP n

1 2

3 4

EU usage

5

1 2 3 EU usage OP 1

OP 2

OP 3

OP A

OP 4

OP 5

OP 6

A 4 5

124

123 122 only

41

cycl

es

… af

ter 3

80 c

ycle

s WB

of A

Page 76: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Process vs. Thread

Each process has its own context − address space (Code, data, heap, stack) − TLB − …

Switching between processes takes tens of thousands of clock cycles (context switch)

OS is involved

Threads share the same context

Switching between two threads only requires to change the values in the architectural registers

Code Section

Data Section

Heap

Stack

Code Section

Data Section

Heap

Stack

Code Section

Data Section

Heap

Stack 1

Stack 2

Process 1 Process 2

Code Section

Data Section

Heap

Stack 1

Stack 2

Thread 1 in

Process 2

Thread 2 in

Process 2

Page 77: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Multithreading

Multithreading: A fixed number of n threads can share the same execution units

Hardware supports fast switching between n threads: − n copies of some resources, e.g. architectural registers (including PC) − fix partitioning of some resources, e.g. RS (or limited sharing) − shared usage of some resources, e.g. EUs

Inst

ruct

ion

queu

e

Programmspeicher PC 1

Memory Unit

RS

Execute 1

RS

RS

Execute m

RS RS

RS

… EU-Bus

ROB 1

RF 1

RS RS

RS

PC 2

RF 2 ROB 2

OP A OP B

OP C OP D

OP E OP F

Inst

ruct

ion

queu

e

Page 78: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Multithreading

Types of Multithreading:

no MT Coarse Grained MT Fine Grained MT Simultaneous MT

Page 79: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Coarse Grained Multithreading

A single thread runs for many clock cycles before the hardware switches to another thread

Hardware switches between threads only, if − a long running operation is detected, e.g. cache miss, or − a fix time slice has passed

A processor with n-way MT appears to an operating system

like n processors − OS schedules n threads of the same process to these processors

Page 80: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Example

Two threads are scheduled to the processor

Reservation stations and EUs are shared resources

Hardware switches between both PCs and IQ (e.g. by multiplexors)

Fetched operations are tagged with thread number

Situation: Thread 1 is running Inst

ruct

ion

Que

ue

thre

ad 1

Programmspeicher PC 1

Memory Unit

RS

Execute 1

RS

RS

Execute m

RS RS

RS

ROB 1

RF 1

RS RS

RS

PC 2

RF 2 ROB 2

OP D.1 OP A.2

OP E.1 OP B.2

OP F.1 OP C.2

Inst

ruct

ion

Que

ue

thre

ad 2

OP A.1

OP A.1

OP B.1

OP B.1

OP A.1

OP B.1

OP C.1

1

OP C.1

Page 81: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Example

Memory operation D.1 of thread 1 was issued

Thread 1 is still running…

Inst

ruct

ion

Que

ue

thre

ad 1

Programmspeicher PC 1

Memory Unit

RS

Execute 1

RS

RS

Execute m

RS RS

RS

ROB 1

RF 1

RS RS

RS

PC 2

RF 2 ROB 2 OP G.1

OP A.2 OP E.1

OP B.2 OP F.1

OP C.2

Inst

ruct

ion

Que

ue

thre

ad 2

OP A.1

OP A.1

OP B.1

OP B.1

OP A.1

OP B.1

OP C.1

OP D.1

1

OP C.1

OP D.1

Page 82: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Example

Memory operation is executed and cache miss is detected

Processor has switched to thread 2 − another PC is used − another instruction queue

is used

Inst

ruct

ion

Que

ue

thre

ad 1

Programmspeicher PC 1

Memory Unit

RS

Execute 1

RS

RS

Execute m

RS RS

RS

ROB 1

RF 1

RS RS

RS

PC 2

RF 2 ROB 2

OP G.1

OP A.2

OP E.1

OP B.2

OP F.1

OP C.2

Inst

ruct

ion

Que

ue

thre

ad 2

OP A.1

OP A.1

OP B.1

OP A.1

OP B.1

OP C.1

OP D.1

OP D.1

2

OP H.1 OP C.1

OP D.1

OP E.1

Page 83: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Example

Issued operations of thread 1 are further processed

But issue now takes place from instruction queue 2

Inst

ruct

ion

Que

ue

thre

ad 1

Programmspeicher PC 1

Memory Unit

RS

Execute 1

RS

RS

Execute m

RS RS

RS

ROB 1

RF 1

RS RS

RS

PC 2

RF 2 ROB 2

OP G.1

OP A.2

OP E.1

OP B.2 OP F.1

OP C.2

Inst

ruct

ion

Que

ue

thre

ad 2

OP A.1

OP A.1

OP B.1

OP A.1

OP B.1

OP C.1

OP D.1

OP D.1

2

OP H.1 OP D.2 OP C.1

OP D.1

OP E.1

OP A.2

Page 84: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Example

Issued operations of thread 1 are further processed

But issue now takes place from instruction queue 2

Inst

ruct

ion

Que

ue

thre

ad 1

Programmspeicher PC 1

Memory Unit

RS

Execute 1

RS

RS

Execute m

RS RS

RS

ROB 1

RF 1

RS RS

RS

PC 2

RF 2 ROB 2

OP G.1

OP A.2

OP E.1

OP B.2

OP F.1 OP C.2 Inst

ruct

ion

Que

ue

thre

ad 2

OP A.1

OP B.1 OP B.1

OP C.1

OP D.1

OP D.1

2

OP H.1

OP D.2

OP C.1

OP D.1

OP E.1

OP A.2

OP E.2

OP A.2

OP B.2

Page 85: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Example

Operations of Thread 1 are further processed, but not committed while simultaneously operations from Thread 2 are processed

If operation E.1 has a true-dependency to D.1 then it blocks the reservation station for operations from Thread 2

Balancing between shared resources important

Inst

ruct

ion

Que

ue

thre

ad 1

Programmspeicher PC 1

Memory Unit

RS

Execute 1

RS

RS

Execute m

RS RS

RS

ROB 1

RF 1

RS RS

RS

PC 2

RF 2 ROB 2

OP G.1

OP E.1

OP B.2

OP F.1

OP C.2

Inst

ruct

ion

Que

ue

thre

ad 2

OP B.1 OP B.1

OP C.1

OP D.1

OP D.1

2

OP H.1

OP D.2

OP C.1

OP D.1

OP E.1 OP E.2

OP A.2

OP B.2

OP C.2

OP F.2

Page 86: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Coarse Grained MT - Limitations

Does not help to overcome the problem of horizontal performance loss (a single thread may not have enough ILP) − Only right after switching between threads, there are operations of

both threads simultaneously processed

Switching between threads may has a negative impact on the

cache hit rate for each thread and affects the performance negatively

Page 87: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Fine-Grained Multithreading

Processor switches in every clock cycle to another thread − E.g. in a round robin manner:

This helps to overcome horizontal performance loss

A single instruction queue and a single reorder buffer are

sufficient (shared)

Operations must be tagged with the corresponding Thread number

Page 88: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Example

Inst

ruct

ion

queu

e

Programmspeicher PC 1

Memory Unit

RS

Execute 1

RS

RS

Execute m

RS RS

RS

ROB RF 1

RS RS

RS

PC 2

RF 2 OP A.1

1

Page 89: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Example

Inst

ruct

ion

queu

e

Programmspeicher PC 1

Memory Unit

RS

Execute 1

RS

RS

Execute m

RS RS

RS

ROB RF 1

RS RS

RS

PC 2

RF 2

OP A.1

2

OP A.2

Page 90: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Example

Inst

ruct

ion

queu

e

Programmspeicher PC 1

Memory Unit

RS

Execute 1

RS

RS

Execute m

RS RS

RS

ROB RF 1

RS RS

RS

PC 2

RF 2

OP A.1

1

OP A.2

OP B.1

Page 91: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Example

Inst

ruct

ion

queu

e

Programmspeicher PC 1

Memory Unit

RS

Execute 1

RS

RS

Execute m

RS RS

RS

ROB RF 1

RS RS

RS

PC 2

RF 2

OP A.1

2

OP A.2

OP B.1

OP B.2

OP A.1

Page 92: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Example

Inst

ruct

ion

queu

e

Programmspeicher PC 1

Memory Unit

RS

Execute 1

RS

RS

Execute m

RS RS

RS

ROB RF 1

RS RS

RS

PC 2

RF 2

1

OP A.1

OP A.2

OP B.1

OP B.2

OP C.1

OP A.1

OP A.2

Page 93: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Example

Inst

ruct

ion

queu

e

Programmspeicher PC 1

Memory Unit

RS

Execute 1

RS

RS

Execute m

RS RS

RS

ROB RF 1

RS RS

RS

PC 2

RF 2

2

OP A.1 OP A.2

OP B.1

OP B.2

OP C.1

OP A.1

OP A.2

OP C.2 OP B.1

Page 94: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Example

Inst

ruct

ion

queu

e

Programmspeicher PC 1

Memory Unit

RS

Execute 1

RS

RS

Execute m

RS RS

RS

ROB RF 1

RS RS

RS

PC 2

RF 2

1

ROB

OP A.1

OP A.2

OP B.1

OP B.2

OP C.1

OP A.1

OP C.2

OP D.1 OP B.1

OP B.2

Page 95: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Example

Inst

ruct

ion

queu

e

Programmspeicher PC 1

Memory Unit

RS

Execute 1

RS

RS

Execute m

RS RS

RS

ROB RF 1

RS RS

RS

PC 2

RF 2

2

OP A.1

OP A.2

OP B.1 OP B.2

OP C.1

OP C.2

OP D.1

OP B.1

OP B.2 OP D.2

OP C.1

Page 96: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Example

Inst

ruct

ion

queu

e

Programmspeicher PC 1

Memory Unit

RS

Execute 1

RS

RS

Execute m

RS RS

RS

ROB RF 1

RS RS

RS

PC 2

RF 2

1

OP A.2

OP B.1 OP B.2

OP C.1

OP C.2

OP D.1

OP B.1

OP B.2 OP D.2

OP C.1

Page 97: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Fine-Grained Multithreading - Limitations

Vertically performance loss cannot be avoided − A long running operation prevents other operation from the same

thread from being executed − due to the shared IQ and ROB, also the other thread is blocked after a

while • Improvement: Stop fetching for a blocked thread

Performance of a single thread is reduced (even if there are no

operations from a second blocked thread), because issue takes place in every second cycle

MT reduces cache performance

Page 98: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Simultaneous Multithreading

Mixing Coarse- and Fine-Grained MT

In every clock cycle operations from n threads will be fetched and issued (Intel calls this Hyperthreading)

Operations must be tagged with the corresponding Thread number

Solving the problem of having either horizontal or vertical performance loss: − If both threads are not blocked, then available ILP is utilized, and horizontal

performance loss is avoided − If one thread is blocked, then the other thread still uses the resources, and vertical

performance loss is avoided (but not horizontal one)

Even if one thread is blocked, the other one can run at full speed (issue in every clock cycle)

Page 99: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Example

Fetch and issue takes place simultaneously for both threads

Each thread has its own IQ, RF, PC, ROB

Reservation Stations are partitioned

Programmspeicher PC 1

Memory Unit

RS

Execute 1

RS

RS

Execute m

RS RS

RS

ROB 1

RF 1

RS RS

RS

PC 2

RF 2 ROB 2

OP A OP B

OP C OP D

OP E OP F

Inst

ruct

ion

queu

e 1

Inst

ruct

ion

queu

e 2

Used for thread 1

Used for thread 2

Page 100: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Example

Both threads are executed...

Avoids horizontal performance loss

Programmspeicher PC 1

Memory Unit

RS

Execute 1

RS

RS

Execute m

RS RS

RS

ROB 1

RF 1

RS RS

RS

PC 2

RF 2 ROB 2

OP A

OP C OP D

OP E OP F

OP B

OP G OP H

Inst

ruct

ion

queu

e 1

Inst

ruct

ion

queu

e 2

Used for thread 1

Used for thread 2

Page 101: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Example

Both threads are executed...

Avoids horizontal performance loss

Programmspeicher PC 1

Memory Unit

RS

Execute 1

RS

RS

Execute m

RS RS

RS

ROB 1

RF 1

RS RS

RS

PC 2

RF 2 ROB 2

OP A OP C

OP D

OP E OP F

OP B

OP G OP H

OP A OP B

OP I OP J

Inst

ruct

ion

queu

e 1

Inst

ruct

ion

queu

e 2

Used for thread 1

Used for thread 2

Page 102: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Example

Both threads are executed...

Avoids horizontal performance loss

But, now long running operation E is issued Programmspeicher

PC 1

Memory Unit

RS

Execute 1

RS

RS

Execute m

RS RS

RS

ROB 1

RF 1

RS RS

RS

PC 2

RF 2 ROB 2

OP C

OP D

OP E

OP F

OP G OP H

OP A OP B

OP I OP J

OP K OP L

OP C OP D

Inst

ruct

ion

queu

e 1

Inst

ruct

ion

queu

e 2

Used for thread 1

Used for thread 2

Page 103: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Example

Assume G is true dependent on E

Programmspeicher PC 1

Memory Unit

RS

Execute 1

RS

RS

Execute m

RS RS

RS

ROB 1

RF 1

RS RS

RS

PC 2

RF 2 ROB 2

OP E

OP F

OP G

OP H

Res A

Res B

OP I OP J

OP K OP L

Res C Res D

OP E OP F

Inst

ruct

ion

queu

e 1

Inst

ruct

ion

queu

e 2 OP M OP N

Used for thread 1

Used for thread 2

Page 104: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Example

Assume G is true dependent on E; and I, too

Then, thread 1 is now blocked…

Programmspeicher PC 1

Memory Unit

RS

Execute 1

RS

RS

Execute m

RS RS

RS

ROB 1

RF 1

RS RS

RS

PC 2

RF 2 ROB 2

OP E

OP K

Res A

Res B

OP I

OP J

OP M

OP L

Res C

OP E

Res F

Inst

ruct

ion

queu

e 1

Inst

ruct

ion

queu

e 2

OP G

OP H

OP H

OP N

OP O OP P

Used for thread 1

Used for thread 2

Page 105: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Example

… but thread 2 can continuous

Programmspeicher PC 1

Memory Unit

RS

Execute 1

RS

RS

Execute m

RS RS

RS

ROB 1

RF 1

RS RS

RS

PC 2

RF 2 ROB 2

OP E

OP K

Res A

Res B

OP I

OP J

OP M

OP L

Res C

OP E

Inst

ruct

ion

queu

e 1

Inst

ruct

ion

queu

e 2

OP G

OP H

OP H

OP N

OP O

OP P

OP Q

Used for thread 1

Used for thread 2

Page 106: Processor Architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table in previous slide) DeMux. Mux rob . ... (solve RAW-Hazard)

Summary - Multithreading

Allows to fill the pipeline with operations from different threads − no data dependency between operations from different threads − allows for higher resource utilization

Coarse-grained MT suffers from horizontal performance loss

Fine-grained MT suffers from horizontal performance loss

SMT solves these problems

− Improvement: Balancing between partitioned resources

All MT approaches have impact on the cache performance

In particular Fine-Grained MT can be also used in statically scheduled processor pipelines to avoid hazards − In a pipeline with n pipeline stages, operations from n threads are issued − no data-/control hazard occur because operations in the pipeline have no dependencies