processor architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table...

Processor Architecture

Advanced Dynamic Scheduling Techniques

M. Schölzel

Content

Tomasulo with speculative execution

Introducing superscalarity into the instruction pipeline

Multithreading

Control Flow Dependencies

Let b a conditional branch at address a with branch taget z.

An operation c ist control flow dependent on b, if the execution of c depends on the branch of b.

Otherwise c is not control flow dependent.

Examples:

c

b

c

b a:

a+1: z:

a+1: z:

a:

c is not control flow dependent on b

b1 a:

a+1: z:

c is control flow dependent on b1 and b2

b2

c

d

c is not control flow dependent on b

What about d?

Scheduling restrictions imposed by control flow dependencies

… for control flow dependent operations: cannot be moved before the branch

… for not control flow dependent operations:

cannot be moved behind a branch

c

b b

b

c c

b c

Program order Speculative Execution of c Program order c may be not executed

Performance Problem due to Control Hazards

Problem: − Branch target of an operation is only known after execution − Long pipeline stalls required in processors with deep pipelines

Solution: − Branch prediction helps, but is limitted

• Tomasulo supports speculative fetch, issue , but not execute of operations

b Instruction Queue

Speicher

Branch operation

PC

? Address for next instruction fetch is not known

to the reservation stations

Drawbacks of Speculative Execution

What happens if an operation is executed speculatively and speculation was wrong? − May affect the data flow − May affect the exception behavior

b

c

b c

control flow graph of a program

speculatively executed block

block to be executed

after dynamic scheduling executed program

Example

Affected Data Flow c is executed speculatively before b

mul-operation now receives the value in

r1 from sub- instead of from add-operation

Affected Exceptions Behavior c is executed speculatively before b

Division by 0 possible

b c a add r1 <- r2,r3

sub r1 <- r4,r5

mul r0 <- r1,r6

does not write in

r1 c

b c a

if r5 = 0 then x else y div r1 <- r4,r5

No division executed c

x: y:

Solution

Divide WB-Phase from Tomasulo-Algorithm into two phases: − Forwarding results from EU to RSs (WB-Phase) − Writing results into architectural registers/memory (Commit-Phase)

Implemented by: − Reorder-Buffer for buffering results from WB-phase − Committing buffered results from the Reorder-Buffer in-oder

By this:

− Usage of speculative results possible, without modifying architectural registers/memory locations

Architecture for Tomasulo with speculative Execution

Inst

ruct

ion

Q

ueue

Program Memory PC

Memory Unit

RS RS

RS

…

Execute 1

RS RS

RS

…

Execute m

RS RS

RS

…

Reg 0 Reg 1 Reg 2

Reg r

Result Bus

Operand Bus B

…

…

…

Operand Bus A Operation Bus

EU-Bus EU-Bus EU-Bus

Reorder Buffer

Arch

itect

ure

Regi

ster

Reorder Buffer (ROB)

Implemented as a queue: − When issuing an operation, an entry is reserved

− During WB, result is written-back to the reserved entry

− Commit is done in-order and writes results back to the architectural register

• speculatively executed operations are committed after preceding branches have been comitted

ROB-entries have now the meaning of virtual registers

entry 1

entry 2

entry n

DeM

ux

Mux

Reserved entry from RS

first last

Result Bus …

To th

e ar

chite

ctur

al

regi

ster

s

Bypass zu den RS

to is

sue-

phas

e (b

ypas

s)

busy

Fields of the ROB

Structure of a ROB-entry

Meaning of the fields depends on the operation type

Operation types: − Branch operation − Memory operation − ALU operation

field/Meaning res addr type busy valid

Branch operation computed target address (will be stored in the PC)

c = speculation was correct w = speculation was wrong

3 entry reserved 0 = result has not been computed yet 1 = result was computed and is available in the res-field

Memory operation Value to be stored in the memory

Address at which the res-Value should be stored

2

ALU operation Result of the operation - 1

res type addr valid busy

Reservation Station Fields

RS has the same functionality as in ordinary Tomasulo: − Buffers operations − Buffers operands

But, ROB-entries are used for determining operand source (virtual register)

opc …

Qj Qk Vj Vk busy

opc Qj Qk Vj Vk busy

opc Qj Qk Vj Vk busy

Reservation Station

Operand Bus B

Operand Bus A Operation Bus

EU-Bus

Operation to be executed (e.g. add, sub, mul, …) Qj = x, if ROB-entry x will store value for operand A, otherwise 0

Qk = x, if ROB-entry x will store value for operand B, otherwise 0

Value for operand B Value for operand A

Reserved ROB-entry

type type

type

Type of operation (see table in previous slide)

DeM

ux

Mux

rob rob

rob

RS is occupied/free

stat stat

stat

Status in pipeline (RO, EX, WB)

misc misc

misc

Miscellaneous

Register File Extensions

Mapping of architecture registers to virtual registers (ROB-entries)

Architecture register n stores ROB-entry, of the latest operation that is computing the value for n (register renaming)

Example:

Reg 0 Reg 1 Reg 2

Reg r

…

rob rob rob

rob

Operand Bus B Operand Bus A

Result Bus

Reg 0 Reg 1 Reg 2

Reg r

…

5 0 1

0

ROB-entry 5 contains result of latest operation with destination register 0

Register 1 is not computed by any operation in the pipeline

ROB-entry 1 contains result of latest operation with destination register 2

Overview Pipeline Phases

Issue − Schedule operation from instruction queue to RS − Read operand values or rename registers (solving WAR- und WAW-Hazard) − Reserve ROB-entry − Issue is in-order

Execute

− Wait for operands to be ready − Execute operation as soon as operands are ready and EU is available (solve RAW-Hazard) − Execute is out-of-order

Write-Back

− Write result through result bus into reserved ROB-entry − WB is out-of-order

Commit

− Write results from ROB in order into destination registers/memory − Commit is in-order

Overview Pipeline Phases (Issue)

Issue operation from instruction queue to RS, if: − RS is free and − ROB not full

Otherwise: Stall issue stage

Allocate RS- and ROB-entry

Read operands, if

− present in register file, or − present in ROB

ROB-entry corresponds

to a virtual register

Programmspeicher PC

Memory Unit

RS RS

RS

…

Execute 1

RS RS

RS

…

Execute m

RS RS

RS

…

Reg 0 Reg 1 Reg 2

Reg r

…

…

…

Reorder Buffer

Op A

Op A

reservierter Platz für Op A

Overview Pipeline Phases (Execute)

Operation is waiting in RS for operands and free EU

Execute operation as soon as all operands are available and EU is free

RS can store state of operation during execution

Programmspeicher PC

Memory Unit

RS RS

RS

…

Execute 1

RS RS

RS

…

Execute m

RS RS

RS

…

Reg 0 Reg 1 Reg 2

Reg r

…

…

…

Reorder Buffer

Op A

Op A


Overview Pipeline Phases (Write-Back)

Write result into reserved ROB-entry

ROB-entry ID has been stored in the rob-field of the RS

Result is forwarded to all waiting RS through the result bus (value is identified by its ROB ID)

Free RS Programmspeicher PC

Memory Unit

RS RS

RS

…

Execute 1

RS RS

RS

…

Execute m

RS RS

RS

…

Reg 0 Reg 1 Reg 2

Reg r

…

…

…

Reorder Buffer

Op A

reservierter Platz für Op A Ergebnis

Overview Pipeline Phases (Commit)

Write results from the first entry in the ROB into the corresponding destination register

Free ROB-entry

Programmspeicher PC

Memory Unit

RS RS

RS

…

Execute 1

RS RS

RS

…

Execute m

RS RS

RS

…

Reg 0 Reg 1 Reg 2

Reg r

…

…

…

Reorder Buffer


Ergebnis

Issue-Phase – Details (for ALU-operations)

For the operation teat will be issued let denote: − opc … operation type (add, sub, mul, …) − src1, src2 … source registers − dst … destination registers

Operation can be issued, if

− there exists an x, where RS[x].busy = 0 and − ROB[last].busy = 0

Update after issue

− if Reg[src1].rob = 0 then // determine value of left operand RS[x].Qj := 0; RS[x].Vj := Reg[src1] // read left operand from the register file else // left operand is still under computation or in ROB if ROB[Reg[src1].rob].valid = 1 then RS[x].Qj := 0; RS[x].Vj := ROB[Reg[src1].rob].res // read operand from ROB else RS[x].Qj := Reg[src1].rob // wait for operand in RS fi

− if Reg[src2].rob = 0 then // the same for the right operand… …

− RS[x].busy := 1; RS[x].rob := tail RS[x].opc := opc; RS[x].type := 1; RS[x].status := RO

Issue-Phase – Details (Example 1)

Situation: − Op A can be issued − Value for r1 is taken from the register file − Value for r2 is taken from the ROB

Update − if Reg[srcy].rob = 0 then

RS[x].Qj/k := 0; RS[x].Vj/k := Reg[srcy] else if ROB[Reg[srcy].rob].valid = 1 then RS[x].Qj/k := 0; RS[x].Vj/k := ROB[Reg[srcy].rob].res else RS[x].Qj/k := Reg[srcy].Qj/k fi

Programmspeicher PC

Memory Unit

RS RS

RS

…

Execute 1

RS RS

RS

…

Execute m

RS RS

RS

…

…

…

1:

2:

3:

4: [56,-,1,1,1]

5: [-,-,-,0,0]

add r0 <- r1, r2 // Op A sub r3 <- r0, r1 // Op B …

OP A


R0: 5 R1: 14 R2: 89 R3: 17

0 0 4 0


Situation: − Op A was issued and ROB-entry 5 was allocated

Update after issue − if Reg[srcy].rob = 0 then

RS[x].Qj/k := 0; RS[x].Vj/k := Reg[srcy] else if ROB[Reg[srcy].rob].valid = 1 then RS[x].Qj/k := 0; RS[x].Vj/k := ROB[Reg[srcy].rob].res else RS[x].Qj/k := Reg[srcy].Qj/k fi

Programmspeicher PC

Memory Unit

RS RS

RS

…

Execute 1

[add,0,0,14,56,-,1,5,RO,1]

RS

RS

…

Execute m

RS RS

RS

…

…

…

1:

2:

3:

4: [56,-,1,1,1]

5: [-,-,1,0,1]



opc Qj Qk Vj Vk busy type rob stat misc

R0: 5 R1: 14 R2: 89 R3: 17

5 0 4 0


Situation: − issue of Op A − Value of r1 is read from the register file − Value in r2 is computed by RS4

Update after issue:

− if Reg[srcy].rob = 0 then RS[x].Qj/k := 0; RS[x].Vj/k := Reg[srcy] else if ROB[Reg[srcy].rob].valid = 1 then RS[x].Qj/k := 0; RS[x].Vj/k := ROB[Reg[srcy].rob].res else RS[x].Qj/k := Reg[src1].Qj/k fi

Programmspeicher PC

Memory Unit

RS RS

[ld,0,0,100,-,-,2,4,RO,1]

…

Execute 1

RS RS

RS

…

Execute m

RS RS

RS

…

…

…

1:

2:

3:

4: [-,-,1,0,1]

5: [-,-,-,0,0]


OP A



R0: 5 R1: 14 R2: 89 R3: 17

0 0 4 0


Situation: − Op A was issued − Uses ROB-entry 5 − Has to wait for the value from r

Update after issue:

− if Reg[srcy].rob = 0 then RS[x].Qj/k := 0; RS[x].Vj/k := Reg[srcy] else if ROB[Reg[srcy].rob].valid = 1 then RS[x].Qj/k := 0; RS[x].Vj/k := ROB[Reg[srcy].rob].res else RS[x].Qj/k := Reg[src1].Qj/k fi

Programmspeicher PC

Memory Unit

RS RS

[ld,0,0,100,-,-,2,4,RO,1]

…

Execute 1

RS

RS

…

Execute m

RS RS

RS

…

…

…

1:

2:

3:

4: [-,-,1,0,1]

5: [-,-,1,0,1]




R0: 5 R1: 14 R2: 89 R3: 17

0 0 4 0

[add,0,4,14,-,-,1,5,RO,1]

Execute – Details

Executing an operation from a RS is possible, if − RS[x].status = RO and − RS[x].Qj = 0 and RS[x].Qk = 0

Update after start of execution: − Perform computation with RS[x].Vj and RS[x].Vk − RS[x].status := EX

Update after end of execution: − RS[x].Vj := res // Store result temporary in the reservation station − RS[x].status := WB

Execute-Phase – Details (Example 3)

Both operands are ready: − RS[x].status = RO and − RS[x].Qj = 0 und RS[x].Qk = 0

Programmspeicher PC

Memory Unit

RS RS

[ld,0,0,100,-,-,2,4,RO,1]

…

Execute 1

RS RS

RS

…

Execute m

RS RS

RS

…

Reg 0 Reg 1 Reg 2

Reg r

…

…

…

1: 0 0 4

0

2:

3:

4: [-,-,1,0,1]

5:


OP A




Operation is executed: − RS[x].status = EX

Programmspeicher PC

Memory Unit

RS RS

[ld,0,0,100,-,-,2,4,EX,1]

…

Execute 1

RS RS

RS

…

Execute m

RS RS

RS

…

Reg 0 Reg 1 Reg 2

Reg r

…

…

…

1: 0 0 4

0

2:

3:

4: [-,-,1,0,1]

5:


OP A




Result is computed: − Result will be stored temporarily in the field Vj − RS[x].status = WB

Result is ready for WB

Programmspeicher PC

Memory Unit

RS RS

[ld,0,0,89,-,-,2,4,WB,1]

…

Execute 1

RS RS

RS

…

Execute m

RS RS

RS

…

Reg 0 Reg 1 Reg 2

Reg r

…

…

…

1: 0 0 4

0

2:

3:

4: [-,-,1,0,1]

5:


OP A



Write-Back – Details (ALU-Operation)

Write-Back of the result res from RS x possible, if − RS[x].status = WB and − Result bus available

Update after WB:

− ROB[RS[x].rob] := RS[x].Vj // Write result to allocated ROB-entry − RS[x].busy := 0 // free RS − ROB[RS[x].rob].valid := 1 // Declare ROB-entry as valid

− for all reservation stations y ¹ x: // Forwarding of the result

if RS[y].Qj = RS[x].rob then RS[y].Vj := RS[x].rob; RS[y].Qj := 0 if RS[y].Qk = RS[x].rob then RS[y].Vk := RS[x].rob; RS[y].Qk := 0

Write-Back-Phase – Details (Example 4)

Situation: − Result of the ld-Operation is written back − Result bus contains:

• ROB-entry ID, e.g. 4 • ROB-value, e.g. 89

− add-operation waits for the right-hand operand

Programmspeicher PC

Memory Unit

RS RS

[ld,0,0,20,-,-,2,4,WB,1]

…

Execute 1

[add,0,4,14,-,-,1,5,RO,1]

RS

RS

…

Execute m

RS RS

RS

…

…

…

1:

2:

3:

4: [-,-,1,0,1]

5: [-,-,1,0,1]


OP B


(4,89)


R0: 5 R1: 14 R2: 89 R3: 17

5 0 4 0

Write-Back-Phase – Details (Example 4)

Situation: − Result was stored in ROB-entry 4 − RS containing add-Operation has also stored the result − RS was freed

Programmspeicher

PC

Memory Unit

RS RS

RS

…

Execute 1

[add,0,0,14,20,-,1,5,RO,1]

RS

RS

…

Execute m

RS RS

RS

…

…

…

1:

2:

3:

4: [20,-,1,1,1]

5: [-,-,1,0,1]


OP B



R0: 5 R1: 14 R2: 89 R3: 17

5 0 4 0

Commit– Details (ALU-Operation)

It must be checked: − ROB[first].valid = 1

Update by commit: − for all Architectural Registers r with Reg[r].rob = head do

Reg[r] := ROB[head].res Reg[r].rob := 0

Commit-Phase – Details (Example 5)

Situation: − Let be head = 4 for the ROB-head − R2 waits for result from ROB-entry 4

Programmspeicher PC

Memory Unit

RS RS

RS

…

Execute 1

[add,0,0,56,89,-,1,5,RO,1]

RS

RS

…

Execute m

RS RS

RS

…

…

…

1:

2:

3:

4: [20,-,1,1,1]

5: [-,-,1,0,1]


OP B


(4,89)


R0: 5 R1: 14 R2: 89 R3: 17

5 0 4 0

Commit-Phase – Details (Example 5)

Situation: − R2 has received result from ROB

Programmspeicher PC

Memory Unit

RS RS

RS

…

Execute 1

[add,0,0,56,89,-,1,5,RO,1]

RS

RS

…

Execute m

RS RS

RS

…

…

…

1:

2:

3:

4: [20,-,1,0,0]

5: [-,-,1,0,1]


OP B


(4,89)


R0: 5 R1: 14 R2: 20 R3: 17

5 0 0 0

Executing Branch Operations

Issue: − Vk-field: stores branch target z − misc-field: remembers address a of the branch operation − misc-field: also remembers which address (‚z‘ or ‚a‘) was predicted

Execute: − Computed target address is stored in Vk-Field of RS:

• Vk := z, if branch is taken • Vk := a+1, if branch is not taken

− misc-field stores, whether or not prediction was correct (‚c‘ = correct; ‚w‘ = wrong)

Write-Back: − res-field of ROB received branch target (Vk-field of the RS) − addr-field receives value of misc-field from RS: ‚c‘ or ‚w‘

Commit:

− If addr-field = c, nothing must be done (operations were fetched from correct address) − If addr-field = w, then copy res-field into PC and flush the whole pipeline:

• All subsequent ROB-entries • All RS-entries • instruction queue

Branch – Details (Example 6 – correct prediction)

Situation: − Branch-operation was issued to RS2 − Branch depends on ld-operation − Op A, Op B, … will be executed speculateively

Programmspeicher PC

Memory Unit

[ld,0,0,-,200,-,2,1,EX,1]

RS

RS

…

Execute 1

[bz,1,0,-,123,112a,3,2,RO,1]

RS

RS

…

Execute m

RS RS

RS

…

R0: 5 R1: 14 R2: 89 R3:

1: [-,-,2,0,1] 0 0 1 0

2: [-,-,3,0,1]

3: [-,-,-,0,0]

4: [-,-,-,0,0]

5: [-,-,-,0,0]

111: ld r2 <- (200) 112: bz r2, #123 113: add r1 <- r1, r0 // Op A 114: sub r3 <- r0, r1 // Op B 115: …

OP A



OP B

OP C

2

ld


Situation: − Op A is executed speculatively − Op B waits for result of Op A

Programmspeicher PC

Memory Unit

[ld,0,0,-,200,-,2,1,EX,1]

RS

RS

…

Execute 1

[bz,1,0,-,123,112a,3,2,RO,1]

RS

RS

…

Execute m

RS RS

RS

…

R0: 5 R1: 14 R2: 89 R3:

1: [-,-,2,0,1] 0 3 1 4

2: [-,-,3,0,1]

3: [-,-,1,0,1]

4: [-,-,1,0,1]

5: [-,-,-,0,0] OP C



OP D

OP E

1

ld

OP A OP B

OP A



Situation: − Op A wrote result to ROB, but not to R1 − Op B is executed speculatively

Programmspeicher PC

Memory Unit

[ld,0,0,-,200,-,2,1,EX,1]

RS

RS

…

Execute 1

[bz,1,0,-,123,112a,3,2,RO,1]

RS

RS

…

Execute m

RS RS

RS

…

R0: 5 R1: 14 R2: 89 R3:

1: [-,-,2,0,1] 0 3 1 4

2: [-,-,3,0,1]

3: [19,-,1,1,1]

4: [-,-,1,0,1]

5: [-,-,-,0,0] OP C



OP D

OP E

1

ld

OP B



Situation: − Op B wrote result to ROB − ld write result to ROB − bz can be executed

Programmspeicher PC

Memory Unit

RS

RS

…

Execute 1

[bz,0,0,6,123,112a,3,2,RO,1]

RS

RS

…

Execute m

RS RS

RS

…

R0: 5 R1: 14 R2: 89 R3:

1: [6,-,2,1,1] 0 3 1 4

2: [-,-,3,0,1]

3: [19,-,1,1,1]

4: [-14,-,1,1,1]

5: [-,-,-,0,0] OP C



OP D

OP E

1

RS



Situation: − bz will be executed: Branch is not taken − Commit for ld-operation is done

Programmspeicher

PC

Memory Unit

RS

RS

…

Execute 1

[bz,0,0,6,123,112a,3,2,EX,1]

RS

RS

…

Execute m

RS RS

RS

…

R0: 5 R1: 14 R2: 6 R3:

1: [-,-,-,0,0] 0 3 0 4

2: [-,-,3,0,1]

3: [19,-,1,1,1]

4: [-14,-,1,1,1]

5: [-,-,-,0,0] OP C



OP D

OP E

1

RS

BZ



Situation: − bz was executed − WB for bz was done − Prediction was correct

(ROB[2].addr := c) Programmspeicher PC

Memory Unit

RS

RS

…

Execute 1

RS

RS

…

Execute m

RS RS

RS

…

R0: 5 R1: 14 R2: 6 R3:

1: [-,-,-,0,0] 0 3 0 4

2: [113,c,3,1,1]

3: [19,-,1,1,1]

4: [-14,-,1,1,1]

5: [-,-,-,0,0] OP C



OP D

OP E

1

RS RS



Situation: − Commit of the branch operation does not require

any action, because prediction was correct

− Now Commit can be done for speculatively executed operations A and B

Programmspeicher PC

Memory Unit

RS

RS

…

Execute 1

RS

RS

…

Execute m

RS RS

RS

…

R0: 5 R1: 14 R2: 6 R3:

1: [-,-,-,0,0] 0 3 0 4

2: [-,-,-,0,0]

3: [19,-,1,1,1]

4: [-14,-,1,1,1]

5: [-,-,-,0,0] OP C



OP D

OP E

0

RS RS


Branch – Details (Example 7 – wrong prediction)

Situation: − Same situation as in example 6 − But, ld-Operation has stored 0 in R2 − I.e., branch is taken

Programmspeicher PC

Memory Unit

RS

RS

…

Execute 1

[bz,0,0,0,123,112a,3,2,EX,1]

RS

RS

…

Execute m

RS RS

RS

…

R0: 5 R1: 14 R2: 0 R3:

1: [-,-,-,0,0] 0 3 0 4

2: [-,-,3,0,1]

3: [19,-,1,1,1]

4: [-14,-,1,1,1]

5: [-,-,-,0,0]

OP C



OP D

OP E

1

RS

BZ

OP F



Situation: − bz-operation was executed − Prediction was wrong

(ROB[2].addr := w) − Correct target can be found

in the res-field Programmspeicher

PC

Memory Unit

RS

RS

…

Execute 1

RS

RS

…

Execute m

RS RS

RS

…

R0: 5 R1: 14 R2: 6 R3:

1: [-,-,-,0,0] 0 3 0 4

2: [123,w,3,1,1]

3: [19,-,1,1,1]

4: [-14,-,1,1,1]

5: [-,-,-,0,0]

OP C



OP D

OP E

1

RS RS

OP F

OP G

OP C



Situation: − Commit of the branch moves correct address into PC

− Flushing the pipeline

Programmspeicher PC

123

Memory Unit

RS

RS

…

Execute 1

RS

RS

…

Execute m

RS RS

RS

…

R0: 5 R1: 14 R2: 6 R3:

1: [-,-,-,0,0] 0 3 0 4

2: [-,-,-,0,0]

3: [19,-,1,1,1]

4: [-14,-,1,1,1]

5: [-,-,-,0,0]



0

RS RS OP C OP D

OP E

OP F

OP G

OP C


Executing Memory Operations

For out-of-order-execution of memory operations holds: − Ordering of load-does not matter − Ordering of load- and store-operations as well as of store- and store-

operations must be maintained

Example:

Strategy: − Writing to memory takes place during commit-phase (in-order)

− Reading from memory takes place during execute-phase (out-of-order)

• But only, if valid-field of all preceding write-operations in the ROB is 1

ld r2 <- (r0) ld r1 <- (r4) st r4 -> (r0) ld r5 <- (r6) st r7 -> (r8)

Example (store-Operation)

Issue-Phase: − issue the first st-operation

Execute-Phase

− Execution of store-operation can start, if both source operands are available

− Execution has no effect − Rather, WB of st-operation starts

immediately Programmspeicher

PC

Memory Unit

[st,0,0,17,5,-,2-,1,RO,1]

RS

RS

…

Execute 1

RS

RS

…

Execute m

RS RS

RS

…

…

…

1: [-,-,2,0,1]

2:

3:

4:

5:



R0: 5 R1: 14 R2: 20 R3: 17

0 0 0 0

M 0 0

RS

st r3 -> (r0) ld r3 <- (r1) st r1-> (r2)


Updates during WB of the st-Operation − ROB[x].res := RS[y].Vj − ROB[x].addr := RS[y].Vk

Programmspeicher PC

Memory Unit

[st,0,0,17,5,-,2-,1,WB,1]

RS

RS

…

Execute 1

RS

RS

…

Execute m

RS RS

RS

…

…

…

1: [-,-,2,0,1]

2:

3:

4:

5:



R0: 5 R1: 14 R2: 20 R3: 17

0 0 0 0

M 0 0

RS

st r3 -> (r0) ld r3 <- (r1) st r1-> (r2)


Commit for st-operation − MEM[ROB[first].addr] := ROB[first].res

Programmspeicher PC

Memory Unit

RS

RS

…

Execute 1

RS

RS

…

Execute m

RS RS

RS

…

…

…

1: [17,5,2,1,1]

2:

3:

4:

5:



R0: 5 R1: 14 R2: 20 R3: 17

0 0 0 0

M 0 0

RS

st r3 -> (r0) ld r3 <- (r1) st r1-> (r2)

RS

Example (load-Operation)

Suppose first st-operation was issued and waits for execution

Then ld-operation was issued, and its source operands are available

Programmspeicher PC

Memory Unit

[st,5,0,-,5,-,2,1,RO,1]

[ld,0,0,14,-,-,2,2,RO,1]

RS

…

Execute 1

RS

RS

…

Execute m

RS RS

RS

…

…

…

1: [-,-,2,0,1]

2: [-,-,2,0,1]

3:

4:

5:



R0: 5 R1: 14 R2: 20 R3: 17

0 0 0 2

M 0 0

RS

st r3 -> (r0) ld r3 <- (r1) st r1-> (r2)

OP C

OP C


Situation : − ld-operation is not executed, because valid-bit of

first st-operation is 0

Programmspeicher PC

Memory Unit

[st,5,0,-,5,-,2,1,RO,1]

[ld,0,0,14,-,-,2,2,RO,1]

RS

…

Execute 1

RS

RS

…

Execute m

RS RS

RS

…

…

…

1: [-,-,2,0,1]

2: [-,-,2,0,1]

3:

4:

5:



R0: 5 R1: 14 R2: 20 R3: 17

0 0 0 2

M 0 0

RS

st r3 -> (r0) ld r3 <- (r1) st r1-> (r2)

OP C

OP C


Situation: − Now, ld-operation can be executed (see valid-bit

of first st-operation)

− ld-operation can read value either from memory or from ROB (if addr-field of a preceding st-operation matches Vj-field of ld-operation

Programmspeicher PC

Memory Unit

[ld,0,0,14,-,-,2,2,RO,1]

RS

…

Execute 1

RS

RS

…

Execute m

RS RS

RS

…

…

…

1: [17,5,2,1,1]

2: [-,-,2,0,1]

3:

4:

5:



R0: 5 R1: 14 R2: 20 R3: 17

0 0 0 2

M 0 0

RS

st r3 -> (r0) ld r3 <- (r1) st r1-> (r2)

RS


WB for ld-operation complete Commit-phase for ld-operations is the

same as for alu-operations

Programmspeicher PC

Memory Unit

[ld,0,0,14,-,-,2,2,RO,1]

RS

…

Execute 1

RS

RS

…

Execute m

RS RS

RS

…

…

…

1: [17,5,2,1,1]

2: [20,-,2,1,1]

3:

4:

5:



R0: 5 R1: 14 R2: 20 R3: 17

0 0 0 2

M 0 0

RS

st r3 -> (r0) ld r3 <- (r1) st r1-> (r2)

RS

Tomasulo with Speculation (Example 9: Loop Iteration, Cycle 0)

Memory Unit

RS 1 RS 2

Execute 1

RS 3 RS 4

RS 5 RS 6

R0: 0 R1: 100 R2: 13

0 0

0 0

0 0

Execute 2

0 0 0

R3: 0 0 R4: 0 0

Loop: ld r0 <- (r1) mul r4 <- r0, r2 add r3 <- r3, r4 add r1 <- r1,1 bne loop, r1,200 …

ld r0,(r1)

mul r4,r0,r2

PC 0

add r3,r3,r4

1:

2:

3:

4:

5:


Memory Unit

RS 1 RS 2

Execute 1

RS 3 RS 4

RS 5 RS 6

R0: 0 R1: 100 R2: 13

1 0

0 0

0 0

Execute 2

1 0 0

R3: 0 0 R4: 0 0

PC 0

ld r0,(100)

mul r4,r0,r2

add r3,r3,r4

add r1,r1,1


1:

2:

3:

4:

5:

ld r0,(100)


Memory Unit

RS 1 RS 2

Execute 1

RS 3 RS 4

RS 5 RS 6

R0: 0 R1: 100 R2: 13

1 0

1 0

0 0

Execute 2

1 0 0

R3: 0 0 R4: 0 2

PC 0

bne loop,r1,200

mul r4,rob1,13

add r3,r3,r4

add r1,r1,1

ld r0,(100)

ld r0,(100)


1:

2:

3:

4:

5:

ld r0,(100)

mul r4


Memory Unit

RS 1 RS 2

Execute 1

RS 3 RS 4

RS 5 RS 6

R0: 0 R1: 100 R2: 13

1 0

1 1

0 0

Execute 2

1 0 0

R3: 0 3 R4: 0 2

PC 0

bne loop,r1,200

add r3,0,rob2

add r1,r1,1

ld r0,(100)

ld r0,(r1)

mul r4,rob1,13 ld r0,(100)


1:

2:

3:

4:

5:

ld r0,(100)

mul r4

add r3


Memory Unit

RS 1 RS 2

Execute 1

RS 3 RS 4

RS 5 RS 6

R0: 0 R1: 100 R2: 13

0 0

1 1

1 0

Execute 2

1 4 0

R3: 0 3 R4: 0 2

PC 0

bne loop,r1,200

add r3,0,rob2

add r1,100,1

ld r0,(r1)

mul r4,20,13

mul r4,r0,r2

ld r0,(100)


1:

2:

3:

4:

5:

20

mul r4

add r3

add r1


Memory Unit

RS 1 RS 2

Execute 1

RS 3 RS 4

RS 5 RS 6

R0: 20 R1: 100 R2: 13

EU-Bus

0 0

1 1

1 1

Execute 2

0 4 0

R3: 0 3 R4: 0 2

PC 5

bne loop,rob4,200 add r3,0,rob2

add r1,100,1

ld r0,(r1)

mul r4,20,13

mul r4,r0,r2

add r3,r3,r4

mul r4,20,13 add r1,100,1


1:

2:

3:

4:

5:

mul r4

add r3

add r1

bne


Memory Unit

RS 1 RS 2

Execute 1

RS 3 RS 4

RS 5 RS 6

R0: 20 R1: 100 R2: 13

1 0

0 1

0 1

Execute 2

1 4 0

R3: 0 3 R4: 0 2

PC 5

bne loop,101,200 add r3,0,260

add r1,100,1 ld r0,(101) mul r4,20,13

mul r4,r0,r2

add r3,r3,r4

add r1,r1,1


1:

2:

3:

4:

5:

260

add r3

101

ld r0

bne

Operations are fetched and issued

speculatively

Operations from different loop

iterations are in the pipeline

ld-operation is no longer dependent on the branch operation


Memory Unit

RS 1 RS 2

Execute 1

RS 3 RS 4

RS 5 RS 6

R0: 20 R1: 100 R2: 13

1 0

1 1

0 1

Execute 2

1 4 0

R3: 0 3 R4: 260 2

PC 5

bne loop,101,200 add r3,0,260

ld r0,(101) mul r4,rob1,13

add r3,r3,r4

add r1,r1,1

bne loop,101,200 add r3,0,260

bne loop,r1,200


Now speculative execution possible

1:

2:

3:

4:

5:

add r3

101

ld r0

mul r4

ld r0,(101)

bne


Memory Unit

RS 1 RS 2

Execute 1

RS 3 RS 4

RS 5 RS 6

R0: 20 R1: 100 R2: 13

1 0

1 0

1 0

Execute 2

1 4 0 3

R4: 260 2

PC 5

bne loop,101,200 add r3,0,260

ld r0,(101) mul r4,rob1,13

R3: 0


1:

2:

3:

4:

5:

260

101

ld r0

mul r4

ld r0,(101)

bne c add r3,r3,r4

add r1,r1,1

bne loop,r1,200


Memory Unit

RS 1 RS 2

Execute 1

RS 3 RS 4

RS 5 RS 6

R0: 20 R1: 100 R2: 13

1 0

1 1

1 0

Execute 2

1 4 0 3

R4: 260 2

PC 5

ld r0,(101) mul r4,8,13

add r1,101,1

R3: 260 ld r0,(r1)

add r3,260,rob2


1:

2:

3:

4:

5:

101

8

mul r4

bne c add r1,r1,1

bne loop,r1,200

add r3

Summary

We have seen Tomasulo-algorithm with speculation

Importance of the Reorder-Buffer

Execution of − Alu-operations − Branch-operations − Memory-operations

But: Issue- and Commit-phase are limited to processing a

single operation per clock cycle

Content



Multithreading

Superscalar Instruction Pipeline

So far: Only data path is superscalar

Parallel execution of operation in the data path, but CPI < 1 not possible

Required: super scalar − Fetch-, − Issue-, − WB-, − Commit-Phase

Programmspeicher PC

Memory Unit

RS

…

Execute 1

RS

RS

…

Execute m

RS RS

RS

…

…

…

ROB

R0: 5 R1: 14 R2: 20 R3: 17

0 0 0 2

0

RS RS

RS

Superscalar Fetch-Phase

Fetch: − Fetching n operations simultaneously from code cache/memory − Requires wider busses

Inst

ruct

ion

qu

eue

RS

operand bus B1

…

operand bus A1

Register File

RS RS

n

operation bus 1

operation bus n

An

operand bus Bn

Cache/Memory

Superscalar Issue-Phase

Issue: − Issue the first n Operations from the instruction queue (n operation busses required) − n operand busses for left operand required (A) − n operand busses for right operand required (B)

Checking for free RS and free ROB-entry must be done simultaneously for up to

n operations!

In

stru

ctio

n

queu

e

RS

operand bus B1

…

operand bus A1

Register File

RS RS

n

operation bus 1

operation bus n

An

operand bus Bn

Cache/Memory

Implementing simultaneous checking in issue-phase

Issue-Logic (1 Operation)

Old ROB Status

Old RS Status

New state for ROB

New state for RS

Issue-Logic (1. Op)

Old ROB Status

Old RS Status

Issue-Locik (2. Op)

For a single operations For two operations

RF Control for

operand buses A and B

Combine

New state for ROB

New state for RS

RF control for operand busses

A2 und B2

RF Control for

operand buses A1 and B1

…

Superscalar WB-Phase

Every EU has its own result bus Ei − All EUs may write simultaneously to the ROB

This makes also the bypass for the reservation stations more complex

Memory Unit Execute 1 Execute m

result busses

…

ROB

Bypasses to RS


A1 An B1 Bn E0 Em R1 Rk

E0 E1 Em

Bypass RS

RS

Superscalar Commit-Phase

For up to n ROB-entries starting at the head: − check if the valid-bit is set to 1 − Then write their result to the register file

Register file needs n write-ports

Memory Unit Execute 1 Execute m

result busses

…

ROB

Bypasses to RS

Register File


A1 An B1 Bn E0 Em R1 Rk

E0 E1 Em

Bypass RS

RS

Example PowerPC

Quelle: PowerPC™ e500 Core Family Reference Manual

Limitations for ILP

Memory band width limits the amount of simultaneously fetched operations (typical 4 to 6 operations)

HW-Overhead and delay for: − Control logic issue-phase − Bypasses for reservation stations − Number of read-/write-ports in the register file

Branches

− Possible Solution: Branch prediction

Available parallelism in the application − Possible Solution: Multithreading

Content



Multithreading

Motivation for Multithreading

True dependencies prevent the EUs from being used in parallel (horizontal performance loss)

Operations with a very long delay during execution create vertical performance loss − E.g. memory access of an operation A in a Pentium

4 (3-way-superscalar) can take 380 clock cycles (cache misses)

− I.e. 1140 operations have to bypass Op A in order to utilize EUs fully during this time

− But: Reorder buffer has only 126 entries − Hence, 339 execution cycles are wasted

Solution Multithreading: Execute multiple

threads that share the same execution units, but have no dependencies

OP 1

OP 2

OP n

1 2

3 4

EU usage

5

…

…

1 2 3 EU usage OP 1

OP 2

OP 3

OP A

OP 4

OP 5

OP 6

…

A 4 5

124

…

123 122 only

41

cycl

es

… af

ter 3

80 c

ycle

s WB

of A

Process vs. Thread

Each process has its own context − address space (Code, data, heap, stack) − TLB − …

Switching between processes takes tens of thousands of clock cycles (context switch)

OS is involved

Threads share the same context

Switching between two threads only requires to change the values in the architectural registers

Code Section

Data Section

Heap

Stack

Code Section

Data Section

Heap

Stack

…

Code Section

Data Section

Heap

Stack 1

Stack 2

Process 1 Process 2

Code Section

Data Section

Heap

Stack 1

Stack 2

Thread 1 in

Process 2

Thread 2 in

Process 2

Multithreading

Multithreading: A fixed number of n threads can share the same execution units

Hardware supports fast switching between n threads: − n copies of some resources, e.g. architectural registers (including PC) − fix partitioning of some resources, e.g. RS (or limited sharing) − shared usage of some resources, e.g. EUs

Inst

ruct

ion

queu

e

Programmspeicher PC 1

Memory Unit

RS

…

Execute 1

RS

RS

…

Execute m

RS RS

RS

…

…

… EU-Bus

ROB 1

RF 1

RS RS

RS

PC 2

RF 2 ROB 2

OP A OP B

OP C OP D

OP E OP F

Inst

ruct

ion

queu

e

Multithreading

Types of Multithreading:

no MT Coarse Grained MT Fine Grained MT Simultaneous MT

Coarse Grained Multithreading

A single thread runs for many clock cycles before the hardware switches to another thread

Hardware switches between threads only, if − a long running operation is detected, e.g. cache miss, or − a fix time slice has passed

A processor with n-way MT appears to an operating system

like n processors − OS schedules n threads of the same process to these processors

Example

Two threads are scheduled to the processor

Reservation stations and EUs are shared resources

Hardware switches between both PCs and IQ (e.g. by multiplexors)

Fetched operations are tagged with thread number

Situation: Thread 1 is running Inst

ruct

ion

Que

ue

thre

ad 1


Memory Unit

RS

…

Execute 1

RS

RS

…

Execute m

RS RS

RS

…

…

…

ROB 1

RF 1

RS RS

RS

PC 2

RF 2 ROB 2

OP D.1 OP A.2

OP E.1 OP B.2

OP F.1 OP C.2

Inst

ruct

ion

Que

ue

thre

ad 2

OP A.1

OP A.1

OP B.1

OP B.1

OP A.1

OP B.1

OP C.1

1

OP C.1

Example

Memory operation D.1 of thread 1 was issued

Thread 1 is still running…

Inst

ruct

ion

Que

ue

thre

ad 1


Memory Unit

RS

…

Execute 1

RS

RS

…

Execute m

RS RS

RS

…

…

…

ROB 1

RF 1

RS RS

RS

PC 2

RF 2 ROB 2 OP G.1

OP A.2 OP E.1

OP B.2 OP F.1

OP C.2

Inst

ruct

ion

Que

ue

thre

ad 2

OP A.1

OP A.1

OP B.1

OP B.1

OP A.1

OP B.1

OP C.1

OP D.1

1

OP C.1

OP D.1

Example

Memory operation is executed and cache miss is detected

Processor has switched to thread 2 − another PC is used − another instruction queue

is used

Inst

ruct

ion

Que

ue

thre

ad 1


Memory Unit

RS

…

Execute 1

RS

RS

…

Execute m

RS RS

RS

…

…

…

ROB 1

RF 1

RS RS

RS

PC 2

RF 2 ROB 2

OP G.1

OP A.2

OP E.1

OP B.2

OP F.1

OP C.2

Inst

ruct

ion

Que

ue

thre

ad 2

OP A.1

OP A.1

OP B.1

OP A.1

OP B.1

OP C.1

OP D.1

OP D.1

2

OP H.1 OP C.1

OP D.1

OP E.1

Example

Issued operations of thread 1 are further processed

But issue now takes place from instruction queue 2

Inst

ruct

ion

Que

ue

thre

ad 1


Memory Unit

RS

…

Execute 1

RS

RS

…

Execute m

RS RS

RS

…

…

…

ROB 1

RF 1

RS RS

RS

PC 2

RF 2 ROB 2

OP G.1

OP A.2

OP E.1

OP B.2 OP F.1

OP C.2

Inst

ruct

ion

Que

ue

thre

ad 2

OP A.1

OP A.1

OP B.1

OP A.1

OP B.1

OP C.1

OP D.1

OP D.1

2

OP H.1 OP D.2 OP C.1

OP D.1

OP E.1

OP A.2

Example

Issued operations of thread 1 are further processed

But issue now takes place from instruction queue 2

Inst

ruct

ion

Que

ue

thre

ad 1


Memory Unit

RS

…

Execute 1

RS

RS

…

Execute m

RS RS

RS

…

…

…

ROB 1

RF 1

RS RS

RS

PC 2

RF 2 ROB 2

OP G.1

OP A.2

OP E.1

OP B.2

OP F.1 OP C.2 Inst

ruct

ion

Que

ue

thre

ad 2

OP A.1

OP B.1 OP B.1

OP C.1

OP D.1

OP D.1

2

OP H.1

OP D.2

OP C.1

OP D.1

OP E.1

OP A.2

OP E.2

OP A.2

OP B.2

Example

Operations of Thread 1 are further processed, but not committed while simultaneously operations from Thread 2 are processed

If operation E.1 has a true-dependency to D.1 then it blocks the reservation station for operations from Thread 2

Balancing between shared resources important

Inst

ruct

ion

Que

ue

thre

ad 1


Memory Unit

RS

…

Execute 1

RS

RS

…

Execute m

RS RS

RS

…

…

…

ROB 1

RF 1

RS RS

RS

PC 2

RF 2 ROB 2

OP G.1

OP E.1

OP B.2

OP F.1

OP C.2

Inst

ruct

ion

Que

ue

thre

ad 2

OP B.1 OP B.1

OP C.1

OP D.1

OP D.1

2

OP H.1

OP D.2

OP C.1

OP D.1

OP E.1 OP E.2

OP A.2

OP B.2

OP C.2

OP F.2

Coarse Grained MT - Limitations

Does not help to overcome the problem of horizontal performance loss (a single thread may not have enough ILP) − Only right after switching between threads, there are operations of

both threads simultaneously processed

Switching between threads may has a negative impact on the

cache hit rate for each thread and affects the performance negatively

Fine-Grained Multithreading

Processor switches in every clock cycle to another thread − E.g. in a round robin manner:

This helps to overcome horizontal performance loss

A single instruction queue and a single reorder buffer are

sufficient (shared)

Operations must be tagged with the corresponding Thread number

Example

Inst

ruct

ion

queu

e


Memory Unit

RS

…

Execute 1

RS

RS

…

Execute m

RS RS

RS

…

…

…

ROB RF 1

RS RS

RS

PC 2

RF 2 OP A.1

1

Example

Inst

ruct

ion

queu

e


Memory Unit

RS

…

Execute 1

RS

RS

…

Execute m

RS RS

RS

…

…

…

ROB RF 1

RS RS

RS

PC 2

RF 2

OP A.1

2

OP A.2

Example

Inst

ruct

ion

queu

e


Memory Unit

RS

…

Execute 1

RS

RS

…

Execute m

RS RS

RS

…

…

…

ROB RF 1

RS RS

RS

PC 2

RF 2

OP A.1

1

OP A.2

OP B.1

Example

Inst

ruct

ion

queu

e


Memory Unit

RS

…

Execute 1

RS

RS

…

Execute m

RS RS

RS

…

…

…

ROB RF 1

RS RS

RS

PC 2

RF 2

OP A.1

2

OP A.2

OP B.1

OP B.2

OP A.1

Example

Inst

ruct

ion

queu

e


Memory Unit

RS

…

Execute 1

RS

RS

…

Execute m

RS RS

RS

…

…

…

ROB RF 1

RS RS

RS

PC 2

RF 2

1

OP A.1

OP A.2

OP B.1

OP B.2

OP C.1

OP A.1

OP A.2

Example

Inst

ruct

ion

queu

e


Memory Unit

RS

…

Execute 1

RS

RS

…

Execute m

RS RS

RS

…

…

…

ROB RF 1

RS RS

RS

PC 2

RF 2

2

OP A.1 OP A.2

OP B.1

OP B.2

OP C.1

OP A.1

OP A.2

OP C.2 OP B.1

Example

Inst

ruct

ion

queu

e


Memory Unit

RS

…

Execute 1

RS

RS

…

Execute m

RS RS

RS

…

…

…

ROB RF 1

RS RS

RS

PC 2

RF 2

1

ROB

OP A.1

OP A.2

OP B.1

OP B.2

OP C.1

OP A.1

OP C.2

OP D.1 OP B.1

OP B.2

Example

Inst

ruct

ion

queu

e


Memory Unit

RS

…

Execute 1

RS

RS

…

Execute m

RS RS

RS

…

…

…

ROB RF 1

RS RS

RS

PC 2

RF 2

2

OP A.1

OP A.2

OP B.1 OP B.2

OP C.1

OP C.2

OP D.1

OP B.1

OP B.2 OP D.2

OP C.1

Example

Inst

ruct

ion

queu

e


Memory Unit

RS

…

Execute 1

RS

RS

…

Execute m

RS RS

RS

…

…

…

ROB RF 1

RS RS

RS

PC 2

RF 2

1

OP A.2

OP B.1 OP B.2

OP C.1

OP C.2

OP D.1

OP B.1

OP B.2 OP D.2

OP C.1

Fine-Grained Multithreading - Limitations

Vertically performance loss cannot be avoided − A long running operation prevents other operation from the same

thread from being executed − due to the shared IQ and ROB, also the other thread is blocked after a

while • Improvement: Stop fetching for a blocked thread

Performance of a single thread is reduced (even if there are no

operations from a second blocked thread), because issue takes place in every second cycle

MT reduces cache performance

Simultaneous Multithreading

Mixing Coarse- and Fine-Grained MT

In every clock cycle operations from n threads will be fetched and issued (Intel calls this Hyperthreading)

Operations must be tagged with the corresponding Thread number

Solving the problem of having either horizontal or vertical performance loss: − If both threads are not blocked, then available ILP is utilized, and horizontal

performance loss is avoided − If one thread is blocked, then the other thread still uses the resources, and vertical

performance loss is avoided (but not horizontal one)

Even if one thread is blocked, the other one can run at full speed (issue in every clock cycle)

Example

Fetch and issue takes place simultaneously for both threads

Each thread has its own IQ, RF, PC, ROB

Reservation Stations are partitioned


Memory Unit

RS

…

Execute 1

RS

RS

…

Execute m

RS RS

RS

…

…

…

ROB 1

RF 1

RS RS

RS

PC 2

RF 2 ROB 2

OP A OP B

OP C OP D

OP E OP F

Inst

ruct

ion

queu

e 1

Inst

ruct

ion

queu

e 2

Used for thread 1

Used for thread 2

Example

Both threads are executed...

Avoids horizontal performance loss


Memory Unit

RS

…

Execute 1

RS

RS

…

Execute m

RS RS

RS

…

…

…

ROB 1

RF 1

RS RS

RS

PC 2

RF 2 ROB 2

OP A

OP C OP D

OP E OP F

OP B

OP G OP H

Inst

ruct

ion

queu

e 1

Inst

ruct

ion

queu

e 2

Used for thread 1

Used for thread 2

Example




Memory Unit

RS

…

Execute 1

RS

RS

…

Execute m

RS RS

RS

…

…

…

ROB 1

RF 1

RS RS

RS

PC 2

RF 2 ROB 2

OP A OP C

OP D

OP E OP F

OP B

OP G OP H

OP A OP B

OP I OP J

Inst

ruct

ion

queu

e 1

Inst

ruct

ion

queu

e 2

Used for thread 1

Used for thread 2

Example



But, now long running operation E is issued Programmspeicher

PC 1

Memory Unit

RS

…

Execute 1

RS

RS

…

Execute m

RS RS

RS

…

…

…

ROB 1

RF 1

RS RS

RS

PC 2

RF 2 ROB 2

OP C

OP D

OP E

OP F

OP G OP H

OP A OP B

OP I OP J

OP K OP L

OP C OP D

Inst

ruct

ion

queu

e 1

Inst

ruct

ion

queu

e 2

Used for thread 1

Used for thread 2

Example

Assume G is true dependent on E


Memory Unit

RS

…

Execute 1

RS

RS

…

Execute m

RS RS

RS

…

…

…

ROB 1

RF 1

RS RS

RS

PC 2

RF 2 ROB 2

OP E

OP F

OP G

OP H

Res A

Res B

OP I OP J

OP K OP L

Res C Res D

OP E OP F

Inst

ruct

ion

queu

e 1

Inst

ruct

ion

queu

e 2 OP M OP N

Used for thread 1

Used for thread 2

Example

Assume G is true dependent on E; and I, too

Then, thread 1 is now blocked…


Memory Unit

RS

…

Execute 1

RS

RS

…

Execute m

RS RS

RS

…

…

…

ROB 1

RF 1

RS RS

RS

PC 2

RF 2 ROB 2

OP E

OP K

Res A

Res B

OP I

OP J

OP M

OP L

Res C

OP E

Res F

Inst

ruct

ion

queu

e 1

Inst

ruct

ion

queu

e 2

OP G

OP H

OP H

OP N

OP O OP P

Used for thread 1

Used for thread 2

Example

… but thread 2 can continuous


Memory Unit

RS

…

Execute 1

RS

RS

…

Execute m

RS RS

RS

…

…

…

ROB 1

RF 1

RS RS

RS

PC 2

RF 2 ROB 2

OP E

OP K

Res A

Res B

OP I

OP J

OP M

OP L

Res C

OP E

Inst

ruct

ion

queu

e 1

Inst

ruct

ion

queu

e 2

OP G

OP H

OP H

OP N

OP O

OP P

OP Q

Used for thread 1

Used for thread 2

Summary - Multithreading

Allows to fill the pipeline with operations from different threads − no data dependency between operations from different threads − allows for higher resource utilization

Coarse-grained MT suffers from horizontal performance loss

Fine-grained MT suffers from horizontal performance loss

SMT solves these problems

− Improvement: Balancing between partitioned resources

All MT approaches have impact on the cache performance

In particular Fine-Grained MT can be also used in statically scheduled processor pipelines to avoid hazards − In a pipeline with n pipeline stages, operations from n threads are issued − no data-/control hazard occur because operations in the pipeline have no dependencies

processor architecture - uni-potsdam.de superscalarity into the instruction pipeline ... (see table...

Documents