chap.6: enhancing performance with pipelining

82
Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c

Upload: brasen

Post on 21-Mar-2016

56 views

Category:

Documents


0 download

DESCRIPTION

Chap.6: Enhancing Performance with Pipelining. Jen-Chang Liu, Spring 2006. Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c. Review Datapath (1/3). Datapath is the hardware that performs operations necessary to execute programs. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Chap.6: Enhancing Performance with Pipelining

Chap.6: Enhancing Performance with Pipelining

Jen-Chang Liu, Spring 2006

Parts of the slides are duplicated frominst.eecs.berkeley.edu/~cs61c

Page 2: Chap.6: Enhancing Performance with Pipelining

Review Datapath (1/3)Datapath is the hardware that performs operations necessary to execute programs.Control instructs datapath on what to do next.Datapath needs:

access to storage (general purpose registers and memory) computational ability (ALU) helper hardware (local registers and PC)

Page 3: Chap.6: Enhancing Performance with Pipelining

Review Datapath (2/3)Five stages of datapath (executing an instruction):

1. Instruction Fetch (Increment PC)2. Instruction Decode (Read Registers)3. ALU (Computation)4. Memory Access5. Write to Registers

ALL instructions must go through ALL five stages.

Page 4: Chap.6: Enhancing Performance with Pipelining

Review Datapath (3/3)P

C

inst

ruct

ion

mem

ory

+4

rtrsrd

regi

ster

s

ALU

Dat

am

emor

y

imm

1. InstructionFetch

2. Decode/ Register

Read

3. Execute 4. Memory5. WriteBack

Page 5: Chap.6: Enhancing Performance with Pipelining

Review Single cycle datapath

Multi-cycle datapath

Instructionfetch

Data/registerread

Instructionexecution

Memory/registerread/write

Registerwrite

Instructionfetch

Data/registerread

Instructionexecution

Memory/registerread/write

Registerwrite

Page 6: Chap.6: Enhancing Performance with Pipelining

Outline Overview A pipeline datapath Pipelined control Data hazards

Forwarding Stalls

Branch hazards Superscalar and dynamic pipelining

Page 7: Chap.6: Enhancing Performance with Pipelining

Time76 PM 8 9 10 11 12 1 2 AM

A

B

C

D

Time76 PM 8 9 10 11 12 1 2 AM

A

B

C

D

Taskorder

Taskorder

What’s pipelining?Time

76 PM 8 9 10 11 12 1 2 AM

A

B

C

D

Time76 PM 8 9 10 11 12 1 2 AM

A

B

C

D

Taskorder

Taskorder

洗衣 乾衣折疊收藏non-pipelined

pipelining

Use different resources at the same time

Page 8: Chap.6: Enhancing Performance with Pipelining

Pipelining Definition: an implementation

technique in which multiple instructions are overlapped in execution

How to achieve pipelining? An instruction is divided into steps

(stages) We have separate resources for each

stage

Page 9: Chap.6: Enhancing Performance with Pipelining

Instructionfetch Reg ALU Data

access Reg

8 nsInstruction

fetch Reg ALU Dataaccess Reg

8 nsInstruction

fetch

8 ns

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 4 6 8 10 12 14 16 18

2 4 6 8 10 12 14

...

Programexecutionorder(in instructions)

Instructionfetch Reg ALU Data

access Reg

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 nsInstruction

fetch Reg ALU Dataaccess Reg

2 nsInstruction

fetch Reg ALU Dataaccess Reg

2 ns 2 ns 2 ns 2 ns 2 ns

Programexecutionorder(in instructions)

Single-cycle implementation

Pipelined implementation

Instructionfetch Reg ALU Data

access Reg

8 nsInstruction

fetch Reg ALU Dataaccess Reg

8 nsInstruction

fetch

8 ns

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 4 6 8 10 12 14 16 18

2 4 6 8 10 12 14

...

Programexecutionorder(in instructions)

Instructionfetch Reg ALU Data

access Reg

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 nsInstruction

fetch Reg ALU Dataaccess Reg

2 nsInstruction

fetch Reg ALU Dataaccess Reg

2 ns 2 ns 2 ns 2 ns 2 ns

Programexecutionorder(in instructions)

Page 10: Chap.6: Enhancing Performance with Pipelining

What does pipelining improve?

Pipelining improves performance by Increasing the instruction throughput 單位時間完成的指令數目增加 Decreasing the execution time of an individual instruction 單一指令執行時間不變

Ideal conditions for savingTime between inst. pipelined =

Time between inst. non-pipelined

Number of pipe stages

Page 11: Chap.6: Enhancing Performance with Pipelining

Design instruction set for pipelining

MIPS has been designed for pipelining1. Instructions are of the same length

Easier to fetch and decode Ex. 80x86 inst.s ranges from 1 to 17 bytes

2. A few instruction formats, with the source register fields located in the same place

可以在決定是甚麼指令前讀暫存器

Page 12: Chap.6: Enhancing Performance with Pipelining

Design instruction set for pipelining (cont.)

3. Memory operands only appear for loads and stores

Calculate memory address in execution stage

4. Operands are aligned in memory One memory access for data transfer

Instructionfetch

Instruction Decoding,

Data/registerread

Instructionexecution

Memory/registerread/write

Registerwrite

0 1 2 3

Aligned

NotAligned

Page 13: Chap.6: Enhancing Performance with Pipelining

Pipeline hazards Hazards: the situations in pipelining

when the next instruction cannot execute in the following clock cycle使下一個指令無法在下一個 cycle 執行

Instructionfetch Reg ALU Data

access Reg

8 nsInstruction

fetch Reg ALU Dataaccess Reg

8 nsInstruction

fetch

8 ns

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 4 6 8 10 12 14 16 18

2 4 6 8 10 12 14

...

Programexecutionorder(in instructions)

Instructionfetch Reg ALU Data

access Reg

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 nsInstruction

fetch Reg ALU Dataaccess Reg

2 nsInstruction

fetch Reg ALU Dataaccess Reg

2 ns 2 ns 2 ns 2 ns 2 ns

Programexecutionorder(in instructions)

危險

甚麼時候下個指令無法執行?

Page 14: Chap.6: Enhancing Performance with Pipelining

Instructionfetch Reg ALU Data

access Reg

8 nsInstruction

fetch Reg ALU Dataaccess Reg

8 nsInstruction

fetch

8 ns

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 4 6 8 10 12 14 16 18

2 4 6 8 10 12 14

...

Programexecutionorder(in instructions)

Instructionfetch Reg ALU Data

access Reg

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 nsInstruction

fetch Reg ALU Dataaccess Reg

2 nsInstruction

fetch Reg ALU Dataaccess Reg

2 ns 2 ns 2 ns 2 ns 2 ns

Programexecutionorder(in instructions)

1. Structural hazards The hardware cannot support the

combination of instructions that we want to execute in the same clock cycle

Instructionfetch Reg ALU Data

access Reg

8 nsInstruction

fetch Reg ALU Dataaccess Reg

8 nsInstruction

fetch

8 ns

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 4 6 8 10 12 14 16 18

2 4 6 8 10 12 14

...

Programexecutionorder(in instructions)

Instructionfetch Reg ALU Data

access Reg

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 nsInstruction

fetch Reg ALU Dataaccess Reg

2 nsInstruction

fetch Reg ALU Dataaccess Reg

2 ns 2 ns 2 ns 2 ns 2 ns

Programexecutionorder(in instructions)

硬體結構問題

2 memory access at the sameclock?

Page 15: Chap.6: Enhancing Performance with Pipelining

Structural Hazard #1: Single Memory

Solution:Second memory

infeasible and inefficient to createTwo Level 1 Caches

Cache: a temporary smaller [of usually most recently used] copy of memory

have both an L1 Instruction Cache and an L1 Data Cache

need more complex hardware to control when both caches miss

Page 16: Chap.6: Enhancing Performance with Pipelining

Instructionfetch Reg ALU Data

access Reg

8 nsInstruction

fetch Reg ALU Dataaccess Reg

8 nsInstruction

fetch

8 ns

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 4 6 8 10 12 14 16 18

2 4 6 8 10 12 14

...

Programexecutionorder(in instructions)

Instructionfetch Reg ALU Data

access Reg

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 nsInstruction

fetch Reg ALU Dataaccess Reg

2 nsInstruction

fetch Reg ALU Dataaccess Reg

2 ns 2 ns 2 ns 2 ns 2 ns

Programexecutionorder(in instructions)

1. Structural hazard #2: single register

Instructionfetch Reg ALU Data

access Reg

8 nsInstruction

fetch Reg ALU Dataaccess Reg

8 nsInstruction

fetch

8 ns

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 4 6 8 10 12 14 16 18

2 4 6 8 10 12 14

...

Programexecutionorder(in instructions)

Instructionfetch Reg ALU Data

access Reg

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 nsInstruction

fetch Reg ALU Dataaccess Reg

2 nsInstruction

fetch Reg ALU Dataaccess Reg

2 ns 2 ns 2 ns 2 ns 2 ns

Programexecutionorder(in instructions)

Can’t read and write to registers simultaneously?

Page 17: Chap.6: Enhancing Performance with Pipelining

Structural Hazard #2: Registers

Fact: Register access is VERY fast: takes less than half the time of ALU stage

Solution: introduce conventionalways Write to Registers during first half of each clock cycle

always Read from Registers during second half of each clock cycle

Result: can perform Read and Write during same clock cycle

Page 18: Chap.6: Enhancing Performance with Pipelining

2. Control hazards The need to make a decision based on

the results of one instruction while others are executing Ex. Branch instruction

下一個指令需等待前面指令的執行結果

add $4, $5, $6 beq $1, $2, 40 lw $3, 300($0)40: or $7, $8, $9

? …無法判斷下一個指令是那個…

Page 19: Chap.6: Enhancing Performance with Pipelining

Solution to control hazards #1: stall 拖延 stall (bubble): 暫停下一個指令執行

Instructionfetch Reg ALU Data

access Reg

Time

beq $1, $2, 40

add $4, $5, $6

lw $3, 300($0)4 ns

Instructionfetch Reg ALU Data

access Reg2ns

Instructionfetch Reg ALU Data

access Reg

2ns

2 4 6 8 10 12 14 16Programexecutionorder(in instructions)

nop: no operationnop: no operationInstruction

fetch Reg ALU Dataaccess Reg

Time

beq $1, $2, 40

add $4, $5, $6

lw $3, 300($0)4 ns

Instructionfetch Reg ALU Data

access Reg2ns

Instructionfetch Reg ALU Data

access Reg

2ns

2 4 6 8 10 12 14 16Programexecutionorder(in instructions)

completion of branchFilling two nops after branch is inefficient!

Page 20: Chap.6: Enhancing Performance with Pipelining

Solution to control hazards #1: stall

Stall (bubble): 暫停下一個指令執行Instruction

fetch Reg ALU Dataaccess Reg

Time

beq $1, $2, 40

add $4, $5, $6

lw $3, 300($0)4 ns

Instructionfetch Reg ALU Data

access Reg2ns

Instructionfetch Reg ALU Data

access Reg

2ns

2 4 6 8 10 12 14 16Programexecutionorder(in instructions)

假設 branch 的位址計算,比較都可用另外的 hardware 在此 stage 完成

Page 21: Chap.6: Enhancing Performance with Pipelining

Solution to control hazards #2:

delayed branchadd $4, $5, $6 beq $1, $2, 40 lw $3, 300($0)

40: or $7, $8, $9? …

Instructionfetch Reg ALU Data

access Reg

Time

beq $1, $2, 40

add $4, $5, $6

lw $3, 300($0)

Instructionfetch Reg ALU Data

access Reg2 ns

Instructionfetch Reg ALU Data

access Reg

2 ns

2 4 6 8 10 12 14

2 ns

(Delayed branch slot)

Programexecutionorder(in instructions)

Page 22: Chap.6: Enhancing Performance with Pipelining

Redefine branchesOld definition: if we take the branch, none of the instructions after the branch get executed by accident

New definition: whether or not we take the branch, the single instruction immediately following the branch gets executed (called the branch-delay slot)

The term “Delayed Branch” meanswe always execute inst after branch

Solution to control hazards #2:

delayed branch (cont.)

Page 23: Chap.6: Enhancing Performance with Pipelining

Notes on Branch-Delay SlotWorst-Case Scenario: can always put a no-op in the branch-delay slot

Better Case: can find an instruction preceding the branch which can be placed in the branch-delay slot without affecting flow of the program

re-ordering instructions is a common method of speeding up programs

compiler must be very smart in order to find instructions to do this

usually can find such an instruction at least 50% of the time

Solution to control hazards #2:

delayed branch (cont.)

Page 24: Chap.6: Enhancing Performance with Pipelining

Solution to control hazards #3: predict 預測 預測下一個可能執行的指令 .

Ex. 猜 branch always not takenInstruction

fetch Reg ALU Dataaccess Reg

Time

beq $1, $2, 40

add $4, $5, $6

lw $3, 300($0)

Instructionfetch Reg ALU Data

access Reg2 ns

Instructionfetch Reg ALU Data

access Reg2 ns

Programexecutionorder(in instructions)

Instructionfetch Reg ALU Data

access Reg

Time

beq $1, $2, 40

add $4, $5 ,$6

or $7, $8, $9

Instructionfetch Reg ALU Data

access Reg

2 4 6 8 10 12 14

2 4 6 8 10 12 14

Instructionfetch Reg ALU Data

access Reg

2 ns

4 ns

bubble bubble bubble bubble bubble

Programexecutionorder(in instructions)

(not taken沒有跳走 )

(taken)

Instructionfetch Reg ALU Data

access Reg

Time

beq $1, $2, 40

add $4, $5, $6

lw $3, 300($0)

Instructionfetch Reg ALU Data

access Reg2 ns

Instructionfetch Reg ALU Data

access Reg2 ns

Programexecutionorder(in instructions)

Instructionfetch Reg ALU Data

access Reg

Time

beq $1, $2, 40

add $4, $5 ,$6

or $7, $8, $9

Instructionfetch Reg ALU Data

access Reg

2 4 6 8 10 12 14

2 4 6 8 10 12 14

Instructionfetch Reg ALU Data

access Reg

2 ns

4 ns

bubble bubble bubble bubble bubble

Programexecutionorder(in instructions)

Page 25: Chap.6: Enhancing Performance with Pipelining

Solution to control hazards: predict (cont.)

Example: easy to predict

Dynamic prediction hardware Record the history of branch results 90% accuracy

Loop: … … beq $1, $2, Loop

Page 26: Chap.6: Enhancing Performance with Pipelining

3. Data hazard An instruction depends on the results of

a previous instruction still in the pipeline

Solution 1: complier (assembler)

下一個指令需要的資料需等待前一個指令執行結果Example:

add $s0, $t0, $t1sub $t2, $s0, $t3

Page 27: Chap.6: Enhancing Performance with Pipelining

Standard notation for steps in MIPS

Time2 4 6 8 10

add $s0, $t0, $t1 IF ID WBEX MEM

Instructionfetch

Instruction Decoding,

Data/registerread

Instructionexecution

Memory/registerread/write

Registerwrite

讀出 寫入

Page 28: Chap.6: Enhancing Performance with Pipelining

Solution to data hazards: forwarding

Forwarding: Getting the missing item early from the internal resourceadd $s0, $t0, $t1sub $t2, $s0, $t3 等待指令執行完畢?

add $s0, $t0, $t1

sub $t2, $s0, $t3

Programexecutionorder(in instructions)

IF ID WBEX

IF ID MEMEX

Time2 4 6 8 10

MEM

WBMEM

$t0+$t1

Page 29: Chap.6: Enhancing Performance with Pipelining

Time2 4 6 8 10 12 14

lw $s0, 20($t1)

sub $t2, $s0, $t3

Programexecutionorder(in instructions)

IF ID WBMEMEX

IF ID WBMEMEX

bubble bubble bubble bubble bubble

Another example of forwarding

Time2 4 6 8 10 12 14

lw $s0, 20($t1)

sub $t2, $s0, $t3

Programexecutionorder(in instructions)

IF ID WBMEMEX

IF ID WBMEMEX

bubble bubble bubble bubble bubble

Time2 4 6 8 10 12 14

lw $s0, 20($t1)

sub $t2, $s0, $t3

Programexecutionorder(in instructions)

IF ID WBMEMEX

IF ID WBMEMEX

bubble bubble bubble bubble bubble

Time2 4 6 8 10 12 14

lw $s0, 20($t1)

sub $t2, $s0, $t3

Programexecutionorder(in instructions)

IF ID WBMEMEX

IF ID WBMEMEX

bubble bubble bubble bubble bubble

?

Mem[20($t1)]

Page 30: Chap.6: Enhancing Performance with Pipelining

Example: Reordering codes

lw $t0, 0($t1)lw $t2, 4($t1)sw $t2, 0($t1)sw $t0, 4($t1)Swap 0($t1) and 4($t1)

What’s wrong with pipelining?Time

2 4 6 8 10

add $s0, $t0, $t1 IF ID WBEX MEMTime

2 4 6 8 10

add $s0, $t0, $t1 IF ID WBEX MEM

lw $t2, 4($t1)sw $t2, 0($t1)

Page 31: Chap.6: Enhancing Performance with Pipelining

Example: Reordering codes

Time2 4 6 8 10

add $s0, $t0, $t1 IF ID WBEX MEM

Time2 4 6 8 10

add $s0, $t0, $t1 IF ID WBEX MEM

lw $t2, 4($t1)

sw $t2, 0($t1)

lw $t0, 0($t1)lw $t2, 4($t1) sw $t0, 4($t1)sw $t2, 0($t1)

Time2 4 6 8 10

add $s0, $t0, $t1 IF ID WBEX MEMsw $t0, 4($t1)

Page 32: Chap.6: Enhancing Performance with Pipelining

Peer Instruction

A. Thanks to pipelining, I have reduced the time it took me to wash my shirt.B. Longer pipelines are always a win (since less work per stage & a faster clock).C. We can rely on compilers to help us avoid data hazards by reordering instrs.

ABC1: FFF2: FFT3: FTF4: FTT5: TFF6: TFT7: TTF8: TTT

Page 33: Chap.6: Enhancing Performance with Pipelining

Outline Overview A pipeline datapath: How to build? Pipelined control Data hazards

Forwarding Stalls

Branch hazards Superscalar and dynamic pipelining

Page 34: Chap.6: Enhancing Performance with Pipelining

Multiple instructions execute using pipelining

IM Reg DM RegALU

IM Reg DM RegALU

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7

Time (in clock cycles)

lw $2, 200($0)

lw $3, 300($0)

Programexecutionorder(in instructions)

lw $1, 100($0) IM Reg DM RegALU

How to combine these multiple datapaths? Shared units during one clock cycle add buffer (registers) to hold data (as we do in multi-cycle)

Page 35: Chap.6: Enhancing Performance with Pipelining

Divide single-cycle datapath into stages

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Instruction

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

ReaddataAddress

Datamemory

1

ALUresult

Mux

ALUZero

IF: Instruction fetch ID: Instruction decode/register file read

EX: Execute/address calculation

MEM: Memory access WB: Write back

1

2

Data flow

Page 36: Chap.6: Enhancing Performance with Pipelining

Pipelined datapath with pipeline registers (Fig 6.11)

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Datamemory

Address

64bits 128b 97b 64b

Page 37: Chap.6: Enhancing Performance with Pipelining

Example: (1) instruction fetch

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Instruction fetchlw

Address

Datamemory

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX MEM/WB

Instruction decodelw

Address

Datamemory

Page 38: Chap.6: Enhancing Performance with Pipelining

Example: (2) instruction decode

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Instruction fetchlw

Address

Datamemory

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX MEM/WB

Instruction decodelw

Address

Datamemory

Page 39: Chap.6: Enhancing Performance with Pipelining

Example: (3) execute

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX MEM/WB

Executionlw

Address

Datamemory

Page 40: Chap.6: Enhancing Performance with Pipelining

Example: (4) memory access

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

ReaddataData

memory1

ALUresult

Mux

ALUZero

ID/EX MEM/WB

Memorylw

Address

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writedata

ReaddataData

memory

1

ALUresult

Mux

ALUZero

ID/EX MEM/WB

Write backlw

Writeregister

Address

97108/Patterson Figure 06.15

Page 41: Chap.6: Enhancing Performance with Pipelining

Example: (5) write back

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

ReaddataData

memory1

ALUresult

Mux

ALUZero

ID/EX MEM/WB

Memorylw

Address

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writedata

ReaddataData

memory

1

ALUresult

Mux

ALUZero

ID/EX MEM/WB

Write backlw

Writeregister

Address

97108/Patterson Figure 06.15

?

寫回暫存器的號碼對嗎?

Page 42: Chap.6: Enhancing Performance with Pipelining

Preserve the Destination Register number

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0

Address

Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

Datamemory

1

ALUresult

Mux

ALUZero

ID/EX

add 5-bit buffer

Page 43: Chap.6: Enhancing Performance with Pipelining

Trace another pipelining Trace helps you understand how pipelining works !!! Single-clock-cycle pipeline diagram example: lw $10, 20($1) sub $11, $2, $3

Page 44: Chap.6: Enhancing Performance with Pipelining

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Instruction decodelw $10, 20($1)

Instruction fetchsub $11, $2, $3

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Instruction fetchlw $10, 20($1)

Address

Datamemory

Address

Datamemory

Clock 1

Clock 2

Page 45: Chap.6: Enhancing Performance with Pipelining

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Instruction decodelw $10, 20($1)

Instruction fetchsub $11, $2, $3

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Instruction fetchlw $10, 20($1)

Address

Datamemory

Address

Datamemory

Clock 1

Clock 2

Page 46: Chap.6: Enhancing Performance with Pipelining

Instructionmemory

Address

4

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

3216Sign

extend

Writeregister

Writedata

Memorylw $10, 20($1)

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Executionsub $11, $2, $3

Instructionmemory

Address

4

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Executionlw $10, 20($1)

Instruction decodesub $11, $2, $3

3216Sign

extend

Address

Datamemory

Datamemory

Address

Clock 3

Clock 4

Page 47: Chap.6: Enhancing Performance with Pipelining

Instructionmemory

Address

4

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

3216Sign

extend

Writeregister

Writedata

Memorylw $10, 20($1)

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Executionsub $11, $2, $3

Instructionmemory

Address

4

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Executionlw $10, 20($1)

Instruction decodesub $11, $2, $3

3216Sign

extend

Address

Datamemory

Datamemory

Address

Clock 3

Clock 4

Page 48: Chap.6: Enhancing Performance with Pipelining

Instructionmemory

Address

4

32

0

Add Addresult

1

ALUresult

Zero

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEMID/EX MEM/WB

Write backMux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Mux

ALUReaddata

Writeregister

Writedata

lw $10, 20($1)

Instructionmemory

Address

4

32

0

Add Addresult

1

ALUresult

Zero

Shiftleft 2

Instr

uct

ion

IF/ID EX/MEMID/EX MEM/WB

Write backMux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Mux

ALUReaddata

Writeregister

Writedata

sub $11, $2, $3

Memory

sub $11, $2, $3

Address

Datamemory

Address

Datamemory

Clock 6

Clock 5

Page 49: Chap.6: Enhancing Performance with Pipelining

Instructionmemory

Address

4

32

0

Add Addresult

1

ALUresult

Zero

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEMID/EX MEM/WB

Write backMux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Mux

ALUReaddata

Writeregister

Writedata

lw $10, 20($1)

Instructionmemory

Address

4

32

0

Add Addresult

1

ALUresult

Zero

Shiftleft 2

Instr

uct

ion

IF/ID EX/MEMID/EX MEM/WB

Write backMux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Mux

ALUReaddata

Writeregister

Writedata

sub $11, $2, $3

Memory

sub $11, $2, $3

Address

Datamemory

Address

Datamemory

Clock 6

Clock 5

Page 50: Chap.6: Enhancing Performance with Pipelining

Multiple-clock-cycle pipeline diagram

Page 51: Chap.6: Enhancing Performance with Pipelining

Outline Overview A pipeline datapath Pipelined control Data hazards

Forwarding Stalls

Branch hazards Superscalar and dynamic pipelining

Page 52: Chap.6: Enhancing Performance with Pipelining

Label the control lines (Fig 6.22)

PC

Instructionmemory

Address

Inst

ruct

ion

Instruction[20– 16]

MemtoReg

ALUOp

Branch

RegDst

ALUSrc

4

16 32Instruction[15– 0]

0

0Registers

Writeregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

Mux

1Write

data

Read

data Mux

1

ALUcontrol

RegWrite

MemRead

Instruction[15– 11]

6

IF/ID ID/EX EX/MEM MEM/WB

MemWrite

Address

Datamemory

PCSrc

Zero

Add Addresult

Shiftleft 2

ALUresult

ALUZero

Add

0

1

Mux

0

1

Mux

written during each clock cycle

Page 53: Chap.6: Enhancing Performance with Pipelining

Divide the control lines into groups

Fig. 6.25

How to set control lines at each stage? Extend the pipeline registers to include control information 將控制訊號儲存在 pipeline registers

Page 54: Chap.6: Enhancing Performance with Pipelining

Propagation of control (Fig.6.26)

Control

EX

M

WB

M

WB

WB

IF/ID ID/EX EX/MEM MEM/WB

Instruction

Page 55: Chap.6: Enhancing Performance with Pipelining

PC

Instructionmemory

Inst

ruct

ion

Add

Instruction[20– 16]

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

16 32Instruction[15– 0]

0

0

Mux

0

1

Add Addresult

RegistersWriteregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

Mux

1

ALUresult

Zero

Writedata

Readdata

Mux

1

ALUcontrol

Shiftleft 2

Reg

Writ

e

MemRead

Control

ALU

Instruction[15– 11]

6

EX

M

WB

M

WB

WBIF/ID

PCSrc

ID/EX

EX/MEM

MEM/WB

Mux

0

1

Mem

Writ

e

AddressData

memory

Address

Fig. 6.27

Page 56: Chap.6: Enhancing Performance with Pipelining

Instructionmemory

Instruction[20– 16]

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

Instruction[15– 0]

0

Mux

0

1

Add Addresult

RegistersWriteregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

Mux

1

ALUresult

Zero

ALUcontrol

Shiftleft 2

Reg

Writ

e

MemRead

Control

ALU

Instruction[15– 11]

EX

M

WB

M

WB

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: before<1> EX: before<2> MEM: before<3> WB: before<4>

MEM/WB

IF: lw $10, 20($1)

000

00

0000

000

00

000

0

00

00

0

00

Mux

0

1

Add

PC

0

Datamemory

Address

Writedata

Readdata

Mux

1

WB

EX

M

Instructionmemory

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

0

Mux

0

1

Add Addresult

Writeregister

Writedata

Mux

1

ALUresult

Zero

ALUcontrol

Shiftleft 2

Reg

Writ

e

ALU

M

WB

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: lw $10, 20($1) EX: before<1> MEM: before<2> WB: before<3>

MEM/WB

IF: sub $11, $2, $3

010

11

0001

000

00

000

0

00

00

0

00

Mux

0

1

Add

PC

0Writedata

Readdata

Mux

1

lwControl

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

X

10

20

X

1

Instruction[20– 16]

Instruction[15– 0] Sign

extend

Instruction[15– 11]

20

$X

$1

10

X

Me

mW

rite

MemRead

Me

mW

rite

Datamemory

Address

Address

Address

Clock 2

Clock 1

Old bookFig 6.31

Page 57: Chap.6: Enhancing Performance with Pipelining

Instructionmemory

Instruction[20– 16]

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

Instruction[15– 0]

0

Mux

0

1

Add Addresult

RegistersWriteregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

Mux

1

ALUresult

Zero

ALUcontrol

Shiftleft 2

Reg

Writ

e

MemRead

Control

ALU

Instruction[15– 11]

EX

M

WB

M

WB

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: before<1> EX: before<2> MEM: before<3> WB: before<4>

MEM/WB

IF: lw $10, 20($1)

000

00

0000

000

00

000

0

00

00

0

00

Mux

0

1

Add

PC

0

Datamemory

Address

Writedata

Readdata

Mux

1

WB

EX

M

Instructionmemory

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

0

Mux

0

1

Add Addresult

Writeregister

Writedata

Mux

1

ALUresult

Zero

ALUcontrol

Shiftleft 2

Reg

Writ

e

ALU

M

WB

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: lw $10, 20($1) EX: before<1> MEM: before<2> WB: before<3>

MEM/WB

IF: sub $11, $2, $3

010

11

0001

000

00

000

0

00

00

0

00

Mux

0

1

Add

PC

0Writedata

Readdata

Mux

1

lwControl

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

X

10

20

X

1

Instruction[20– 16]

Instruction[15– 0] Sign

extend

Instruction[15– 11]

20

$X

$1

10

X

Mem

Writ

e

MemRead

Mem

Writ

e

Datamemory

Address

Address

Address

Clock 2

Clock 1

Page 58: Chap.6: Enhancing Performance with Pipelining

Instructionmemory

Address

Instruction[20– 16]

Mem

toR

eg

Branch

ALUSrc

4

Instruction[15– 0]

0

1

Add Addresult

RegistersWriteregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

ALUresult

Shiftleft 2

Re

gWrit

e

MemRead

Control

ALU

Instruction[15– 11]

EX

M

WB

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: sub $11, $2, $3 EX: lw $10, . . . MEM: before<1> WB: before<2>

MEM/WB

IF: and $12, $4, $5

000

10

1100

010

11

000

1

00

00

0

00

Mux

0

1

Add

PC

0Writedata

Readdata

Mux

1

WB

EX

M

Instructionmemory

Address

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

0

0

1

Add Addresult

Writeregister

Writedata 1

ALUresult

ALUcontrol

Shiftleft 2

Re

gWrit

e

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: and $12, $2, $3 EX: sub $11, . . . MEM: lw $10, . . . WB: before<1>

MEM/WB

IF: or $13, $6, $7

000

10

1100

000

10

101

0

11

10

0

00

Mux

0

1

Add

PC

0Writedata

Mux

1

andControl

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

12

X

X

5

4

Instruction[20– 16]

Instruction[15– 0]

Instruction[15– 11]

X

$5

$4

X

12

Me

mW

rite

MemRead

Me

mW

rite

sub

11

X

X

3

2

X

$3

$2

X

11

$1

20

10

Mux

0

Mux

1

ALUOp

RegDst

ALUcontrol

M

WB

$3

$2

11

Mux

Mux

ALUAddress Read

dataData

memory

10

WB

Zero

Zero

Signextend

Signextend

Datamemory

Address

Clock 3

Clock 4

20($1)

Page 59: Chap.6: Enhancing Performance with Pipelining

Instructionmemory

Address

Instruction[20– 16]

Mem

toR

eg

Branch

ALUSrc

4

Instruction[15– 0]

0

1

Add Addresult

RegistersWriteregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

ALUresult

Shiftleft 2

Reg

Writ

e

MemRead

Control

ALU

Instruction[15– 11]

EX

M

WB

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: sub $11, $2, $3 EX: lw $10, . . . MEM: before<1> WB: before<2>

MEM/WB

IF: and $12, $4, $5

000

10

1100

010

11

000

1

00

00

0

00

Mux

0

1

Add

PC

0Writedata

Readdata

Mux

1

WB

EX

M

Instructionmemory

Address

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

0

0

1

Add Addresult

Writeregister

Writedata 1

ALUresult

ALUcontrol

Shiftleft 2

Reg

Writ

e

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: and $12, $2, $3 EX: sub $11, . . . MEM: lw $10, . . . WB: before<1>

MEM/WB

IF: or $13, $6, $7

000

10

1100

000

10

101

0

11

10

0

00

Mux

0

1

Add

PC

0Writedata

Mux

1

andControl

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

12

X

X

5

4

Instruction[20– 16]

Instruction[15– 0]

Instruction[15– 11]

X

$5

$4

X

12

Mem

Writ

e

MemRead

Mem

Writ

e

sub

11

X

X

3

2

X

$3

$2

X

11

$1

20

10

Mux

0

Mux

1

ALUOp

RegDst

ALUcontrol

M

WB

$3

$2

11

Mux

Mux

ALUAddress Read

dataData

memory

10

WB

Zero

Zero

Signextend

Signextend

Datamemory

Address

Clock 3

Clock 4

20($1)

Page 60: Chap.6: Enhancing Performance with Pipelining

Instructionmemory

Address

Instruction[20– 16]

Branch

ALUSrc

4

Instruction[15– 0]

0

1

Add Addresult

RegistersWriteregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

ALUresult

Shiftleft 2

Re

gWrit

e

MemRead

Control

ALU

Instruction[15– 11]

EX

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: or $13, $6, $7 EX: and $12, . . . MEM: sub $11, . . . WB: lw $10, . . .

MEM/WB

IF: add $14, $8, $9

000

10

1100

000

10

101

0

10

00

0

Mux

0

1

Add

PC

0Writedata

Readdata

Mux

1

WB

EX

M

Instructionmemory

Address

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

0

0

1

Add Addresult

1

ALUresult

ALUcontrol

Shiftleft 2

Re

gWrit

e

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: add $14, $8, $9 EX: or $13, . . . MEM: and $12, . . . WB: sub $11, . . .

MEM/WB

IF: after<1>

000

10

1100

000

10

101

0

10

00

0

10

Mux

0

1

Add

PC

0Writedata

Mux

1

addControl

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

14

X

X

9

8

Instruction[20– 16]

Instruction[15– 0]

Instruction[15– 11]

X

$9

$8

X

14

Me

mW

rite

MemRead

Me

mW

rite

or

13

X

X

7

6

X

$7

$6

X

13

$4

Mux

0

Mux

1

ALUOp

RegDst

ALUcontrol

M

WB

$7

$6

13

Mux

Mux

ALUReaddata

12

WB

11 10

10$5

12

WB

Mem

toR

eg

11

11

11

Writeregister

Writedata

Zero

Zero

Datamemory

Address

Datamemory

Address

Signextend

Signextend

Clock 5

Clock 6

20($1)

Page 61: Chap.6: Enhancing Performance with Pipelining

Outline Overview A pipeline datapath Pipelined control Data hazards

Forwarding Stalls

Branch hazards Superscalar and dynamic pipelining

Page 62: Chap.6: Enhancing Performance with Pipelining

IM Reg

IM Reg

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

sub $2, $1, $3

Programexecutionorder(in instructions)

and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of register $2:

DM Reg

Reg

Reg

Reg

DM

Example: data hazards

Both r/w is oK!

?

?

Page 63: Chap.6: Enhancing Performance with Pipelining

Solution 1: software Insert independent instruction or nop (no operation)

sub $2, $1, $3nopnopand $12, $2, $6or $13, $6, $2add $14, $2, $2sw $15, 100($2)

Page 64: Chap.6: Enhancing Performance with Pipelining

Hardware approach: forwarding

IM Reg

IM Reg

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

sub $2, $1, $3

Programexecution order(in instructions)

and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of register $2 :

DM Reg

Reg

Reg

Reg

X X X – 20 X X X X XValue of EX/MEM :X X X X – 20 X X X XValue of MEM/WB :

DM

1

2

EX/MEM MEM/WB 1EX hazard2MEM hazard

Page 65: Chap.6: Enhancing Performance with Pipelining

Registers

Mux M

ux

ALU

ID/EX MEM/WB

Datamemory

Mux

Forwardingunit

EX/MEM

b. With forwarding

ForwardB

Rd EX/MEM.RegisterRd

MEM/WB.RegisterRd

RtRtRs

ForwardA

Mux

ALU

ID/EX MEM/WB

Datamemory

EX/MEM

a. No forwarding

Registers

Mux

Page 66: Chap.6: Enhancing Performance with Pipelining

Control for forwarding Set ForwardA and ForwardB

EX hazard:If ( EX/MEM.RegWriteand (EX/MEM.RegisterRd <> 0)and (EX/MEM.RegisterRd = ID/EX.RegisterRs))ForwardA = 10

// 排除寫入 $0// 前一指令有寫入暫存器動作

// 前一指令有寫入暫存器號碼與現在欲使用暫存器相同

If ( EX/MEM.RegWriteand (EX/MEM.RegisterRd <> 0)and (EX/MEM.RegisterRd = ID/EX.RegisterRt))ForwardB = 10

// 排除寫入 $0// 前一指令有寫入暫存器動作

// 前一指令有寫入暫存器號碼與現在欲使用暫存器相同

Page 67: Chap.6: Enhancing Performance with Pipelining

Outline Overview A pipeline datapath Pipelined control Data hazards

Forwarding Stalls

Branch hazards Superscalar and dynamic pipelining

Page 68: Chap.6: Enhancing Performance with Pipelining

Unsolved data hazards

Reg

IM

Reg

Reg

IM

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

lw $2, 20($1)

Programexecutionorder(in instructions)

and $4, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

DM Reg

Reg

Reg

DM

?

Page 69: Chap.6: Enhancing Performance with Pipelining

Stall

lw $2, 20($1)

Programexecutionorder(in instructions)

and $4, $2, $5

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

Reg

IM

Reg

Reg

IM DM

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (in clock cycles)

IM Reg DM RegIM

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9 CC 10

DM Reg

RegReg

Reg

bubble

Page 70: Chap.6: Enhancing Performance with Pipelining

Hazard detection unit When to stall?

If ( ID/EX.MemRead and (( ID/EX.RegisterRt = IF/ID.RegisterRs) or ( ID/EX.RegisterRt = IF/ID.RegisterRt) )Stall the pipeline

// 前一指令是 load

Page 71: Chap.6: Enhancing Performance with Pipelining

Outline Overview A pipeline datapath Pipelined control Data hazards

Forwarding Stalls

Branch hazards Advanced pipelining

Page 72: Chap.6: Enhancing Performance with Pipelining

Advanced technology Single-cycle implementation multi-cycle implementation Pipelining (instruction-level

parallelism)* How to build fast processors ?1. Increase depth of the pipeline: Superpipelining

2. Replicate the internal components of the computer : multiple issue

Dynamic multiple issue (hardware): superscalarStatic multiple issue (compiler): VLIW

How to schedule?

Page 73: Chap.6: Enhancing Performance with Pipelining

Superpipelining (longer pipelines) Ideal improvement of pipelining

Time between inst. pipelined = Time between inst. nonpipelinedNumber of pipe stages

* Divide instruction up to 8 or more stages* How to balance the length of each cycle?=> Clock cycle depends on the step of longest time

Instructionfetch Reg ALU Data

access Reg

8 nsInstruction

fetch Reg ALU Dataaccess Reg

8 nsInstruction

fetch

8 ns

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 4 6 8 10 12 14 16 18

2 4 6 8 10 12 14

...

Programexecutionorder(in instructions)

Instructionfetch Reg ALU Data

access Reg

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 nsInstruction

fetch Reg ALU Dataaccess Reg

2 nsInstruction

fetch Reg ALU Dataaccess Reg

2 ns 2 ns 2 ns 2 ns 2 ns

Programexecutionorder(in instructions)

Page 74: Chap.6: Enhancing Performance with Pipelining

Multiple-issue pipeline Two problems

How to package instructions into issue slots

Static issue processor: compiler Dynamic issue processor: at runtime by

hardware Dealing with data and control hazards

Static issue processor: compiler Dynamic issue processor: at runtime by

hardware

Page 75: Chap.6: Enhancing Performance with Pipelining

Static multiple issueTime

2 4 6 8 10

add $s0, $t0, $t1 IF ID WBEX MEM

Time2 4 6 8 10

add $s0, $t0, $t1 IF ID WBEX MEM

Issue packet

• The mix of instructions can be initiated in a given clock is usually restricted Issue packet can be taken as a single instruction allowing several operations Very long instruction word (VLIW)

Page 76: Chap.6: Enhancing Performance with Pipelining

Ex. Two-issue MIPS 2 instructions are issued per clock

cycle One for integer ALU or branch The other for load or store

* One slot can be nop

Page 77: Chap.6: Enhancing Performance with Pipelining

A static two-issue datapath

PC Instructionmemory

4

RegistersMux

Mux

ALU

Mux

Datamemory

Mux

40000040

Signextend Sign

extend

ALU Address

Writedata

1

2

avoidStructural hazard

64-bit

Page 78: Chap.6: Enhancing Performance with Pipelining

Superscalar: Dynamic pipeline scheduling

Dynamic pipeline scheduling Hardware support for reordering the

order of instruction execution so as to avoid stalls

Ex. Unsolved pipeline stalllw $t0, 20($s2)addu $t1, $t0, $t2sub $s4, $s4, $t3slti $t5, $s4, 20Stall (memory is slow)

Page 79: Chap.6: Enhancing Performance with Pipelining

Dynamic pipeline scheduling

Commitunit

Instruction fetchand decode unit

In-order issue

In-order commit

Load/Store

FloatingpointIntegerInteger …Functional

unitsOut-of-order execute

Reservationstation

Reservationstation

Reservationstation

Reservationstation

1

2

3

Buffers tohold operandsand operation

Reorder buffer,decide when it’s safeto put the result back

In the orderof programexecution

Page 80: Chap.6: Enhancing Performance with Pipelining

dynamic pipelining Static pipelining Advantage of dynamic pipelining

Hide memory latency Avoid the stalls that compiler could not

schedule Speculative execute instructions

(combined with branch prediction) while waiting for hazards to be solved

Page 81: Chap.6: Enhancing Performance with Pipelining

Fallacies and pitfalls Pipelining is easy Pipelining ideas can be implemented

independent of technology No. of transistors on-chip increase,

additional hardware makes sense Ex. Extra hardware in dynamic pipelining

Increase the depth of pipelining always increase performance

Page 82: Chap.6: Enhancing Performance with Pipelining

Depth of pipelining

Data hazards: large percentage of cycles become stalls

Control hazards: slower branches Pipeline register overhead

1 2 4 8 16

Pipeline depth

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Rel

ativ

e pe

rform

ance

Fp pipelining, 19863 factors: