computer organization and architecture (at70. 01) comp. sc. and inf

77
Architecture (AT70.01) Architecture (AT70.01) Comp. Sc. and Inf. Comp. Sc. and Inf. Mgmt. Mgmt. Asian Institute of Asian Institute of Technology Technology Instructor: Dr. Sumanta Guha Slide Sources: Patterson & Hennessy COD book site (copyright Morgan Kaufmann) adapted and supplemented

Upload: delphia-oliver

Post on 06-Jan-2018

234 views

Category:

Documents


1 download

DESCRIPTION

COD Ch. 6 Enhancing Performance with Pipelining

TRANSCRIPT

Page 1: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Computer Organization Computer Organization and Architecture (AT70.01)and Architecture (AT70.01)Comp. Sc. and Inf. Mgmt.Comp. Sc. and Inf. Mgmt.Asian Institute of TechnologyAsian Institute of TechnologyInstructor: Dr. Sumanta GuhaSlide Sources: Patterson &Hennessy COD book site(copyright Morgan Kaufmann)adapted and supplemented

Page 2: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

COD Ch. 6Enhancing Performance with Pipelining

Page 3: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Pipelining Start work ASAP!! Do not waste time!

Time76 PM 8 9 10 11 12 1 2 AM

A

B

C

D

Time76 PM 8 9 10 11 12 1 2 AM

A

B

C

D

Taskorder

Taskorder

Assume 30 min. each task – wash, dry, fold, store – and that separate tasks use separate hardware and so can be overlapped

Pipelined

Not pipelined

Page 4: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Pipelined vs. Single-Cycle Instruction Execution: the Plan

Instructionfetch Reg ALU Data

access Reg

8 ns Instructionfetch Reg ALU Data

access Reg

8 ns Instructionfetch

8 ns

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 4 6 8 10 12 14 16 18

2 4 6 8 10 12 14

...

Programexecutionorder(in instructions)

Instructionfetch Reg ALU Data

access Reg

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 ns Instructionfetch Reg ALU Data

access Reg

2 ns Instructionfetch Reg ALU Data

access Reg

2 ns 2 ns 2 ns 2 ns 2 ns

Programexecutionorder(in instructions)

Single-cycle

Pipelined

Assume 2 ns for memory access, ALU operation; 1 ns for register access:therefore, single cycle clock 8 ns; pipelined clock cycle 2 ns.

Page 5: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Pipelining: Keep in Mind Pipelining does not reduce latency of a single

task, it increases throughput of entire workload Pipeline rate limited by longest stage

potential speedup = number pipe stages unbalanced lengths of pipe stages reduces speedup

Time to fill pipeline and time to drain it – when there is slack in the pipeline – reduces speedup

Page 6: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Example Problem Problem: for the laundry fill in the following table when

1. the stage lengths are 30, 30, 30 30 min., resp.2. the stage lengths are 20, 20, 60, 20 min., resp.

Come up with a formula for pipeline speed-up!

Person Unpipelined Pipeline 1 Ratio unpipelined Pipeline 2 Ratio unpiplelined finish time finish time to pipeline 1 finish time to pipeline 2 1 2 3 4

n

Page 7: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Pipelining MIPS What makes it easy with MIPS?

all instructions are same length so fetch and decode stages are similar for all instructions

just a few instruction formats simplifies instruction decode and makes it possible in one

stage memory operands appear only in load/stores

so memory access can be deferred to exactly one later stage

operands are aligned in memory one data transfer instruction requires one memory access

stage

Page 8: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Pipelining MIPS What makes it hard?

structural hazards: different instructions, at different stages, in the pipeline want to use the same hardware resource

control hazards: succeeding instruction, to put into pipeline, depends on the outcome of a previous branch instruction, already in pipeline

data hazards: an instruction in the pipeline requires data to be computed by a previous instruction still in the pipeline

Before actually building the pipelined datapath and control we first briefly examine these potential hazards individually…

Page 9: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Structural Hazards Structural hazard: inadequate hardware to simultaneously

support all instructions in the pipeline in the same clock cycle E.g., suppose single – not separate – instruction and data

memory in pipeline below with one read port then a structural hazard between first and fourth lw instructions

MIPS was designed to be pipelined: structural hazards are easy to avoid!

2 4 6 8 10 12 14

Instructionfetch Reg ALU Data

access Reg

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 ns Instructionfetch Reg ALU Data

access Reg

2 ns Instructionfetch Reg ALU Data

access Reg

2 ns 2 ns 2 ns 2 ns 2 ns

Programexecutionorder(in instructions)

Pipelined

Instructionfetch Reg ALU Data

access Reg2 nslw $4, 400($0)

Hazard if single memory

Page 10: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Control Hazards Control hazard: need to make a decision based on the

result of a previous instruction still executing in pipeline Solution 1 Stall the pipeline

Instructionfetch Reg ALU Data

access Reg

Time

beq $1, $2, 40

add $4, $5, $6

lw $3, 300($0)

4 ns

Instructionfetch Reg ALU Data

access Reg2ns

Instructionfetch Reg ALU Data

access Reg

2ns

2 4 6 8 10 12 14 16Programexecutionorder(in instructions)

Pipeline stall

bubble

Note that branch outcome iscomputed in ID stage withadded hardware (later…)

Page 11: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Control Hazards Solution 2 Predict branch outcome

e.g., predict branch-not-taken :

Instructionfetch Reg ALU Data

access Reg

Time

beq $1, $2, 40

add $4, $5, $6

lw $3, 300($0)

Instructionfetch Reg ALU Data

access Reg2 ns

Instructionfetch Reg ALU Data

access Reg2 ns

Programexecutionorder(in instructions)

Instructionfetch Reg ALU Data

access Reg

Time

beq $1, $2, 40

add $4, $5 ,$6

or $7, $8, $9

Instructionfetch Reg ALU Data

access Reg

2 4 6 8 10 12 14

2 4 6 8 10 12 14

Instructionfetch Reg ALU Data

access Reg

2 ns

4 ns

bubble bubble bubble bubble bubble

Programexecutionorder(in instructions)

Prediction success

Prediction failure: undo (=flush) lw

Page 12: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Control Hazards Solution 3 Delayed branch: always execute the

sequentially next statement with the branch executing after one instruction delay – compiler’s job to find a statement that can be put in the slot that is independent of branch outcome

MIPS does this – but it is an option in SPIM (Simulator -> Settings)

Instructionfetch Reg ALU Data

access Reg

Time

beq $1, $2, 40

add $4, $5, $6

lw $3, 300($0)

Instructionfetch Reg ALU Data

access Reg2 ns

Instructionfetch Reg ALU Data

access Reg

2 ns

2 4 6 8 1 0 12 14

2 ns

(d elayed branch slot)

Programexecutionorder(in instructions)

Delayed branch beq is followed by add that isindependent of branch outcome

Page 13: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Data Hazards Data hazard: instruction needs data from the result of a

previous instruction still executing in pipeline Solution Forward data if possible…

Time2 4 6 8 10

add $s0, $t0, $t1 IF ID WBEX MEM

add $s0, $t0, $t1

sub $t2, $s0, $t3

Programexecutionorder(in instructions)

IF ID WBEX

IF ID MEMEX

Time2 4 6 8 10

MEM

WBMEM

Instruction pipeline diagram:shade indicates use – left=write, right=read

Without forwarding – blue line –data has to go back in time;with forwarding – red line – data is available in time

Page 14: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Data Hazards Forwarding may not be enough

e.g., if an R-type instruction following a load uses the result of the load – called load-use data hazard

Time2 4 6 8 10 12 14

lw $s0, 20($t1)

sub $t2, $s0, $t3

Programexecutionorder(in instructions)

IF ID WBMEMEX

IF ID WBMEMEX

Time2 4 6 8 10 12 14

lw $s0, 20($t1)

sub $t2, $s0, $t3

Programexecutionorder(in instructions)

IF ID WBMEMEX

IF ID WBMEMEX

bubble bubble bubble bubble bubble

With a one-stage stall, forwardingcan get the data to the sub instruction in time

Without a stall it is impossibleto provide input to the subinstruction in time

Page 15: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Reordering Code to Avoid Pipeline Stall (Software Solution)

Example:lw $t0, 0($t1)lw $t2, 4($t1)sw $t2, 0($t1)sw $t0, 4($t1)

Reordered code:lw $t0, 0($t1)lw $t2, 4($t1)sw $t0, 4($t1)sw $t2, 0($t1)

Data hazard

Interchanged

Page 16: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Pipelined Datapath We now move to actually building a pipelined datapath First recall the 5 steps in instruction execution

1. Instruction Fetch & PC Increment (IF)2. Instruction Decode and Register Read (ID)3. Execution or calculate address (EX)4. Memory access (MEM)5. Write result into register (WB)

Review: single-cycle processor all 5 steps done in a single clock cycle dedicated hardware required for each step

What happens if we break the execution into multiple cycles, but keep the extra hardware?

Page 17: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Review - Single-Cycle Datapath “Steps”

5 516

RD1

RD2

RN1 RN2 WN

WDRegister File ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

5

Instruction I32

MUX

<<2RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

32

IFInstruction Fetch

IDInstruction Decode

EXExecute/ Address Calc.

MEMMemory Access

WBWrite Back

Zero

Page 18: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Pipelined Datapath – Key Idea What happens if we break the execution into multiple cycles, but keep

the extra hardware? Answer: We may be able to start executing a new instruction at each

clock cycle - pipelining …but we shall need extra registers to hold data

between cycles – pipeline registers

Page 19: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Pipelined Datapath

IF/ID

Pipeline registers

5 516

RD1

RD2

RN1 RN2 WN

WDRegister File ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

5

Instruction I32

MUX

<<2RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

32

ID/EX EX/MEM MEM/WB

Zero

64 bits97 bits 64 bits

128 bits

wide enough to hold data coming in

Page 20: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Pipelined Datapath

IF/ID

Pipeline registers

5 516

RD1

RD2

RN1 RN2 WN

WDRegister File ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

5

Instruction I32

MUX

<<2RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

32

ID/EX EX/MEM MEM/WB

Zero

64 bits97 bits 64 bits

128 bits

wide enough to hold data coming in

Only data flowing right to left may cause hazard…, why?

Page 21: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Bug in the Datapath

5 516

RD1

RD2

RN1 RN2 WN

WDRegister File ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

5

Instruction I32

MUX

<<2RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

32

Write register number comes from another later instruction!

IF/ID ID/EX EX/MEM MEM/WB

Page 22: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Corrected Datapath

5

RD1

RD2

RN1

RN2

WNWD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RDInstruction

Memory

ADDR

PC

4

ADD

ADD

MUX

5

5

5

EX/MEM MEM/WB

Zero

ID/EXIF/ID

64 bits 133 bits102 bits 69 bits

Destination register number is also passed through ID/EX, EX/MEMand MEM/WB registers, which are now wider by 5 bits

Page 23: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Pipelined Example Consider the following instruction sequence: lw $t0, 10($t1) sw $t3, 20($t4) add $t5, $t6, $t7 sub $t8, $t9, $t10

Page 24: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Single-Clock-Cycle Diagram:Clock Cycle 1LW

5

RD1

RD2

RN1

RN2

WNWD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RDInstruction

Memory

ADDR

PC

4

ADD

ADD

MUX

5

5

5

IF/ID ID/EX EX/MEM MEM/WB

Zero

Page 25: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

5

RD1

RD2

RN1

RN2

WNWD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RDInstruction

Memory

ADDR

PC

4

ADD

ADD

MUX

5

5

5

IF/ID ID/EX EX/MEM MEM/WB

Zero

Single-Clock-Cycle Diagram: Clock Cycle 2LWSW

Page 26: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

5

RD1

RD2

RN1

RN2

WNWD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RDInstruction

Memory

ADDR

PC

4

ADD

ADD

MUX

5

5

5

IF/ID ID/EX EX/MEM MEM/WB

Zero

Single-Clock-Cycle Diagram: Clock Cycle 3 LWSWADD

Page 27: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

5

RD1

RD2

RN1

RN2

WNWD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RDInstruction

Memory

ADDR

PC

4

ADD

ADD

MUX

5

5

5

IF/ID ID/EX EX/MEM MEM/WB

Zero

Single-Clock-Cycle Diagram: Clock Cycle 4 LWSWADDSUB

Page 28: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Single-Clock-Cycle Diagram: Clock Cycle 5

5

RD1

RD2

RN1

RN2

WNWD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RDInstruction

Memory

ADDR

PC

4

ADD

ADD

MUX

5

5

5

IF/ID ID/EX EX/MEM MEM/WB

Zero

LWSWADDSUB

Page 29: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Single-Clock-Cycle Diagram: Clock Cycle 6

5

RD1

RD2

RN1

RN2

WNWD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RDInstruction

Memory

ADDR

PC

4

ADD

ADD

MUX

5

5

5

IF/ID ID/EX EX/MEM MEM/WB

Zero

SWADDSUB

Page 30: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Single-Clock-Cycle Diagram: Clock Cycle 7

5

RD1

RD2

RN1

RN2

WNWD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RDInstruction

Memory

ADDR

PC

4

ADD

ADD

MUX

5

5

5

IF/ID ID/EX EX/MEM MEM/WB

Zero

ADDSUB

Page 31: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Single-Clock-Cycle Diagram: Clock Cycle 8

5

RD1

RD2

RN1

RN2

WNWD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RDInstruction

Memory

ADDR

PC

4

ADD

ADD

MUX

5

5

5

IF/ID ID/EX EX/MEM MEM/WB

Zero

SUB

Page 32: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Alternative View – Multiple-Clock-Cycle Diagram

IM REG ALU DM REGlw $t0, 10($t1)

sw $t3, 20($t4)

add $t5, $t6, $t7

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7

IM REG ALU DM REG

IM REG ALU DM REG

sub $t8, $t9, $t10 IM REG ALU DM REG

CC 8

Time axis

Page 33: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Notes One significant difference in the execution of an R-type

instruction between multicycle and pipelined implementations: register write-back for the R-type instruction is the 5th (the last

write-back) pipeline stage vs. the 4th stage for the multicycle implementation. Why?

think of structural hazards when writing to the register file… Worth repeating: the essential difference between the pipeline

and multicycle implementations is the insertion of pipeline registers to decouple the 5 stages

The CPI of an ideal pipeline (no stalls) is 1. Why? The RaVi Architecture Visualization Project of Dortmund U. has

pipeline simulations – see link in our Additional Resources page As we develop control for the pipeline keep in mind that the

text does not consider jump – should not be too hard to implement!

Page 34: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Recall Single-Cycle Control – the Datapath

PC

Instructionmemory

Readaddress

Instruction[31– 0]

Instruction [20 16]

Instruction [25 21]

Add

Instruction [5 0]

MemtoRegALUOpMemWrite

RegWrite

MemReadBranchRegDst

ALUSrc

Instruction [31 26]

4

16 32Instruction [15 0]

0

0Mux

0

1

Control

Add ALUresult

Mux

0

1

RegistersWriteregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

Mux

1

ALUresult

Zero

PCSrc

Datamemory

Writedata

Readdata

Mux

1

Instruction [15 11]

ALUcontrol

Shiftleft 2

ALUAddress

Page 35: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Recall Single-Cycle – ALU Control

Instruction AluOp Instruction Funct Field Desired ALU control

opcode operation ALU action inputLW 00 load word xxxxxx add 010SW 00 store word xxxxxx add 010Branch eq 01 branch eq xxxxxx subtract 110R-type 10 add 100000 add 010R-type 10 subtract 100010 subtract 110R-type 10 AND 100100 and 000R-type 10 OR 100101 or 001R-type 10 set on less 101010 set on less 111

Truth table for ALU control bits

ALUOp Funct field OperationALUOp1 ALUOp0 F5 F4 F3 F2 F1 F0

0 0 X X X X X X 0100 1 X X X X X X 1101 X X X 0 0 0 0 0101 X X X 0 0 1 0 1101 X X X 0 1 0 0 0001 X X X 0 1 0 1 0011 X X X 1 0 1 0 111

Page 36: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Recall Single-Cycle – Control Signals

Signal Name Effect when deasserted Effect when asserted

RegDst The register destination number for the The register destination number for the Write register comes from the rt field (bits 20-16) Write register comes from the rd field (bits 15-11)RegWrite None The register on the Write register input is written

with the value on the Write data inputAlLUSrc The second ALU operand comes from the The second ALU operand is the sign-extended, second register file output (Read data 2) lower 16 bits of the instructionPCSrc The PC is replaced by the output of the adder The PC is replaced by the output of the adder

that computes the value of PC + 4 that computes the branch targetMemRead None Data memory contents designated by the address

input are put on the first Read data outputMemWrite None Data memory contents designated by the address

input are replaced by the value of the Write data inputMemtoReg The value fed to the register Write data input The value fed to the register Write data input comes from the ALU comes from the data memory

Instruction RegDst ALUSrcMemto-

RegReg

WriteMem Read

Mem Write Branch ALUOp1 ALUp0

R-format 1 0 0 1 0 0 0 1 0lw 0 1 1 1 1 0 0 0 0sw X 1 X 0 0 1 0 0 0beq X 0 X 0 0 0 1 0 1

Effect of control bits

Deter-miningcontrolbits

Page 37: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Pipeline Control Initial design – motivated by single-cycle datapath control

– use the same control signals Observe:

No separate write signal for the PC as it is written every cycle No separate write signals for the pipeline registers as they

are written every cycle No separate read signal for instruction memory as it is read

every clock cycle No separate read signal for register file as it is read every

clock cycle Need to set control signals during each pipeline stage Since control signals are associated with components

active during a single pipeline stage, can group control lines into five groups according to pipeline stage

Will bemodifiedby hazarddetectionunit!!

Page 38: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Pipelined Datapath with Control I

PC

Instructionmemory

Address

Inst

ruct

ion

Instruction[20– 16]

MemtoReg

ALUOp

Branch

RegDst

ALUSrc

4

16 32Instruction[15– 0]

0

0Registers

Writeregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

Mux

1Write

data

Read

data Mux

1

ALUcontrol

RegWrite

MemRead

Instruction[15– 11]

6

IF/ID ID/EX EX/MEM MEM/WB

MemWrite

Address

Datamemory

PCSrc

Zero

Add Addresult

Shiftleft 2

ALUresult

ALUZero

Add

0

1

Mux

0

1

Mux

Same controlsignals as thesingle-cycledatapath

Page 39: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

There are five stages in the pipeline instruction fetch / PC increment instruction decode / register fetch execution / address calculation memory access write back

Pipeline Control Signals

Execution/Address Calculation stage control lines

Memory access stage control lines

Write-back stage control

lines

InstructionReg Dst

ALU Op1

ALU Op0

ALU Src Branch

Mem Read

Mem Write

Reg write

Mem to Reg

R-format 1 1 0 0 0 0 0 1 0lw 0 0 0 1 0 1 0 1 1sw X 0 0 1 0 0 1 0 Xbeq X 0 1 0 1 0 0 0 X

Nothing to control as instruction memoryread and PC write are always enabled

Page 40: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Pass control signals along just like the data – extend each pipeline register to hold needed control bits for succeeding stages

Note: The 6-bit funct field of the instruction required in the EX stage to generate ALU control can be retrieved as the 6 least significant bits of the immediate field which is sign-extended and passed from the IF/ID register to the ID/EX register

Pipeline Control Implementation

Control

EX

M

WB

M

WB

WB

IF/ID ID/EX EX/MEM MEM/WB

Instruction

Page 41: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Pipelined Datapath with Control II

PC

Instructionmemory

Inst

ruct

ion

Add

Instruction[20– 16]

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

16 32Instruction[15– 0]

0

0

Mux

0

1

Add Addresult

RegistersWriteregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

Mux

1

ALUresult

Zero

Writedata

Readdata

Mux

1

ALUcontrol

Shiftleft 2

Reg

Writ

e

MemRead

Control

ALU

Instruction[15– 11]

6

EX

M

WB

M

WB

WBIF/ID

PCSrc

ID/EX

EX/MEM

MEM/WB

Mux

0

1

Mem

Writ

e

AddressData

memory

Address

Control signalsemanate from the controlportions of the pipeline registers

Page 42: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

PipelinedExecutionandControl

Instruction sequence:

Instructionmemory

Instruction[20– 16]

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

Instruction[15– 0]

0

Mux

0

1

Add Addresult

RegistersWriteregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

Mux

1

ALUresult

Zero

ALUcontrol

Shiftleft 2

Reg

Writ

e

MemRead

Control

ALU

Instruction[15– 11]

EX

M

WB

M

WB

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: before<1> EX: before<2> MEM: before<3> WB: before<4>

MEM/WB

IF: lw $10, 20($1)

000

00

0000

000

00

000

0

00

00

0

00

Mux

0

1

Add

PC

0

Datamemory

Address

Writedata

Readdata

Mux

1

WB

EX

M

Instructionmemory

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

0

Mux

0

1

Add Addresult

Writeregister

Writedata

Mux

1

ALUresult

Zero

ALUcontrol

Shiftleft 2

Reg

Writ

e

ALU

M

WB

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: lw $10, 20($1) EX: before<1> MEM: before<2> WB: before<3>

MEM/WB

IF: sub $11, $2, $3

010

11

0001

000

00

000

0

00

00

0

00

Mux

0

1

Add

PC

0Writedata

Readdata

Mux

1

lwControl

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

X

10

20

X

1

Instruction[20– 16]

Instruction[15– 0] Sign

extend

Instruction[15– 11]

20

$X

$1

10

X

Mem

Writ

e

MemRead

Mem

Writ

e

Datamemory

Address

Address

Address

Clock 2

Clock 1

lw $10, 20($1)sub $11, $2, $3and $12, $4, $7or $13, $6, $7add $14, $8, $9

Label “before<i>” meansi th instruction beforelw

Clock cycle 1

Clock cycle 2

Page 43: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

PipelinedExecutionandControl

Instruction sequence:

Instructionmemory

Address

Instruction[20– 16]

Mem

toR

eg

Branch

ALUSrc

4

Instruction[15– 0]

0

1

Add Addresult

RegistersWriteregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

ALUresult

Shiftleft 2

Reg

Writ

e

MemRead

Control

ALU

Instruction[15– 11]

EX

M

WB

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: sub $11, $2, $3 EX: lw $10, . . . MEM: before<1> WB: before<2>

MEM/WB

IF: and $12, $4, $5

000

10

1100

010

11

000

1

00

00

0

00

Mux

0

1

Add

PC

0Writedata

Readdata

Mux

1

WB

EX

M

Instructionmemory

Address

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

0

0

1

Add Addresult

Writeregister

Writedata 1

ALUresult

ALUcontrol

Shiftleft 2

Reg

Writ

e

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: and $12, $2, $3 EX: sub $11, . . . MEM: lw $10, . . . WB: before<1>

MEM/WB

IF: or $13, $6, $7

000

10

1100

000

10

101

0

11

10

0

00

Mux

0

1

Add

PC

0Writedata

Mux

1

andControl

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

12

X

X

5

4

Instruction[20– 16]

Instruction[15– 0]

Instruction[15– 11]

X

$5

$4

X

12

Mem

Writ

e

MemRead

Mem

Writ

e

sub

11

X

X

3

2

X

$3

$2

X

11

$1

20

10

Mux

0

Mux

1

ALUOp

RegDst

ALUcontrol

M

WB

$3

$2

11

Mux

Mux

ALUAddress Read

dataData

memory

10

WB

Zero

Zero

Signextend

Signextend

Datamemory

Address

Clock 3

Clock 4

lw $10, 20($1)sub $11, $2, $3and $12, $4, $7or $13, $6, $7add $14, $8, $9

Clock cycle 3

Clock cycle 4

Page 44: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

PipelinedExecutionandControl

Instruction sequence:

Instructionmemory

Address

Instruction[20– 16]

Branch

ALUSrc

4

Instruction[15– 0]

0

1

Add Addresult

RegistersWriteregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

ALUresult

Shiftleft 2

Reg

Writ

e

MemRead

Control

ALU

Instruction[15– 11]

EX

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: or $13, $6, $7 EX: and $12, . . . MEM: sub $11, . . . WB: lw $10, . . .

MEM/WB

IF: add $14, $8, $9

000

10

1100

000

10

101

0

10

00

0

Mux

0

1

Add

PC

0Writedata

Readdata

Mux

1

WB

EX

M

Instructionmemory

Address

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

0

0

1

Add Addresult

1

ALUresult

ALUcontrol

Shiftleft 2

Reg

Writ

e

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: add $14, $8, $9 EX: or $13, . . . MEM: and $12, . . . WB: sub $11, . . .

MEM/WB

IF: after<1>

000

10

1100

000

10

101

0

10

00

0

10

Mux

0

1

Add

PC

0Writedata

Mux

1

addControl

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

14

X

X

9

8

Instruction[20– 16]

Instruction[15– 0]

Instruction[15– 11]

X

$9

$8

X

14

Mem

Writ

e

MemRead

Mem

Writ

e

or

13

X

X

7

6

X

$7

$6

X

13

$4

Mux

0

Mux

1

ALUOp

RegDst

ALUcontrol

M

WB

$7

$6

13

Mux

Mux

ALUReaddata

12

WB

11 10

10$5

12

WB

Mem

toR

eg

11

11

11

Writeregister

Writedata

Zero

Zero

Datamemory

Address

Datamemory

Address

Signextend

Signextend

Clock 5

Clock 6

lw $10, 20($1)sub $11, $2, $3and $12, $4, $7or $13, $6, $7add $14, $8, $9

Clock cycle 5

Clock cycle 6

Label “after<i>” meansi th instruction after add

Page 45: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

PipelinedExecutionandControl

Instruction sequence:

Instructionmemory

Address

Instruction[20– 16]

Branch

ALUSrc

4

Instruction[15– 0]

0

1

Add Addresult

RegistersWriteregister

Writedata

ALUresult

Shiftleft 2

Reg

Writ

e

MemRead

Control

ALU

Instruction[15– 11]

Signextend

EX

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: after<1> EX: add $14, . . . MEM: or $13, . . . WB: and $12, . . .

MEM/WB

IF: after<2>

000

00

0000

000

10

101

0

10

00

0

Mux

0

1

Add

PC

0Writedata

Readdata

Mux

1

WB

EX

M

Instructionmemory

Address

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

0

0

1

Add Addresult

1

ALUresult

Zero

ALUcontrol

Shiftleft 2

Reg

Writ

e

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: after<2> EX: after<1> MEM: add $14, . . . WB: or $13, . . .

MEM/WB

IF: after<3>

000

00

0000

000

00

000

0

10

00

0

10

Mux

0

1

Add

PC

0Writedata

Mux

1

Control

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Instruction[20– 16]

Instruction[15– 0] Sign

extend

Instruction[15– 11]

Mem

Writ

e

MemRead

Mem

Writ

e

$8

Mux

0

Mux

1

ALUOp

RegDst

ALUcontrol

M

WB

Mux

Mux

ALUReaddata

14

WB

13 12

12$9

14

WB

Mem

toR

eg

10

13

13

Writeregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2 Zero

Datamemory

Address

Datamemory

Address

Clock 7

Clock 8

lw $10, 20($1)sub $11, $2, $3and $12, $4, $7or $13, $6, $7add $14, $8, $9

Clock cycle 8

Clock cycle 7

Page 46: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Pipelined Execution and Control

Instruction sequence:

WB

EX

M

Instructionmemory

Address

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

0

0

1

Add Addresult

1

ALUresult

Zero

ALUcontrol

Shiftleft 2

Reg

Writ

e

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: after<3> EX: after<2> MEM: after<1> WB: add $14, . . .

MEM/WB

IF: after<4>

000

00

0000

000

00

000

0

00

00

0

10

Mux

0

1

Add

PC

0Writedata

Mux

1

Control

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Instruction[20– 16]

Instruction[15– 0] Sign

extend

Instruction[15– 11]

MemRead

Mem

Writ

e

Mux

Mux

ALUReaddata

WB

14

14

Writeregister

Writedata

Datamemory

Address

Clock 9

lw $10, 20($1)sub $11, $2, $3and $12, $4, $7or $13, $6, $7add $14, $8, $9

Clock cycle 9

Page 47: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Revisiting Hazards So far our datapath and control have ignored

hazards We shall revisit data hazards and control hazards

and enhance our datapath and control to handle them in hardware…

Page 48: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Problem with starting an instruction before previous are finished: data dependencies that go backward in time – called data hazards

Data Hazards and Forwarding

IM Reg

IM Reg

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

sub $2, $1, $3

Programexecutionorder(in instructions)

and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of register $2:

DM Reg

Reg

Reg

Reg

DM

sub $2, $1, $3and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15, 100($2)

$2 = 10 before sub;$2 = -20 after sub

Page 49: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Have compiler guarantee never any data hazards! by rearranging instructions to insert independent instructions

between instructions that would otherwise have a data hazard between them,

or, if such rearrangement is not possible, insert nops

Such compiler solutions may not always be possible, and nops slow the machine down

Software Solution

sub $2, $1, $3 lw $10, 40($3) slt $5, $6, $7

and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15, 100($2)

sub $2, $1, $3 nop nop

and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15, 100($2)

or

MIPS: nop = “no operation” = 00…0 (32bits) = sll $0, $0, 0

Page 50: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Hardware Solution: Forwarding Idea: use intermediate data, do not wait for result to

be finally written to the destination register. Two steps:

1. Detect data hazard2. Forward intermediate data to resolve hazard

Page 51: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Pipelined Datapath with Control II (as before)

PC

Instructionmemory

Inst

ruct

ion

Add

Instruction[20– 16]

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

16 32Instruction[15– 0]

0

0

Mux

0

1

Add Addresult

RegistersWriteregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

Mux

1

ALUresult

Zero

Writedata

Readdata

Mux

1

ALUcontrol

Shiftleft 2

Reg

Writ

e

MemRead

Control

ALU

Instruction[15– 11]

6

EX

M

WB

M

WB

WBIF/ID

PCSrc

ID/EX

EX/MEM

MEM/WB

Mux

0

1

Mem

Writ

e

AddressData

memory

Address

Control signalsemanate from the controlportions of the pipeline registers

Page 52: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Hazard Detection Hazard conditions:1a. EX/MEM.RegisterRd = ID/EX.RegisterRs1b. EX/MEM.RegisterRd = ID/EX.RegisterRt2a. MEM/WB.RegisterRd = ID/EX.RegisterRs2b. MEM/WB.RegisterRd = ID/EX.RegisterRt

Eg., in the earlier example, first hazard between sub $2, $1, $3 and and $12, $2, $5 is detected when the and is in EX stage and the sub is in MEM stage because

EX/MEM.RegisterRd = ID/EX.RegisterRs = $2 (1a)

Whether to forward also depends on: if the later instruction is going to write a register – if not, no need to

forward, even if there is register number match as in conditions above if the destination register of the later instruction is $0 – in which case there is no need to forward value ($0 is always 0 and never

overwritten)

Page 53: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Data Forwarding Plan:

allow inputs to the ALU not just from ID/EX, but also later pipeline registers, and

use multiplexors and control signals to choose appropriate inputs to ALU

IM Reg

IM Reg

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

sub $2, $1, $3

Programexecution order(in instructions)

and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of register $2 :

DM Reg

Reg

Reg

Reg

X X X – 20 X X X X XValue of EX/MEM :X X X X – 20 X X X XValue of MEM/WB :

DM

sub $2, $1, $3and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15, 100($2)

Dependencies between pipelines move forward in time

Page 54: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

ForwardingHardware

Registers

Mux M

ux

ALU

ID/EX MEM/WB

Datamemory

Mux

Forwardingunit

EX/MEM

b. With forwarding

ForwardB

RdEX/MEM.RegisterRd

MEM/WB.RegisterRd

RtRtRs

ForwardA

Mux

ALU

ID/EX MEM/WB

Datamemory

EX/MEM

a. No forwarding

Registers

Mux

Datapath before adding forwarding hardware

Datapath after adding forwarding hardware

Page 55: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Forwarding Hardware: Multiplexor Control

Mux control Source ExplanationForwardA = 00 ID/EX The first ALU operand comes from the register fileForwardA = 10 EX/MEM The first ALU operand is forwarded from prior ALU

resultForwardA = 01 MEM/WB The first ALU operand is forwarded from data

memory or an earlier ALU resultForwardB = 00 ID/EX The second ALU operand comes from the register

fileForwardB = 10 EX/MEM The second ALU operand is forwarded from prior

ALU resultForwardB = 01 MEM/WB The second ALU operand is forwarded from data

memory or an earlier ALU resultDepending on the selection in the rightmost multiplexor

(see datapath with control diagram)

Page 56: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Data Hazard: Detection and Forwarding

Forwarding unit determines multiplexor control according to the following rules:

1. EX hazard if ( EX/MEM.RegWrite // if there is a write… and ( EX/MEM.RegisterRd 0 ) // to a non-$0

register… and ( EX/MEM.RegisterRd = ID/EX.RegisterRs ) ) // which matches,

then… ForwardA = 10

if ( EX/MEM.RegWrite // if there is a write… and ( EX/MEM.RegisterRd 0 ) // to a non-$0

register… and ( EX/MEM.RegisterRd = ID/EX.RegisterRt ) ) // which matches,

then… ForwardB = 10

Page 57: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Data Hazard: Detection and Forwarding

2. MEM hazard if ( MEM/WB.RegWrite // if there is a write… and ( MEM/WB.RegisterRd 0 ) // to a non-$0 register… and ( EX/MEM.RegisterRd ID/EX.RegisterRs ) // and not already a register match // with earlier pipeline register… and ( MEM/WB.RegisterRd = ID/EX.RegisterRs ) ) // but match with later pipeline register, then… ForwardA = 01

if ( MEM/WB.RegWrite // if there is a write… and ( MEM/WB.RegisterRd 0 ) // to a non-$0 register… and ( EX/MEM.RegisterRd ID/EX.RegisterRt ) // and not already a register match // with earlier pipeline register… and ( MEM/WB.RegisterRd = ID/EX.RegisterRt ) ) // but match with later pipeline register, then… ForwardB = 01

This check is necessary, e.g., for sequences such as add $1, $1, $2; add $1, $1, $3; add $1, $1, $4; (array summing?), where an earlier pipeline (EX/MEM) register has more recent data

Page 58: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Forwarding Hardware with Control

PC Instructionmemory

Registers

Mux

Mux

Control

ALU

EX

M

WB

M

WB

WB

ID/EX

EX/MEM

MEM/WB

Datamemory

Mux

Forwardingunit

IF/ID

Inst

ruct

ion

Mux

RdEX/MEM.RegisterRd

MEM/WB.RegisterRd

Rt

Rt

Rs

IF/ID.RegisterRd

IF/ID.RegisterRt

IF/ID.RegisterRt

IF/ID.RegisterRs

Datapath with forwarding hardware and control wires – certain details,e.g., branching hardware, are omitted to simplify the drawingNote: so far we have only handled forwarding to R-type instructions…!

Called forwarding unit, not hazard detection unit, because once data is forwarded there is no hazard!

Page 59: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

ForwardingPC Instruction

memory

Registers

Mux

Mux

Mux

EX

M

WB

WB

Datamemory

Mux

Forwardingunit

Inst

ruct

ion

IF/ID

and $4, $2, $5 sub $2, $1, $3

ID/EX

before<1>

EX/MEM

before<2>

MEM/WB

or $4, $4, $2

Clock 3

2

5

10 10

$2

$5

52

4

$1

$3

31

2

Control

ALU

PC Instructionmemory

Registers

Mux

Mux

Mux

EX

M

WB

M

WB

Datamemory

Mux

Forwardingunit

Inst

ruct

ion

IF/ID

or $4, $4, $2 and $4, $2, $5

ID/EX

sub $2, . . .

EX/MEM

before<1>

MEM/WB

add $9, $4, $2

Clock 4

4

6

10 10

$4

$2

62

4

$2

$5

52

4

Control

ALU

10

2

WB

M

WB

sub $2, $1, $3and $4, $2, $5or $4, $4, $2add $9, $4, $2

Execution example:

Clock cycle 3

Clock cycle 4

Page 60: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Forwarding

Execution example (cont.):

PC Instructionmemory

Registers

Mux

Mux

Mux

EX

M

WB

M

WB

Datamemory

Mux

Forwardingunit

Inst

ruct

ion

IF/ID

add $9, $4, $2 or $4, $4, $2

ID/EX

and $4, . . .

EX/MEM

sub $2, . . .

MEM/WB

after<1>

Clock 5

4

2

10 10

$4

$2

24

9

$4

$2

4

2

24

Control

ALU

10

WB

2

1

PC Instructionmemory

Mux

Mux

Mux

EX

M

WB

M

WB

Datamemory

Mux

Forwardingunit

after<1>after<2> add $9, $4, $2 or $4, . . .

EX/MEM

and $4, . . .

MEM/WB

Clock 6

10

$4

$2

24

9

ALU

10

4

4

WB

4

1

Registers

Inst

ruct

ion

IF/ID

ID/EX

4

Controlsub $2, $1, $3and $4, $2, $5or $4, $4, $2add $9, $4, $2

Clock cycle 5

Clock cycle 6

Page 61: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Load word can still cause a hazard: an instruction tries to read a register following a load instruction that writes to the same register

therefore, we need a hazard detection unit to stall the pipeline after the load instruction

Data Hazards and Stalls

Reg

IM

Reg

Reg

IM

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

lw $2, 20($1)

Programexecutionorder(in instructions)

and $4, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

DM Reg

Reg

Reg

DM

lw $2, 20($1)and $4, $2, $5or $8, $2, $6add $9, $4, $2Slt $1, $6, $7

As even a pipelinedependency goesbackward in timeforwarding will notsolve the hazard

Page 62: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Pipelined Datapath with Control II (as before)

PC

Instructionmemory

Inst

ruct

ion

Add

Instruction[20– 16]

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

16 32Instruction[15– 0]

0

0

Mux

0

1

Add Addresult

RegistersWriteregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

Mux

1

ALUresult

Zero

Writedata

Readdata

Mux

1

ALUcontrol

Shiftleft 2

Reg

Writ

e

MemRead

Control

ALU

Instruction[15– 11]

6

EX

M

WB

M

WB

WBIF/ID

PCSrc

ID/EX

EX/MEM

MEM/WB

Mux

0

1

Mem

Writ

e

AddressData

memory

Address

Control signalsemanate from the controlportions of the pipeline registers

Page 63: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Hazard Detection Logic to Stall

Hazard detection unit implements the following check if to stall

if ( ID/EX.MemRead // if the instruction in the EX stage is a load…

and ( ( ID/EX.RegisterRt = IF/ID.RegisterRs ) // and the destination register

or ( ID/EX.RegisterRt = IF/ID.RegisterRt ) ) ) // matches either source register

// of the instruction in the ID stage, then…

stall the pipeline

Page 64: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Mechanics of Stalling If the check to stall verifies, then the pipeline needs to

stall only 1 clock cycle after the load as after that the forwarding unit can resolve the dependency

What the hardware does to stall the pipeline 1 cycle: does not let the IF/ID register change (disable write!) – this

will cause the instruction in the ID stage to repeat, i.e., stall therefore, the instruction, just behind, in the IF stage must

be stalled as well – so hardware does not let the PC change (disable write!) – this will cause the instruction in the IF stage to repeat, i.e., stall

changes all the EX, MEM and WB control fields in the ID/EX pipeline register to 0, so effectively the instruction just behind the load becomes a nop – a bubble is said to have been inserted into the pipeline

note that we cannot turn that instruction into an nop by 0ing all the bits in the instruction itself – recall nop = 00…0 (32 bits) – because it has already been decoded and control signals generated

Page 65: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Hazard Detection Unit

PC Instructionmemory

Registers

Mux

Mux

Mux

Control

ALU

EX

M

WB

M

WB

WB

ID/EX

EX/MEM

MEM/WB

Datamemory

Mux

Hazarddetection

unit

Forwardingunit

0

Mux

IF/ID

Inst

ruct

ion

ID/EX.MemRead

IF/ID

Writ

e

PC

Writ

e

ID/EX.RegisterRt

IF/ID.RegisterRd

IF/ID.RegisterRtIF/ID.RegisterRtIF/ID.RegisterRs

RtRs

Rd

Rt EX/MEM.RegisterRd

MEM/WB.RegisterRd

Datapath with forwarding hardware, the hazard detection unit and controls wires – certain details, e.g., branching hardware are omitted to simplify the drawing

Page 66: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Stalling Resolves a Hazard Same instruction sequence as before for which

forwarding by itself could not resolve the hazard:

lw $2, 20($1)

Programexecutionorder(in instructions)

and $4, $2, $5

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

Reg

IM

Reg

Reg

IM DM

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (in clock cycles)

IM Reg DM RegIM

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9 CC 10

DM Reg

RegReg

Reg

bubble

lw $2, 20($1)and $4, $2, $5or $8, $2, $6add $9, $4, $2Slt $1, $6, $7

Hazard detection unit inserts a 1-cycle bubble in the pipeline, afterwhich all pipeline register dependencies go forward so then the forwarding unit can handle them and there are no more hazards

Page 67: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Stalling Execution example:

Hazarddetection

unit

0

MuxIF

/IDW

rite

PC

Writ

e

ID/EX.RegisterRt

lw $2, 20($1)

PC Instructionmemory

Registers

Mux

Mux

Mux

EX

M

WB

WB

Datamemory

Mux

Forwardingunit

Inst

ruct

ion

IF/ID

and $4, $2, $5

ID/EX

before<1>

EX/MEM

before<2>

MEM/WB

or $4, $4, $2

Clock 3

2

5

2

500 11

$2

$5

52

4

$1

$X

X1

2

Control

ALU

M

WB

Hazarddetection

unit

0

MuxIF

/IDW

rite

PC

Writ

e

ID/EX.RegisterRt

ID/EX.MemRead

ID/EX.MemRead

M

WB

$1

$X

X

1

2

before<3>

PC Instructionmemory

Registers

Mux

Mux

Mux

EX WB

Datamemory

Mux

Forwardingunit

Inst

ruct

ion

IF/ID

ID/EX

EX/MEM

MEM/WB

and $4, $2, $5 lw $2, 20($1) before<1> before<2>

Clock 2

1

1

X

X11

Control

ALU

M

WB

lw $2, 20($1)and $4, $2, $5or $4, $4, $2add $9, $4, $2

Clock cycle 2

Clock cycle 3

Page 68: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Stalling Execution example (cont.):

Hazarddetection

unit

0

MuxIF

/IDW

rite

PC

Writ

e

ID/EX.RegisterRt

$2

$5

52

2

2

4

WB

Hazarddetection

unit

0

MuxIF

/IDW

rite

PC

Writ

eID/EX.RegisterRt

PC Instructionmemory

Registers

Mux

Mux

Mux

EX

M

WB

Datamemory

Mux

Inst

ruct

ion

IF/ID

and $4, $2, $5 bubble

ID/EX

lw $2, . . .

EX/MEM

before<1>

MEM/WB

Clock 4

2

2

5

510

11

00

$2

$5

52

4

Control

ALU

M

WB

bubble lw $2, . . .

PC Instructionmemory

Registers

Mux

Mux

Mux

EX

M

WB

M

WB

Datamemory

Mux

Forwardingunit

Forwardingunit

Inst

ruct

ion

IF/ID

and $4, $2, $5

ID/EX

EX/MEM

MEM/WB

add $9, $4, $2

Clock 5

2

210 10

11

$4

$2

2

4

4

4

2

4

$2

$5

5

2

4

Control

ALU

0

WB

ID/EX.MemRead

ID/EX.MemRead

or $4, $4, $2

or $4, $4, $2

lw $2, 20($1)and $4, $2, $5or $4, $4, $2add $9, $4, $2

Clock cycle 4

Clock cycle 5

Page 69: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Stalling Execution example (cont.):

Registers

Inst

ruct

ion

ID/EX

4

Control

PC Instructionmemory

PC Instructionmemory

Hazarddetection

unit

0

MuxIF

/IDW

rite

PC

Writ

e

IF/ID

Writ

e

PC

Writ

e

ID/EX.RegisterRt

bubble

Registers

Mux

Mux

EX

M

WB

M

WB

Datamemory

Mux

Inst

ruct

ion

IF/ID

add $9, $4, $2

ID/EX

and $4, . . .

EX/MEM

MEM/WB

Clock 6

4

4

2

210 10

$4

$2

24

49

$2

2

Control

ALU

10

WB0

add $9, $4, $2 or $4, . . . and $4, . . .after<2> after<1>

after<1>

Clock 7

Mux

Mux

Mux

EX

M

WB

M

WB

Datamemory

Mux

Forwardingunit

Forwardingunit

EX/MEM

MEM/WB

10 10

$4

$4

$2

24

4

9

ALU

10

WB

44

4

1

Hazarddetection

unit

0

Mux

ID/EX.RegisterRt

or $4, $4, $2

ID/EX.MemRead

ID/EX.MemRead

Mux

IF/ID

lw $2, 20($1)and $4, $2, $5or $4, $4, $2add $9, $4, $2

Clock cycle 6

Clock cycle 7

Page 70: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Problem with branches in the pipeline we have so far is that the branch decision is not made till the MEM stage – so what instructions, if at all, should we insert into the pipeline following the branch instructions?

Possible solution: stall the pipeline till branch decision is known not efficient, slow the pipeline significantly!

Another solution: predict the branch outcome e.g., always predict branch-not-taken – continue with next sequential

instructions if the prediction is wrong have to flush the pipeline behind the branch – discard

instructions already fetched or decoded – and continue execution at the branch target

Control (or Branch) Hazards

Page 71: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Predicting Branch-not-taken:Misprediction delay

Reg

Reg

CC 1

Time (in clock cycles)

40 beq $1, $3, 7

Programexecutionorder(in instructions)

IM Reg

IM DM

IM DM

IM DM

DM

DM Reg

Reg Reg

Reg

Reg

RegIM

44 and $12, $2, $5

48 or $13, $6, $2

52 add $14, $2, $2

72 lw $4, 50($7)

CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9

Reg

The outcome of branch taken (prediction wrong) is decided only when beq is in the MEM stage, so the following three sequential instructionsalready in the pipeline have to be flushed and execution resumes at lw

Page 72: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Optimizing the Pipeline to Reduce Branch Delay

Move the branch decision from the MEM stage (as in our current pipeline) earlier to the ID stage

calculating the branch target address involves moving the branch adder from the MEM stage to the ID stage – inputs to this adder, the PC value and the immediate fields are already available in the IF/ID pipeline register

calculating the branch decision is efficiently done, e.g., for equality test, by XORing respective bits and then ORing all the results and inverting, rather than using the ALU to subtract and then test for zero (when there is a carry delay)

with the more efficient equality test we can put it in the ID stage without significantly lengthening this stage – remember an objective of pipeline design is to keep pipeline stages balanced

we must correspondingly make additions to the forwarding and hazard detection units to forward to or stall the branch at the ID stage in case the branch decision depends on an earlier result

Page 73: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Flushing on Misprediction Same strategy as for stalling on load-use data

hazard… Zero out all the control values (or the instruction itself)

in pipeline registers for the instructions following the branch that are already in the pipeline – effectively turning them into nops – so they are flushed

in the optimized pipeline, with branch decision made in the ID stage, we have to flush only one instruction in the IF stage – the branch delay penalty is then only one clock cycle

Page 74: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Optimized Datapath for Branch

PC Instructionmemory

4

Registers

Mux

Mux

Mux

ALU

EX

M

WB

M

WB

WB

ID/EX

0

EX/MEM

MEM/WB

Datamemory

Mux

Hazarddetection

unit

Forwardingunit

IF.Flush

IF/ID

Signextend

Control

Mux

=

Shiftleft 2

Mux

Branch decision is moved from the MEM stage to the ID stage – simplified drawing not showing enhancements to the forwarding and hazard detection units

IF.Flush control zeros out the instruction in the IF/ID pipeline register (which follows the branch)

Page 75: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

PipelinedBranch

Execution example:

PC Instructionmemory

4

Registers

Signextend

Mux

Mux

Control

EX

M

WB

M

WB

WB

Mux

Hazarddetection

unit

Forwardingunit

Mux

IF.Flush

IF/ID

and $12, $2, $5 beq $1, $3, 7 sub $10, $4, $8

MEM/WB

EX/MEM

ID/EX

Clock 3

72 44

48 44

28

7

$1

$3

10

48

72

72

0

Mux

0

$4

$8

ALU Datamemory

bubble (nop)lw $4, 50($7)

Clock 4

Mux

Shiftleft 2

before<1>

beq $1, $3, 7 sub $10, . . . before<1>

before<2>

=

PC Instructionmemory

4

Registers

Signextend

Mux

Mux

Control

EX

M

WB

M

WB

WB

Mux

Hazarddetection

unit

Forwardingunit

IF.Flush

IF/ID

MEM/WB

EX/MEM

ID/EX

76 72

76 72

$1

$3

10

76

ALU Datamemory

Mux

Shiftleft 2

=

36 sub $10, $4, $840 beq $1, $3, 744 and $12 $2, $548 or $13 $2, $652 add $14, $4, $256 slt $15, $6, $7…72 lw $4, 50($7)

Clock cycle 4

Clock cycle 3

Optimized pipeline withonly one bubble as a resultof the taken branch

Page 76: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Simple Example: Comparing Performance Compare performance for single-cycle, multicycle, and pipelined

datapaths using the gcc instruction mix assume 2 ns for memory access, 2 ns for ALU operation,

1 ns for register read or write assume gcc instruction mix 23% loads, 13% stores, 19%

branches, 2% jumps, 43% ALU for pipelined execution assume

50% of the loads are followed immediately by an instruction that uses the result of the load

25% of branches are mispredicted branch delay on misprediction is 1 clock cycle jumps always incur 1 clock cycle delay so their average

time is 2 clock cycles

Page 77: Computer Organization and Architecture (AT70. 01) Comp. Sc. and Inf

Simple Example: Comparing Performance Single-cycle (p. 373): average instruction time 8 ns Multicycle (p. 397): average instruction time 8.04 ns Pipelined:

loads use 1 cc (clock cycle) when no load-use dependency and 2 cc when there is dependency – given 50% of loads are followed by dependency the average cc per load is 1.5

stores use 1 cc each branches use 1 cc when predicted correctly and 2 cc when

not – given 25% misprediction average cc per branch is 1.25 jumps use 2 cc each ALU instructions use 1 cc each therefore, average CPI is1.5 23% + 1 13% + 1.25 19% + 2 2% + 1 43% =

1.18 therefore, average instruction time is 1.18 2 = 2.36 ns