computer organization and architecture (at70. 01) comp. sc. and inf
DESCRIPTION
COD Ch. 6 Enhancing Performance with PipeliningTRANSCRIPT
Computer Organization Computer Organization and Architecture (AT70.01)and Architecture (AT70.01)Comp. Sc. and Inf. Mgmt.Comp. Sc. and Inf. Mgmt.Asian Institute of TechnologyAsian Institute of TechnologyInstructor: Dr. Sumanta GuhaSlide Sources: Patterson &Hennessy COD book site(copyright Morgan Kaufmann)adapted and supplemented
COD Ch. 6Enhancing Performance with Pipelining
Pipelining Start work ASAP!! Do not waste time!
Time76 PM 8 9 10 11 12 1 2 AM
A
B
C
D
Time76 PM 8 9 10 11 12 1 2 AM
A
B
C
D
Taskorder
Taskorder
Assume 30 min. each task – wash, dry, fold, store – and that separate tasks use separate hardware and so can be overlapped
Pipelined
Not pipelined
Pipelined vs. Single-Cycle Instruction Execution: the Plan
Instructionfetch Reg ALU Data
access Reg
8 ns Instructionfetch Reg ALU Data
access Reg
8 ns Instructionfetch
8 ns
Time
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)
2 4 6 8 10 12 14 16 18
2 4 6 8 10 12 14
...
Programexecutionorder(in instructions)
Instructionfetch Reg ALU Data
access Reg
Time
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)
2 ns Instructionfetch Reg ALU Data
access Reg
2 ns Instructionfetch Reg ALU Data
access Reg
2 ns 2 ns 2 ns 2 ns 2 ns
Programexecutionorder(in instructions)
Single-cycle
Pipelined
Assume 2 ns for memory access, ALU operation; 1 ns for register access:therefore, single cycle clock 8 ns; pipelined clock cycle 2 ns.
Pipelining: Keep in Mind Pipelining does not reduce latency of a single
task, it increases throughput of entire workload Pipeline rate limited by longest stage
potential speedup = number pipe stages unbalanced lengths of pipe stages reduces speedup
Time to fill pipeline and time to drain it – when there is slack in the pipeline – reduces speedup
Example Problem Problem: for the laundry fill in the following table when
1. the stage lengths are 30, 30, 30 30 min., resp.2. the stage lengths are 20, 20, 60, 20 min., resp.
Come up with a formula for pipeline speed-up!
Person Unpipelined Pipeline 1 Ratio unpipelined Pipeline 2 Ratio unpiplelined finish time finish time to pipeline 1 finish time to pipeline 2 1 2 3 4
n
Pipelining MIPS What makes it easy with MIPS?
all instructions are same length so fetch and decode stages are similar for all instructions
just a few instruction formats simplifies instruction decode and makes it possible in one
stage memory operands appear only in load/stores
so memory access can be deferred to exactly one later stage
operands are aligned in memory one data transfer instruction requires one memory access
stage
Pipelining MIPS What makes it hard?
structural hazards: different instructions, at different stages, in the pipeline want to use the same hardware resource
control hazards: succeeding instruction, to put into pipeline, depends on the outcome of a previous branch instruction, already in pipeline
data hazards: an instruction in the pipeline requires data to be computed by a previous instruction still in the pipeline
Before actually building the pipelined datapath and control we first briefly examine these potential hazards individually…
Structural Hazards Structural hazard: inadequate hardware to simultaneously
support all instructions in the pipeline in the same clock cycle E.g., suppose single – not separate – instruction and data
memory in pipeline below with one read port then a structural hazard between first and fourth lw instructions
MIPS was designed to be pipelined: structural hazards are easy to avoid!
2 4 6 8 10 12 14
Instructionfetch Reg ALU Data
access Reg
Time
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)
2 ns Instructionfetch Reg ALU Data
access Reg
2 ns Instructionfetch Reg ALU Data
access Reg
2 ns 2 ns 2 ns 2 ns 2 ns
Programexecutionorder(in instructions)
Pipelined
Instructionfetch Reg ALU Data
access Reg2 nslw $4, 400($0)
Hazard if single memory
Control Hazards Control hazard: need to make a decision based on the
result of a previous instruction still executing in pipeline Solution 1 Stall the pipeline
Instructionfetch Reg ALU Data
access Reg
Time
beq $1, $2, 40
add $4, $5, $6
lw $3, 300($0)
4 ns
Instructionfetch Reg ALU Data
access Reg2ns
Instructionfetch Reg ALU Data
access Reg
2ns
2 4 6 8 10 12 14 16Programexecutionorder(in instructions)
Pipeline stall
bubble
Note that branch outcome iscomputed in ID stage withadded hardware (later…)
Control Hazards Solution 2 Predict branch outcome
e.g., predict branch-not-taken :
Instructionfetch Reg ALU Data
access Reg
Time
beq $1, $2, 40
add $4, $5, $6
lw $3, 300($0)
Instructionfetch Reg ALU Data
access Reg2 ns
Instructionfetch Reg ALU Data
access Reg2 ns
Programexecutionorder(in instructions)
Instructionfetch Reg ALU Data
access Reg
Time
beq $1, $2, 40
add $4, $5 ,$6
or $7, $8, $9
Instructionfetch Reg ALU Data
access Reg
2 4 6 8 10 12 14
2 4 6 8 10 12 14
Instructionfetch Reg ALU Data
access Reg
2 ns
4 ns
bubble bubble bubble bubble bubble
Programexecutionorder(in instructions)
Prediction success
Prediction failure: undo (=flush) lw
Control Hazards Solution 3 Delayed branch: always execute the
sequentially next statement with the branch executing after one instruction delay – compiler’s job to find a statement that can be put in the slot that is independent of branch outcome
MIPS does this – but it is an option in SPIM (Simulator -> Settings)
Instructionfetch Reg ALU Data
access Reg
Time
beq $1, $2, 40
add $4, $5, $6
lw $3, 300($0)
Instructionfetch Reg ALU Data
access Reg2 ns
Instructionfetch Reg ALU Data
access Reg
2 ns
2 4 6 8 1 0 12 14
2 ns
(d elayed branch slot)
Programexecutionorder(in instructions)
Delayed branch beq is followed by add that isindependent of branch outcome
Data Hazards Data hazard: instruction needs data from the result of a
previous instruction still executing in pipeline Solution Forward data if possible…
Time2 4 6 8 10
add $s0, $t0, $t1 IF ID WBEX MEM
add $s0, $t0, $t1
sub $t2, $s0, $t3
Programexecutionorder(in instructions)
IF ID WBEX
IF ID MEMEX
Time2 4 6 8 10
MEM
WBMEM
Instruction pipeline diagram:shade indicates use – left=write, right=read
Without forwarding – blue line –data has to go back in time;with forwarding – red line – data is available in time
Data Hazards Forwarding may not be enough
e.g., if an R-type instruction following a load uses the result of the load – called load-use data hazard
Time2 4 6 8 10 12 14
lw $s0, 20($t1)
sub $t2, $s0, $t3
Programexecutionorder(in instructions)
IF ID WBMEMEX
IF ID WBMEMEX
Time2 4 6 8 10 12 14
lw $s0, 20($t1)
sub $t2, $s0, $t3
Programexecutionorder(in instructions)
IF ID WBMEMEX
IF ID WBMEMEX
bubble bubble bubble bubble bubble
With a one-stage stall, forwardingcan get the data to the sub instruction in time
Without a stall it is impossibleto provide input to the subinstruction in time
Reordering Code to Avoid Pipeline Stall (Software Solution)
Example:lw $t0, 0($t1)lw $t2, 4($t1)sw $t2, 0($t1)sw $t0, 4($t1)
Reordered code:lw $t0, 0($t1)lw $t2, 4($t1)sw $t0, 4($t1)sw $t2, 0($t1)
Data hazard
Interchanged
Pipelined Datapath We now move to actually building a pipelined datapath First recall the 5 steps in instruction execution
1. Instruction Fetch & PC Increment (IF)2. Instruction Decode and Register Read (ID)3. Execution or calculate address (EX)4. Memory access (MEM)5. Write result into register (WB)
Review: single-cycle processor all 5 steps done in a single clock cycle dedicated hardware required for each step
What happens if we break the execution into multiple cycles, but keep the extra hardware?
Review - Single-Cycle Datapath “Steps”
5 516
RD1
RD2
RN1 RN2 WN
WDRegister File ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
5
Instruction I32
MUX
<<2RD
InstructionMemory
ADDR
PC
4
ADD
ADD
MUX
32
IFInstruction Fetch
IDInstruction Decode
EXExecute/ Address Calc.
MEMMemory Access
WBWrite Back
Zero
Pipelined Datapath – Key Idea What happens if we break the execution into multiple cycles, but keep
the extra hardware? Answer: We may be able to start executing a new instruction at each
clock cycle - pipelining …but we shall need extra registers to hold data
between cycles – pipeline registers
Pipelined Datapath
IF/ID
Pipeline registers
5 516
RD1
RD2
RN1 RN2 WN
WDRegister File ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
5
Instruction I32
MUX
<<2RD
InstructionMemory
ADDR
PC
4
ADD
ADD
MUX
32
ID/EX EX/MEM MEM/WB
Zero
64 bits97 bits 64 bits
128 bits
wide enough to hold data coming in
Pipelined Datapath
IF/ID
Pipeline registers
5 516
RD1
RD2
RN1 RN2 WN
WDRegister File ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
5
Instruction I32
MUX
<<2RD
InstructionMemory
ADDR
PC
4
ADD
ADD
MUX
32
ID/EX EX/MEM MEM/WB
Zero
64 bits97 bits 64 bits
128 bits
wide enough to hold data coming in
Only data flowing right to left may cause hazard…, why?
Bug in the Datapath
5 516
RD1
RD2
RN1 RN2 WN
WDRegister File ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
5
Instruction I32
MUX
<<2RD
InstructionMemory
ADDR
PC
4
ADD
ADD
MUX
32
Write register number comes from another later instruction!
IF/ID ID/EX EX/MEM MEM/WB
Corrected Datapath
5
RD1
RD2
RN1
RN2
WNWD
RegisterFile
ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
32
MUX
<<2
RDInstruction
Memory
ADDR
PC
4
ADD
ADD
MUX
5
5
5
EX/MEM MEM/WB
Zero
ID/EXIF/ID
64 bits 133 bits102 bits 69 bits
Destination register number is also passed through ID/EX, EX/MEMand MEM/WB registers, which are now wider by 5 bits
Pipelined Example Consider the following instruction sequence: lw $t0, 10($t1) sw $t3, 20($t4) add $t5, $t6, $t7 sub $t8, $t9, $t10
Single-Clock-Cycle Diagram:Clock Cycle 1LW
5
RD1
RD2
RN1
RN2
WNWD
RegisterFile
ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
32
MUX
<<2
RDInstruction
Memory
ADDR
PC
4
ADD
ADD
MUX
5
5
5
IF/ID ID/EX EX/MEM MEM/WB
Zero
5
RD1
RD2
RN1
RN2
WNWD
RegisterFile
ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
32
MUX
<<2
RDInstruction
Memory
ADDR
PC
4
ADD
ADD
MUX
5
5
5
IF/ID ID/EX EX/MEM MEM/WB
Zero
Single-Clock-Cycle Diagram: Clock Cycle 2LWSW
5
RD1
RD2
RN1
RN2
WNWD
RegisterFile
ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
32
MUX
<<2
RDInstruction
Memory
ADDR
PC
4
ADD
ADD
MUX
5
5
5
IF/ID ID/EX EX/MEM MEM/WB
Zero
Single-Clock-Cycle Diagram: Clock Cycle 3 LWSWADD
5
RD1
RD2
RN1
RN2
WNWD
RegisterFile
ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
32
MUX
<<2
RDInstruction
Memory
ADDR
PC
4
ADD
ADD
MUX
5
5
5
IF/ID ID/EX EX/MEM MEM/WB
Zero
Single-Clock-Cycle Diagram: Clock Cycle 4 LWSWADDSUB
Single-Clock-Cycle Diagram: Clock Cycle 5
5
RD1
RD2
RN1
RN2
WNWD
RegisterFile
ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
32
MUX
<<2
RDInstruction
Memory
ADDR
PC
4
ADD
ADD
MUX
5
5
5
IF/ID ID/EX EX/MEM MEM/WB
Zero
LWSWADDSUB
Single-Clock-Cycle Diagram: Clock Cycle 6
5
RD1
RD2
RN1
RN2
WNWD
RegisterFile
ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
32
MUX
<<2
RDInstruction
Memory
ADDR
PC
4
ADD
ADD
MUX
5
5
5
IF/ID ID/EX EX/MEM MEM/WB
Zero
SWADDSUB
Single-Clock-Cycle Diagram: Clock Cycle 7
5
RD1
RD2
RN1
RN2
WNWD
RegisterFile
ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
32
MUX
<<2
RDInstruction
Memory
ADDR
PC
4
ADD
ADD
MUX
5
5
5
IF/ID ID/EX EX/MEM MEM/WB
Zero
ADDSUB
Single-Clock-Cycle Diagram: Clock Cycle 8
5
RD1
RD2
RN1
RN2
WNWD
RegisterFile
ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
32
MUX
<<2
RDInstruction
Memory
ADDR
PC
4
ADD
ADD
MUX
5
5
5
IF/ID ID/EX EX/MEM MEM/WB
Zero
SUB
Alternative View – Multiple-Clock-Cycle Diagram
IM REG ALU DM REGlw $t0, 10($t1)
sw $t3, 20($t4)
add $t5, $t6, $t7
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7
IM REG ALU DM REG
IM REG ALU DM REG
sub $t8, $t9, $t10 IM REG ALU DM REG
CC 8
Time axis
Notes One significant difference in the execution of an R-type
instruction between multicycle and pipelined implementations: register write-back for the R-type instruction is the 5th (the last
write-back) pipeline stage vs. the 4th stage for the multicycle implementation. Why?
think of structural hazards when writing to the register file… Worth repeating: the essential difference between the pipeline
and multicycle implementations is the insertion of pipeline registers to decouple the 5 stages
The CPI of an ideal pipeline (no stalls) is 1. Why? The RaVi Architecture Visualization Project of Dortmund U. has
pipeline simulations – see link in our Additional Resources page As we develop control for the pipeline keep in mind that the
text does not consider jump – should not be too hard to implement!
Recall Single-Cycle Control – the Datapath
PC
Instructionmemory
Readaddress
Instruction[31– 0]
Instruction [20 16]
Instruction [25 21]
Add
Instruction [5 0]
MemtoRegALUOpMemWrite
RegWrite
MemReadBranchRegDst
ALUSrc
Instruction [31 26]
4
16 32Instruction [15 0]
0
0Mux
0
1
Control
Add ALUresult
Mux
0
1
RegistersWriteregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Signextend
Mux
1
ALUresult
Zero
PCSrc
Datamemory
Writedata
Readdata
Mux
1
Instruction [15 11]
ALUcontrol
Shiftleft 2
ALUAddress
Recall Single-Cycle – ALU Control
Instruction AluOp Instruction Funct Field Desired ALU control
opcode operation ALU action inputLW 00 load word xxxxxx add 010SW 00 store word xxxxxx add 010Branch eq 01 branch eq xxxxxx subtract 110R-type 10 add 100000 add 010R-type 10 subtract 100010 subtract 110R-type 10 AND 100100 and 000R-type 10 OR 100101 or 001R-type 10 set on less 101010 set on less 111
Truth table for ALU control bits
ALUOp Funct field OperationALUOp1 ALUOp0 F5 F4 F3 F2 F1 F0
0 0 X X X X X X 0100 1 X X X X X X 1101 X X X 0 0 0 0 0101 X X X 0 0 1 0 1101 X X X 0 1 0 0 0001 X X X 0 1 0 1 0011 X X X 1 0 1 0 111
Recall Single-Cycle – Control Signals
Signal Name Effect when deasserted Effect when asserted
RegDst The register destination number for the The register destination number for the Write register comes from the rt field (bits 20-16) Write register comes from the rd field (bits 15-11)RegWrite None The register on the Write register input is written
with the value on the Write data inputAlLUSrc The second ALU operand comes from the The second ALU operand is the sign-extended, second register file output (Read data 2) lower 16 bits of the instructionPCSrc The PC is replaced by the output of the adder The PC is replaced by the output of the adder
that computes the value of PC + 4 that computes the branch targetMemRead None Data memory contents designated by the address
input are put on the first Read data outputMemWrite None Data memory contents designated by the address
input are replaced by the value of the Write data inputMemtoReg The value fed to the register Write data input The value fed to the register Write data input comes from the ALU comes from the data memory
Instruction RegDst ALUSrcMemto-
RegReg
WriteMem Read
Mem Write Branch ALUOp1 ALUp0
R-format 1 0 0 1 0 0 0 1 0lw 0 1 1 1 1 0 0 0 0sw X 1 X 0 0 1 0 0 0beq X 0 X 0 0 0 1 0 1
Effect of control bits
Deter-miningcontrolbits
Pipeline Control Initial design – motivated by single-cycle datapath control
– use the same control signals Observe:
No separate write signal for the PC as it is written every cycle No separate write signals for the pipeline registers as they
are written every cycle No separate read signal for instruction memory as it is read
every clock cycle No separate read signal for register file as it is read every
clock cycle Need to set control signals during each pipeline stage Since control signals are associated with components
active during a single pipeline stage, can group control lines into five groups according to pipeline stage
Will bemodifiedby hazarddetectionunit!!
Pipelined Datapath with Control I
PC
Instructionmemory
Address
Inst
ruct
ion
Instruction[20– 16]
MemtoReg
ALUOp
Branch
RegDst
ALUSrc
4
16 32Instruction[15– 0]
0
0Registers
Writeregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Signextend
Mux
1Write
data
Read
data Mux
1
ALUcontrol
RegWrite
MemRead
Instruction[15– 11]
6
IF/ID ID/EX EX/MEM MEM/WB
MemWrite
Address
Datamemory
PCSrc
Zero
Add Addresult
Shiftleft 2
ALUresult
ALUZero
Add
0
1
Mux
0
1
Mux
Same controlsignals as thesingle-cycledatapath
There are five stages in the pipeline instruction fetch / PC increment instruction decode / register fetch execution / address calculation memory access write back
Pipeline Control Signals
Execution/Address Calculation stage control lines
Memory access stage control lines
Write-back stage control
lines
InstructionReg Dst
ALU Op1
ALU Op0
ALU Src Branch
Mem Read
Mem Write
Reg write
Mem to Reg
R-format 1 1 0 0 0 0 0 1 0lw 0 0 0 1 0 1 0 1 1sw X 0 0 1 0 0 1 0 Xbeq X 0 1 0 1 0 0 0 X
Nothing to control as instruction memoryread and PC write are always enabled
Pass control signals along just like the data – extend each pipeline register to hold needed control bits for succeeding stages
Note: The 6-bit funct field of the instruction required in the EX stage to generate ALU control can be retrieved as the 6 least significant bits of the immediate field which is sign-extended and passed from the IF/ID register to the ID/EX register
Pipeline Control Implementation
Control
EX
M
WB
M
WB
WB
IF/ID ID/EX EX/MEM MEM/WB
Instruction
Pipelined Datapath with Control II
PC
Instructionmemory
Inst
ruct
ion
Add
Instruction[20– 16]
Mem
toR
eg
ALUOp
Branch
RegDst
ALUSrc
4
16 32Instruction[15– 0]
0
0
Mux
0
1
Add Addresult
RegistersWriteregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Signextend
Mux
1
ALUresult
Zero
Writedata
Readdata
Mux
1
ALUcontrol
Shiftleft 2
Reg
Writ
e
MemRead
Control
ALU
Instruction[15– 11]
6
EX
M
WB
M
WB
WBIF/ID
PCSrc
ID/EX
EX/MEM
MEM/WB
Mux
0
1
Mem
Writ
e
AddressData
memory
Address
Control signalsemanate from the controlportions of the pipeline registers
PipelinedExecutionandControl
Instruction sequence:
Instructionmemory
Instruction[20– 16]
Mem
toR
eg
ALUOp
Branch
RegDst
ALUSrc
4
Instruction[15– 0]
0
Mux
0
1
Add Addresult
RegistersWriteregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Signextend
Mux
1
ALUresult
Zero
ALUcontrol
Shiftleft 2
Reg
Writ
e
MemRead
Control
ALU
Instruction[15– 11]
EX
M
WB
M
WB
WB
Inst
ruct
ion
IF/ID EX/MEMID/EX
ID: before<1> EX: before<2> MEM: before<3> WB: before<4>
MEM/WB
IF: lw $10, 20($1)
000
00
0000
000
00
000
0
00
00
0
00
Mux
0
1
Add
PC
0
Datamemory
Address
Writedata
Readdata
Mux
1
WB
EX
M
Instructionmemory
Mem
toR
eg
ALUOp
Branch
RegDst
ALUSrc
4
0
Mux
0
1
Add Addresult
Writeregister
Writedata
Mux
1
ALUresult
Zero
ALUcontrol
Shiftleft 2
Reg
Writ
e
ALU
M
WB
WB
Inst
ruct
ion
IF/ID EX/MEMID/EX
ID: lw $10, 20($1) EX: before<1> MEM: before<2> WB: before<3>
MEM/WB
IF: sub $11, $2, $3
010
11
0001
000
00
000
0
00
00
0
00
Mux
0
1
Add
PC
0Writedata
Readdata
Mux
1
lwControl
Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
X
10
20
X
1
Instruction[20– 16]
Instruction[15– 0] Sign
extend
Instruction[15– 11]
20
$X
$1
10
X
Mem
Writ
e
MemRead
Mem
Writ
e
Datamemory
Address
Address
Address
Clock 2
Clock 1
lw $10, 20($1)sub $11, $2, $3and $12, $4, $7or $13, $6, $7add $14, $8, $9
Label “before<i>” meansi th instruction beforelw
Clock cycle 1
Clock cycle 2
PipelinedExecutionandControl
Instruction sequence:
Instructionmemory
Address
Instruction[20– 16]
Mem
toR
eg
Branch
ALUSrc
4
Instruction[15– 0]
0
1
Add Addresult
RegistersWriteregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
ALUresult
Shiftleft 2
Reg
Writ
e
MemRead
Control
ALU
Instruction[15– 11]
EX
M
WB
WB
Inst
ruct
ion
IF/ID EX/MEMID/EX
ID: sub $11, $2, $3 EX: lw $10, . . . MEM: before<1> WB: before<2>
MEM/WB
IF: and $12, $4, $5
000
10
1100
010
11
000
1
00
00
0
00
Mux
0
1
Add
PC
0Writedata
Readdata
Mux
1
WB
EX
M
Instructionmemory
Address
Mem
toR
eg
ALUOp
Branch
RegDst
ALUSrc
4
0
0
1
Add Addresult
Writeregister
Writedata 1
ALUresult
ALUcontrol
Shiftleft 2
Reg
Writ
e
M
WB
Inst
ruct
ion
IF/ID EX/MEMID/EX
ID: and $12, $2, $3 EX: sub $11, . . . MEM: lw $10, . . . WB: before<1>
MEM/WB
IF: or $13, $6, $7
000
10
1100
000
10
101
0
11
10
0
00
Mux
0
1
Add
PC
0Writedata
Mux
1
andControl
Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
12
X
X
5
4
Instruction[20– 16]
Instruction[15– 0]
Instruction[15– 11]
X
$5
$4
X
12
Mem
Writ
e
MemRead
Mem
Writ
e
sub
11
X
X
3
2
X
$3
$2
X
11
$1
20
10
Mux
0
Mux
1
ALUOp
RegDst
ALUcontrol
M
WB
$3
$2
11
Mux
Mux
ALUAddress Read
dataData
memory
10
WB
Zero
Zero
Signextend
Signextend
Datamemory
Address
Clock 3
Clock 4
lw $10, 20($1)sub $11, $2, $3and $12, $4, $7or $13, $6, $7add $14, $8, $9
Clock cycle 3
Clock cycle 4
PipelinedExecutionandControl
Instruction sequence:
Instructionmemory
Address
Instruction[20– 16]
Branch
ALUSrc
4
Instruction[15– 0]
0
1
Add Addresult
RegistersWriteregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
ALUresult
Shiftleft 2
Reg
Writ
e
MemRead
Control
ALU
Instruction[15– 11]
EX
M
WB
Inst
ruct
ion
IF/ID EX/MEMID/EX
ID: or $13, $6, $7 EX: and $12, . . . MEM: sub $11, . . . WB: lw $10, . . .
MEM/WB
IF: add $14, $8, $9
000
10
1100
000
10
101
0
10
00
0
Mux
0
1
Add
PC
0Writedata
Readdata
Mux
1
WB
EX
M
Instructionmemory
Address
Mem
toR
eg
ALUOp
Branch
RegDst
ALUSrc
4
0
0
1
Add Addresult
1
ALUresult
ALUcontrol
Shiftleft 2
Reg
Writ
e
M
WB
Inst
ruct
ion
IF/ID EX/MEMID/EX
ID: add $14, $8, $9 EX: or $13, . . . MEM: and $12, . . . WB: sub $11, . . .
MEM/WB
IF: after<1>
000
10
1100
000
10
101
0
10
00
0
10
Mux
0
1
Add
PC
0Writedata
Mux
1
addControl
Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
14
X
X
9
8
Instruction[20– 16]
Instruction[15– 0]
Instruction[15– 11]
X
$9
$8
X
14
Mem
Writ
e
MemRead
Mem
Writ
e
or
13
X
X
7
6
X
$7
$6
X
13
$4
Mux
0
Mux
1
ALUOp
RegDst
ALUcontrol
M
WB
$7
$6
13
Mux
Mux
ALUReaddata
12
WB
11 10
10$5
12
WB
Mem
toR
eg
11
11
11
Writeregister
Writedata
Zero
Zero
Datamemory
Address
Datamemory
Address
Signextend
Signextend
Clock 5
Clock 6
lw $10, 20($1)sub $11, $2, $3and $12, $4, $7or $13, $6, $7add $14, $8, $9
Clock cycle 5
Clock cycle 6
Label “after<i>” meansi th instruction after add
PipelinedExecutionandControl
Instruction sequence:
Instructionmemory
Address
Instruction[20– 16]
Branch
ALUSrc
4
Instruction[15– 0]
0
1
Add Addresult
RegistersWriteregister
Writedata
ALUresult
Shiftleft 2
Reg
Writ
e
MemRead
Control
ALU
Instruction[15– 11]
Signextend
EX
M
WB
Inst
ruct
ion
IF/ID EX/MEMID/EX
ID: after<1> EX: add $14, . . . MEM: or $13, . . . WB: and $12, . . .
MEM/WB
IF: after<2>
000
00
0000
000
10
101
0
10
00
0
Mux
0
1
Add
PC
0Writedata
Readdata
Mux
1
WB
EX
M
Instructionmemory
Address
Mem
toR
eg
ALUOp
Branch
RegDst
ALUSrc
4
0
0
1
Add Addresult
1
ALUresult
Zero
ALUcontrol
Shiftleft 2
Reg
Writ
e
M
WB
Inst
ruct
ion
IF/ID EX/MEMID/EX
ID: after<2> EX: after<1> MEM: add $14, . . . WB: or $13, . . .
MEM/WB
IF: after<3>
000
00
0000
000
00
000
0
10
00
0
10
Mux
0
1
Add
PC
0Writedata
Mux
1
Control
Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Instruction[20– 16]
Instruction[15– 0] Sign
extend
Instruction[15– 11]
Mem
Writ
e
MemRead
Mem
Writ
e
$8
Mux
0
Mux
1
ALUOp
RegDst
ALUcontrol
M
WB
Mux
Mux
ALUReaddata
14
WB
13 12
12$9
14
WB
Mem
toR
eg
10
13
13
Writeregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2 Zero
Datamemory
Address
Datamemory
Address
Clock 7
Clock 8
lw $10, 20($1)sub $11, $2, $3and $12, $4, $7or $13, $6, $7add $14, $8, $9
Clock cycle 8
Clock cycle 7
Pipelined Execution and Control
Instruction sequence:
WB
EX
M
Instructionmemory
Address
Mem
toR
eg
ALUOp
Branch
RegDst
ALUSrc
4
0
0
1
Add Addresult
1
ALUresult
Zero
ALUcontrol
Shiftleft 2
Reg
Writ
e
M
WB
Inst
ruct
ion
IF/ID EX/MEMID/EX
ID: after<3> EX: after<2> MEM: after<1> WB: add $14, . . .
MEM/WB
IF: after<4>
000
00
0000
000
00
000
0
00
00
0
10
Mux
0
1
Add
PC
0Writedata
Mux
1
Control
Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Instruction[20– 16]
Instruction[15– 0] Sign
extend
Instruction[15– 11]
MemRead
Mem
Writ
e
Mux
Mux
ALUReaddata
WB
14
14
Writeregister
Writedata
Datamemory
Address
Clock 9
lw $10, 20($1)sub $11, $2, $3and $12, $4, $7or $13, $6, $7add $14, $8, $9
Clock cycle 9
Revisiting Hazards So far our datapath and control have ignored
hazards We shall revisit data hazards and control hazards
and enhance our datapath and control to handle them in hardware…
Problem with starting an instruction before previous are finished: data dependencies that go backward in time – called data hazards
Data Hazards and Forwarding
IM Reg
IM Reg
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
sub $2, $1, $3
Programexecutionorder(in instructions)
and $12, $2, $5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9
10 10 10 10 10/– 20 – 20 – 20 – 20 – 20
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
Value of register $2:
DM Reg
Reg
Reg
Reg
DM
sub $2, $1, $3and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15, 100($2)
$2 = 10 before sub;$2 = -20 after sub
Have compiler guarantee never any data hazards! by rearranging instructions to insert independent instructions
between instructions that would otherwise have a data hazard between them,
or, if such rearrangement is not possible, insert nops
Such compiler solutions may not always be possible, and nops slow the machine down
Software Solution
sub $2, $1, $3 lw $10, 40($3) slt $5, $6, $7
and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15, 100($2)
sub $2, $1, $3 nop nop
and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15, 100($2)
or
MIPS: nop = “no operation” = 00…0 (32bits) = sll $0, $0, 0
Hardware Solution: Forwarding Idea: use intermediate data, do not wait for result to
be finally written to the destination register. Two steps:
1. Detect data hazard2. Forward intermediate data to resolve hazard
Pipelined Datapath with Control II (as before)
PC
Instructionmemory
Inst
ruct
ion
Add
Instruction[20– 16]
Mem
toR
eg
ALUOp
Branch
RegDst
ALUSrc
4
16 32Instruction[15– 0]
0
0
Mux
0
1
Add Addresult
RegistersWriteregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Signextend
Mux
1
ALUresult
Zero
Writedata
Readdata
Mux
1
ALUcontrol
Shiftleft 2
Reg
Writ
e
MemRead
Control
ALU
Instruction[15– 11]
6
EX
M
WB
M
WB
WBIF/ID
PCSrc
ID/EX
EX/MEM
MEM/WB
Mux
0
1
Mem
Writ
e
AddressData
memory
Address
Control signalsemanate from the controlportions of the pipeline registers
Hazard Detection Hazard conditions:1a. EX/MEM.RegisterRd = ID/EX.RegisterRs1b. EX/MEM.RegisterRd = ID/EX.RegisterRt2a. MEM/WB.RegisterRd = ID/EX.RegisterRs2b. MEM/WB.RegisterRd = ID/EX.RegisterRt
Eg., in the earlier example, first hazard between sub $2, $1, $3 and and $12, $2, $5 is detected when the and is in EX stage and the sub is in MEM stage because
EX/MEM.RegisterRd = ID/EX.RegisterRs = $2 (1a)
Whether to forward also depends on: if the later instruction is going to write a register – if not, no need to
forward, even if there is register number match as in conditions above if the destination register of the later instruction is $0 – in which case there is no need to forward value ($0 is always 0 and never
overwritten)
Data Forwarding Plan:
allow inputs to the ALU not just from ID/EX, but also later pipeline registers, and
use multiplexors and control signals to choose appropriate inputs to ALU
IM Reg
IM Reg
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
sub $2, $1, $3
Programexecution order(in instructions)
and $12, $2, $5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9
10 10 10 10 10/– 20 – 20 – 20 – 20 – 20
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
Value of register $2 :
DM Reg
Reg
Reg
Reg
X X X – 20 X X X X XValue of EX/MEM :X X X X – 20 X X X XValue of MEM/WB :
DM
sub $2, $1, $3and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15, 100($2)
Dependencies between pipelines move forward in time
ForwardingHardware
Registers
Mux M
ux
ALU
ID/EX MEM/WB
Datamemory
Mux
Forwardingunit
EX/MEM
b. With forwarding
ForwardB
RdEX/MEM.RegisterRd
MEM/WB.RegisterRd
RtRtRs
ForwardA
Mux
ALU
ID/EX MEM/WB
Datamemory
EX/MEM
a. No forwarding
Registers
Mux
Datapath before adding forwarding hardware
Datapath after adding forwarding hardware
Forwarding Hardware: Multiplexor Control
Mux control Source ExplanationForwardA = 00 ID/EX The first ALU operand comes from the register fileForwardA = 10 EX/MEM The first ALU operand is forwarded from prior ALU
resultForwardA = 01 MEM/WB The first ALU operand is forwarded from data
memory or an earlier ALU resultForwardB = 00 ID/EX The second ALU operand comes from the register
fileForwardB = 10 EX/MEM The second ALU operand is forwarded from prior
ALU resultForwardB = 01 MEM/WB The second ALU operand is forwarded from data
memory or an earlier ALU resultDepending on the selection in the rightmost multiplexor
(see datapath with control diagram)
Data Hazard: Detection and Forwarding
Forwarding unit determines multiplexor control according to the following rules:
1. EX hazard if ( EX/MEM.RegWrite // if there is a write… and ( EX/MEM.RegisterRd 0 ) // to a non-$0
register… and ( EX/MEM.RegisterRd = ID/EX.RegisterRs ) ) // which matches,
then… ForwardA = 10
if ( EX/MEM.RegWrite // if there is a write… and ( EX/MEM.RegisterRd 0 ) // to a non-$0
register… and ( EX/MEM.RegisterRd = ID/EX.RegisterRt ) ) // which matches,
then… ForwardB = 10
Data Hazard: Detection and Forwarding
2. MEM hazard if ( MEM/WB.RegWrite // if there is a write… and ( MEM/WB.RegisterRd 0 ) // to a non-$0 register… and ( EX/MEM.RegisterRd ID/EX.RegisterRs ) // and not already a register match // with earlier pipeline register… and ( MEM/WB.RegisterRd = ID/EX.RegisterRs ) ) // but match with later pipeline register, then… ForwardA = 01
if ( MEM/WB.RegWrite // if there is a write… and ( MEM/WB.RegisterRd 0 ) // to a non-$0 register… and ( EX/MEM.RegisterRd ID/EX.RegisterRt ) // and not already a register match // with earlier pipeline register… and ( MEM/WB.RegisterRd = ID/EX.RegisterRt ) ) // but match with later pipeline register, then… ForwardB = 01
This check is necessary, e.g., for sequences such as add $1, $1, $2; add $1, $1, $3; add $1, $1, $4; (array summing?), where an earlier pipeline (EX/MEM) register has more recent data
Forwarding Hardware with Control
PC Instructionmemory
Registers
Mux
Mux
Control
ALU
EX
M
WB
M
WB
WB
ID/EX
EX/MEM
MEM/WB
Datamemory
Mux
Forwardingunit
IF/ID
Inst
ruct
ion
Mux
RdEX/MEM.RegisterRd
MEM/WB.RegisterRd
Rt
Rt
Rs
IF/ID.RegisterRd
IF/ID.RegisterRt
IF/ID.RegisterRt
IF/ID.RegisterRs
Datapath with forwarding hardware and control wires – certain details,e.g., branching hardware, are omitted to simplify the drawingNote: so far we have only handled forwarding to R-type instructions…!
Called forwarding unit, not hazard detection unit, because once data is forwarded there is no hazard!
ForwardingPC Instruction
memory
Registers
Mux
Mux
Mux
EX
M
WB
WB
Datamemory
Mux
Forwardingunit
Inst
ruct
ion
IF/ID
and $4, $2, $5 sub $2, $1, $3
ID/EX
before<1>
EX/MEM
before<2>
MEM/WB
or $4, $4, $2
Clock 3
2
5
10 10
$2
$5
52
4
$1
$3
31
2
Control
ALU
PC Instructionmemory
Registers
Mux
Mux
Mux
EX
M
WB
M
WB
Datamemory
Mux
Forwardingunit
Inst
ruct
ion
IF/ID
or $4, $4, $2 and $4, $2, $5
ID/EX
sub $2, . . .
EX/MEM
before<1>
MEM/WB
add $9, $4, $2
Clock 4
4
6
10 10
$4
$2
62
4
$2
$5
52
4
Control
ALU
10
2
WB
M
WB
sub $2, $1, $3and $4, $2, $5or $4, $4, $2add $9, $4, $2
Execution example:
Clock cycle 3
Clock cycle 4
Forwarding
Execution example (cont.):
PC Instructionmemory
Registers
Mux
Mux
Mux
EX
M
WB
M
WB
Datamemory
Mux
Forwardingunit
Inst
ruct
ion
IF/ID
add $9, $4, $2 or $4, $4, $2
ID/EX
and $4, . . .
EX/MEM
sub $2, . . .
MEM/WB
after<1>
Clock 5
4
2
10 10
$4
$2
24
9
$4
$2
4
2
24
Control
ALU
10
WB
2
1
PC Instructionmemory
Mux
Mux
Mux
EX
M
WB
M
WB
Datamemory
Mux
Forwardingunit
after<1>after<2> add $9, $4, $2 or $4, . . .
EX/MEM
and $4, . . .
MEM/WB
Clock 6
10
$4
$2
24
9
ALU
10
4
4
WB
4
1
Registers
Inst
ruct
ion
IF/ID
ID/EX
4
Controlsub $2, $1, $3and $4, $2, $5or $4, $4, $2add $9, $4, $2
Clock cycle 5
Clock cycle 6
Load word can still cause a hazard: an instruction tries to read a register following a load instruction that writes to the same register
therefore, we need a hazard detection unit to stall the pipeline after the load instruction
Data Hazards and Stalls
Reg
IM
Reg
Reg
IM
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
lw $2, 20($1)
Programexecutionorder(in instructions)
and $4, $2, $5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
DM Reg
Reg
Reg
DM
lw $2, 20($1)and $4, $2, $5or $8, $2, $6add $9, $4, $2Slt $1, $6, $7
As even a pipelinedependency goesbackward in timeforwarding will notsolve the hazard
Pipelined Datapath with Control II (as before)
PC
Instructionmemory
Inst
ruct
ion
Add
Instruction[20– 16]
Mem
toR
eg
ALUOp
Branch
RegDst
ALUSrc
4
16 32Instruction[15– 0]
0
0
Mux
0
1
Add Addresult
RegistersWriteregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Signextend
Mux
1
ALUresult
Zero
Writedata
Readdata
Mux
1
ALUcontrol
Shiftleft 2
Reg
Writ
e
MemRead
Control
ALU
Instruction[15– 11]
6
EX
M
WB
M
WB
WBIF/ID
PCSrc
ID/EX
EX/MEM
MEM/WB
Mux
0
1
Mem
Writ
e
AddressData
memory
Address
Control signalsemanate from the controlportions of the pipeline registers
Hazard Detection Logic to Stall
Hazard detection unit implements the following check if to stall
if ( ID/EX.MemRead // if the instruction in the EX stage is a load…
and ( ( ID/EX.RegisterRt = IF/ID.RegisterRs ) // and the destination register
or ( ID/EX.RegisterRt = IF/ID.RegisterRt ) ) ) // matches either source register
// of the instruction in the ID stage, then…
stall the pipeline
Mechanics of Stalling If the check to stall verifies, then the pipeline needs to
stall only 1 clock cycle after the load as after that the forwarding unit can resolve the dependency
What the hardware does to stall the pipeline 1 cycle: does not let the IF/ID register change (disable write!) – this
will cause the instruction in the ID stage to repeat, i.e., stall therefore, the instruction, just behind, in the IF stage must
be stalled as well – so hardware does not let the PC change (disable write!) – this will cause the instruction in the IF stage to repeat, i.e., stall
changes all the EX, MEM and WB control fields in the ID/EX pipeline register to 0, so effectively the instruction just behind the load becomes a nop – a bubble is said to have been inserted into the pipeline
note that we cannot turn that instruction into an nop by 0ing all the bits in the instruction itself – recall nop = 00…0 (32 bits) – because it has already been decoded and control signals generated
Hazard Detection Unit
PC Instructionmemory
Registers
Mux
Mux
Mux
Control
ALU
EX
M
WB
M
WB
WB
ID/EX
EX/MEM
MEM/WB
Datamemory
Mux
Hazarddetection
unit
Forwardingunit
0
Mux
IF/ID
Inst
ruct
ion
ID/EX.MemRead
IF/ID
Writ
e
PC
Writ
e
ID/EX.RegisterRt
IF/ID.RegisterRd
IF/ID.RegisterRtIF/ID.RegisterRtIF/ID.RegisterRs
RtRs
Rd
Rt EX/MEM.RegisterRd
MEM/WB.RegisterRd
Datapath with forwarding hardware, the hazard detection unit and controls wires – certain details, e.g., branching hardware are omitted to simplify the drawing
Stalling Resolves a Hazard Same instruction sequence as before for which
forwarding by itself could not resolve the hazard:
lw $2, 20($1)
Programexecutionorder(in instructions)
and $4, $2, $5
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
Reg
IM
Reg
Reg
IM DM
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (in clock cycles)
IM Reg DM RegIM
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9 CC 10
DM Reg
RegReg
Reg
bubble
lw $2, 20($1)and $4, $2, $5or $8, $2, $6add $9, $4, $2Slt $1, $6, $7
Hazard detection unit inserts a 1-cycle bubble in the pipeline, afterwhich all pipeline register dependencies go forward so then the forwarding unit can handle them and there are no more hazards
Stalling Execution example:
Hazarddetection
unit
0
MuxIF
/IDW
rite
PC
Writ
e
ID/EX.RegisterRt
lw $2, 20($1)
PC Instructionmemory
Registers
Mux
Mux
Mux
EX
M
WB
WB
Datamemory
Mux
Forwardingunit
Inst
ruct
ion
IF/ID
and $4, $2, $5
ID/EX
before<1>
EX/MEM
before<2>
MEM/WB
or $4, $4, $2
Clock 3
2
5
2
500 11
$2
$5
52
4
$1
$X
X1
2
Control
ALU
M
WB
Hazarddetection
unit
0
MuxIF
/IDW
rite
PC
Writ
e
ID/EX.RegisterRt
ID/EX.MemRead
ID/EX.MemRead
M
WB
$1
$X
X
1
2
before<3>
PC Instructionmemory
Registers
Mux
Mux
Mux
EX WB
Datamemory
Mux
Forwardingunit
Inst
ruct
ion
IF/ID
ID/EX
EX/MEM
MEM/WB
and $4, $2, $5 lw $2, 20($1) before<1> before<2>
Clock 2
1
1
X
X11
Control
ALU
M
WB
lw $2, 20($1)and $4, $2, $5or $4, $4, $2add $9, $4, $2
Clock cycle 2
Clock cycle 3
Stalling Execution example (cont.):
Hazarddetection
unit
0
MuxIF
/IDW
rite
PC
Writ
e
ID/EX.RegisterRt
$2
$5
52
2
2
4
WB
Hazarddetection
unit
0
MuxIF
/IDW
rite
PC
Writ
eID/EX.RegisterRt
PC Instructionmemory
Registers
Mux
Mux
Mux
EX
M
WB
Datamemory
Mux
Inst
ruct
ion
IF/ID
and $4, $2, $5 bubble
ID/EX
lw $2, . . .
EX/MEM
before<1>
MEM/WB
Clock 4
2
2
5
510
11
00
$2
$5
52
4
Control
ALU
M
WB
bubble lw $2, . . .
PC Instructionmemory
Registers
Mux
Mux
Mux
EX
M
WB
M
WB
Datamemory
Mux
Forwardingunit
Forwardingunit
Inst
ruct
ion
IF/ID
and $4, $2, $5
ID/EX
EX/MEM
MEM/WB
add $9, $4, $2
Clock 5
2
210 10
11
$4
$2
2
4
4
4
2
4
$2
$5
5
2
4
Control
ALU
0
WB
ID/EX.MemRead
ID/EX.MemRead
or $4, $4, $2
or $4, $4, $2
lw $2, 20($1)and $4, $2, $5or $4, $4, $2add $9, $4, $2
Clock cycle 4
Clock cycle 5
Stalling Execution example (cont.):
Registers
Inst
ruct
ion
ID/EX
4
Control
PC Instructionmemory
PC Instructionmemory
Hazarddetection
unit
0
MuxIF
/IDW
rite
PC
Writ
e
IF/ID
Writ
e
PC
Writ
e
ID/EX.RegisterRt
bubble
Registers
Mux
Mux
EX
M
WB
M
WB
Datamemory
Mux
Inst
ruct
ion
IF/ID
add $9, $4, $2
ID/EX
and $4, . . .
EX/MEM
MEM/WB
Clock 6
4
4
2
210 10
$4
$2
24
49
$2
2
Control
ALU
10
WB0
add $9, $4, $2 or $4, . . . and $4, . . .after<2> after<1>
after<1>
Clock 7
Mux
Mux
Mux
EX
M
WB
M
WB
Datamemory
Mux
Forwardingunit
Forwardingunit
EX/MEM
MEM/WB
10 10
$4
$4
$2
24
4
9
ALU
10
WB
44
4
1
Hazarddetection
unit
0
Mux
ID/EX.RegisterRt
or $4, $4, $2
ID/EX.MemRead
ID/EX.MemRead
Mux
IF/ID
lw $2, 20($1)and $4, $2, $5or $4, $4, $2add $9, $4, $2
Clock cycle 6
Clock cycle 7
Problem with branches in the pipeline we have so far is that the branch decision is not made till the MEM stage – so what instructions, if at all, should we insert into the pipeline following the branch instructions?
Possible solution: stall the pipeline till branch decision is known not efficient, slow the pipeline significantly!
Another solution: predict the branch outcome e.g., always predict branch-not-taken – continue with next sequential
instructions if the prediction is wrong have to flush the pipeline behind the branch – discard
instructions already fetched or decoded – and continue execution at the branch target
Control (or Branch) Hazards
Predicting Branch-not-taken:Misprediction delay
Reg
Reg
CC 1
Time (in clock cycles)
40 beq $1, $3, 7
Programexecutionorder(in instructions)
IM Reg
IM DM
IM DM
IM DM
DM
DM Reg
Reg Reg
Reg
Reg
RegIM
44 and $12, $2, $5
48 or $13, $6, $2
52 add $14, $2, $2
72 lw $4, 50($7)
CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
Reg
The outcome of branch taken (prediction wrong) is decided only when beq is in the MEM stage, so the following three sequential instructionsalready in the pipeline have to be flushed and execution resumes at lw
Optimizing the Pipeline to Reduce Branch Delay
Move the branch decision from the MEM stage (as in our current pipeline) earlier to the ID stage
calculating the branch target address involves moving the branch adder from the MEM stage to the ID stage – inputs to this adder, the PC value and the immediate fields are already available in the IF/ID pipeline register
calculating the branch decision is efficiently done, e.g., for equality test, by XORing respective bits and then ORing all the results and inverting, rather than using the ALU to subtract and then test for zero (when there is a carry delay)
with the more efficient equality test we can put it in the ID stage without significantly lengthening this stage – remember an objective of pipeline design is to keep pipeline stages balanced
we must correspondingly make additions to the forwarding and hazard detection units to forward to or stall the branch at the ID stage in case the branch decision depends on an earlier result
Flushing on Misprediction Same strategy as for stalling on load-use data
hazard… Zero out all the control values (or the instruction itself)
in pipeline registers for the instructions following the branch that are already in the pipeline – effectively turning them into nops – so they are flushed
in the optimized pipeline, with branch decision made in the ID stage, we have to flush only one instruction in the IF stage – the branch delay penalty is then only one clock cycle
Optimized Datapath for Branch
PC Instructionmemory
4
Registers
Mux
Mux
Mux
ALU
EX
M
WB
M
WB
WB
ID/EX
0
EX/MEM
MEM/WB
Datamemory
Mux
Hazarddetection
unit
Forwardingunit
IF.Flush
IF/ID
Signextend
Control
Mux
=
Shiftleft 2
Mux
Branch decision is moved from the MEM stage to the ID stage – simplified drawing not showing enhancements to the forwarding and hazard detection units
IF.Flush control zeros out the instruction in the IF/ID pipeline register (which follows the branch)
PipelinedBranch
Execution example:
PC Instructionmemory
4
Registers
Signextend
Mux
Mux
Control
EX
M
WB
M
WB
WB
Mux
Hazarddetection
unit
Forwardingunit
Mux
IF.Flush
IF/ID
and $12, $2, $5 beq $1, $3, 7 sub $10, $4, $8
MEM/WB
EX/MEM
ID/EX
Clock 3
72 44
48 44
28
7
$1
$3
10
48
72
72
0
Mux
0
$4
$8
ALU Datamemory
bubble (nop)lw $4, 50($7)
Clock 4
Mux
Shiftleft 2
before<1>
beq $1, $3, 7 sub $10, . . . before<1>
before<2>
=
PC Instructionmemory
4
Registers
Signextend
Mux
Mux
Control
EX
M
WB
M
WB
WB
Mux
Hazarddetection
unit
Forwardingunit
IF.Flush
IF/ID
MEM/WB
EX/MEM
ID/EX
76 72
76 72
$1
$3
10
76
ALU Datamemory
Mux
Shiftleft 2
=
36 sub $10, $4, $840 beq $1, $3, 744 and $12 $2, $548 or $13 $2, $652 add $14, $4, $256 slt $15, $6, $7…72 lw $4, 50($7)
Clock cycle 4
Clock cycle 3
Optimized pipeline withonly one bubble as a resultof the taken branch
Simple Example: Comparing Performance Compare performance for single-cycle, multicycle, and pipelined
datapaths using the gcc instruction mix assume 2 ns for memory access, 2 ns for ALU operation,
1 ns for register read or write assume gcc instruction mix 23% loads, 13% stores, 19%
branches, 2% jumps, 43% ALU for pipelined execution assume
50% of the loads are followed immediately by an instruction that uses the result of the load
25% of branches are mispredicted branch delay on misprediction is 1 clock cycle jumps always incur 1 clock cycle delay so their average
time is 2 clock cycles
Simple Example: Comparing Performance Single-cycle (p. 373): average instruction time 8 ns Multicycle (p. 397): average instruction time 8.04 ns Pipelined:
loads use 1 cc (clock cycle) when no load-use dependency and 2 cc when there is dependency – given 50% of loads are followed by dependency the average cc per load is 1.5
stores use 1 cc each branches use 1 cc when predicted correctly and 2 cc when
not – given 25% misprediction average cc per branch is 1.25 jumps use 2 cc each ALU instructions use 1 cc each therefore, average CPI is1.5 23% + 1 13% + 1.25 19% + 2 2% + 1 43% =
1.18 therefore, average instruction time is 1.18 2 = 2.36 ns