![Page 2: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/2.jpg)
2
Overview and Learning Outcomes
• Deepen the understanding on how modern
processors work
• Learn how pipelining can improve processors
performance and efficiency
• Being aware of the problems arising from using
pipelined processors
• Understanding instruction dependencies
![Page 3: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/3.jpg)
3
The Fetch-Execute Cycle
• As explained in
COMP15111 Instruction
execution is a simple
repetitive cycle:CPU Memory
LDR R0, x
LDR R1, y
ADD R2, R1, R0
STR R2, Z
…
PC
Fetch Instruction
Execute Instruction
![Page 4: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/4.jpg)
4
Fetch-Execute Detail
The two parts of the cycle can be further subdivided
• Fetch
– Get instruction from memory (IF)
– Decode instruction & select registers (ID)
• Execute
– Perform operation or calculate address (EX)
– Access an operand in data memory (MEM)
– Write result to a register (WB)
We have designed the ‘worst case’ data path
– It works for all instructions
![Page 5: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/5.jpg)
5
Processor Detail
Register
Bank
Data
Cache
PCInstr.
CacheALU
IF ID EX MEM WB
Instruction Instruction Execute Access Write
Fetch Decode Instruction Memory Back
MU
X
Write in R2Do nothingAdd R0 & R1Select registers (R0, R1)ADD R2,R1,R0
Write in R0Get value from [x]Compute address xSelect register (PC)LDR R0, xCycle i
Cycle i+1
![Page 6: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/6.jpg)
6
Cycles of Operation
• Most logic circuits are driven by a clock
• In its simplest form one instruction would take one
clock cycle (single-cycle processor)
• This is assuming that getting the instruction and
accessing data memory can each be done in a 1/5th
of a cycle (i.e. a cache hit)
• For this part we will assume a perfect cache
replacement strategy
![Page 7: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/7.jpg)
7
Logic to do this
• Each stage will do its work and pass to the next
• Each block is only doing useful work once every 1/5th
of a cycle
Fe
tch Lo
gic
De
cod
e Lo
gic
Exe
c Log
ic
Me
mLo
gic
Write
Log
ic
Inst Cache Data Cache
![Page 8: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/8.jpg)
8
Application Execution
WBMEMEXIDIFADD
WBMEMEXIDIFLDR
WBMEMEXIDIFLDR
321
Clock Cycle
• Can we do it any better?
– Increase utilization
– Accelerate execution
![Page 9: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/9.jpg)
9
Insert Buffers Between Stages
• Instead of direct connection between stages – use
extra buffers to hold state
• Clock buffers once per cycle
Fe
tch Lo
gic
De
cod
e Lo
gic
Exe
c Log
ic
Me
mLo
gic
Write
Log
icInst Cache Data Cache
clock
Instru
ction
Re
g.
![Page 10: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/10.jpg)
10
In a pipeline processor
• Just like a car production line
• We still can execute one instruction every cycle
• But now clock frequency is increased by 5x
• 5x faster execution!
WBMEMEXIDIFADD
WBMEMEXIDIFLDR
WBMEMEXIDIFLDR
7654321
Clock Cycle
![Page 11: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/11.jpg)
11
Benefits of Pipelining
LDR IF ID EX MEM WB
LDR IF ID EX MEM WB
ADD IF ID EX MEM WB
1 2 3 4 5 6 7
LDR IF ID EX MEM WB
LDR IF ID EX MEM WB
ADD IF ID EX MEM WB
Clock Cycle
Without Pipelining
Clock Cycle
With Pipelining
1 2 3
![Page 12: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/12.jpg)
12
Why 5 Stages ?
• Simply because early pipelined processors
determined that dividing into these 5 stages of
roughly equal complexity was appropriate
• Some recent processors have used more than 30
pipeline stages
• We will consider 5 for simplicity at the moment
![Page 13: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/13.jpg)
13
Imagine we have a non-pipelined processor running at 10MHz and want to run a program with 1000 instructions.
a) How much time would it take to execute the program?
Pipelining Example
![Page 14: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/14.jpg)
14
Pipelining Example
Assuming ideal conditions (perfect pipelining and no hazards), how much time would it take to execute the same program in:
b) A 10-stage pipeline?c) A 100-stage pipeline?
![Page 15: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/15.jpg)
15
Pipelining Example
Looking at those results, it seems clear that increasing pipeline should increase the execution speed of a processor. Why do you think that processor designers (see Intel, below) have not only stopped increasing pipeline length but, in fact, reduced it?
Pentium III – Coppermine (1999) – 10-stage pipelinePentium IV – NetBurst (2000) – 20-stage pipelinePentium Prescott (2004) – 31-stage pipelineCore i7 9xx – Bloomfield (2008) – 24-stage pipelineCore i7 5Yxx – Broadwell (2014) – 19-stage pipeline
![Page 16: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/16.jpg)
16
Limits to Pipeline Scalability
• Higher freq. => more power consumed
• More stages
– more extra hardware
– more complex design (forwarding?)
– more difficult to split into uniform size chunks
– loading time of the registers limits cycle period
• Hazards (control and data)
– A longer datapath means higher probability of hazards
occurring and worse penalties when they happen
![Page 17: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/17.jpg)
Control Hazards
![Page 18: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/18.jpg)
18
The Control Transfer Problem
• Instructions are normally fetched sequentially (i.e.
just incrementing the PC)
• What if we fetch a branch?
– We only know it is a branch when we decode it in the
second stage of the pipeline
– By that time we are already fetching the next instruction in
serial order
![Page 19: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/19.jpg)
19
A Pipeline ‘Bubble’
Inst 1
Inst 2
Inst 3
B n
Inst 5
Inst 6
…
Inst n
We must mark Inst 5 as unwanted and
ignore it as it goes down the pipeline.
But we have wasted a cycle
1 2 3 4 5 6 7 8
Inst 3 IF ID EX MEM WB
B n IF ID EX MEM WB
Inst 5 IF ID EX MEM WB
..
Inst n IF ID EX MEM WB
We know it is a branch here.
Inst 5 is already fetched
![Page 20: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/20.jpg)
20
Conditional Branches
• It gets worse!
• Suppose we have a conditional branch
• It is possible that we might not be able to determine
the branch outcome until the execute (3rd) stage
• We would then have 2 ‘bubbles’
• We can often avoid this by reading registers during
the decode stage.
![Page 21: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/21.jpg)
21
Conditional Branches
Inst 1
Inst 2
Inst 3
BEQ n
Inst 5
Inst 6
…
Inst n
If condition is true, we must mark Inst 5 & 6 as
unwanted and ignore them as they go down the
pipeline. 2 wasted cycles now
1 2 3 4 5 6 7 8 9
Inst 3 IF ID EX MEM WB
BEQ n IF ID EX MEM WB
Inst 5 IF ID EX MEM WB
Inst 6 IF ID EX MEM WB
..
Inst n IF ID EX MEM WB
We do not know whether we have to branch
until EX. Inst 5 & 6 are already fetched
![Page 22: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/22.jpg)
22
Deeper Pipelines
• ‘Bubbles’ due to branches are called Control Hazards
• They occur when it takes one or more pipeline stages
to detect the branch
• The more stages, the less each does
• More likely to take multiple stages
• Longer pipelines usually suffer more degradation
from control hazards
![Page 23: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/23.jpg)
23
Branch Prediction
• In most programs many branch instructions are
executed many times
– E.g. loops, functions
• What if, when a branch is executed
– We take note of its address
– We take note of the target address
– We use this info the next time the branch is fetched
![Page 24: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/24.jpg)
24
Branch Target Buffer
• We could do this with some sort of (small) cache
• As we fetch the branch we check the BTB
• If a valid entry in BTB, we use its target to fetch next
instruction (rather than the PC)
Address Data
Branch Address Target Address
![Page 25: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/25.jpg)
25
Branch Target Buffer
• For unconditional branches we always get it right
• For conditional branches it depends on the probability of repeating the target
– E.g. a ‘for’ loop which jumps back many times we will get it right most of the time (only first and last time will mispredict)
• But it is only a prediction, if we get it wrong we pay a penalty (bubbles)
![Page 26: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/26.jpg)
26
Outline Implementation
Fetch
Stage
PC
Inst
Cache
Branch
Target
Buffer
valid
inc
![Page 27: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/27.jpg)
27
Other Branch Prediction
• BTB is simple to understand
– But expensive to implement
– And it just uses the last branch to predict
• In practice, prediction accuracy depends on
– More history (several previous branches)
– Context (how did we get to this branch)
• Real-world branch predictors are more complex and
vital to performance for deep pipelines
![Page 28: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/28.jpg)
28
Benefits of Branch Prediction
1 2 3 4 5 6 7 8 9
Inst a IF ID EX MEM WB
BEQ n IF ID EX MEM WB
Inst c IF ID EX MEM WB
Inst d IF ID EX MEM WB
..
Inst n IF ID EX MEM WB
1 2 3 4 5 6 7 8 9
Inst a IF ID EX MEM WB
BEQ n IF ID EX MEM WB
Inst c
Inst d
..
Inst n IF ID EX MEM WB
With Branch Prediction
Without Branch PredictionThe comparison is not
done until 3rd stage, so
2 instructions have
been issued and need
to be eliminated from
the pipeline and we
have wasted 2 cycles
If we predict that next
instruction will be ‘n’
then we pay no penalty
![Page 29: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/29.jpg)
29
Consider a simple program with two nested loops as the following:while (true) {
for (i=0; i<x; i++) {do_stuff
}}
With the following assumptions:do_stuff has 20 instructions that can be executed ideally in the pipeline.
The overhead for control hazards is 3-cycles, regardless of the branch
being static or conditional.
Each of the two loops can be translated into a single branch instruction.
Calculate the instructions-per-cycle that can be achieved for different values of x (2, 4, 10, 100):a) Without branch prediction.
b) With a simple branch prediction policy - do the same as the last time.
Control Hazard Example
![Page 30: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/30.jpg)
30
Control Hazard Example
![Page 31: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/31.jpg)
Data Hazards
![Page 32: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/32.jpg)
32
COMP25212
Data Hazards
• Pipeline can cause other problems
• Consider the following instructions:
ADD R1,R2,R3
MUL R0,R1,R1
• The ADD produces a value in R1
• The MUL uses R1 as input
• There is a data dependency between them
![Page 33: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/33.jpg)
33
COMP25212
Instructions in the Pipeline
Re
giste
r Ba
nk
Da
ta
Ca
che
PC
Instru
ction
Ca
che
MU
X
ALU
IF ID EX MEM WB
ADD R1,R2,R3MUL R0,R1,R1
The new value of R1 has not been updated in the register bank
MUL would be reading an outdated value!!
![Page 34: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/34.jpg)
34
COMP25212
The Data is not Ready
• At the end of the ID cycle, MUL instruction should
have selected value in R1 to put into buffer at input
to EX stage
• But the correct value for R1 from ADD instruction is
being put into the buffer at the output of EX stage at
this time
• It will not get to the input of WB until one cycle later
– then probably another cycle to write into register
bank
![Page 35: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/35.jpg)
35
Dealing with data dependencies (I)
• Detect dependencies in HW and hold instructions in
ID stage until data is ready, i.e. pipeline stall
– Bubbles and wasted cycles again
WB
8
MEMEX--IDIFMUL R0,R1,R1
WBMEMEXIDIFADD R1,R2,R3
7654321
Clock Cycle
data is produced here
R1 is written back hereMUL can not proceed until here
![Page 36: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/36.jpg)
36
Dealing with data dependencies (II)
• Use the compiler to try and reorder instructions
– Only works if we can find something useful to do –
otherwise insert NOPs – waste
WBMEMEXIDIFInstr A / NOP
WBMEMEXIDIFADD R1,R2,R3
WB
8
MEMEXIDIFMUL R0,R1,R1
WBMEMEXIDIFInstr B / NOP
7654321
Clock Cycle
![Page 37: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/37.jpg)
37
Forwarding
Re
giste
r Ba
nk
Da
ta
Ca
che
PC
Instru
ction
Ca
che
MU
X
ALU
ADD R1,R2,R3MUL R0,R1,R1
• We can add extra data paths for specific cases
– The output of EX feeds back into the input of EX
– Sends the data to next instruction
• Control becomes more complex
Result
From ADD
![Page 38: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/38.jpg)
38
Why did it Occur?
• Due to the design of our pipeline
• In this case, the result we want is ready one stage ahead (EX)
of where it was needed (ID)
– why wait until it goes down the pipeline?
• But, …what if we have the sequence
LDR R1,[R2,R3]
MUL R0,R1,R1
• LDR = load R1 from memory address R2+R3
– Now the result we want will be ready after MEM stage
![Page 39: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/39.jpg)
39
Pipeline Sequence for LDR
• Fetch
• Decode and read registers (R2 & R3)
• Execute – add R2+R3 to form address
• Memory access, read from address [R2+R3]
• Now we can write the value into register R1
![Page 40: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/40.jpg)
40
More Forwarding
Register B
ank
Data
Cache
PC
Instruction C
ache
MU
X
ALU
MUL R0,R1,R1
LDR R1,[R2,R3]
Waste
d cycle
• MUL has to wait until LDR finishes MEM
• We need to add extra paths from MEM to EX
• Control becomes even more complex
![Page 41: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/41.jpg)
41
Forwarding Example
1 2 3 4 5 6 7 8 9 10 11 12
LDR r1,A IF ID EX MEM WB
LDR r2,B IF ID EX MEM WB
ADD r3,r1,r2 IF ID Stall Stall EX MEM WB
ST r3 C IF Stall Stall ID Stall Stall EX MEM WB
1 2 3 4 5 6 7 8 9
LDR r1,A IF ID EX MEM WB
LDR r2,B IF ID EX MEM WB
ADD r3,r1,r2 IF ID Stall EX MEM WB
ST r3 C IF Stall ID EX MEM WB
Clock Cycle
With Forwarding
Without Forwarding
![Page 42: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/42.jpg)
42
Deeper Pipelines
• As mentioned previously we can go to longer pipelines
– Do less per pipeline stage
– Each step takes less time
– So clock frequency increases
– But greater penalty for hazards
– More complex control
• A trade-off between many aspects needs to be made
– Frequency, power, area,
![Page 43: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/43.jpg)
43
Consider the following program which implements R = A^2 + B^2
LD r1, AMUL r2, r1, r1 -- A^2LD r3, BMUL r4, r3, r3 -- B^2ADD r5, r2, r4 -- A^2 + B^2ST r5, R
a) Draw its dependency diagram
Data Hazard Example
![Page 44: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/44.jpg)
44
b) Simulate its execution in a basic 5-stage pipeline without forwarding.
c) Simulate the execution in a 5-stage pipeline with forwarding.
Data Hazard Example
654321 121110987 181716151413
ST r5, R
ADD r5, r2, r4
MUL r4, r3, r3
LD r3, B
MUL r2, r1, r1
LD r1, A
242322212019654321 121110987 181716151413
ST r5, R
ADD r5, r2, r4
MUL r4, r3, r3
LD r3, B
MUL r2, r1, r1
LD r1, A
242322212019
654321 121110987 181716151413
ST r5, R
ADD r5, r2, r4
MUL r4, r3, r3
LD r3, B
MUL r2, r1, r1
LD r1, A
242322212019654321 121110987 181716151413
ST r5, R
ADD r5, r2, r4
MUL r4, r3, r3
LD r3, B
MUL r2, r1, r1
LD r1, A
242322212019
![Page 45: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/45.jpg)
45
Where Next?
• Despite these difficulties it is possible to build
processors which approach 1 instruction per cycle
(IPC)
• Given that the computational model is one of serial
instruction execution can we do any better than
this?
![Page 46: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/46.jpg)
Instruction Level Parallelism
![Page 47: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/47.jpg)
47
Instruction Level Parallelism (ILP)
• Suppose we have an expression of the form x =
(a+b) * (c-d)
• Assuming a,b,c & d are in registers, this might turn
into
ADD R0, R2, R3
SUB R1, R4, R5
MUL R0, R0, R1
STR R0, x
![Page 48: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/48.jpg)
48
ILP (cont)
• The MUL has a dependence on the ADD and the SUB, and the STR has a dependence on the MUL
• However, the ADD and SUB are independent
• In theory, we could execute them in parallel, even out of order
ADD R0, R2, R3
SUB R1, R4, R5
MUL R0, R0, R1
STR R0, x
![Page 49: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/49.jpg)
49
The Dependency Grapha.k.a. Data Flow graph
• We can see this more clearly if we plot it as a
dependency graph (or data flow)
ADD SUB
MUL
R2 R3 R4 R5
x
As long as R2, R3,R4 & R5 are available,We can execute theADD & SUB in parallel
![Page 50: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/50.jpg)
50
Amount of ILP?
• This is obviously a very simple example
• However, real programs often have quite a few
independent instructions which could be executed in
parallel
• Exact number is clearly program dependent but
analysis has shown that 4 is common (in parts of the
program anyway)
![Page 51: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/51.jpg)
51
How to Exploit?
• We need to fetch multiple instructions per cycle –
wider instruction fetch
• Need to decode multiple instructions per cycle
• But must use common registers – they are logically
the same registers
• Need multiple ALUs for execution
• But also access common data cache
![Page 52: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/52.jpg)
52
Dual Issue Pipeline
• Two instructions can now execute in parallel
• (Potentially) double the execution rate
• Called a ‘Superscalar’ architecture (2-way)
Re
giste
r Ba
nk
Da
ta
Ca
chePC
Instru
ction
Ca
che
MU
X
ALU
I1 I2
ALU
MU
X
![Page 53: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/53.jpg)
53
Register & Cache Access
• Note the access rate to both registers & cache will be
doubled
• To cope with this we may need a dual ported
register bank & dual ported caches
• This can be done either by duplicating access
circuitry or even duplicating whole register & cache
structure
![Page 54: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/54.jpg)
54
Selecting Instructions
• To get the doubled performance out of this
structure, we need to have independent instructions
• We can have a ‘dispatch unit’ in the fetch stage
which uses hardware to examine the instruction
dependencies and only issue two in parallel if they
are independent
![Page 55: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/55.jpg)
55
Instruction dependencies
• If we had
ADD R1,R1,R0
MUL R2,R1,R1
ADD R3,R4,R5
MUL R6,R3,R3
• Issued in pairs as above
• We would not be able to issue any in parallel
because of dependencies
ADD
MUL
MUL
ADD
![Page 56: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/56.jpg)
56
Instruction reorder
• If we examine dependencies and reorder
ADD R1,R1,R0 ADD R1,R1,R0
MUL R2,R1,R1 ADD R3,R4,R5
ADD R3,R4,R5 MUL R2,R1,R1
MUL R6,R3,R3 MUL R6,R3,R3
• We can now execute pairs in parallel (assuming
appropriate forwarding logic)
ADD
MUL MUL
ADD
![Page 57: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/57.jpg)
57
Example of 2-way SuperscalarDependency Graph
ADD
R2 R3
MUL
x
STR
y
STR
MUL
R3 R4
ADD R0, R2, R3
SUB R1, R4, R5
MUL R0, R0, R1
MUL R2, R3, R4
STR R0, x
STR R2, y
SUB
R4 R5
ADD R0, R2, R3
SUB R1, R4, R5
MUL R0, R0, R1
STR R0, x
MUL R2, R3, R4
STR R2, y
![Page 58: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/58.jpg)
58
2-way Superscalar Execution
WBMEMEXIDIFSTR R2, y
WBMEMEXIDIFSTR R0, x
WBMEMEXIDIFMUL R2, R3, R4
WBMEMEXIDIFMUL R0, R0, R1
WBMEMEXIDIFSUB R1, R4, R5
WBMEMEXIDIFADD R0, R2, R3
7654321
Superscalar Processor
WBMEMEXIDIFSTR R2, y
WBMEMEXIDIFSTR R0, x
WBMEMEXIDIFMUL R2, R3, R4
WBMEMEXIDIFMUL R0, R0, R1
WBMEMEXIDIFSUB R1, R4, R5
WBMEMEXIDIFADD R0, R2, R3
10987654321
Scalar Processor
![Page 59: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/59.jpg)
59
Limits of ILP
• Modern processors are up to 4 way superscalar (but
rarely achieve 4x speed)
• Not much beyond this
– Hardware complexity
– Limited amounts of ILP in real programs
• Limited ILP not surprising, conventional programs
are written assuming a serial execution model
![Page 60: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/60.jpg)
60
Consider the following program which implements R = A^2 + B^2 + C^2 + D^2
LD r1, AMUL r2, r1, r1 -- A^2LD r3, BMUL r4, r3, r3 -- B^2ADD r11, r2, r4 -- A^2 + B^2LD r5, CMUL r6, r5, r5 -- C^2LD r7, DMUL r8, r7, r7 -- D^2ADD r12, r6, r8 -- C^2 + D^2ADD r21, r11, r12 -- A^2 + B^2 + C^2 + D^2ST r21, R
The current code is not really suitable for a superscalar pipeline because of its low instruction-level parallelism
Superscalar Example
![Page 61: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/61.jpg)
61
Superscalar Example
a) Reorder the instructions to exploit superscalar execution. Assume all kinds of forwarding are implemented.
![Page 62: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/62.jpg)
Reordering Instructions
![Page 63: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/63.jpg)
63
Compiler Optimisation
• Reordering can be done by the compiler
• If compiler can not manage to reorder the instructions, we still need hardware to avoid issuing conflicts (stall)
• But if we could rely on the compiler, we could get rid of expensive checking logic
• This is the principle of VLIW (Very Long Instruction Word)[1]
• Compiler must add NOPs if necessary
[1] You can find an introduction to VLIW architectures at:
https://web.archive.org/web/20110929113559/http://www.nxp.com/acrobat_download2/other/vliw-wp.pdf
![Page 64: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/64.jpg)
64
Compiler limitations
• There are arguments against relying on the compiler
– Legacy binaries – optimum code tied to a particular
hardware configuration
– ‘Code Bloat’ in VLIW – useless NOPs
• Instead, we can rely on hardware to re-order
instructions if necessary
– Out-of-order processors
– Complex but effective
![Page 65: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/65.jpg)
65
Out of Order Processors
• An instruction buffer needs to be added to store all
issued instructions
• An dynamic scheduler is in charge of sending non-
conflicted instructions to execute
• Memory and register accesses need to be delayed
until all older instructions are finished to comply
with application semantics
![Page 66: Pipelining - University of Manchestersyllabus.cs.manchester.ac.uk/ugt/2016/COMP25212/lect/pipe.pdf · Cache PC Instr. Cache ALU IF ID EX MEM WB Instruction Instruction Execute Access](https://reader034.vdocuments.net/reader034/viewer/2022052613/5f15ddca52a7b61ab216be50/html5/thumbnails/66.jpg)
66
Out of Order Execution
Re
giste
r Ba
nk
PC
Instr.
Ca
che
ALUIn
structio
n
Bu
ffer
Dispatch
Schedule
Da
ta
Ca
che
Me
mo
ry
Qu
eu
e
Re
giste
r
Qu
eu
e
Delay
Delay
• What changes in an out-of-order processor
– Instruction Dispatching and Scheduling
– Memory and register accesses deferred