csci2510 computer organization lecture 12: pipeliningmcyang/csci2510/2019f/lec12... ·...
TRANSCRIPT
![Page 1: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps](https://reader030.vdocuments.net/reader030/viewer/2022040203/5ea4d6f66ad43e0c3961ac26/html5/thumbnails/1.jpg)
CSCI2510 Computer Organization
Lecture 12: Pipelining
Ming-Chang YANG
Reading: Chap. 8 (5th Ed.)
![Page 2: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps](https://reader030.vdocuments.net/reader030/viewer/2022040203/5ea4d6f66ad43e0c3961ac26/html5/thumbnails/2.jpg)
Why We Need Pipelining?
• Real-life example:
Four loads of
laundry that need
to be washed,
dried, and folded.
– Washing: 30 min
– Drying: 40 min
– Folding: 20 min
• Without pipeline:
– 360 min in total
• With pipeline:
– 210 min in total!
CSCI2510 Lec12: Pipelining 2
https://cs.stanford.edu/people/eroberts/courses/soco/projects/risc/pipelining/index.html
![Page 3: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps](https://reader030.vdocuments.net/reader030/viewer/2022040203/5ea4d6f66ad43e0c3961ac26/html5/thumbnails/3.jpg)
Outline
• Sequential Execution vs Pipelining
• Pipeline Stall: Hazard
– Data Hazard
– Instruction Hazard
– Structural Hazard
• Superscalar Operation
CSCI2510 Lec12: Pipelining 3
![Page 4: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps](https://reader030.vdocuments.net/reader030/viewer/2022040203/5ea4d6f66ad43e0c3961ac26/html5/thumbnails/4.jpg)
Sequential Execution
• The processor fetches and executes instructions, one
after the other.
– Fi: Fetch steps for instruction Ii
– Ei: Execute steps for instruction Ii
• Execution of a program consists of a sequential
sequence of fetch and execute steps:
• How to improve the speed of execution?
– Use faster technologies to build CPU and memory ($$$).
– Arrange hardware to do multiple operations at a time ($).CSCI2510 Lec12: Pipelining 4
F1 E1 F2 E2 F3 E3
I1 I2 I3
Time
![Page 5: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps](https://reader030.vdocuments.net/reader030/viewer/2022040203/5ea4d6f66ad43e0c3961ac26/html5/thumbnails/5.jpg)
Separate HW & Interstage Buffer
• Consider a computer having two separate hardware
units:
– One hardware unit is for fetching instructions.
– The other hardware unit is for executing instructions.
• Interstage Buffer: Deposit the fetched instruction.
– Execution unit executes the deposited instruction.
– Fetch unit fetches the next instruction at the same time.
CSCI2510 Lec12: Pipelining 5
InstructionFetchUnit
ExecutionUnit
Interstage buffer
Instruction
Instr
uction
![Page 6: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps](https://reader030.vdocuments.net/reader030/viewer/2022040203/5ea4d6f66ad43e0c3961ac26/html5/thumbnails/6.jpg)
• Assume the computer is controlled by a clock.
– The fetch and execute steps of any instruction can be
completed in one clock cycle.
• Fetch and execute units form a two-stage pipeline:
– Both units are kept busy all the time.
– An interstage buffer is needed to hold the instruction.
CSCI2510 Lec12: Pipelining 6
Basic Idea of Instruction Pipelining (1/2)
F1 E 1
F2 E 2
F3 E 3
I 1
I 2
I 3
Instruction
Clock cycle 1 2 3 4 Time
![Page 7: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps](https://reader030.vdocuments.net/reader030/viewer/2022040203/5ea4d6f66ad43e0c3961ac26/html5/thumbnails/7.jpg)
• Parallelism is increased by overlapping the fetch and
execute steps.
– If executions sustain for a long time, the completion rate of
a two-stage pipelining will be twice.
• More is better? How about 4-stage pipeline?
– F: Fetch instruction from memory
– D: Decode instruction and fetch source operands
– E: Execute instruction
– W: Write the result
CSCI2510 Lec12: Pipelining 7
Basic Idea of Instruction Pipelining (2/2)
![Page 8: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps](https://reader030.vdocuments.net/reader030/viewer/2022040203/5ea4d6f66ad43e0c3961ac26/html5/thumbnails/8.jpg)
F
Fetch
instruction
D
Decode
instruction
4-Stage Pipeline (1/2)
CSCI2510 Lec12: Pipelining 8
F4I4
F1
F2
F3
I1
I2
I3
D1
D2
D3
D4
E1
E2
E3
E4
W1
W2
W3
W4
Instruction
Clock cycle 1 2 3 4 5 6 7
E
Execute
operation
W
Write
results
Interstage buffers
B1 B2 B3
Time
![Page 9: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps](https://reader030.vdocuments.net/reader030/viewer/2022040203/5ea4d6f66ad43e0c3961ac26/html5/thumbnails/9.jpg)
Class Exercise 12.1
• During clock cycle 4, what is the information hold by
the three interstage buffers (i.e., B1, B2, and B3)
respectivley?
CSCI2510 Lec12: Pipelining 9
Student ID:
Name:
Date:
F
Fetch
instruction
D
Decode
instruction
F4I4
F1
F2
F3
I1
I2
I3
D1
D2
D3
D4
E1
E2
E3
E4
W1
W2
W3
W4
Clock cycle 1 2 3 4 5 6 7
E
Execute
operation
W
Write
results
B1 B2 B3
Time
![Page 10: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps](https://reader030.vdocuments.net/reader030/viewer/2022040203/5ea4d6f66ad43e0c3961ac26/html5/thumbnails/10.jpg)
4-Stage Pipeline (2/2)
• The four hardware units perform their tasks
simultaneously without interfering others.
– The required information is passed from one unit to the
next through a interstage buffer.
• Each stage should be roughly the same maximum
clock period.
– Why? A unit that completes its task early is idle for the
remainder of the clock period.
• Question: What is the ideal speedup of an N-stage pipeline compared to the sequential execution?
CSCI2510 Lec12: Pipelining 11
![Page 11: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps](https://reader030.vdocuments.net/reader030/viewer/2022040203/5ea4d6f66ad43e0c3961ac26/html5/thumbnails/11.jpg)
Outline
• Sequential Execution vs Pipelining
• Pipeline Stall: Hazard
– Data Hazard
– Instruction Hazard
– Structural Hazard
• Superscalar Operation
CSCI2510 Lec12: Pipelining 12
![Page 12: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps](https://reader030.vdocuments.net/reader030/viewer/2022040203/5ea4d6f66ad43e0c3961ac26/html5/thumbnails/12.jpg)
Reality: Pipeline may Stall
• If a pipeline stage requires more than 1 cycle, others
have to wait (pipeline stalled)
– E.g. E2 requires three cycles to complete
CSCI2510 Lec12: Pipelining 13
F1
F2
F3
I1
I2
I3
E1
E2
E3
D1
D 2
D 3
W 1
W 2
W 3
Instruction
F4 D 4I4
Clock cycle 1 2 3 4 5 6 7 8 9
E4
F5I5 D 5
Time
E5
W 4
In cycles 5 and 6: Write, Decode and
Fetch units must wait and do nothing …
![Page 13: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps](https://reader030.vdocuments.net/reader030/viewer/2022040203/5ea4d6f66ad43e0c3961ac26/html5/thumbnails/13.jpg)
Stall & Hazard
• Hazard: Any condition that causes pipeline to stall.
• Another example: A cache miss occurs in F2:
Figure: Instruction execution steps in successive clock cycles.
Figure: Statuses of processor stages in successive clock cycles.
CSCI2510 Lec12: Pipelining 14
F1
F2
F3
I1
I2
I3
D1
D2
D3
E1
E2
E3
W1
W2
W3
Instruction
1 2 3 4 5 6 7 8 9Clock cycle
Time
1 2 3 4 5 6 7 8Clock cycle
Stage
F: Fetch
D: Decode
E: Execute
W: Write
F1 F2 F3
D1 D2 D3idle idle idle
E1 E2 E3idle idle idle
W1 W2idle idle idle
9
W3
F2 F2 F2
Time
![Page 14: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps](https://reader030.vdocuments.net/reader030/viewer/2022040203/5ea4d6f66ad43e0c3961ac26/html5/thumbnails/14.jpg)
Types of Hazards
• Data Hazard
– Either the source or the destination operands of an
instruction are not available when required.
• Instruction Hazard
– A delay in the availability of an instruction (this may
be a result of a miss in the cache).
• Structural Hazard
– Two instructions require the use of a given
hardware resource at the same time.
CSCI2510 Lec12: Pipelining 15
![Page 15: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps](https://reader030.vdocuments.net/reader030/viewer/2022040203/5ea4d6f66ad43e0c3961ac26/html5/thumbnails/15.jpg)
Outline
• Sequential Execution vs Pipelining
• Pipeline Stall: Hazard
– Data Hazard
– Instruction Hazard
– Structural Hazard
• Superscalar Operation
CSCI2510 Lec12: Pipelining 16
![Page 16: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps](https://reader030.vdocuments.net/reader030/viewer/2022040203/5ea4d6f66ad43e0c3961ac26/html5/thumbnails/16.jpg)
Data Hazard
CSCI2510 Lec12: Pipelining 17
I1 (Mul)
I2 (Add)
I3
Instruction
1 2 3 4 5 6 7 8 9Clock cycle
I4
F1
F2
F3
D1
D3
E1
E3
E2
W3
W1
D2-A W2
F4 D4 E4 W4
D2
Time
Pipeline is stalled for two cycles.
I1: A = 3 * A;
I2: B = 4 + A;
D: Decode and fetch
source operands
• A data hazard is a situation in which the pipeline is
stalled because the operands are delayed.
• Example:
– Dependent operations must be performed sequentially to
ensure the data consistency.
![Page 17: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps](https://reader030.vdocuments.net/reader030/viewer/2022040203/5ea4d6f66ad43e0c3961ac26/html5/thumbnails/17.jpg)
Class Exercise 12.2
• Please specify whether we will encounter data
hazards for the following instructions.
CSCI2510 Lec12: Pipelining 18
I1: A = 5 * C;
I2: B = 20 + C;
I1: C = A * B;
I2: E = C + D;
![Page 18: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps](https://reader030.vdocuments.net/reader030/viewer/2022040203/5ea4d6f66ad43e0c3961ac26/html5/thumbnails/18.jpg)
Software Solution to Data Hazard
• The compiler detects and introduces two-cycle delay
by inserting NOP (No-operation) instructions.
– Advantage: Simpler hardware, less cost
– Disadvantage: Larger code size, less flexibility, and
reduced performance
CSCI2510 Lec12: Pipelining 20
F1
F2
I1 (Mul)
I2 (Add)
D1 E1
E2
Instruction
1 2 3 4 5 6 7 8 9Clock cycle
W1
W2D2
Time
NOP
NOP
I1: A = 3 * A;
I2: B = 4 + A;
Question: Do we really avoid the pipline stalling?
![Page 19: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps](https://reader030.vdocuments.net/reader030/viewer/2022040203/5ea4d6f66ad43e0c3961ac26/html5/thumbnails/19.jpg)
Hardware Solution to Data Hazard (1/2)
• The data hazard arises because I2 is waiting for data
to be written in the register A.
• In fact, the result of I1 is available at the output of ALU.
• Delay is reduced if the result can be forwarded to E2.
CSCI2510 Lec12: Pipelining 21
F1
F2
F3
I1 (Mul)
I2 (Add)
I3
D1
D3
E1
E3
E2
W3
Instruction
1 2 3 4 5 6 7 8 9Clock cycle
W1
D2-A W2
F4 D4 E4 W4I4
D2
Time
D: Decode and fetch
source operands
I1: A = 3 * A;
I2: B = 4 + A;
Result of I1 is available here!
![Page 20: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps](https://reader030.vdocuments.net/reader030/viewer/2022040203/5ea4d6f66ad43e0c3961ac26/html5/thumbnails/20.jpg)
Hardware Solution to Data Hazard (2/2)
• Operand Forwarding: By introducing the forwarding
path, the execution of I2 can proceed without stalling.
– Disadvantage: Additional hardware cost
CSCI2510 Lec12: Pipelining 22
E: Execute(ALU)
W: Write(Register file)
SRC1,SRC2 RSLT
(b) Source and result registers
Register
file
SRC1 SRC2
RSLT
Destination
Source 1
Source 2
(a) Datapath (3 buses)
ALU
Port A Port B
![Page 21: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps](https://reader030.vdocuments.net/reader030/viewer/2022040203/5ea4d6f66ad43e0c3961ac26/html5/thumbnails/21.jpg)
Outline
• Sequential Execution vs Pipelining
• Pipeline Stall: Hazard
– Data Hazard
– Instruction Hazard
– Structural Hazard
• Superscalar Operation
CSCI2510 Lec12: Pipelining 23
![Page 22: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps](https://reader030.vdocuments.net/reader030/viewer/2022040203/5ea4d6f66ad43e0c3961ac26/html5/thumbnails/22.jpg)
Instruction Hazard
• Recall: The purpose of the instruction fetch unit is to
supply the execution units with instructions.
– F: Fetch instruction from memory
– D: Decode instruction and fetch source operands
– E: Execute instruction
– W: Write the result
• Instruction Hazard: The cases cause the pipeline to
stall, because of the delay of instructions.
1) Cache miss
2) Brach instruction (both unconditional and conditional)CSCI2510 Lec12: Pipelining 24
F
Fetch
instruction
D
Decode
instruction
E
Execute
operation
W
Write
results
B1 B2 B3
![Page 23: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps](https://reader030.vdocuments.net/reader030/viewer/2022040203/5ea4d6f66ad43e0c3961ac26/html5/thumbnails/23.jpg)
Instruction Hazard: Cache Miss
• The effect of a cache miss on the pipelined operation
is as follows:
– I1 is fetched from the cache in cycle 1.
– The fetch operation F2 for I2 results in a cache miss.
• The instruction fetch unit must suspend any further fetch requests until
F2 is completed.
CSCI2510 Lec12: Pipelining 25
F1
F2
I1
I2
I3
D1
D2
E1
E2
W1
W2
F3 D3 E3 W3
Instruction
1 2 3 4 5 6 7 8 9Clock cycleTime
F3
Postponed
![Page 24: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps](https://reader030.vdocuments.net/reader030/viewer/2022040203/5ea4d6f66ad43e0c3961ac26/html5/thumbnails/24.jpg)
• Branches may also cause the pipeline to stall.
– Branch Penalty: The time lost because of a branch inst.
– Branch penalty can be reduced by computing the branch
address earlier in Decode stage (rather than Execute stage)
• However, it still results in 1 cycle branch penalty to the pipeline.
CSCI2510 Lec12: Pipelining 26
F1 D1 E1 W1
I2 (Branch to Ik)
I1
1 2 3 4 5 6 7Clock cycle
F2 D2
Branch address computed in Execute stage
Branch Penalty: 2 clock cycles
E2
8
Time
F1 D1 E1 W1
I2 (Branch to Ik)
I1
1 2 3 4 5 6 7Clock cycle
F2 D2
Branch address computed in Decode stage
Branch Penalty: 1 clock cycle
Time
Fk Dk Ek
Fk+1 Dk+1
Ik
Ik+1
Wk
Ek+1
XF3I3
D3
F4 XI4
I3 and I4 must be
discarded
F3 X
Fk Dk Ek
Fk+ 1 Dk+ 1
I3
Ik
Ik+ 1
Wk
Ek+ 1
Only I3 is
discarded
Instruction Hazard: Unconditional Branch
![Page 25: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps](https://reader030.vdocuments.net/reader030/viewer/2022040203/5ea4d6f66ad43e0c3961ac26/html5/thumbnails/25.jpg)
Solution to Instruction Hazard
• Instruction Queue: The interstage buffer between
Fetch and Decode units can keep multiple instructions.
– Fetch unit gets and deposits one instruction at a time.
– Decode unit consumes one instruction at a time.
CSCI2510 Lec12: Pipelining 27
Instruction queue
E
Execute
operation
W
Write
results
D
Decode
instruction
F
Fetch
instruction
Interstage buffers
![Page 26: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps](https://reader030.vdocuments.net/reader030/viewer/2022040203/5ea4d6f66ad43e0c3961ac26/html5/thumbnails/26.jpg)
F4
W3E3
F2 D2 E2 W2
F3 D3
E4D4 W4F4
• F4, F5, F6, Fk, and Fk+1, are delayed.
• I1, I2, I3, I4, and Ik cannot complete in successive cycles.
CSCI2510 Lec12: Pipelining 28
Example: Without Instruction Queue
F1 D1 E1 E1 E1 W1
I5 (Branch to Ik)
I1
1 2 3 4 5 6 7 8 9Clock cycle
I2
I3
I4
I6
Ik
Ik+1
10Time
XF6
Fk Dk Ek
Fk+1 Dk+1
Wk
Ek+1
11 12
Instruction 1 takes 3
Execute cycles (i.e., 2-
cycle stall).
Instruction 4 is delayed.
Instruction 5 is a branch .
Instruction 6 is discarded.
F5 D5
Since there is no
instruction queue!
![Page 27: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps](https://reader030.vdocuments.net/reader030/viewer/2022040203/5ea4d6f66ad43e0c3961ac26/html5/thumbnails/27.jpg)
• I6 is still being discarded, but the instruction queue can avoid
delaying F4, F5, F6, Fk, and Fk+1 if the queue is not empty.
• I1, I2, I3, I4, and Ik can complete in successive cycles.CSCI2510 Lec12: Pipelining 29
Example: With Instruction Queue
F1 D1 E1 E1 E1 W1
I5 (Branch to Ik)
I1
1 2 3 4 5 6 7 8 9Clock cycle
I2
I3
I4
I6
Ik
Ik+1
10
1Queue length 1 1 12 3 2 1 1 1
Time
X
F4
W3E3
F2 D2 E2 W2
F3 D3
E4D4 W4
F5
F6
Fk Dk Ek
Fk+1 Dk+1
Wk
Ek+1
Keep
fetching
D5
Instruction 1 takes 3
Execute cycles (i.e., 2-
cycle stall),
The queue length rises to
3 before cycle 6.
Instruction 5 is a branch .
Instruction 6 is discarded,
after taking Branch.
The queue length drops to
1 before cycle 8.
![Page 28: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps](https://reader030.vdocuments.net/reader030/viewer/2022040203/5ea4d6f66ad43e0c3961ac26/html5/thumbnails/28.jpg)
Without vs With Instruction Queue
• With instruction queue, the branch instruction does
not increase the overall execution time (if the queue
is not empty).
– Since instructions can complete in successive clock cycles.
• Branch address is computed in parallel with other
instructions, so no cycles lost due to branch.
– This is called branch folding.
• Instruction queue is also possible to hide the effect of
cache miss (if the queue is not empty).
CSCI2510 Lec12: Pipelining 30
![Page 29: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps](https://reader030.vdocuments.net/reader030/viewer/2022040203/5ea4d6f66ad43e0c3961ac26/html5/thumbnails/29.jpg)
Class Exercise 12.3
• Please show how instruction queue can hide the
effect of cache miss (three cycles) caused by F4.
CSCI2510 Lec12: Pipelining 31
F1 D1 E1 E1 E1 W1
W3E3
I1
F2 D2
1 2 3 4 5 6 7 8 9Clock cycle
E2 W2
F3 D3
I2
I3
I4
10Time
11 12
F1 D1 E1 E1 E1 W1
W3E3
I1
F2 D2 E2 W2
F3 D3
I2
I3
I4
1 2 3 4 5 6 7 8 9Clock cycle 10Time
11 12
1 1 1Queue length
Without
Instruction
Queue
With
Instruction
Queue
![Page 30: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps](https://reader030.vdocuments.net/reader030/viewer/2022040203/5ea4d6f66ad43e0c3961ac26/html5/thumbnails/30.jpg)
All intermediate instructions
must be discarded …
• Branch folding is not working for conditional branches.
• Conditional branches may result in added hazard.
– Since the condition is based on the preceding instruction.
• Example:
CSCI2510 Lec12: Pipelining 33
Instruction Hazard: Conditional Branch
Add
LOOP Shift_left R1
Decrement
Branch=0
R2
LOOP
NEXT R1,R3
R2 is used as the
branch condition.
We need to wait for R2 to
determine whether to perform
the conditional branching.
F1 D1 E1 W1
I2 (Decrement)
I1
1 2 3 4 5 6 7Clock cycle
F3 D3I3
Time
F2 D2 E2 W2
(Shift)
(Branch if R2 = 0) D3-R2
Fk Dk EkIkWk
8 9 10
LOOP
![Page 31: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps](https://reader030.vdocuments.net/reader030/viewer/2022040203/5ea4d6f66ad43e0c3961ac26/html5/thumbnails/31.jpg)
Solution 1) Delayed Branch (1/2)
• The location following a
branch instruction is
called a branch delay slot.
• Delayed branching can
minimize the penalty by
– Placing useful instructions
in branch delay slots, and
– Internally re-ordering the
instructions.
CSCI2510 Lec12: Pipelining 34
Add
LOOP Shift_left R1
Decrement
Branch=0
R2
LOOP
NEXT R1,R3
(a) Original program loop
Add
LOOP
Shift_left R1
Decrement
Branch=0
R2
LOOP
NEXT R1,R3
(b) Internally Re-ordered instructions
(actual program logic NOT affected)
Branch Delay Slot
![Page 32: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps](https://reader030.vdocuments.net/reader030/viewer/2022040203/5ea4d6f66ad43e0c3961ac26/html5/thumbnails/32.jpg)
Solution 1) Delayed Branch (2/2)
• Delayed branching can minimize the branch penalty.
CSCI2510 Lec12: Pipelining 35
Instruction
1 2 3 4 5 6 7 8Clock cycle Time
F ENEXT: Add (Branch not taken) WD
9 10
ALU
Result
Forwarding
ALU
Result
Forwarding
F D
F Daddr
F E
Decrement
Branch=0?
Shift (delay slot)
E
W
W
D
(get branch address)
F E
F
F E
Decrement (Branch is taken)
Branch=0?
Shift (delay slot)
W
W
D
Daddr
D
(get branch address)
![Page 33: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps](https://reader030.vdocuments.net/reader030/viewer/2022040203/5ea4d6f66ad43e0c3961ac26/html5/thumbnails/33.jpg)
Solution 2) Branch Prediction (1/2)
CSCI2510 Lec12: Pipelining 36
F1
F2
I1 (Compare)
I2 (Branch>0)
D1 E1 W1
Instruction
E2
Clock cycle 1 2 3 4 5 6
D2 / P2
Time
I3 F3 D3 X(Branch Delay Slot)
F4
Fk Dk
XI4
Ik
Incorrect Prediction
Fk DkIk
Correct Prediction
• Attempt to predict
whether conditional
branch will take place.
– Delayed branch can
be applied together.
• Branch Prediction:
– If we get it right: no
lost cycles.
• Registers and memory
cannot be updated until
we know we got it right.
– If we get it wrong, just
cancel the instructions.
– Branch prediction can
be dynamic or static.
![Page 34: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps](https://reader030.vdocuments.net/reader030/viewer/2022040203/5ea4d6f66ad43e0c3961ac26/html5/thumbnails/34.jpg)
Solution 2) Branch Prediction (2/2)
• Static Branch Prediction
– The same choice is used every time the conditional branch
is encountered.
– For example, a branch instruction at the end of a loop
causes a branch to the start of the loop for every pass
through the loop except the last one.
• It is helpful to assume this branch will be taken under this case.
– A flexible approach is to have the compiler decide.
• Dynamic Branch Prediction
– The choice is influenced by the past behavior.
– For example, a simple prediction is to use the result of the
most recent execution of the branch instruction.
CSCI2510 Lec12: Pipelining 37
![Page 35: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps](https://reader030.vdocuments.net/reader030/viewer/2022040203/5ea4d6f66ad43e0c3961ac26/html5/thumbnails/35.jpg)
Outline
• Sequential Execution vs Pipelining
• Pipeline Stall: Hazard
– Data Hazard
– Instruction Hazard
– Structural Hazard
• Superscalar Operation
CSCI2510 Lec12: Pipelining 38
![Page 36: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps](https://reader030.vdocuments.net/reader030/viewer/2022040203/5ea4d6f66ad43e0c3961ac26/html5/thumbnails/36.jpg)
Structural Hazard
• A structural hazard is the situation when two
instructions require the use of a hardware resource at
the same time.
• The most common case is in accessing to memory.
– Case 1: One instruction is accessing memory during the
Execute or Write stage; while another is being fetched.
– Solution 1: Many processors use separate instruction and
data caches to avoid this delay.
– Case 2: Another example is when two instructions require
access to the register file at the same time.
– Solution 2: Let the register file have more input/output ports.
• In general, the structural hazard can be avoided by
providing sufficient hardware resources ($$$).CSCI2510 Lec12: Pipelining 39
![Page 37: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps](https://reader030.vdocuments.net/reader030/viewer/2022040203/5ea4d6f66ad43e0c3961ac26/html5/thumbnails/37.jpg)
Class Exercise 12.4
• What is the cause of the following structure hazard?
CSCI2510 Lec12: Pipelining 40
![Page 38: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps](https://reader030.vdocuments.net/reader030/viewer/2022040203/5ea4d6f66ad43e0c3961ac26/html5/thumbnails/38.jpg)
Outline
• Sequential Execution vs Pipelining
• Pipeline Stall: Hazard
– Data Hazard
– Instruction Hazard
– Structural Hazard
• Superscalar Operation
CSCI2510 Lec12: Pipelining 42
![Page 39: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps](https://reader030.vdocuments.net/reader030/viewer/2022040203/5ea4d6f66ad43e0c3961ac26/html5/thumbnails/39.jpg)
Superscalar Operation (1/2)
• Superscalar: Execute multiple instructions at any
time via multiple processing units (i.e., we can
execute more than one instruction per cycle)
CSCI2510 Lec12: Pipelining 43
W : Writeresults
Decode /
Dispatchunit
Instruction
queue
Floating-pointunit
Integerunit
F : Instructionfetch unit
Fetch two instructions
at a time
Decode two
instructions
at a time
I1 (FracAdd)
Instruction
Clock cycle 1 2 3 4 5 6
Time
F1 D1 E1A E1B E1C W1
I2 (Add) F2 D2 E2 W2
![Page 40: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps](https://reader030.vdocuments.net/reader030/viewer/2022040203/5ea4d6f66ad43e0c3961ac26/html5/thumbnails/40.jpg)
Superscalar Operation (2/2)
• Superscalar operation may result in out-of-order
execution, and cause data consistency issue.
– In our previous example, I1 and I2 are dispatched in the
same order as they appear.
– However, their execution is completed out of order.
– To guarantee a consistent state when out-of-order
execution occur, the results of the execution of instructions
must be written in program order strictly .
• The out-of-order execution is also a common
technique to make use of instruction cycles by re-
ordering instructions.
– E.g., Delayed branching reorders the instructions to
minimize the branch penalty.CSCI2510 Lec12: Pipelining 44
![Page 41: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps](https://reader030.vdocuments.net/reader030/viewer/2022040203/5ea4d6f66ad43e0c3961ac26/html5/thumbnails/41.jpg)
Out-of-Order Execution
R1 mem[r0] /* Instruction 1 */
R2 R1 + R2 /* Instruction 2 */
R5 R5 + 1 /* Instruction 3 */
R6 R6 – R3 /* Instruction 4 */
• Instruction 1 results in a cache miss, and a cache
miss can stall entire processor for 20-30 cycles.
• Instruction 2 cannot be executed since it needs R1.
• In instruction queue, look ahead and find instructions
3 and 4 to execute first (reordering).
R1 mem[r0] /* Instruction 1 */
R5 R5 + 1 /* Instruction 3 */
R6 R6 – R3 /* Instruction 4 */
R2 R1 + R2 /* Instruction 2 */
CSCI2510 Lec12: Pipelining 45
![Page 42: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps](https://reader030.vdocuments.net/reader030/viewer/2022040203/5ea4d6f66ad43e0c3961ac26/html5/thumbnails/42.jpg)
• I6 is still being discarded, but the instruction queue can avoid
delaying F4, F5, F6, Fk, and Fk+1 if the queue is not empty.
• I1, I2, I3, I4, and Ik can complete in successive cycles.CSCI2510 Lec12: Pipelining 46
Recall: With Instruction Queue
F1 D1 E1 E1 E1 W1
I5 (Branch to Ik)
I1
1 2 3 4 5 6 7 8 9Clock cycle
I2
I3
I4
I6
Ik
Ik+1
10
1Queue length 1 1 12 3 2 1 1 1
Time
X
F4
W3E3
F2 D2 E2 W2
F3 D3
E4D4 W4
F5
F6
Fk Dk Ek
Fk+1 Dk+1
Wk
Ek+1
Keep
fetching
D5
Instruction 1 takes 3
Execute cycles (i.e., 2-
cycle stall),
The queue length rises to
3 before cycle 6.
Instruction 5 is a branch .
Instruction 6 is discarded,
after taking Branch.
The queue length drops to
1 before cycle 8.
Decode two
instructionsat a time
![Page 43: CSCI2510 Computer Organization Lecture 12: Pipeliningmcyang/csci2510/2019F/Lec12... · 2019-12-07 · • Assume the computer is controlled by a clock. –The fetch and execute steps](https://reader030.vdocuments.net/reader030/viewer/2022040203/5ea4d6f66ad43e0c3961ac26/html5/thumbnails/43.jpg)
Summary
• Sequential Execution vs Pipelining
• Pipeline Stall: Hazard
– Data Hazard
– Instruction Hazard
– Structural Hazard
• Superscalar Operation
CSCI2510 Lec12: Pipelining 47