chapter 8 - pipelining

7/29/2019 Chapter 8 - Pipelining

1/38

Pipelining


2/38

Overview

Pipelining is widely used in modern

processors.

Pipelining improves system performance interms of throughput.

Pipelined organization requires sophisticated

compilation techniques.


3/38

Basic Concepts


4/38

Making the Execution of

Programs Faster

Use faster circuit technology to build the

processor and the main memory.

Arrange the hardware so that more than oneoperation can be performed at the same time.

In the latter way, the number of operations

performed per second is increased.


5/38

Use the Idea of Pipelining in a

Computer

F1

E1

F2

E2

F3

E3

I1

I2

I3

(a) Sequential execution

Instructionfetchunit

Executionunit

Interstage buffer

B1

(b) Hardware organization

Time

F1

E1

F2

E2

F3

E3

I1

I2

I3

Instruction

(c) Pipelined execution

Figure 8.1. Basic idea of instruction pipelining.

Clock cycle 1 2 3 4Time

Fetch + Execution

F1I1 1 E1 1


6/38

Use the Idea of Pipelining in a

Computer

FI

F

F

I

I

E

E

E

Fi u r . . t i l in .

( ) I n t r uc t i n c u ti n i v i i n t f u r t

F : Ftchin tr u cti n

: cin tr u cti n

n f t c hrn

E : E cu tr t i n

: ritr u lt

I n t r t u f f r

( ) H r r r n i t in

3

Fetch + Decode

+ Execution + Write

Textbook page: 457


7/38

Role of Cache Memory

Each pipeline stage is expected to complete in one

clock cycle.

The clock period should be long enough to let the

slowest pipeline stage to complete. Faster stages can only wait for the slowest one to

complete.

Since main memory is very slow compared to the

execution, if each instruction needs to be fetched

from main memory, pipeline is almost useless.

Fortunately, we have cache.


8/38

Pipeline Performance

The potential increase in performance

resulting from pipelining is proportional to the

number of pipeline stages.

However, this increase would be achieved

only if all pipeline stages require the same

time to complete, and there is no interruption

throughout program execution. Unfortunately, this is not true.


9/38


F1

F

F3

I1

I

I3

1

3

1

3

1

3

s c o

FI

l c k c c l

i r . . n i n r i n kin m r h n n l kl .

FI

Tim


10/38


The previous pipeline is said to have been stalled for two clockcycles.

Any condition that causes a pipeline to stall is called a hazard.

Data hazard any condition in which either the source or the

destination operands of an instruction are not available at thetime expected in the pipeline. So some operation has to bedelayed, and the pipeline stalls.

Instruction (control) hazard a delay in the availability of aninstruction causes the pipeline to stall.

Structural hazard the situation when two instructions requirethe use of a given hardware resource at the same time.


11/38


F

F

F

I

I

I

E

E

E

In truction

F i ur . . i l in t l l c u y c c h i i n F .

3Clc cycl

( I n t r uc ti n c ut i n t i n u c c i v cl c c y cl

3Clc cycl

St g

F: F t ch

: c

E : E c u t

: rit

F F F

i l i l i l

E E Ei l i l i l

i l i l i l

( F un ct i n r f rm y c h r c r t i n u cc i v c l c cy cl

F F F

im

im

Idle periods

stalls (bubbles)

Instruction

hazard


12/38


F1

F

F3

I1

I L

I3

11

3

1

s c o

FI

l c k c cl 1

i u r . . ff c f L i n ru ci n n i l in i mi n .

FI

Tim

3 3

Load X(R1), R2Structural

hazard


13/38


Again, pipelining does not result in individual

instructions being executed faster; rather, it is the

throughput that increases.

Throughput is measured by the rate at whichinstruction execution is completed.

Pipeline stall causes degradation in pipeline

performance.

We need to identify all hazards that may cause the

pipeline to stall and to find ways to minimize their

impact.


14/38

Quiz

Four instructions, the I2 takes two clock

cycles for execution. Pls draw the figure for 4-

stage pipeline, and figure out the total cycles

needed for the four instructions to complete.


15/38

Data Hazards


16/38

Data Hazards

We must ensure that the results obtained when instructions areexecuted in a pipelined processor are identical to those obtainedwhen the same instructions are executed sequentially.

Hazard occurs

A 3 + A

B 4 A

No hazard

A 5 C

B 20 + C

When two operations depend on each other, they must be

executed sequentially in the correct order. Another example:

Mul R2, R3, R4

Add R5, R4, R6

l k lTim


17/38

Data Hazards

11 ul

1 1

s

i r . . i l i l l 1.

1

Figure 8.6. Pipeline stalled by data dependency between D2 and W1.


18/38

Operand Forwarding

Instead of from the register file, the second

instruction can get data directly from the

output of ALU after the previous instruction is

completed.

A special arrangement needs to be made to

forward the output of ALU to the input of

ALU.


19/38


20/38

Handling Data Hazards in

Software

Let the compiler detect and handle thehazard:

I1: Mul R2, R3, R4

NOP

NOP

I2: Add R5, R4, R6

The compiler can reorder the instructions toperform some useful work during the NOPslots.


21/38

Side Effects

The previous example is explicit and easily detected.

Sometimes an instruction changes the contents of a registerother than the one named as the destination.

When a location other than one explicitly named in an instruction

as a destination operand is affected, the instruction is said tohave a side effect. (Example?)

Example: conditional code flags:

Add R1, R3

AddWithCarry R2, R4

Instructions designed for execution on pipelined hardware shouldhave few side effects.


22/38

Instruction Hazards


23/38

Time lost as a result of branch instruction is

referred as branch penalty

Branch penalty is one clock cycle

For longer pipeline, branch penalty may be

higher

Reducing branch penalty requires branch

instruction to be computed earlier in pipeline


24/38

Instruction fetch unit should have dedicated

hardware to identify branch instruction

Compute branch target address as quickly

as possible after instruction is fetched

So above task in performed in decode stage

Explaination of fig-8.8,8.9-a,8.9-b

s r u c o

1l ck cc l

im

F1I1 E1


25/38

Unconditional Branches

FI r n ch

I3

I

E

F3

F E

F +1 E +1I 1

i u r . . n i l cc l c u r n ch i n r uc i n .

E cu t i n u n it i l

F i u r . . r n c h t i m in .


26/38

Branch Timing

- Branch penalty

- Reducing the penalty


27/38

Instruction Queue and

Prefetching

F : Fetchinstruction

E : Executeinstruction

W : Writeresults

D : Dispatch/Decode

Instruction queue

Instruction fetch unit

Figure 8.10. Use of an instruction queue in the hardware organization of Figure 8.2b.

unit


28/38

To reduce the effect of cache miss or

stall,processor employ sophisticated fetch

unit

Fetched instruction are placed in instruction

queue before they are are needed

instruction queue can store severalinstructions

Dispatch unit takes instruction from front of

the queue and send to execution unit Explaination pg-468 2nd para (exam must)


29/38

Branch instruction is executed concurrentlywith other instruction. This technique is called

branch folding

branch folding occurs only at the time abranch instruction is encountered,atleast one

instruction is available in the queue other

than branch instruction


30/38

Conditional Braches

A conditional branch instruction introducesthe added hazard caused by the dependencyof the branch condition on the result of a

preceding instruction. The decision to branch cannot be made until

the execution of that instruction has beencompleted.

Branch instructions represent about 20% ofthe dynamic instruction count of mostprograms.


31/38

Branch instruction can be handled in several

ways to reduce branch penalty

Delayed branch

Branch prediction

Dynamic branch prediction


32/38

DELAYED BRANCH

Location following a branch instruction is

called branch delay slot

There may be more than one branch delay

slot depending on the time to execute a

branch instruction

A techcnique called delay branching can

minimize the penalty incurred as a result ofbranch instruction


33/38


34/38

Arrange the instruction execution such thatwhether or not the branch is taken

Place useful instruction in delay slot

But no useful instruction can be placed indelay slot

But slot can be filled with NOP instructions

Effectiveness of delay branch approachdepends on how often it is possible to reorder

instructions


35/38

BRANCH PREDICTION

In branch prediction, assume that branch will

not take place and continue to fetch

instruction in sequential address order

Speculative execution means that instruction

are executed before the processor is certain

that they are in correct execution sequence

Hence care must be taken no processor orregister are updated until it is confirmed that

these instruction should indeed be executed


36/38

Branch prediction is classified into 2 types

1.Static branch prediction

Prediction is always the same every time a given

instruction takes place

2.Dynamic branch prediction

Prediction decision may change based depending

on execution history


37/38

2 state algorithm


38/38

4 state algorithm

chapter 8 - pipelining

Documents