computing systems pipelining: enhancing performance

[email protected] 1

Computing Systems

Pipelining: enhancing performance

2

Pipelining

Technique in which multiple instructions are overlapped in execution Instructions’ steps can be carried in parallel

Texec = 2400 ps

Texec = 1400 ps

3

Pipelining

Improve performance by increasing instruction throughput as opposed to decreasing the execution time (= latency) of an individual instruction increasing throughput decrease total time to complete the work

Ideal speedup is number of stages in the pipeline. Do we achieve this? stages may be imperfectly balanced Pipelining involves some overhead

stagespipeofNumbernsinstructiobetweenTime

nsinstructiobetweenTime ednonpipelinpipelined

4

Pipelining

What makes it easy (designing instruction sets for pipelining) all instructions are the same length just a few instruction formats memory operands appear only in loads and stores

What makes it hard? sometime the next instruction cannot be started in the next cycle (hazards) structural hazards: suppose we had only one memory control hazards: need to worry about branch instructions data hazards: an instruction depends on a previous instruction

We’ll build a simple pipeline and look at these issues instructions supported: lw, sw, add, sub, and, or, slt, beq

5

Basic idea

Basic idea: take a single-cycle datapath and separate it into pieces

mux

Stylized Datapath; the drawing leaves out some details

6

Pipelined datapath

There is a bug ! Can you find it ? What instructions can we execute to manifest the bug?

Instructions and data move from left to right (with two exceptions)

7

Corrected datapath

For the load instruction we need to preserve the destination register number until the data is read from the MEM/WB pipeline register

8

Graphically representing pipelines Pipelining can be difficult to understand

every clock cycle, many instructions are simultaneously executing in a single datapath

To aid understanding there are 2 basic styles of pipeline figures: Multiple-clock-cycle pipeline diagrams Single-clock-cycle pipeline diagram

Can help with answering questions like: how many cycles does it take to execute this code? what is the ALU doing during cycle 4? help understand datapaths

We highlight the right half of registers or memory when they are being read and highlight the left half when they are being written

9

Multiple-clock cycle diagrams: graphical view

10

Multiple-clock cycle diagrams: traditional view

11

Single-clock cycle diagrams:pipeline at a particular time instant

12

Pipeline operation

One operation begins in every cycle Also, one operation completes in each cycle Each instruction takes 5 cycles

K cycles in general, where k is the depth of the pipeline In one clock cycle, several instructions are active Different stages are executing different instructions When a stage is not used no control needs to be applied

Issue: how to generate control signals ? we need to set the control values for each pipeline stage for

each instruction

13

Pipeline Control

Note: we moved the position of the destination register

14

Pipeline Control

We have 5 stages. What needs to be controlled in each stage?

Instruction Fetch and PC Increment The control signals to read IM and write the PC are always

asserted, so there is nothing special to control this pipeline stage Instruction Decode / Register Fetch

The same thing happens at every clock cycle, so there are no optional control lines to set

Execution / address calculation control lines set in this stage: RegDest, ALUop, and ALUSrc

Memory access control lines set in this stage: Branch, MemRead, and MemWrite

Write Back control lines set in this stage: MemtoReg, and RegWrite

15

Pipeline Control Since the control signals are needed from the execution stage on:

we can generate the control signals during the instruction decode stage and

pass them along the pipeline registers just like the data

Execution/Address Calculation stage control lines

Memory access stage control lines

Write-back stage control

lines

InstructionReg Dst

ALU Op1

ALU Op0

ALU Src Branch

Mem Read

Mem Write

Reg write

Mem to Reg

R-format 1 1 0 0 0 0 0 1 0lw 0 0 0 1 0 1 0 1 1sw X 0 0 1 0 0 1 0 Xbeq X 0 1 0 1 0 0 0 X

we have nine control lines

16

Pipeline datapath with control

17

Dependencies Problem with starting next instruction before first is finished

dependencies that “go backward in time” are data hazards

18

Software solution

Have compiler guarantee no hazards Where do we insert the “nops” ?

sub $2, $1, $3and $12,$2, $5or $13,$6, $2add $14,$2, $2sw $15,100($2)

Problem: this really slows us down!

Two nops here !!

19

Hardware solution: Forwarding Use temporary results, don’t wait for them to be written

ALU forwarding (EX hazard) read/write to same register (MEM hazard)

what if this $2 was $13?

20

Forwarding logic

Forwarding from EX/MEM registers if (EX/MEM.RegWrite // instruction writes to register and (EX/MEM.RegisterRd != 0) // not if destination $zeroand (EX/MEM.RegisterRd = ID/EX.Register.Rs))ForwardA = 10

if (EX/MEM.RegWrite // instruction writes to register and (EX/MEM.RegisterRd != 0) // not if destination $zeroand (EX/MEM.RegisterRd = ID/EX.Register.Rt))ForwardB = 10

21

Forwarding logic

Forwarding from MEM/WB registers

if (MEM/WB.RegWrite // instruction writes to register and (MEM/WB.RegisterRd != 0) // not if destination $zeroand (MEM/WB.RegisterRd = ID/EX.Register.Rs))ForwardA = 01

if (MEM/WB.RegWrite // instruction writes to register and (MEM/WB.RegisterRd != 0) // not if destination $zeroand (MEM/WB.RegisterRd = ID/EX.Register.Rt))ForwardB = 01

Almost true !!! There is a bug !!!

22

Forwarding logic Let’s consider a sequence of instructions all reading and writing to

the same register

According to the previous policy, since MEM/WB.RegisterRd=ID/EX.Register.Rswe “should” forward from MEM/WB.

MEM/WB

EX/MEM

But, this time the more recent result is in the EX/MEM register

Thus, we have to forward from EX/MEM register(Fortunately, we already know how to do !!)

23

Forwarding logic

Forwarding from MEM/WB registers (corrected version)

if (MEM/WB.RegWrite //instruction writes to register and (MEM/WB.RegisterRd != 0) //not if destination $zeroand (MEM/WB.RegisterRd = ID/EX.Register.Rs)and (EX/MEM.RegisterRd != ID/EX.Register.Rs))ForwardA = 01

if (MEM/WB.RegWrite //instruction writes to register and (MEM/WB.RegisterRd != 0) //not if destination $zeroand (MEM/WB.RegisterRd = ID/EX.Register.Rt)and (EX/MEM.RegisterRd != ID/EX.Register.Rt))ForwardB = 01

Make sure the latest value is not in EX/MEM

24

Forwarding unit

The main idea (some details not shown)

ForwardA

ForwardB

25

Forwarding unit

Mux control Source CommentForwardA=00 ID/EX The first ALU operand comes from the register fileForwardA=10 EX/MEM The first ALU operand is forwarded from prior ALU resultForwardA=01 MEM/WB The first ALU operand is forwarded from DM or an earlier

ALU resultForwardB=00 ID/EX The first ALU operand comes from the register fileForwardB=10 EX/MEM The first ALU operand is forwarded from prior ALU resultForwardB=01 MEM/WB The first ALU operand is forwarded from DM or an earlier

ALU result

26

Can’t always forward !!! Load word instruction can still cause a hazard:

- an instruction tries to read a register following a load instruction that writes to the same register.

Thus, we need a hazard detection unit to “stall” the load instruction

The hazard cannot be solved by forwardingwe must stall (insert a nop)

27

Stall logic

Hazard detection unit:

if (ID/EX.MeamRead and((ID/EX.Register.Rt = IF/ID.Register.Rs) or

(ID/EX.Register.Rt = IF/ID.Register.Rt)))stall the pipeline

We can stall by letting an instruction that won’t do anything go forward

Deasserting the control lines (in this way the instruction has no effect, act like a bubble in the pipeline)

and preventing the following instructions to be fetched

This is accomplished simply by preventing the PC register and the IF/ID register from changing

The only instruction that reads data memory is load

The destination of the load instruction is in the Rt field

PC

…

…

28

Pipeline with hazard detection unitSome details not shown

29

Branch Hazards (= control hazards) When we decide to branch, other instructions are in the pipeline!

30

Solutions to branch hazard

Branch stalling (software) easy but inefficient

Static branch prediction: we assume “branch not taken” we need to add hardware for flushing instructions if we are wrong we must discard the instructions in IF, ID, and EX stages (change the control

values to 0) Reducing the branch delay penalty

move the branch decision earlier (to ID stage) compare the two registers read in the ID stage comparison for equality requires few extra gates still need to flush instruction in IF/ID register clearing the register transform the fetched

instruction into a nop Make the hazard into a feature: delayed branch slot

always execute the instruction following the branch

Registers =

31

Branch detection in the ID stagebranch target computation has been moved ahead

32

Delayed branch (MIPS) A “branch delay slot” which the compiler tries to fill with

a useful instruction (make the one cycle delay part of the ISA)

best solution branch mostly taken

33

Branches

If the branch is taken, we may have a penalty of one cycle For our simple design, this is reasonable With deeper pipelines, penalty increases and static branch prediction

drastically hurts performance Solution: dynamic branch prediction (keep track of branch history)

Branch prediction is especially important because it enables other more advanced pipelining techniques to be effective!

Modern processors predict correctly 95% of the time!

Example:

Loop branch that branch 9 times in a row, then is not taken. Assume 1-bit predictor.

We will fail prediction the first and last time prediction accuracy 80%

34

Improving performance

Try and avoid stalls! e.g., reorder these instructions:

lw $t0, 0($t1)lw $t2, 4($t1)sw $t2, 0($t1)sw $t0, 4($t1)

Dynamic Pipeline Scheduling Hardware is organized differently and chooses which

instructions to execute next Will execute instructions out of order (e.g., doesn’t wait

for a dependency to be resolved, but rather keeps going!) Speculates on branches and keeps the pipeline full

(may need to rollback if prediction incorrect)

Trying to exploit instruction-level parallelism

35

Dynamic scheduled pipeline

36

Advanced Pipelining

Trying to exploit instruction-level parallelism Increase the depth of the pipeline (overlap more instructions) Replicate internal functional units to start more than one instruction

each cycle (multiple issue) static multiple issues (decision at compile time) dynamic multiple issues (decision at execution time)

Loop unrolling to expose more ILP (better scheduling)

“Superscalar” processors DEC Alpha 21264: 9 stage pipeline, 6 instruction issue

All modern processors are superscalar and issue multiple instructions usually with some limitations (e.g., different “pipes”)

VLIW: very long instruction word, static multiple issue (relies more on compiler technology)

37

Summary

Pipelined processors divide execution in multiple steps Pipelining improves instruction throughput, not the inherent

execution time (latency) of instructions However pipeline hazards reduce performance

structural, data and control hazards Structural hazards are resolved by duplicating resources Data forwarding helps resolve data hazards

but not all hazard can be resolved (load followed by R-type) some data hazards require nop insertion (bubbles)

Control hazard delay penalty can be reduced by branch prediction always not taken, delayed slots, dynamic prediction

38

Concluding Remarks

Pipelined processors are not easy to design Technology affect implementation Instruction set design affect performance and design difficulty More stages do not necessarily lead to higher performance Pipelining and multiple issue both attempt to exploit ILP

computing systems pipelining: enhancing performance

Documents