11 - 1dt085 l10 pipeline2 - uppsala university · 2011-11-23 · compiler can insert nops clock...

1

Pipeline Implementation

2

Don’t forget…

  You need to register for the exam at least 14 days before   http://tenta.angstrom.uu.se/tenta/

  The IT office can not help you if you miss the deadline   If you do miss the deadline:

  You are not guaranteed a spot at the exam   You won’t be anonymous

(although I never look at the names on exams when I’m grading so I’m not sure how this matters.)

  There have been so many bugs and problems with this anonymous system that your grades and exams will be delayed by a few days.   This won’t actually matter since I won’t be able to grade your exams until the second

week in January anyway.

3

Today’s Menu

  More pipelinining   Data hazards   Structural hazards   Control hazards   Branch prediction – just a taste…   Examples

4

This Data Hazard, Revisited

  In this particular case…   R10 value is not computed or returned to register file when later instruction wants to use it

as an input

Double pumping reg file doesn’t help here; later instruction needs R10 2 clock cycles before it’s been computed & stored back. Oops…

Iget Rget ALU op Mput Rput

Iget Rget ALU op Mput Rput

10 W

10 R

5

Coping with Data Hazards

  What do you do?   Sometimes the dumb-sounding answer is right

  Hypothesis:   It is BAD when certain instructions “overlap” in time in certain patterns in our 5 stage

MIPS pipeline

  Proposed solution   Don’t let them overlap like this…?   Right - that is one solution

  Mechanics   Don’t let the instruction flow thru the pipe   In particular, don’t let it WRITE any bits anywhere in the pipe hardware that represents

REAL CPU state (e.g., register file, memory)   Name for this operation: PIPELINE STALL

6

Coping with Data Hazards: Example

Clock Cycle 1

Clock Cycle 2

Clock Cycle 3

Clock Cycle 4

Clock Cycle 5

Clock Cycle 6

Clock Cycle 7

ADD R10, R11, R12 ADD R12, R10, R11 ADD R11, R10, R12

REG IM DM ALU Reg

Program Execution

Time

REG IM DM ALU Reg

Clock Cycle 8

REG IM DM ALU Reg

10 W

10 R

10 R

7

Solution 1 : Stall

Clock Cycle 1

Clock Cycle 2

Clock Cycle 3

Clock Cycle 4

Clock Cycle 5

Clock Cycle 6

Clock Cycle 7

REG IM DM ALU Reg

Program Execution

Time Clock Cycle 8

IM REG ALU bubble bubble


DM

REG IM ALU

10 W

10 R

Empty slots in in the pipe

called bubbles; means no real

instruction work getting saved here

10 R

8

Mechanically: How Do We Stall?

  Add extra hardware to detect stall situations   Watches the instruction field bits   Looks for “read versus write” conflicts in particular pipe stages   Basically, a bunch of careful “case logic”

  Add extra hardware to push bubbles thru pipe   Actually, relatively easy   Can just let the instruction you want to stall GO FORWARD thru the pipe…   …but, TURN OFF the bits that allow any results to get written into the machine state   So, the instruction “executes” (it does the work), but doesn’t “save”

“If an instruction executes in the middle of forest, but no registers are around to save the results…did it really execute?” (No.)

9

No Dependence Between #1 and #4

REG IM DM ALU

Clock Cycle 1

Clock Cycle 2

Clock Cycle 3

Clock Cycle 4

Clock Cycle 5

Clock Cycle 6

Clock Cycle 7

REG IM DM ALU

Program Execution

Clock Cycle 8

SUB R2, R1, R3 AND R12, R2, R5 OR R13, R6, R2 ADD R14, R2, R2

REG IM DM ALU

REG IM DM ALU

REG

REG

REG

REG

2 W

2 R

In this case, double pumped reg file makes it ok…

10

REG IM DM ALU Reg

How Else Could We Stall the Pipeline?   Compiler can insert nops

Clock Cycle 1

Clock Cycle 2

Clock Cycle 3

Clock Cycle 4

Clock Cycle 5

Clock Cycle 6

Clock Cycle 7

REG IM DM ALU Reg

Program Execution

Time Clock Cycle 8

ADD R10, R11, R12 nop nop ADD R12, R10, R11

IM DM ALU Reg

ALU

REG

IM DM Reg REG

On MIPS R0 = R0+R0 will do it-- saves no

state

11

Or, The Hardware Can Simulate NOPS

Clock Cycle 1

Clock Cycle 2

Clock Cycle 3

Clock Cycle 4

Clock Cycle 5

Clock Cycle 6

Clock Cycle 7

REG IM DM ALU Reg

Program Execution

Time Clock Cycle 8

IM REG ALU

ADD R10, R11, R12 stall stall ADD R12, R10, R11 DM

IM

IM

bubble bubble bubble

bubble bubble

bubble

bubble bubble

Reg

12

Hardware Trick Can Fix a Dependence

  If the result you need does not exist AT ALL yet…   …well, you are outta luck; sorry

  But, what if the result exists, but is not stored back yet?   Then, maybe we can help   Instead of stalling until the result is stored back in its “natural” home…   …grab the result “on the fly” from “inside” the pipe, and send it to the other instruction

(another pipe stage) that wants to use it

  Generic name: forwarding   Instead of waiting to store the result, we forward it immediately (more or less) to the

instruction that wants it   Mechanically, we add busses to the datapath to move these values around, and these

busses always “point backwards” in the datapath, from later stages to earlier stages

13

Reducing Data Hazards: Forwarding   Data may be already computed - just not in the Register File

Clock Cycle 1

Clock Cycle 2

Clock Cycle 3

Clock Cycle 4

Clock Cycle 5

Clock Cycle 6

Clock Cycle 7

REG IM DM ALU Reg

Program Execution

Time Clock Cycle 8

ADD R10, R11, R12 ADD R12, R10, R11 REG IM DM ALU Reg

R10

R10

Moving this R10 value requires forwarding busses & logic

10 W

14 Forwarding bus from MEM

Additions to the Datapath for Forwarding

Read Reg 1 Read Reg 2 Write Reg Write Data

Read Data 1 Read Data 2

Register File

ALU

16 32

Read data

Data Memory (RAM)

M

U X

M U X

Zero

Instruction Memory (RAM)

PC

Adder 4

Current PC

ADDER

<< 2

M U X

Sign extend

IF/ID ID/EX EX/MEM MEM/WB

15

Forwarding Continued

Clock Cycle 1

Clock Cycle 2

Clock Cycle 3

Clock Cycle 4

Clock Cycle 5

Clock Cycle 6

Clock Cycle 7

REG IM DM ALU Reg

Program Execution

Time Clock Cycle 8


REG IM DM ALU Reg

REG IM DM ALU Reg

R10

R10

R10

R10

16 Forwarding bus from WB

More Additions to the Datapath



Register File

ALU

16 32

Read data

Data Memory (RAM)

M

U X

M U X

Zero


PC

Adder 4

Current PC

ADDER << 2

M U X

Sign extend


17

Forwarding Doesn’t Always Work

Clock Cycle 1

Clock Cycle 2

Clock Cycle 3

Clock Cycle 4

Clock Cycle 5

Clock Cycle 6

Clock Cycle 7

REG IM DM ALU Reg

Program Execution

Time Clock Cycle 8

LW R10, 0x00(R4) ADD R12, R10, R11 REG IM DM ALU Reg

R10

R10

ALU needs R10 at beginning of clock cycle,

but R10 value not ready till end of cycle

18

Loads and Stores Require a Load Delay Slot

Clock Cycle 1

Clock Cycle 2

Clock Cycle 3

Clock Cycle 4

Clock Cycle 5

Clock Cycle 6

Clock Cycle 7

REG IM DM ALU Reg

Program Execution

Time Clock Cycle 8

LW R10, 0x00(R4)

nop ADD R12, R10, R11 REG IM DM ALU Reg

IM DM ALU Reg

R10

R10

REG

Gives us 1 cycle of delay we need to get R10 from mem,

then to ALU

19

Pipelines: Idiosyncrasies

  Different architects deal with these “special cases” in different ways, sometimes yielding very different solutions   MIPS lets software compiler writers “see” this necessary delay slot   Phrasing in architecture: feature “exposed to the compiler”

  What this “exposing” means   The compiler knows the slot is required   The compiler has to deal with it--the hardware won’t deal with it   Your book says this was a sensible tradeoff at the time this MIPS architecture was built

(i.e., when throwing 1M gates at a messy pipeline problem was not possible)   MIPS = “Microprocessor without interlocked pipeline stages”   Today everyone knows this was a bad idea. Much easier to fix in hardware (transistors

are free) than to force everyone to recompile everything.

  Alternative   Hardware in pipeline detects this hazard, stalls appropriately   This sort of hardware has a name: pipeline interlock hardware 20

Example of Forwarding and Load Delay

  Rewrite the code assuming a machine without forwarding (by inserting nops or stalls).

ADD R4, R5, R2 LW R15, 0(R4) SW R15, 4(R2)

  Rewrite the code assuming forwarding

21

Example of Forwarding and Load Delay

  Why forwarding?

ADD R4, R5, R2

LW R15, 0(R4)

SW R15, 4(R2)

  Why load delay? ADD R4, R5, R2

LW R15, 0(R4)

SW R15, 4(R2)

22

Solution Templete Program Execution

Time

ADD R4, R5, R2 LW R15, 0(R4) SW R15, 4(R2)

IM DM ALU REG REG

23

Solution (w/out forwarding) Program Execution

Time

ADD R4, R5, R2 LW R15, 0(R4) SW R15, 4(R2)

IM DM ALU REG REG

IM DM ALU REG REG bubble bubble

IM DM ALU REG bubble bubble

R4

R15

R15

R4

24

Solution (w/forwarding) Program Execution

Time

ADD R4, R5, R2 LW R15, 0(R4) SW R15, 4(R2)

IM DM ALU REG REG

IM DM ALU REG REG

IM DM ALU REG bubble REG

R15

R15

Forwarding

R4

R4

Forwarding

25

Taxonomy of Hazards

  Data Hazards are just one type of hazard that can occur in a machine. There are actually 3 basic types of hazards

  Data hazards   Instruction depends on result of prior computation which is not ready yet

  Structural hazards   HW cannot support a combination of instructions

  Control hazards   pipelining of branches and other instructions which change the PC

26

Taxonomy of Hazards


  OK, we did these. Stall, double pump, and forward, to fix



27

Taxonomy of Hazards





28

Structural Hazards--Pipe Stage Contention

  Structural hazards   Occurs when two or more instructions want to use the same hardware resource in the

same cycle   Causes bubble (stall) in pipelined machines   Overcome by replicating hardware resources   Examples

  Multiple accesses to the register file   Branch adder and ALU   Multiple accesses to memory

29

ADDER #2

Structural Hazard Example 1   Without adder #2, both the address computation and the arithmetic

computation would require access to the ALU in the same cycle beq r1,r2, offset ; if r1 == r2, then PC <-- PC + offset



Register File

ALU

16 32

Read data

Data Memory (RAM)

M

U X

M U X

Zero


PC

Adder 4

Current PC

ADDER << 2

M U X

Sign exten

d


ADDER #1

30

Structural Hazard Example 2

REG IM DM ALU Reg

Clock Cycle 1

Clock Cycle 2

Clock Cycle 3

Clock Cycle 4

Clock Cycle 5

Clock Cycle 6

Clock Cycle 7

LW R2, 0x10(R4) SUB R5,R6,R7 ADD R10,R11,R12 ADD R12, R10, R11

REG IM DM ALU Reg

REG IM DM ALU Reg

Program Execution

Time Clock Cycle 8

REG IM DM ALU Reg

  Two instructions need access to memory in Clock Cycle 4.   This is a big reason for having separate I memory (for instructions) and D memory

(for data value)

31

Structural Example 2 (con’t)

REG IM DM ALU Reg

Clock Cycle 1

Clock Cycle 2

Clock Cycle 3

Clock Cycle 4

Clock Cycle 5

Clock Cycle 6

Clock Cycle 7

LW R2, 0x10(R4) SUB R5,R6,R7 ADD R10,R11,R12

Stall ADD R12, R10, R11

REG IM DM ALU Reg

REG IM DM ALU Reg

Program Execution

Time Clock Cycle 8

  Two instructions need access to memory in Clock Cycle 4.   We’d need to stall to fix this as it…

REG IM DM ALU

bubble bubble bubble bubble bubble

32

Taxonomy of Hazards




  OK, maybe add extra hardware resources; may still have to stall


33

Example code

Address Instruction 36 NOP 40 ADD R30,R30,R30 44 BEQ R1, R3, 24 <- this branchs to address 72 48 AND R12, R2, R5 52 OR R13, R6, R2 56 ADD R14, R2, R2 60 ... ... 72 LW R4, 50(R7) 76 ...

Flow of instructions if branch is taken: 36, 40, 44, 72, ... Flow of instructions if branch is not taken: 36, 40, 44, 48, ...

Control Hazards - Branches

We execute all these if R1 != R3

We execute just these if R1 == R3

34

Recall: Basic Pipelined Datapath



Register File

ALU

16 32

Read data

Data Memory (RAM)

M

U X

M U X

Zero


PC

Adder 4

Current PC

ADDER << 2

M U X

Sign extend


35

Branch Hazards

Clock Cycle 1

Clock Cycle 2

Clock Cycle 3

Clock Cycle 4

Clock Cycle 5

Clock Cycle 6

Clock Cycle 7

REG IM DM ALU Reg

Clock Cycle 8

44 BEQ R1, R3, 24 48 AND R12, R2, R5 52 OR R13, R6, R2 56 ADD R14, R2, R2 60 or 72 (depending on branch)

REG IM DM ALU Reg

REG IM DM ALU Reg

REG IM DM ALU Reg

Clock Cycle 9

IM DM Reg ALU


REG

36

Always Stalling Hurts the No-branch case

Clock Cycle 1

Clock Cycle 2

Clock Cycle 3

Clock Cycle 4

Clock Cycle 5

Clock Cycle 6

Clock Cycle 7

REG IM DM ALU Reg

Clock Cycle 8

44 BEQ R1, R3, 24 stall stall stall 48 AND R12, R2, R5

IM

IM

IM

Clock Cycle 9

REG IM DM Reg ALU

bubble bubble bubble bubble



Flow of instructions if branch is not taken: 36, 40, 44, 48, ...

37

Solution: Assume Branch Not Taken

Clock Cycle 1

Clock Cycle 2

Clock Cycle 3

Clock Cycle 4

Clock Cycle 5

Clock Cycle 6

Clock Cycle 7

REG IM DM ALU Reg

Clock Cycle 8

44 BEQ R1, R3, 24 48 AND R12, R2, R5 52 OR R13, R6, R2 56 ADD R14, R2, R2 60 (we assume branch NOT taken)

REG IM DM ALU Reg

REG IM DM ALU Reg

REG IM DM ALU Reg

Clock Cycle 9

IM DM Reg ALU


REG

38

…i.e., what if we guessed wrong on the branch?

Address Instruction 36 NOP 40 ADD R30,R30,R30 44 BEQ R1, R3, 24 <- this branches to address 72 48 AND R12, R2, R5 52 OR R13, R6, R2 56 ADD R14, R2, R2 60 ... ... 72 LW R4, 50(R7) 76 ...


Uh Oh: What If Branch Was Taken…?

We already started some of these since we assumed NO branch taken

But a few clock cycles later, We figure out these are right Instructions to go next

39

What Happens When the Branch IS Taken

Clock Cycle 1

Clock Cycle 2

Clock Cycle 3

Clock Cycle 4

Clock Cycle 5

Clock Cycle 6

Clock Cycle 7

REG IM DM ALU Reg

Clock Cycle 8

44 BEQ R1, R3, 24 48 AND R12, R2, R5 52 OR R13, R6, R2 56 ADD R14, R2, R2 72 LW R4, 50(R7)

REG IM DM ALU Reg

REG IM DM ALU Reg

REG IM DM ALU Reg

Clock Cycle 9

REG IM DM Reg ALU

Flow of instructions if branch is taken: 36, 40, 44, 72, ...

These 3 incorrect to execute--kill

them

40

Common Side-Effect in Pipelines

  Sometimes, you just have to guess what will execute   Often, we can do it right, and this saves cycles   But, occasionally, we are wrong

  Consequences   We mistakenly start executing the wrong instructions   To repair this, must make sure that they DO NOT really execute   In particular, must ensure they do not incorrectly corrupt machine state

  Terminology is appealing vivid   We “kill” them -- bland but accurate   We “squash” them -- think of “bug-under-boot” images

41

Common Side-Effect in Pipelines

  About squashing instructions:   Do it because you have to, to avoid getting wrong answer   Do it because we insist on “sequential execution semantics” which means “program

behaves like the instructions execute sequentially, in order” no matter what weird goop happens in the pipe

  Aside: terminology   We say the machine executes the instruction sequentially   Also say “instructions are RETIRED in sequential order”   Image is: instruction is born (fetched), grows up (regfile access, ALU ops), then finally

finishes and commits correct machine state.   This “finally finishes” is “retiring” the instruction   In deep pipelines and complex machines, even knowing WHEN your instruction retires

takes a lot of complex logic

  Consequence   Think about restructuring pipe to MINIMIZE number of instructions squashed

42

Better if we can do it sooner, here

Move the Branch Computation Forward



Register File

ALU

16 32

Read data

Data Memory (RAM)

M

U X

M U X

Zero


PC

Adder 4

Current PC

ADDER << 2

M U X

Sign extend


Too late, adds extra cycle, 1 more inst to squash

43

Move the Branch Computation Further Forward



Register File

ALU

16 32

Read data

Data Memory (RAM)

M

U X

M U X

Zero


PC

Adder 4

Current PC

ADDER << 2

M U X

Sign extend

IF/ID ID/EX EX/MEM MEM/WB ADDER

Compare

Compare Controls MUX Selext

Even better if we can do it sooner, here; need to change hardware a little to do it

44

Result: New & Improved MIPS Datapath   Need just 1 extra cycle after the BEQ branch to know right address   On MIPS, its called - the branch delay slot

Clock Cycle 1

Clock Cycle 2

Clock Cycle 3

Clock Cycle 4

Clock Cycle 5

Clock Cycle 6

Clock Cycle 7

REG IM DM ALU Reg

Clock Cycle 8

44 BEQ R1, R3, 24 48 AND R12, R2, R5 72 LW R4, 50(R7) REG IM DM ALU Reg

IM DM ALU Reg

Clock Cycle 9

REG

45

The Branch Problem   Branch is detected and handled in cycle 2   Allows branch destination to start in cycle 3   But what about the instruction fetched in cycle 2? (ADD here…)

  MIPS uses a “branch delay slot”   Other architectures stall and lose performance

Clock Cycle 1

Clock Cycle 2

Clock Cycle 3

Clock Cycle 4

Clock Cycle 5

Clock Cycle 6

Clock Cycle 7

REG IM DM ALU Reg

Clock Cycle 8

0x00 BEQ 0x20 0x 04 ADD 0x24 SUB

REG IM DM ALU Reg

REG IM DM ALU Reg

46

Pipeline Idiosyncrasies Revisited

  Good news   Just 1 cycle to figure out what the right branch address is   So, not 2 or 3 cycles of potential NOP or stall

  Strange news   OK, it’s always 1 cycle, and we always have to wait   SO--on MIPS, this instruction always executes, no matter what   This deviates from the “atomic instruction principle”

  Hence the name: branch delay slot   The instruction cycle after the branch is used for address calc, 1 cycle delay necessary   SO…we regard this as a free instruction cycle, and we just DO IT

  Consequence   You (or your compiler) will need to adjust your code to put some useful work in that “slot”, since just putting in a NOP is wasteful

47

Rewriting the Code for a Branch Delay Slot   Without Branch Delay Slot With Branch Delay Slot

Address Instruction Address Instruction 36 NOP 36 NOP 40 ADD R30,R30,R30 40 BEQ R1, R3, 28 44 BEQ R1, R3, 24 44 ADD R30, R30, R30 48 AND R12, R2, R5 48 AND R12, R2, R5 52 OR R13, R6, R2 52 OR R13, R6, R2 56 ADD R14, R2, R2 56 ADD R14, R2, R2 60 ... 60 ... 64 ... 64 ... 68 ... 68 ... 72 LW R4, 50(R7) 72 LW R4, 50(R7) 76 ... 76 ...

  Flow of instructions if branch is taken: 36, 40, 44, 72, ...   Flow of instructions if branch is not taken: 36, 40, 44, 48, ...

48

Recall: Problems w/ Branch Delay Slots in Pipes



Register File

ALU

16 32

Read data

Data Memory (RAM)

M

U X

M U X

Zero


PC

Adder 4

Current PC

ADDER << 2

M U X

Sign extend

IF/ID ID/EX EX/MEM MEM/WB ADDER

Compare

Compare Controls MUX Select

If we left these branch target address calcs here (deep in the pipe), created many bubbles …moved here

49

Datapath with Branch Logic



Register File

ALU

16 32

Read data

Data Memory (RAM)

M

U X

M U X

Zero


PC

Adder 4

Current PC

M U X

Sign extend

IF/ID ID/EX MEM/WB ADDER

Compare

Compare Controls MUX Select

<< 2

EX/MEM

50

The Branch Delay Slot

  In retrospect, probably a mistake   This solution only works for some pipelines

  Deeper pipelines mean higher performance   They also mean more cycles between when you fetch an instruction and when you know

for sure the address of the next instruction   # of branch delay slots would have to grow

  Breaks the atomic instruction principle   Compilers don’t always find a way to fill the slot

  Forget about it for a minute…   Let’s try to fix this problem in a better way

  What if we could predict the future?

51

Branch Prediction: A better solution?   Assume branch not taken; so just start AND instruction

Clock Cycle 1

Clock Cycle 2

Clock Cycle 3

Clock Cycle 4

Clock Cycle 5

Clock Cycle 6

Clock Cycle 7

REG IM DM ALU Reg

Clock Cycle 8

44 BEQ R1, R3, 24 48 AND R1, R2, R5 60 or 72 (depending on outcome of branch)

REG IM DM ALU Reg

REG IM DM ALU Reg

  This is a form of Branch Prediction

52

Predict Branch Not Taken   Instead of a branch delay slot or stalling,

we just assume that the branch will not happen   If you’re right, great!   If your wrong, cancel the instructions that should not have executed

  Example:   Assume “not taken” when the branch is not taken

Clock Cycle 1

Clock Cycle 2

Clock Cycle 3

Clock Cycle 4

Clock Cycle 5

Clock Cycle 6

Clock Cycle 7

REG IM DM ALU Reg 44 BEQ R1, R3, 24 48 AND R1, R2, R5 52 SUB R2, R3, R4

REG IM DM ALU Reg

REG IM DM ALU Reg

53

Branch Misprediction   Example:

  Assume “not taken” when the branch is taken   Cancel instruction 48 (AND) because it should not have issued

Clock Cycle 1

Clock Cycle 2

Clock Cycle 3

Clock Cycle 4

Clock Cycle 5

Clock Cycle 6

Clock Cycle 7

REG IM DM ALU Reg 44 BEQ R1, R3, 24 48 AND R1, R2, R5 72 SUB R8, R9, R2

REG IM DM ALU Reg

REG IM DM ALU Reg

54

How Can We Do Better? Branch Prediction

  There are many different schemes   Assume taken   Assume not taken   1-bit Branch Prediction   2-bit Branch Prediction   N-bit Branch Prediction   Table-based Branch Prediction   …..

  Assume taken or not taken is called static branch prediction   Using hardware to dynamically predict is called dynamic branch

prediction

55

Static Prediction Problems

  Some branches: 99% of the time they are taken   Example: “are we finished the loop, if not start at the top again…”

  Some branches: 99% of the time they are not taken   Example: “is this an error? If so branch to error routine”

  Compilers have trouble re-writing to make static behavior the right behavior

56

Dynamic Branch Prediction

  Dynamic branch prediction uses the previous outcome of a branch to determine future outcomes   Does this make sense?

  Yes, sometimes. Consider the following code fragment

for (k = 0; k < 100000000; k++){ /* do something */ }

  But what about

if (a == b) { /* do something */ } else /* do something else */

Most of the time, the “last” branch decision is same as next branch decision

Hmmm… Not so clear here, eh?

57

1-bit Branch Prediction   Hardware has a table of single bits

  Each entry in the table corresponds to a branch in the program   If a bit is set, the branch is predicted taken   If the bit is not set (0), the branch is predicted not taken

  How do the branch table bits get set?   The hardware determined the real outcome of a branch and

uses that outcome (history) to set (or unset) a bit   How do the branch instructions get mapped to entries in the table

  Magic… for now. (Lots of custom logic, basically…)

Branch Prediction Table

L1 ADD R1, R2, R3 SUB R3, R4, R2 BEQ R1, R3, L1

L2 LUI R2, 0x1234 BNE R3,R4, L2 J L1

0 1 2 3 4 5

0 1 1 0 0 0

58

1-bit Branch Prediction (cont.)

Branch Prediction Table at start of program

0 1 2 3 4 5


L2 LUI R2, 0x1234 BNE R3,R4, L2 J L1

0 1 1 0 0 0

Flow of instructions ADD R1, R2, R3 SUB R3, R4, R2 BEQ R1, R3, L1 ;table predicts not taken LUI R2,0x1234 ; if the branch was taken, then squash LUI let’s assume the branch was taken ;then the hardware will update the table ;next, we fetch the destination of the branch ADD R1, R2, R3 SUB R3, R4, R2 BEQ R1, R3, L1 ; table predicts taken ADD R1, R2, R3

Branch Predict Table after first branch resolved

0 1 2 3 4 5


L2 LUI R2, 0x1234 BNE R3,R4, L2 J L1

1 1 1 0 0 0

59

Problem with one-bit predictors

  Consider this:

for (j = 0; j < 100,000,000; j++){ for (k = 0; k < 10; k++) { /* do something */ }

}   How often do we mispredict the k-loop branch?

60

2-bit Branch Prediction

  Table has 2 bits instead of one bit   Creates a history--more “memory” of behavior saved

  Use the entries to determine the outcome of a branch as follows

Taken

Not Taken

Not Taken

Not Taken

Not Taken

Taken

Taken

Taken

Predict taken 11

Predict taken 10

Predict not taken 00

Predict not taken 01

61

2-bit Branch Prediction: Mechanics

  If you’re right--you’re right   Don’t change your prediction if things are going OK with it…

Predict taken

Predict not taken

Taken

Not Taken

11

00

62


  Oops--first wrong, mis-predict.   Remember that this was first--it’s a new state   BUT--don’t change your prediction about the branch direction

Predict taken Predict taken

Predict not taken Predict not taken

Taken

Not Taken

Not Taken

Taken

11 10

00 01

Oops!

Oops!

63


  If we’re lucky, that mispredict was a fluke…   The way we WERE predicting it before was OK,

that one previous branch was just wrong. NEXT one, we’re right again.



Taken

Not Taken

Not Taken

Taken 11 10

00 01

64


  Nope--we’re really wrong.   Now, the branch really wants to go the other way, twice in a row.   So--Alter our prediction.



Taken

Not Taken

Not Taken Taken

11 10

00 01

Not Taken

Taken

65

2-bit Branch Prediction: Example

  Consider a few branch predictions in sequence

Predict not taken

Taken

Not Taken

Taken

Taken

Predict taken 11

Predict taken 10

00 Predict not taken

01

1. We stay here as long as “not taken” is correct

2. Oops…first mispredict

3. Oops…2nd mispredict; lets change our predictions for future

4. Stay here as long as “taken” is now right…

66

Generalization: 3-bit Prediction

T

NT

NT NT

T

T

NT

T

NT

T

NT

T

NT

T

NT

T

3-bit Prediction

T

NT

NT NT

T

T

NT

T

2-bit Prediction

67

Generalization: N-bit Prediction

  Saturating N-bit counter

  See whether a majority of the last 2N-1 branches were taken or not taken

T

NT NT

T

T

NT

T

NT

T

NT

T

NT

T

NT

T

NT NT

T

NT

T …

68

How well does this work?

  Really good for loop behavior!

0% 5% 10% 15% 20%

nasa7

matrix300

tomcatv

doduc

spice

fpppp

gcc

espresso

eqntott

li

SPE

C89

Ben

chm

ark

Frequency of Mispredictions

4096 entries

2 bits per entry

69

When does this break?

  Doesn’t deal with data-dependent branches (not much we can do here)

  Doesn’t deal with correlated behavior:

L1: bne $s1, $0, L2 # B1! ...!L2: bne $s2, $0, L3 # B2! ...!L3: bne $s1, $s2, L4 # B3! ...!L4: ...!

  Note that if B1 not-taken, and B2 not-taken, then B3 is not-taken   There is a lot of correlated behavior like this in real programs

70

Small Example:

L1: bne $s1, $0, L2 # B1: if (d==0)! addi $s1, $0, 1 # d = 1!

L2: subi $s2, $s1, 1 # !

bne $s2, $0, L3 # B2: if (d == 1)!

... !

L3:!

!

  If B1 is not taken, B2 will not be taken

  How does a standard 1-bit predictor work with this?   Assume $s1 alternates between 2 and 0

71

Small Example:

  We ALWAYS mispredict!!!!

$s1 = ? B1 predict

B1 action

New B1

predict

B2 predict

B2 action

New B2

predict 2 NT T T NT T T

0 T NT NT T NT NT

2 NT T T NT T T

0 T NT NT T NT NT

72

Correlating Branch Predictors

  Idea: keep 2 (or more) predictors   One is used/updated if last branch was taken (T)   One is used/updated if last branch was not taken (NT)   Each predictor could be N bits (we’ll assume one bit)

Prediction Bits Prediction if last branch not taken

Prediction if last branch taken

NT/NT Not taken Not taken

NT/T Not taken Taken

T/NT Taken Not taken

T/T Taken Taken

73

Previous Example

  Initialized to NT/NT   Only one misprediction of B2!

$s1 = ? B1 predict

B1 action

New B1

predict

B2 predict

B2 action

New B2

predict 2 NT/NT T T/NT NT/NT T NT/T

0 T/NT NT T/NT NT/T NT NT/T

2 T/NT T T/NT NT/T T NT/T

0 T/NT NT T/NT NT/T NT NT/T

Use different predictor for B2 based on whether B1 was taken or not.

Upd

ate

pred

icto

r bas

ed o

n B

1 ac

tion

74

Performance of Correlating Branch Predictors

0% 5% 10% 15% 20%

nasa7

matrix300

tomcatv

doduc

spice

fpppp

gcc

espresso

eqntott

li

SPEC

89 B

ench

mar

k

Frequency of Mispredictions

2-bit 4096 entry 2-bit 2-level correlating 1024 entry

75

How to keep the branch prediction data

  Keep a table of addresses of branch instructions with the current state of the branch predictor for that branch

  A valid field to indicate that this address is a branch   Check in the BTB when you fetch instruction   Update bits when you know whether the branch is taken or not

= = ? = = ?

Current PC

= = ? = = ?

Branch Prediction State Valid

Tag bits

76

Branch Targets

  So we now predict whether we are going to take the branch or not   Doesn’t help if we don’t know where a taken branch goes   MIPS: branch delay slot (BDS)

  solves both the prediction and target address problem   End of ID stage, we know whether we are taken a branch and where we are going

  Without the BDS, or with greater level of pipelining, this doesn’t work

  Fortunately, conditional branches, when taken, go the same place every time!   Use history   Keep a cache

  Branch Target Buffer (BTB): Table for branch target

77

Branch Target Buffers

  A table for branches that are predicted as taken   Don’t have to compute branch targets for not-taken branches

  Easy to add to the structure that stores the state of the predictor   Useful for jumps (we know it is always taken, but we don’t know

where)

= = ? = = ?

Current PC

= = ? = = ?

Next PC 78

What Makes Pipelines Hard to Implement?

  Detecting and resolving hazards

  Instruction Set Architecture   Very complex multicycle instructions are difficult to pipeline   Example:

  stringMov from 0x1234, to 0x4000, 0x1000 bytes

  Exceptions and Interrupts

79

What Makes Pipelines Hard to Implement?

  Detecting and resolving hazards

  Instruction Set Architecture   Very complex multicycle instructions are difficult to pipeline   Example:

  stringMov from 0x1234, to 0x4000, 0x1000 bytes

  Exceptions and Interrupts

80

Exceptions and Interrupts

  Exceptions are exceptional events that disrupt the normal flow of a program

  Terminology varies between different machines   Examples of Interrupts

  User hitting the keyboard   Disk drive asking for attention   Arrival of a network packet

  Examples of Exceptions   Divide by zero   Overflow   Page fault

81

Exception Flow

  When an exception (or interrupt) occurs, control is transferred to the OS

User Process

Event exception

Exception processing by exception handler

Exception return (optional)

Operating System

82

Flow of Instructions During Exception

  Example: Add instruction overflows in clock cycle 3

Clock Cycle 1

Clock Cycle 2

Clock Cycle 3

Clock Cycle 4

Clock Cycle 5

Clock Cycle 6

Clock Cycle 7

REG IM DM ALU Reg

Clock Cycle 8

ADDuserProgram LWuserProgram SUBuserProgram SWOS

REG IM DM ALU Reg

Clock Cycle 9

REG IM DM ALU Reg

REG IM DM ALU Reg

83

Characterizing Exceptions and Interrupts

  Synchronous vs. asynchronous events   Synchronous events occur at the same place every time a program executes

  Asynchronous events are caused by external devices such as a keyboard, disk drive or mouse

  User requested vs. coerced   If a user asks for it, it is user requested

  Coerced are hardware events not under user control

  User maskable vs. nonmaskable   Can a user disable an exception from being detected?

  Within vs. between instructions   Does the event prevent the current instruction from completing?

  Resume vs. terminate   Can the event be handled (corrected) or must the program be terminated?

84

Types of Exceptions

Exception Syn/Asynch User request? User maskable? Within? Resume? I/O device asynch coerced nonmaskable between resume invoke OS synch user req. nonmaskable between resume tracing instr. execution synch user req. user maskable between resume breakpoint synch user req. user maskable between resume int overflow synch coerced user maskable within resume fp overflow synch coerced user maskable within resume page fault synch coerced nonmaskable within resume misaligned mem access synch coerced user maskable within resume mem-prot violation synch coerced nonmaskable within term. undef. instr synch coerced nonmaskable within term. hardware malf. asynch coerced nonmaskable within term.

85

Stopping and Restarting Execution   Exception occurs while many instructions are in flight

  Ex: a page fault on a load instruction will occur in stage 4 of the MIPS pipe   Pipeline must be safely shutdown when exception occurs and then restarted at the

offending instruction

  How to handle this? This is done by:   Force a trap instruction into the pipeline   Until the trap is taken, turn off all writes for the faulting instruction and any instruction

that issued after the faulting instruction   This prevents instructions from changing the state of the machine

  When the trap is taken, invoking the OS, the OS saves the PC of the offending instruction

  The OS fixes the exception (if possible) and then restarts the machine   Restarting usually means setting PC <-- offending instruction address   Replays instruction(s)

86

Precise vs. Imprecise Exceptions

  If the pipeline can be stopped so that the instructions issued before the faulting instruction complete, then the pipeline is said to implement precise exceptions   Gives the illusion that the machine executes one instruction at a time   Difficult to do when some instructions take multiple cycles to complete

  Some instructions may complete before an exception is detected   Example

Multiply r1, r2, r3 ; multiply takes 10 cycles Add r10,r11,r12 ; takes 5 cycles

Add will complete before multiply is done. If multiply overflows, then an exception will be raised AFTER the add has updated the value in R10. This is an imprecise exception.

  Some machines implement both modes: imprecise and precise exceptions   Special software instructions to guarantee precise exceptions

  Machine runs slower when one needs precise exceptions

87

Exceptions and the MIPS Architecture

  Which stage can exceptions occur in? Stage Problem exceptions occurring

IF page fault on instruction fetch; misaligned memory access; memory protection violation

ID undefined or illegal opcode EX arithmetic exception MEM page fault on data fetch; misaligned memory access;

memory-protection violation WB none

88

Multiple Exceptions   Multiple exceptions can happen in the same cycle

  Example   In Clock Cycle 4, LW can have a data page fault while the ADD has an arithmetic exception   Handled by servicing the page fault and then restarting the LW instruction

  The ADD’s arithmetic exception will occur again because the ADD instruction is restarted after the exception is handled

Clock Cycle 1

Clock Cycle 2

Clock Cycle 3

Clock Cycle 4

Clock Cycle 5

Clock Cycle 6

Clock Cycle 7

REG IM DM ALU Reg

Clock Cycle 8

LW ADD REG IM DM ALU Reg

Clock Cycle 9

89

Multiple Exceptions (cont.)

  Multiple exceptions can be difficult to manage   Can occur out-of-order   Example

  ADD causes an exception in the instruction fetch stage while LW causes an exception in the memory access stage

  If we implement precise exceptions, LW exception must be handled first   This is done by having hardware post exceptions by order of instruction Clock Cycle 1

Clock Cycle 2

Clock Cycle 3

Clock Cycle 4

Clock Cycle 5

Clock Cycle 6

Clock Cycle 7

REG IM DM ALU Reg

Clock Cycle 8

LW ADD REG IM DM ALU Reg

Clock Cycle 9

90

About Exceptions

  One of the single messiest parts of designing a modern CPU   It isn’t pretty, it’s easy to get wrong   It’s often not too elegant   It usually takes huge wads of special logic   It causes architects to age prematurely

  Further complicated by modern CPU mechanisms   Deep pipes   Superscalar --lots of instructions in flight in parallel   Out-of-order execution -- time order of exceptions != program order of the instructions on

which the exceptions happened   Maintaining illusion of “sequential instruction execution” gets really complicated.

91

Performance of Pipelined Systems

  Stalls due to data and branch hazards make performance less than one instruction per cycle

  Compiler is critical in determining overall performance   Compiler generates code that avoids stalls

  Example lw R15, 0x00(R2) add R14, R15, R15 lw R16, 0x04(R2)

  Might become: lw R15, 0x00(R2) lw R16, 0x04(R2) add R14, R15, R15

92

Data Dependencies

  Identify all of the true data dependencies in the following code fragment. Don’t assume any implementation information (e.g., forwarding). add R2, R5, R4 add R4, R2, R5 lw R5, 100(R2) add R3, R5, R4 sw R3, 101(R2)

93

Data Dependencies

  Identify all of the true data dependencies in the following code fragment. Don’t assume any implementation information (e.g., forwarding). add R2, R5, R4 add R4, R2, R5 lw R5, 100(R2) add R3, R5, R4 sw R3, 101(R2)

94

Branch Delay Slot

  Modify the following code to make use of a branch delay slot (assume a MIPS 5-stage pipeline w/bypass). Loop: add R3, R3, R4

lw R2, 100 (R3)

beq R3, R4, Loop

95

Branch Delay Slot

  Modify the following code to make use of a branch delay slot (assume a MIPS 5-stage pipeline w/bypass). Loop: add R3, R3, R4

beq R3, R4, Loop

lw R2, 100 (R3)

96

Bypass Paths

  Add the necessary bypass path for the following code fragment

add R3, R2, R1 sub R5, R2, R3



Register File

ALU

16 32

Read data

Data Memory (RAM)

M

U X

M U X

Zero


PC

Sign extend


97

Bypass Paths

  Add the necessary bypass path for the following code fragment

add R3, R2, R1 sub R5, R2, R3



Register File

ALU

16 32

Read data

Data Memory (RAM)

M

U X

M U X

Zero


PC

Sign extend


98

Code Performance

  How many cycles will the following code fragment take? Assume a 5-stage MIPS pipeline with forward (bypass) paths

add R5, R5, R7 lw R6, 100 (R7) sub R7, R6, R8

99

Code Performance

  How many cycles will the following code fragment take. Assume a 5-stage MIPS pipeline with forward (bypass) paths

add R5, R5, R7 lw R6, 100 (R7) sub R7, R6, R8

REG IM DM ALU Reg

Clock Cycle 1

Clock Cycle 2

Clock Cycle 3

Clock Cycle 4

Clock Cycle 5

Clock Cycle 6

Clock Cycle 7

ADD R5,R6,R7 LW R6, 100(R7) SUB R7,R6, R8

REG IM DM ALU Reg

REG IM DM ALU Reg

Program Execution

Time Clock Cycle 8

STALL

100

Machine Performance

  Given the following information, what is the clock cycle time (in nanoseconds) of the machine and how many nanoseconds does it take for an instruction to complete?

# of pipeline stages 8 Critical Path 15 ns

101

Machine Performance

  Given the following information, what is the clock cycle time (in nanoseconds) of the machine and how many nanoseconds does it take for an instruction to complete?

# of pipeline stages 8 Critical Path 15 ns

Recall that the critical path is the longest path through a pipeline stage. For a pipelined machine, the critical path defines the cycle time. Therefore, the clock cycle time is 15 ns. One instruction takes 8 stages * 15 ns = 120 ns.

102

Machine Performance (2)

  Using the machine specified from the previous problem   What is the minimum (best) CPI (assume a MIPS 5-state pipeline)?   What is the machine’s CPI if 20% of all instructions are loads and 5% of the instructions

following a load depend on the result of the load (assume all other instructions have no dependencies)?

103

Machine Performance (2)

  Using the machine specified from the previous problem   What is the minimum (best) CPI (assume a MIPS 5-state pipeline)?   What is the machine’s CPI if 20% of all instructions are loads and 5% of the instructions

following a load depend on the result of the load (assume all other instructions have no dependencies)? The best CPI is 1.0 20% * 5% = 1% <-- 1% of the instructions stall for one cycle The CPI is: 99% * 1.0 cycles + 1% * 2.0 cycles = 1.01 cycles per instruction

104

Solution (w/out forwarding) Program Execution

Time

ADD R4, R5, R2 LW R15, 0(R4) SW R15, 4(R2)

IM DM ALU REG REG

IM DM ALU REG REG bubble bubble

IM DM ALU REG bubble bubble

R4

R15

R15

R4

105

Solution (w/forwarding) Program Execution

Time

ADD R4, R5, R2 LW R15, 0(R4) SW R15, 4(R2)

IM DM ALU REG REG

IM DM ALU REG REG

IM DM ALU REG bubble REG

R15

R15

Forwarding

R4

R4

Forwarding

11 - 1dt085 l10 pipeline2 - uppsala university · 2011-11-23 · compiler can insert nops clock...

Documents