pipeline complicationscs510 computer architectureslecture 8 - 1 lecture 8 advanced pipeline

Pipeline Complications CS510 Computer Architectures Lecture 8 - 1

Lecture 8Lecture 8

Advanced PipelineAdvanced PipelineLecture 8Lecture 8

Advanced PipelineAdvanced Pipeline


InterruptsInterruptsInterruptsInterrupts

Interrupts: 5 instructions executing in 5-stage pipeline– How to stop the pipeline?– How to restart the pipeline?– Who caused the interrupt?

Stage Exceptional ConditionsIF Page fault on instruction fetch;

Unaligned memory access; Memory-protection violation

ID Undefined or illegal opcodeEX Arithmetic interruptMEM Page fault on data fetch;

Unaligned memory access; Memory-protection violation


Simultaneous Exceptions Simultaneous Exceptions in More Than One Pipe Stagesin More Than One Pipe Stages

Simultaneous Exceptions Simultaneous Exceptions in More Than One Pipe Stagesin More Than One Pipe Stages

• Simultaneous exceptions in more than one pipeline stage, e.g. LD followed by ADD

– LD with data page(DM) fault in MEM stage– ADD with instruction page(IM) fault in IF stage– ADD fault will happen BEFORE Load fault

• Solution #1– Interrupt status vector per instruction– Defer check till the last stage, and kill

machine state update if exception

Delays updating the machine state until late in pipeline, possibly at the completion of an instruction!• Solution #2

– Interrupt ASAP– Restart everything that is incomplete


Simultaneous ExceptionsSimultaneous ExceptionsSimultaneous ExceptionsSimultaneous Exceptions

Complex Addressing Modes and Instructions• Address modes: Auto-increment causes register change

during the instruction execution - Register write in EX stage instead of WB stage

– Interrupts? Need to restore register state

– Adds WAR and WAW hazards since writes in a register in EX, no longer in WB stage

• Memory-Memory Move Instructions– Must be able to handle multiple page faults

– Long-lived instructions: partial state save on interrupt

• Condition Codes


Extending the DLX to Handle Extending the DLX to Handle Multi-cycle Operations Multi-cycle Operations

Extending the DLX to Handle Extending the DLX to Handle Multi-cycle Operations Multi-cycle Operations

IF ID MEM WBEXIF ID MEM WB

EXint unit

EXFP/int

Multiply

EXFP/Intdivider

EXFP adder


Multicycle OperationsMulticycle OperationsMulticycle OperationsMulticycle Operations

IF ID MEM WB

EX

integer unit

M1 M2 M3 M4 M5 M6 M7

FP/integer multiply

A1 A2 A3 A4

FP adder

DIV

FP/integer divider


Latency and Initiation IntervalLatency and Initiation IntervalLatency and Initiation IntervalLatency and Initiation Interval

Latency: Number of intervening cycles between instructions that produces a result and uses the result

Initiation Interval: number of cycles that must elapse between issuing of two operations of a given type

Integer ALU 0 1Load 1 1FP add 3 1FP mul 6 1FP div 14 15

Data needed Result available

* FP LD and ST aresame as integer byhaving 64-bit pathto memory.

MULTD IF ID M1 M2 M3 M4 M5 M6 M7 MEM WB

ADDD IF ID AI A2 A3 A4 MEM WB

LD* IF ID EX MEM WB

SD* IF ID EX MEM WB

Example

latency initiation interval


Floating Point OperationsFloating Point OperationsFloating Point OperationsFloating Point Operations

FP Instruction Latency Initiation Interval (MIPS R4000)

Add, Subtract 4 3

Multiply 8 4

Divide 36 35

Square root 112 111

Negate 2 1

Absolute value 2 1

FP compare 3 2

Cycles before using result

Cycles before issuing instr of the same type

Floating Point: long execution time Also, pipeline FP execution unit may initiate new instructions without waiting full latency

Reality: MIPS R4000


Complications Complications Due to FP OperationsDue to FP Operations

Complications Complications Due to FP OperationsDue to FP Operations

Divide, Square Root take 10X to 30X longer than Add– exceptions?

– Adds WAR and WAW hazards since pipelines are no longer same length


Summary of Pipelining BasicsSummary of Pipelining BasicsSummary of Pipelining BasicsSummary of Pipelining Basics

• Hazards limit performance

– Structural: need more HW resources

– Data: need forwarding, compiler scheduling

– Control: early evaluation of PC, delayed branch, prediction

• Increasing length of pipe increases impact of hazards; pipelining helps instruction bandwidth, not latency

• Interrupts, FP Instruction Set makes pipelining harder

• Compilers reduce cost of data and control hazards

– Load delay slots

– Branch delay slots

– Branch prediction


Case Study:Case Study:MIPS R4000 and MIPS R4000 and

Introduction to Advanced Introduction to Advanced PipeliningPipelining

Case Study:Case Study:MIPS R4000 and MIPS R4000 and

Introduction to Advanced Introduction to Advanced PipeliningPipelining


Case Study:Case Study: MIPS R4000 (100 MHz to 200 MHz)MIPS R4000 (100 MHz to 200 MHz)

Case Study:Case Study: MIPS R4000 (100 MHz to 200 MHz)MIPS R4000 (100 MHz to 200 MHz)

8 Stage Pipeline:

IF First half of fetching of instruction• PC selection • Initiation of instruction cache access

IS - Second half of fetching of instruction • Access to instruction cache

RF Instruction decode, register fetch, hazard checking also instruction cache hit detection(tag check)

EX Execution• Effective address calculation • ALU operation• Branch target computation and condition evaluation

DF - First half of access to data cache

DS - Second half of access to data cache

TC - Tag check for data cache hit

WB -Write back for loads and register-register operations

• Cache miss exception– 10s of cycles delay

• What is impact on– Load Delay?– Why?


The Pipeline Structure The Pipeline Structure of the R4000of the R4000

The Pipeline Structure The Pipeline Structure of the R4000of the R4000

Instruction Memory REG

AL

U Data Memory REG

Instruction is available

Tag check

load data available

IF IS RF EX DF DS TC WB


Case Study: MIPS R4000Case Study: MIPS R4000

LOAD LatencyLOAD LatencyCase Study: MIPS R4000Case Study: MIPS R4000

LOAD LatencyLOAD Latency

2 Cycle Load Latency

Load data availableLoad data availablewith forwardingwith forwarding

LD R1, X IF IS RF EX DF DS TC WB

IF IS RF EX DF DS IF IS RF EX DF DS . . .

ADD R3, R1, R2 IF IS RF EX DF DS TC WBIF IS RF EX DF DS TC . . .

IF IS RF EX DF . . .

EX

Load data neededLoad data needed

EX

2 Stall Cycles2 Stall Cycles



LOAD Followed by ALU InstructionsLOAD Followed by ALU InstructionsCase Study: MIPS R4000Case Study: MIPS R4000

LOAD Followed by ALU InstructionsLOAD Followed by ALU Instructions

2 cycle Load Latency with Forwarding Circuit

IF ISIF

RFISIF

EXRFISIF

DFstallstallstall

DSstallstallstall

TCEXRFIS

WBDF ...EX ...RF ...

LW R1ADD R2, R1SUB R3, R1OR R4, R1

Forwarding



Branch LatencyBranch LatencyCase Study: MIPS R4000Case Study: MIPS R4000

Branch LatencyBranch Latency

Predict NOT TAKEN strategy NOT TAKEN: one-cycle delayed slot TAKEN: one-cycle delayed slot followed by two stalls - 3 cycle latency

(conditions evaluated during EX phase)R4000 uses Predict NOT TAKENNOT TAKEN

Delay Slot plus 2 stall cycles

IF ISIF

RFISIF

RFISIF

DFEXRFIS

DSDFEXRFIS

TCDSDFEXRF

WB TC ...DS ...DF ...EX ...

NOT TAKENNOT TAKEN BrDelay SlotDelay SlotBr instr +2Br instr +3Br instr +4

EX

IF

DSDF

IS

TCDS

RF

IF ISIF

RFIS RF

DFEX

WB TC ...

EX ...

EXTAKENTAKEN BrDelay SlotDelay SlotStallStallStallStallBr Target instr IF

Branch target address available after EX stage


Extending DLX to Handle Extending DLX to Handle Floating Point OperationsFloating Point OperationsExtending DLX to Handle Extending DLX to Handle Floating Point OperationsFloating Point Operations

IF ID MEM WB

Integer Unit(EX)Integer Unit(EX)

FP/integer multiplyFP MultiplierFP Multiplier

FP AdderFP Adder

FP DividerFP Divider


MIPS R4000 FP UnitMIPS R4000 FP UnitMIPS R4000 FP UnitMIPS R4000 FP Unit• FP Adder, FP Multiplier, FP Divider

• Last step of FP Multiplier/Divider uses FP Adder HW

• 8 kinds of stages in FP units:

Stage Functional unit DescriptionA FP adder Mantissa ADD stage D FP divider Divide pipeline stageE FP multiplier Exception test stageM FP multiplier First stage of multiplierN FP multiplier Second stage of multiplierR FP adder Rounding stageS FP adder Operand shift stageU Unpack FP numbers


MIPS R4000 FP Pipe StagesMIPS R4000 FP Pipe StagesMIPS R4000 FP Pipe StagesMIPS R4000 FP Pipe StagesFP Instr 1 2 3 4 5 6 7 8 Add, Subtract U S+A A+R R+S

Multiply U E+M M M M N N+A R

Divide U A D28 D+A D+R, D+A, D+R, A, R

Square root U E (A+R)108 A R

Negate U S

Absolute value U S

FP compare U A R

Stages:

M First stage of multiplier N Second stage of multiplier

R Rounding stage A Mantissa ADD stage

S Operand shift stage D Divide pipeline stage

U Unpack FP numbers E Exception test stage


MIPS R4000 FP Pipe StagesMIPS R4000 FP Pipe StagesMIPS R4000 FP Pipe StagesMIPS R4000 FP Pipe Stages

Add Issue U S+A A+R R+S



Add Stall U S+A A + R R +S

Add Stall U S + A A +R R +S



Multiply Issue U M M M M N N+ A R

clock cycle

Operation Issue/stall 0 1 2 3 4 5 6 7 8 9 10 11 12

A

A

A

A

ADD issued at 5 cycles after Multiply will stall 1 cycle.

Stall

Stall

ADD issued at 4 cycles after Multiply will stall 2 cycles.


R4000 PerformanceR4000 PerformanceR4000 PerformanceR4000 PerformanceNot an ideal pipeline CPI of 1:

– Load stalls: (1 or 2 clock cycles)– Branch stalls: (2 cycles for taken br. + unfilled branch slots)– FP result stalls: RAW data hazard (latency)– FP structural stalls: Not enough FP hardware (parallelism)

00.5

11.5

22.5

33.5

44.5

eqnt

ott

esp

resso

gcc li

dodu

c

nasa7

ora

sp

ice2g

6

su

2cor

tom

catv

Base Load stalls Branch stalls FP result stalls FP structural

stalls

Integer programs Floating Point programs

Pip

eli

ne

CP

I


Advanced PipelineAdvanced PipelineAndAnd

Instruction Level ParallelismInstruction Level Parallelism


Advanced Pipelining and Advanced Pipelining and Instruction Level ParallelismInstruction Level Parallelism

Advanced Pipelining and Advanced Pipelining and Instruction Level ParallelismInstruction Level Parallelism

• gcc 17% control transfer– 5 instructions + 1 branch– Beyond single block to get more instruction level paralleli

sm• Loop level parallelism is one opportunity, SW and HW

. . .Branch Target . . .

Branch instruction . . .

. . .Any instruction

. . .

Branch instruction . . .

Block of Code


Advanced Pipelining Advanced Pipelining and Instruction Level Parallelismand Instruction Level Parallelism

Advanced Pipelining Advanced Pipelining and Instruction Level Parallelismand Instruction Level Parallelism

Loop unrolling Control stalls

Basic pipeline scheduling RAW stalls

Dynamic scheduling with scoreboarding RAW stalls

Dynamic scheduling with register renaming WAR and WAW stalls

Dynamic branch prediction Control stalls

Issuing multiple instructions per cycle Ideal CPI

Compiler dependence analysis Ideal CPI and data stalls

Software pipelining and trace scheduling Ideal CPI and data stalls

Speculation All data and control stalls

Dynamic memory disambiguation RAW stalls involving memory

Technique Reduces


Basic Pipeline Scheduling Basic Pipeline Scheduling and Loop Unrollingand Loop Unrolling

Basic Pipeline Scheduling Basic Pipeline Scheduling and Loop Unrollingand Loop Unrolling

FP unit latencies

Instruction producing Instruction using Latency in result result clock cyclesFP ALU op Another FP ALU op 3FP ALU op Store double 2Load double* FP ALU op 1Load double* Store double 0 * Same as integer Load since there is a 64-bit data path from/to memory.

Fully pipelined or replicated --- no structural hazards, issue on every clock cycle

for ( i =1; i <= 1000; i++)x[i] = x[i] + s;


Loop: LD F0,0(R1) ;R1 is the pointer to a vector ADDD F4,F0,F2 ;F2 contains a scalar value SD 0(R1),F4 ;store back result SUBI R1,R1,8 ;decrement pointer 8B (DW) BNEZ R1,Loop ;branch R1!=zero NOP ;delayed branch slot

FP Loop HazardsFP Loop HazardsFP Loop HazardsFP Loop Hazards

Where are the stalls?

Instruction Instruction Latency inproducing result using result clock cycles

FP ALU op Another FP ALU op 3

FP ALU op Store double 2

Load double FP ALU op 1

Load double Store double 0

Integer op Integer op 0


FP Loop Showing StallsFP Loop Showing StallsFP Loop Showing StallsFP Loop Showing Stalls

1 Loop: LD F0,0(R1) ;F0=vector element

2 stall

3 ADDD F4,F0,F2 ;add scalar in F2

4 stall

5 stall

6 SD 0(R1),F4 ;store result

7 SUBI R1,R1,8 ;decrement pointer 8B (DW)

8 stall

9 BNEZ R1,Loop ;branch R1!=zero

10 stall ;delayed branch slot

Rewrite code to minimize stalls?Rewrite code to minimize stalls?


Reducing StallsReducing StallsReducing StallsReducing Stalls

1 Loop: LD F0,0(R1)

2 stall

3 ADDD F4,F0,F2

4 stall

5 stall

6 SD 0(R1),F4

7 SUBI R1,R1,#8

8 stall

9 BNEZ R1,Loop

10 stall

For Load-ALU latency

There is only one instruction left, i.e., BNEZ.

When we do that, SD instruction fills the delayedbranch slot.

For ALU-ALU latencyReading R1 by LD is done before Writing R1 by SUBI. Yes we can.

Consider moving SUBI into this Load Delay Slot.

When we do this, we need to change the immediate value 0 to 8 in SD

8


Revised FP Loop Revised FP Loop to Minimize Stallsto Minimize StallsRevised FP Loop Revised FP Loop to Minimize Stallsto Minimize Stalls

1 Loop: LD F0,0(R1)

2 SUBI R1,R1,#8

3 ADDD F4,F0,F2

4 stall

5 BNEZ R1,Loop ;delayed branch

6 SD 8(R1),F4 ;altered when move past SUBI

Instruction Instruction Latency inproducing result using result clock cycles

FP ALU op Another FP ALU op 3

FP ALU op Store double 2

Load double FP ALU op 1

Unroll loop 4 times to make the code faster


Unroll Loop 4 TimesUnroll Loop 4 TimesUnroll Loop 4 TimesUnroll Loop 4 Times 1 Loop:LD F0,0(R1) 2 ADDD F4,F0,F2 3 SD 0(R1),F4 ;drop SUBI & BNEZ 4 LD F6,-8(R1) 5 ADDD F8,F6,F2 6 SD -8(R1),F8 ;drop SUBI & BNEZ 7 LD F10,-16(R1) 8 ADDD F12,F10,F2 9 SD -16(R1),F12 ;drop SUBI & BNEZ 10 LD F14,-24(R1) 11 ADDD F16,F14,F2 12 SD -24(R1),F16 13 SUBI R1,R1,#32 ;alter to 4*8 14 BNEZ R1,Loop

15 NOP 15 + 4 x (1*+2+)+1^= 28 clock cycles, or 7 per iteration. 1*: LD to ADDD stall 1 cycle 2+: ADDD to SD stall 2 cycles 1^: Data dependency on R1

Rewrite loop to minimize the stalls


Unrolled Loop Unrolled Loop to Minimize Stallsto Minimize Stalls

Unrolled Loop Unrolled Loop to Minimize Stallsto Minimize Stalls

1 Loop: LD F0,0(R1)2 LD F6,-8(R1)3 LD F10,-16(R1)4 LD F14,-24(R1)5 ADDD F4,F0,F26 ADDD F8,F6,F27 ADDD F12,F10,F28 ADDD F16,F14,F29 SD 0(R1),F410 SD -8(R1),F811 SUBI R1,R1,#3212 SD 16(R1),F12 ; 16-32= -1613 BNEZ R1,LOOP14 SD 8(R1),F16 ; 8-32 = -24

Assumptions

- OK to move SD past SUBI even though SUBI changes R1

SUBI IF RF EX MEM WB

SD IF ID EX MEM WB

BNEZ IF ID EX MEM WB

- OK to move loads before stores(Get right data)

- When is it safe for compiler to do such changes?

14 clock cycles, or 3.5 per iteration


Compiler Perspectives Compiler Perspectives on Code Movementon Code Movement


• Definitions: Compiler is concerned about dependencies in the program, whether this causes a HW hazard or not depends on a given pipeline

• Data dependencies (RAW if a hazard for HW): Instruction j is data dependent on instruction i if either

– Instruction i produces a result used by instruction j, or

– Instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i.

• Easy to determine for registers (fixed names)• Hard for memory:

– Does 100(R4) = 20(R6)?

– From different loop iterations, does 20(R6) = 20(R6)?




• Name Dependence: Two instructions use the same name(register or memory location) but they do not exchange data

• Two kinds of Name Dependence

Instruction i precedes instruction j– Antidependence (WAR if a hazard for HW)

• Instruction j writes a register or memory location that instruction i reads from and instruction i is executed first

– Output dependence (WAW if a hazard for HW)• Instruction i and instruction j write the same register or memory locatio

n; ordering between instructions must be preserved.


• Again Hard for Memory Accesses

– Does 100(R4) = 20(R6)?

– From different loop iterations, does 20(R6) = 20(R6)?• Our example required compiler to know that if R1 doesn’t change

then:

0(R1) ¹ -8(R1) ¹ -16(R1) ¹ -24(R1) 1

There were no dependencies between some loads and stores, so they could be moved by each other.






• Control Dependence

• Example

if p1 {S1;};

if p2 {S2;}

S1 is control dependent on p1 and S2 is control dependent on p2 but not on p1.




• Two (obvious) constraints on control dependencies:

– An instruction that is control dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch.

– An instruction that is not control dependent on a branch cannot be moved to after the branch so that its execution is controlled by the branch.

• Control dependencies may be relaxed in some systems to get parallelism; get the same effect if preserve the order of exceptions and data flow


When Safe to Unroll Loop?When Safe to Unroll Loop?When Safe to Unroll Loop?When Safe to Unroll Loop?

• Example: When a loop is unrolled, where are data dependencies? (A,B,C distinct, non-overlapping)

for (i=1; i<=100; i=i+1) { A[i+1] = A[i] + C[i]; /* S1 */ B[i+1] = B[i] + A[i+1];} /* S2 */

1. S2 uses the value A[i+1], computed by S1 in the same iteration.

2. S1 uses a value computed by S1 in an earlier iteration, since iteration i computes A[i+1] which is read in iteration i+1. The same is true of S2 for B[i] and B[i+1].

This is a loop-carried dependence between iterations• Implies that iterations are dependent, and can’t be executed in parallel• Not the case for our example; each iteration was distinct


When Safe to Unroll Loop?When Safe to Unroll Loop?When Safe to Unroll Loop?When Safe to Unroll Loop?

• Example: Where are data dependencies? (A,B,C,D distinct & non-overlapping)

Following looks like there is a loop carried dependence

for (i=1; i<=100; i=i+1) {A[i] = A[i] + B[i]; /* S1 */B[i+1] = C[i] + D[i];} /* S2 */

However, we can rewrite it as follows for loop carried dependence-free

A[1] = A[1] + B[1];

for (i=1; i<=99; i=i+1) {B[i+1] = C[i] + D[i];

A[i+1] = A[i+1] + B[i+1];}

B[101] = C[100]+D[100];


SummarySummarySummarySummary

• Instruction Level Parallelism in SW or HW

• Loop level parallelism is easiest to see

• SW parallelism dependencies defined for a program, hazards if HW cannot resolve

• SW dependencies/compiler sophistication determine if compiler can unroll loops

pipeline complicationscs510 computer architectureslecture 8 - 1 lecture 8 advanced pipeline

Documents