instructioninstruction--level parallelismlevel … parallelismlevel parallelism and its ......

InstructionInstruction--Level ParallelismLevel Parallelismand its Exploitation: PART 1and its Exploitation: PART 1

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Project and Case StudiesProject and Case StudiesProject and Case StudiesProject and Case Studies

• Project: Lecture Nov 9Project: Lecture Nov 9• Case studies

• Presentation of pipeline case studies on Nov 12p p• This week: Division into groups (5-6 in each)• Selection of case studies:

MIPS R10000• MIPS R10000• Intel Pentium 4• AMD Opteron• Sun Rock (UltraSparc)• IBM Power 6

LecturesLectures1. Introduction2. Instruction-level Parallelism, part 13. Instruction-level Parallelism, part 2, p4. Memory Hierarchies5. Multiprocessors and Thread-Level Parallelismp6. System Aspects and Virtualization7. Summary and Reviewy

Bottlenecks in Simple PipelinesBottlenecks in Simple Pipelines

ID/EX EX/MEM MEM/WB

4

P

IF/ID ID/EX EX/MEM MEM/WB

Datamemory

PC

Three classes of dependences that limit parallelism: Data hazards (lost cycles due to dependences) C t l h d (l t l d t b h ) Control hazards (lost cycles due to branches) Structural hazards (lost cycles due to lack of resources)

Unlocking Instruction-Level Parallelism: gA First Set of Techniques

• Improve parallel execution in basic pipeline to p p p pavoid stalls:– Static scheduling of instructions (compiler)– Dynamic branch prediction (run-time)

E l it ll li i Add• Exploit parallelism in programsdynamically (run-time)

Add

Add

Addfor (i=1000; i>0; i=i-1)x[i] = x[i] + 10.0;

Add

Use many execution unitsto run things in parallel

InstructionInstruction--Level Parallelism Level Parallelism –– Basic Basic Concepts (Ch 2.1)Concepts (Ch 2.1)

Two instructions must be independent in order to execute pin parallelThree classes of dependences that limit parallelism: Data dependences N d d Name dependences Control dependences

Dependences are properties of the program Can lead to hazards which are properties of the

i li i tipipeline organization

Data DependencesData DependencesData DependencesData DependencesAn instruction j is data dependent on

instruction i if:instruction i if:– instruction i produces a result used by instr. j, or– instruction j is data dependent on instruction kinstruction j is data dependent on instruction k

and instr. k is data dependent on instr. i

Example: [Notation: OP Rx Ry Rz Rx <- Ry OP Rz]Example: [Notation: OP Rx, Ry, Rz Rx <- Ry OP Rz]LD F0,0(R1)ADDD F4,F0,F2SD 0(R1) F4SD 0(R1),F4

Easy to detect dependences for registers – trickier for memory locations (memory ambiguity/alias problem)memory locations (memory ambiguity/alias problem)

Name DependencesName DependencesName DependencesName Dependences

Two instructions use same name (register orTwo instructions use same name (register or memory address) but don’t exchange data

Anti dependence (WAR if hazard in pipeline) Anti dependence (WAR if hazard in pipeline)Instruction j writes to a register or memory location that instruction ireads from and instruction i is executed first

Output dependence (WAW if hazard in pipeline) Output dependence (WAW if hazard in pipeline)Instruction i and instruction j write to the same register or memory location; ordering between instructions must be preserved

Name dependences are not fundamental and can sometimes be eliminated through hardware or software techniques (renaming)software techniques (renaming)

Control DependencesControl DependencesControl DependencesControl DependencesExample:

if Test1 then { S1 }if Test2 then { S2 }• S1 is control dependent on Test1• S1 is control dependent on Test1• S2 is control dependent on Test2;

but not on Test1

We can’t move an instruction that is dependent on a branch before the branch instructionWe can’t move an instruction that is not controlWe can t move an instruction that is not control dependent on a branch after the branch instr.

Compiler Techniques to Expose ILP: Compiler Techniques to Expose ILP: p q pp q pExample (Ch 2.2)Example (Ch 2.2)

for (i=1000; i>0; i=i-1)x[i] = x[i] + 10.0;

Iterations are independent => parallel executionloop: LD F0, 0(R1) ; F0 = array elementRAW

ADDD F4, F0, F2 ; Add scalar constantSD 0(R1), F4 ; Save resultSUBI R1, R1, #8 ; decrement array ptr.

RAW

BNEZ R1, loop ; reiterate if R1 != 0NOP ; delayed branch

C li i t ll lti i h it ti ?Can we eliminate all penalties in each iteration?

Static SchedulingStatic Scheduling withinwithin each Loop Iterationeach Loop IterationStatic Scheduling Static Scheduling withinwithin each Loop Iterationeach Loop Iteration

loop: LD F0 0(R1) l LD F0 0(R1)loop: LD F0, 0(R1)stallADDD F4, F0, F2stall

loop: LD F0, 0(R1) stallADDD F4, F0, F2SUBI R1, R1, #8

stallSD 0(R1), F4SUBI R1, R1, #8BNEZ R1 loop

BNEZ R1, loopSD 8(R1), F4

BNEZ R1, loopNOP

Original loop: Statically scheduled loop:Four stall cycles One stall cycle

Can we do better by scheduling across iterations?

Loop UnrollingLoop Unrollingp gp gloop: LD F0, 0(R1)

ADDD F4, F0, F2SD 0(R1), F4 ; drop SUBI & BNEZLD F6 8(R1) ; adjust displacement

RAW

RAW

LD F6, -8(R1) ; adjust displacementADDD F8, F6, F2SD -8(R1), F8 ; drop SUBI & BNEZLD F10, -16(R1) ; adjust displacement

RAW

RAW

ADDD F12, F10, F2SD -16(R1), F12 ; drop SUBI & BNEZLD F14, -24(R1) ; adjust displacementADDD F16 F14 F2

RAW

RAW

RAW ADDD F16, F14, F2SD -24(R1), F16SUBI R1, R1, #32 ; alter to 4*8BNEZ R1, loop

RAW

RAW

NOP

Registers must be renamed to avoid WAR hazards A larger chunk of sequential code simplifies scheduling A larger chunk of sequential code simplifies scheduling

Statically Scheduled Unrolled LoopStatically Scheduled Unrolled LoopStatically Scheduled Unrolled LoopStatically Scheduled Unrolled Looploop: LD F0, 0(R1)

Important steps: Hoist loads

LD F6, -8(R1)LD F10, -16(R1)LD F14, -24(R1)ADDD F4 F0 F2

Push stores down Note: the displacement of the store instr must beADDD F4, F0, F2

ADDD F8, F6, F2ADDD F12, F10, F2ADDD F16, F14, F2

of the store instr. must be changedEffects of loop unrolling:

SD 0(R1), F4SD -8(R1), F8SD -16(R1), F12SUBI R1 R1 #32

Provides a larger seq. instr. window Simplifies for static andSUBI R1, R1, #32

BNEZ R1, loopSD 8(R1), F16

All penalties are eliminated CPI=1!

Simplifies for static and dynamic methods to extract ILP But makes code biggerAll penalties are eliminated. CPI=1! But makes code bigger

Dynamic Methods to Improve ILPDynamic Methods to Improve ILP

• Not all potential hazards can be resolved byNot all potential hazards can be resolved by static scheduling at compile time

• For static scheduling to work, it is necessary for g ythe compiler to know enough about the processor implementation to predict hazards

Therefore, we will also explore dynamic techniques for hazard resolutionhazard resolution

General Processor OrganizationGeneral Processor Organization

Get

Memoryaccess

Fetchinstruction

Getoperands

&Issue

Integer& Logic

Updatestate

Floatingpoint

Major bottlenecksControl hazards, memory performance => Fetch bottleneckData hazards, structural hazards, control hazards => Issue bottleneck

Fetch BottleneckFetch Bottleneck

• Control hazardsControl hazards– Dynamic branch prediction: Predict outcome of

branches and jumps– Branch target buffers– Issue (and execute) beyond branches– Do not update state until prediction verified– Do not update state until prediction verified

• Memory bottleneck– Memory performance improvement (memoryMemory performance improvement (memory

hierarchy)– Prefetch, multiple fetch

Dynamic Branch Prediction (Ch 2 3)Dynamic Branch Prediction (Ch 2 3)Dynamic Branch Prediction (Ch. 2.3)Dynamic Branch Prediction (Ch. 2.3)

Branches limit performance because:p• Branch penalties are high• Prevent a lot of ILP from being exploited

Solution: Dynamic branch prediction to predict the outcome of conditional branches.

Benefits: Reduce time to determine branch condition Reduce time to calculate the branch target address

Branch History TableBranch History TableA simple branch prediction scheme

IF ID EX MEM WB

PC

100

branches predicted as taken01010

The idea: Use last branch outcome as prediction– branch-prediction buffer is indexed by bits from branch-

i t ti PC linstruction PC values– If prediction is wrong, then invert prediction

P bl l t i di ti iProblem: a loop causes two mispredictions in a row

A TwoA Two bit Prediction Schemebit Prediction SchemeA TwoA Two--bit Prediction Schemebit Prediction Scheme

– Requires prediction to be wrong twice in order to change prediction => better performance

– Performance:Performance:• 0%-18% miss prediction frequency for SPEC92• Integer programs have higher miss frequency than floating

point (FP) programspoint (FP) programs

Correlating Branch PredictorsCorrelating Branch Predictors

• Correlating predictorsCorrelating predictors ((m,n) predictor)

• Takes into consideration multiple branches and their correlationP f b tt th th• Performs better than the 2-bit predictor

Issue BottleneckIssue Bottleneck• RAW hazards

– Dynamic scheduling (out-of-order execution)• WAR & WAW hazards

– Remove name dependencies (register renaming)• Structural hazards

– Dynamic scheduling (out-of-order execution)– Memory performance improvement (memory hierarchy, prefetch,

non blocking load/store buffers)non-blocking, load/store buffers)– Multiple and pipelined functional units

• Control hazardsSpeculative execution– Speculative execution

• Single issue– Issue multiple instructions per cycle (superscalar, VLIW)

Dynamic Instruction Scheduling (Ch 2 4)Dynamic Instruction Scheduling (Ch 2 4)Dynamic Instruction Scheduling (Ch. 2.4)Dynamic Instruction Scheduling (Ch. 2.4)

Key idea: Allow subsequent independent instructions to y q pproceedDIVD F0,F2,F4 ; takes long timeADDD F10 F0 F8 ; stalls waiting for F0ADDD F10,F0,F8 ; stalls waiting for F0SUBD F12,F8,F13 ; Let this instr. bypass the ADDD

• Enables out-of-order execution => out-of-order completionInstr. gets

IF ID M WBEXstuck here

Two historical schemes used in recent machines:Scoreboard dates back to CDC 6600 in 1963Tomasulo’s algorithm in IBM 360/91 in 1967Tomasulo s algorithm in IBM 360/91 in 1967

Tomasulo’s Algorithm: Hardware Tomasulo’s Algorithm: Hardware ggOrganization (Ch. 2.5)Organization (Ch. 2.5)

Note: Tomasulo’s algorithm is gof course applicable also for other types oftypes of instructions than floating point.

Basic IdeasBasic Ideas

• Decouple issue from operand fetch –Decouple issue from operand fetch Prevents stall due to RAW hazards

• Register renaming: Translate result register g g greferences to instruction (functional unit) references – Prevents WAR and WAW hazards

E l i t S d T t WAW/WARExample – registers S and T prevent WAW/WARDIV.D F0,F2,F4ADD.D S,F0,F8

DIV.D F0,F2,F4ADD.D F6,F0,F8

S.D S,0(R1)SUB.D T,F10,F8MUL.D F6,F10,T

S.D F6,0(R1)SUB.D F8,F10,F8MUL.D F6,F10,F8

Three Stages ofThree Stages of Tomasulo’sTomasulo’s AlgorithmAlgorithmThree Stages of Three Stages of Tomasulo’sTomasulo’s Algorithm Algorithm

1. Issue—get instruction from FP Op Queueg p Issue if no structural hazard for a reservation station

2. Execution—operate on operands (EX) Execute when both operands are available;

if not ready, watch Common Data Bus (CDB) for result

3. Write result—finish execution (WB) Write on CDB to all awaiting functional units;

mark reservation station available

– Normal bus: data + destinationC D t B d t– Common Data Bus: data + source

Tomasulo example cycle 0Tomasulo example, cycle 0Instruction status Exec. WriteInstruction j k Issue compl. result Load buffersLD F6 34+ R2 BusyAddress6 3 yLD F2 45+ R3 Load1 NoMULTDF0 F2 F4 Load2 NoSUBD F8 F6 F2 Load3 NoDIVD F10 F0 F6ADDD F6 F8 F2

Functional unit status src 1 src 2 RS for j RS for kTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 NoMult1 NoMult2 No

Register result statusF0 F2 F4 F6 F8 F10 F30F0 F2 F4 F6 F8 F10 ... F30

FUClock: 0

Tomasulo example cycle 1Tomasulo example, cycle 1Instruction status Exec. WriteInstruction j k Issue compl. result Load buffersLD F6 34+ R2 1 Time Busy Address6 3 e yLD F2 45+ R3 Load1 Yes R2+32MULTDF0 F2 F4 Load2 NoSUBD F8 F6 F2 Load3 NoDIVD F10 F0 F6ADDD F6 F8 F2




FU Load1Clock: 1

Tomasulo example cycle 2Tomasulo example, cycle 2Instruction status Exec. WriteInstruction j k Issue compl. result Load buffersLD F6 34+ R2 1 Time Busy Address6 3 e yLD F2 45+ R3 2 1 Load1 Yes R2+32MULTDF0 F2 F4 Load2 Yes R3+45SUBD F8 F6 F2 Load3 NoDIVD F10 F0 F6ADDD F6 F8 F2




FU Load2 Load1Clock: 2

Tomasulo example cycle 3Tomasulo example, cycle 3Instruction status Exec. WriteInstruction j k Issue compl. result Load buffersLD F6 34+ R2 1 3 Time Busy Address6 3 3 e yLD F2 45+ R3 2 0 Load1 Yes R2+32MULTDF0 F2 F4 3 1 Load2 Yes R3+45SUBD F8 F6 F2 Load3 NoDIVD F10 F0 F6ADDD F6 F8 F2


Add1 NoAdd2 NoAdd3 NoMult1 Yes Mult F4 Load2Mult2 No


FU Mult1 Load2 Load1Clock: 3

Tomasulo example cycle 4Tomasulo example, cycle 4Instruction status Exec. WriteInstruction j k Issue compl. result Load buffersLD F6 34+ R2 1 3 4 Time Busy Address6 3 3 e yLD F2 45+ R3 2 4 Load1 NoMULTDF0 F2 F4 3 0 Load2 Yes R3+45SUBD F8 F6 F2 4 Load3 NoDIVD F10 F0 F6ADDD F6 F8 F2


Add1 Yes Sub M(R2+34) Load2Add2 NoAdd3 NoMult1 Yes Mult F4 Load2Mult2 No


FU Mult1 Load2 - Add1Clock: 4

Tomasulo example cycle 5Tomasulo example, cycle 5Instruction status Exec. WriteInstruction j k Issue compl. result Load buffersLD F6 34+ R2 1 3 4 Time Busy Address6 3 3 e yLD F2 45+ R3 2 4 5 Load1 NoMULTDF0 F2 F4 3 Load2 NoSUBD F8 F6 F2 4 Load3 NoDIVD F10 F0 F6 5ADDD F6 F8 F2


2 Add1 Yes Sub M(R2+34)M(R3+45) -Add2 NoAdd3 No

10 Mult1 Yes Mult M(R3+45) F4 -Mult2 Yes Div F6 Mult1


FU Mult1 - Add1 Mult2Clock: 5

Tomasulo example cycle 6Tomasulo example, cycle 6Instruction status Exec. WriteInstruction j k Issue compl. result Load buffersLD F6 34+ R2 1 3 4 Time Busy Address6 3 3 e yLD F2 45+ R3 2 4 5 Load1 NoMULTDF0 F2 F4 3 Load2 NoSUBD F8 F6 F2 4 Load3 NoDIVD F10 F0 F6 5ADDD F6 F8 F2 6


1 Add1 Yes Sub M(R2+34)M(R3+45)Add2 Yes Add F2 Add1Add3 No

9 Mult1 Yes Mult M(R3+45) F4Mult2 Yes Div F6 Mult1


FU Mult1 Add2 Add1 Mult2Clock: 6

Tomasulo example cycle 7Tomasulo example, cycle 7Instruction status Exec. WriteInstruction j k Issue compl. result Load buffersLD F6 34+ R2 1 3 4 Time Busy Address6 3 3 e yLD F2 45+ R3 2 4 5 Load1 NoMULTDF0 F2 F4 3 Load2 NoSUBD F8 F6 F2 4 7 Load3 NoDIVD F10 F0 F6 5ADDD F6 F8 F2 6


0 Add1 Yes Sub M(R2+34)M(R3+45)Add2 Yes Add F2 Add1Add3 No



FU Mult1 Add2 Add1 Mult2Clock: 7

Tomasulo example cycle 8Tomasulo example, cycle 8Instruction status Exec. WriteInstruction j k Issue compl. result Load buffersLD F6 34+ R2 1 3 4 Time Busy Address6 3 3 e yLD F2 45+ R3 2 4 5 Load1 NoMULTDF0 F2 F4 3 Load2 NoSUBD F8 F6 F2 4 7 8 Load3 NoDIVD F10 F0 F6 5ADDD F6 F8 F2 6


Add1 No2 Add2 Yes Add F6-F2 F2 -

Add3 No7 Mult1 Yes Mult M(R3+45) F4

Mult2 Yes Div F6 Mult1


FU Mult1 Add2 - Mult2Clock: 8

Tomasulo example cycle 10Tomasulo example, cycle 10Instruction status Exec. WriteInstruction j k Issue compl. result Load buffersLD F6 34+ R2 1 3 4 Time Busy Address6 3 3 e yLD F2 45+ R3 2 4 5 Load1 NoMULTDF0 F2 F4 3 Load2 NoSUBD F8 F6 F2 4 7 8 Load3 NoDIVD F10 F0 F6 5ADDD F6 F8 F2 6 10


Add1 No0 Add2 Yes Add F6-F2 F2

Add3 No5 Mult1 Yes Mult M(R3+45) F4

Mult2 Yes Div F6 Mult1


FU Mult1 Add2 Mult2Clock: 10

Tomasulo example cycle 11Tomasulo example, cycle 11Instruction status Exec. WriteInstruction j k Issue compl. result Load buffersLD F6 34+ R2 1 3 4 Time Busy Address6 3 3 e yLD F2 45+ R3 2 4 5 Load1 NoMULTDF0 F2 F4 3 Load2 NoSUBD F8 F6 F2 4 7 8 Load3 NoDIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11


Add1 NoAdd2 NoAdd3 No



FU Mult1 - Mult2Clock: 11

Tomasulo example cycle 15Tomasulo example, cycle 15Instruction status Exec. WriteInstruction j k Issue compl. result Load buffersLD F6 34+ R2 1 3 4 Time Busy Address6 3 3 e yLD F2 45+ R3 2 4 5 Load1 NoMULTDF0 F2 F4 3 15 Load2 NoSUBD F8 F6 F2 4 7 8 Load3 NoDIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11


Add1 NoAdd2 NoAdd3 No



FU Mult1 Mult2Clock: 15

Tomasulo example cycle 16Tomasulo example, cycle 16Instruction status Exec. WriteInstruction j k Issue compl. result Load buffersLD F6 34+ R2 1 3 4 Time Busy Address6 3 3 e yLD F2 45+ R3 2 4 5 Load1 NoMULTDF0 F2 F4 3 15 16 Load2 NoSUBD F8 F6 F2 4 7 8 Load3 NoDIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11


Add1 NoAdd2 NoAdd3 NoMult1 No

40 Mult2 Yes Div F0 F6 -


FU - Mult2Clock: 15

Tomasulo example cycle 56Tomasulo example, cycle 56Instruction status Exec. WriteInstruction j k Issue compl. result Load buffersLD F6 34+ R2 1 3 4 Time Busy Address6 3 3 e yLD F2 45+ R3 2 4 5 Load1 NoMULTDF0 F2 F4 3 15 16 Load2 NoSUBD F8 F6 F2 4 7 8 Load3 NoDIVD F10 F0 F6 5 56ADDD F6 F8 F2 6 10 11


Add1 NoAdd2 NoAdd3 NoMult1 No

0 Mult2 Yes Div F0 F6


FU Mult2Clock: 56

Tomasulo example cycle 57Tomasulo example, cycle 57Instruction status Exec. WriteInstruction j k Issue compl. result Load buffersLD F6 34+ R2 1 3 4 Time Busy Address6 3 3 e yLD F2 45+ R3 2 4 5 Load1 NoMULTDF0 F2 F4 3 15 16 Load2 NoSUBD F8 F6 F2 4 7 8 Load3 NoDIVD F10 F0 F6 5 56 57ADDD F6 F8 F2 6 10 11




FU -Clock: 57

SummarySummary• Data, name, and control dependences• Hazards and static instruction order prevent ILP exploitation• Static methods

– Let compiler reorder instructions to remove hazards• Dynamic methods• Dynamic methods

– Modify processor organization to remove hazards and dynamically reorder instructions

– Fetch bottleneck: Dynamic branch predictionI b ttl k T l ’ l ith– Issue bottleneck: Tomasulo’s algorithm

• Next– Speculative execution– Multiple issuep– Register renaming– Increased instruction fetch bandwidth

instructioninstruction--level parallelismlevel … parallelismlevel parallelism and its ......

Documents