chapter 3: instruction level parallelism (ilp)

Instruction Level Parallelism (ILP)

ILP: The simultaneous execution of multiple instructions from aprogram. While pipelining is a form of ILP, the generalapplication of ILP goes much further into more aggressivetechniques to achieve parallel execution of the instructions inthe instruction stream.

1 / 26

Exploiting ILP

Two basic approaches:

I rely on hardware to discover and exploit parallelismdynamically, and

I rely on software to restructure programs to staticallyfacilitate parallelism.

These techniques are complimentary. They can be and areused in concert to improve performance.

2 / 26

Pgm Challenges and Opportunities to ILP

Basic Block: a straight-line code sequence with a single entrypoint and a single exit point.

Remember, the average branch frequency is 15%–25%; thusthere are only 3–6 instructions on average between a pair ofbranches.

Loop level parallelism: the opportunity to use multiple iterationsof a loop to expose additional parallelism. Example techniquesto exploit loop level parallelism: vectorization, data parallel, loopunrolling.

3 / 26

Dependencies and Hazards

3 types of dependencies:

I data dependencies (or true data dependencies),

I name dependencies, and

I control dependencies.

Dependencies are artifacts of programs; hazards are artifactsof pipeline organization. Not all dependencies become hazardsin the pipeline. That is, dependencies may turn into hazardswithin the pipeline depending on the architecture of the pipeline.

4 / 26

Data Dependencies

An instruction j is data dependent on instruction i if:

I instruction i produces a result that may be used byinstruction j; or

I instruction j is data dependent on instruction k, andinstruction k is data dependent on instruction i.

5 / 26

Name Dependencies (not true dependencies)

Occurring when two instructions use the same register or memorylocation, but there is no flow in data between the instructionsassociated with that name.

There are 2 types of name dependencies between an instruction ithat precedes instruction j in the execution order:

I antidependence: when instruction j writes a register or memorylocation that instruction i reads.

I output dependence: when instruction i and instruction j write thesame register or memory location.

Because these are not true dependencies, the instructions i and jcould potentially be executed in parallel if these dependencies aresome how removed (by using distinct register or memory locations).

6 / 26

Data Hazards

A hazard exists whenever there is a name or data dependencebetween two instructions and they are close enough that theiroverlapped execution would violate the program’s order ofdependency.

Possible data hazards:

I RAW (read after write)

I WAW (write after write)

I WAR (write after read)

RAR (read after read) is not a hazard.

7 / 26

Control Dependencies

Dependency of instructions to the sequential flow of executionand preserves branch (or any flow altering operation) behaviorof the program.

In general, two constraints are imposed by controldependencies:

I An instruction that is control dependent on a branch cannotbe moved before the branch so that its execution is nolonger controlled by the branch.

I An instruction that is not control dependent on a branchcannot be moved after the branch so that the execution iscontrolled by the branch.

8 / 26

Relaxed Adherence to Dependencies

Strictly enforcing dependency relations is not entirelynecessary if we can preserve the correctness of the program.Two properties critical to program correctness (and normallypreserved by maintaining both data and control dependency)are:

I exception behavior: reordering instructions must not alterthe exception behavior of the program — specifically thevisible exceptions (transparent exceptions such as pagefaults are ok to affect).

I data flow: data sources must be preserved to feed theinstructions consuming that data.

Liveness: data that is needed is called live; data that is nolonger used is called dead.

9 / 26

Compiler Techniques for Exposing ILP

I loop unrolling

I instruction scheduling

10 / 26

Advanced Branch Prediction

I Correlating branch predictors (or two-level predictors):make use of outcome of most recent branches to makeprediction.

I Tournament predictors: run multiple predictors and run atournament between them; use the most successful.

11 / 26

Measured misprediction rates: SPEC89

6%

7%

8%

5%

4%

3%

2%

1%

0%

Conditio

nal bra

nch m

ispre

dic

tion r

ate

Total predictor size

Local 2-bit predictors

Correlating predictors

Tournament predictors

5124804484163843523202882562241921601289664320

12 / 26

End section 3.1-3.3

Lecture break; continue section 3.4-3.9 next class

13 / 26

Dynamic Scheduling

Designing the hardware so that it can dynamically rearrangeinstruction execution to reduce stalls while maintaining dataflow and exception behavior.

Two techniques:

I Scoreboarding (discussed in Appendix C), centralizedscoreboard

I Tomasulo’s Algorithm (discussed in Chapter 3), distributedreservation stations

14 / 26

Dynamic Scheduling

Instructions are issued to the pipeline in-order but executed andcompleted out-of-order.

out-of-order execution leading to the possibility of out-of-ordercompletion.

out-of-order execution introduces the possibility of WAR andWAW hazards which do not occur in statically scheduledpipelines.

out-of-order completion creates major complications inexception handling

15 / 26

Key Components of Tomasulo’s Algorithm

Register renaming: to minimize WAW and WAR hazards (caused byname dependencies).

Reservation station: buffering operands for instructions waiting toexecute. Operands are fetched as soon as they are available (fromthe reservation station/instruction serving the operand). The registerspecifiers of issued instructions are renamed to the reservationstation (to achieve register renaming).

Thus, hazard detection and execution control are distributed (thetargeted functional unit and reservation station determine when).

Common data bus/result bus/etc: carries results past the reservationstations (where they are captured) and back to the register file.Sometimes multiple buses are used.

16 / 26

Hardware Speculation

Execute (completely) the instructions that are predicted tooccur after a branch w/o knowing the branch outcome.Speculate that the instructions are to be executed.

instruction commit: when the results (operandwrites/exceptions) of the speculated instruction are made.

Key to speculation is to allow out-of-order instruction executebut force them to commit in order. Generally achieved by areorder buffer (ROB) which holds completed instructions andretires them in order. Thus ROB holds/buffers register/memoryoperands that are also used by functional units for sourceoperands.

17 / 26

Multiple-Issue Processors

Attempting to reduce the CPI below one.

1. Statically scheduled superscalar processors

2. VLIW (very long instruction word) processors

3. Dynamically scheduled superscalar processors

superscalar: scheduling multiple instructions for execution atthe same time.

18 / 26

Advanced Instruction Delivery & Speculation

I branch-target buffer or branch-target cache: predict thedestination PC address for a branch instruction; can alsostore the target instruction (instead of, or in addition to thedestination address). branch folding: overwrite thebranch instr with the destination instruction.

I Return address prediction.

I Integrated instruction fetch units. Autonomously executingunits to fetch instructions.

I Comments on speculation: how much? through multiplebranches? energy costs?

I Value prediction: not yet feasible; not sure why it’s covered.

19 / 26

Branch-Target Buffer

Look up Predicted PC

Number of

entries

in branch-

target

buffer

No: instruction is

not predicted to be

branch; proceed normally=

Yes: then instruction is branch and predicted

PC should be used as the next PC

Branch

predicted

taken or

untaken

PC of instruction to fetch

20 / 26

End section 3.4-3.9

Lecture break; continue section 3.10-3.16 next class

21 / 26

The Limits of ILP

How much ILP is available? Is there a limit?

Consider the ideal processor:

1. Infinite number of registers for register renaming.

2. Perfect branch prediction

3. Perfect jump prediction

4. Perfect memory address analysis

5. All memory accesses occur in one cycle

Effectively removing all control dependencies and all but true datadependencies. That is, all instructions can be scheduled as early astheir data dependency allows.

22 / 26

Unlimited Issue/Single Cycle Execution

0 20 40 60 80 100 120

Instruction issues per cycle

gcc

espresso

li

SP

EC

benchm

ark

s

fpppp

doduc

tomcatv

55

63

18

75

119

150

140 160

The Power7 currently issues up to six instructions per clock.

23 / 26

Limited Issue Window Size/Single Cycle Execution

I Up to 64 instructionsissued/cycle

I Tournament predictor;best available in 2011(not a primarybottleneck)

I Perfect memoryaddress analysis

I 64 int/64 FP extraregisters available forrenaming

I Single cycle execution

101010

89

1515

13

810

111212

119

14

2235

5247

912

151617

5645

3422

14

gcc

espresso

li

fpppp

Benchm

ark

sdoduc

tomcatv

0 10 20

Instruction issues per cycle

30 40 50 60

Infinite

256

128

64

32

Window size

24 / 26

Hardware/Software Speculation

Professor rambles on about this topic for a few moments.

25 / 26

Multithreading

I Fine-grained multithreading: switch threads at each clockcycle. Examples: Sun Niagra processor (8 threads), NvidiaGPUs.

I Course-grained multithreading: switch threads at majorstalls such as L2/L3 misses. Examples: no commercialventures.

I Simultaneous multithreading: process/dispatch multiplethreads simultaneously to the common functional units.Examples: Intel (hyperthreading, two threads), IBMPower7 (four threads).

26 / 26

chapter 3: instruction level parallelism (ilp)

Documents