chapter 3: instruction level parallelism (ilp)
TRANSCRIPT
Instruction Level Parallelism (ILP)
ILP: The simultaneous execution of multiple instructions from aprogram. While pipelining is a form of ILP, the generalapplication of ILP goes much further into more aggressivetechniques to achieve parallel execution of the instructions inthe instruction stream.
1 / 26
Exploiting ILP
Two basic approaches:
I rely on hardware to discover and exploit parallelismdynamically, and
I rely on software to restructure programs to staticallyfacilitate parallelism.
These techniques are complimentary. They can be and areused in concert to improve performance.
2 / 26
Pgm Challenges and Opportunities to ILP
Basic Block: a straight-line code sequence with a single entrypoint and a single exit point.
Remember, the average branch frequency is 15%–25%; thusthere are only 3–6 instructions on average between a pair ofbranches.
Loop level parallelism: the opportunity to use multiple iterationsof a loop to expose additional parallelism. Example techniquesto exploit loop level parallelism: vectorization, data parallel, loopunrolling.
3 / 26
Dependencies and Hazards
3 types of dependencies:
I data dependencies (or true data dependencies),
I name dependencies, and
I control dependencies.
Dependencies are artifacts of programs; hazards are artifactsof pipeline organization. Not all dependencies become hazardsin the pipeline. That is, dependencies may turn into hazardswithin the pipeline depending on the architecture of the pipeline.
4 / 26
Data Dependencies
An instruction j is data dependent on instruction i if:
I instruction i produces a result that may be used byinstruction j; or
I instruction j is data dependent on instruction k, andinstruction k is data dependent on instruction i.
5 / 26
Name Dependencies (not true dependencies)
Occurring when two instructions use the same register or memorylocation, but there is no flow in data between the instructionsassociated with that name.
There are 2 types of name dependencies between an instruction ithat precedes instruction j in the execution order:
I antidependence: when instruction j writes a register or memorylocation that instruction i reads.
I output dependence: when instruction i and instruction j write thesame register or memory location.
Because these are not true dependencies, the instructions i and jcould potentially be executed in parallel if these dependencies aresome how removed (by using distinct register or memory locations).
6 / 26
Data Hazards
A hazard exists whenever there is a name or data dependencebetween two instructions and they are close enough that theiroverlapped execution would violate the program’s order ofdependency.
Possible data hazards:
I RAW (read after write)
I WAW (write after write)
I WAR (write after read)
RAR (read after read) is not a hazard.
7 / 26
Control Dependencies
Dependency of instructions to the sequential flow of executionand preserves branch (or any flow altering operation) behaviorof the program.
In general, two constraints are imposed by controldependencies:
I An instruction that is control dependent on a branch cannotbe moved before the branch so that its execution is nolonger controlled by the branch.
I An instruction that is not control dependent on a branchcannot be moved after the branch so that the execution iscontrolled by the branch.
8 / 26
Relaxed Adherence to Dependencies
Strictly enforcing dependency relations is not entirelynecessary if we can preserve the correctness of the program.Two properties critical to program correctness (and normallypreserved by maintaining both data and control dependency)are:
I exception behavior: reordering instructions must not alterthe exception behavior of the program — specifically thevisible exceptions (transparent exceptions such as pagefaults are ok to affect).
I data flow: data sources must be preserved to feed theinstructions consuming that data.
Liveness: data that is needed is called live; data that is nolonger used is called dead.
9 / 26
Compiler Techniques for Exposing ILP
I loop unrolling
I instruction scheduling
10 / 26
Advanced Branch Prediction
I Correlating branch predictors (or two-level predictors):make use of outcome of most recent branches to makeprediction.
I Tournament predictors: run multiple predictors and run atournament between them; use the most successful.
11 / 26
Measured misprediction rates: SPEC89
6%
7%
8%
5%
4%
3%
2%
1%
0%
Conditio
nal bra
nch m
ispre
dic
tion r
ate
Total predictor size
Local 2-bit predictors
Correlating predictors
Tournament predictors
5124804484163843523202882562241921601289664320
12 / 26
End section 3.1-3.3
Lecture break; continue section 3.4-3.9 next class
13 / 26
Dynamic Scheduling
Designing the hardware so that it can dynamically rearrangeinstruction execution to reduce stalls while maintaining dataflow and exception behavior.
Two techniques:
I Scoreboarding (discussed in Appendix C), centralizedscoreboard
I Tomasulo’s Algorithm (discussed in Chapter 3), distributedreservation stations
14 / 26
Dynamic Scheduling
Instructions are issued to the pipeline in-order but executed andcompleted out-of-order.
out-of-order execution leading to the possibility of out-of-ordercompletion.
out-of-order execution introduces the possibility of WAR andWAW hazards which do not occur in statically scheduledpipelines.
out-of-order completion creates major complications inexception handling
15 / 26
Key Components of Tomasulo’s Algorithm
Register renaming: to minimize WAW and WAR hazards (caused byname dependencies).
Reservation station: buffering operands for instructions waiting toexecute. Operands are fetched as soon as they are available (fromthe reservation station/instruction serving the operand). The registerspecifiers of issued instructions are renamed to the reservationstation (to achieve register renaming).
Thus, hazard detection and execution control are distributed (thetargeted functional unit and reservation station determine when).
Common data bus/result bus/etc: carries results past the reservationstations (where they are captured) and back to the register file.Sometimes multiple buses are used.
16 / 26
Hardware Speculation
Execute (completely) the instructions that are predicted tooccur after a branch w/o knowing the branch outcome.Speculate that the instructions are to be executed.
instruction commit: when the results (operandwrites/exceptions) of the speculated instruction are made.
Key to speculation is to allow out-of-order instruction executebut force them to commit in order. Generally achieved by areorder buffer (ROB) which holds completed instructions andretires them in order. Thus ROB holds/buffers register/memoryoperands that are also used by functional units for sourceoperands.
17 / 26
Multiple-Issue Processors
Attempting to reduce the CPI below one.
1. Statically scheduled superscalar processors
2. VLIW (very long instruction word) processors
3. Dynamically scheduled superscalar processors
superscalar: scheduling multiple instructions for execution atthe same time.
18 / 26
Advanced Instruction Delivery & Speculation
I branch-target buffer or branch-target cache: predict thedestination PC address for a branch instruction; can alsostore the target instruction (instead of, or in addition to thedestination address). branch folding: overwrite thebranch instr with the destination instruction.
I Return address prediction.
I Integrated instruction fetch units. Autonomously executingunits to fetch instructions.
I Comments on speculation: how much? through multiplebranches? energy costs?
I Value prediction: not yet feasible; not sure why it’s covered.
19 / 26
Branch-Target Buffer
Look up Predicted PC
Number of
entries
in branch-
target
buffer
No: instruction is
not predicted to be
branch; proceed normally=
Yes: then instruction is branch and predicted
PC should be used as the next PC
Branch
predicted
taken or
untaken
PC of instruction to fetch
20 / 26
End section 3.4-3.9
Lecture break; continue section 3.10-3.16 next class
21 / 26
The Limits of ILP
How much ILP is available? Is there a limit?
Consider the ideal processor:
1. Infinite number of registers for register renaming.
2. Perfect branch prediction
3. Perfect jump prediction
4. Perfect memory address analysis
5. All memory accesses occur in one cycle
Effectively removing all control dependencies and all but true datadependencies. That is, all instructions can be scheduled as early astheir data dependency allows.
22 / 26
Unlimited Issue/Single Cycle Execution
0 20 40 60 80 100 120
Instruction issues per cycle
gcc
espresso
li
SP
EC
benchm
ark
s
fpppp
doduc
tomcatv
55
63
18
75
119
150
140 160
The Power7 currently issues up to six instructions per clock.
23 / 26
Limited Issue Window Size/Single Cycle Execution
I Up to 64 instructionsissued/cycle
I Tournament predictor;best available in 2011(not a primarybottleneck)
I Perfect memoryaddress analysis
I 64 int/64 FP extraregisters available forrenaming
I Single cycle execution
101010
89
1515
13
810
111212
119
14
2235
5247
912
151617
5645
3422
14
gcc
espresso
li
fpppp
Benchm
ark
sdoduc
tomcatv
0 10 20
Instruction issues per cycle
30 40 50 60
Infinite
256
128
64
32
Window size
24 / 26
Hardware/Software Speculation
Professor rambles on about this topic for a few moments.
25 / 26
Multithreading
I Fine-grained multithreading: switch threads at each clockcycle. Examples: Sun Niagra processor (8 threads), NvidiaGPUs.
I Course-grained multithreading: switch threads at majorstalls such as L2/L3 misses. Examples: no commercialventures.
I Simultaneous multithreading: process/dispatch multiplethreads simultaneously to the common functional units.Examples: Intel (hyperthreading, two threads), IBMPower7 (four threads).
26 / 26