pipeline complicationscs510 computer architectureslecture 8 - 1 lecture 8 advanced pipeline
TRANSCRIPT
Pipeline Complications CS510 Computer Architectures Lecture 8 - 1
Lecture 8Lecture 8
Advanced PipelineAdvanced PipelineLecture 8Lecture 8
Advanced PipelineAdvanced Pipeline
Pipeline Complications CS510 Computer Architectures Lecture 8 - 2
InterruptsInterruptsInterruptsInterrupts
Interrupts: 5 instructions executing in 5-stage pipeline– How to stop the pipeline?– How to restart the pipeline?– Who caused the interrupt?
Stage Exceptional ConditionsIF Page fault on instruction fetch;
Unaligned memory access; Memory-protection violation
ID Undefined or illegal opcodeEX Arithmetic interruptMEM Page fault on data fetch;
Unaligned memory access; Memory-protection violation
Pipeline Complications CS510 Computer Architectures Lecture 8 - 3
Simultaneous Exceptions Simultaneous Exceptions in More Than One Pipe Stagesin More Than One Pipe Stages
Simultaneous Exceptions Simultaneous Exceptions in More Than One Pipe Stagesin More Than One Pipe Stages
• Simultaneous exceptions in more than one pipeline stage, e.g. LD followed by ADD
– LD with data page(DM) fault in MEM stage– ADD with instruction page(IM) fault in IF stage– ADD fault will happen BEFORE Load fault
• Solution #1– Interrupt status vector per instruction– Defer check till the last stage, and kill
machine state update if exception
Delays updating the machine state until late in pipeline, possibly at the completion of an instruction!• Solution #2
– Interrupt ASAP– Restart everything that is incomplete
Pipeline Complications CS510 Computer Architectures Lecture 8 - 4
Simultaneous ExceptionsSimultaneous ExceptionsSimultaneous ExceptionsSimultaneous Exceptions
Complex Addressing Modes and Instructions• Address modes: Auto-increment causes register change
during the instruction execution - Register write in EX stage instead of WB stage
– Interrupts? Need to restore register state
– Adds WAR and WAW hazards since writes in a register in EX, no longer in WB stage
• Memory-Memory Move Instructions– Must be able to handle multiple page faults
– Long-lived instructions: partial state save on interrupt
• Condition Codes
Pipeline Complications CS510 Computer Architectures Lecture 8 - 5
Extending the DLX to Handle Extending the DLX to Handle Multi-cycle Operations Multi-cycle Operations
Extending the DLX to Handle Extending the DLX to Handle Multi-cycle Operations Multi-cycle Operations
IF ID MEM WBEXIF ID MEM WB
EXint unit
EXFP/int
Multiply
EXFP/Intdivider
EXFP adder
Pipeline Complications CS510 Computer Architectures Lecture 8 - 6
Multicycle OperationsMulticycle OperationsMulticycle OperationsMulticycle Operations
IF ID MEM WB
EX
integer unit
M1 M2 M3 M4 M5 M6 M7
FP/integer multiply
A1 A2 A3 A4
FP adder
DIV
FP/integer divider
Pipeline Complications CS510 Computer Architectures Lecture 8 - 7
Latency and Initiation IntervalLatency and Initiation IntervalLatency and Initiation IntervalLatency and Initiation Interval
Latency: Number of intervening cycles between instructions that produces a result and uses the result
Initiation Interval: number of cycles that must elapse between issuing of two operations of a given type
Integer ALU 0 1Load 1 1FP add 3 1FP mul 6 1FP div 14 15
Data needed Result available
* FP LD and ST aresame as integer byhaving 64-bit pathto memory.
MULTD IF ID M1 M2 M3 M4 M5 M6 M7 MEM WB
ADDD IF ID AI A2 A3 A4 MEM WB
LD* IF ID EX MEM WB
SD* IF ID EX MEM WB
Example
latency initiation interval
Pipeline Complications CS510 Computer Architectures Lecture 8 - 8
Floating Point OperationsFloating Point OperationsFloating Point OperationsFloating Point Operations
FP Instruction Latency Initiation Interval (MIPS R4000)
Add, Subtract 4 3
Multiply 8 4
Divide 36 35
Square root 112 111
Negate 2 1
Absolute value 2 1
FP compare 3 2
Cycles before using result
Cycles before issuing instr of the same type
Floating Point: long execution time Also, pipeline FP execution unit may initiate new instructions without waiting full latency
Reality: MIPS R4000
Pipeline Complications CS510 Computer Architectures Lecture 8 - 9
Complications Complications Due to FP OperationsDue to FP Operations
Complications Complications Due to FP OperationsDue to FP Operations
Divide, Square Root take 10X to 30X longer than Add– exceptions?
– Adds WAR and WAW hazards since pipelines are no longer same length
Pipeline Complications CS510 Computer Architectures Lecture 8 - 10
Summary of Pipelining BasicsSummary of Pipelining BasicsSummary of Pipelining BasicsSummary of Pipelining Basics
• Hazards limit performance
– Structural: need more HW resources
– Data: need forwarding, compiler scheduling
– Control: early evaluation of PC, delayed branch, prediction
• Increasing length of pipe increases impact of hazards; pipelining helps instruction bandwidth, not latency
• Interrupts, FP Instruction Set makes pipelining harder
• Compilers reduce cost of data and control hazards
– Load delay slots
– Branch delay slots
– Branch prediction
Pipeline Complications CS510 Computer Architectures Lecture 8 - 11
Case Study:Case Study:MIPS R4000 and MIPS R4000 and
Introduction to Advanced Introduction to Advanced PipeliningPipelining
Case Study:Case Study:MIPS R4000 and MIPS R4000 and
Introduction to Advanced Introduction to Advanced PipeliningPipelining
Pipeline Complications CS510 Computer Architectures Lecture 8 - 12
Case Study:Case Study: MIPS R4000 (100 MHz to 200 MHz)MIPS R4000 (100 MHz to 200 MHz)
Case Study:Case Study: MIPS R4000 (100 MHz to 200 MHz)MIPS R4000 (100 MHz to 200 MHz)
8 Stage Pipeline:
IF First half of fetching of instruction• PC selection • Initiation of instruction cache access
IS - Second half of fetching of instruction • Access to instruction cache
RF Instruction decode, register fetch, hazard checking also instruction cache hit detection(tag check)
EX Execution• Effective address calculation • ALU operation• Branch target computation and condition evaluation
DF - First half of access to data cache
DS - Second half of access to data cache
TC - Tag check for data cache hit
WB -Write back for loads and register-register operations
• Cache miss exception– 10s of cycles delay
• What is impact on– Load Delay?– Why?
Pipeline Complications CS510 Computer Architectures Lecture 8 - 13
The Pipeline Structure The Pipeline Structure of the R4000of the R4000
The Pipeline Structure The Pipeline Structure of the R4000of the R4000
Instruction Memory REG
AL
U Data Memory REG
Instruction is available
Tag check
load data available
IF IS RF EX DF DS TC WB
Pipeline Complications CS510 Computer Architectures Lecture 8 - 14
Case Study: MIPS R4000Case Study: MIPS R4000
LOAD LatencyLOAD LatencyCase Study: MIPS R4000Case Study: MIPS R4000
LOAD LatencyLOAD Latency
2 Cycle Load Latency
Load data availableLoad data availablewith forwardingwith forwarding
LD R1, X IF IS RF EX DF DS TC WB
IF IS RF EX DF DS IF IS RF EX DF DS . . .
ADD R3, R1, R2 IF IS RF EX DF DS TC WBIF IS RF EX DF DS TC . . .
IF IS RF EX DF . . .
EX
Load data neededLoad data needed
EX
2 Stall Cycles2 Stall Cycles
Pipeline Complications CS510 Computer Architectures Lecture 8 - 15
Case Study: MIPS R4000Case Study: MIPS R4000
LOAD Followed by ALU InstructionsLOAD Followed by ALU InstructionsCase Study: MIPS R4000Case Study: MIPS R4000
LOAD Followed by ALU InstructionsLOAD Followed by ALU Instructions
2 cycle Load Latency with Forwarding Circuit
IF ISIF
RFISIF
EXRFISIF
DFstallstallstall
DSstallstallstall
TCEXRFIS
WBDF ...EX ...RF ...
LW R1ADD R2, R1SUB R3, R1OR R4, R1
Forwarding
Pipeline Complications CS510 Computer Architectures Lecture 8 - 16
Case Study: MIPS R4000Case Study: MIPS R4000
LOAD Followed by ALU InstructionsLOAD Followed by ALU InstructionsCase Study: MIPS R4000Case Study: MIPS R4000
LOAD Followed by ALU InstructionsLOAD Followed by ALU Instructions
2 cycle Load Latency with Forwarding Circuit
IF ISIF
RFISIF
EXRFISIF
DFstallstallstall
DSstallstallstall
TCEXRFIS
WBDF ...EX ...RF ...
LW R1ADD R2, R1SUB R3, R1OR R4, R1
Forwarding
Pipeline Complications CS510 Computer Architectures Lecture 8 - 17
Case Study: MIPS R4000Case Study: MIPS R4000
Branch LatencyBranch LatencyCase Study: MIPS R4000Case Study: MIPS R4000
Branch LatencyBranch Latency
Predict NOT TAKEN strategy NOT TAKEN: one-cycle delayed slot TAKEN: one-cycle delayed slot followed by two stalls - 3 cycle latency
(conditions evaluated during EX phase)R4000 uses Predict NOT TAKENNOT TAKEN
Delay Slot plus 2 stall cycles
IF ISIF
RFISIF
RFISIF
DFEXRFIS
DSDFEXRFIS
TCDSDFEXRF
WB TC ...DS ...DF ...EX ...
NOT TAKENNOT TAKEN BrDelay SlotDelay SlotBr instr +2Br instr +3Br instr +4
EX
IF
DSDF
IS
TCDS
RF
IF ISIF
RFIS RF
DFEX
WB TC ...
EX ...
EXTAKENTAKEN BrDelay SlotDelay SlotStallStallStallStallBr Target instr IF
Branch target address available after EX stage
Pipeline Complications CS510 Computer Architectures Lecture 8 - 18
Extending DLX to Handle Extending DLX to Handle Floating Point OperationsFloating Point OperationsExtending DLX to Handle Extending DLX to Handle Floating Point OperationsFloating Point Operations
IF ID MEM WB
Integer Unit(EX)Integer Unit(EX)
FP/integer multiplyFP MultiplierFP Multiplier
FP AdderFP Adder
FP DividerFP Divider
Pipeline Complications CS510 Computer Architectures Lecture 8 - 19
MIPS R4000 FP UnitMIPS R4000 FP UnitMIPS R4000 FP UnitMIPS R4000 FP Unit• FP Adder, FP Multiplier, FP Divider
• Last step of FP Multiplier/Divider uses FP Adder HW
• 8 kinds of stages in FP units:
Stage Functional unit DescriptionA FP adder Mantissa ADD stage D FP divider Divide pipeline stageE FP multiplier Exception test stageM FP multiplier First stage of multiplierN FP multiplier Second stage of multiplierR FP adder Rounding stageS FP adder Operand shift stageU Unpack FP numbers
Pipeline Complications CS510 Computer Architectures Lecture 8 - 20
MIPS R4000 FP Pipe StagesMIPS R4000 FP Pipe StagesMIPS R4000 FP Pipe StagesMIPS R4000 FP Pipe StagesFP Instr 1 2 3 4 5 6 7 8 Add, Subtract U S+A A+R R+S
Multiply U E+M M M M N N+A R
Divide U A D28 D+A D+R, D+A, D+R, A, R
Square root U E (A+R)108 A R
Negate U S
Absolute value U S
FP compare U A R
Stages:
M First stage of multiplier N Second stage of multiplier
R Rounding stage A Mantissa ADD stage
S Operand shift stage D Divide pipeline stage
U Unpack FP numbers E Exception test stage
Pipeline Complications CS510 Computer Architectures Lecture 8 - 21
MIPS R4000 FP Pipe StagesMIPS R4000 FP Pipe StagesMIPS R4000 FP Pipe StagesMIPS R4000 FP Pipe Stages
Add Issue U S+A A+R R+S
Add Issue U S+A A+R R+S
Add Issue U S+A A+R R+S
Add Stall U S+A A + R R +S
Add Stall U S + A A +R R +S
Add Issue U S+A A+R R+S
Add Issue U S+A A+R R+S
Multiply Issue U M M M M N N+ A R
clock cycle
Operation Issue/stall 0 1 2 3 4 5 6 7 8 9 10 11 12
A
A
A
A
ADD issued at 5 cycles after Multiply will stall 1 cycle.
Stall
Stall
ADD issued at 4 cycles after Multiply will stall 2 cycles.
Pipeline Complications CS510 Computer Architectures Lecture 8 - 22
R4000 PerformanceR4000 PerformanceR4000 PerformanceR4000 PerformanceNot an ideal pipeline CPI of 1:
– Load stalls: (1 or 2 clock cycles)– Branch stalls: (2 cycles for taken br. + unfilled branch slots)– FP result stalls: RAW data hazard (latency)– FP structural stalls: Not enough FP hardware (parallelism)
00.5
11.5
22.5
33.5
44.5
eqnt
ott
esp
resso
gcc li
dodu
c
nasa7
ora
sp
ice2g
6
su
2cor
tom
catv
Base Load stalls Branch stalls FP result stalls FP structural
stalls
Integer programs Floating Point programs
Pip
eli
ne
CP
I
Pipeline Complications CS510 Computer Architectures Lecture 8 - 23
Pipeline Complications CS510 Computer Architectures Lecture 8 - 24
Advanced PipelineAdvanced PipelineAndAnd
Instruction Level ParallelismInstruction Level Parallelism
Pipeline Complications CS510 Computer Architectures Lecture 8 - 25
Advanced Pipelining and Advanced Pipelining and Instruction Level ParallelismInstruction Level Parallelism
Advanced Pipelining and Advanced Pipelining and Instruction Level ParallelismInstruction Level Parallelism
• gcc 17% control transfer– 5 instructions + 1 branch– Beyond single block to get more instruction level paralleli
sm• Loop level parallelism is one opportunity, SW and HW
. . .Branch Target . . .
Branch instruction . . .
. . .Any instruction
. . .
Branch instruction . . .
Block of Code
Pipeline Complications CS510 Computer Architectures Lecture 8 - 26
Advanced Pipelining Advanced Pipelining and Instruction Level Parallelismand Instruction Level Parallelism
Advanced Pipelining Advanced Pipelining and Instruction Level Parallelismand Instruction Level Parallelism
Loop unrolling Control stalls
Basic pipeline scheduling RAW stalls
Dynamic scheduling with scoreboarding RAW stalls
Dynamic scheduling with register renaming WAR and WAW stalls
Dynamic branch prediction Control stalls
Issuing multiple instructions per cycle Ideal CPI
Compiler dependence analysis Ideal CPI and data stalls
Software pipelining and trace scheduling Ideal CPI and data stalls
Speculation All data and control stalls
Dynamic memory disambiguation RAW stalls involving memory
Technique Reduces
Pipeline Complications CS510 Computer Architectures Lecture 8 - 27
Basic Pipeline Scheduling Basic Pipeline Scheduling and Loop Unrollingand Loop Unrolling
Basic Pipeline Scheduling Basic Pipeline Scheduling and Loop Unrollingand Loop Unrolling
FP unit latencies
Instruction producing Instruction using Latency in result result clock cyclesFP ALU op Another FP ALU op 3FP ALU op Store double 2Load double* FP ALU op 1Load double* Store double 0 * Same as integer Load since there is a 64-bit data path from/to memory.
Fully pipelined or replicated --- no structural hazards, issue on every clock cycle
for ( i =1; i <= 1000; i++)x[i] = x[i] + s;
Pipeline Complications CS510 Computer Architectures Lecture 8 - 28
Loop: LD F0,0(R1) ;R1 is the pointer to a vector ADDD F4,F0,F2 ;F2 contains a scalar value SD 0(R1),F4 ;store back result SUBI R1,R1,8 ;decrement pointer 8B (DW) BNEZ R1,Loop ;branch R1!=zero NOP ;delayed branch slot
FP Loop HazardsFP Loop HazardsFP Loop HazardsFP Loop Hazards
Where are the stalls?
Instruction Instruction Latency inproducing result using result clock cycles
FP ALU op Another FP ALU op 3
FP ALU op Store double 2
Load double FP ALU op 1
Load double Store double 0
Integer op Integer op 0
Pipeline Complications CS510 Computer Architectures Lecture 8 - 29
FP Loop Showing StallsFP Loop Showing StallsFP Loop Showing StallsFP Loop Showing Stalls
1 Loop: LD F0,0(R1) ;F0=vector element
2 stall
3 ADDD F4,F0,F2 ;add scalar in F2
4 stall
5 stall
6 SD 0(R1),F4 ;store result
7 SUBI R1,R1,8 ;decrement pointer 8B (DW)
8 stall
9 BNEZ R1,Loop ;branch R1!=zero
10 stall ;delayed branch slot
Rewrite code to minimize stalls?Rewrite code to minimize stalls?
Pipeline Complications CS510 Computer Architectures Lecture 8 - 30
Reducing StallsReducing StallsReducing StallsReducing Stalls
1 Loop: LD F0,0(R1)
2 stall
3 ADDD F4,F0,F2
4 stall
5 stall
6 SD 0(R1),F4
7 SUBI R1,R1,#8
8 stall
9 BNEZ R1,Loop
10 stall
For Load-ALU latency
There is only one instruction left, i.e., BNEZ.
When we do that, SD instruction fills the delayedbranch slot.
For ALU-ALU latencyReading R1 by LD is done before Writing R1 by SUBI. Yes we can.
Consider moving SUBI into this Load Delay Slot.
When we do this, we need to change the immediate value 0 to 8 in SD
8
Pipeline Complications CS510 Computer Architectures Lecture 8 - 31
Revised FP Loop Revised FP Loop to Minimize Stallsto Minimize StallsRevised FP Loop Revised FP Loop to Minimize Stallsto Minimize Stalls
1 Loop: LD F0,0(R1)
2 SUBI R1,R1,#8
3 ADDD F4,F0,F2
4 stall
5 BNEZ R1,Loop ;delayed branch
6 SD 8(R1),F4 ;altered when move past SUBI
Instruction Instruction Latency inproducing result using result clock cycles
FP ALU op Another FP ALU op 3
FP ALU op Store double 2
Load double FP ALU op 1
Unroll loop 4 times to make the code faster
Pipeline Complications CS510 Computer Architectures Lecture 8 - 32
Unroll Loop 4 TimesUnroll Loop 4 TimesUnroll Loop 4 TimesUnroll Loop 4 Times 1 Loop:LD F0,0(R1) 2 ADDD F4,F0,F2 3 SD 0(R1),F4 ;drop SUBI & BNEZ 4 LD F6,-8(R1) 5 ADDD F8,F6,F2 6 SD -8(R1),F8 ;drop SUBI & BNEZ 7 LD F10,-16(R1) 8 ADDD F12,F10,F2 9 SD -16(R1),F12 ;drop SUBI & BNEZ 10 LD F14,-24(R1) 11 ADDD F16,F14,F2 12 SD -24(R1),F16 13 SUBI R1,R1,#32 ;alter to 4*8 14 BNEZ R1,Loop
15 NOP 15 + 4 x (1*+2+)+1^= 28 clock cycles, or 7 per iteration. 1*: LD to ADDD stall 1 cycle 2+: ADDD to SD stall 2 cycles 1^: Data dependency on R1
Rewrite loop to minimize the stalls
Pipeline Complications CS510 Computer Architectures Lecture 8 - 33
Unrolled Loop Unrolled Loop to Minimize Stallsto Minimize Stalls
Unrolled Loop Unrolled Loop to Minimize Stallsto Minimize Stalls
1 Loop: LD F0,0(R1)2 LD F6,-8(R1)3 LD F10,-16(R1)4 LD F14,-24(R1)5 ADDD F4,F0,F26 ADDD F8,F6,F27 ADDD F12,F10,F28 ADDD F16,F14,F29 SD 0(R1),F410 SD -8(R1),F811 SUBI R1,R1,#3212 SD 16(R1),F12 ; 16-32= -1613 BNEZ R1,LOOP14 SD 8(R1),F16 ; 8-32 = -24
Assumptions
- OK to move SD past SUBI even though SUBI changes R1
SUBI IF RF EX MEM WB
SD IF ID EX MEM WB
BNEZ IF ID EX MEM WB
- OK to move loads before stores(Get right data)
- When is it safe for compiler to do such changes?
14 clock cycles, or 3.5 per iteration
Pipeline Complications CS510 Computer Architectures Lecture 8 - 34
Compiler Perspectives Compiler Perspectives on Code Movementon Code Movement
Compiler Perspectives Compiler Perspectives on Code Movementon Code Movement
• Definitions: Compiler is concerned about dependencies in the program, whether this causes a HW hazard or not depends on a given pipeline
• Data dependencies (RAW if a hazard for HW): Instruction j is data dependent on instruction i if either
– Instruction i produces a result used by instruction j, or
– Instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i.
• Easy to determine for registers (fixed names)• Hard for memory:
– Does 100(R4) = 20(R6)?
– From different loop iterations, does 20(R6) = 20(R6)?
Pipeline Complications CS510 Computer Architectures Lecture 8 - 35
Compiler Perspectives Compiler Perspectives on Code Movementon Code Movement
Compiler Perspectives Compiler Perspectives on Code Movementon Code Movement
• Name Dependence: Two instructions use the same name(register or memory location) but they do not exchange data
• Two kinds of Name Dependence
Instruction i precedes instruction j– Antidependence (WAR if a hazard for HW)
• Instruction j writes a register or memory location that instruction i reads from and instruction i is executed first
– Output dependence (WAW if a hazard for HW)• Instruction i and instruction j write the same register or memory locatio
n; ordering between instructions must be preserved.
Pipeline Complications CS510 Computer Architectures Lecture 8 - 36
• Again Hard for Memory Accesses
– Does 100(R4) = 20(R6)?
– From different loop iterations, does 20(R6) = 20(R6)?• Our example required compiler to know that if R1 doesn’t change
then:
0(R1) ¹ -8(R1) ¹ -16(R1) ¹ -24(R1) 1
There were no dependencies between some loads and stores, so they could be moved by each other.
Compiler Perspectives Compiler Perspectives on Code Movementon Code Movement
Compiler Perspectives Compiler Perspectives on Code Movementon Code Movement
Pipeline Complications CS510 Computer Architectures Lecture 8 - 37
Compiler Perspectives Compiler Perspectives on Code Movementon Code Movement
Compiler Perspectives Compiler Perspectives on Code Movementon Code Movement
• Control Dependence
• Example
if p1 {S1;};
if p2 {S2;}
S1 is control dependent on p1 and S2 is control dependent on p2 but not on p1.
Pipeline Complications CS510 Computer Architectures Lecture 8 - 38
Compiler Perspectives Compiler Perspectives on Code Movementon Code Movement
Compiler Perspectives Compiler Perspectives on Code Movementon Code Movement
• Two (obvious) constraints on control dependencies:
– An instruction that is control dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch.
– An instruction that is not control dependent on a branch cannot be moved to after the branch so that its execution is controlled by the branch.
• Control dependencies may be relaxed in some systems to get parallelism; get the same effect if preserve the order of exceptions and data flow
Pipeline Complications CS510 Computer Architectures Lecture 8 - 39
When Safe to Unroll Loop?When Safe to Unroll Loop?When Safe to Unroll Loop?When Safe to Unroll Loop?
• Example: When a loop is unrolled, where are data dependencies? (A,B,C distinct, non-overlapping)
for (i=1; i<=100; i=i+1) { A[i+1] = A[i] + C[i]; /* S1 */ B[i+1] = B[i] + A[i+1];} /* S2 */
1. S2 uses the value A[i+1], computed by S1 in the same iteration.
2. S1 uses a value computed by S1 in an earlier iteration, since iteration i computes A[i+1] which is read in iteration i+1. The same is true of S2 for B[i] and B[i+1].
This is a loop-carried dependence between iterations• Implies that iterations are dependent, and can’t be executed in parallel• Not the case for our example; each iteration was distinct
Pipeline Complications CS510 Computer Architectures Lecture 8 - 40
When Safe to Unroll Loop?When Safe to Unroll Loop?When Safe to Unroll Loop?When Safe to Unroll Loop?
• Example: Where are data dependencies? (A,B,C,D distinct & non-overlapping)
Following looks like there is a loop carried dependence
for (i=1; i<=100; i=i+1) {A[i] = A[i] + B[i]; /* S1 */B[i+1] = C[i] + D[i];} /* S2 */
However, we can rewrite it as follows for loop carried dependence-free
A[1] = A[1] + B[1];
for (i=1; i<=99; i=i+1) {B[i+1] = C[i] + D[i];
A[i+1] = A[i+1] + B[i+1];}
B[101] = C[100]+D[100];
Pipeline Complications CS510 Computer Architectures Lecture 8 - 41
SummarySummarySummarySummary
• Instruction Level Parallelism in SW or HW
• Loop level parallelism is easiest to see
• SW parallelism dependencies defined for a program, hazards if HW cannot resolve
• SW dependencies/compiler sophistication determine if compiler can unroll loops