advanced computer architecture 5md00 / 5z033 exploiting ilp with sw approaches

39
Advanced Computer Architecture 5MD00 / 5Z033 Exploiting ILP with SW approaches Henk Corporaal www.ics.ele.tue.nl/~heco TUEindhoven December 2010

Upload: orrick

Post on 12-Feb-2016

47 views

Category:

Documents


0 download

DESCRIPTION

Advanced Computer Architecture 5MD00 / 5Z033 Exploiting ILP with SW approaches. Henk Corporaal www.ics.ele.tue.nl/~heco TUEindhoven December 2010. Topics. Static branch prediction and speculation Basic compiler techniques Multiple issue architectures Advanced compiler support techniques - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Advanced Computer Architecture 5MD00 / 5Z033 Exploiting ILP with SW approaches

Advanced Computer Architecture5MD00 / 5Z033

Exploiting ILP with SW approaches

Henk Corporaalwww.ics.ele.tue.nl/~heco

TUEindhovenDecember 2010

Page 2: Advanced Computer Architecture 5MD00 / 5Z033 Exploiting ILP with SW approaches

04/22/23 ACA H.Corporaal 2

Topics• Static branch prediction and speculation• Basic compiler techniques• Multiple issue architectures• Advanced compiler support techniques

– Loop-level parallelism– Software pipelining

• Hardware support for compile-time scheduling

Page 3: Advanced Computer Architecture 5MD00 / 5Z033 Exploiting ILP with SW approaches

04/22/23 ACA H.Corporaal 3

We discussed previously dynamic branch prediction

This does not help the compiler !!!

We need Static Branch Prediction

Page 4: Advanced Computer Architecture 5MD00 / 5Z033 Exploiting ILP with SW approaches

04/22/23 ACA H.Corporaal 4

Static Branch Prediction and Speculation• Static branch prediction useful for code scheduling• Example:

ld r1,0(r2)sub r1,r1,r3 # hazardbeqz r1,Lor r4,r5,r6addu r10,r4,r3

L: addu r7,r8,r9

• If the branch is taken most of the times and since r7 is not needed on the fall-through path, we could move addu r7,r8,r9 directly after the ld

• If the branch is not taken most of the times and assuming that r4 is not needed on the taken path, we could move or r4,r5,r6 after the ld

Page 5: Advanced Computer Architecture 5MD00 / 5Z033 Exploiting ILP with SW approaches

04/22/23 ACA H.Corporaal 5

Static Branch Prediction Methods• Always predict taken

– Average misprediction rate for SPEC: 34% (9%-59%)

• Backward branches predicted taken, forward branches not taken– In SPEC, most forward branches are taken, so always predict taken is

better

• Profiling– Run the program and profile all branches. If a branch is taken (not taken)

most of the times, it is predicted taken (not taken)– Behavior of a branch is often biased to taken or not taken– Average misprediction rate for SPECint: 15% (11%-22%),

SPECfp: 9% (5%-15%)

• Can we do better? YES, use control flow restructuring to exploit correlation

Page 6: Advanced Computer Architecture 5MD00 / 5Z033 Exploiting ILP with SW approaches

04/22/23 ACA H.Corporaal 6

Static exploitation of correlation

...

...bez t,b,c

...

...bez t,e,f

a

b

d

e f

g

c

...

...bez t,b,c

...

...bez t,e,f

a

b

d

e f

g

c

...

...bez t,e,f

d'

If correlation,branch direction

in block d depends on branch in block a

control flowrestructuring

Page 7: Advanced Computer Architecture 5MD00 / 5Z033 Exploiting ILP with SW approaches

04/22/23 ACA H.Corporaal 7

Basic compiler techniques

• Dependencies limit ILP (Instruction-Level Parallelism)– We can not always find sufficient independent

operations to fill all the delay slots– May result in pipeline stalls

• Scheduling to avoid stalls

• Loop unrolling: create more exploitable parallelism

Page 8: Advanced Computer Architecture 5MD00 / 5Z033 Exploiting ILP with SW approaches

04/22/23 ACA H.Corporaal 8

Dependencies Limit ILP: Example

MIPS assembly code:; R1 = &x[1]; R2 = &x[1000]+8; F2 = s

Loop: L.D F0,0(R1) ; F0 = x[i] ADD.D F4,F0,F2 ; F4 = x[i]+s S.D 0(R1),F4 ; x[i] = F4 ADDI R1,R1,8 ; R1 = &x[i+1] BNE R1,R2,Loop ; branch if R1!=&x[1000]+8

C loop:for (i=1; i<=1000; i++) x[i] = x[i] + s;

Page 9: Advanced Computer Architecture 5MD00 / 5Z033 Exploiting ILP with SW approaches

04/22/23 ACA H.Corporaal 9

Schedule this on a MIPS Pipeline• FP operations are mostly multicycle • The pipeline must be stalled if an instruction uses the result

of a not yet finished multicycle operation• We’ll assume the following latencies

Producing Consuming Latencyinstruction instruction (clock cycles)FP ALU op FP ALU op 3FP ALU op Store double 2Load double FP ALU op 1Load double Store double 0

Page 10: Advanced Computer Architecture 5MD00 / 5Z033 Exploiting ILP with SW approaches

04/22/23 ACA H.Corporaal 10

Where to Insert Stalls?

• How would this loop be executed on the MIPS FP pipeline?

Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) ADDI R1,R1,8 BNE R1,R2,Loop

Which true (flow) dependences?

Inter-iterationdependence

Page 11: Advanced Computer Architecture 5MD00 / 5Z033 Exploiting ILP with SW approaches

04/22/23 ACA H.Corporaal 11

Where to Insert Stalls• How would this loop be executed on the MIPS FP

pipeline?• 10 cycles per

iterationLoop: L.D F0,0(R1) stall ADD.D F4,F0,F2 stall stall S.D 0(R1),F4 ADDI R1,R1,8 stall BNE R1,R2,Loop stall

Page 12: Advanced Computer Architecture 5MD00 / 5Z033 Exploiting ILP with SW approaches

04/22/23 ACA H.Corporaal 12

Code Scheduling to Avoid Stalls• Can we reorder the order of instruction to avoid stalls?• Execution time reduced from 10 to 6 cycles per

iteration

• But only 3 instructions perform useful work, rest is loop overhead. How to avoid this ???

Loop: L.D F0,0(R1) ADDI R1,R1,8 ADD.D F4,F0,F2 stall BNE R1,R2,Loop S.D -8(R1),F4

watch out!

Page 13: Advanced Computer Architecture 5MD00 / 5Z033 Exploiting ILP with SW approaches

04/22/23 ACA H.Corporaal 13

Loop Unrolling: increasing ILPAt source level:

for (i=1; i<=1000; i++) x[i] = x[i] + s;

for (i=1; i<=1000; i=i+4){ x[i] = x[i] + s; x[i+1] = x[i+1]+s; x[i+2] = x[i+2]+s; x[i+3] = x[i+3]+s;}

• Any drawbacks? – loop unrolling increases code size– more registers needed

MIPS code after scheduling:Loop: L.D F0,0(R1) L.D F6,8(R1) L.D F10,16(R1) L.D F14,24(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 ADD.D F12,F10,F2 ADD.D F16,F14,F2 S.D 0(R1),F4 S.D 8(R1),F8 ADDI R1,R1,32 SD -16(R1),F12 BNE R1,R2,Loop SD -8(R1),F16

Page 14: Advanced Computer Architecture 5MD00 / 5Z033 Exploiting ILP with SW approaches

04/22/23 ACA H.Corporaal 14

Multiple issue architecturesHow to get CPI < 1 ?

• Superscalar: multiple instructions issued per cycle– Statically scheduled– Dynamically scheduled (see previous lecture)

• VLIW ?– single instruction issue, but multiple operations per

instruction

• SIMD / Vector ?– single instruction issue, single operation, but multiple data

sets per operation

• Multi-processor ?

Page 15: Advanced Computer Architecture 5MD00 / 5Z033 Exploiting ILP with SW approaches

04/22/23 ACA H.Corporaal 15

Instruction Parallel (ILP) ProcessorsThe name ILP is used for:• Multiple-Issue Processors

– Superscalar: varying no. instructions/cycle (0 to 8), scheduled by HW (dynamic issue capability)

• IBM PowerPC, Sun UltraSparc, DEC Alpha, Pentium III/4, etc.–

VLIW (very long instr. word): fixed number of instructions (4-16) scheduled by the compiler (static issue capability)

• Intel Architecture-64 (IA-64, Itanium), TriMedia, TI C6x

• (Super-) pipelined processors

• Anticipated success of multiple instructions led to Instructions Per Cycle (IPC) metric instead of CPI

Page 16: Advanced Computer Architecture 5MD00 / 5Z033 Exploiting ILP with SW approaches

04/22/23 ACA H.Corporaal 16

Vector processors• Vector Processing:

Explicit coding of independent loops as operations on large vectors of numbers– Multimedia instructions being added to many processors

• Different implementations:– real SIMD;

• e.g. 320 separate 32-bit ALUs + RFs

– (multiple) subword units: • divide a single ALU into sub ALUs

– deeply pipelined units:• aiming at very high frequency; • with forwarding between units

Page 17: Advanced Computer Architecture 5MD00 / 5Z033 Exploiting ILP with SW approaches

04/22/23 ACA H.Corporaal 17

Simple In-order Superscalar• In-order Superscalar 2-issue processor: 1 Integer & 1 FP

– Used in first Pentium processor (also in Larrabee, but canceled!!)– Fetch 64-bits/clock cycle; Int on left, FP on right– Can only issue 2nd instruction if 1st instruction issues– More ports needed for FP register file to execute FP load & FP op in parallel

Type PipeStagesInt. instruction IF ID EX MEM WBFP instruction IF ID EX MEM WBInt. instruction IF ID EX MEM WBFP instruction IF ID EX MEM WBInt. instruction IF ID EX MEM WBFP instruction IF ID EX MEM WB

• 1 cycle load delay impacts the next 3 instructions !

Page 18: Advanced Computer Architecture 5MD00 / 5Z033 Exploiting ILP with SW approaches

04/22/23 ACA H.Corporaal 18

Dynamic trace for unrolled codefor (i=1; i<=1000; i++) a[i] = a[i]+s;

Integer instruction FP instruction CycleL: LD F0,0(R1) 1 LD F6,8(R1) 2 LD F10,16(R1) ADDD F4,F0,F2 3 LD F14,24(R1) ADDD F8,F6,F2 4 LD F18,32(R1) ADDD F12,F10,F2 5 SD 0(R1),F4 ADDD F16,F14,F2 6 SD 8(R1),F8 ADDD F20,F18,F2 7 SD 16(R1),F12 8 ADDI R1,R1,40 9 SD -16(R1),F16 10 BNE R1,R2,L 11 SD -8(R1),F20 12

Load: 1 cycle latencyALU op: 2 cycles latency

• 2.4 cycles per element vs. 3.5 for ordinary MIPS pipeline• Int and FP instructions not perfectly balanced

Page 19: Advanced Computer Architecture 5MD00 / 5Z033 Exploiting ILP with SW approaches

04/22/23 ACA H.Corporaal 19

Superscalar – Multi-issue Issues• While Integer/FP split is simple for the HW, get IPC of 2 only for

programs with:– Exactly 50% FP operations AND no hazards

• More complex decode and issue:– Even 2-issue superscalar => examine 2 opcodes, 6 register specifiers, and

decide if 1 or 2 instructions can issue (N-issue ~O(N2) comparisons)– Register file complexity: for 2-issue superscalar: needs 4 reads and 2

writes/cycle– Rename logic: must be able to rename same register multiple times in one

cycle! For instance, consider 4-way issue:add r1, r2, r3 add p11, p4, p7sub r4, r1, r2 sub p22, p11, p4lw r1, 4(r4) lw p23, 4(p22)add r5, r1, r2 add p12, p23, p4Imagine doing this transformation in a single cycle!

– Result buses: Need to complete multiple instructions/cycle• Need multiple buses with associated matching logic at every reservation station.

Page 20: Advanced Computer Architecture 5MD00 / 5Z033 Exploiting ILP with SW approaches

04/22/23 ACA H.Corporaal 20

VLIW Processors• Superscalar HW too difficult to build => let compiler find

independent instructions and pack them in one Very Long Instruction Word (VLIW)

• Example: VLIW processor with 2 ld/st units, two FP units, one integer/branch unit, no branch delay

LD F0,0(R1) LD F6,8(R1)LD F10,16(R1) LD F14,24(R1)LD F18,32(R1) LD F22,40(R1) ADDD F4,F0,F2 ADDD F8,F6,F2LD F26,48(R1) ADDD F12,F10,F2 ADDD F16,F14,F2

ADD F20,F18,F2 ADD F24,F22,F2SD 0(R1),F4 SD 8(R1),F8 ADDD F28,F26,F2SD 16(R1),F12 SD 24(R1),F16SD 32(R1),F20 SD 40(R1),F24 ADDI R1,R1,56SD –8(R1),F28 BNE R1,R2,L

Ld/st 1 Ld/st 2 FP 1 FP 2 Int

Page 21: Advanced Computer Architecture 5MD00 / 5Z033 Exploiting ILP with SW approaches

04/22/23 ACA H.Corporaal 21

Superscalar versus VLIWVLIW advantages:• Much simpler to build. Potentially fasterVLIW disadvantages and proposed solutions:• Binary code incompatibility

– Object code translation or emulation– Less strict approach (EPIC, IA-64, Itanium)

• Increase in code size, unfilled slots are wasted bits– Use clever encodings, only one immediate field– Compress instructions in memory and decode them when

they are fetched, or when put in L1 cache• Lockstep operation: if the operation in one instruction

slot stalls, the entire processor is stalled– Less strict approach

Page 22: Advanced Computer Architecture 5MD00 / 5Z033 Exploiting ILP with SW approaches

04/22/23 ACA H.Corporaal 22

Use compressed instructions

L1 Instruction Cache

Memory

CPU

compressedinstructions

inmemory

decompresshere?

or decompresshere?

Q: What are pros and cons?

Page 23: Advanced Computer Architecture 5MD00 / 5Z033 Exploiting ILP with SW approaches

04/22/23 ACA H.Corporaal 23

Advanced compiler support techniques

• Loop-level parallelism• Software pipelining• Global scheduling (across basic blocks)

Page 24: Advanced Computer Architecture 5MD00 / 5Z033 Exploiting ILP with SW approaches

04/22/23 ACA H.Corporaal 24

Detecting Loop-Level Parallelism

• Loop-carried dependence: a statement executed in a certain iteration is dependent on a statement executed in an earlier iteration

• If there is no loop-carried dependence, then its iterations can be executed in parallel

for (i=1; i<=100; i++){ A[i+1] = A[i]+C[i]; /* S1 */ B[i+1] = B[i]+A[i+1]; /* S2 */}

S1

S2

A loop is parallel the corresponding dependence graphdoes not contain a cycle

Page 25: Advanced Computer Architecture 5MD00 / 5Z033 Exploiting ILP with SW approaches

04/22/23 ACA H.Corporaal 25

Finding Dependences• Is there a dependence in the following loop?

for (i=1; i<=100; i++) A[2*i+3] = A[2*i] + 5.0;

• Affine expression: an expression of the form a*i + b (a, b constants, i loop index variable)

• Does the following equation have a solution?a*i + b = c*j + d

• GCD test: if there is a solution, then GCD(a,c) must divide d-b

Note: Because the GCD test does not take the loop bounds into account, there are cases where the GCD test says “yes, there is a solution” while in reality there isn’t

Page 26: Advanced Computer Architecture 5MD00 / 5Z033 Exploiting ILP with SW approaches

04/22/23 ACA H.Corporaal 26

Software Pipelining• We have already seen loop unrolling• Software pipelining is a related technique that

that consumes less code space. It interleaves instructions from different iterations– instructions in one iteration are often dependent on

each otherIteration 0

Iteration 1Iteration 2Software-

pipelinediteration

Steadystatekernelin

stru

ctio

ns

Page 27: Advanced Computer Architecture 5MD00 / 5Z033 Exploiting ILP with SW approaches

04/22/23 ACA H.Corporaal 27

Simple Software Pipelining ExampleL: l.d f0,0(r1) # load M[i]

add.d f4,f0,f2 # compute M[i]s.d f4,0(r1) # store M[i]addi r1,r1,-8 # i = i-1bne r1,r2,L

• Software pipelined loop:L: s.d f4,16(r1) # store M[i]

add.d f4,f0,f2 # compute M[i-1]l.d f0,0(r1) # load M[i-2]addi r1,r1,-8bne r1,r2,L

• Need hardware to avoid the WAR hazards

Page 28: Advanced Computer Architecture 5MD00 / 5Z033 Exploiting ILP with SW approaches

04/22/23 ACA H.Corporaal 28

Global code scheduling• Loop unrolling and software pipelining work well when there

are no control statements (if statements) in the loop body -> loop is a single basic block

• Global code scheduling: scheduling/moving code across branches: larger scheduling scope

• When can the assignments to B and C be moved before the test?

A[i]=A[i]+B[i]

A[i]=0?

B[i]= X

C[i]=

T F

Page 29: Advanced Computer Architecture 5MD00 / 5Z033 Exploiting ILP with SW approaches

04/22/23 ACA H.Corporaal 29

Which scheduling scope?

Trace Superblock Decision Tree Hyperblock/region

Page 30: Advanced Computer Architecture 5MD00 / 5Z033 Exploiting ILP with SW approaches

04/22/23 ACA H.Corporaal 30

Comparing scheduling scopes

Trace Sup.block

Hyp.block

Dec.Tree

Region

Multiple exc. paths No No Yes Yes YesSide-entries allowed Yes No No No NoJoin points allowed Yes No Yes No YesCode motion down joins Yes No No No NoMust be if-convertible No No Yes No NoTail dup. before sched. No Yes No Yes No

Page 31: Advanced Computer Architecture 5MD00 / 5Z033 Exploiting ILP with SW approaches

04/22/23 ACA H.Corporaal 31

B C

E F

D

G

A

Trace Superblock

B C

F E’

D’

G’

A

E

D

G

tail duplication

Partitioning a CFG into scheduling scopes:

Scheduling scope creation (1)

Page 32: Advanced Computer Architecture 5MD00 / 5Z033 Exploiting ILP with SW approaches

04/22/23 ACA H.Corporaal 32

Trace Scheduling• Find the most likely sequence of basic blocks

that will be executed consecutively (trace selection)

• Optimize the trace as much as possible (trace compaction)– move operations as early as possible in the trace– pack the operations in as few VLIW instructions as

possible– additional bookkeeping code may be necessary on

exit points of the trace

Page 33: Advanced Computer Architecture 5MD00 / 5Z033 Exploiting ILP with SW approaches

04/22/23 ACA H.Corporaal 33

B C

E F

D

G

A

Hyperblock/ region

Partitioning a CFG into scheduling scopes:

B C

E’ F’

D’

G’’

A

E

D

G

Decision Tree

tail duplication

F

G’

Scheduling scope creation (2)

Page 34: Advanced Computer Architecture 5MD00 / 5Z033 Exploiting ILP with SW approaches

04/22/23 ACA H.Corporaal 34

Code movement (upwards) within regions

I

I I

add

I

source block

destination block

I

Copy needed

Intermediateblock

Check foroff-liveness

Legend:

Code movement

Page 35: Advanced Computer Architecture 5MD00 / 5Z033 Exploiting ILP with SW approaches

04/22/23 ACA H.Corporaal 35

Hardware support for compile-time scheduling

• Predication – (discussed already)– see also Itanium example

• Deferred exceptions• Speculative loads

Page 36: Advanced Computer Architecture 5MD00 / 5Z033 Exploiting ILP with SW approaches

04/22/23 ACA H.Corporaal 36

• Avoid branch prediction by turning branches into conditional or predicated instructions:

If false, then neither store result nor cause exception– Expanded ISA of Alpha, MIPS, PowerPC, SPARC have conditional

move; PA-RISC can annul any following instr.– IA-64/Itanium: conditional execution of any instruction

• Examples:if (R1==0) R2 = R3; CMOVZ R2,R3,R1

if (R1 < R2) SLT R9,R1,R2 R3 = R1; CMOVNZ R3,R1,R9else CMOVZ R3,R2,R9 R3 = R2;

Predicated Instructions (discussed before)

Page 37: Advanced Computer Architecture 5MD00 / 5Z033 Exploiting ILP with SW approaches

04/22/23 ACA H.Corporaal 37

Deferred Exceptions

• What if the load generates a page fault?• What if the load generates an “index-out-of-bounds” exception?

ld r1,0(r3) # load A ld r9,0(r2) # speculative load B beqz r1,L3 # test A addi r9,r1,4 # else partL3: st r9,0(r3) # store A

if (A==0) A = B;else A = A+4;

ld r1,0(r3) # load A bnez r1,L1 # test Ald r1,0(r2) # then part; load Bj L2

L1: addi r1,r1,4 # else part; inc AL2: st r1,0(r3) # store A

• How to optimize when then-part is usually selected?

Page 38: Advanced Computer Architecture 5MD00 / 5Z033 Exploiting ILP with SW approaches

04/22/23 ACA H.Corporaal 38

HW supporting Speculative Loads• Speculative load (sld): does not generate exceptions• Speculation check instruction (speck): check for

exception. The exception occurs when this instruction is executed.

ld r1,0(r3) # load Asld r9,0(r2) # speculative load of B bnez r1,L1 # test Aspeck 0(r2) # perform exception

checkj L2

L1: addi r9,r1,4 # else partL2: st r9,0(r3) # store A

Page 39: Advanced Computer Architecture 5MD00 / 5Z033 Exploiting ILP with SW approaches

04/22/23 ACA H.Corporaal 39

How further?Burton SmithMicrosoft2005