advanced computer architecture 5md00 / 5z033 ilp architectures

55
Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures Henk Corporaal www.ics.ele.tue.nl/ ~heco/courses/aca [email protected] TUEindhoven 2007

Upload: hafwen

Post on 25-Feb-2016

42 views

Category:

Documents


0 download

DESCRIPTION

Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures. Henk Corporaal www.ics.ele.tue.nl/~heco/courses/aca [email protected] TUEindhoven 2007. Topics. Introduction Hazards Data dependences Control dependences Branch prediction Dependences limit ILP: scheduling - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

Advanced Computer Architecture5MD00 / 5Z033

ILP architectures

Henk Corporaalwww.ics.ele.tue.nl/~heco/courses/aca

[email protected]

2007

Page 2: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

04/22/23 ACA H.Corporaal 2

Topics• Introduction• Hazards

– Data dependences– Control dependences

• Branch prediction• Dependences limit ILP: scheduling• Out-Of-Order execution: Hardware speculation• Multiple issue• How much ILP is there?

Page 3: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

04/22/23 ACA H.Corporaal 3

IntroductionILP = Instruction level parallelism• multiple operations (or instructions) can be executed in

parallel

Needed:• Sufficient resources• Parallel scheduling

– Hardware solution– Software solution

• Application should contain ILP

Page 4: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

04/22/23 ACA H.Corporaal 4

Hazards• Three types of hazards

– Structural– Data dependence– Control dependence

• Hazards cause scheduling problems

Page 5: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

04/22/23 ACA H.Corporaal 5

Data dependences• RaW read after write• WaR write after read• WaW write after write

Page 6: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

04/22/23 ACA H.Corporaal 6

Control Dependences

C input code:

CFG: 1 sub t1, a, b bgz t1, 2, 3

4 mul y,a,b …………..

3 rem r, b, a goto 4

2 rem r, a, b goto 4

if (a > b) { r = a % b; } else { r = b % a; }y = a*b;

How real are control dependences?

Page 7: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

04/22/23 ACA H.Corporaal 7

Branch Prediction

Page 8: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

04/22/23 ACA H.Corporaal 8

Branch PredictionMotivation

• High branch penalties in pipelined processors:• With on average one out of five instructions being

a branch, the maximum ILP is five• Situation even worse for multiple-issue processors,

because we need to provide an instruction stream of n instructions per cycle.

• Idea: predict the outcome of branches based on their history and execute instructions speculatively

Page 9: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

04/22/23 ACA H.Corporaal 9

5 Branch Prediction Schemes• 1-bit Branch Prediction Buffer• 2-bit Branch Prediction Buffer• Correlating Branch Prediction Buffer• Branch Target Buffer• Return Address Predictors

+ A way to get rid of those malicious branches

Page 10: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

04/22/23 ACA H.Corporaal 10

1-bit Branch Prediction Buffer• 1-bit branch prediction buffer or branch history table:

• Buffer is like a cache without tags• Does not help for simple MIPS pipeline because target address calculations in same

stage as branch condition calculation

10…..10 101 00

01010110

PC

BHT

Page 11: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

04/22/23 ACA H.Corporaal 11

Branch Prediction Buffer: 1 bit prediction

Problems:• Aliasing: lower K bits of different branch instructions could be the same

– Soln: Use tags (the buffer becomes a tag); however very expensive

• Loops are predicted wrong twice

– Soln: Use n-bit saturation counter prediction

* taken if counter 2 (n-1)

* not-taken if counter < 2 (n-1)

– A 2 bit saturating counter predicts a loop wrong only once

Branch address 2 K entries(K bits)

prediction bit

Page 12: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

04/22/23 ACA H.Corporaal 12

• Solution: 2-bit scheme where prediction is changed only if mispredicted twice

• Can be implemented as a saturating counter:

2-bit Branch Prediction Buffer

T

T

NT

Predict Taken

Predict Not Taken

Predict Taken

Predict Not TakenT

NT

T

NT

NT

Page 13: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

04/22/23 ACA H.Corporaal 13

Correlating Branches• Fragment from SPEC92 benchmark eqntott:

if (aa==2) aa = 0;if (bb==2) bb=0;if (aa!=bb){..}

subi R3,R1,#2b1: bnez R3,L1

add R1,R0,R0L1: subi R3,R2,#2b2: bnez R3,L2

add R2,R0,R0L2: sub R3,R1,R2b3: beqz R3,L3

Page 14: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

04/22/23 ACA H.Corporaal 14

Correlating Branch Predictor

Idea: behavior of current branch is related to taken/not taken history of recently executed branches

– Then behavior of recent branches selects between, say, 4 predictions of next branch, updating just that prediction

• (2,2) predictor: 2-bit global, 2-bit local

• (k,n) predictor uses behavior of last k branches to choose from 2k predictors, each of which is n-bit predictor

4 bits from branch address

2-bits per branch local predictors

Prediction

2-bit global branch history

(01 = not taken, then taken)

shiftregister

Page 15: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

04/22/23 ACA H.Corporaal 15

Branch Correlation Using Branch HistoryTwo schemes (a, k, m, n)

• PA: Per address history, a > 0• GA: Global history, a = 0

n-bit saturating Up/Down Counter Prediction

Table size (usually n = 2): #bits = k * 2a + 2k * 2m *n

Variant: Gshare (Scott McFarling’93): GA which takes logic OR of PC address bits and branch history bits

Branch Address0 1 2k-1

0

1

2m-1

Branch History Table

a k

m

Pattern History Table

Page 16: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

04/22/23 ACA H.Corporaal 16

Accuracy (taking the best combination of parameters):

Predictor Size (bytes)64 128

Bra

nch

Pred

ictio

n A

ccur

acy

(%)

256 1K 2K 4K 8K 16K 32K 64K

89

91

95

969798

929394

PA(10, 6, 4, 2)

GA(0,11,5,2)

BimodalGAsPAs

Page 17: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

04/22/23 ACA H.Corporaal 17

0%1%

5%6% 6%

11%

4%

6%5%

1%

0%

2%

4%

6%

8%

10%

12%

14%

16%

18%

20%

nasa7 matrix300 tomcatv doducd spice fpppp gcc espresso eqntott li

Freq

uenc

y of

Mis

pred

ictio

ns

4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2)

Accuracy of Different Branch Predictors

4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT 1024 Entries (2,2) BHT

0%

Mis

pred

ictio

ns R

ate

18%

Page 18: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

04/22/23 ACA H.Corporaal 18

BHT Accuracy• Mispredict because either:

– Wrong guess for that branch– Got branch history of wrong branch when index the

table• 4096 entry table: misprediction rates vary from

1% (nasa7, tomcatv) to 18% (eqntott), with spice at 9% and gcc at 12%

• For SPEC92, 4096 about as good as infinite table

• Real programs + OS more like gcc

Page 19: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

04/22/23 ACA H.Corporaal 19

Branch Target Buffer• Branch condition is not enough !!• Branch Target Buffer (BTB): Tag and Target address

Tag branch PC PC if taken

=? Branchprediction(often in separatetable)

Yes: instruction is branch. Use predicted PC as next PC if branch predicted taken.No: instruction is not a

branch. Proceed normally

10…..10 101 00PC

Page 20: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

04/22/23 ACA H.Corporaal 20

Instruction Fetch Stage

Not shown: hardware needed when prediction was wrong

InstructionMemoryP

C

Inst

ruct

ion

regi

ster

4

BTB

found & takentarget address

Page 21: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

04/22/23 ACA H.Corporaal 21

Special Case: Return Addresses

• Register indirect branches: hard to predict target address– MIPS instruction: jr r31 ; PC = r31– useful for

• implementing switch/case statements• FORTRAN computed GOTOs• procedure return (mainly)

• SPEC89: 85% such branches for procedure return• Since stack discipline for procedures, save return

address in small buffer that acts like a stack: 8 to 16 entries has very high hit rate

Page 22: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

04/22/23 ACA H.Corporaal 22

Dynamic Branch Prediction Summary

• Prediction important part of scalar execution• Branch History Table: 2 bits for loop accuracy• Correlation: Recently executed branches

correlated with next branch– Either different branches– Or different executions of same branch

• Branch Target Buffer: include branch target address (& prediction)

• Return address stack for prediction of indirect jumps

Page 23: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

04/22/23 ACA H.Corporaal 23

Or: Avoid branches !

Page 24: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

04/22/23 ACA H.Corporaal 24

• Avoid branch prediction by turning branches into conditional or predicated instructions:

• If false, then neither store result nor cause exception– Expanded ISA of Alpha, MIPS, PowerPC, SPARC have conditional

move; PA-RISC can annul any following instr.– IA-64/Itanium: conditional execution of any instruction

• Examples:if (R1==0) R2 = R3; CMOVZ R2,R3,R1

if (R1 < R2) SLT R9,R1,R2 R3 = R1; CMOVNZ R3,R1,R9else CMOVZ R3,R2,R9 R3 = R2;

Predicated Instructions

Page 25: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

04/22/23 ACA H.Corporaal 25

Dynamic Scheduling

Page 26: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

04/22/23 ACA H.Corporaal 26

Dynamic Scheduling Principle• What we examined so far is static scheduling

– Compiler reorders instructions so as to avoid hazards and reduce stalls• Dynamic scheduling: hardware rearranges instruction execution to reduce stalls• Example:

DIV.D F0,F2,F4 ; takes 24 cycles and; is not pipeline

ADD.D F10,F0,F8

SUB.D F12,F8,F14

• Key idea: Allow instructions behind stall to proceed• Book describes Tomasulo algorithm, but we describe general idea

This instruction cannot continueeven though it does not dependon anything

Page 27: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

04/22/23 ACA H.Corporaal 27

Advantages ofDynamic Scheduling

• Handles cases when dependences unknown at compile time – e.g., because they may involve a memory reference

• It simplifies the compiler • Allows code compiled for one or no pipeline to

run efficiently on a different pipeline • Hardware speculation, a technique with

significant performance advantages, that builds on dynamic scheduling

Page 28: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

04/22/23 ACA H.Corporaal 28

Superscalar ConceptInstructionMemory

InstructionCache

Decoder

BranchUnit ALU-1 ALU-2 Logic &

ShiftLoadUnit

StoreUnit

ReorderBuffer Register

File

DataCache

DataMemory

Reservation Stations

Address

DataData

Instruction

Page 29: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

04/22/23 ACA H.Corporaal 29

Superscalar Issues• How to fetch multiple instructions in time (across basic block

boundaries) ?• Predicting branches• Non-blocking memory system• Tune #resources(FUs, ports, entries, etc.)• Handling dependencies• How to support precise interrupts?• How to recover from mis-predicted branch path?

• For the latter two issues we need to look at sequential look-ahead and architectural state – Ref: Johnson 91

Page 30: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

04/22/23 ACA H.Corporaal 30

Example of Superscalar Processor Execution

• Superscalar processor organization:– simple pipeline: IF, EX, WB– fetches 2 instructions each cycle– 2 ld/st units, dual-ported memory; 2 FP adders; 1 FP multiplier– Instruction window (buffer between IF and EX stage) is of size 2– FP ld/st takes 1 cc; FP +/- takes 2 cc; FP * takes 4 cc; FP / takes 8 cc

Cycle 1 2 3 4 5 6 7L.D F6,32(R2)L.D F2,48(R3)MUL.D F0,F2,F4SUB.D F8,F2,F6DIV.D F10,F0,F6ADD.D F6,F8,F2MUL.D F12,F2,F4

Page 31: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

04/22/23 ACA H.Corporaal 31

Example of Superscalar Processor Execution

• Superscalar processor organization:– simple pipeline: IF, EX, WB– fetches 2 instructions each cycle– 2 ld/st units, dual-ported memory; 2 FP adders; 1 FP multiplier– Instruction window (buffer between IF and EX stage) is of size 2– FP ld/st takes 1 cc; FP +/- takes 2 cc; FP * takes 4 cc; FP / takes 8 cc

Cycle 1 2 3 4 5 6 7L.D F6,32(R2) IFL.D F2,48(R3) IFMUL.D F0,F2,F4SUB.D F8,F2,F6DIV.D F10,F0,F6ADD.D F6,F8,F2MUL.D F12,F2,F4

Page 32: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

04/22/23 ACA H.Corporaal 32

Example of Superscalar Processor Execution

• Superscalar processor organization:– simple pipeline: IF, EX, WB– fetches 2 instructions each cycle– 2 ld/st units, dual-ported memory; 2 FP adders; 1 FP multiplier– Instruction window (buffer between IF and EX stage) is of size 2– FP ld/st takes 1 cc; FP +/- takes 2 cc; FP * takes 4 cc; FP / takes 8 cc

Cycle 1 2 3 4 5 6 7L.D F6,32(R2) IF EXL.D F2,48(R3) IF EXMUL.D F0,F2,F4 IFSUB.D F8,F2,F6 IFDIV.D F10,F0,F6ADD.D F6,F8,F2MUL.D F12,F2,F4

Page 33: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

04/22/23 ACA H.Corporaal 33

Example of Superscalar Processor Execution

• Superscalar processor organization:– simple pipeline: IF, EX, WB– fetches 2 instructions each cycle– 2 ld/st units, dual-ported memory; 2 FP adders; 1 FP multiplier– Instruction window (buffer between IF and EX stage) is of size 2– FP ld/st takes 1 cc; FP +/- takes 2 cc; FP * takes 4 cc; FP / takes 8 cc

Cycle 1 2 3 4 5 6 7L.D F6,32(R2) IF EX WBL.D F2,48(R3) IF EX WBMUL.D F0,F2,F4 IF EXSUB.D F8,F2,F6 IF EXDIV.D F10,F0,F6 IFADD.D F6,F8,F2 IFMUL.D F12,F2,F4

Page 34: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

04/22/23 ACA H.Corporaal 34

Example of Superscalar Processor Execution

• Superscalar processor organization:– simple pipeline: IF, EX, WB– fetches 2 instructions each cycle– 2 ld/st units, dual-ported memory; 2 FP adders; 1 FP multiplier– Instruction window (buffer between IF and EX stage) is of size 2– FP ld/st takes 1 cc; FP +/- takes 2 cc; FP * takes 4 cc; FP / takes 8 cc

Cycle 1 2 3 4 5 6 7L.D F6,32(R2) IF EX WBL.D F2,48(R3) IF EX WBMUL.D F0,F2,F4 IF EX EXSUB.D F8,F2,F6 IF EX EXDIV.D F10,F0,F6 IFADD.D F6,F8,F2 IFMUL.D F12,F2,F4

stall becauseof data dep.

cannot be fetched because window full

Page 35: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

04/22/23 ACA H.Corporaal 35

Example of Superscalar Processor Execution

• Superscalar processor organization:– simple pipeline: IF, EX, WB– fetches 2 instructions each cycle– 2 ld/st units, dual-ported memory; 2 FP adders; 1 FP multiplier– Instruction window (buffer between IF and EX stage) is of size 2– FP ld/st takes 1 cc; FP +/- takes 2 cc; FP * takes 4 cc; FP / takes 8 cc

Cycle 1 2 3 4 5 6 7L.D F6,32(R2) IF EX WBL.D F2,48(R3) IF EX WBMUL.D F0,F2,F4 IF EX EX EXSUB.D F8,F2,F6 IF EX EX WBDIV.D F10,F0,F6 IFADD.D F6,F8,F2 IF EXMUL.D F12,F2,F4 IF

Page 36: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

04/22/23 ACA H.Corporaal 36

Example of Superscalar Processor Execution

• Superscalar processor organization:– simple pipeline: IF, EX, WB– fetches 2 instructions each cycle– 2 ld/st units, dual-ported memory; 2 FP adders; 1 FP multiplier– Instruction window (buffer between IF and EX stage) is of size 2– FP ld/st takes 1 cc; FP +/- takes 2 cc; FP * takes 4 cc; FP / takes 8 cc

Cycle 1 2 3 4 5 6 7L.D F6,32(R2) IF EX WBL.D F2,48(R3) IF EX WBMUL.D F0,F2,F4 IF EX EX EX EXSUB.D F8,F2,F6 IF EX EX WBDIV.D F10,F0,F6 IFADD.D F6,F8,F2 IF EX EXMUL.D F12,F2,F4 IF

cannot execute structural hazard

Page 37: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

04/22/23 ACA H.Corporaal 37

Example of Superscalar Processor Execution

• Superscalar processor organization:– simple pipeline: IF, EX, WB– fetches 2 instructions each cycle– 2 ld/st units, dual-ported memory; 2 FP adders; 1 FP multiplier– Instruction window (buffer between IF and EX stage) is of size 2– FP ld/st takes 1 cc; FP +/- takes 2 cc; FP * takes 4 cc; FP / takes 8 cc

Cycle 1 2 3 4 5 6 7L.D F6,32(R2) IF EX WBL.D F2,48(R3) IF EX WBMUL.D F0,F2,F4 IF EX EX EX EX WBSUB.D F8,F2,F6 IF EX EX WBDIV.D F10,F0,F6 IF EXADD.D F6,F8,F2 IF EX EX WBMUL.D F12,F2,F4 IF ?

Page 38: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

04/22/23 ACA H.Corporaal 38

Register Renaming• A technique to eliminate anti- and output

dependencies• Can be implemented

– by the compiler• advantage: low cost• disadvantage: “old” codes perform poorly

– in hardware• advantage: binary compatibility• disadvantage: extra hardware needed

• We describe general idea

Page 39: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

04/22/23 ACA H.Corporaal 39

Register Renaming– there’s a physical register file larger than logical register file– mapping table associates logical registers with physical register– when an instruction is decoded

• its physical source registers are obtained from mapping table• its physical destination register is obtained from a free list• mapping table is updated

add r3,r3,4

R8

R7

R5

R1

R9

R2 R6

before:

mapping table:

free list:

r0

r1

r2

r3

r4

add R2,R1,4

R8

R7

R5

R2

R9

R6

after:

mapping table:

free list:

r0

r1

r2

r3

r4

Page 40: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

04/22/23 ACA H.Corporaal 40

Eliminating False Dependencies• How register renaming eliminates false

dependencies:

• Before:• addi r1, r2, 1• addi r2, r0, 0• addi r1, r0, 1

• After (free list: R7, R8, R9)• addi R7, R5, 1• addi R8, R0, 0• addi R9, R0, 1

Page 41: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

04/22/23 ACA H.Corporaal 41

Limitations of Multiple-Issue Processors• Available ILP is limited (we’re not programming with

parallelism in mind)• Hardware cost

– adding more functional units is easy– more memory ports and register ports needed– dependency check needs O(n2) comparisons

• Limitations of VLIW processors– Loop unrolling increases code size– Unfilled slots waste bits– Cache miss stalls pipeline

• Research topic: scheduling loads– Binary incompatibility (not EPIC)

Page 42: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

04/22/23 ACA H.Corporaal 42

Measuring available ILP: How?

• Using existing compiler• Using trace analysis

– Track all the real data dependencies (RaWs) of instructions from issue window• register dependence• memory dependence

– Check for correct branch prediction• if prediction correct continue• if wrong, flush schedule and start in next cycle

Page 43: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

04/22/23 ACA H.Corporaal 43

Trace analysis

Program

For i := 0..2

A[i] := i;

S := X+3;

Compiled code

set r1,0

set r2,3

set r3,&A

Loop: st r1,0(r3)

add r1,r1,1

add r3,r3,4

brne r1,r2,Loop

add r1,r5,3

Trace

set r1,0

set r2,3

set r3,&A

st r1,0(r3)

add r1,r1,1

add r3,r3,4

brne r1,r2,Loop

st r1,0(r3)

add r1,r1,1

add r3,r3,4

brne r1,r2,Loop

st r1,0(r3)

add r1,r1,1

add r3,r3,4

brne r1,r2,Loop

add r1,r5,3How parallel can this code be executed?

Page 44: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

04/22/23 ACA H.Corporaal 44

Trace analysis

Parallel Traceset r1,0 set r2,3 set r3,&A

st r1,0(r3) add r1,r1,1 add r3,r3,4

st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop

st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop

brne r1,r2,Loop

add r1,r5,3

Max ILP =

Speedup = Lparallel / Lserial = 16 / 6 = 2.7

Is this the maximum?

Page 45: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

04/22/23 ACA H.Corporaal 45

Ideal ProcessorAssumptions for ideal/perfect processor:

1. Register renaming – infinite number of virtual registers => all register WAW & WAR hazards avoided2. Branch and Jump prediction – Perfect => all program instructions available for execution3. Memory-address alias analysis – addresses are known. A store can be moved before a load provided addresses not equal

Also: – unlimited number of instructions issued/cycle (unlimited resources), and– unlimited instruction window– perfect caches– 1 cycle latency for all instructions (FP *,/)

Programs were compiled using MIPS compiler with maximum optimization level

Page 46: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

04/22/23 ACA H.Corporaal 46

Upper Limit to ILP: Ideal Processor

Programs

Inst

ruct

ion

Issu

es p

er c

ycle

0

20

40

60

80

100

120

140

160

gcc espresso li fpppp doducd tomcatv

54.862.6

17.9

75.2

118.7

150.1

Integer: 18 - 60 FP: 75 - 150

IPC

Page 47: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

04/22/23 ACA H.Corporaal 47

35

41

16

6158

60

9

1210

48

15

6 7 6

46

13

45

6 6 7

45

14

45

2 2 2

29

4

19

46

0

10

20

30

40

50

60

gcc espresso li fpppp doducd tomcatvProgram

Inst

ruct

ion

issu

es p

er c

ycle

Perfect Selective predictor Standard 2-bit Static None

Window Size and Branch Impact• Change from infinite window to examine 2000

and issue at most 64 instructions per cycle FP: 15 - 45

Integer: 6 – 12

IPC

Perfect Tournament BHT(512) Profile No prediction

Page 48: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

04/22/23 ACA H.Corporaal 48

11

1512

29

54

10

1512

49

16

1013 12

35

15

44

9 10 11

20

11

28

5 5 6 5 5 74 4 5

4 5 5

59

45

0

10

20

30

40

50

60

70

gcc espresso li fpppp doducd tomcatvProgram

Inst

ruct

ion

issu

es p

er c

ycle

Infinite 256 128 64 32 None

Impact of Limited Renaming Registers• Changes: 2000 instr. window, 64 instr. issue, 8K 2-level

predictor (slightly better than tournament predictor)

Integer: 5 - 15 FP: 11 - 45

IP

C

Infinite 256 128 64 32

Page 49: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

04/22/23 ACA H.Corporaal 49

Program

Inst

ruct

ion

issu

es p

er c

ycle

0

5

10

15

20

25

30

35

40

45

50

gcc espresso li fpppp doducd tomcatv

10

15

12

49

16

45

7 79

49

16

45 4 4

6 53

53 3 4 4

45

Perfect Global/stack Perfect Inspection None

Memory Address Alias Impact• Changes: 2000 instr. window, 64 instr. issue, 8K

2-level predictor, 256 renaming registers

FP: 4 - 45(Fortran,no heap)

Integer: 4 - 9IPC

Perfect Global/stack perfect Inspection None

Page 50: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

04/22/23 ACA H.Corporaal 50

Program

Inst

ruct

ion

issu

es p

er c

ycle

0

10

20

30

40

50

60

gcc expresso li fpppp doducd tomcatv

10

15

12

52

17

56

10

15

12

47

16

10

1311

35

15

34

910 11

22

12

8 8 9

14

9

14

6 6 68

79

4 4 4 5 46

3 2 3 3 3 3

45

22

Infinite 256 128 64 32 16 8 4

Window Size Impact• Assumptions: Perfect disambiguation, 1K Selective predictor, 16

entry return stack, 64 renaming registers, issue as many as window

Integer: 6 - 12

FP: 8 - 45

IPC

Page 51: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

04/22/23 ACA H.Corporaal 51

How to Exceed ILP Limits of this Study?• WAR and WAW hazards through memory: eliminated

WAW and WAR hazards through register renaming, but not for memory operands

• Unnecessary dependences – (compiler did not unroll loops so iteration variable

dependence)• Overcoming the data flow limit: value prediction,

predicting values and speculating on prediction– Address value prediction and speculation predicts addresses

and speculates by reordering loads and stores. Could provide better aliasing analysis

Page 52: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

04/22/23 ACA H.Corporaal 52

Workstation Microprocessors 3/2001

Source: Microprocessor Report, www.MPRonline.com

• Max issue: 4 instructions (many CPUs)Max rename registers: 128 (Pentium 4) Max BHT: 4K x 9 (Alpha 21264B), 16Kx2 (Ultra III)Max Window Size (OOO): 126 intructions (Pent. 4)Max Pipeline: 22/24 stages (Pentium 4)

Page 53: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

04/22/23 ACA H.Corporaal 53

SPEC 2000 Performance 3/2001 Source: Microprocessor Report, www.MPRonline.com

1.6X

3.8X

1.2X

1.7X

1.5X

Page 54: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

04/22/23 ACA H.Corporaal 54

Conclusions• 1985-2002: >1000X performance (55% / y)

• Hennessy: industry has been following a roadmap of ideas known in 1985 to exploit Instruction Level Parallelism and (real) Moore’s Law to get 1.55X/year– Caches, (Super)Pipelining, Superscalar, Branch

Prediction, Out-of-order execution, Trace cache

• After 2002 slowdown (about 20%/y)

Page 55: Advanced Computer Architecture 5MD00 / 5Z033 ILP architectures

04/22/23 ACA H.Corporaal 55

Conclusions (cont'd)• ILP limits: To make performance progress in future

need to have explicit parallelism from programmer vs. implicit parallelism of ILP exploited by compiler/HW?

• Other problem:– Processor-memory performance gap– VLSI scaling problems (wiring)– Energy / leakage problems

• However: other forms of parallelism come to rescue