cs184b: computer architecture (abstractions and...

19
Caltech CS184 Spring2003 -- DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 5: April 14, 2003 ILP 2 Caltech CS184 Spring2003 -- DeHon 2 Today ILP Limits Practical Issues – Finite size issues Cost Scaling • Ultrascalar

Upload: others

Post on 18-Oct-2020

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS184b: Computer Architecture (Abstractions and Optimizations)courses.cms.caltech.edu/cs184/spring2003/lectures/Day5_2up.pdf · ILP 2 Caltech CS184 Spring2003 -- DeHon 2 Today •

1

Caltech CS184 Spring2003 -- DeHon1

CS184b:Computer Architecture

(Abstractions and Optimizations)

Day 5: April 14, 2003ILP 2

Caltech CS184 Spring2003 -- DeHon2

Today

• ILP Limits• Practical Issues

– Finite size issues• Cost Scaling• Ultrascalar

Page 2: CS184b: Computer Architecture (Abstractions and Optimizations)courses.cms.caltech.edu/cs184/spring2003/lectures/Day5_2up.pdf · ILP 2 Caltech CS184 Spring2003 -- DeHon 2 Today •

2

Caltech CS184 Spring2003 -- DeHon3

Limit Studies• Goal: understand how far you can go

– this case, how much ILP can find• Remove current/artificial limits

– do full renaming, arbitrary look ahead– perfect control prediction, memory

disambiguation• Careful with assumptions

– can still be pessimistic– is there another way to do it?– Another way around the limitation?

Caltech CS184 Spring2003 -- DeHon4

Available ILP

[Hennessy and Patterson 4.38e2/3.35e3]

Page 3: CS184b: Computer Architecture (Abstractions and Optimizations)courses.cms.caltech.edu/cs184/spring2003/lectures/Day5_2up.pdf · ILP 2 Caltech CS184 Spring2003 -- DeHon 2 Today •

3

Caltech CS184 Spring2003 -- DeHon5

What do we achieve today?

• Pentium … < 1 instruction/cycle retired– But low cycle time– Time= CPI × Instructions × CycleTime

• Not seen attempts to issue more than 4 instructions/cycle – Much less sustain retire or more than 4

Caltech CS184 Spring2003 -- DeHon6

Limit Effects

Page 4: CS184b: Computer Architecture (Abstractions and Optimizations)courses.cms.caltech.edu/cs184/spring2003/lectures/Day5_2up.pdf · ILP 2 Caltech CS184 Spring2003 -- DeHon 2 Today •

4

Caltech CS184 Spring2003 -- DeHon7

Superscalar

IFDecodeQueue

EX

ALU

MPY

LD/ST

RF

RUUFetchWidth

Window Size

PhysicalRegisters

Issue Width

Number/TypesOf Functional Units

Caltech CS184 Spring2003 -- DeHon8

Window Size (unlimited issue)

[Hennessy and Patterson 4.39e2/3.36e3]

There’s quite a bit of non-local parallelism.

Page 5: CS184b: Computer Architecture (Abstractions and Optimizations)courses.cms.caltech.edu/cs184/spring2003/lectures/Day5_2up.pdf · ILP 2 Caltech CS184 Spring2003 -- DeHon 2 Today •

5

Caltech CS184 Spring2003 -- DeHon9

Window Size (Issue limited)

[64-issues Hennessy and Patterson 4.47e2/3.45e3]

Caltech CS184 Spring2003 -- DeHon10

Operation Organization

• Consider Tree-structured calculation– freedom in ordering– consider:

• post-order traversal• by levels from leaves

– where is parallelism?– Storage cost?

Page 6: CS184b: Computer Architecture (Abstractions and Optimizations)courses.cms.caltech.edu/cs184/spring2003/lectures/Day5_2up.pdf · ILP 2 Caltech CS184 Spring2003 -- DeHon 2 Today •

6

Caltech CS184 Spring2003 -- DeHon11

Window Size• How many instructions forward do we

look?– Only look at next = in-order issue

JohnsonFig. 3.9(32 issuewindow?)

Caltech CS184 Spring2003 -- DeHon12

Branch Prediction

[Hennessey & Patterson Fig 3.38/e3]

Page 7: CS184b: Computer Architecture (Abstractions and Optimizations)courses.cms.caltech.edu/cs184/spring2003/lectures/Day5_2up.pdf · ILP 2 Caltech CS184 Spring2003 -- DeHon 2 Today •

7

Caltech CS184 Spring2003 -- DeHon13

Window Cost?

• No one before you in the window writes a value you need

• Rsrci ≠ Rdsti-1; Rsrci ≠ Rdsti-2;…

• O(WS2) comparisons

Caltech CS184 Spring2003 -- DeHon14

Cost?

• Anecdotal [Farrell, Fischer JSSC v33n5]– DEC 20-instruction queue– 4 instruction issue– (80 physical registers)– 10mm2 in 0.35µm (300Mλ2+)

• Compare: – 300 4-LUTs (w/ interconnect)– MIPS-X 32b CPU w/ 1KB memory = 68Mλ2

– 600 MHz = 1.6ns

Page 8: CS184b: Computer Architecture (Abstractions and Optimizations)courses.cms.caltech.edu/cs184/spring2003/lectures/Day5_2up.pdf · ILP 2 Caltech CS184 Spring2003 -- DeHon 2 Today •

8

Caltech CS184 Spring2003 -- DeHon15

Costs?• Both DEC and “Quantifying” (also DEC)

– appear to use a scoreboarded scheme to avoid– accept not issue until result computed?

• “Quantifyng” suggests:– wakeup time ∝ IW2×WS2

• but assuming quadratic wire delay in length• (never buffer wire)

– but WS=F(IW)– certainly faster than linear time– A ∝ IW × WS

Caltech CS184 Spring2003 -- DeHon16

Registers• How many virtual registers needed?

[Hennessy and Patterson 4.43e2/3.41e3]

Page 9: CS184b: Computer Architecture (Abstractions and Optimizations)courses.cms.caltech.edu/cs184/spring2003/lectures/Day5_2up.pdf · ILP 2 Caltech CS184 Spring2003 -- DeHon 2 Today •

9

Caltech CS184 Spring2003 -- DeHon17

Register Costs?

• First Order– area linear in number of registers– delay linear in number of registers

• Bank RF– maybe sublinear delay– at least square root number of registers

• wire delay sqrt of area

Caltech CS184 Spring2003 -- DeHon18

RF and IW interaction

• Larger Issue (Decode)– want to read/retire more registers per cycle– RF ports = 3 IW [Op Rdst Rsrc1,Rsrc2]– A ∝ ports × number– …and number of registers = F(IW)– A ∝ IW × F(IW)

• RF grows faster than linear

Page 10: CS184b: Computer Architecture (Abstractions and Optimizations)courses.cms.caltech.edu/cs184/spring2003/lectures/Day5_2up.pdf · ILP 2 Caltech CS184 Spring2003 -- DeHon 2 Today •

10

Caltech CS184 Spring2003 -- DeHon19

Bypass: Control

• Control comparison– every functional input (2 IW)– get input from

• every pipestage (d) from issue produce to wb• for every result producer (>IW)

• Total comparisons: d×IW2

Caltech CS184 Spring2003 -- DeHon20

Bypass: Interconnect• Linear layout

– bypass span functional units and RF– physical RF grows with IW

• read/write ports• more physical registers to support IW

– FU bypass muxes grows with IW• Consequently

– width grows with IW – cycle grow with IW?

Page 11: CS184b: Computer Architecture (Abstractions and Optimizations)courses.cms.caltech.edu/cs184/spring2003/lectures/Day5_2up.pdf · ILP 2 Caltech CS184 Spring2003 -- DeHon 2 Today •

11

Caltech CS184 Spring2003 -- DeHon21

Bypass: Interconnect

• “Quantifying”– quadratic wire delay– (but asymptotically, we can buffer)– largest delay component calculated

• (>1ns for IW=8) [180nm]• IW=8 about 5-6 times IW=4

Caltech CS184 Spring2003 -- DeHon22

Aliasing• Do memory

operations depend on one another?

• E.g.A[j+3]=x*x+y;Z=A[i-2]+A[i+2]

• Is A[i-2], A[i+2] another name for A[j+3]?

• E.g.*a++;*b+=3;*a++;d=*c+3;

• Are these operations all independent?

• Or do some name the same memory locaiton?

Page 12: CS184b: Computer Architecture (Abstractions and Optimizations)courses.cms.caltech.edu/cs184/spring2003/lectures/Day5_2up.pdf · ILP 2 Caltech CS184 Spring2003 -- DeHon 2 Today •

12

Caltech CS184 Spring2003 -- DeHon23

Aliasing

[Hennessey & Patterson Fig 3.43/e3]

Caltech CS184 Spring2003 -- DeHon24

…And now for something Completely Different

Page 13: CS184b: Computer Architecture (Abstractions and Optimizations)courses.cms.caltech.edu/cs184/spring2003/lectures/Day5_2up.pdf · ILP 2 Caltech CS184 Spring2003 -- DeHon 2 Today •

13

Caltech CS184 Spring2003 -- DeHon25

Different Solution

• These assume Number of Regs > IW• If IW>R, different approach…

• From Henry, Kuszmaul, et. al.– ARVLSI’99– SPAA’99– ISCA’00

Caltech CS184 Spring2003 -- DeHon26

Consider Machine

• Each FU has a full RF• Build network between FUs

– use network to connect produce/consume – user register names to configure

interconnect• Signal data ready along network

Page 14: CS184b: Computer Architecture (Abstractions and Optimizations)courses.cms.caltech.edu/cs184/spring2003/lectures/Day5_2up.pdf · ILP 2 Caltech CS184 Spring2003 -- DeHon 2 Today •

14

Caltech CS184 Spring2003 -- DeHon27

Ultrascalar: concept model

Caltech CS184 Spring2003 -- DeHon28

Ultrascalar concept

• Linear delay• O(1) register cost / FU• Complete renaming at each FU

– different set of registers– so when say complete RF at each FU,

that’s only the logical registers

Page 15: CS184b: Computer Architecture (Abstractions and Optimizations)courses.cms.caltech.edu/cs184/spring2003/lectures/Day5_2up.pdf · ILP 2 Caltech CS184 Spring2003 -- DeHon 2 Today •

15

Caltech CS184 Spring2003 -- DeHon29

Ultrascalar: cyclic prefix

Caltech CS184 Spring2003 -- DeHon30

Parallel Prefix• Basic idea is one we saw with adders• An FU will either

– produce a register (generate)– or transmit a register (propagate)– can do tree combining

• pair of FUs will either both propagate or will generate

• compute function by pair in one stage• recurse to next stage• get log-depth tree network connecting producer

and consumer

Page 16: CS184b: Computer Architecture (Abstractions and Optimizations)courses.cms.caltech.edu/cs184/spring2003/lectures/Day5_2up.pdf · ILP 2 Caltech CS184 Spring2003 -- DeHon 2 Today •

16

Caltech CS184 Spring2003 -- DeHon31

Ultrascalar: cyclic prefix

Caltech CS184 Spring2003 -- DeHon32

Cyclic Prefix

• Gets delay down to log(WS)– w/ linear layout, delay still linear

• Issue into, retire from Window in order– serves

• rename• shared RF• issue• bypass• reorder

Page 17: CS184b: Computer Architecture (Abstractions and Optimizations)courses.cms.caltech.edu/cs184/spring2003/lectures/Day5_2up.pdf · ILP 2 Caltech CS184 Spring2003 -- DeHon 2 Today •

17

Caltech CS184 Spring2003 -- DeHon33

Ultrascalar: layout

Register pathsnot growing.

(p=0 tree!)Wide, but constantwidth

If Memory width <√narea goes as n

wire goes as √n

Caltech CS184 Spring2003 -- DeHon34

Ultrascalar: asymptotics• Assume M(n)<O(√n)

– Area ~ n×R2

– Delay ~ (√n)×R• Claim can do

– Area ~ n×R– Delay ~ √(n×R)

• If memory grows faster, will dominate interconnect growth, hence area and delay– get extra term for memory growth (like Rent’s

Rule)

Page 18: CS184b: Computer Architecture (Abstractions and Optimizations)courses.cms.caltech.edu/cs184/spring2003/lectures/Day5_2up.pdf · ILP 2 Caltech CS184 Spring2003 -- DeHon 2 Today •

18

Caltech CS184 Spring2003 -- DeHon35

UltraScalar:

• 0.25 µm• 128-window, 32 logical regs• 64b ops ?• 8 instruction fetch• delays <2ns [0.25µm]

– commit, wakeup, schedule– wire delay dominate logic

• area ~2Gλ2 (not include datapath)

Caltech CS184 Spring2003 -- DeHon36

Solution for:

• Object/binary compatibility is paramount• Performance is King• Recompilation not an option• Cost (area, energy) is no object

Page 19: CS184b: Computer Architecture (Abstractions and Optimizations)courses.cms.caltech.edu/cs184/spring2003/lectures/Day5_2up.pdf · ILP 2 Caltech CS184 Spring2003 -- DeHon 2 Today •

19

Caltech CS184 Spring2003 -- DeHon37

Friday

• …an alternative way to exploit ILP• rely on compiler and feedback

• [reminder: no lecture Wednesday]

Caltech CS184 Spring2003 -- DeHon38

(Semi?) Big Ideas• Good to look at

– Extremes (what can this possibly do?)– Sensitivity (how important is this to…)

• Balance• Size Matters• Interconnect delay dominate• As parameters grow

– watch tradeoffs– widely different solutions prevail in different points in

space (different asymptotes)