® 1 vlsi design challenges for gigascale integration shekhar borkar intel corp. october 25, 2005

25
® 1 VLSI Design Challenges VLSI Design Challenges for for Gigascale Integration Gigascale Integration Shekhar Borkar Shekhar Borkar Intel Corp. Intel Corp. October 25, 2005 October 25, 2005

Upload: jonah-atkinson

Post on 31-Dec-2015

226 views

Category:

Documents


2 download

TRANSCRIPT

RR

®® 1

VLSI Design Challenges VLSI Design Challenges for for

Gigascale IntegrationGigascale Integration

Shekhar BorkarShekhar Borkar

Intel Corp.Intel Corp.

October 25, 2005October 25, 2005

2

OutlineOutline

Technology scaling challengesTechnology scaling challengesCircuit and design solutionsCircuit and design solutionsMicroarchitecture advancesMicroarchitecture advancesMulti-everywhereMulti-everywhereSummarySummary

3

Goal: 10 TIPS by 2015Goal: 10 TIPS by 2015

0.01

0.1

1

10

100

1000

10000

100000

1000000

10000000

1970 1980 1990 2000 2010 2020

MIP

S

Pentium® Pro Architecture

Pentium® 4 Architecture

Pentium® Architecture

486386

2868086

How do you get there?How do you get there?How do you get there?How do you get there?

4

Technology ScalingTechnology ScalingGATE

SOURCE

BODY

DRAIN

Xj

ToxD

GATE

SOURCE DRAIN

Leff

BODY

Dimensions scale Dimensions scale down by 30%down by 30%

Doubles transistor Doubles transistor densitydensity

Oxide thickness Oxide thickness scales downscales down

Faster transistor, Faster transistor, higher performancehigher performance

Vdd & Vt scalingVdd & Vt scaling Lower active powerLower active power

Scaling will continue, but with challenges!Scaling will continue, but with challenges!Scaling will continue, but with challenges!Scaling will continue, but with challenges!

5

Technology OutlookTechnology OutlookHigh Volume High Volume ManufacturingManufacturing

20042004 20062006 20082008 20102010 20122012 20142014 20162016 20182018

Technology Node Technology Node (nm)(nm)

9090 6565 4545 3232 2222 1616 1111 88

Integration Integration Capacity (BT)Capacity (BT)

2 4 8 16 32 64 128 256

Delay = CV/I Delay = CV/I scalingscaling

0.70.7 ~0.7~0.7 >0.7>0.7 Delay scaling will slow downDelay scaling will slow down

Energy/Logic Op Energy/Logic Op scalingscaling

>0.35>0.35 >0.5>0.5 >0.5>0.5 Energy scaling will slow downEnergy scaling will slow down

Bulk Planar CMOSBulk Planar CMOS High Probability Low ProbabilityHigh Probability Low Probability

Alternate, 3G etcAlternate, 3G etc Low Probability High ProbabilityLow Probability High Probability

VariabilityVariability Medium High Very HighMedium High Very High

ILD (K)ILD (K) ~3~3 <3<3 Reduce slowly towards 2-2.5Reduce slowly towards 2-2.5

RC DelayRC Delay 11 11 11 11 11 11 11 11

Metal LayersMetal Layers 6-76-7 7-87-8 8-98-9 0.5 to 1 layer per generation0.5 to 1 layer per generation

6

The Leakage(s)…The Leakage(s)…

1

10

100

1000

10000

30 50 70 90 110 130

Temp (C)

Ioff

(n

a/u

)

0.25

90

45

0.1

1

10

100

1000

0.25u 0.18u 0.13u 90nm 65nm 45nm

Technology

SD

Lea

kag

e (W

atts

)

2X Tr Growth1.5X Tr Growth

90nm MOS Transistor

50nm50nm SiSi

1.2 nm SiO1.2 nm SiO22

GateGate

1.E-031.E-021.E-011.E+001.E+011.E+021.E+031.E+041.E+051.E+06

0.25u 0.18u 0.13u 90nm 65nm 45nm

Technology

Gat

e L

eaka

ge

(Wat

ts)

1.5X

2X

During Burn-in1.4X Vdd

7

Must Fit in Power EnvelopeMust Fit in Power Envelope

0

200

400

600

800

1000

1200

1400

90nm 65nm 45nm 32nm 22nm 16nm

Po

wer

(W

), P

ow

er D

ensi

ty (

W/c

m2)

SiO2 Lkg

SD Lkg

Active

10 mm Die

Technology, Circuits, and Technology, Circuits, and Architecture to constrain the power Architecture to constrain the power

Technology, Circuits, and Technology, Circuits, and Architecture to constrain the power Architecture to constrain the power

8

SolutionsSolutions

Move away from Frequency alone to Move away from Frequency alone to deliver performancedeliver performance

More on-die memoryMore on-die memoryMulti-everywhereMulti-everywhere

–Multi-threadingMulti-threading–Chip level multi-processingChip level multi-processing

Throughput oriented designsThroughput oriented designsValued performance by higher level of Valued performance by higher level of

integrationintegration–Monolithic & PolylithicMonolithic & Polylithic

9

Leakage SolutionsLeakage Solutions

Tri-gate Transistor

Silicon substrateSilicon substrate

1.2 nm SiO1.2 nm SiO22

GateGate

Planar Transistor

SiGeSiGe

Silicon substrate

Gate electrode

3.0nm High-k

For a few generations, then what?

For a few generations, then what?

10

Active Power ReductionActive Power Reduction

SlowSlow FastFast SlowSlow

Lo

w S

up

ply

V

olt

age

Hig

h S

up

ply

V

olt

age

Multiple Supply Voltages

Logic BlockLogic BlockFreq = 1Vdd = 1Throughput = 1Power = 1Area = 1 Pwr Den = 1

Vdd

Logic BlockLogic Block

Freq = 0.5Vdd = 0.5Throughput = 1Power = 0.25Area = 2Pwr Den = 0.125

Vdd/2

Logic BlockLogic Block

Throughput Oriented Designs

11

Leakage ControlLeakage ControlBody Bias

VddVbp

Vbn-Ve

+Ve

2-10X2-10XReductionReduction

Sleep Transistor

Logic BlockLogic Block

2-1000X2-1000XReductionReduction

Stack Effect

Equal Loading

5-10X5-10XReductionReduction

12

Optimum FrequencyOptimum Frequency

Maximum performance withMaximum performance with• Optimum pipeline depth Optimum pipeline depth • Optimum frequencyOptimum frequency

Maximum performance withMaximum performance with• Optimum pipeline depth Optimum pipeline depth • Optimum frequencyOptimum frequency

Process Technology

0

2

4

6

8

10

1 2 3 4 5 6 7 8 9 10Relative Frequency

Sub-threshold Leakage increases

exponentially

Pipeline Depth

0

2

4

6

8

10

1 2 3 4 5 6 7 8 9 10Relative Pipeline Depth

Power Efficiency

Optimum

Pipeline & Performance

0

2

4

6

8

10

1 2 3 4 5 6 7 8 9 10Relative Frequency (Pipelining)

Performance

DiminishingReturn

13

Memory LatencyMemory Latency

MemoryMemoryCPUCPU CacheCache

Small~few Clocks Large

50-100ns1

10

100

1000

100 1000 10000

Freq (MHz)M

em

ory

Late

ncy (

Clo

cks)

Assume: 50ns Memory latency

Cache miss hurts performanceCache miss hurts performanceWorse at higher frequencyWorse at higher frequency

Cache miss hurts performanceCache miss hurts performanceWorse at higher frequencyWorse at higher frequency

14

Increase on-die MemoryIncrease on-die Memory

Large on die memory provides:

1. Increased Data Bandwidth & Reduced Latency

2. Hence, higher performance for much lower power

1

10

100

Po

wer

Den

sity

(W

atts

/cm

2)

Logic

Memory

0%

25%

50%

75%

100%

1u 0.5u 0.25u 0.13u 65nmC

ach

e %

of

To

tal

Are

a

486Pentium®

Pentium® III

Pentium® 4

Pentium® M

15

Multi-threadingMulti-threading

0%

20%

40%

60%

80%

100%

100% 98% 96%

Cache Hit %

Per

form

ance

1 GHz

2 GHz

3 GHz

STST Wait for MemWait for Mem

MT1MT1 Wait for MemWait for Mem

MT2MT2 WaitWait

MT3MT3

Single ThreadSingle Thread

Multi-ThreadingMulti-Threading

Full HW UtilizationFull HW Utilization

Multi-threading improves performance without Multi-threading improves performance without impacting thermals & power deliveryimpacting thermals & power delivery

Multi-threading improves performance without Multi-threading improves performance without impacting thermals & power deliveryimpacting thermals & power delivery

Thermals & Power Delivery Thermals & Power Delivery designed for full HW utilizationdesigned for full HW utilizationThermals & Power Delivery Thermals & Power Delivery designed for full HW utilizationdesigned for full HW utilization

16

Single Core Power/PerformanceSingle Core Power/Performance

0

1

2

3

4

5

Pipelined S-Scalar OOO-Spec

Deep Pipe

Incre

ase

(X)

Area X

Perf X

-1

0

1

2

3

Pipelined S-Scalar OOO-Spec

Deep Pipe

Incre

ase

(X)

Power X

Mips/W (%)

Moore’s Law more transistors for advanced architectures

Delivers higher peak performance

But…

Lower power efficiency

17

Chip Multi-ProcessingChip Multi-Processing

1

1.5

2

2.5

3

3.5

1 2 3 4Die Area, Power

Rel

ativ

e P

erfo

rman

ce

Multi Core

Single Core

C1C1 C2C2

C3C3 C4C4

Cache

• Multi-core, each core Multi-threaded• Shared cache and front side bus• Each core has different Vdd & Freq• Core hopping to spread hot spots• Lower junction temperature

18

Dual CoreDual Core

VoltageVoltage FrequencyFrequency PowerPower PerformancePerformance

1%1% 1%1% 3%3% 0.66%0.66%

Rule of thumb

CoreCore

CacheCache

CoreCore

CacheCache

CoreCore

Voltage = 1Freq = 1Area = 1Power = 1Perf = 1

Voltage = -15%Freq = -15%Area = 2Power = 1Perf = ~1.8

In the same process technology…

19

Multi-CoreMulti-Core

C1C1 C2C2

C3C3 C4C4

Cache

Large CoreLarge Core

Cache

1

2

3

4

1

2 SmallCoreSmallCore

1 1

1

2

3

4

1

2

3

4

Power

PerformancePower = 1/4

Performance = 1/2

Multi-Core:Multi-Core:Power efficientPower efficient

Better power and Better power and thermal managementthermal management

Multi-Core:Multi-Core:Power efficientPower efficient

Better power and Better power and thermal managementthermal management

20

Special Purpose HardwareSpecial Purpose HardwareT

CB Exec

Core

PLL

OOO

ROM

CA

M1

TC

B ExecCore

PLL

ROB

ROM

CL

B

Inputseq

Sendbuffer

2.23 mm X 3.54 mm, 260K transistors

Opportunities: Network processing enginesMPEG Encode/Decode engines, Speech engines

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1995 2000 2005 2010 2015M

IPS

GP MIPS@75W

TOE MIPS@~2W

TCP/IP Offload EngineTCP/IP Offload Engine

Special purpose HW provides best Mips/WattSpecial purpose HW provides best Mips/WattSpecial purpose HW provides best Mips/WattSpecial purpose HW provides best Mips/Watt

21

Performance ScalingPerformance Scaling

0

2

4

6

8

10

0 10 20 30

Number of Cores

Per

form

ance

Amdahl’s Law: Parallel Speedup = 1/(Serial% + (1-Serial%)/N)

Serial% = 6.7%N = 16, N1/2 = 8

16 Cores, Perf = 8

Serial% = 20%N = 6, N1/2 = 3

6 Cores, Perf = 3

Parallel software key to Multi-core successParallel software key to Multi-core successParallel software key to Multi-core successParallel software key to Multi-core success

22

From Multi to Many…From Multi to Many…

0

5

10

15

20

25

30

TPT OneApp

TwoApp

FourApp

EightApp

Sys

tem

Per

form

ance

Large

Med

Small

Single Core Performance

1

0.5

0.3

0

0.2

0.4

0.6

0.8

1

1.2

La

rge

Me

d

Sm

all

Re

lati

ve

Pe

rfo

rma

nc

e

13mm, 100W, 48MB Cache, 4B Transistors, in 22nm12 Cores 24 Cores 144 Cores

23

GPGP GPGP

GPGP

GPGP GPGP

GPGP

GPGP

GPGP GPGP

GPGP

GPGP GPGP

General Purpose Cores

Future Multi-core PlatformFuture Multi-core Platform

SPSP SPSP

SPSP SPSPSpecial Purpose HW

CC

CC

CC

CC

CC

CC

CC

CC Interconnect fabric

Heterogeneous Multi-Core PlatformHeterogeneous Multi-Core PlatformHeterogeneous Multi-Core PlatformHeterogeneous Multi-Core Platform

24

The New Era of ComputingThe New Era of Computing

0.01

0.1

1

10

100

1000

10000

100000

1000000

1970 1980 1990 2000 2010

MIP

S

Multi-everywhere: MT, CMPMulti-everywhere: MT, CMPMulti-everywhere: MT, CMPMulti-everywhere: MT, CMP

Speculative, OOO

Era of Era of Instruction Instruction

LevelLevelParallelismParallelism

Super Scalar

486386

2868086 Era of Era of

PipelinedPipelinedArchitectureArchitecture

Multi ThreadedEra of Era of

Thread &Thread &ProcessorProcessor

LevelLevelParallelismParallelism

Special Special Purpose HWPurpose HW

Multi-Threaded, Multi-Core

25

SummarySummaryBusiness as usualBusiness as usual is not an option is not an option

–Performance at any cost is historyPerformance at any cost is historyMust make a Must make a Right Hand Turn (RHT)Right Hand Turn (RHT)

–Move away from frequency aloneMove away from frequency aloneFuture Future Architectures and designsArchitectures and designs

–More memory (larger caches)More memory (larger caches)–Multi-threadingMulti-threading–Multi-processingMulti-processing–Special purpose hardwareSpecial purpose hardware–Valued performance with higher integrationValued performance with higher integration