vlsi trends in microarchitecture past, present and futureafek/intel2006.pdf · pentium® proc...

Vlsi_03_2005.ppt/April_2005 1

VLSI Trends in MicroarchitecturePast, present and future

TAU universityJanuary 24, 2006

Uri Weiser


Agenda

Microarchitecture– VLSI– Trends Past and present:

» Pipeline, superpipeline» Out of Order» Branch prediction» Caches» Trace cache» Threads and Chip Multiprocessing

– Future» Asymmetric» Accelerators


“[In the beginning] we had little idea of what we had started. ...I remember... saying, ‘Okay, we’ve done integrated circuit. What do we do next?”

Gordon E. Moore


TRENDSIN

VLSI

Sources:Shekhar BorkarUri Weiser


Process TechnologyProcess Technology1.51.5µµ 1.01.0µµ 0.80.8µµ 0.60.6µµ 0.350.35µµ 0.250.25µµ 0.180.18µµ 0.130.13µµ

Intel386Intel386™™ DX DX ProcessorProcessor

Intel486Intel486™™ DX DX ProcessorProcessor

PentiumPentium®®ProcessorProcessor

PentiumPentium®® Pro Pro ProcessorProcessor

PentiumPentium®® II II ProcessorProcessor

PentiumPentium®® 4 4 ProcessorProcessor

PentiumPentium®® III III ProcessorProcessor

ProcessorProcessor

Technology trendTrends


Performance History

10

100

1,000

10,000

1.0μ 0.7μ 0.5μ 0.35μ 0.25μ 0.18μ

Freq

(Mhz

)

Freq (uArch)Freq (Process)

18.3X

3.3X

i486Pentium® proc

Pentium® II & III proc

1.0u-0.18u, 1989-2001

Frequency increased 61X

1. 18.3X due to process technology

2. Additional 3.3X due to uArch

1

10

100

1.0μ 0.7μ 0.5μ 0.35μ 0.25μ 0.18μ

Rel

ativ

e Pe

rfor

man

ce Perf due to uArchitecture

Perf due to Freq

14X

7X

i486Pentium® proc

Pentium® II & III proc

Pentium® 4 proc

Performance increased ~100X

1. 14X due to process tech

2. Additional 7X due to uArch & design

Trends


Process Technology: Minimum Feature Size

Source: Intel, SIA Technology RoadmapSource: Intel, SIA Technology Roadmap0.01

Feature SizeFeature Size(microns)(microns)

0.1

1

10

’68 ’71 ’76 ’80 ’84 ’88 ’92 ’96 ’00 ’04 ’08

IntelSIA

Trends


Transistors on a Chip

400480088080

8085 8086286

386486 Pentium® proc

P6

0.001

0.01

0.1

1

10

100

1000

1970 1980 1990 2000 2010Year

Tran

sist

ors

(MT)

2X growth in 1.96 years!

Transistors on a chip doubled every two yearsTransistors on a chip doubled every two years

Pentium® 4 proc

Trends


Die Size Growth

40048008

80808085

8086 286386

486 Pentium ® procP6

1

10

100

1970 1980 1990 2000 2010Year

Die

sid

e (m

m) =

Are

a1/2

Die size grows? Is it saturated?Die size grows? Is it saturated?

Trends


Frequency

P6Pentium ® proc

48638628680868085

8080800840040.1

1

10

100

1000

10000

1970 1980 1990 2000 2010Year

Freq

uenc

y (M

hz)

Doubles every 2 years

Lead Microprocessors frequency doubles every 2 yearsLead Microprocessors frequency doubles every 2 years

Trends


Frequency of Operation

PPro-300PPro-225

486DX2-66

386DX-33 486DX-33

8080

80868085

8088

80286

386DX-16486DX-25

486DX-50PP-66

PP-90

1

10

100

1000

1970 1975 1980 1985 1990 1995 2000Years

Freq

uenc

yM

HZ

IntelPPCOther

Alpha

Trends


Frequency of Operation (cont.)

PPro-300

486DX2 PP-66PP-90

PP-100PP-120

PPro-225

PPro-180PPro-150/PPC604-150

PP-133/PPC604-133

PPC601-100

PPC620C-200

A21064A-275A21164-300

M-R4400-200

0

50

100

150

200

250

300

350

400

1992 1993 1994 1995 1996 1997 1998Years

Freq

uenc

y (M

Hz)

Intel

PPC

Other

A21164-333

A21164A-417

Trends


Brainiacs and Speed demons

Source: ISCA 95, p. 174

1.6

0.57 0.6

0

0.5

1

1.5

2

0 50 100 150 200 250 300 350

Speed demons

Brai

niac

sSPECInt92 = 400

300

200

SPECInt92 = 10050

MHz

ALPHA

X86

PowerPC

PENTIUM

PENTIUM PRO

21164

21064

1.0

0.5

1.5

2.0

50 100 150 200 250 300 350 400

1

SPEC

Int9

2 / M

Hz

Trends


Trends of Future Processors

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

0 500 1000 1500 2000 2500

M HZ

SPEC

int/1

00M

HZ

PII, PIIISPECInt = 60SPECint = 40

SPECint = 27= 18

= 12= 8

= 4

SPECInt = 90

P IV

P M

Trends


1

10

100

1000

1.5μ 1μ 0.7μ 0.5μ 0.35μ 0.25μ 0.18μ 0.13μ 0.1μ 0.07μ

Wat

ts/c

m2

i386i486

Pentium ® processor Pentium Pro ® processor

Pentium II ® processor Pentium III ® processor

Power density continues to get worse

Hot plate

Nuclear ReactorRocketNozzleSun’s

Surface

Trends


On Die Cache Memory

Core Logic

Cache

Core Logic

Cache

Core Logic

Cache

0.13μ

0.18μ

0.10μ

Pentium III & 4

Pentium IIIPentium IIPentium

ProPentium

0%

20%

40%

60%

80%

100%

0.7μ 0.5μ 0.35μ 0.25μ 0.18μ 0.13μ 0.10μ

Cache % offull chiparea ?

Larger % of die area will be memoryLarger % of die area will be memory

Trends


Process trend – the theory (cont)Performance driven era vs. Power aware era

Processes

“Old” processes

Pow

er ra

tioProcess generation p vs. p-1

@ same die area

n1µ

n+1 n+2 n+30.35µ

“new” processes

m0.25µ

m+1 m+2 m+30.09µ

1.4X Frequency0.9X voltage0.7X capacity/transistor1X area2X transistors

Power increase/generation: 1.6X

Power aware era(Performance within power envelop)

1.4X Frequency0.75X voltage0.7X capacity/transistor1X area2X transistor


Performance driven era(no power constrains)

Reference: Eric Spargle

1.0

2.0

3.0

4.0

Trends


Processor roadmap trend – real life (cont)Extension of Pollack’s Rule (Micro32, 1999)

Processor generation k vs. k-1 compacted @ the same process technology

Power

0

1

2

3

4

1.5 1 0.7 0.5 0.35 0.18Technology Generation

Gro

wth

(X)

Perf/power delta ratio

1 : >3Performance

!Perf/power delta

ratio3 : 1

Trends


Microarchitecture


The Generic Processor

Datasupply

Instructionsupply

Executionengine

Sophisticated organization to “service” instructionsInstruction supply– Instruction cache– Branch prediction– Instruction decoder– ...

Execution engine– Instruction scheduler– Register files– Execution units– ...

Data supply– Data cache– TLB’s– …

Goal - Maximum throughput – balanced design

Microarchitecture


“The Core” - A Block Diagram

FPU ALU1 ALU2AGU

+MEM

n

2

DataCache

D-TLBReg.File

Scheduler

Instruction flowRead SourcesWrite DestinationWrite Bypass

Write Back

Microarchitecture


Parallelism Evolution

ProcessorElement

Instruction

Basic configuration Pipeline

VLIW

Superscalar - In order

Superscalar - Out of Order

a

...

PE

a ...

PE

b

PE

c

PE

n

PE

a ...

PE

d

PE

f

PE

n

b

c

e

PE

a a

PE

b c n

PE PE PE PE

Parallelism


PipelineBreak the work to smaller pieces

Increased throughput– increased # of completed instructions per cycle and reduces cycle time– Number of stages varies

– Small: 4-5 (Pentium), “Superpipeline” ~14 (Pentium Pro), “ultra-pipeline” ~25 (PIV)

Calls for good balancing among stages

I1I2

I3

F D E WF D E W

F D E WF D E W

F D E WF D E W

F: FetchD: DecodeE: ExecuteW: Write Back

0 1 2 3 4 5 6 7 8 9 10 11 12 (Cycles)

1/4 IPC = 4 CPI

1 IPC = 1 CPIIPC = Instructions Per CycleCPI = Cycles Per Instructions

ExamplesIntel 486NS 32532

t

Parallelism


Pipeline Stalls

But there are “stalls” in the pipeline– “Data Hazards”: Data flow dependency (instructions output/input)

» Solved by: bypasses, renaming– “Control Hazards”: Control flow dependencies

» Solved by branch prediction– “Structural Hazards”: Limited resources– Other (Cache misses, long latency instructions, page faults….)

F D E WF D stall W

F D E WF E W

F D E W

F: FetchD: DecodeE: ExecuteW: Write Back

Estall

D

Data Flow stall(w/o bypass)

Control Flow stall

stallstallstallstall

Address Generation Interlock

stall

0 1 2 3 4 5 6 7 8 9 10 11 12 (Cycles)t

Parallelism


Super ScalarPerforms more in a single cycle

Ideally, can multiply the throughput– But stall occurs more frequently

0 1 2 3 4 5 6 7 8 9 10 11 12F D E W

F D E W

F D E W

F D E WF D E W

F D E W

F D E W

F D E W

F D E W

F D E W

F D E WF D E WStall

2 IPC = 1/2 CPI

ExamplesIntel Pentium® Proc. Alpha 21164

t

Parallelism


Super PipelineSplit to shorter stages - allows higher frequency

Ideally, can (again) multiply the throughput, but– Stall penalties do not scale (e.g., control flow stall, cache misses)– Clock setup/hold reduces net cycle time - each instruction takes longer!

In the example above: 2X stages, but performance gain is


1 2 95 6 73 4 8

1 2 95 6 73 4 8

Out Of Order ExecutionIn Order Execution: instructions are processed in their program order.

– Limitation to potential Parallelism.OOO: Instructions are executed based on “data flow” rather than program orderBefore: src -> dest(1) load (r10), r21(2) mov r21, r31 (2 depends on 1)(3) load a, r11(4) mov r11, r22 (4 depends on 3)(5) mov r22, r23 (5 depends on 4)

After:(1) load (r10), r21; (3) load a, r11;

(2) mov r21,r31; (4) mov r11,r22;

(5) mov r22,r23;Usually highly superscalar

1F2F3F4F

1D2D3D4D

5F 5D 5W

1W2E3w4E5E

2W3W4W5E

1E2E3E4E5E

1E2E3E4E5E

1F2F3F4F

1D2D3D4D

5F 5D 5W

1W2E3E4E5E

2W3E4E5E

4E5E

1E2E3E4E5E

1E2E3E4E5E

4W5E

3W

In Order Processing

Out of Order Processing

In Order vs. OOO execution.Assuming:- Unlimited resources- 2 cycles load latency

Examples:Intel Pentium® II/III/4Compaq Alpha 21264

t

t

OOO


Out Of Order (cont.)Advantages

– Help exploit Instruction Level Parallelism (ILP)– Help cover latencies (e.g., cache miss, divide)– Artificially increase the Register file size (i.e. number of registers)– Superior/complementary to compiler scheduler

» Dynamic instruction window» Make usage of more registers than the Architecture Registers

Complex microarchitecture– Complex scheduler. Involves also

» Large instruction window» Speculative execution

– Requires reordering back-end mechanism (retirement) for:» Precise interrupt resolution» Misprediction/speculation recovery» Memory ordering

OOO


Speculation


Branch PredictionGoal - ensure enough instruction supply by correct prefetchingIn the past - prefetcher assumed fall-through

– Lose on unconditional branch (e.g., call)– Lose on frequently taken branches (e.g., loops)

Branch prediction– Predicts whether a branch is taken/not taken– Predicts the branch target address

Misprediction cost varies (higher w/ increased pipeline length)Typical Branch prediction rates: ~90%-96%

4%-10% misprediction,10-25 branches between mispredictions50-125 instructions between mispredictions

Misprediction cost increased with– Pipeline depth– Machine width

» e.g. 3 width x 10 stages = 30 inst flushed!

?

Speculation - BP


Target Array + Direction PredictionTarget and direction are predicted separately Tag may be partial

Branch IP

tag predicted target

predicted Target

Address

predicted direction(taken/not-taken)

hit / miss(indicates a branch)

DirectionPrediction

(for conditionalbranches only)

TargetPrediction

Speculation - BP


Speculative ExecutionExecution of instructions from a predicted (yet unsure) pathEventually, path may turn wrong.Advantages:

– Ensure instruction supply– Allow large scheduling window (for out of order)

Issues:– Misprediction cost– Misprediction recovery

Speculation - execution


Cache - Motivation & PrincipleMemory consumption is growing about 2X every 2 years

– Typical size: (Y2000) 64M-128M, (Y2002) 128M-256MCPU speed grows faster than memory and buses

– CPU/Bus grew from 1:1 to 6:1, and still growing486 Pentium P-II P-III P4 25-66MHz 66-233MHz 200-450MH 0.5-1.33GHz 1.4-2.4GHz33MHz 66MHZ 66-100MHz 133-200MHz 400MHz

– Memory: DRAM: 60-100ns (“10-16MHz”), Cost:


The Generic Processor

Executionengine

Datasupply

Instruction Cache

BTB Instruction fetch/decode

Rename

Scheduler

Trace/decodedcache

Speculation – Trace Cache

35Vlsi_03_2005.ppt/April_2005 35

Fetch bandwidthexample

A CB B CAtime

Dynamic instruction stream

o o o o

A

B

CD

B E

Control flow graphA, B, C are instruction blocks


36Vlsi_03_2005.ppt/April_2005 36

Trace Cache Concept

• Hold in the “instruction”cache the dynamic stream of the executed instructions

=> Trace cache acts as “branch predictor” + wide instructions supplier


37Vlsi_03_2005.ppt/April_2005 37

Trace Cache Overview

I Cache

Decoder

Trace Cache

FillUnit

address

A B C

A B CA B C

To Execution Core

Build Mode

Stream Mode


38Vlsi_03_2005.ppt/April_2005 38

Trace cache line

• Tag: identifies starting address of trace

• N instructions (potentially decoded)• Next address: next fetch address• path info: branch flags (T, NT), number

of branches, trace ends w/ branch?,…)

N Instructions Next address path infoPC Tag



Threads


Scalar Execution

Dependencies reduce throughput/utilization

Time

Threads


Superscalar Execution

Generally increases throughput, but decreases utilization

Time

Threads


Predication

Generally increases utilization, increases throughput less(much of the utilization is thrown away)

Time

Threads


CMP – Chip Multi-ProcessorTime

Low utilization / higher throughput

Threads


Blocked Multithreading

May increase utilization and throughput, but must switch when currentthread goes to low utilization/throughput section (e.g. L2 cache miss)

Time

Threads


Fine Grained Multithreading

Increases utilization/throughput by reducing impact of dependences

Time

Threads


Simultaneous Multithreading

Time

Increases utilization/throughput

Threads


Future


Gain

Frequency

Analog Circuit Paradigm

GBWP = Gain Bandwidth Product = constant @ a given technology

BW1Gain1

BW2Gain2

e.g. Gain1*BW1= Gain2*BW2

Future


Gain

Frequency

Analog Circuit Paradigm (cont.)

BW1 BWn Gain

Future


“Theory”Analog Gain Bandwidth Product (GBWP) is constant for a specific technology, this is also true for other “environments”…A computer structure can excel in performance for a specific application set but not at all applications (also true for benchmarks)a person can excel in several areas but not at all...…...

examples: benchmarks, application in coming foils people….

Future


Tuning for ApplicationsPerformance

“Applications”

Apps1 Appsn

Future


Provide Specialized “efficient”MIPS

Find a way to support the new performance requirements via an efficient “mechanism”A tailored solutions (to a specific application set) can provide an “efficient” MIPS via INTEGRATION, how?

Future


The Needthe environment

These days is the PC's 20th birthday – 835 Million PC sold 1981-2001– 138 million PCs in year 2001(IDC), 10X number of cars, 1.5X of

television sold annually– 2.2 Billion Email a day, 10X of the first class mail– 400 million on line users (200 in Sep99)– CPU performance improved ~8000X !!!

What will be the need for performance in the coming 20 years?What will be the technology progress in the coming 20 years? 10 years? 5 years?

Statistics courtesy of Gartner Dataquest, U.S. News & World Report, Jupiter Internet Population Model, and NUA Internet Surveys

Future


Windows XP examples that needs excessive performance:

- Movie Maker Video Indexing- Video smoothing

Example 1:Movie Maker Video Indexing

320*240, 30fps4X slower thanreal Time on CentrinoTM @1.6Ghz

1980x1080, 30fps~100X over CenrinoTM @1.6Mhz

Future


30 FPS (Enhanced)5 FPS (Jerky)

Example 2:Emulation of:Video smoothingVideo Enhancement

Video smoothing

352*240 pixelsCPU usage:70% of CentrinoTM @1.6Ghz

1980x1080, 30fps~21X over CenrinoTM @1.6Mhz

Future


The Need

Future


The need: Build a Panorama

M. Brown and D. G. Lowe. Recognising Panoramas. ICCV 2003Performance: >30min P4 3GHz

Simplified capabilities at Microsoft Digital Image Suite 10 ($129.95)

Future


Future


CPU Usagean example(measured on IBM X20)

Streaming vs. General purpose

General purpose usageExcel and Outlook -->Burst need for MIPS

Streaming Processing6 low resolution videos-->Continuous need for MIPS

Avg.

Avg.

5min

~5min

Future


Process trend – the theory (cont)Performance driven era vs. Power aware era

Processes

“Old” processes

Pow

er ra

tioProcess generation p vs. p-1

@ same die area

n1µ

n+1 n+2 n+30.35µ

“new” processes

m0.25µ

m+1 m+2 m+30.09µ

1.4X Frequency0.9X voltage0.7X capacity/transistor1X area2X transistorsLeakage 30%


Power aware era(Performance within power envelop)

1.4X Frequency0.75X voltage0.7X capacity/transistor1X area2X transistorLeakage


Processor roadmap trend – real life (cont)Extension of Pollack’s Rule (Micro32, 1999)

Processor generation k vs. k-1 compacted @ the same process technology

Power

0

1

2

3

4

1.5 1 0.7 0.5 0.35 0.18Technology Generation

Gro

wth

(X)

Perf/power delta ratio

1 : >3Performance

!Perf/power delta

ratio3 : 1

Future


solution 1: CMP (Chip Multi-Processor)P

erfo

rman

ce

Power

penalty: MP

Conventional des

ign 1% performancefor 3% in power

One processor

2 CMP

3 CMP

4 CMP

*CMP = Symmetric General Purpose (GP) cores

CMP*

Future


solution 2: ACCMP (Asymmetric Cluster CMP)P

erfo

rman

ce

Power

penalty: Specialized MIPS

AC

CMP

~1% performancefor 1% in power

>3% performancefor 1% in power

Future


ACCMPWhat is the ACCMP?– On Die Asymmetric Clusters of cores– Efficient specialized MIPS clusters with

>3-4X performance/power over GP cores– Compatible ISA?

Penalties– Multi-Processing (tasks or threads)

Specialized MIPS

ACCMP is a solution that enables to continue (for a while) Moore’s performance law within the power envelop

Future


ACCMP

Specialized MIPS A Cluster

Host coreHost core

L2 $

Host ClusterGeneral Purpose

MIPS

Interconnect

Specialized MIPS B Cluster

Ext.Bus

Future


Future - Processors

• applications need• Specialized MIPS• Detached from the CPU core• Different engines• Mixture of Programmable and fixed function

• ?

VLSI Trends in Microarchitecture�Past, present and futureAgenda TRENDS�IN�VLSIPerformance HistoryProcess Technology: Minimum Feature SizeTransistors on a ChipDie Size GrowthFrequencyFrequency of OperationFrequency of Operation (cont.)Brainiacs and Speed demonsOn Die Cache MemoryProcess trend – the theory (cont) �Performance driven era vs. Power aware era�Processor roadmap trend – real life (cont) �Extension of Pollack’s Rule (Micro32, 1999)�MicroarchitectureThe Generic Processor“The Core” - A Block DiagramParallelism EvolutionPipelinePipeline StallsSuper ScalarSuper PipelineOut Of Order ExecutionOut Of Order (cont.)SpeculationBranch PredictionTarget Array + Direction PredictionSpeculative ExecutionCache - Motivation & PrincipleThe Generic ProcessorFetch bandwidth�exampleTrace Cache ConceptTrace Cache OverviewTrace cache lineThreadsScalar ExecutionSuperscalar ExecutionPredicationCMP – Chip Multi-ProcessorBlocked MultithreadingFine Grained MultithreadingSimultaneous MultithreadingFuture“Theory”Provide Specialized “efficient” MIPSThe Need�the environmentWindows XP examples �that needs excessive performance:�� - Movie Maker Video Indexing� - Video smoothing�The NeedThe need: Build a PanoramaProcess trend – the theory (cont) �Performance driven era vs. Power aware era�Processor roadmap trend – real life (cont) �Extension of Pollack’s Rule (Micro32, 1999)�solution 1: CMP (Chip Multi-Processor)solution 2: ACCMP (Asymmetric Cluster CMP)ACCMPACCMPFuture - Processors

vlsi trends in microarchitecture past, present and futureafek/intel2006.pdf · pentium® proc...

Documents