vlsi trends in microarchitecture past, present and futureafek/intel2006.pdf · pentium® proc...

66
Vlsi_03_2005.ppt/April_2005 1 VLSI Trends in Microarchitecture Past, present and future TAU university January 24, 2006 Uri Weiser

Upload: others

Post on 24-Oct-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

  • Vlsi_03_2005.ppt/April_2005 1

    VLSI Trends in MicroarchitecturePast, present and future

    TAU universityJanuary 24, 2006

    Uri Weiser

  • Vlsi_03_2005.ppt/April_2005 2

    Agenda

    Microarchitecture– VLSI– Trends Past and present:

    » Pipeline, superpipeline» Out of Order» Branch prediction» Caches» Trace cache» Threads and Chip Multiprocessing

    – Future» Asymmetric» Accelerators

  • Vlsi_03_2005.ppt/April_2005 3

    “[In the beginning] we had little idea of what we had started. ...I remember... saying, ‘Okay, we’ve done integrated circuit. What do we do next?”

    Gordon E. Moore

  • Vlsi_03_2005.ppt/April_2005 4

    TRENDSIN

    VLSI

    Sources:Shekhar BorkarUri Weiser

  • Vlsi_03_2005.ppt/April_2005 5

    Process TechnologyProcess Technology1.51.5µµ 1.01.0µµ 0.80.8µµ 0.60.6µµ 0.350.35µµ 0.250.25µµ 0.180.18µµ 0.130.13µµ

    Intel386Intel386™™ DX DX ProcessorProcessor

    Intel486Intel486™™ DX DX ProcessorProcessor

    PentiumPentium®®ProcessorProcessor

    PentiumPentium®® Pro Pro ProcessorProcessor

    PentiumPentium®® II II ProcessorProcessor

    PentiumPentium®® 4 4 ProcessorProcessor

    PentiumPentium®® III III ProcessorProcessor

    ProcessorProcessor

    Technology trendTrends

  • Vlsi_03_2005.ppt/April_2005 6

    Performance History

    10

    100

    1,000

    10,000

    1.0μ 0.7μ 0.5μ 0.35μ 0.25μ 0.18μ

    Freq

    (Mhz

    )

    Freq (uArch)Freq (Process)

    18.3X

    3.3X

    i486Pentium® proc

    Pentium® II & III proc

    1.0u-0.18u, 1989-2001

    Frequency increased 61X

    1. 18.3X due to process technology

    2. Additional 3.3X due to uArch

    1

    10

    100

    1.0μ 0.7μ 0.5μ 0.35μ 0.25μ 0.18μ

    Rel

    ativ

    e Pe

    rfor

    man

    ce Perf due to uArchitecture

    Perf due to Freq

    14X

    7X

    i486Pentium® proc

    Pentium® II & III proc

    Pentium® 4 proc

    Performance increased ~100X

    1. 14X due to process tech

    2. Additional 7X due to uArch & design

    Trends

  • Vlsi_03_2005.ppt/April_2005 7

    Process Technology: Minimum Feature Size

    Source: Intel, SIA Technology RoadmapSource: Intel, SIA Technology Roadmap0.01

    Feature SizeFeature Size(microns)(microns)

    0.1

    1

    10

    ’68 ’71 ’76 ’80 ’84 ’88 ’92 ’96 ’00 ’04 ’08

    IntelSIA

    Trends

  • Vlsi_03_2005.ppt/April_2005 8

    Transistors on a Chip

    400480088080

    8085 8086286

    386486 Pentium® proc

    P6

    0.001

    0.01

    0.1

    1

    10

    100

    1000

    1970 1980 1990 2000 2010Year

    Tran

    sist

    ors

    (MT)

    2X growth in 1.96 years!

    Transistors on a chip doubled every two yearsTransistors on a chip doubled every two years

    Pentium® 4 proc

    Trends

  • Vlsi_03_2005.ppt/April_2005 9

    Die Size Growth

    40048008

    80808085

    8086 286386

    486 Pentium ® procP6

    1

    10

    100

    1970 1980 1990 2000 2010Year

    Die

    sid

    e (m

    m) =

    Are

    a1/2

    Die size grows? Is it saturated?Die size grows? Is it saturated?

    Trends

  • Vlsi_03_2005.ppt/April_2005 10

    Frequency

    P6Pentium ® proc

    48638628680868085

    8080800840040.1

    1

    10

    100

    1000

    10000

    1970 1980 1990 2000 2010Year

    Freq

    uenc

    y (M

    hz)

    Doubles every 2 years

    Lead Microprocessors frequency doubles every 2 yearsLead Microprocessors frequency doubles every 2 years

    Trends

  • Vlsi_03_2005.ppt/April_2005 11

    Frequency of Operation

    PPro-300PPro-225

    486DX2-66

    386DX-33 486DX-33

    8080

    80868085

    8088

    80286

    386DX-16486DX-25

    486DX-50PP-66

    PP-90

    1

    10

    100

    1000

    1970 1975 1980 1985 1990 1995 2000Years

    Freq

    uenc

    yM

    HZ

    IntelPPCOther

    Alpha

    Trends

  • Vlsi_03_2005.ppt/April_2005 12

    Frequency of Operation (cont.)

    PPro-300

    486DX2 PP-66PP-90

    PP-100PP-120

    PPro-225

    PPro-180PPro-150/PPC604-150

    PP-133/PPC604-133

    PPC601-100

    PPC620C-200

    A21064A-275A21164-300

    M-R4400-200

    0

    50

    100

    150

    200

    250

    300

    350

    400

    1992 1993 1994 1995 1996 1997 1998Years

    Freq

    uenc

    y (M

    Hz)

    Intel

    PPC

    Other

    A21164-333

    A21164A-417

    Trends

  • Vlsi_03_2005.ppt/April_2005 13

    Brainiacs and Speed demons

    Source: ISCA 95, p. 174

    1.6

    0.57 0.6

    0

    0.5

    1

    1.5

    2

    0 50 100 150 200 250 300 350

    Speed demons

    Brai

    niac

    sSPECInt92 = 400

    300

    200

    SPECInt92 = 10050

    MHz

    ALPHA

    X86

    PowerPC

    PENTIUM

    PENTIUM PRO

    21164

    21064

    1.0

    0.5

    1.5

    2.0

    50 100 150 200 250 300 350 400

    1

    SPEC

    Int9

    2 / M

    Hz

    Trends

  • Vlsi_03_2005.ppt/April_2005 14

    Trends of Future Processors

    0.0

    1.0

    2.0

    3.0

    4.0

    5.0

    6.0

    7.0

    8.0

    0 500 1000 1500 2000 2500

    M HZ

    SPEC

    int/1

    00M

    HZ

    PII, PIIISPECInt = 60SPECint = 40

    SPECint = 27= 18

    = 12= 8

    = 4

    SPECInt = 90

    P IV

    P M

    Trends

  • Vlsi_03_2005.ppt/April_2005 15

    1

    10

    100

    1000

    1.5μ 1μ 0.7μ 0.5μ 0.35μ 0.25μ 0.18μ 0.13μ 0.1μ 0.07μ

    Wat

    ts/c

    m2

    i386i486

    Pentium ® processor Pentium Pro ® processor

    Pentium II ® processor Pentium III ® processor

    Power density continues to get worse

    Hot plate

    Nuclear ReactorRocketNozzleSun’s

    Surface

    Trends

  • Vlsi_03_2005.ppt/April_2005 16

    On Die Cache Memory

    Core Logic

    Cache

    Core Logic

    Cache

    Core Logic

    Cache

    0.13μ

    0.18μ

    0.10μ

    Pentium III & 4

    Pentium IIIPentium IIPentium

    ProPentium

    0%

    20%

    40%

    60%

    80%

    100%

    0.7μ 0.5μ 0.35μ 0.25μ 0.18μ 0.13μ 0.10μ

    Cache % offull chiparea ?

    Larger % of die area will be memoryLarger % of die area will be memory

    Trends

  • Vlsi_03_2005.ppt/April_2005 17

    Process trend – the theory (cont)Performance driven era vs. Power aware era

    Processes

    “Old” processes

    Pow

    er ra

    tioProcess generation p vs. p-1

    @ same die area

    n1µ

    n+1 n+2 n+30.35µ

    “new” processes

    m0.25µ

    m+1 m+2 m+30.09µ

    1.4X Frequency0.9X voltage0.7X capacity/transistor1X area2X transistors

    Power increase/generation: 1.6X

    Power aware era(Performance within power envelop)

    1.4X Frequency0.75X voltage0.7X capacity/transistor1X area2X transistor

    Power increase/generation: 1.1X

    Performance driven era(no power constrains)

    Reference: Eric Spargle

    1.0

    2.0

    3.0

    4.0

    Trends

  • Vlsi_03_2005.ppt/April_2005 18

    Processor roadmap trend – real life (cont)Extension of Pollack’s Rule (Micro32, 1999)

    Processor generation k vs. k-1 compacted @ the same process technology

    Power

    0

    1

    2

    3

    4

    1.5 1 0.7 0.5 0.35 0.18Technology Generation

    Gro

    wth

    (X)

    Perf/power delta ratio

    1 : >3Performance

    !Perf/power delta

    ratio3 : 1

    Trends

  • Vlsi_03_2005.ppt/April_2005 19

    Microarchitecture

  • Vlsi_03_2005.ppt/April_2005 20

    The Generic Processor

    Datasupply

    Instructionsupply

    Executionengine

    Sophisticated organization to “service” instructionsInstruction supply– Instruction cache– Branch prediction– Instruction decoder– ...

    Execution engine– Instruction scheduler– Register files– Execution units– ...

    Data supply– Data cache– TLB’s– …

    Goal - Maximum throughput – balanced design

    Microarchitecture

  • Vlsi_03_2005.ppt/April_2005 21

    “The Core” - A Block Diagram

    FPU ALU1 ALU2AGU

    +MEM

    n

    2

    DataCache

    D-TLBReg.File

    Scheduler

    Instruction flowRead SourcesWrite DestinationWrite Bypass

    Write Back

    Microarchitecture

  • Vlsi_03_2005.ppt/April_2005 22

    Parallelism Evolution

    ProcessorElement

    Instruction

    Basic configuration Pipeline

    VLIW

    Superscalar - In order

    Superscalar - Out of Order

    a

    ...

    PE

    a ...

    PE

    b

    PE

    c

    PE

    n

    PE

    a ...

    PE

    d

    PE

    f

    PE

    n

    b

    c

    e

    PE

    a a

    PE

    b c n

    PE PE PE PE

    Parallelism

  • Vlsi_03_2005.ppt/April_2005 23

    PipelineBreak the work to smaller pieces

    Increased throughput– increased # of completed instructions per cycle and reduces cycle time– Number of stages varies

    – Small: 4-5 (Pentium), “Superpipeline” ~14 (Pentium Pro), “ultra-pipeline” ~25 (PIV)

    Calls for good balancing among stages

    I1I2

    I3

    F D E WF D E W

    F D E WF D E W

    F D E WF D E W

    F: FetchD: DecodeE: ExecuteW: Write Back

    0 1 2 3 4 5 6 7 8 9 10 11 12 (Cycles)

    1/4 IPC = 4 CPI

    1 IPC = 1 CPIIPC = Instructions Per CycleCPI = Cycles Per Instructions

    ExamplesIntel 486NS 32532

    t

    Parallelism

  • Vlsi_03_2005.ppt/April_2005 24

    Pipeline Stalls

    But there are “stalls” in the pipeline– “Data Hazards”: Data flow dependency (instructions output/input)

    » Solved by: bypasses, renaming– “Control Hazards”: Control flow dependencies

    » Solved by branch prediction– “Structural Hazards”: Limited resources– Other (Cache misses, long latency instructions, page faults….)

    F D E WF D stall W

    F D E WF E W

    F D E W

    F: FetchD: DecodeE: ExecuteW: Write Back

    Estall

    D

    Data Flow stall(w/o bypass)

    Control Flow stall

    stallstallstallstall

    Address Generation Interlock

    stall

    0 1 2 3 4 5 6 7 8 9 10 11 12 (Cycles)t

    Parallelism

  • Vlsi_03_2005.ppt/April_2005 25

    Super ScalarPerforms more in a single cycle

    Ideally, can multiply the throughput– But stall occurs more frequently

    0 1 2 3 4 5 6 7 8 9 10 11 12F D E W

    F D E W

    F D E W

    F D E WF D E W

    F D E W

    F D E W

    F D E W

    F D E W

    F D E W

    F D E WF D E WStall

    2 IPC = 1/2 CPI

    ExamplesIntel Pentium® Proc. Alpha 21164

    t

    Parallelism

  • Vlsi_03_2005.ppt/April_2005 26

    Super PipelineSplit to shorter stages - allows higher frequency

    Ideally, can (again) multiply the throughput, but– Stall penalties do not scale (e.g., control flow stall, cache misses)– Clock setup/hold reduces net cycle time - each instruction takes longer!

    In the example above: 2X stages, but performance gain is

  • Vlsi_03_2005.ppt/April_2005 27

    1 2 95 6 73 4 8

    1 2 95 6 73 4 8

    Out Of Order ExecutionIn Order Execution: instructions are processed in their program order.

    – Limitation to potential Parallelism.OOO: Instructions are executed based on “data flow” rather than program orderBefore: src -> dest(1) load (r10), r21(2) mov r21, r31 (2 depends on 1)(3) load a, r11(4) mov r11, r22 (4 depends on 3)(5) mov r22, r23 (5 depends on 4)

    After:(1) load (r10), r21; (3) load a, r11;

    (2) mov r21,r31; (4) mov r11,r22;

    (5) mov r22,r23;Usually highly superscalar

    1F2F3F4F

    1D2D3D4D

    5F 5D 5W

    1W2E3w4E5E

    2W3W4W5E

    1E2E3E4E5E

    1E2E3E4E5E

    1F2F3F4F

    1D2D3D4D

    5F 5D 5W

    1W2E3E4E5E

    2W3E4E5E

    4E5E

    1E2E3E4E5E

    1E2E3E4E5E

    4W5E

    3W

    In Order Processing

    Out of Order Processing

    In Order vs. OOO execution.Assuming:- Unlimited resources- 2 cycles load latency

    Examples:Intel Pentium® II/III/4Compaq Alpha 21264

    t

    t

    OOO

  • Vlsi_03_2005.ppt/April_2005 28

    Out Of Order (cont.)Advantages

    – Help exploit Instruction Level Parallelism (ILP)– Help cover latencies (e.g., cache miss, divide)– Artificially increase the Register file size (i.e. number of registers)– Superior/complementary to compiler scheduler

    » Dynamic instruction window» Make usage of more registers than the Architecture Registers

    Complex microarchitecture– Complex scheduler. Involves also

    » Large instruction window» Speculative execution

    – Requires reordering back-end mechanism (retirement) for:» Precise interrupt resolution» Misprediction/speculation recovery» Memory ordering

    OOO

  • Vlsi_03_2005.ppt/April_2005 29

    Speculation

  • Vlsi_03_2005.ppt/April_2005 30

    Branch PredictionGoal - ensure enough instruction supply by correct prefetchingIn the past - prefetcher assumed fall-through

    – Lose on unconditional branch (e.g., call)– Lose on frequently taken branches (e.g., loops)

    Branch prediction– Predicts whether a branch is taken/not taken– Predicts the branch target address

    Misprediction cost varies (higher w/ increased pipeline length)Typical Branch prediction rates: ~90%-96%

    4%-10% misprediction,10-25 branches between mispredictions50-125 instructions between mispredictions

    Misprediction cost increased with– Pipeline depth– Machine width

    » e.g. 3 width x 10 stages = 30 inst flushed!

    ?

    Speculation - BP

  • Vlsi_03_2005.ppt/April_2005 31

    Target Array + Direction PredictionTarget and direction are predicted separately Tag may be partial

    Branch IP

    tag predicted target

    predicted Target

    Address

    predicted direction(taken/not-taken)

    hit / miss(indicates a branch)

    DirectionPrediction

    (for conditionalbranches only)

    TargetPrediction

    Speculation - BP

  • Vlsi_03_2005.ppt/April_2005 32

    Speculative ExecutionExecution of instructions from a predicted (yet unsure) pathEventually, path may turn wrong.Advantages:

    – Ensure instruction supply– Allow large scheduling window (for out of order)

    Issues:– Misprediction cost– Misprediction recovery

    Speculation - execution

  • Vlsi_03_2005.ppt/April_2005 33

    Cache - Motivation & PrincipleMemory consumption is growing about 2X every 2 years

    – Typical size: (Y2000) 64M-128M, (Y2002) 128M-256MCPU speed grows faster than memory and buses

    – CPU/Bus grew from 1:1 to 6:1, and still growing486 Pentium P-II P-III P4 25-66MHz 66-233MHz 200-450MH 0.5-1.33GHz 1.4-2.4GHz33MHz 66MHZ 66-100MHz 133-200MHz 400MHz

    – Memory: DRAM: 60-100ns (“10-16MHz”), Cost:

  • Vlsi_03_2005.ppt/April_2005 34

    The Generic Processor

    Executionengine

    Datasupply

    Instruction Cache

    BTB Instruction fetch/decode

    Rename

    Scheduler

    Trace/decodedcache

    Speculation – Trace Cache

  • 35Vlsi_03_2005.ppt/April_2005 35

    Fetch bandwidthexample

    A CB B CAtime

    Dynamic instruction stream

    o o o o

    A

    B

    CD

    B E

    Control flow graphA, B, C are instruction blocks

    Speculation – Trace Cache

  • 36Vlsi_03_2005.ppt/April_2005 36

    Trace Cache Concept

    • Hold in the “instruction”cache the dynamic stream of the executed instructions

    => Trace cache acts as “branch predictor” + wide instructions supplier

    Speculation – Trace Cache

  • 37Vlsi_03_2005.ppt/April_2005 37

    Trace Cache Overview

    I Cache

    Decoder

    Trace Cache

    FillUnit

    address

    A B C

    A B CA B C

    To Execution Core

    Build Mode

    Stream Mode

    Speculation – Trace Cache

  • 38Vlsi_03_2005.ppt/April_2005 38

    Trace cache line

    • Tag: identifies starting address of trace

    • N instructions (potentially decoded)• Next address: next fetch address• path info: branch flags (T, NT), number

    of branches, trace ends w/ branch?,…)

    N Instructions Next address path infoPC Tag

    Speculation – Trace Cache

  • Vlsi_03_2005.ppt/April_2005 39

    Threads

  • Vlsi_03_2005.ppt/April_2005 40

    Scalar Execution

    Dependencies reduce throughput/utilization

    Time

    Threads

  • Vlsi_03_2005.ppt/April_2005 41

    Superscalar Execution

    Generally increases throughput, but decreases utilization

    Time

    Threads

  • Vlsi_03_2005.ppt/April_2005 42

    Predication

    Generally increases utilization, increases throughput less(much of the utilization is thrown away)

    Time

    Threads

  • Vlsi_03_2005.ppt/April_2005 43

    CMP – Chip Multi-ProcessorTime

    Low utilization / higher throughput

    Threads

  • Vlsi_03_2005.ppt/April_2005 44

    Blocked Multithreading

    May increase utilization and throughput, but must switch when currentthread goes to low utilization/throughput section (e.g. L2 cache miss)

    Time

    Threads

  • Vlsi_03_2005.ppt/April_2005 45

    Fine Grained Multithreading

    Increases utilization/throughput by reducing impact of dependences

    Time

    Threads

  • Vlsi_03_2005.ppt/April_2005 46

    Simultaneous Multithreading

    Time

    Increases utilization/throughput

    Threads

  • Vlsi_03_2005.ppt/April_2005 47

    Future

  • Vlsi_03_2005.ppt/April_2005 48

    Gain

    Frequency

    Analog Circuit Paradigm

    GBWP = Gain Bandwidth Product = constant @ a given technology

    BW1Gain1

    BW2Gain2

    e.g. Gain1*BW1= Gain2*BW2

    Future

  • Vlsi_03_2005.ppt/April_2005 49

    Gain

    Frequency

    Analog Circuit Paradigm (cont.)

    BW1 BWn Gain

    Future

  • Vlsi_03_2005.ppt/April_2005 50

    “Theory”Analog Gain Bandwidth Product (GBWP) is constant for a specific technology, this is also true for other “environments”…A computer structure can excel in performance for a specific application set but not at all applications (also true for benchmarks)a person can excel in several areas but not at all...…...

    examples: benchmarks, application in coming foils people….

    Future

  • Vlsi_03_2005.ppt/April_2005 51

    Tuning for ApplicationsPerformance

    “Applications”

    Apps1 Appsn

    Future

  • Vlsi_03_2005.ppt/April_2005 52

    Provide Specialized “efficient”MIPS

    Find a way to support the new performance requirements via an efficient “mechanism”A tailored solutions (to a specific application set) can provide an “efficient” MIPS via INTEGRATION, how?

    Future

  • Vlsi_03_2005.ppt/April_2005 53

    The Needthe environment

    These days is the PC's 20th birthday – 835 Million PC sold 1981-2001– 138 million PCs in year 2001(IDC), 10X number of cars, 1.5X of

    television sold annually– 2.2 Billion Email a day, 10X of the first class mail– 400 million on line users (200 in Sep99)– CPU performance improved ~8000X !!!

    What will be the need for performance in the coming 20 years?What will be the technology progress in the coming 20 years? 10 years? 5 years?

    Statistics courtesy of Gartner Dataquest, U.S. News & World Report, Jupiter Internet Population Model, and NUA Internet Surveys

    Future

  • Vlsi_03_2005.ppt/April_2005 54

    Windows XP examples that needs excessive performance:

    - Movie Maker Video Indexing- Video smoothing

    Example 1:Movie Maker Video Indexing

    320*240, 30fps4X slower thanreal Time on CentrinoTM @1.6Ghz

    1980x1080, 30fps~100X over CenrinoTM @1.6Mhz

    Future

  • Vlsi_03_2005.ppt/April_2005 55

    30 FPS (Enhanced)5 FPS (Jerky)

    Example 2:Emulation of:Video smoothingVideo Enhancement

    Video smoothing

    352*240 pixelsCPU usage:70% of CentrinoTM @1.6Ghz

    1980x1080, 30fps~21X over CenrinoTM @1.6Mhz

    Future

  • Vlsi_03_2005.ppt/April_2005 56

    The Need

    Future

  • Vlsi_03_2005.ppt/April_2005 57

    The need: Build a Panorama

    M. Brown and D. G. Lowe. Recognising Panoramas. ICCV 2003Performance: >30min P4 3GHz

    Simplified capabilities at Microsoft Digital Image Suite 10 ($129.95)

    Future

  • Vlsi_03_2005.ppt/April_2005 58

    Future

  • Vlsi_03_2005.ppt/April_2005 59

    CPU Usagean example(measured on IBM X20)

    Streaming vs. General purpose

    General purpose usageExcel and Outlook -->Burst need for MIPS

    Streaming Processing6 low resolution videos-->Continuous need for MIPS

    Avg.

    Avg.

    5min

    ~5min

    Future

  • Vlsi_03_2005.ppt/April_2005 60

    Process trend – the theory (cont)Performance driven era vs. Power aware era

    Processes

    “Old” processes

    Pow

    er ra

    tioProcess generation p vs. p-1

    @ same die area

    n1µ

    n+1 n+2 n+30.35µ

    “new” processes

    m0.25µ

    m+1 m+2 m+30.09µ

    1.4X Frequency0.9X voltage0.7X capacity/transistor1X area2X transistorsLeakage 30%

    Power increase/generation: 1.6X

    Power aware era(Performance within power envelop)

    1.4X Frequency0.75X voltage0.7X capacity/transistor1X area2X transistorLeakage

  • Vlsi_03_2005.ppt/April_2005 61

    Processor roadmap trend – real life (cont)Extension of Pollack’s Rule (Micro32, 1999)

    Processor generation k vs. k-1 compacted @ the same process technology

    Power

    0

    1

    2

    3

    4

    1.5 1 0.7 0.5 0.35 0.18Technology Generation

    Gro

    wth

    (X)

    Perf/power delta ratio

    1 : >3Performance

    !Perf/power delta

    ratio3 : 1

    Future

  • Vlsi_03_2005.ppt/April_2005 62

    solution 1: CMP (Chip Multi-Processor)P

    erfo

    rman

    ce

    Power

    penalty: MP

    Conventional des

    ign 1% performancefor 3% in power

    One processor

    2 CMP

    3 CMP

    4 CMP

    *CMP = Symmetric General Purpose (GP) cores

    CMP*

    Future

  • Vlsi_03_2005.ppt/April_2005 63

    solution 2: ACCMP (Asymmetric Cluster CMP)P

    erfo

    rman

    ce

    Power

    penalty: Specialized MIPS

    AC

    CMP

    ~1% performancefor 1% in power

    >3% performancefor 1% in power

    Future

  • Vlsi_03_2005.ppt/April_2005 64

    ACCMPWhat is the ACCMP?– On Die Asymmetric Clusters of cores– Efficient specialized MIPS clusters with

    >3-4X performance/power over GP cores– Compatible ISA?

    Penalties– Multi-Processing (tasks or threads)

    Specialized MIPS

    ACCMP is a solution that enables to continue (for a while) Moore’s performance law within the power envelop

    Future

  • Vlsi_03_2005.ppt/April_2005 65

    ACCMP

    Specialized MIPS A Cluster

    Host coreHost core

    L2 $

    Host ClusterGeneral Purpose

    MIPS

    Interconnect

    Specialized MIPS B Cluster

    Ext.Bus

    Future

  • Vlsi_03_2005.ppt/April_2005 66

    Future - Processors

    • applications need• Specialized MIPS• Detached from the CPU core• Different engines• Mixture of Programmable and fixed function

    • ?

    VLSI Trends in Microarchitecture�Past, present and futureAgenda TRENDS�IN�VLSIPerformance HistoryProcess Technology: Minimum Feature SizeTransistors on a ChipDie Size GrowthFrequencyFrequency of OperationFrequency of Operation (cont.)Brainiacs and Speed demonsOn Die Cache MemoryProcess trend – the theory (cont) �Performance driven era vs. Power aware era�Processor roadmap trend – real life (cont) �Extension of Pollack’s Rule (Micro32, 1999)�MicroarchitectureThe Generic Processor“The Core” - A Block DiagramParallelism EvolutionPipelinePipeline StallsSuper ScalarSuper PipelineOut Of Order ExecutionOut Of Order (cont.)SpeculationBranch PredictionTarget Array + Direction PredictionSpeculative ExecutionCache - Motivation & PrincipleThe Generic ProcessorFetch bandwidth�exampleTrace Cache ConceptTrace Cache OverviewTrace cache lineThreadsScalar ExecutionSuperscalar ExecutionPredicationCMP – Chip Multi-ProcessorBlocked MultithreadingFine Grained MultithreadingSimultaneous MultithreadingFuture“Theory”Provide Specialized “efficient” MIPSThe Need�the environmentWindows XP examples �that needs excessive performance:�� - Movie Maker Video Indexing� - Video smoothing�The NeedThe need: Build a PanoramaProcess trend – the theory (cont) �Performance driven era vs. Power aware era�Processor roadmap trend – real life (cont) �Extension of Pollack’s Rule (Micro32, 1999)�solution 1: CMP (Chip Multi-Processor)solution 2: ACCMP (Asymmetric Cluster CMP)ACCMPACCMPFuture - Processors