application-specific supercomputing new building blocks ... · queue fp result queue rename reg...

21
The Engine of SOC Design Application-Specific Supercomputing New Building Blocks Enable New Systems Efficiency Chris Rowen President and CEO Tensilica, Inc.

Upload: vucong

Post on 08-Mar-2019

213 views

Category:

Documents


0 download

TRANSCRIPT

The Engine of SOC Design

Application-Specific SupercomputingNew Building Blocks Enable New Systems Efficiency

Chris RowenPresident and CEOTensilica, Inc.

2 © 2006 Tensilica Inc.

Outline

• Fundamental Shift in Silicon Scaling• A Quick History of Processor Design• A New Method: Automatic Processor Generation• Tradeoffs in Efficiency versus Generality• Example of Large Processor Array for Embedded• An Architecture for Very Large Scale Compute• Conclusions: Essentials of the New HPC

• The Power Problem is Real• Work on Compute Efficiency – GFLOPS per Watt• Work on Communications Efficiency – Useful Messages/Data per

Watt

3 © 2006 Tensilica Inc.

Inflexion Point in Clock Frequency

1

10

2003 2005 2007 2009 2011 2013Year

Perf

orm

ance

(20

03 =

1.0

)

Transistor-constrained clock (21%)Power-constrained clock (4%)

Based on International Technology Roadmap for Semiconductors 12/03

4 © 2006 Tensilica Inc.

High-End Processors hit MHz ceiling

Clock < 4GHz

Basic Implications:

1. Get performance from better architecture instead of more MHz

2. Use multiple processors

5 © 2006 Tensilica Inc.

I decode

The history of the microprocessor

reg

Dmemory

ALU

next PC

Imemory

PC++

Imemory

PC++

μPC++

microcodememory

μΙ decode

cache-coherent

non-blockingcache miss

engine

added cache state

FP mult FP ALU

FP I decode

FP reg filereg

Dmemory

ALU

reg file

memorymgmt

D TLB

cache miss

engine

D tags

memorymgmt

I TLB

Imemory

PC++

Ι decode

memorymgmt

I TLB

cache miss

engine

I tags

branch memory

PC predict

PC++

renamereg

file[1]

renamereg

file[1]

branch memory

PC[1]++PC[0]++

thread selectPC predict

Imemory

Ι decode

memorymgmt

I TLB

out-of-ordercompletion engine

result queue

FP result queue

renamereg

file[0]

renameFP reg file[0]

ALU0 ALU1 ALU2 ALU3

Basic micro-controller

thread integer performanceprocessor area and power

Floating point

Superscalar (static or dyn.)

Symmetric multi-processing

Branch prediction

Micro-code

General register file

Simultaneous multi-thread

Data width:4 8 16 32…

Cache and memory protect

Pipelined load/store arch

SIMD multimedia ALU

Out-of-order execution

Relative Impact of Processor Features

6 © 2006 Tensilica Inc.

Intel’s Own Assessment

0

2

4

6

8

10

12

14

16

18

i486 (1989) Pentium™Processor

(1993)

Pentium™ 3Processor

(1999)

Pentium™ 4Processor

(2000)

Processor generation (Year)

Rel

ativ

e Pe

rform

ance

and

Effi

cien

cy (i

486

= 1.

0)

Performance (SPECint)

Die Size

Power

Power Efficiency (Perf/Power)

Transistor Speed

Power and Area increase more rapidly than Performance

1

10

100

1 10

Relative Scalar Performance

Ene

rgy

per I

nstru

ctio

n (n

J)

Pentium 1993Pentium Pro 1995Pentium 4 2001Pentium 4 2005Pentium-M 2003Pentium-M 2005Core Duo 2006Goal

7 © 2006 Tensilica Inc.

TensilicaProcessorGenerator

Tailored SW Tools: Compiler, debugger, simulators, OS ports

Application-optimized processor

implementation (RTL)

Base CPU

AppsDatapaths

OCD

Timer

FPUExtended Registers

Cache

The New WorldAutomatic Processor Generation

Processor configuration1. Select from menu2. Automatic instruction

discovery (XPRES Compiler)3. Explicit instruction

description (TIE)

8 © 2006 Tensilica Inc.

External Interface

Base ISA FeatureConfigurable FunctionOptional Function

User Defined Features (TIE)Optional & Configurable

User Defined Queues and Wires

JTAG Extended Instruction Align, Decode, Dispatch

Xtensa Processor InterfaceControl

Write Buffer

XtensaLocal Memory Interface

TRACE PortJTAG Tap Control

On Chip Debug

User Defined

Execution Units and Interfaces

Instruction Decode/Dispatch

Base ALU

Floating Point

Vectra DSP

MAC 16 DSPMUL 16/32

User Defined Register

Files

Instruction Fetch / PC

Data Load/Store

Unit

Data ROMs

Data RAMs

DataCache

DataMMU

User Defined

Execution Units

User Defined Register

Files

Vectra DSP

Base Register File

User Defined Execution Unit

Vectra DSP

Processor Controls

Interrupt Control

Data Address Watch Registers

Instruction Address Watch Registers

Timers

Used Defined Data Load/Store Units

Instruction ROMInstruction RAM

InstructionCache

Instruction MMU

PIF

Exception SupportException Handling

Registers

Trace

Interrupts

Tensilica Confidential

Build Almost Any Processor

9 © 2006 Tensilica Inc.

Covering Breadth of Processor Demands

Hard-wired logic

GP CPU

DSP

Dat

a P

roce

ssin

g E

ffici

ency

Nee

ds

Computing Complex

Xtensa Configurable Processors

10 © 2006 Tensilica Inc.

Simple Recipe

Highly efficient baseline core+

Complete instruction set extensibility+

Highly efficient interconnect supportBaseline core:• 90nm:

• 0.1mm2, <40μW per MHz in 90nm technology

• Energy efficient: <25mW @450MHz

• 65nm:• Baseline at 65nm: ~50K MIPS per watt

Comms• High-bandwidth local memory system: 128-256b per cycle

• Nearest-neighbor data-streaming queues (instruction or memory mapped)

• Efficient master or slave bus interface supporting DMA or direct fetch of memory

Full double-precision• IEEE floating point ISA

• 2-way SIMD, 3-way superscalar

• Alternatives:• 4-way superscalar• 4-way SIMD• True sequential vectors

11 © 2006 Tensilica Inc.

Benefit of Processor Generation

0.09

0.52

0.04 0.05 0.060.06 0.08

2.0

0.0

0.5

1.0

1.5

2.0

MIPS32b (NEC VR4122)ARM1136 (Freeescale iMX31) MIPS64b (NEC VR5000) ARM1020E MIPS64 20Kc Xtensa out-of-box Xtensa auto-optimized Xtensa optimized

ConsumerMarks per MHz

0.011 0.012 0.016 0.0170.03

0.013

0.47

0.0

0.1

0.2

0.3

0.4

0.5

TeleMarks per MHz

0.03

0.010.01

0.02 0.02 0.02

0.123

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

NetMarks per MHz

Source: EEMBC Certified Benchmarks

Consumer DSP Networking

0.6 0.70.9

4.2

0.6 0.6

1.0

0

1

2

3

4

Office Automation

OAMarks per MHz

12 © 2006 Tensilica Inc.

ARMMIPSBaseline Tensilica

Processors

T e n silic a X te n sa

Domain- Optimized Tensilica

Processors

0

2

4

6

8

10

12

0 25 50 75 100 125 150 175 200

Power (core mW)

Perfo

rman

ce

(AR

M11

36 @

333

MH

z =

1.0)

A Breakthrough in Low Power

Performance on EEMBC benchmarks aggregate for Consumer, Telecom, Office, Network, based on ARM1136J-S (Freescale i.MX31), ARM1026EJ-S, Tensilica Diamond 570T, T1050 and T1030, MIPS 20K, NECVR5000). MIPS M4K, MIPS 4Ke, MIPS 4Ks, MIPS 24K, ARM 968E-S, ARM 966E-S, ARM926EJ-S, ARM7TDMI-S scaled by ratio of Dhrystone MIPS within architecture family. All power figures from vendor websites, 2/23/2006

25-50x lower power at same performance

13 © 2006 Tensilica Inc.

Processor Flexibility and Efficiency

General-purposeXtensa

Domain-specificXtensa

Application-specificXtensa

Traditional processor

Flexibility or Generality

100-

1000

x

Per

form

ance

or E

ffici

ency

Hard-wiredLogic

• Instruction set extensions vary from general-purpose to highly task-specific

• Configurable processors break the Efficiency vs. Flexibility tradeoff

14 © 2006 Tensilica Inc.

True Multi-Processor System-on-Chip192 Xtensas per chip in Cisco CRS-1 Terabit Router

192 Xtensa network processing cores per Silicon Packet Processor

Up to 400,000 processors per system

Complete 32b processor

16 clusters per chip12 processor per cluster

15 © 2006 Tensilica Inc.

Example Parallel ArchitecturesCRS-1: Massive general-purpose throughput

Routing task runs to completion on each processor:

•IPv4 Unicast•MPLS–3 Labels•Link Bundling (v4)•Load Balancing L3 (v4)•1 Policier Check•Marking•TE/FRR•Sampled Netflow•WRED•ACL•IPv4 Multicast•IPv6 Unicast•Per prefix accounting•GRE/L2TPv3 Tunneling•RPF check (loose/strict) v4•Load Balancing V3 (v6)•Link Bundling (v6)•Congestion Control

18mm x 18mm IBM 0.13μm18M gates8Mbit SRAMs

50,000 general purpose MIPS175 Gb/s memory bandwidth

Programmability also means• Ability to juggle feature ordering• Support for heterogeneous mixes of feature chains• Rapid introduction of new features

96Gb/s 96Gb/s

16 © 2006 Tensilica Inc.

Basic Processor EfficiencyThe Usual List of Suspects

0.060.40.30.60.5IC GFLOPS/Watt

13065801030Approx IC Power (Watts)

5.624245.615Aggregate DP GLFOPS per IC

12228Processors per IC

1.43.03.00.73.2Cycles per second (GHz)

44440.6 per SPE

DP Operations per Cycle per Processor

Intel Itanium2

Intel Xeon 5100

Woodcrest

AMD Opteron

K8L

IBM BlueGene

/L (PowerPC 440 ASIC)

IBM/ Sony/

Toshiba Cell

Source: Vendor websites www.geek.com,www.answers.com

DP FP pipelines in FPGA: 15.9 GFLOPs @ 25W (Xilinx Virtex-4 LX200): 0.63 GFLOPS/W

7

12

8332

0.65

4

Xtensa-based

SIMD/LIW Scientific Engine

17 © 2006 Tensilica Inc.

Example Parallel Architectures Optimized Scientific Compute Processor

• Optimized for general-purpose double-precision computation on well-structured local data.

• Three-issue “FLIX” VLIW:1. 128b load-store (with update), FP convert,

general integer ops2. general integer op 3. 2-way SIMD FPop (IEEE add, sub, mul,

mul-add, mul-sub)• Free intermixing of 16b, 24b and 64b FLIX

instructions• 8-stage pipe with zero-overhead loops• 32 entry windowed address register file• 16 entry SIMD FP register file (16 x 2 x 64b)• 32KB I cache + 8KB D cache + 64-128KB data

RAM• Local instruction and data cache plus large

local data RAM (64K-128KB), dual-ported between processor and closely-coupled DMA engine

• Core: 220K gates = 1mm2 <100mW@650MHz

64b instruction word

VADDVSUBVMULVMADDVMSUB RADDSWAPV[compare]VMOV VMOVEQZVMOVGEZVMOVLTZVMOVNEZVMOVTVMOVF

27 integer ALU

LDI[U]LVI[U]LVX[U]SD{H.L}I[U]SVI[U]SVX[U]TRUNC{H,L}CEIL{H,L}ROUND{H,L}FLOOR{{H,L}FLOAT{H,L}UFLOAT{H,L}VMOVVNEGVABS+ 45 integer

ALU and LS

SIMD FP opsinteger ops

LSconvertinteger

18 © 2006 Tensilica Inc.

Example Parallel Architectures Local Interconnect Structure

CPUcore

Instcache

Datacache

DataRAM

DMA engine

Group of 4 Bus

NorthCPU

WestCPU

EastCPU

SouthCPU

PCIemessage pathmessage

path

Messagepath

Messagepath

9x9 Crossbar

XDR XIO XDR XIO

XDR XIO XDR XIO XDR XIO XDR XIO XDR XIO XDR XIO

Central broadcast,

boot & debug processor

On-chip eDRAM buffer

(optional)

PCIe

DRAMDRAMDRAMDRAMDRAMDRAMDRAM

DRAM

19 © 2006 Tensilica Inc.

Example Parallel Architectures Petascale Climate Modeling System

Technical/ Economic Challenges1. Variance in data-reference/

communication patterns for codes2. Potential for extreme scalability via large-

scale processing arrays3. Size, cost and maintenance strongly

correlated to system power dissipation4. General-purpose CPUs optimized for

integer applications – unimpressive performance per $, per watt

Technical/ Economic Challenges1. Variance in data-reference/

communication patterns for codes2. Potential for extreme scalability via large-

scale processing arrays3. Size, cost and maintenance strongly

correlated to system power dissipation4. General-purpose CPUs optimized for

integer applications – unimpressive performance per $, per watt

Parallel Climate ModelingLenny Oliker and Michael Wehner of Lawrence Berkeley

Lab speculation: a much more parallel climate model• 1.5km grid for Earth• 20,000,000 domains• 500 MFLOPS/domain• 500 MB/s per domain• Complex algorithms require general-purpose

programmability in double precision floating point• 2D communications mesh @~20MB/s per domain

System Architecture Approach• Highly suitable for distributed array computation• Two design challenges:

• Total memory bandwidth: 5-10 peta-bytes/s – many parallel local DRAM channels

• Power: GFLOPS/W best predictor of system cost, size • Best off-the-shelf processor (IBM Cell) is about 0.5 DP

GFLOPS/W• Domain-specific processor approach offers significant

potential advantage (>10 DP GFLOPS/W)

20 © 2006 Tensilica Inc.

Example Parallel Architectures 10 PetaFLOPS System Concept: 3.8M processors

~150 m2

<5MWatts

~ $100M

32 boards per rack

120 racks @ ~25KW

power + comms

32 chip + memory clusters per board (2.7

TFLOPS @ 1000W

VLIW CPU: • 128b load-store + 2 DP MUL/ADD + integer op/ DMA

per cycle:• Synthesizable at 650MHz in commodity 65nm • 1mm2 core, ~3mm2 with inst cache, data cache data

RAM, DMA interface, 0.25mW/MHz• Double precision SIMD FP : 4 ops/cycle (2.7GFLOPs)• Vectorizing compiler, cycle-accurate simulator,

debugger GUI• 8 channel DMA for streaming from on/off chip DRAM• Nearest neighbor 2D communications grid

ProcArray

RAM RAM

RAM RAM

8 DRAM perprocessor

chip: 50 GB/s

CPU64-128K D

2x128b

32K I

8 chanDMA

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

CPU

D

IDM A

Opt. 8M

B em

bedded DR

AM

External DRAM interface

External DRAM interface

External D

RA

M interfaceE

xter

nal D

RA

M in

terfa

ce

MasterProcessor

Comm LinkControl

32 processors per 65nm chip83 GFLOPS @ 12W

21 © 2006 Tensilica Inc.

Conclusions

Essentials of the New HPC• The practical and economic limitations of power are real:

Figure out how to live with 5MW• Cost of building/operating system now much greater than

cost of design: Build systems around particular application domains or communications paradigms

• Demonstrated potential to improve compute and local communications energy efficiency by 1-2 order of magnitude

• Next challenge: global communications efficiency:• Develop algorithms with reduced global comms needs• Develop ultra-low-energy global comms mechanisms