hardware platforms for embedded computing

66
Hardware platforms for Embedded computing

Upload: skule

Post on 23-Feb-2016

63 views

Category:

Documents


0 download

DESCRIPTION

Hardware platforms for Embedded computing. poor design techniques. The energy/flexibility conflict - Intrinsic Power Efficiency -. Operations/Watt [MOPS/mW]. Ambient Intelligence. 10. DSP-ASIPs. hardwired muxed ASIC. 1. Processors. Reconfigurable Computing. µPs. 0.1. 0.01. Technology. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Hardware platforms for Embedded computing

Hardware platforms for Embedded computing

Page 2: Hardware platforms for Embedded computing

The energy/flexibility conflict- Intrinsic Power Efficiency -

Technology

[H. de Man, Keynote, DATE‘02;T. Claasen, ISSCC99]

Operations/Watt[MOPS/mW]

ProcessorsReconfigurable Computing

hardwired muxed ASIC1

0.1

0.01

0.13µ

Necessary to optimize HW/SW; otherwise the prize for software flexibility cannot be paid!

Ambient Intelligence

0.07µ

DSP-ASIPsµPs

10

0.25µ0.5µ1.0µ

poor design techniques

Page 3: Hardware platforms for Embedded computing

Architectural Choices

P

Prog Mem

MACUnit

AddrGenP

Prog Mem

P

Prog Mem

SatelliteProcessorDedicated

Logic

Satellite

Processor

SatelliteProcessor

GeneralPurpose

P

Software

DirectMapped

Hardware

HardwareReconfigurable

Processor

ProgrammableDSP

Flex

ibili

tyFl

exib

ility

1/Efficiency (power, speed)1/Efficiency (power, speed)

Page 4: Hardware platforms for Embedded computing

The Processor Design Space

Cost

Perf

orm

ance

Microprocessors

Performance iseverything& Software rules

Embeddedprocessors

Microcontrollers

Cost is everything

Application specific architecturesfor performance

Page 5: Hardware platforms for Embedded computing

Area of processor cores = Cost

Nintendo processor Cellular phones

Page 6: Hardware platforms for Embedded computing

Another figure of meritComputation per unit area

Nintendo processor Cellular phones???

Page 7: Hardware platforms for Embedded computing

Embedded vs. general-purpose processors

Embedded processors may be optimized for a category of applications. Customization may be narrow or broad.

We may judge embedded processors using different metrics: Code size. Memory system performance. Preditability.

Page 8: Hardware platforms for Embedded computing

Microcontrollers

CPU ROM RAM

I/O

A single chip

Subsystems:Timers, Counters, AnalogInterfaces, I/O interfaces

Memory

Page 9: Hardware platforms for Embedded computing

Microcontroller Architectures

CPUProgram + Data

Address Bus

Data Bus

Memory

Von NeumannArchitecture

CPUProgram

Address Bus

Data Bus

HarvardArchitecture

Memory

Data

Address Bus

Fetch Bus

0

0

0

2n

Page 10: Hardware platforms for Embedded computing

MCS-51 “Family” of Microcontollers

8051 introduced by Intel in late 1970s Now produced by many companies in

many variations The most pupular microcontroller – about

40% of market share 8-bit microcontroller

Page 11: Hardware platforms for Embedded computing

“Original” 8051 Microcontroller

Oscillator and timing

4096 Bytes Program Memory

128 Bytes Data

Memory

Two 16 Bit Timer/Event

Counters

8051 CPU

64 K Byte Bus Expansion

Control

Programmable I/O

Programmable Serial Port Full Duplex UART

Synchronous Shifter

Internal data bus

External interrupts

subsystem interrupts

Control Parallel portsAddress Data BusI/O pins

Serial InputSerial Output

Page 12: Hardware platforms for Embedded computing

Microcontrollers- MHS 80C51 as an example -• 8-bit CPU optimised for control applications• Extensive Boolean processing capabilities• 64 k Program Memory address space• 64 k Data Memory address space• 4 k bytes of on chip Program Memory• 128 bytes of on chip data RAM• 32 bi-directional and individually addressable I/O lines• Two 16-bit timers/counters• Full duplex UART• 6 sources/5-vector interrupt structure with 2 priority levels• On chip clock oscillators• Very popular CPU with many different variations

Features for Embedded System

s

Page 13: Hardware platforms for Embedded computing

RISC processors RISC generally means

highly-pipelinable, one instruction per cycle.

Pipelines of embedded RISC processors have grown over time: ARM7 has 3-stage

pipeline. ARM9 has 5-stage

pipeline. ARM11 has eight-stage

pipeline.

ARM11 pipeline [ARM05].

Page 14: Hardware platforms for Embedded computing

RISC processor families ARM: ARM7 is relatively simple, no memory

management; ARM11 has memory management, other features.

MIPS: MIPS32 4K has 5-stage pipeline; 4KE family has DSP extension; 4KS is designed for security.

PowerPC: 400 series includes several embedded processors; MPD7410 is two-issue machine; 970FX has 16-stage pipeline.

Page 15: Hardware platforms for Embedded computing

DSP Applications

Audio applications MPEG Audio Portable audio Digital cameras Wireless Cellular

telephones Base station

Networking Cable modems ADSL VDSL

Page 16: Hardware platforms for Embedded computing

Another Look at DSP Applications

High-end Wireless Base Station - TMS320C6000 Cable modem gateways

Mid-end Cellular phone - TMS320C540 Fax/ voice server

Low end Storage products - TMS320C27 Digital camera - TMS320C5000 Portable phones Wireless headsets Consumer audio Automobiles, toasters, thermostats, ...

Incr

easi

ngC

ost

Increasingvolum

e

Page 17: Hardware platforms for Embedded computing

DSP vs. General Purpose MPU

The “MIPS/MFLOPS” of DSPs is speed of Multiply-Accumulate (MAC).

DSP are judged by whether they can keep the multipliers busy 100% of the time.

The "SPEC" of DSPs is 4 algorithms: Inifinite Impule Response (IIR) filters Finite Impule Response (FIR) filters FFT, and convolvers

In DSPs, algorithms are king! Binary compatability not an issue

Software is not (yet) king in DSPs. People still write in assembly language for a product to

minimize the die area for ROM in the DSP chip.

Page 18: Hardware platforms for Embedded computing

Architectural Features of DSPs

Data path configured for DSP Fixed-point arithmetic MAC- Multiply-accumulate

Multiple memory banks and buses - Harvard Architecture Multiple data memories

Specialized addressing modes Bit-reversed addressing Circular buffers

Specialized instruction set and execution control Zero-overhead loops Support for MAC

Specialized peripherals for DSP THE ULTIMATE IN BENCHMARK DRIVEN ARCHITECTURE

DESIGN!!!

Page 19: Hardware platforms for Embedded computing

Application: y[j] = i=0

x[j-i]*a[i]

i: 0i n-1: yi[j] = yi-1[j] + x[j-i]*a[i]

Domain-oriented architectures

Architecture: Example: Data path ADSP210x

n-1

- Parallelism - Dedicated registers

MR

MFMX MY

*+,-

AR

AFAX AY

+,-,..

DP

yi-1[j]

x[j-i]

x[j-i]*a[i]

a[i]

Address generation unit (AGU)

Address- registersA0, A1, A2 ..i+1, j-i+1

ax

MR:=0; A1:=1; A2:=n-2; MX:=x[n-1]; MY:=a[0];for ( j:=1 to n) {MR:=MR+MX*MY; MY:=a[A1]; MX:=x[A2]; A1++; A2--}

Page 20: Hardware platforms for Embedded computing

DSP - Features (1) • Multiply/accumulate (MAC) and zero-overhead loop

(ZOL) instructions (as shown)• Heterogeneous registers (as shown)• Separate address generation units (AGUs)

(as in ADSP 210x)

Page 21: Hardware platforms for Embedded computing

DSP - Features (2) • Modulo

addressing: Am++ Am:=(Am+1) mod n(implements ring or circular buffer in memory)

..x[n-2]x[n-1]x[0]x[1]..

Memory, t=t1

..x[n-3]x[n-2]x[n-1]x[n]x[1]

Memory, t2=t1+1

sliding windowt2x

t1

t

Page 22: Hardware platforms for Embedded computing

Multiple memory banks or memories

MR

MFMX MY

*+,-

AR

AFAX AY

+,-,..

DP

Address generation unit (AGU)

Address- registersA0, A1, A2 ..

Simplifies parallel fetches

Page 23: Hardware platforms for Embedded computing

Very long instruction word (VLIW) processorsKey idea: detection of possible parallelism to be done by compiler, not by hardware at run-time (inefficient).

VLIW: parallel operations (instructions) encoded in one long word (instruction packet), each instruction controlling one functional unit. E.g.:

Page 24: Hardware platforms for Embedded computing

The Texas InstrumentsTMS 320C6xx as an example

31 0

0Instr. A

31 0

0Instr. D

31 0

1Instr. F

31 0

0Instr. G

31 0

1Instr. E

31 0

1Instr. C

31 0

1Instr. B

Cycle Instruction

1 A2 B C D3 E F G

Instructions B, C and D use disjoint functional units, cross paths and other data path resources. The same is also true for E, F and G.

Bit in each instruction encodes end of parallel execution

Parallel execution cannot span several packets.

Page 25: Hardware platforms for Embedded computing

Partitioned register files

register file A register file B

L1 S1 M1 D1 D2 M2 S2 L2

Data bus

Address bus

Data path A Data path B

• Many memory ports are required to supply enough operands per cycle.

• Memories with many ports are expensive. Registers are partitioned into (typically 2) sets, e.g. for TI

C60x:

Page 26: Hardware platforms for Embedded computing

Instruction types are mapped tofunctional unit types

There are 4 functional unit (FU) types: M: Memory Unit I: Integer Unit F: Floating-Point Unit B: Branch Unit

Instruction types corresponding FU type,except type A (mapping to either I or M-functional units).

Page 27: Hardware platforms for Embedded computing

Large # of delay slots,a problem of VLIW processors

The execution of many instructions has been started before it is realized that a branch was required.Nullifying those instructions would waste compute power Executing those instructions is declared a feature, not a bug. How to fill all „delay slots“ with useful instructions? Avoid branches wherever possible.

add sub and or

sub mult xor div

ld st mv beq

Page 28: Hardware platforms for Embedded computing

Predicated execution:Implementing IF-statements „branch-free“

Conditional Instruction „[c] I“ consists of:• condition c• instruction I

c = true => I executedc = false => NOP

Page 29: Hardware platforms for Embedded computing

Predicated execution:Implementing IF-statements „branch-free“: TI C6x

if (c){ a = x + y; b = x + z;}else{ a = x - y; b = x - z;}

Conditional branch

[c] B L1 NOP 5 B L2 NOP 4 SUB x,y,a || SUB x,z,bL1: ADD x,y,a || ADD x,z,bL2:

Predicated execution

[c] ADD x,y,a|| [c] ADD x,z,b|| [!c] SUB x,y,a|| [!c] SUB x,z,b

max. 12 cycles 1 cycle

Page 30: Hardware platforms for Embedded computing

Roadmap continues: 906545 nm “Traditional” Bus-based SoCs fit in one tile !!

Architecture Evolution

Communication demand is staggering, but unevenly distributed, because of architectural heterogeneity

I/0

I/0

PE

PE PE PE

SRAM SRAM

DRAM

I/O

I/OPERIPHERALS

3D stacked m

ain mem

ory

PE

LocalMemory

hierarchy

CPU

i/o

Page 31: Hardware platforms for Embedded computing

Multicores Are Here!

1985 199019801970 1975 1995 2000

4004

8008

80868080 286 386 486 Pentium P2 P3P4Itanium

Itanium 2

2005 20??

# of

cor

es

1

2

4

8

16

32

64

128

256

512

Athlon

Raw

Power4 Opteron

Power6

Niagara

YonahPExtreme

Tanglewood

Cell

IntelTflops

Xbox360

CaviumOcteon

RazaXLR

PA-8800

CiscoCSR-1

PicochipPC102

Boardcom 1480 Opteron 4PXeon MP

AmbricAM2045

[Amarasinghe06]

Page 32: Hardware platforms for Embedded computing

MPSoC – 2005 ITRS roadmap

2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 20200

200

400

600

800

1000

60

50

40

30

20

10

0

1200

Num

ber

of P

roce

ssin

g En

gine

s

Logi

c, M

emor

y Si

ze (N

orm

aliz

ed to

200

5)

Number of Processing Engines(Right Axis)

Total Logic Size(Normalized to 2005, Left Axis)

Total Memory Size(Normalized to 2005, Left Axis)

16 23 32 46 63 79101

133 161212

268

348

424

526

669

878

[Martin06]

Page 33: Hardware platforms for Embedded computing

Power is the Challenge!

0

200

400

600

800

1000

1200

1400

90nm 65nm 45nm 32nm 22nm 16nm

Pow

er (W

), Po

wer

Den

sity

(W/c

m2 )SiO2 LkgSD LkgActive

10 mm Die

Technology, Circuits, and Architecture to constrain the power

Page 34: Hardware platforms for Embedded computing

Near Term Solutions Move away from Frequency alone to

deliver performance More on-die memory Multi-everywhere

Multi-threading Chip level multi-processing

Throughput oriented designs Performance by higher level of

integration

Page 35: Hardware platforms for Embedded computing

Architecture Techniques

0%

25%

50%

75%

100%

1u 0.5u 0.25u 0.13u 65nm

Cach

e %

of T

otal

Are

a

486 Pentium®

Pentium® III

Pentium® 4

Pentium® M

Increase on-die Memory

ST Wait for Mem

MT1 Wait for MemMT2 Wait

MT3

Single ThreadSingle Thread

Multi-ThreadingMulti-Threading

Full HW Utilization

Multi-threading

Improved performance, no impact on thermals & power delivery

C1 C2

C3 C4

Cache

Chip Multi-processing

LargeCore 1

1.5

2

2.5

3

3.5

1 2 3 4Die Area, Power

Rel

ativ

e Pe

rfor

man

ce

Multi Core

Single Core

Page 36: Hardware platforms for Embedded computing

Multi-Core

C1 C2

C3 C4

Cache

Large Core

Cache

1

2

3

4

1

2 SmallCore 1 1

1

2

3

4

1

2

3

4

Power

PerformancePower = 1/4

Performance = 1/2

Multi-Core:Power efficient

Better power and thermal management

Page 37: Hardware platforms for Embedded computing

Embedded vs. General Purpose

Embedded Applications Asymmetric Multi-Processing

Differentiated Processors Specific tasks known early

Mapped to dedicated processors Configurable and extensible

processors: performance, power efficiency

Communication Coherent memory Shared local memories HW FIFOS, other direct connections

Dataflow programming models Classical example – Smart mobile –

RISC + DSP + Media processors

Server Applications Symmetric Multi-Processing

Homogeneous cores General tasks known late

Tasks run on any core High-performance, high-speed

microprocessors Communication

large coherent memory space on multi-core die or bus

SMT programming models (Simultaneous Multi-Threading)

Examples: large server chips (eg Sun Niagara 8x4 threads), scientific multi-processors

Page 38: Hardware platforms for Embedded computing

MPSoC architectures

Page 39: Hardware platforms for Embedded computing

Example system platforms

Generic Automotive Wireless Multimedia

Page 40: Hardware platforms for Embedded computing

PC-based platform

Basic hardware components: CPU; memory; timers; DMA; minimal I/O devices.

Basic software: BIOS.

Page 41: Hardware platforms for Embedded computing

PC-style hardware architecture

CPU

system bus

memory

DMAcontroller

timers

businterface

brid

ge

high-speed bus

low-speed bus

I/O

I/O

Page 42: Hardware platforms for Embedded computing

Strong ARM StrongARM system includes:

CPU chip (3.686 MHz clock) system control module (32.768 kHz

clock). Real-time clock; operating system timer general-purpose I/O; interrupt controller; power manager controller; reset controller.

Page 43: Hardware platforms for Embedded computing

Pros and cons

Plentiful hardware options. Simple programming semantics. Good software development

environments. Performance-limited.

Page 44: Hardware platforms for Embedded computing

TI Open Wireless Multimedia Applications Platform Dual-processor shared memory system:

GPPOS

DSPmanager

General-purposeprocessor

DSP

DSPOS

DSPtask

& I/Octrl

bridge

Memctrl

external memory

http://www.ti.com/sc/docs/apps/wireless/omap/overview.htm

Page 45: Hardware platforms for Embedded computing

TI OMAP™ Hardware platform

I-MMU D-MMU

I-Cache

RISC Core

MMU

I-Cache Internal RAM/ROM

DSP Core+

Appl Coprocessors

DMA

Memory & Traffic Controller

ProgramMemory SDRAM

PeripheralsLCD Controller, Interrupt Handlers, Timers, GPIO, UARTs, ...

ARM9 core 16KB I-

cache 8KB D-cache 2-way set

associative 150 MHz

C55x DSP core

16KB I-cache 8KB RAM set 2-way set

associative 200 MHz

D-Cache

Page 46: Hardware platforms for Embedded computing

OMAPI Standard (ST/TI)

Goal: standardize the interfaces between application processor and peripheral devices in a mobile product

Provide standard services (APIs) in the OS that can be used by application developers

Page 47: Hardware platforms for Embedded computing

STMicro Nomadik platformMain Core

Memory System HW Accelerators I/Os

Page 48: Hardware platforms for Embedded computing

Nomadik SW platform

Compliant with OMAPI standard

Page 49: Hardware platforms for Embedded computing

Scalable VLIW Media Processor:• 100 to 300+ MHz• 32-bit or 64-bit

Nexperia™

System Buses• 32-128 bit

General-purpose Scalable RISC Processor• 50 to 300+ MHz• 32-bit or 64-bit

Library of DeviceIP Blocks• Image coprocessors• DSPs• UART• 1394• USB…and more

TM-xxxxTM-xxxxD$D$

I$I$

TriMedia CPUTriMedia CPU

DEVICE IP BLOCKDEVICE IP BLOCK

DEVICE IP BLOCKDEVICE IP BLOCK

DEVICE IP BLOCKDEVICE IP BLOCK

.. .. ..

DVP SYSTEM SILICON

PI B

US

SDRAM

MMI

DVP

MEM

ORY

BU

S

DEVICE IP BLOCK

PRxxxxD$

I$

MIPS CPU

DEVICE IP BLOCK. . .

DEVICE IP BLOCK

PI B

US

TriMedia™MIPS™

Philips Digital Video Nexperia Platform

Page 50: Hardware platforms for Embedded computing

MiddlewareJavaTV, TVPAK, OpenTV, MHP/Java, proprietary ...

Applications

Nexperia HardwareNexperia Hardware

Streaming andStreaming and Platform SoftwarePlatform Software K

erne

l: pS

OS

, Win

-CE

, Ja

vaO

S

Nexperia-DVP SoftwareNexperia™ -DVP Software Architecture

Supports multiple OSs and middleware software

Abstracts platform functionality via consistent APIs

Nexperia™-DVP Streaming Software

Encapsulates implementation of streaming media components (hardware and software)

Nexperia™ Platform Software OS independent device

drivers for on-chip and off-chip devices

Page 51: Hardware platforms for Embedded computing

Infineon Automotive Platform

TC1166

Applications High Performance drives / servo drives, Industrial control RoboticsFeatures 32-bit super-scalar TriCoreTM V1.3

CPU, 4 stage pipelineFully integrated DSP capabilitiesSingle precision floating point unit (FPU)80 MHz at full industrial temperature range

32-bit peripheral control processor with single cycle instruction (PCP2)

Memories1.5 MByte embedded progr. flash with ECC32 KByte data flash - EEPROM emulation56 KBSRAM, 8 KB I$, 16 KB Imem

8-channel DMA controller Interrupt system with 2 x 255 hardware

priority arbitration levels serviced by CPU and PCP2 Coprocessor

Triple bus structure: 64-bit local memory buses to internal flash and data memory, 32-bit system peripheral bus, 32-bit remote peripheral bus

Page 52: Hardware platforms for Embedded computing

HW layer

SW Platform layer(> 60% of total SW)

Application Platform layer(10% of total SW)

Controllers Library

OSEKRTOS

OSEKCOM

I/O drivers & handlers(> 20 configurable modules)

Application Programming Interface

Boot Loader

Sys. Config.

Transport

KWP 2000

CCP

ApplicationSpecificSoftware

Speedom

eterTachom

eterW

ater temp.

Speedom

eterTachom

eterO

dometer

---------------

ApplicationLibraries

Nec78k HC12HC08 H8S26 MB90

SW Platform Reuse> 70%

of total SW

CustomerLibraries

MOSAIC SW Architecture & Components for Automotive Dashboard and Body Control

Page 53: Hardware platforms for Embedded computing

Special Purpose processor

Stream processorGraphic processorNetwork processor

Dynamically Reconfigurable Processors

FPGA 、 Reconfigurable systems

Dedicated hardware

ProgrammableHardware

DSP

General purposeCPU

ConfigurableProcessor

Tile Processor

HomogeneousChip-multiprocessor

Specialinstructions

MultipleCores

HeterogeneousMultiprocessor

Multiple Cores

High performance forwide application field

High performance for narrow application fieldArchitecture trends

Page 54: Hardware platforms for Embedded computing

Task Specific (configurable) Processors

HDL GENERATOR

Silicon

RTL synthesis

Silicon

µcode

Processor modelD

D

Applications

SysC specs

ISADP

Courtesy:Target

Compilers T

RWTH AACHEN Lisatek(CoWare);IMEC Target Compiler T, ARM OptimoDEPHILIPS Siliconhive; TENSILICA, PicoChip…

INSTRUCTION SET SIMULATOR

HDLModel Break

Step

RETARGETABLE

COMPILER

Machinecode

MACDAPACSACH Y,1NEGLAR AR3,#X…

Page 55: Hardware platforms for Embedded computing

Multi-issue instruction

L operations packed in one long instruction

M copies of storage and function

SIMD operation

Parallelism at Three Levels in Extensible Instructions

Parallelism: L x M x NExample: 3 x 4 x 3 = 36 ops/cycle

op

op

N dependent operations

implemented as single

fused operation

const

register and constant inputs

reg

Fused operation

reg reg reg

op

Three forms of instruction-set parallelism:• Very Long Instruction Word (VLIW)• Single Instruction Multiple Data (SIMD) aka “vectors”• Fused operations aka “complex operations”

Page 56: Hardware platforms for Embedded computing

addi addi

l 8 u i

sub

abs

add

l 8 u i

Example:SAD (sum of absolute differences)

short total = 0;char *p1, *p2;for i =1,m for j =1,n total + = abs(*p1++ - *p2++)

Original C Code

SLOT 2

SLOT 1

SLOT 0

Sample Software Pipelined ScheduleVector + Fusion + FL I X Configuration

loop j =1, n / 8 by 2: liu9x8[j]; liu9x8[j]; fusion[j-2] liu9x8[j+1]; liu9x8[j+1]; fusion[j-1]

N O YES

Vectorize?2

abs9 x 8

cvt9_16

add16 x 8

sub9 x 8

l iu 9 x 8l iu 9 x 8

48

fusion

fusion

Page 57: Hardware platforms for Embedded computing

Dynamically Reconfigurable Processors

Reconfigurable systems → Previous lesson Flexible but It takes 10’s milliseconds for dynamic

reconfiguration. Dynamically Reconfigurable Processors

Improves area efficiency by changing hardware structure. IPs used in various SoCs. History

Reconfigurable Co-processor Garp(1997), CHIMAERA(2000) Multicontext reconfigurable devices WASMII(1992),Time-multiplexing

FPGA(1997), PipeRench(1998), DRL(1998) Functional-level synthesis

Various commercial products are available since 2000 IPFlex DAPDNA-2, NEC electronics DRP-1, PACT Xpp, Elixent DFabrix

SONY’s VME(Virtual Mobile Engine) is embedded in Network Workman and PSP

Recently, many Japanese vendors start to develop commercial products

Fujitsu Hitachi Lucent Sanyo Toshiba  ( Mep+D-Fabrix)

Page 58: Hardware platforms for Embedded computing

What is Configurable Computing?Spatially-programmed connection of Spatially-programmed connection of processing elementsprocessing elements

“Hardware” customized to specifics of problem.

Direct map of problem specific dataflow, control.

Circuits “adapted” as problem requirements change.

Page 59: Hardware platforms for Embedded computing

Spatial vs. Temporal Computing

Spatial Temporal

Page 60: Hardware platforms for Embedded computing

Processor vs. FPGA Area

Page 61: Hardware platforms for Embedded computing

Processing Element Specialized for media/stream processing Coarse grain ⇔ Fine grain: LUT of FPGAs Components

ALU Shifter + Mask unit Multiplexers Registers

Operations and interconnection between components are changeable

No instruction fetch mechanism : A part of large datapath

Page 62: Hardware platforms for Embedded computing

Reconfigurable HW (DSP fabric)

Target signal processing and arithmetic intensive applications

Reconfigurable array of simple DSP core (CNode)

Low power architecture Hierarchical clock gating Distributed leakage control (fine grain power gating)

Programmable DMA engine

Reconfigurable at run time, multi task

Page 63: Hardware platforms for Embedded computing

Mapping Flow

• Alus execute a cyclic micro-sequence

• Data exchanges through hierarchical clustered interconnect

• Configuration step is sequence loading and interconnect programming

Data in Data out

ILP + software pipelining

Procedure(In,Out,inout)

Constant A,b,c,…;

Begin

X=a-in[0];

……..

End;

Behavioral code

Data in Data out

Data in Data out

Data in

Data out

Partitioning/static scheduling

DFG

Coarse grained configuration

MUX

Clusters Level0

Mux level 2

N0_i

N0_o

N2_o N2_i

N1_i N1_o

Level 1

Page 64: Hardware platforms for Embedded computing

Mapping Flow 3D optimization problem

(place/route/schedule)

Traditional scheduling techniques for VLIW or clustered VLIW don’t apply The solution don’t take into account the spatial

dimension of the problem

Traditional P&R used in FPGA don't apply neither because they don't consider the time dimension

Page 65: Hardware platforms for Embedded computing

Putting it all together

Constant SoC Die Size Slow evolution of peripherals (area decrease) GP CPU sub-system complexity 2x each node (constant

area), Embedded Memory capacity 2x at each node (constant area) Loosely coupled DSP sub-system complexity increase by

30% at each node (30% area decrease)

2004 2006 2008 2010 2012 Technology Node (nm) 90 65 45 32 22 Loosely coupled Sub-Systems 2 4 6 8 12 General Purpose CPU Single Multiple Hardware Accelerator Hardwired Reconfigurable

Page 66: Hardware platforms for Embedded computing

Interconnect

4MB Multi-port Embedded

Memory HostCore 2

L1L2

Peripherals& analog

What can fit in 45mm² in 45nmL1

DSP

HW

DMA

L1

DSP

HW

DMA

L1

DSP

HW

DMA

L1

DSP

HW

DMA

L1

DSP

HW

DMA

L1

DSP

HW

DMA

Programmable Multimedia Accelerator

ImagingH/W192 CNode

(40 GOPS)

HostCore 1

L1

VideoH/W