the last lesson: recent embedded architectures hideharu amano

The last lesson: Recent Embedded Architectures

Hideharu Amano

Embedded processors

• Cost/Power-centric, Performance for specific application

• RISC Processors• Shrunk instructions are provided

– ARM 　（ ARM ）– MIPS 　（ MIPS ）– SH （ Hitachi/Lunesus)

• Works at 60MHz-800MHz depending on the applications

→ 　 Performance was enough until 90s’

MOPS(Million Operations Per Second) for various embedded applications

10 100 1000 10000

MPEG2/4 Dec.

MPEG2/4 Enc.

JPEG Enc./Dec.

MP3 Enc./Dec. Dolby Enc./Dec.

100K words identification５ K sentence translation

3 dimensional image generation

2 dimensional image generation

Ｖｏ IP modem

CDMA modem

Imageprocessing

VoiceMusic

Graphics

Communication

The performance of the simple RISC processor is

not enough

Performance enhancement techniques for CPU

Instruction Level Parallel processing

SuperScaler

VLIW (Very Long Instruction Word)

SMT (SimultaneousMultiThreading)

Dynamic scheduling of instructions

SuperscalarUsing high clockfrequency

Sophisticated Branch Prediction

Thread Level ParallelProcessing MIMD (Multiple Instruction streams

Multiple Data streams)

SIMD (Single Instruction stream Multiple Data streams)

Chip-multiprocessors

Efficient for every CPU

Of course, useful for embedded CPUs

Increasing cost/power consumption

EmbeddedCPU

HardwareAccelerator

RAM I/O I/O

On-Chip busOn-Chip Network

Embedded CPU ＋ Hardware Accelerator

Hardware accelerator is suitable for high

performance in specific application

Various type of architectures for

embedded processing

Amdahl’s Law

• Total SpeedUp =

(1-ratio of acceleration) +

ratio of acceleration

SpeedUp of acceleration• 100 times acceleration.• If the ratio of acceleration is 50%, total speed up

becomes 2.001 times.• Fortunately, the ratio is large in media processin

g.

Special Purpose processor

Stream processorGraphic processorNetwork processor

Dynamically Reconfigurable Processors

FPGA 、 Reconfigurable systems

Dedicated hardware

ProgrammableHardware

DSP

General purposeCPU

ConfigurableProcessor

Tile Processor

HomogeneousChip-multiprocessor

Specialinstructions

MultipleCores

HeterogeneousMultiprocessor

Multiple Cores

High performance forwide application field

High performance for narrow application fieldVarious embedded architectures

Specification Analysis

System Spec.

Hardware/Softwaredivision

Hardware Spec. Software Spec.

Interface GenerationHardware Functional

SynthesisProgram Generation

Hardware design Interface design Program

Co-verification

System design

Hardware/SoftwareCo-design High level design

cost can be reduced.Recently, Low level

design cost is increased.

Configurable Processor／ Integrated Platform

• Configurable Processor– Hardware accelerators, special purpose processors c

an be combined as special instructions. • ARC(ARC)• Xtensa (Tensilica)• MeP(Toshiba ）• Triton(Poseidon Design Systems)

– Various type of interconnection is possible.– Integrated software emvironment

• Integrated Platform → 　 Standard components– UniPhier （ Matsushita ）

MM1 MM2 ．．． MMn

32bitProcessor Core

ConfigurationOptional Inst.Memory SizeInterruptDebugging．．．

MeP Core Extension

Extended Inst.UCIDSPVLIW．．．

Hardwareengine

Bus IFLocal bus

Global bus

Configurable ProcessorMeP

Multi-Core/Multiprocessor• Heterogeneous Processors

– Special purpose processors for each application– High performance/cost– Different programming for different processor→ 　 Complicated BUGs!

• Homogeneous Processors– Multiple general purpose processors– Programming environment for servers can be introduced.

• Parallel OS, Parallel Compilers– Dynamic Voltage Control/Dynamic Frequency Control →

Necessary performance with optimized power.• Each processor executes its own task 　⇔　 Differ

ent from Tile processors

NEC MP211

ARM926PE0

ARM926PE1

ARM926PE2

SPX-K602DSP

DMAC USB OTG

3D Acc.

Rot-ater.

ImageAcc.

CamDTVI/F.

LCDI/F

AsyncBridge0

AsyncBridge1

APBBridge0

IIC UART

TIM1

TIM2

TIM3

WDT

Mem. card

PCM

APBBridge1

Bus Interface

Scheduler

SDRAMController

SRAMInterface

On-chip SRAM

(640KB)

PLL OSC

Inst.RAM

PMU

INTC TIM0GPIO SIO

Sec.Acc.

SMU uWIRE

CameraLCD

FLASH DDR SDRAM

Cell （ IBM/SONY/Toshiba ）

SXU

LS

DMA

PXU

L1 C

L2 C

MIC

BIC

ExternalDRAM

Flex I/O

EIB: 2+2 Ring Bus

CPU Core IBM Power

SPE:Synergistic Processing Element(SIMD core)

32KB+32KB

512KB

PPE

512KB Local Store

SXU

LS

DMA

SXU

LS

DMA

SXU

LS

DMA

SXU

LS

DMA

SXU

LS

DMA

SXU

LS

DMA

SXU

LS

DMA

NUMA machines which share a single address

space

MPCore (ARM+NEC)

CPUinterface

Timer

WdogCPU

interfaceTimer

WdogCPU

interfaceTimer

WdogCPU

interfaceTimer

Wdog

CPU/VFP

L1 Memory

CPU/VFP

L1 Memory

CPU/VFP

L1 Memory

CPU/VFP

L1 Memory

Interrupt Distributor

Snoop Control Unit (SCU) CoherenceControl Bus

DuplicatedL1 Tag

…

IRQ IRQ IRQ IRQ

PrivateFIQ Lines

PrivatePeripheralBus

L2 Cache

PrivateAXI R/W64bit Bus

Tile Processor/Processor Array• Each PE provides its own PC, and fetches instructions from i

ts own instruction memory.→ Falls into NORMA machines• However, it is close to dynamically reconfigurable processor

s shown later.– A single task is executed with all PEs 　⇔　 Multiprocessors– Heterogeneous PEs– A lot of homogeneous PEs– Program is embedded.– Simple Interconnection network.– The concept of context switching– The target is image processing and media processing.– MIT 　 RAW– Quicksilver’s ACM– MorphTech’s rDSP– PicoChip’s PC101

ComputingProcessor

(8 stages 32bitSingle issue

In order)

4-stagepipelined

FPU

96KBI-Cache

32KBD-Cache

Com-municationProcessor

8 32-bitchannels

On-Chip NORMA system for embedded applications

MIT’s RAW

Adaptive Node Domain NodeProgrammable Node

Level1 Cluster

Level2 Cluster

Level3 Cluster

ACM　（ Quicksilver)Matrix Interconnect Network

Dynamically Reconfigurable Processors• Reconfigurable systems → 　 Previous lesson

– Flexible but It takes 10’s milliseconds for dynamic reconfiguration.• Dynamically Reconfigurable Processors

– Improves area efficiency by changing hardware structure.– IPs used in various SoCs.– History

• Reconfigurable Co-processor Garp(1997), CHIMAERA(2000)• Multicontext reconfigurable devices WASMII(1992),Time-multiplexing FPGA(19

97), 　 PipeRench(1998), 　 DRL(1998)• Functional-level synthesis

– Various commercial products are available since 2000• IPFlex DAPDNA-2, NEC electronics DRP-1, PACT Xpp, Elixent DFabrix

– SONY’s VME(Virtual Mobile Engine) is embedded in Network Workman and PSP

– Recently, many Japanese vendors start to develop commercial products• Fujitsu• Hitachi• Lucent• Sanyo• Toshiba 　（ Mep+D-Fabrix)

Processing Element• Specialized for media/stream processing Coarse grain ⇔ 　 Fine grain: LUT of FPGAs• Components

– ALU– Ｓｈｉｆｔｅｒ＋ Mask unit– Multiplexers– Registers

• Operations and interconnection between components are changeable

• No instruction fetch mechanism : A part of large datapath

Chameleon CS2112 DPU

Instruction

Rou

ting

MU

XR

ou

ting

MU

X

Register＆

Mask

Register＆

Mask

OP Register

RegisterBarrelShifter

OP ： Operations in C or Verilog

SIMD arrays and pipelines are formed with multiple DPUs.

３２ bit ・１６ bit

Dynamic reconfiguration• Compared with FPGAs, coarse grain PE is area

effective for media/stream processing.

→ 　 However, flexible part requires semiconductor area : Not comparable with hardware accelerators

• But it is flexible!

→ 　 Dynamic reconfiguration

　 By changing hardware structure, the same semiconductor area can be used for multiple tasks.

Instructions/Configuration datadelivery

On-Chip Memory

ＰＥ

ＰＥ

On-Chip Memory

•10’s micro-seconds•PACT Xpp•Elixent’s D-Fabrix

Multiple tasks can be switched→　 High area efficiency

PAC PACI/O

I/O

PACI/O

I/O

I/O

I/O

PACI/O

I/O

SCM

CM

CM CM

CM

PAC: Processing Array Cluster)CM: Configuration ManagerSCM: Supervising CM

Xpp (PACT Informations 　 technologie) PAE

RAM ALU I/O

Configuration Manager

Xpp64 (8x8 ＰＡＣ） is availableConfiguation requires 100’s clock cycles24bits Data, 40MHz Clock

４ bit ALU Register

RAM basedswitch box

ALU R

RR

ALU R

RR

ALU R

RR

ALU R

RR

ALU R

RR

ALU R

RR

ALU R

RR

ALU R

RR

ALU R

RR

ALU R

RR

ALU R

RR

ALU R

RR

ALU R

RR

ALU R

RR

ALU R

RR

ALU R

RR

RAM8bit address

8bit dataD-FabrixProcessing Array

Elixent D-Fabrix

MMU

InstCache

DataCache

InstUnit

Load/StoreUnit

FR

FPU

AR

ALU

WR

ISEF

FP UnitInteger Unit

Extension Unit

Stretch　 S5 engine

Multicontext reconfiguration

Mul

tipl

exer

SRAM slots

n

Logic cells

1

2

Input data

Output data

Logic cellsLogic cellsContext

Multiple sets of configuration can be switched with a clock cycle.

Context memory is combined into PE/SwitchesFujitsu’s MPLD using ROMs(1990)Fujitsu’s MPLD using ROMs(1990) 、、

WASMII used RAM(1992)WASMII used RAM(1992) 、、 Xilinx’s proposal(1997)Xilinx’s proposal(1997) 、、NEC’s DRL(1998)NEC’s DRL(1998) 、、 Chameleon CS2112(2000)Chameleon CS2112(2000)

Context pointer

PE and Switches

Contextmemory

Double buffering using multicontext devices

• Task is switched without overhead

Task N+1

Task N

Execution

LoadingConfigurationData

Task N+1

Task N+2

Execution

LoadingConfigurationData

Double buffering using multicontext devices

Ipflex’s DAPDNA-2

DAP(RISC)

DMAController

InterruptController

TimerSROM IF

GPIOUART

Serial IF

DD

R S

DR

IF(64b

it 166MH

z)P

CI IF

(32bit 66M

Hz)

DNA loadbuffer

DNA direct I/O(Async. In)

DNA storebuffer

DNA direct I/O(Async. out)

DNAMatrix

BS

U

Heterogeneous368 PEsALU,Memory 、Delay

Time-multiplexing execution of a single task

If the performance becomes 1/n, the performance/areais not increased.

Target hardware

Reconfigurable Device

Even in the dedicated hardware, everything cannot be donewith a single clock.In this example, it takes 4 clock cycles.The dynamic reconfigurable processor requires 8 clock cycles→ 　 The performance/area is improved.

Target hardware

Time-multiplexing execution of a single task

NEC electronics’ DRP (Dynamically Reconfigurable Processor)

• Multicontext reconfiguration– 16 contexts– Controlled by FSM (Finite State Machine)– Background loading of configuration data

• 8x8 PEs + distributed memory modules → 　 A Tile• DRP-1 is consisting of 8 tiles → 　 512PEs• 8bits data width• State transition/Configuration is controlled with a til

e.• Single task is executed with multiple contexts.

DRP-1

TileVmemHmem

DRP Tile Structure

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

HMEM HMEM HMEM HMEM

HMEM HMEM HMEM HMEM

VMEM

VMEM

VMEM

VMEM

VMEM

VMEM

VMEM

VMEM

VMEM

VMEM

VMEM

VMEM

VMEM

VMEM

VMEM

VMEM

State Transition Controller

VMEM ctrlVMEM ctrl

VMEM ctrlVMEM ctrl

VMEM(2-port memory)

8bit ×　 256entry

HMEM(1-port memory)

8bit ×　 8092entry

Context control with a FSM

０

１

２

３

４

５

Data input

Data output

1.Contextswitching

2. Parallel Processing in a context3. Sequential execution in a context

DRP compiler automatically generatesthe diagram from C-like language: BDL.

ReconfigurableArrayView

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FU FU FU FU

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FU FU FU FU

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

RF

Instruction FetchInstruction DispatchInstruction Decode Data Cache

VLIWview

IMEC ADRES

PE

Interconnect

PE PE PE…..

PE

Interconnect

PE PE PE…..

PE

Interconnect

PE PE PE…..

PE

Interconnect

PE PE PE…..

Co

nfi

gu

rati

on

Co

ntr

oll

er

Output Controller

Input Controller

Fabric16PEs X 16PEs

128bits

128bits

672bits

32bits

Stripe

Rapport Kilocore

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

MLT ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

MLT

MLT

MLT

MLT

MLT

MLT

MLT

Crossbar switch

LS

LS

LS

LS

LS

LS

LS

LS

LS

LS

MEM

MEM

MEM

MEM

MEM

MEM

MEM

MEM

MEM

MEM

Configuration Manager

Sequence manager

Businterface

Computational cell array

Interrupt/DMA request

I/O port

Load/Store cell

Localmemory

Hitachi’s　 FE-GA

Product Vendor Conf. Data Width PE

Xpp-64 PACT Delivery 24 Homo

D-Fabric Elixent Delivery 4 Homo

S5 engine Stretch Delivery 4/8 Hetero

PCA-2 NTT Delivery 9 Homo

CS2112 Chameleon Multi-c(8) 16/32 Hetero

DAPDNA-2 IPFlex Multi-c(4) 32 Hetero

DRP-1 NECEL Multi-c(16) 8 Homo

Kilocore Rapport Multi-c 8 Homo

ADRES IMEC Multi-c(32) 16 Homo

FE-GA Hitachi Multi-c(4) 16 Hetero

Cluster machine

Fujitsu Multi-c 16 Hetero


１ 3 8 16 ManyTime-multiplexing

Number of nodes

Gates Number

10

100

100032bitALU/Registers

8bit ALU/registers

4 ・ 5inputLUT

FPGAＶＬＩＷ

Chip-Multiprocessor

ACM

DAPDNA-2

DRP－ 1

KilocoreDRL

Dynamically reconfigurable Processors

CS2112 ｒ DSPPC101

PARS

SimpleRISC

SuperScalar

10

100

1000 ト

10K

100K

1M

10M

Superscalar

Cost

ADRES

• Behavaioral Description Language (BDL) ： C-like– Bit width, Pragma– Pointer is limited.

• Functional synthesis: FSM and Data path are generated.– Synthesis tools for ASIC can be us

ed.• Mapping: ＦＳＭ →　ＳＴＣ、 Da

tapath → 　 PE array• Place & Routing• Configuration data generation

C Source Code

High Level Synthesis

FSM Datapath

Technology Mapper

Place & Router

Code Generation

Object Code

C-level design (DRP)

BDL code examplemem(0:16) d0[8], d1[8], d2[8], d3[8], d4[8], d5[8], d6[8], d7[8];void row() { ter(0:16) SUMT0, SUMT1, SUMT2, SUMT3; reg(0:16) SUB0, SUB1, SUB2, SUB3; ter(0:16) z0, z1, z2, z3, z4, z5, z6, z7; reg(0:8) i=0; $ for(; i < 8; i++) { d0[i], d1[i], d2[i], d3[i], d4[i], d5[i], d6[i], d7[i]; $ SUMT0 = d0[i] + d7[i]; SUB0 = d0[i] - d7[i]; SUMT1 = d1[i] + d6[i]; SUB1 = d1[i] - d6[i]; . . . . . z0 = A * SUMT0 + A * SUMT1 + A * SUMT2 + A * SUMT3; z2 = B * SUMT0 + C * SUMT1 – C * SUMT2 – B * SUMT3; . . . . . $ z1 = D * SUB0 + E * SUB1 * F * SUB2 + G * SUB3; z3 = E * SUB0 – G * SUB1 – D * SUB2 – F * SUB3; . . . . . $}

16bit memory:Allocated to VMEM

Terminals & RegistersDelimiter for the

state/context

Memory Access for giving an address

Terminals must be used In the assigned

state/context

Registers can be used in the next

states/contexts

Special Purpose processor

Stream processorGraphic processorNetwork processor


FPGA 、 Reconfigurable systems

Dedicated hardware

ProgrammableHardware

DSP

General purposeCPU

ConfigurableProcessor

Tile Processor

HomogeneousChip-multiprocessor

Specialinstructions

MultipleCores

HeterogeneousMultiprocessor

Multiple Cores

High performance forwide application field

High performance for narrow application fieldVarious embedded architectures

Now going major

Next going major

Going major ?

Glossary

• 今回は、いままで出てきた単語が多く、しかもそのまま呼ばれているものばかり

• Tile Processor: タイルプロセッサ• Dynamically reconfigurable processor: 動的再構

成可能（リコンフィギャラブル）プロセッサ• FSM(Finite State Machine) 有限状態マシン• Multicontext ：マルチコンテキスト型（マルチ

コンテクストかも）• Functional Synthesis: 機能合成• Time multiplexed Execution ：時分割多重実行

Excise

• Assume that the dynamically reconfigurable processor executes 1000 times faster than that of the host processor.

• Compute the total performance when it can be used for 90% of the total task.

the last lesson: recent embedded architectures hideharu amano

Documents