a reconfigurable processor architecture and software development environment for embedded systems...

A Reconfigurable Processor Architecture A Reconfigurable Processor Architecture and Software Development Environment and Software Development Environment

for Embedded Systemsfor Embedded Systems

A Reconfigurable Processor Architecture A Reconfigurable Processor Architecture and Software Development Environment and Software Development Environment

for Embedded Systemsfor Embedded Systems

Andrea CappelliF. Campi, R.Guerrieri, A.Lodi, M.Toma, A. La Rosa,

L. Lavagno, C. Passerone, R.Canegallo

Nice, FranceApril 22, 2003

OutlineOutline

Motivations XiRisc: a VLIW Processor PiCoGA: A Pipelined Configurable Gate

Array Software Development Environment Results & Measurements Conclusions

MotivationsMotivations

Increased on-chip Transistor density

Increased Integration costs

Strong limitations in power supply

Severepower consumption

constraints

Millions of transistors/Chip

1997199920012003200520070

Technology (nm)

Increased Algorithmic complexity

Quest for performance and

flexibility

1997199920012003200520072009

Algorithm complexityMoore’s law

Battery capacity

Embedded systems Algorithms analysisEmbedded systems Algorithms analysis 90% of computational complexity is concentrated

in small kernels covering small parts of overall code

Many algorithms show a relevant instruction-level parallelism Performance improved by multiple parallel data paths

Operand granularity is typically different from 32-bit Traditional ALU is power-inefficient

Significant improvements can be obtained extending embedded processors with application-specific function units

Reconfigurable computingto achieve maximum flexibility

Existing ArchitecturesExisting Architectures

Standard processor coupled with embedded programmable logic where application specific functions are dynamically

remapped depending on the performed algorithm

1: Coprocessor model 2: Function unit model

32-bit load/store Risc architecture (5 stages pipeline)

Concurrent fetch and execution of two 32-bit instructions per cycle

VLIW Elaboration:

Set of specialized function units implementing DSP-specific operations

EXTENDED INSTRUCTION SET RISC ARCHITECTURE

Function unit approach: Reconfigurable device fits in a classical RISC pipeline:

Low communication overhead Exploits very high resource parallelism

ArchitectureArchitecture

Duplicated instruction decode logic (2 simmetrical data- channels)

Duplicated commonly used function Units (Alu and Shifter)

All others function units are shared (DSP operations, Memory handler)

A tightly coupled pipelined configurable Gate Array

Dynamic Instruction Set ExtensionDynamic Instruction Set Extension

configuration specificationregion

specificationpGA-load

Specific operation to transfer data from a configuration cache to the PiCoGA:

32-bit and 64-bit operation to launch the execution inside the PiCoGA(Data exchange through register file):

operation

specification

32-bit

pGA-opSource 1 Source 2 Dest 1 Dest 2

64-bit

pGA-opSource 1 Source 2

operation

specificationDest 1 Dest 2Source 3 Source 4

PiCoGA: a Pipelined ConfigurablePiCoGA: a Pipelined ConfigurableGate ArrayGate Array

Two-dimensional array of LUT-based Reconfigurable Logic Cells Each row implements a possible stage of a customized pipeline, independent and concurrent with the processor Up to 4x32-bit input data and up to 2x32-bit output data from/to register File

Embedded function unit for dynamic extension of the Instruction Set

PiCoGA

DFG-based elaborationDFG-based elaboration Row elaboration is activated by an embedded control unit Execution enable signal for of each pipeline stage

PiCoGA operation latency is dependent on the operation performed

ConfigurationCachePiCoGA

PiCoGA ConfigurationPiCoGA Configuration

Goal: to reduce cache misses due to PiCoGA configuration

Multi-context programming (4 cache layers/planes inside the array) Dedicated Configuration Cache with high bandwith bus to the PiCoGA (192 bits) Partial Run-Time Reconfiguration (A region is configured while another one is

active) Configuration is completely concurrent with processor elaboration

Layer4

Layer3

Layer2

Layer1

PiCoGA mapping

The Software Development EnvironmentThe Software Development Environment

InititialC code

Profiling

Computationkernel

extraction

100010100001100101001010110110010010100101110101101001011101101001010110111111111101

Executablecode

Latencyinformation

AssemblerLevel

Scheduler

pGA-op

Software SimulationSoftware SimulationGoals: check the correctness of the algorithm and evaluate performances

In the source code pGA-op is described using a pragma directive:

#pragma pGA shift_add 0x12 5 c a bc = ( a << 2 ) + b

#pragma end

/**************************************//* Shift_add mapped on PiCoGA *//**************************************/

#if defined(PiCoGA)...

asm(“pGA-op 0x12 ...”)...

/*************************************//* Emulation function _shift_add *//************************************/

#elsevoid _shift_add(){

...c = ( a << 2 ) + b

#endif

Sofware SimulationSofware Simulation

Two special instructions are defined to support emulation:...topga ...jal _shft_addfmpga ......

topga saves current state and passes arguments to emulation function. Function clock cycle count is halted

fmpga copies emulation function result(s) and restores registers; cycle count is incremented with the latency value of the pGA-op

Evaluation of overall performances by counting elaboration cycles

Results and Measurements Results and Measurements

0,150,22

DES CRC MedianFilter

MotionPrediction

Only VLIW

VLIW + PiCoGA

Normalized Energy Histogram

Speed-ups for several signal processing cores:

75% of energy consumption for a VLIW

architecture is due to accesses to instruction

and data memory

Strong reduction of accesses to instruction

memory DES CRC

Median

Filter

Motion

Estimation

Motion

PredictionTurbo Codes

13.5x 4.3x 7.7x 12.4x 4.5x 12x

ConclusionsConclusions

XiRisc: VLIW Risc architecture enhanced by run-time reconfigurable function unit

PiCoGA: pipelined, runtime configurable, row-oriented array of LUT-based cells

Specific software development toolchain Speedups range from 4.3x to 13.5x Up to 93% energy consumption reduction

a reconfigurable processor architecture and software development environment for embedded systems...

una metodologia

specialized function

function unit modelsviluppo

used function units

loadstore risc architecture

embedded programmable

reconfigurable logic

application specific

Documents

gabriele cappelli 2013

bitÁcora cappelli lorenzo

cartoline e cappelli d'alpino

intelligent cameras and embedded reconfigurable computing: a...

tim cappelli, manchester medical school

teoria dei campi -...

cappelli & di martino exhibition catalogue

cappelli trabajofinaldewink

sociolinguistics handout - gloria cappelli

sei cappelli per pensare - diegm.uniud.it · 6 cappelli...

cappelli engels

curriculum vitae david p cappelli

the functions of silence - gloria cappelli

cappelli delle streghe - houdelier

compact reconfigurable avionics – reconfigurable data

english morphology handout - gloria cappelli

09 cappelli pres

cappelli di arianna

cappelli pubblihouse

kup it - cappelli e accessori