arces university of bologna reconfigurable architectures andrea lodi

ARCES University of Bologna

Reconfigurable Architectures

Andrea Lodi


SoC trends

• Increasing mask cost (~ 3M$)• Increasing design complexity• Increasing design time (~ 3M$)

• Rapidly changing communication standards• Low-power design in wireless environment• Increasing algorithmic complexity

requirements


Product life cycle

time

sales

Growth Maturity Decrease

LOSS

time-

to-m

arke

t met

time-

to-m

arke

t fai

led


Trends in wireless systems

• Increased on-chip Transistor density

• Increased design complexity

Millions of transistors/Chip

1997199920012003200520070

400

200

300

100

2009

Technology (nm)

• Increased Algorithmic complexity

• Low battery capacity growth

1997199920012003200520072009

Algorithm complexityMoore’s law

Battery capacity

• Demand for reusability and flexibility

• Demand for high performance and energy efficiency


Digital architecture design space


Parallelism in computation

• Thread level parallelism

• Instruction level parallelism (ILP)

• Pipeline (loop level)

• Fine-grain parallelism (bit/byte-level)


Instruction level parallelism

+ + +

**

-

+

e

3

a b c d

+ + +

* *3

-

+

ASIC Implementation


Spatial vs. Temporal Computing

Ax2 + Bx + c (Ax + B)x + C

Spatial (ASIC) Temporal (Processor)


Superscalar/VLIW processors

• FU limitations• Register file size limitation• Crossbar inefficiency


Byte-level parallelism in processors

• MMX technology: 57 new instructions • Byte and half word parallel computation• SIMD execution model


Bit-level parallelismReverse (int v) {

int x, r;

for (c=0; x<WIDTH; x++) {r |= v&1;v = v >> 1;R = r << 1;

}return r;

}

v

r

popcount (int v) {int r=0;

while (v) {if (v&1) r++;v = v >> 1;

}return r;

}

+ + + ++ + + +

+ + +

v

r


Pipeline parallelism

for (j=0; j<MAX; j++)b[j] = popcount[a[j]];

+ + + +

+ + + +

+ + +

v

r= register


FPGAFPGA (Field-Programmable Gate Array) composed of 2 elements:• Array of clbs (configurable logic blocks) composed of :

– 1 or few small size LUTs (4:1 or 3:1)– Control logic: mux controlled by configuration bits– Dedicated computational logic (carry chain …)

• Configurable routing network connecting clbs composed of:– Different length wires– Connection blocks connecting clbs to the routing network– Switch blocks connecting routing wires

LUTs, configuration bits to program clbs and the routing network represent the FPGA configuration, which determines the function implemented


Configurable logic block


Xilinx Clb

• Xilinx clb 4000 series:– 11 input 4 output bits

– 3 LUTs

– Carry logic

– 2 output registers


Configurable routing network


Example


Density Comparison


FPGA vs. Processor

FPGA(computing in space)• Parallel execution• Configurable in 102-103 cycles• Fine-grained data• Application specific operators• Large area (switches, SRAM)• Entire applications don’t fit• Slow synthesis, P&R tools

Processor(computing in time)• Sequential execution• Programmable every cycle• Fixed-size operands• Basic operators (ALU)• Compact• Handles complex control flow• Fast compilers


Reconfigurable processors

But:

• 90% execution time spent in computational kernels:– FPGAs 10-100x speed-up over processors– FPGAs 10-100x denser than processors (bit-ops/2s)

• Reconfigurable processor: Risc + FPGA


Reconfigurable processor architecture

• Hybrid architectures:– RISC processor– FPGA


Computational models

• RC Array: IO Processor/Interface logic

• Attached processor– Piperench, T-Recs

• ISA Extension– Function unit:

• PRISC, OneChip, Chimaera

– Coprocessor• Garp, NAPA, Molen


IO Processor/Interface Logic

• Logic used in place of – ASIC environment

customization– external FPGA/PLD

devices

• Looks like IO peripheral to processor

• Example– protocol handling– stream computation

• compression, encrypt– peripherals– sensors, actuators

• Case for:– Always have some system

adaptation to do

– Modern chips have capacity to hold processor + glue logic

– reduce part count

– Glue logic vary

– many protocols, services

– only need few at a time


Example: Interface/Peripherals

• Triscend E5


Instruction Set Extension

• Instruction Bandwidth– Processor can only describe a small number of basic

computations in a cycle• I bits 2I operations

– This is a small fraction of the operations one could do even in terms of www Ops

• w22(2w) operations

– Processor could have to issue w2(2 (2w) -I) operations just to describe some computations

– An a priori selected base set of functions could be very bad for some applications


Instruction Set Extension

• Idea:– provide a way to augment the processor’s

instruction set– with operations needed by a particular

application


Architectural Models for I.S.A extension

Cpu surrounded by a collection of

Application-specific Custom Computing Devices

PLEIADES PLEIADES

High performance Overdesigned for most applications Difficult to program

Zhang et al, 2000

XTENSA XTENSA

Risc CPU featuring application-specific function units optionally inserted in the

processor pipeline

Good performance Easy to program Configured at mask-level

Tensilica inc, 2002


Dynamic ISA Extension models

Standard processor coupled with embedded programmable logic where application specific functions are dynamically

re-mapped depending on the performed algorithm

1: Coprocessor model 2: Function unit model


Coprocessor model: Garp Explicit instructions moving Explicit instructions moving data to and from the arraydata to and from the array High communication overheadHigh communication overhead (long latency array operations)(long latency array operations) Processor stalled each time the Processor stalled each time the array is activearray is active

Array performs at TASK level Array performs at TASK level (Very coarse grain)(Very coarse grain)

10-20x on stream, feed-forward 10-20x on stream, feed-forward operationsoperations 2-3x when data-dependencies 2-3x when data-dependencies limit pipelininglimit pipelining

Callahan, Hauser, Wawrzynek, 2000


Function unit model: Prisc

Razdan, Smith 1994

Array fit in the risc pipelineArray fit in the risc pipeline

No communication overheadNo communication overhead Some degree of parallelism Some degree of parallelism between between function unitsfunction units

Gate array performs Gate array performs combinatorial combinatorial instructions ONLY (very fine instructions ONLY (very fine grain)grain)

Low speedup figures (2x/3x)Low speedup figures (2x/3x)


Function Unit Model: pros

• No communication overhead:– Strict synergy between FPGA and other function

units– FPGA can be used frequently even for small

functions– Small reconfigurable array area

• Flow control handled by the core• Memory access handled by the core• Easy instruction set extension• Configuration streams compiled from C


EXTENDIBLE INSTRUCTION SET RISC ARCHITECTURE

32-bit load/store Risc architecture (5 stages pipeline)

•Concurrent fetch and execution of two 32-bit instructions per cycle•Fully bypassed, to minimize pipeline stalls (Average of 10/20% for most computational cores)•DSP-oriented reconfigurable functional unit (PiCoGA)•Fully configurable at execution time•Elaboration and configuration controlled by asm instructions inserted in C source code•PiCoGA used as a programmable Data-path with independent pipeline structure

•Multiply/Mac Unit•Branch/Decrement Unit•Alu featuring “MMX” byte-wide concurrent operations

Embedded reconfigurable device for dynamic ISA extension

VLIW Elaboration

Set of specialized functional units


XiRisc Architecture


Dynamic Instruction Set Extension


…..pgaload …..…..…..pgaop $3,$4,$5…...…...Add $8, $3

Dynamic Instruction Set Extension

Register FileRegister File

Con

fig

ura

tion

Mem

ory

Con

fig

ura

tion

Mem

ory


PiCoGA Architecture

PiCoGA(Pipelined Configurable Gate Array): Embedded datapathfor dynamic i.s.a. extension •Dynamically reconfigurable•Structured in rows activated in data- flow fashion by the PiCoGA control unit• Can hold a state• pGA-op latency depends on the specific mapped function• Functionality is determined from DFG extracted from C code

PiC

oG

APiC

oG

A C

on

trol

Con

trol U

nit

Un

itPiC

oG

APiC

oG

A C

on

trol

Con

trol U

nit

Un

it

Processor InterfaceProcessor Interface

PicoRow (Synchronous Element)


Pico-cell Description4x32-bit input data from Reg File2x32-bit output data to Reg File

PiC

oG

A C

on

trol U

nit

PiC

oG

A C

on

trol U

nit

PiC

oG

A C

on

trol U

nit

PiC

oG

A C

on

trol U

nit

INPUTLOGICINPUTLOGIC

LUT16x2LUT16x2

OUTPUTLOGIC,

REGISTERS

OUTPUTLOGIC,

REGISTERS

CARRYCHAINCARRYCHAIN

LUT16x2LUT16x2

EN

PiCoGA control unit signals

Configuration bus

Loop-back

12 global lines to/from R

eg File

INPUT CONNECT

BLOCK

INPUT CONNECT

BLOCK

SWITCHBLOCK

INP

UT

CO

NN

EC

TB

LO

CK

INP

UT

CO

NN

EC

TB

LO

CK

OU

TP

UT

CO

NN

EC

TB

LO

CK

OU

TP

UT

CO

NN

EC

TB

LO

CK

RLC

…

…

… …

…

…

…

ARCES University of BolognaPiC

oG

APiC

oG

A C

on

trol

Con

trol U

nit

Un

it

Computing on PiCoGA

Mapping

Pga_op2

Mapping

Pga_op1

Data Flow Graph

Data out

Data in


Multi-context Array

Func. 1Func. 1

Func. 2Func. 2

Func. 3Func. 3

Func. 4Func. 4

Func. nFunc. n

Configuration Configuration CacheCache

PiCoGAPiCoGA

Four configuration planes are Four configuration planes are available, one of them executingavailable, one of them executing

Plane switch takes just 1 clock Plane switch takes just 1 clock cyclecycle

While a plane is executing another While a plane is executing another may be reconfigured may be reconfigured → No → No reconfiguration time overheadreconfiguration time overhead


Architecture FlexibilityParallelism to exploit ?

Bit-level operations ?

MAC intensive ?

Memory intensive ?

Yes

Yes

Yes

Yes

No

No

No

(Ex: Turbo Decod., Motion Est.)

(Ex: DES, Reed-Solomon)

(Ex: FFT, Scalar product)

(Ex: DCT, Motion Est.)

Speed-up from

pGA (5x – 100x)

Speed-up from DSP instructions and VLIW

(1.5x – 2x)

Improvements for a large number of Data & Signal Processing algorithms


Programming XiRisc: Restrictions

• Fixed-point algorithms• Variable size specification at the bit level

Not supported yet:• Dynamic memory allocation• Math library• Operating System


File.cFile.c

C COMPILERC COMPILER

PROFILERPROFILER

XiRisc Compilation Flow

PiCoGAConfigurator

PiCoGAConfigurator

ConfigurationBit stream

ConfigurationBit stream

PiCoGAopPiCoGAop

ConfigurationLibrary

ConfigurationLibrary

Software Simulation


Example: Motion Estimation

Sum of Absolute Difference

(SAD)-

High instruction-level

and inter-iteration parallelism


Data Flow Graphpixel-pixel

absolute difference

Abs (p1[i] – p2[i])•p1[i], p2[i] pixel

Absolute DifferenceSum tree

…..


SAD

Writebackto

Register File

AD1 AD2 AD3 AD4

From Register File

Sum of Absolute Difference

SAD8

SAD8


Emulation Functionwith

Latency and Issue Delay

Emulation Functionwith

Latency and Issue Delay

Place & Route

ConfigurationBits

ConfigurationBits

Place & RoutePlace & Route

MappingMapping

DFG-based descriptionDFG-based description

High-LevelC Compiler

High-LevelC Compiler

GriffyCompiler


Performance evaluation

• Emulation function• Latency and Issue-Delay back-annotation• Profiling


Motion Estimation: Results

Motion estimation:• 16 SAD operations in parallel• PiCoGA occupation: ~100%• Speed-up: 7x (with respect to standard XiRisc)

MPEG preliminary result:• H.261 standard QCIF (176x144): 10 frame/sec


Reed-Solomon Encoder: Results

Encoder RS(15,9): 4-bit symbols• PiCoGA occupation: ~25%• Speed-up: 37x• Throughput: 70.6 Mb/sec

Encoder RS(255,239) widely used: 8-bit symbols• PiCoGA occupation: ~60%• Speed-up: 135x• Throughput: 187.1 Mb/sec


Speed-up and Power Consumption

AlgorithmAlgorithmEnergy consumption Energy consumption

reductionreduction

(vs. std. XiRisc)(vs. std. XiRisc)

Speed-up Speed-up

(vs. std. XiRisc)(vs. std. XiRisc)

DES encryptionDES encryption 89%89% 13.5x13.5x

Turbo decoderTurbo decoder 75%75% 11.7x11.7x

Motion predictionMotion prediction 46%46% 4.5x4.5x

Median filterMedian filter 60%60% 7.7x7.7x

CRCCRC 49%49% 4.3x4.3x

arces university of bologna reconfigurable architectures andrea lodi

Documents

asic implementation

market failed slide

energy efficiency slide

design time

temporal computing ax

processors mmx technology

flexibility demand

wireless environment