arces university of bologna reconfigurable architectures andrea lodi

50
ARCES University of Bologna Reconfigurable Architectures Andrea Lodi

Post on 20-Dec-2015

220 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: ARCES University of Bologna Reconfigurable Architectures Andrea Lodi

ARCES University of Bologna

Reconfigurable Architectures

Andrea Lodi

Page 2: ARCES University of Bologna Reconfigurable Architectures Andrea Lodi

ARCES University of Bologna

SoC trends

• Increasing mask cost (~ 3M$)• Increasing design complexity• Increasing design time (~ 3M$)

• Rapidly changing communication standards• Low-power design in wireless environment• Increasing algorithmic complexity

requirements

Page 3: ARCES University of Bologna Reconfigurable Architectures Andrea Lodi

ARCES University of Bologna

Product life cycle

time

sales

Growth Maturity Decrease

LOSS

time-

to-m

arke

t met

time-

to-m

arke

t fai

led

Page 4: ARCES University of Bologna Reconfigurable Architectures Andrea Lodi

ARCES University of Bologna

Trends in wireless systems

• Increased on-chip Transistor density

• Increased design complexity

Millions of transistors/Chip

1997199920012003200520070

400

200

300

100

2009

Technology (nm)

• Increased Algorithmic complexity

• Low battery capacity growth

1997199920012003200520072009

Algorithm complexityMoore’s law

Battery capacity

• Demand for reusability and flexibility

• Demand for high performance and energy efficiency

Page 5: ARCES University of Bologna Reconfigurable Architectures Andrea Lodi

ARCES University of Bologna

Digital architecture design space

Page 6: ARCES University of Bologna Reconfigurable Architectures Andrea Lodi

ARCES University of Bologna

Parallelism in computation

• Thread level parallelism

• Instruction level parallelism (ILP)

• Pipeline (loop level)

• Fine-grain parallelism (bit/byte-level)

Page 7: ARCES University of Bologna Reconfigurable Architectures Andrea Lodi

ARCES University of Bologna

Instruction level parallelism

+ + +

**

-

+

e

3

a b c d

+ + +

* *3

-

+

ASIC Implementation

Page 8: ARCES University of Bologna Reconfigurable Architectures Andrea Lodi

ARCES University of Bologna

Spatial vs. Temporal Computing

Ax2 + Bx + c (Ax + B)x + C

Spatial (ASIC) Temporal (Processor)

Page 9: ARCES University of Bologna Reconfigurable Architectures Andrea Lodi

ARCES University of Bologna

Superscalar/VLIW processors

• FU limitations• Register file size limitation• Crossbar inefficiency

Page 10: ARCES University of Bologna Reconfigurable Architectures Andrea Lodi

ARCES University of Bologna

Byte-level parallelism in processors

• MMX technology: 57 new instructions • Byte and half word parallel computation• SIMD execution model

Page 11: ARCES University of Bologna Reconfigurable Architectures Andrea Lodi

ARCES University of Bologna

Bit-level parallelismReverse (int v) {

int x, r;

for (c=0; x<WIDTH; x++) {r |= v&1;v = v >> 1;R = r << 1;

}return r;

}

v

r

popcount (int v) {int r=0;

while (v) {if (v&1) r++;v = v >> 1;

}return r;

}

+ + + ++ + + +

+ + +

v

r

Page 12: ARCES University of Bologna Reconfigurable Architectures Andrea Lodi

ARCES University of Bologna

Pipeline parallelism

for (j=0; j<MAX; j++)b[j] = popcount[a[j]];

+ + + +

+ + + +

+ + +

v

r= register

Page 13: ARCES University of Bologna Reconfigurable Architectures Andrea Lodi

ARCES University of Bologna

FPGAFPGA (Field-Programmable Gate Array) composed of 2 elements:• Array of clbs (configurable logic blocks) composed of :

– 1 or few small size LUTs (4:1 or 3:1)– Control logic: mux controlled by configuration bits– Dedicated computational logic (carry chain …)

• Configurable routing network connecting clbs composed of:– Different length wires– Connection blocks connecting clbs to the routing network– Switch blocks connecting routing wires

LUTs, configuration bits to program clbs and the routing network represent the FPGA configuration, which determines the function implemented

Page 14: ARCES University of Bologna Reconfigurable Architectures Andrea Lodi

ARCES University of Bologna

Configurable logic block

Page 15: ARCES University of Bologna Reconfigurable Architectures Andrea Lodi

ARCES University of Bologna

Xilinx Clb

• Xilinx clb 4000 series:– 11 input 4 output bits

– 3 LUTs

– Carry logic

– 2 output registers

Page 16: ARCES University of Bologna Reconfigurable Architectures Andrea Lodi

ARCES University of Bologna

Configurable routing network

Page 17: ARCES University of Bologna Reconfigurable Architectures Andrea Lodi

ARCES University of Bologna

Example

Page 18: ARCES University of Bologna Reconfigurable Architectures Andrea Lodi

ARCES University of Bologna

Density Comparison

Page 19: ARCES University of Bologna Reconfigurable Architectures Andrea Lodi

ARCES University of Bologna

FPGA vs. Processor

FPGA(computing in space)• Parallel execution• Configurable in 102-103 cycles• Fine-grained data• Application specific operators• Large area (switches, SRAM)• Entire applications don’t fit• Slow synthesis, P&R tools

Processor(computing in time)• Sequential execution• Programmable every cycle• Fixed-size operands• Basic operators (ALU)• Compact• Handles complex control flow• Fast compilers

Page 20: ARCES University of Bologna Reconfigurable Architectures Andrea Lodi

ARCES University of Bologna

Reconfigurable processors

But:

• 90% execution time spent in computational kernels:– FPGAs 10-100x speed-up over processors– FPGAs 10-100x denser than processors (bit-ops/2s)

• Reconfigurable processor: Risc + FPGA

Page 21: ARCES University of Bologna Reconfigurable Architectures Andrea Lodi

ARCES University of Bologna

Reconfigurable processor architecture

• Hybrid architectures:– RISC processor– FPGA

Page 22: ARCES University of Bologna Reconfigurable Architectures Andrea Lodi

ARCES University of Bologna

Computational models

• RC Array: IO Processor/Interface logic

• Attached processor– Piperench, T-Recs

• ISA Extension– Function unit:

• PRISC, OneChip, Chimaera

– Coprocessor• Garp, NAPA, Molen

Page 23: ARCES University of Bologna Reconfigurable Architectures Andrea Lodi

ARCES University of Bologna

IO Processor/Interface Logic

• Logic used in place of – ASIC environment

customization– external FPGA/PLD

devices

• Looks like IO peripheral to processor

• Example– protocol handling– stream computation

• compression, encrypt– peripherals– sensors, actuators

• Case for:– Always have some system

adaptation to do

– Modern chips have capacity to hold processor + glue logic

– reduce part count

– Glue logic vary

– many protocols, services

– only need few at a time

Page 24: ARCES University of Bologna Reconfigurable Architectures Andrea Lodi

ARCES University of Bologna

Example: Interface/Peripherals

• Triscend E5

Page 25: ARCES University of Bologna Reconfigurable Architectures Andrea Lodi

ARCES University of Bologna

Instruction Set Extension

• Instruction Bandwidth– Processor can only describe a small number of basic

computations in a cycle• I bits 2I operations

– This is a small fraction of the operations one could do even in terms of www Ops

• w22(2w) operations

– Processor could have to issue w2(2 (2w) -I) operations just to describe some computations

– An a priori selected base set of functions could be very bad for some applications

Page 26: ARCES University of Bologna Reconfigurable Architectures Andrea Lodi

ARCES University of Bologna

Instruction Set Extension

• Idea:– provide a way to augment the processor’s

instruction set– with operations needed by a particular

application

Page 27: ARCES University of Bologna Reconfigurable Architectures Andrea Lodi

ARCES University of Bologna

Architectural Models for I.S.A extension

Cpu surrounded by a collection of

Application-specific Custom Computing Devices

PLEIADES PLEIADES

High performance Overdesigned for most applications Difficult to program

Zhang et al, 2000

XTENSA XTENSA

Risc CPU featuring application-specific function units optionally inserted in the

processor pipeline

Good performance Easy to program Configured at mask-level

Tensilica inc, 2002

Page 28: ARCES University of Bologna Reconfigurable Architectures Andrea Lodi

ARCES University of Bologna

Dynamic ISA Extension models

Standard processor coupled with embedded programmable logic where application specific functions are dynamically

re-mapped depending on the performed algorithm

1: Coprocessor model 2: Function unit model

Page 29: ARCES University of Bologna Reconfigurable Architectures Andrea Lodi

ARCES University of Bologna

Coprocessor model: Garp Explicit instructions moving Explicit instructions moving data to and from the arraydata to and from the array High communication overheadHigh communication overhead (long latency array operations)(long latency array operations) Processor stalled each time the Processor stalled each time the array is activearray is active

Array performs at TASK level Array performs at TASK level (Very coarse grain)(Very coarse grain)

10-20x on stream, feed-forward 10-20x on stream, feed-forward operationsoperations 2-3x when data-dependencies 2-3x when data-dependencies limit pipelininglimit pipelining

Callahan, Hauser, Wawrzynek, 2000

Page 30: ARCES University of Bologna Reconfigurable Architectures Andrea Lodi

ARCES University of Bologna

Function unit model: Prisc

Razdan, Smith 1994

Array fit in the risc pipelineArray fit in the risc pipeline

No communication overheadNo communication overhead Some degree of parallelism Some degree of parallelism between between function unitsfunction units

Gate array performs Gate array performs combinatorial combinatorial instructions ONLY (very fine instructions ONLY (very fine grain)grain)

Low speedup figures (2x/3x)Low speedup figures (2x/3x)

Page 31: ARCES University of Bologna Reconfigurable Architectures Andrea Lodi

ARCES University of Bologna

Function Unit Model: pros

• No communication overhead:– Strict synergy between FPGA and other function

units– FPGA can be used frequently even for small

functions– Small reconfigurable array area

• Flow control handled by the core• Memory access handled by the core• Easy instruction set extension• Configuration streams compiled from C

Page 32: ARCES University of Bologna Reconfigurable Architectures Andrea Lodi

ARCES University of Bologna

EXTENDIBLE INSTRUCTION SET RISC ARCHITECTURE

32-bit load/store Risc architecture (5 stages pipeline)

•Concurrent fetch and execution of two 32-bit instructions per cycle•Fully bypassed, to minimize pipeline stalls (Average of 10/20% for most computational cores)•DSP-oriented reconfigurable functional unit (PiCoGA)•Fully configurable at execution time•Elaboration and configuration controlled by asm instructions inserted in C source code•PiCoGA used as a programmable Data-path with independent pipeline structure

•Multiply/Mac Unit•Branch/Decrement Unit•Alu featuring “MMX” byte-wide concurrent operations

Embedded reconfigurable device for dynamic ISA extension

VLIW Elaboration

Set of specialized functional units

Page 33: ARCES University of Bologna Reconfigurable Architectures Andrea Lodi

ARCES University of Bologna

XiRisc Architecture

Page 34: ARCES University of Bologna Reconfigurable Architectures Andrea Lodi

ARCES University of Bologna

Dynamic Instruction Set Extension

Page 35: ARCES University of Bologna Reconfigurable Architectures Andrea Lodi

ARCES University of Bologna

…..pgaload …..…..…..pgaop $3,$4,$5…...…...Add $8, $3

Dynamic Instruction Set Extension

Register FileRegister File

Con

fig

ura

tion

Mem

ory

Con

fig

ura

tion

Mem

ory

Page 36: ARCES University of Bologna Reconfigurable Architectures Andrea Lodi

ARCES University of Bologna

PiCoGA Architecture

PiCoGA(Pipelined Configurable Gate Array): Embedded datapathfor dynamic i.s.a. extension •Dynamically reconfigurable•Structured in rows activated in data- flow fashion by the PiCoGA control unit• Can hold a state• pGA-op latency depends on the specific mapped function• Functionality is determined from DFG extracted from C code

PiC

oG

APiC

oG

A C

on

trol

Con

trol U

nit

Un

itPiC

oG

APiC

oG

A C

on

trol

Con

trol U

nit

Un

it

Processor InterfaceProcessor Interface

PicoRow (Synchronous Element)

Page 37: ARCES University of Bologna Reconfigurable Architectures Andrea Lodi

ARCES University of Bologna

Pico-cell Description4x32-bit input data from Reg File2x32-bit output data to Reg File

PiC

oG

A C

on

trol U

nit

PiC

oG

A C

on

trol U

nit

PiC

oG

A C

on

trol U

nit

PiC

oG

A C

on

trol U

nit

INPUTLOGICINPUTLOGIC

LUT16x2LUT16x2

OUTPUTLOGIC,

REGISTERS

OUTPUTLOGIC,

REGISTERS

CARRYCHAINCARRYCHAIN

LUT16x2LUT16x2

EN

PiCoGA control unit signals

Configuration bus

Loop-back

12 global lines to/from R

eg File

INPUT CONNECT

BLOCK

INPUT CONNECT

BLOCK

SWITCHBLOCK

INP

UT

CO

NN

EC

TB

LO

CK

INP

UT

CO

NN

EC

TB

LO

CK

OU

TP

UT

CO

NN

EC

TB

LO

CK

OU

TP

UT

CO

NN

EC

TB

LO

CK

RLC

… …

Page 38: ARCES University of Bologna Reconfigurable Architectures Andrea Lodi

ARCES University of BolognaPiC

oG

APiC

oG

A C

on

trol

Con

trol U

nit

Un

it

Computing on PiCoGA

Mapping

Pga_op2

Mapping

Pga_op1

Data Flow Graph

Data out

Data in

Page 39: ARCES University of Bologna Reconfigurable Architectures Andrea Lodi

ARCES University of Bologna

Multi-context Array

Func. 1Func. 1

Func. 2Func. 2

Func. 3Func. 3

Func. 4Func. 4

Func. nFunc. n

Configuration Configuration CacheCache

PiCoGAPiCoGA

Four configuration planes are Four configuration planes are available, one of them executingavailable, one of them executing

Plane switch takes just 1 clock Plane switch takes just 1 clock cyclecycle

While a plane is executing another While a plane is executing another may be reconfigured may be reconfigured → No → No reconfiguration time overheadreconfiguration time overhead

Page 40: ARCES University of Bologna Reconfigurable Architectures Andrea Lodi

ARCES University of Bologna

Architecture FlexibilityParallelism to exploit ?

Bit-level operations ?

MAC intensive ?

Memory intensive ?

Yes

Yes

Yes

Yes

No

No

No

(Ex: Turbo Decod., Motion Est.)

(Ex: DES, Reed-Solomon)

(Ex: FFT, Scalar product)

(Ex: DCT, Motion Est.)

Speed-up from

pGA (5x – 100x)

Speed-up from DSP instructions and VLIW

(1.5x – 2x)

Improvements for a large number of Data & Signal Processing algorithms

Page 41: ARCES University of Bologna Reconfigurable Architectures Andrea Lodi

ARCES University of Bologna

Programming XiRisc: Restrictions

• Fixed-point algorithms• Variable size specification at the bit level

Not supported yet:• Dynamic memory allocation• Math library• Operating System

Page 42: ARCES University of Bologna Reconfigurable Architectures Andrea Lodi

ARCES University of Bologna

File.cFile.c

C COMPILERC COMPILER

PROFILERPROFILER

XiRisc Compilation Flow

PiCoGAConfigurator

PiCoGAConfigurator

ConfigurationBit stream

ConfigurationBit stream

PiCoGAopPiCoGAop

ConfigurationLibrary

ConfigurationLibrary

Software Simulation

Page 43: ARCES University of Bologna Reconfigurable Architectures Andrea Lodi

ARCES University of Bologna

Example: Motion Estimation

Sum of Absolute Difference

(SAD)-

High instruction-level

and inter-iteration parallelism

Page 44: ARCES University of Bologna Reconfigurable Architectures Andrea Lodi

ARCES University of Bologna

Data Flow Graphpixel-pixel

absolute difference

Abs (p1[i] – p2[i])•p1[i], p2[i] pixel

Absolute DifferenceSum tree

…..

Page 45: ARCES University of Bologna Reconfigurable Architectures Andrea Lodi

ARCES University of Bologna

SAD

Writebackto

Register File

AD1 AD2 AD3 AD4

From Register File

Sum of Absolute Difference

SAD8

SAD8

Page 46: ARCES University of Bologna Reconfigurable Architectures Andrea Lodi

ARCES University of Bologna

Emulation Functionwith

Latency and Issue Delay

Emulation Functionwith

Latency and Issue Delay

Place & Route

ConfigurationBits

ConfigurationBits

Place & RoutePlace & Route

MappingMapping

DFG-based descriptionDFG-based description

High-LevelC Compiler

High-LevelC Compiler

GriffyCompiler

Page 47: ARCES University of Bologna Reconfigurable Architectures Andrea Lodi

ARCES University of Bologna

Performance evaluation

• Emulation function• Latency and Issue-Delay back-annotation• Profiling

Page 48: ARCES University of Bologna Reconfigurable Architectures Andrea Lodi

ARCES University of Bologna

Motion Estimation: Results

Motion estimation:• 16 SAD operations in parallel• PiCoGA occupation: ~100%• Speed-up: 7x (with respect to standard XiRisc)

MPEG preliminary result:• H.261 standard QCIF (176x144): 10 frame/sec

Page 49: ARCES University of Bologna Reconfigurable Architectures Andrea Lodi

ARCES University of Bologna

Reed-Solomon Encoder: Results

Encoder RS(15,9): 4-bit symbols• PiCoGA occupation: ~25%• Speed-up: 37x• Throughput: 70.6 Mb/sec

Encoder RS(255,239) widely used: 8-bit symbols• PiCoGA occupation: ~60%• Speed-up: 135x• Throughput: 187.1 Mb/sec

Page 50: ARCES University of Bologna Reconfigurable Architectures Andrea Lodi

ARCES University of Bologna

Speed-up and Power Consumption

AlgorithmAlgorithmEnergy consumption Energy consumption

reductionreduction

(vs. std. XiRisc)(vs. std. XiRisc)

Speed-up Speed-up

(vs. std. XiRisc)(vs. std. XiRisc)

DES encryptionDES encryption 89%89% 13.5x13.5x

Turbo decoderTurbo decoder 75%75% 11.7x11.7x

Motion predictionMotion prediction 46%46% 4.5x4.5x

Median filterMedian filter 60%60% 7.7x7.7x

CRCCRC 49%49% 4.3x4.3x