algorithm architecture co for dsp applications · tukey fft algorithm and several other algorithms...

56
AlgorithmArchitecture CoDesign for DSP Applications for DSP Applications Pramod Meher Institute for Infocomm Research Singapore Singapore

Upload: others

Post on 30-Apr-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Algorithm Architecture Co for DSP Applications · Tukey FFT algorithm and several other algorithms for di tdiscrete FiFourier and other di tdiscrete tftransforms [Cooley 1995]. polynomial

Algorithm‐Architecture Co‐Design for DSP Applicationsfor DSP Applications

Pramod MeherInstitute for Infocomm Research

SingaporeSingapore

Page 2: Algorithm Architecture Co for DSP Applications · Tukey FFT algorithm and several other algorithms for di tdiscrete FiFourier and other di tdiscrete tftransforms [Cooley 1995]. polynomial

outline

scaling trends and design issues need of algorithm‐architecture co‐design need of algorithm architecture co design design for catching up the speed design for dealing with power design for interconnect minimization design for matching computation with I/O design for demand‐driven solution conclusions

Pramod Meher, Institute for Infocomm Research, Singapore17/12/2010 2

Page 3: Algorithm Architecture Co for DSP Applications · Tukey FFT algorithm and several other algorithms for di tdiscrete FiFourier and other di tdiscrete tftransforms [Cooley 1995]. polynomial

scaling trends and design issuesscaling trends and design issues

Pramod Meher, Institute for Infocomm Research, Singapore17/12/2010 3

Page 4: Algorithm Architecture Co for DSP Applications · Tukey FFT algorithm and several other algorithms for di tdiscrete FiFourier and other di tdiscrete tftransforms [Cooley 1995]. polynomial

real-time processing requirement

D/ AEMBEDDED PROCESSOR

sample inA/ D

sample outSPEAKER LISTENER

The system must  have enough speed performance to process the samples as fast as they arrive at. Slower processing will degrade  the quality of reception.

Computational demand per second = N SS= no. of samples arriving at the processor/ secondN= computation per sample required by the algorithm

Audio and telecommunication systems have sampling rate 104 to y p g106 samples/sec, and video applications need 107 to 108samples/second. MPEG motion estimation requires several GOPs/sec.

Pramod Meher, Institute for Infocomm Research, Singapore17/12/2010

GOPs/sec. 

4

Page 5: Algorithm Architecture Co for DSP Applications · Tukey FFT algorithm and several other algorithms for di tdiscrete FiFourier and other di tdiscrete tftransforms [Cooley 1995]. polynomial

design constraints and trade-offs

I/O, size, cost andreconfigurabilityreconfigurability,time to market

power-consumptionspeed-

performance

mutually conflicting, improvement of one worsens some others mutually conflicting, improvement of one worsens some others for the best possible solution, one has to evaluate the relative 

importance of the constraints for the given application need to design the algorithms and architectures accordingly

5Pramod Meher, Institute for Infocomm Research, Singapore17/12/2010

Page 6: Algorithm Architecture Co for DSP Applications · Tukey FFT algorithm and several other algorithms for di tdiscrete FiFourier and other di tdiscrete tftransforms [Cooley 1995]. polynomial

technology scaling trend

160

180Y (nm)

100

120

140

160

TECH

NOLO

GY

40

60

80

100

T

0

20

2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 2022 2024

YEAR OF PRODUCTION

SOURCE ITRS

Pramod Meher, Institute for Infocomm Research, Singapore17/12/2010 6

Page 7: Algorithm Architecture Co for DSP Applications · Tukey FFT algorithm and several other algorithms for di tdiscrete FiFourier and other di tdiscrete tftransforms [Cooley 1995]. polynomial

MOSFET intrinsic delay

0 8

0.9C DELAY

 (ps)

0 5

0.6

0.7

0.8

SFET

 INTR

INSI

0 2

0.3

0.4

0.5

MOS

0

0.1

0.2

2009 2011 2013 2015 2017 2019 2021 2023

YEAR OF PRODUCTION

SOURCE ITRS

Pramod Meher, Institute for Infocomm Research, Singapore17/12/2010 7

Page 8: Algorithm Architecture Co for DSP Applications · Tukey FFT algorithm and several other algorithms for di tdiscrete FiFourier and other di tdiscrete tftransforms [Cooley 1995]. polynomial

maximum power dissipation

18040000

Transistor Density (millions per sq.cm)Maximum Allowable Power (W) (with heat sink)

160

170

180

25000

30000

35000

40000

ABLE POWER

 

ENSITY

140

150

10000

15000

20000

25000

MUM ALLOWA

RANSISTOR D

120

130

0

5000

10000

2009 2011 2013 2015 2017 2019 2021

MAX

IMTR

YEAR OF PRODUCTION

SOURCE ITRS

Pramod Meher, Institute for Infocomm Research, Singapore17/12/2010 8

Page 9: Algorithm Architecture Co for DSP Applications · Tukey FFT algorithm and several other algorithms for di tdiscrete FiFourier and other di tdiscrete tftransforms [Cooley 1995]. polynomial

resistivity scaling

Cu ResistivityCu Resistivity 

SOURCE: ITRS

Pramod Meher, Institute for Infocomm Research, Singapore17/12/2010 9

Page 10: Algorithm Architecture Co for DSP Applications · Tukey FFT algorithm and several other algorithms for di tdiscrete FiFourier and other di tdiscrete tftransforms [Cooley 1995]. polynomial

interconnect delay

16

Interconnect RC Delay per m of Cu (assumes no scattering and an effective of 2.2 -cm)

10

12

14

16

NDS

2

4

6

8

PICO

SECO

N

0

2

2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020YEAR OF PRODUCTION

The RC delay increases quadratically with technology, reaching above12ns for a 1 mm Cu wire by 2020.

SOURCE ITRS 

Pramod Meher, Institute for Infocomm Research, Singapore17/12/2010 10

y

Page 11: Algorithm Architecture Co for DSP Applications · Tukey FFT algorithm and several other algorithms for di tdiscrete FiFourier and other di tdiscrete tftransforms [Cooley 1995]. polynomial

interconnect delay vs propagation delay

2

interconnect delay per 100 nm propagation delay

1.5

CONDS

0.5

1

PICO

SEC

02009 2011 2013 2015 2017 2019 2021

YEAR OF PRODUCTION

Pramod Meher, Institute for Infocomm Research, Singapore17/12/2010 11

Page 12: Algorithm Architecture Co for DSP Applications · Tukey FFT algorithm and several other algorithms for di tdiscrete FiFourier and other di tdiscrete tftransforms [Cooley 1995]. polynomial

I/O limitations

900040000

Transistor Density Maximum No. of Pins

7000

8000

25000

30000

35000

OF I/O PINS

NSITY

6000

7000

10000

15000

20000

25000

UM NUMBE

RANSISTOR DE

4000

5000

0

5000

10000

2009 2011 2013 2015 2017 2019 2021

MAX

IMU

TR

YEAR OF PRODUCTIONYEAR OF PRODUCTION

SOURCE ITRS

Pramod Meher, Institute for Infocomm Research, Singapore17/12/2010 12

Page 13: Algorithm Architecture Co for DSP Applications · Tukey FFT algorithm and several other algorithms for di tdiscrete FiFourier and other di tdiscrete tftransforms [Cooley 1995]. polynomial

analysis of scaling trend

increasing transistor density, bigger chips, and more logic functionalities in a chip

faster transistors and lower critical path limitation on maximum allowable power very high interconnect delay could be a show‐stopper and interconnect power could be very high

maximum number of I/O pins not even doubled in maximum number of I/O pins not even doubled in coming ten years while transistor density increased over thirty times during the same period

Pramod Meher, Institute for Infocomm Research, Singapore17/12/2010 13

Page 14: Algorithm Architecture Co for DSP Applications · Tukey FFT algorithm and several other algorithms for di tdiscrete FiFourier and other di tdiscrete tftransforms [Cooley 1995]. polynomial

need of algorithm- architecture co-design

Pramod Meher, Institute for Infocomm Research, Singapore17/12/2010 14

Page 15: Algorithm Architecture Co for DSP Applications · Tukey FFT algorithm and several other algorithms for di tdiscrete FiFourier and other di tdiscrete tftransforms [Cooley 1995]. polynomial

core operations in DSP

Filtering: Finite Impulse Response (FIR), Infinite Impulse Response (IIR) filtering, and Adaptive filtering

Matrix operations: convolution, correlation and i l i li i i d imatrix multiplication, inner‐product computation

Discrete transforms: DFT, DCT, DST, DHT, CZT,  and DWT etcDWT etc..

Pramod Meher, Institute for Infocomm Research, Singapore17/12/2010 15

Page 16: Algorithm Architecture Co for DSP Applications · Tukey FFT algorithm and several other algorithms for di tdiscrete FiFourier and other di tdiscrete tftransforms [Cooley 1995]. polynomial

typical behavior of DSP operations

computation‐intensive

repetitive multiply‐accumulate computations repetitive multiply accumulate computations

trigonometric symmetries

computational symmetries and redundancies computational symmetries and redundancies

multiplication with constant coefficients

inner product with a constant vector inner‐product with a constant vector

several implementation platforms

d di t d hit t ft i d dedicated architectures are often required

Pramod Meher, Institute for Infocomm Research, Singapore17/12/2010 16

Page 17: Algorithm Architecture Co for DSP Applications · Tukey FFT algorithm and several other algorithms for di tdiscrete FiFourier and other di tdiscrete tftransforms [Cooley 1995]. polynomial

key design considerations

processing time : Fast enough to process the input/ sample values as fast as they arrive at. Throughput and l h ld h h f llatency should match the requirement of real‐time processing.

power consumption: Key design issue more demanding power consumption: Key design issue, more demanding in portable and embedded systems.

interconnects: Should be minimized for speed as well as power

I/O : Number of I/Os should be minimized to reduce size and costand cost.

area: It could be traded for speed or power provided that the cost allows.

Pramod Meher, Institute for Infocomm Research, Singapore17/12/2010 17

Page 18: Algorithm Architecture Co for DSP Applications · Tukey FFT algorithm and several other algorithms for di tdiscrete FiFourier and other di tdiscrete tftransforms [Cooley 1995]. polynomial

scope of algorithm-architecture co-design

− sequential/parallel computationl li d/di ib d i

ALGORITHMIC FEATURES

DESIGN GOALS

− localized/distributed computation− in‐place/not‐in‐place computation− recursive/non‐recursive computation− compute/communication bound

throughput and latency, power‐dissipation, 

I/O, interconnects, and 

− compute/communication‐bound

chip‐area, etc.

− processing unit design− arithmetic circuit design

ARCHITECTURAL FEATURES

arithmetic circuit design−memory organization and placement− register organization− I/O and communication

Pramod Meher, Institute for Infocomm Research, Singapore17/12/2010

/

18

Page 19: Algorithm Architecture Co for DSP Applications · Tukey FFT algorithm and several other algorithms for di tdiscrete FiFourier and other di tdiscrete tftransforms [Cooley 1995]. polynomial

objectives of co-design

design for catching with the speed

design for dealing with power design for dealing with power

design for interconnect minimization

design for matching computation with I/O design for matching computation with I/O

design for demand‐driven solution

Pramod Meher, Institute for Infocomm Research, Singapore17/12/2010 19

Page 20: Algorithm Architecture Co for DSP Applications · Tukey FFT algorithm and several other algorithms for di tdiscrete FiFourier and other di tdiscrete tftransforms [Cooley 1995]. polynomial

catching up the speed

Pramod Meher, Institute for Infocomm Research, Singapore17/12/2010 20

Page 21: Algorithm Architecture Co for DSP Applications · Tukey FFT algorithm and several other algorithms for di tdiscrete FiFourier and other di tdiscrete tftransforms [Cooley 1995]. polynomial

performance metrics: key design techniques

computational bandwidth and throughput rateperformance metrics computational bandwidth and throughput rate

absolute and relative latencies

execution time execution time

minimization of computation and mappingdesign techniques minimization of computation and mapping computation to hardware architecture

co‐design for parallel processing and pipelining co design for parallel processing and pipelining 

combination of parallel processing and pipelining

Pramod Meher, Institute for Infocomm Research, Singapore17/12/2010 21

Page 22: Algorithm Architecture Co for DSP Applications · Tukey FFT algorithm and several other algorithms for di tdiscrete FiFourier and other di tdiscrete tftransforms [Cooley 1995]. polynomial

minimization of computations

use of trigonometric symmetry: the famous Cooley–Tukey FFT algorithm and several other algorithms for di t F i d th di t t fdiscrete Fourier and other discrete transforms [Cooley 1995]. 

polynomial operations and interpolation and polynomial operations and interpolation and minimization of redundant operations: Toom‐Cook method and Agarwal‐Cooley algorithm of digital 

l i [ l 19 ]convolution [Agarwal 1977]. 

Pramod Meher, Institute for Infocomm Research, Singapore17/12/2010 22

Page 23: Algorithm Architecture Co for DSP Applications · Tukey FFT algorithm and several other algorithms for di tdiscrete FiFourier and other di tdiscrete tftransforms [Cooley 1995]. polynomial

design for parallel processing

multirate FIR filtering approach:  Toom‐Cook algorithm [Cook 1966] :

H ( ) + H ( )

H0(z)

12

2

+ + +-

+ +2

H0(z) + H1(z)

H1(z) Z -1

Z -12

2

+ + +-

Z -1

block filter [Clark 1981] and transform domain adaptive filter

(Filter structure based on a two‐point convolution algorithm.                         H0 and H1 : the half‐length filters of even and odd coefficient of filter H.)

block filter [Clark 1981] and transform domain adaptive filter [Narayan 1983] :  to process a block of input in a cycle. 

decomposition of computation [DA 2006, Conv 2008]: 

Pramod Meher, Institute for Infocomm Research, Singapore17/12/2010

p p [ , ]derives independent tasks and maps to parallel architecture

23

Page 24: Algorithm Architecture Co for DSP Applications · Tukey FFT algorithm and several other algorithms for di tdiscrete FiFourier and other di tdiscrete tftransforms [Cooley 1995]. polynomial

block formulation: parallel processing

Consider an FIR filter of order N= 3 : 

This can be expressed in a block form [Lin 1996] p [ ]

Here the block size  L= 3 . Could similarly be done for longer block sizes. The outputs are available are k-th cycle  

Pramod Meher, Institute for Infocomm Research, Singapore17/12/2010 24

Page 25: Algorithm Architecture Co for DSP Applications · Tukey FFT algorithm and several other algorithms for di tdiscrete FiFourier and other di tdiscrete tftransforms [Cooley 1995]. polynomial

design for pipelining

recursive algorithm formulation:  mapping to systolic and systolic‐like architecture for pipelining and reuse of data [DHT 1993][DHT 1993] .

conversion to cyclic convolution form: DFT and other transforms could be converted to cyclic convolution formtransforms could be converted to cyclic convolution form, and cyclic convolution could be mapped to pipelined structures [DFT 2006, DCT 2007, DST 2007].

cut‐set retiming: graphical formulation to transform algorithm into pipeline form

latency, register complexity, and communication bottleneck  on one hand and granularity of pipelining on the other hand.

Pramod Meher, Institute for Infocomm Research, Singapore17/12/2010 25

Page 26: Algorithm Architecture Co for DSP Applications · Tukey FFT algorithm and several other algorithms for di tdiscrete FiFourier and other di tdiscrete tftransforms [Cooley 1995]. polynomial

convolutional representation of transforms

N-point sinusoidal transforms e.g, DFT/DCT/ DHT are given by [DFT 2006, DXT 2006, DHT  2007, DST 2007]

where the transform kernel is defined as

for prime values of N, the N x N kernel matrix is transformed to an (N-1)-point cyclic convolution.

Pramod Meher, Institute for Infocomm Research, Singapore17/12/2010 26

Page 27: Algorithm Architecture Co for DSP Applications · Tukey FFT algorithm and several other algorithms for di tdiscrete FiFourier and other di tdiscrete tftransforms [Cooley 1995]. polynomial

conversion to cyclic convolution: example

N-point DFT:

For N = 5

4- point cyclic convolution

can be used for many other transforms and for other prime lengths N.

Pramod Meher, Institute for Infocomm Research, Singapore17/12/2010 27

Page 28: Algorithm Architecture Co for DSP Applications · Tukey FFT algorithm and several other algorithms for di tdiscrete FiFourier and other di tdiscrete tftransforms [Cooley 1995]. polynomial

pipelined convolution: design example

implementation of 4‐point cyclic convolution [DHT 2007]

u(0) u(1) u(2) u(3)u(0) u(1) u(2) u(3)

+ ++

U(0)U(1),U(2),U(3),

Pramod Meher, Institute for Infocomm Research, Singapore17/12/2010 28

Page 29: Algorithm Architecture Co for DSP Applications · Tukey FFT algorithm and several other algorithms for di tdiscrete FiFourier and other di tdiscrete tftransforms [Cooley 1995]. polynomial

dealing with power

Pramod Meher, Institute for Infocomm Research, Singapore17/12/2010 29

Page 30: Algorithm Architecture Co for DSP Applications · Tukey FFT algorithm and several other algorithms for di tdiscrete FiFourier and other di tdiscrete tftransforms [Cooley 1995]. polynomial

components of power dissipation

CMOS power dissipation1. Dynamic power dissipation :  2... ddD VCfP y p p

: average switching activity,  f : operating frequency, C: total load capacitance of CMOS circuit, Vdd : supply voltage

2 Leakage power: ILEAK: leakage currentIVP

ddD f

2. Leakage power:                                 , ILEAK: leakage current

3. Short‐circuit power dissipation

Power dissipation across the conductors

LEAKddLEAK IVP .

Power dissipation across the conductors1. power dissipation across clock network2. power dissipation in interconnects

Pramod Meher, Institute for Infocomm Research, Singapore17/12/2010 30

Page 31: Algorithm Architecture Co for DSP Applications · Tukey FFT algorithm and several other algorithms for di tdiscrete FiFourier and other di tdiscrete tftransforms [Cooley 1995]. polynomial

components of dynamic power dissipation

dynamic power interconnect power

15%

21%

51%34%

35%

16%

28%

Interconnect Gate Diffusion global signal local signalglobal clock local clock

Pramod Meher, Institute for Infocomm Research, Singapore17/12/2010 31

Page 32: Algorithm Architecture Co for DSP Applications · Tukey FFT algorithm and several other algorithms for di tdiscrete FiFourier and other di tdiscrete tftransforms [Cooley 1995]. polynomial

voltage scaling for dynamic power reduction

C : charging capacitance

2.. DDD VCfP dynamic power:                       could be reduced by reducing VDD

CCHARGE : charging capacitance along the critical path.VTH : threshold voltage,

2CHARGE

)( THDD

DDPD VVk

VCT

propagation delay:

Reduction of supply voltage leads to increase in propagation delay and  hence the decrease in maximum usable frequency.  But, throughput could still be maintained if we use parallel

Propagation delay could still be maintained in spite of reducing 

But, throughput  could still be maintained if we use parallel architecture.

Propagation delay could still be maintained in spite of reducingthe supply voltage if CCHARGE is reduced proportionately by pipelining [Parhi 1999]

Institute for Infocomm Research, Singapore12/17/2010 32

Page 33: Algorithm Architecture Co for DSP Applications · Tukey FFT algorithm and several other algorithms for di tdiscrete FiFourier and other di tdiscrete tftransforms [Cooley 1995]. polynomial

leakage powers

standby leakage currents : when the circuit is in idle mode i.e., no computation takes placeactive leakage currents: when the circuit is in use

Pramod Meher, Institute for Infocomm Research, Singapore17/12/2010 33

[Nose 2002]

Page 34: Algorithm Architecture Co for DSP Applications · Tukey FFT algorithm and several other algorithms for di tdiscrete FiFourier and other di tdiscrete tftransforms [Cooley 1995]. polynomial

total power minimization: design example

[Wang 2005] 

Pramod Meher, Institute for Infocomm Research, Singapore17/12/2010 34

Page 35: Algorithm Architecture Co for DSP Applications · Tukey FFT algorithm and several other algorithms for di tdiscrete FiFourier and other di tdiscrete tftransforms [Cooley 1995]. polynomial

interconnect minimization

Pramod Meher, Institute for Infocomm Research, Singapore17/12/2010 35

Page 36: Algorithm Architecture Co for DSP Applications · Tukey FFT algorithm and several other algorithms for di tdiscrete FiFourier and other di tdiscrete tftransforms [Cooley 1995]. polynomial

recursive formulation

DFT of an N-point sequence {x(n)} is given by  

It can be represented in a recursive form [DHT 1993, DFT 2D 2005]p [ , ]

Other sinusoidal transforms could also be expressed in this form.

Pramod Meher, Institute for Infocomm Research, Singapore17/12/2010 36

Page 37: Algorithm Architecture Co for DSP Applications · Tukey FFT algorithm and several other algorithms for di tdiscrete FiFourier and other di tdiscrete tftransforms [Cooley 1995]. polynomial

recursive formulation: design example

Systolic design of DFT for N=5 

0 0 0 0 05; If

;0,0:Initialize

count

Rcount

x(0)

0

x(4) W 45

direction

0

W 35

0

W 25

0

W 15

0

W 15

l;1

in..;inout

countcountXWRR

XX

x(0)

x(1)

x(2)

x(3)

rojection d

W 45

W 4

W 35

W 3

W 25

W 2

W 15

W 1

W 15

W 1 W RXin Xoutend.

;0 ;else

countRY

x(3)

x(2)x(2)

x(1)

p W 45

W 45

W 35

W 35

W 25

W 25

W 15

W 15

W 15

W 15

W, RXin Xout

Y

x(4)

X(4) X(3) X(2) X(1) X(0)

PE-1 PE-2 PE-3 PE-4 PE-5

X(4) X(3) X(2) X(1) X(0)

x(0) W 45 W 3

5 W 25 W 1

5 W 15

Pramod Meher, Institute for Infocomm Research, Singapore17/12/2010 37

X(4) X(3) X(2) X(1) X(0)X(4) X(3) X(2) X(1) X(0)

Page 38: Algorithm Architecture Co for DSP Applications · Tukey FFT algorithm and several other algorithms for di tdiscrete FiFourier and other di tdiscrete tftransforms [Cooley 1995]. polynomial

3-D designs: interconnect minimization

Decompose the computation to multiple stages [DHT 2D 1993, DCT 1996].

Map the computation of each part to a different layer

The output of one layer is used as input of one of the  adjacent layers

The inner layers should preferably have less computation to perform to reduce power dissipation in in inner layers

3 D architectures may help to reduce the buffer 3‐D architectures may help to reduce the buffer space for intermediate data.

Pramod Meher, Institute for Infocomm Research, Singapore17/12/2010 38

Page 39: Algorithm Architecture Co for DSP Applications · Tukey FFT algorithm and several other algorithms for di tdiscrete FiFourier and other di tdiscrete tftransforms [Cooley 1995]. polynomial

3-D architecture: design examplesupper layer of 3 #5-point DFT unitsComputation of 15‐point DFT [DFT 2D 2005]. 

Use the prime‐factor decomposition to compute the 15‐point DFT in 2 stages.

Map the input into a 5 x 3 array [x(i, j)].

In stage‐1 compute 5 number of 3‐point DFT of the rows of [x(i, j)] to obtain a 5 x 3 intermediate result [y(i, j)].

In stage‐2 compute 3 number of 5‐point DFT of columns of [y(i, j)] to obtain the DFT.

Pramod Meher, Institute for Infocomm Research, Singapore17/12/2010 39

col-1 col-2 col-3

Page 40: Algorithm Architecture Co for DSP Applications · Tukey FFT algorithm and several other algorithms for di tdiscrete FiFourier and other di tdiscrete tftransforms [Cooley 1995]. polynomial

3-D architecture: design examples

Computation of 2‐D DFT: Example 3x3 DFTX(0,0) X(0,1) X(0,2) X(1,0) X(1,1) X(1,2) X(2,0) X(2,1) X(2,2)

V (0,2) V (1,2) V (2,2)

x(1,2)

x(0,2)

0 0 0

V (2,1)V (1,1)V (0,1)x(0,1)

x(2,2)

0 0 0

V (0 0) V (2 0)

x(2,1)

x(1,1)

V (0,0) V (1,0) V (2,0)

x(2,0)

x(1,0)

x(0,0)

Pramod Meher, Institute for Infocomm Research, Singapore17/12/2010 40

0 0 0

0 0 0 0 0 0 0 0 0

Page 41: Algorithm Architecture Co for DSP Applications · Tukey FFT algorithm and several other algorithms for di tdiscrete FiFourier and other di tdiscrete tftransforms [Cooley 1995]. polynomial

3-D architecture: design examples

Computation of 2‐D DFT: Example 3x3 DFT 

)21(X)10()2,0(

XX

di i i liik i1) h(fPE1) h(

)0,2()1,2()2,2(

XXX

)0,1()1,1()2,1(

XXX

)0,0()1,0(

XX

XX

ZWDWC

dinitializeik

inout

iN

kN

;:performs PEeach period cycleevery During

;0;;

:asisrow1)th (ofPE1)th (

1

03

03

WD

WC

0

3

13

WD

WC

0

3

23

WD

WC

)2,2(),2,1(),2,0( xxx

COUNTCOUNTZDZ

YZYZCXZ

inout

in

inout

1.

;;.

;

22

2

11

13

03

WD

WC

1

3

23

WD

WC

1

3

13

WD

WC

)1,2(),1,1(),1,0( xxx

ZZZ

ThenCOUNTIfCOUNTCOUNT

;0;

3;1

1

12

23

03

WD

WC

2

3

23

WD

WC

2

3

13

WD

WC

)0,2(),0,1(),0,0( xxxEndif

Pramod Meher, Institute for Infocomm Research, Singapore17/12/2010 41

0 0 0

Page 42: Algorithm Architecture Co for DSP Applications · Tukey FFT algorithm and several other algorithms for di tdiscrete FiFourier and other di tdiscrete tftransforms [Cooley 1995]. polynomial

matching computation with I/O

Pramod Meher, Institute for Infocomm Research, Singapore17/12/2010 42

Page 43: Algorithm Architecture Co for DSP Applications · Tukey FFT algorithm and several other algorithms for di tdiscrete FiFourier and other di tdiscrete tftransforms [Cooley 1995]. polynomial

decomposition of computation

Matrix vector product Y = CX: Let C be of size 4 x 4. Yand X be of size 3 x 1. [DA 2006, FIR 2008][ , ]

Pramod Meher, Institute for Infocomm Research, Singapore17/12/2010 43

Page 44: Algorithm Architecture Co for DSP Applications · Tukey FFT algorithm and several other algorithms for di tdiscrete FiFourier and other di tdiscrete tftransforms [Cooley 1995]. polynomial

decomposition by graph partitioning

Computation of larger size problem by small array and less I/Os

b0 a05 a15

c0 c1

a25

c2

a35

c3

a45

c4

a55

c5

a04

a03 a13

a14b1

b2 a23

a24

a33

a34

a43

a44

a53

a54

DG partitioned into 3 blocks

a01 a11

a02 a12b3

b4 a21

a22

a31

a32

a41

a42

a51

a52

DG partitioned into 3 blocks

DG for vector-matrix multiplication c=Ab. d 6 l

a10a00b5 a20 a30 a40 a50

0 0 0 0 00

Pramod Meher, Institute for Infocomm Research, Singapore17/12/2010

A is a 6 x 6 matrix and b is 6-point column vector. DG partitioned into 2 blocks

44

Page 45: Algorithm Architecture Co for DSP Applications · Tukey FFT algorithm and several other algorithms for di tdiscrete FiFourier and other di tdiscrete tftransforms [Cooley 1995]. polynomial

graph partitioning: example

Two methods: (i) Locally Sequential Globally Parallel (LSGP)   (ii) Locally Parallel Globally Sequential (LPGS) methods

DG partitioned into 3 blocks DG partitioned into 2 blocksPE PE PE

LSGP: One block of nodes are  LPGS: Each block of nodes aremapped to one PE. 3 PEs of the array are used for three blocks of the DG. All the blocks are processed concurrently

LPGS: Each block of nodes are mapped to the whole array. 3 columns of node in a block are mapped to 3PEs of the array. Bl k d ti ll

Pramod Meher, Institute for Infocomm Research, Singapore17/12/2010

processed concurrently. Blocks are processed sequentially.

45

Page 46: Algorithm Architecture Co for DSP Applications · Tukey FFT algorithm and several other algorithms for di tdiscrete FiFourier and other di tdiscrete tftransforms [Cooley 1995]. polynomial

design for demand-driven solution

Pramod Meher, Institute for Infocomm Research, Singapore17/12/2010 46

Page 47: Algorithm Architecture Co for DSP Applications · Tukey FFT algorithm and several other algorithms for di tdiscrete FiFourier and other di tdiscrete tftransforms [Cooley 1995]. polynomial

iterative circular optimization

system level design, optimization, and  trade‐off 

algorithm design for minimization of interconnect and algorithm design for minimization of interconnect and matching with I/O, and modify for speed and power. 

mapping algorithm to architecture: exploring 3‐D mapping algorithm to architecture: exploring 3 D design, local communication, parallel, pipeline, or a combination

mapping architectures to circuits:  selection of arithmetic circuits [FIR 2010]

l h d b circular iterative approach: top‐down to bottom‐up followed by bottom‐up  by demand‐driven trade‐off [DWT 2010]

Pramod Meher, Institute for Infocomm Research, Singapore17/12/2010 47

[ ]

Page 48: Algorithm Architecture Co for DSP Applications · Tukey FFT algorithm and several other algorithms for di tdiscrete FiFourier and other di tdiscrete tftransforms [Cooley 1995]. polynomial

system-level design and optimization

HW/SW partitioning  synchronous /asynchronous  partitioning:  globally y / y p g g yasynchronous and locally synchronous design 

data representation: Selection of bit‐width, sign‐it d 2’ l t t ti dmagnitude or 2’s complement representation, and 

data encoding power management schemes: dynamic voltage power management schemes: dynamic voltagescaling, adaptive clocking, and clock gating

Pramod Meher, Institute for Infocomm Research, Singapore17/12/2010 48

Page 49: Algorithm Architecture Co for DSP Applications · Tukey FFT algorithm and several other algorithms for di tdiscrete FiFourier and other di tdiscrete tftransforms [Cooley 1995]. polynomial

conclusions

Pramod Meher, Institute for Infocomm Research, Singapore17/12/2010 49

Page 50: Algorithm Architecture Co for DSP Applications · Tukey FFT algorithm and several other algorithms for di tdiscrete FiFourier and other di tdiscrete tftransforms [Cooley 1995]. polynomial

conclusions

Over the years, the application space of DSP has become wide and diverse.

The complexity of DSP algorithms also has grown high.  Dedicated architectures are needed for real‐time 

i l i f S f i li iimplementation of DSP functionalities. Algorithms for dedicated architectures are tailored to 

meet a target set of design metrics and trade‐offs.meet a target set of design metrics and trade offs. Typical behaviors of the DSP algorithms are utilized to 

perform reformulation of algorithm for mapping that to a hardware architecture.

Pramod Meher, Institute for Infocomm Research, Singapore17/12/2010 50

Page 51: Algorithm Architecture Co for DSP Applications · Tukey FFT algorithm and several other algorithms for di tdiscrete FiFourier and other di tdiscrete tftransforms [Cooley 1995]. polynomial

conclusions

We reviewed the scaling trends to explore the key design challenges

Algorithms and architectures need to be designed to have just enough speedS l d h ld b d d f Surplus speed should be traded for power

Algorithm and architecture should minimizes the interconnect length and match computation with I/Ointerconnect length and match computation with I/O bandwidth

Co‐design should ultimately satisfy the constraints and demands

Pramod Meher, Institute for Infocomm Research, Singapore17/12/2010 51

Page 52: Algorithm Architecture Co for DSP Applications · Tukey FFT algorithm and several other algorithms for di tdiscrete FiFourier and other di tdiscrete tftransforms [Cooley 1995]. polynomial

references

[Agarwal 1977] R. Agarwal and J. W. Cooley, New algorithms for digital convolution, IEEE Trans  ASSP, Oct 1977, pp. 392‐410.[Clark 1981] G. Clark, S. Mitra, and S. Parker, Block implementation of adaptive [ ] , , , p pdigital filters, Processing, IEEE Trans  ASSP, June 1981, pp.744‐752.[Conv 2008] P. K. Meher, ‘Parallel and Pipelined Architectures for Cyclic Convolution by Block Circulant Formulation using Low‐Complexity Short‐Length Al i h ’ I Ci i & S f Vid h l 1422Algorithms,’ IEEE Trans. on Circuits & Systems for Video Technology, pp. 1422‐1431, October 2008.[Cooley 1995]  J. W. Cooley and J. W. Tukey, An Algorithm for the Machine Calculation of Complex Fourier Series Mathematics of Computation vol 19 noCalculation of Complex Fourier Series, Mathematics of Computation, vol. 19, no. 90, April 1965, pp. 297‐301.[Cook 1966] S. A. Cook., On the minimum computation time of functions. Harvard Thesis, 1966.[DA 2006] P. K. Meher, ‘Hardware‐Efficient Systolization of DA‐based Calculation of Finite Digital Convolution,’ IEEE Trans Circuits & Systems‐II, pp.707‐711, Aug 2006.

Pramod Meher, Institute for Infocomm Research, Singapore17/12/2010 52

Page 53: Algorithm Architecture Co for DSP Applications · Tukey FFT algorithm and several other algorithms for di tdiscrete FiFourier and other di tdiscrete tftransforms [Cooley 1995]. polynomial

references

[DCT 1996] S. S. Nayak and P. K. Meher, ‘A 3‐Dimensional Systolic Architecture for Parallel VLSI Implementation of the Discrete Cosine Transform’, IEE Proceeding ~ Circuits, Devices and Systems, vol.143, no.5, pp.255‐258, Oct 1996. g y pp[DCT 2006] P. K. Meher, ‘Systolic Designs for DCT using a Low‐Complexity Concurrent Convolutional Formulation,’ IEEE Trans Circuits & Systems for Video Technology, September 2006, pp.1041‐1050.[DFT 2D 2005] P. K. Meher, ‘Design of a Fully‐Pipelined Systolic Array for Flexible Transposition‐Free VLSI of 2‐D DFT’, IEEE Trans Circuits and Systems‐II,  Feb 2005.[DFT 2006] P. K. Meher, ‘Efficient Systolic Implementation of DFT using a Low‐Complexity Convolution like Formulation ’ IEEE Trans Circuits & SystemsComplexity Convolution‐like Formulation,  IEEE Trans Circuits & Systems‐II, vol.53, no.8, pp.702‐706, August 2006.[DHT 1993] P. K. Meher, J. K. Satapathy and G. Panda, ‘Efficient Systolic Solution for a New Prime‐Factor Discrete Hartley Transform Algorithm’, IEE Proceedings ~ y g , gCircuits, Devices and Systems, vol. 140, no.2, pp. 135‐139, April 1993.[DHT 2D 1993] P. K. Meher and G. Panda, ‘Novel Recursive Algorithm and Highly Compact Semi‐Systolic Architecture for High‐Throughput Computation of 2‐D 

Pramod Meher, Institute for Infocomm Research, Singapore17/12/2010

DHT’, IEE Electronics Letters, pp. 883‐885, May 1993.

53

Page 54: Algorithm Architecture Co for DSP Applications · Tukey FFT algorithm and several other algorithms for di tdiscrete FiFourier and other di tdiscrete tftransforms [Cooley 1995]. polynomial

references

[DHT 2007] P. K. Meher, J. C. Patra and M. N. S. Swamy, ‘High‐Throughput Memory‐Based Architecture for DHT Using a New Convolutional Formulation,’ IEEE Trans. on Circuits & Systems‐II, vol.54, no.7, pp.606‐610, July 2007.y , , , pp , y[DST 2007] P. K. Meher and M. N. S. Swamy, ‘New Systolic Algorithm and Array Architecture for Prime‐Length Discrete Sine Transform,’ IEEE Trans Circuits & Systems‐II, vol.54, no.2, pp.262‐266, March 2007.[DXT 2006] P. K. Meher, ‘Unified Systolic‐Like Architecture for DCT and DST Using Distributed Arithmetic,’ IEEE Trans Circuits & Systems‐I, vol.53, no.12, pp.2656‐2663, December 2006.[DWT 2010] B K M h t d P K M h ‘P ll l d Pi li A hit t[DWT 2010] B. K. Mohanty and P. K. Meher, ‘Parallel and Pipeline Architectures for High‐Throughput Computation of Multilevel 3‐D DWT,’IEEE Trans. on Circuits & Systems for Video Technology, pp.1200‐1209, September 2010.[FIR 2008] P. K. Meher, S. Chandrasekaran, and A. Amira, ‘FPGA Realization of FIR[FIR 2008] P. K. Meher, S. Chandrasekaran, and A. Amira,  FPGA Realization of FIR Filters by Efficient and Flexible Systolization Using Distributed Arithmetic,’ IEEE Trans Signal Processing, pp. 3009‐3017, July 2008. 

Pramod Meher, Institute for Infocomm Research, Singapore17/12/2010 54

Page 55: Algorithm Architecture Co for DSP Applications · Tukey FFT algorithm and several other algorithms for di tdiscrete FiFourier and other di tdiscrete tftransforms [Cooley 1995]. polynomial

references

[FIR 2010] P. K. Meher, ‘New Approach to Look‐up‐Table Design and Memory‐Based Realization of FIR Digital Filter,’ IEEE Trans Circuits & Systems‐I, pp.592‐603, March 2010.,[Lin  1996] Ing‐Song Lin and S. K. Mitra, ‘Overlapped block digital filtering,’ Circuits and Systems II August 1996, pp. 586 – 596.[Narayan 1983] S.S. Narayan, A. Peterson, M. Narasimha, Transform domain LMS algorithm , IEEE Trans ASSP, 1983 , pp. 609 – 615. [Nose 2002] K.Nose, M.Hirabayashi, H.Kawaguchi, S.Lee, and T.Sakurai, VTH‐hopping scheme to reduce subthreshold leakage for low‐power processors, IEEE J l f S lid St t Ci it 2002 413 419Journal of Solid‐State Circuits, 2002, pp. 413 – 419.[Parhi 1999] K. K. Parhi, VLSI Digital Signal Processing Systems: Design and Implementation. New York: John Wiley & Sons, Inc, 1999.[Wang 2005] A Wang A Chandrakasan A 180 mV subthreshold FFT processor[Wang 2005] A. Wang, A. Chandrakasan, A 180‐mV subthreshold FFT processor using a minimum  energy design methodology, IEEE Journal of Solid‐State Circuits, 2005, pp. 310‐319

Pramod Meher, Institute for Infocomm Research, Singapore17/12/2010 55

Page 56: Algorithm Architecture Co for DSP Applications · Tukey FFT algorithm and several other algorithms for di tdiscrete FiFourier and other di tdiscrete tftransforms [Cooley 1995]. polynomial

Thank You !

Pramod Meher, Institute for Infocomm Research, Singapore17/12/2010 56