vlsivlsi vlsi signal processingsignal...

32
ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-1 VLSI Signal Processing VLSI VLSI Signal Processing Signal Processing Lecture 1 Pipelining & Retiming Lecture 1 Pipelining & Retiming (Review version) (Review version)

Upload: others

Post on 17-Apr-2020

22 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: VLSIVLSI VLSI Signal ProcessingSignal Processingtwins.ee.nctu.edu.tw/courses/vsp_11_summer/lecture...ADSP Lecture1 - Pipelining & Retiming (cwliu@twins.ee.nctu.edu.tw) 1-2 DSP Representations

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-1

VLSI Signal ProcessingVLSIVLSI Signal ProcessingSignal ProcessingLecture 1 Pipelining & RetimingLecture 1 Pipelining & Retiming

(Review version)(Review version)

Page 2: VLSIVLSI VLSI Signal ProcessingSignal Processingtwins.ee.nctu.edu.tw/courses/vsp_11_summer/lecture...ADSP Lecture1 - Pipelining & Retiming (cwliu@twins.ee.nctu.edu.tw) 1-2 DSP Representations

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-2

DSP Representations• DSP Algorithm are non-terminating

• Iteration period: the time for one iteration, one execution of the loop within DSP algorithm

• Critical path: longest path without delay• Critical path sets bound on clock frequency, i.e. Tclk≥Tcritical• Sampling rate: number of samples processed / second• Latency: time difference between output and corresponding

input (combinational logic = gate delays, sequential logic = number of clock cycles)

• The clock rate (or frequency) of the DSP system is not the same as its sampling rate

]3[]2[]1[][][ 3210 −+−+−+= nxhnxhnxhnxhny

Page 3: VLSIVLSI VLSI Signal ProcessingSignal Processingtwins.ee.nctu.edu.tw/courses/vsp_11_summer/lecture...ADSP Lecture1 - Pipelining & Retiming (cwliu@twins.ee.nctu.edu.tw) 1-2 DSP Representations

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-3

Critical Path – fundamental limit• 4 possible paths

– Input nodes to delay element– Input node to output node– Delay element to delay element– Delay element to output

• Critical path: the path with the longest computation time among all paths without delay elements

• Clock period is bounded by the computation time of the critical path• Example: 5-tap FIR filter and assume TA=4ns, TM=10ns

critical paths = 26ns

Page 4: VLSIVLSI VLSI Signal ProcessingSignal Processingtwins.ee.nctu.edu.tw/courses/vsp_11_summer/lecture...ADSP Lecture1 - Pipelining & Retiming (cwliu@twins.ee.nctu.edu.tw) 1-2 DSP Representations

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-4

Iteration Period• Iteration: execution of all computations of an algorithm once• Iteration period: the time required for execution of on iteration

• Iteration rate: the number iterations executed per second• Sample rate (throughput): the number of samples processed per

second

MA TT +

A0→B0⇒A1 →B1⇒A2….

Page 5: VLSIVLSI VLSI Signal ProcessingSignal Processingtwins.ee.nctu.edu.tw/courses/vsp_11_summer/lecture...ADSP Lecture1 - Pipelining & Retiming (cwliu@twins.ee.nctu.edu.tw) 1-2 DSP Representations

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-5

Loop Bound• Loop: a directed path that begins and ends at the same node• Loop bound of the loop j: Tj / Wj, where Tj is the loop

computation time and Wj is the number of delays in the loopExamples:– assume TA+TM =6ns,

a, b, c, a is a loop with loop bound = 6ns

– assume TA+TM =6ns, the loop bound = 3ns

y(n)=ay(n-2)+x(n)

2MA

sTTT +≥

MAs TTT +≥

Page 6: VLSIVLSI VLSI Signal ProcessingSignal Processingtwins.ee.nctu.edu.tw/courses/vsp_11_summer/lecture...ADSP Lecture1 - Pipelining & Retiming (cwliu@twins.ee.nctu.edu.tw) 1-2 DSP Representations

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-6

• Critical loop: the loop with the maximum loop bound• Iteration bound: the loop bound of the critical loop• Not possible to achieve iteration period lower than iteration

bound even with infinite processing power• Definition

• Example

Iteration Bound – fundamental limit

Page 7: VLSIVLSI VLSI Signal ProcessingSignal Processingtwins.ee.nctu.edu.tw/courses/vsp_11_summer/lecture...ADSP Lecture1 - Pipelining & Retiming (cwliu@twins.ee.nctu.edu.tw) 1-2 DSP Representations

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-7

Architecture Transformation

Decimation ↓ Mh(n)

z-1

z-1

z-1

↓ M

z-1

z-1

z-1

↓ M

↓ M

↓ M

↓ M

z-1

z-1

z-1

↓ M

↓ M

↓ M

↓ M

Page 8: VLSIVLSI VLSI Signal ProcessingSignal Processingtwins.ee.nctu.edu.tw/courses/vsp_11_summer/lecture...ADSP Lecture1 - Pipelining & Retiming (cwliu@twins.ee.nctu.edu.tw) 1-2 DSP Representations

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-8

Architecture TransformationInterpolation ↑L h(n)

z-1

z-1

z-1

↑L ↑L

z-1

z-1

z-1

z-1

z-1

z-1

↑L

↑L

↑L

↑L

Transposed form

Page 9: VLSIVLSI VLSI Signal ProcessingSignal Processingtwins.ee.nctu.edu.tw/courses/vsp_11_summer/lecture...ADSP Lecture1 - Pipelining & Retiming (cwliu@twins.ee.nctu.edu.tw) 1-2 DSP Representations

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-9

Data-Flow Graph (DFG)• DFG captures the data-driven property of the DSP algorithm• Nodes represent computations, functions, or tasks• Directed edges represents data flow with non-negative

number of delays• A node can execute whenever all its input data are available• Each edge describes a precedence constraint between two

nodes in DFG– Intra iteration precedence constraint: if no delay– Inter-iteration precedence constraint: if existing one or more

delays

]3[]2[]1[][][ 3210 −+−+−+= nxhnxhnxhnxhny

Page 10: VLSIVLSI VLSI Signal ProcessingSignal Processingtwins.ee.nctu.edu.tw/courses/vsp_11_summer/lecture...ADSP Lecture1 - Pipelining & Retiming (cwliu@twins.ee.nctu.edu.tw) 1-2 DSP Representations

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-10

Algorithm for Computing Iteration BoundLongest Path Matrix (LPM) Algorithm• d: number of delays, di: ith delay element• Construct matrices L(m), 1≤m≤d, where each entry ℓi,j

(m) is the longest path from di to dj with m-1 delays

• ℓi,j(m)=-1 if no such path

• L(m+1) can be obtained from L(1) and L(m) recursively by, if there is k such that ℓi,j

(m+1)= ℓi,k(1) + ℓk,j

(m), otherwise ℓi,j(m)=-1

• Note that the diagonal element represents the longest computation time of all loops with m delays contains di

• Then the iteration bound can be calculated by T∞=max{ℓi,i(m) /

m}, for 1≤ i,m ≤ d

Page 11: VLSIVLSI VLSI Signal ProcessingSignal Processingtwins.ee.nctu.edu.tw/courses/vsp_11_summer/lecture...ADSP Lecture1 - Pipelining & Retiming (cwliu@twins.ee.nctu.edu.tw) 1-2 DSP Representations

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-11

Pipelining & Parallel Processing• Pipelining processing

– By using pipelining latches to reduce critical path– Either to increase the clock speed or sample speed, or to

reduce the power consumption at same speed• Parallel processing

– By using replicating hardware to process multiple samples in parallel in a clock period

– The effective sampling speed is increased by the level of parallelism.

– Can be used to the reduction of power consumption

continue sendingTsample=TCLK Tsample≠TCLK

Page 12: VLSIVLSI VLSI Signal ProcessingSignal Processingtwins.ee.nctu.edu.tw/courses/vsp_11_summer/lecture...ADSP Lecture1 - Pipelining & Retiming (cwliu@twins.ee.nctu.edu.tw) 1-2 DSP Representations

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-12

Basic Ideas• Parallel processing • Pipelined

processing

a1 a2 a3 a4

b1 b2 b3 b4

c1 c2 c3 c4

d1 d2 d3 d4

a1 b1 c1 d1

a2 b2 c2 d2

a3 b3 c3 d3

a4 b4 c4 d4

P1

P2

P3

P4

P1

P2

P3

P4

time timeInterleaving How ?

Page 13: VLSIVLSI VLSI Signal ProcessingSignal Processingtwins.ee.nctu.edu.tw/courses/vsp_11_summer/lecture...ADSP Lecture1 - Pipelining & Retiming (cwliu@twins.ee.nctu.edu.tw) 1-2 DSP Representations

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-13

Pipelining LatchesDirect Form 4-tap FIR

1. TCritical=TM+(N-1) TA, where N is

the number of taps

2. Clock speed limited by critical

path

1. TCritical=TM+TA

2. Pipelining will not affect

functionality but introduce latency

Feed-forward pipelining latches

2-level pipelined architecture

How to insert pipelining latches? Retiming

Page 14: VLSIVLSI VLSI Signal ProcessingSignal Processingtwins.ee.nctu.edu.tw/courses/vsp_11_summer/lecture...ADSP Lecture1 - Pipelining & Retiming (cwliu@twins.ee.nctu.edu.tw) 1-2 DSP Representations

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-14

Retiming• Retiming is to move around existing delays s.t.

– Does not alter the latency of the system– Changes (Reduces) the critical path of the system– Reduces the number of register

• Pipelining is equivalent to introducing many delays at the input followed by retiming

-1D +1D

Page 15: VLSIVLSI VLSI Signal ProcessingSignal Processingtwins.ee.nctu.edu.tw/courses/vsp_11_summer/lecture...ADSP Lecture1 - Pipelining & Retiming (cwliu@twins.ee.nctu.edu.tw) 1-2 DSP Representations

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-15

Cutset Retiming

Example+ kD

+ kD

- kD

cutset

G2

G1

A

B

C

D

E

F

2D

D

cutset

D

D

D

D

Add delays to edges going one way and remove from ones going the other

Page 16: VLSIVLSI VLSI Signal ProcessingSignal Processingtwins.ee.nctu.edu.tw/courses/vsp_11_summer/lecture...ADSP Lecture1 - Pipelining & Retiming (cwliu@twins.ee.nctu.edu.tw) 1-2 DSP Representations

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-16

Cutset Retiming

+ kD

+ kD

- kD

cutset

G2

G1

+ kD + kD

- kD

G1 G2

+ kD + kD

- kD

RetimingRetiming

Special case, a node cutset, i.e. G2={ Vi }

Retiming value

Preserve functionality !!

Page 17: VLSIVLSI VLSI Signal ProcessingSignal Processingtwins.ee.nctu.edu.tw/courses/vsp_11_summer/lecture...ADSP Lecture1 - Pipelining & Retiming (cwliu@twins.ee.nctu.edu.tw) 1-2 DSP Representations

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-17

Example – Node RetimingTo cutset around one specified node

Page 18: VLSIVLSI VLSI Signal ProcessingSignal Processingtwins.ee.nctu.edu.tw/courses/vsp_11_summer/lecture...ADSP Lecture1 - Pipelining & Retiming (cwliu@twins.ee.nctu.edu.tw) 1-2 DSP Representations

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-18

Node Retiming• The cutset can be selected such that one of the disjoint sub-

graphs contains only one node• Retiming value r(V): defined as the number of delays moved from

each output edge of the node V to each of its input edges• Feasibility constraints: for each directed edge UV of the retimed

SDFG, the number of delay elements must be non-negative

• For any SDFG, there is a retiming design space defined by the system of inequalities (each edge contributes an inequality)

• Solving systems of inequalities

( ) ( ) ( ) ( )( ) ( ) ( )

( ) ( )lyrespective retiming before andafter

edge on the elementsdelay ofnumber theare and where

0

UVewew

ewVrUrUrVrewew

r

r

≤−≥−+=

Page 19: VLSIVLSI VLSI Signal ProcessingSignal Processingtwins.ee.nctu.edu.tw/courses/vsp_11_summer/lecture...ADSP Lecture1 - Pipelining & Retiming (cwliu@twins.ee.nctu.edu.tw) 1-2 DSP Representations

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-19

Node Retiming with Formation

Result

Page 20: VLSIVLSI VLSI Signal ProcessingSignal Processingtwins.ee.nctu.edu.tw/courses/vsp_11_summer/lecture...ADSP Lecture1 - Pipelining & Retiming (cwliu@twins.ee.nctu.edu.tw) 1-2 DSP Representations

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-20

Properties of Retiming• P.1:The weight of a retimed path p = V0, V1, … , Vk is given by

wr(p) = w(p)+r(Vk)-r(V0)

• P.2: Retiming does not change the number of delays in a cycle (special case of P.1 with Vk=V0)

• P.3: Retiming does not alter the iteration bound in a DFG as the number of delays in a cycle does not change

• P.4: Adding a constant value j to the retiming value of each node does not change the mapping from G to Gr

To find the minimum critical by using retiming

Page 21: VLSIVLSI VLSI Signal ProcessingSignal Processingtwins.ee.nctu.edu.tw/courses/vsp_11_summer/lecture...ADSP Lecture1 - Pipelining & Retiming (cwliu@twins.ee.nctu.edu.tw) 1-2 DSP Representations

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-21

Retiming for Min Clock Period (1/3)• : the minimum feasible clock period is defined as

• In the retiming design space (defined by the system of inequalities), find a retimed SDFG with the minimum

• Definition:– W(U,V): the minimum number of registers on any path

from node U to node V

– D(U,V): the maximum computation time among all paths from U to V with weight W(U,V)

( )GΦ

( ) { }0)(:)(max ==Φ pwptG

( )GΦ

( ) ( ){ }VUpwVUW p⎯→⎯= :min,

( ) ( ) ( ) ( ){ }VUWpwVUptVUD p , and : max, =⎯→⎯=

w(p) denotes the delays of the patht(p) denotes the computation time of the path

Page 22: VLSIVLSI VLSI Signal ProcessingSignal Processingtwins.ee.nctu.edu.tw/courses/vsp_11_summer/lecture...ADSP Lecture1 - Pipelining & Retiming (cwliu@twins.ee.nctu.edu.tw) 1-2 DSP Representations

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-22

Retiming for Min Clock Period (2/3)Step 1: Compute W(U,V) and D(U,V)

• Let M = tmaxn, where tmax is the maximum computation time of the nodes in G and n is the number of nodes in G

• Form a new graph G’, which is the same as G except the edge weights are replaced by w’(e) = Mw(e) – t(u) for all edges from U to V

• Solve the all-pair shortest path problem on G’ (using Floyd-Warshall algorithm). Let S’UV be the shortest path from U to V

• If U = V,– W(U,V) = 0 and D(U,V) = t(U)

else,– W(U,V) = ceiling (S’UV/M) and D(U,V) = MW(U,V) - S’UV + t(V)

Page 23: VLSIVLSI VLSI Signal ProcessingSignal Processingtwins.ee.nctu.edu.tw/courses/vsp_11_summer/lecture...ADSP Lecture1 - Pipelining & Retiming (cwliu@twins.ee.nctu.edu.tw) 1-2 DSP Representations

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-23

Retiming for Min Clock Period (3/3)Step 2: Given a clock period c, find a feasible solution such that

• The retimed design space is now defined by– feasibility constraints

– critical path constraints

i.e.

• If the system of inequalities is not , try a smaller c

( ) cG ≤Φ

( ) ( ) ( )ewVrUr ≤−

( ) ( ) ( )( ) cVUD

VUWVrUr>

−≤−,such that G,in V and Unodes allfor

1,

( ) ( ) ( ) ( )( ) ( ) ( ) ( ) cVUDUrVrVUW

VUWVrUrcVUD≤<−+⇔

−≤−>, ,1, if

1, ,, if

φ

r(U)-r(V) ≤ W(U,V)

Page 24: VLSIVLSI VLSI Signal ProcessingSignal Processingtwins.ee.nctu.edu.tw/courses/vsp_11_summer/lecture...ADSP Lecture1 - Pipelining & Retiming (cwliu@twins.ee.nctu.edu.tw) 1-2 DSP Representations

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-24

Retiming for Register Minimization• Find a feasible solution with minimum number of registers in the

design space constrained by feasibility and critical-path constraints (the refined retimed space with minimum clock period)

• Multiple fan-out problem

• Modeled as an integer linear programming (ILP) problemMinimize , while satisfying– longest fanout constraints– feasibility constraints– critical path constraints

U 3D

D

7D

V1

V2

V3 U D 2D 4D

V1

V2

V3

( ) ( ) ( )ewVrUr ≤−( ) ( ) ( )

( ) cVUDVUWVrUr

>−≤−

,such that G,in V and Unodes allfor 1,

( )∑ Uq

( ) ( ) ( ) ( )UqUrVrew ≤−+

Note solving the ILP model is NP-hard, and a gadget model can be found in the literature.

Page 25: VLSIVLSI VLSI Signal ProcessingSignal Processingtwins.ee.nctu.edu.tw/courses/vsp_11_summer/lecture...ADSP Lecture1 - Pipelining & Retiming (cwliu@twins.ee.nctu.edu.tw) 1-2 DSP Representations

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-25

K Slow-down and RetimingReplace each D by KD

2-slowtransformation

1. K-1 null operation must be inserted2. Hardware utilization is reduced (e.g. 50%)

D

Retiming 2-slow version

Hardware can be fully utilized if two independent operations are available

Tclk=1 t.u., Titer=2 t.u.

Page 26: VLSIVLSI VLSI Signal ProcessingSignal Processingtwins.ee.nctu.edu.tw/courses/vsp_11_summer/lecture...ADSP Lecture1 - Pipelining & Retiming (cwliu@twins.ee.nctu.edu.tw) 1-2 DSP Representations

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-26

Cutset Retiming2-slow Lattice Filter

Inserting of “0” to preserve behavior !!

x0 0 x1 0 x2 0 x3 0 …

Tsample = 2 Tcritical

2-slow

TCritical=2TM+(N-1) TA, where N is tap number

Page 27: VLSIVLSI VLSI Signal ProcessingSignal Processingtwins.ee.nctu.edu.tw/courses/vsp_11_summer/lecture...ADSP Lecture1 - Pipelining & Retiming (cwliu@twins.ee.nctu.edu.tw) 1-2 DSP Representations

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-27

Pipelining & Parallel Processing• M-level pipelining processing

– The critical path is reduced to 1/M

• L-level parallel processing– The critical path of the block processing system remains

unchanged.– L samples are processed in 1 (not L) clock cycle, the

iteration period is Titeration=Tsample=Tclk/L, where Tclk≥Tcritical

• By combining M-level pipelining and L-level parallel processing, then – Titeration=Tsample=(1/LM)Tclk

Page 28: VLSIVLSI VLSI Signal ProcessingSignal Processingtwins.ee.nctu.edu.tw/courses/vsp_11_summer/lecture...ADSP Lecture1 - Pipelining & Retiming (cwliu@twins.ee.nctu.edu.tw) 1-2 DSP Representations

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-28

CMOS Power Consumption• (Dynamic) power dissipation

– Ctotal: the total capacitance of the CMOS circuit• Propagation delay

– Ccharge : the capacitance to be charged or discharged in a single clock cycle (along the critical path)

– V0 : the supply voltage; Vt : the threshold voltage– K : technology parameter

fVCP total ⋅⋅= 20

( )20

0chargepd

tVVkVC

T−

⋅=

Page 29: VLSIVLSI VLSI Signal ProcessingSignal Processingtwins.ee.nctu.edu.tw/courses/vsp_11_summer/lecture...ADSP Lecture1 - Pipelining & Retiming (cwliu@twins.ee.nctu.edu.tw) 1-2 DSP Representations

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-29

Pipelining Processing

Propagation delay of the original filter and the pipelined filter

Page 30: VLSIVLSI VLSI Signal ProcessingSignal Processingtwins.ee.nctu.edu.tw/courses/vsp_11_summer/lecture...ADSP Lecture1 - Pipelining & Retiming (cwliu@twins.ee.nctu.edu.tw) 1-2 DSP Representations

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-30

Pipelining for Low Power– propagation delay of the original architecture

– propagation delay of the pipelined architecture

– setting the above two equations equal, the following quadratic equation can be obtained to solve

( )20

0chargepipelined-non

tVVkVC

T−

⋅=

( )20

0charge

pipelinedtVVk

VM

C

T−⋅

⋅⋅=

β

β

β

( ) ( )202

0 tt VVVVM −⋅=−⋅ ββ

( )

pipelinednon

totalpipelined

P

fVCP

−⋅=

⋅⋅⋅=2

20 Then,

β

β

Page 31: VLSIVLSI VLSI Signal ProcessingSignal Processingtwins.ee.nctu.edu.tw/courses/vsp_11_summer/lecture...ADSP Lecture1 - Pipelining & Retiming (cwliu@twins.ee.nctu.edu.tw) 1-2 DSP Representations

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-31

Parallel ProcessingPropagation delay of the original filter and the parallel filter

Page 32: VLSIVLSI VLSI Signal ProcessingSignal Processingtwins.ee.nctu.edu.tw/courses/vsp_11_summer/lecture...ADSP Lecture1 - Pipelining & Retiming (cwliu@twins.ee.nctu.edu.tw) 1-2 DSP Representations

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-32

Parallel Processing for Low Power– propagation delay of the original architecture

– propagation delay of the parallel architecture

– setting these two propagation delays equal, the following quadratic equation can be obtained to solve β

( )20

0chargeparallel-non

tVVkVC

T−

⋅=

( )LT

VVkVC

Tt

⋅=−⋅

⋅⋅= parallel-non2

0

0chargeparallel β

β

( ) ( )202

0 tt VVVVL −⋅=−⋅ ββ

( ) ( )

parallelnon

totalparallel

PLfVCLP

−⋅=

⋅⋅⋅⋅=

2

20 Then,

β

β