constraint directed cad tool for automatic latency-optimal implementation of fpga-based systolic...
TRANSCRIPT
Constraint Directed CAD Tool For AutomaticLatency-optimal Implementation of
FPGA-based Systolic Arrays
Greg Nash
Reconfigurable Technology: FPGAs and Reconfigurable Processors for
Computing and Communications IV:
SPIE ITCom, Boston, MA, July 29, 2002
Outline
• Introduction to CAD tool, SPADE (symbolic parallel algorithm development environment)
• Design examples: matrix Lyapunov equation, discreet Fourier transform (DFT)
• Isolating useful designs (Lyapunov)– Alignment of variables in space-time
– Non-optimal solutions
– “Low” bandwidth designs
– “Regular” designs
• Finding optimal solutions (DFT)– Minimum latency
– Maximum throughput
Systolic Array: Matrix Multiply
Project alongtime axis
Space-Time Mapping Systolic Array
d
e
c
1
[ , ] [ , ]* [ , ] 1 , ,N
k
c i j d i k e k j for i j k N
Parallel Processing With Systolic Arrays
• Algorithms– Linear algebra – graph theory – computational geometry
– String matching – sorting/searching – dynamic programming
– Discreet mathematics – number-theoretic algorithms
• Applications (real-time/embedded processing)– Communications – seismic analysis – signal/image processing
– Adaptive processing – arithmetic arrays
• Architecture – Simple processing elements – local interconnects – synchronous
– Fine-grained – pipelined – small local memory
– Local control – regular arrays
• Hardware– FPGA/PLD chips – programmable connections
– Reconfigurable boards – asics
Altera Stratix FPGA: DFT Mapping
Systolic DFT Array
SPADE Operation
MathematicalAlgorithm
InputCode
TransformationSearch
i,k S,T
k
iM
S
Ty
y
T
SM
i
kh
h
T
SM
i
kx
x
S=spatial coordinatesT=temporal coordinatesM=transformation solution
Simulator,GraphicalOutputs
12 12 11 13 13 11
1
, , ,1
11 11 21 21 31 31
1
, ,1
/ ; /
, 1, 3
( ) /
; ; ;
, 1, 3
j
ij ij i k k j i ik
j
ij ij i k k jk
u a l u a l
for j i i j
u a l u l
l a l a l a
for i j j i
l a l u
A LU for i to N do for j to N do if j=1 and i>=1 and i<=N then l[i,j]:=a[i,j]; elif i=1 and j>1 and j<=N then u[i,j]:=a[i,j]/l[i,i]; fi; if i>=j and j>1 and i<=N then l[i,j]:=a[i,j]-add(l[i,k]*\ u[k,j],k=1..j-1) fi; if j>i and i>1 and j<=N then u[i,j]:=(a[i,j]-add(l[i,k]*\ u[k,j],k=1..i-1))/l[i,i] fi; odod;
Algorithm Domain
( ) ( ) I2 0 0. ., (2 , 1) ( )0 1 1
S
x x y yx A I a depends on y B I b for all Iie g x i j x j
• Multiple statements of the general form
– Where Ax,By/ax,by are integer matrices/vectors, S is the dimension of the algorithm space and the dependencies include commutative and associative operators: min, max, ,
SPADE Functionality• Scheduling• Reindexing• Localization• Allocation• Constraint introduction• Solutions
– Primary objective function: latency– Secondary objective functions
• area• regularity• bandwidth
• Automatic operation
“Time-alignment” Constraint
1
[ , ] [ , ] [ , ] 1 , ,N
k
c i j d i k e k j for i j k N
Space-Time Mapping
Systolic Array (N=6)
tu
0t eu n
0t du n
Matrix-matrix multiplication:
cd
e
0t cu n
Lyapunov Matrix Equation Example
• Abstract problem: find X given A (lower triangular) and B (upper triangular)
• Convert to mathematical expression
• Non-uniform recurrence equation in maple language
)j,j(b)i,i(a
)l,i(x*)j,l(b)j,k(x*)k,i(a)j,i(c
)j,i(x
i
k
j
l
1
1
1
1
for i to N do for j to N do
x[i,j] := (c[i,j]-add(a[i,k]*x[k,j],k=1..i-1)- add(b[l,j]*x[i,l],l=1..j-1))/(a[i,i]+b[j,j]); od;od;
CBXAX
Non-latency Optimal Solutions
•Two minimum area, latency optimal designs (L=4N-3) found
•Four smaller area, non-optimal designs (L=4N-2) found
Space-Time View (N=6)
Minimum Bandwidth Secondary Objective Function
• Minimum area secondary objective function, x,a, and b time aligned– 2 unique designs found
– 8 unique data flow paths
– 5 different directions
– Some PEs experience 6 differentdifferent flows of data
• Minimum bandwidth secondary objective function– Single unique minimum area design found
– Variable x placed in “center” of array
N=6
N=6
Maximum Regularity Secondary Objective Function
• Desire simple orthogonal interconnection network topology with minimum number of interconnections
• Avoid time aligned variables (introduces O(N) memory per PE) • Preference for “close” dependency relations between variables
• Four unique solutions found
• Reject
(N=6)x
a
b
x
a
b
x
a
b
1D DFT Design Example
for j to N/4 do for k to N/4 do Y[j,k] := WM[j,k]*add(CM1[j,i]*X[i,k],i=1..4); od; for k to 4 do Z[k,j] := add(CM2[k,i]*Y[j,i],i=1..N/4); odod;
1
2
tM M
tM
Y W C XZ C Y
Z CXBase-4 Transformation
•Mathematical derivation (base-4 form)
•SPADE input code
(2 / )( 1)( 1)
1[ ] [ ] 1, 2...
j N k nN
nZ k X n e k N
•Desired constraints–Minimize number of multipliers (time-align Y)–Time-align X, Z at array boundary–Keep coefficient matrices CM1 and CM2 internal to the array
/ 4 / 41 1
1 / 4 1 / 4/ 2 / 2
1 / 2 1 / 23 / 4 3 / 4
1 3 / 4 1 3 / 4
,N N
N NN N
N NN N
N NN N
Z XZ XZ XZ XZ XZ XZ XZ XZ X
Base-4 vs. Previous Systolic Designs
• CM1 and CM2 contain only elements from the set {1,-1,-i,i}
CM1 X and CM2Yt only involve complex additions
• Twiddle factor matrix WM is of dimension N/4
x4 fewer complex multiplies with x2 more complex adders(previous designs require one complex multiply/add per transform point)
• Takes advantage of reduced arithmetic with radix-4 butterfly, but transform length not limited to N = r m
1
2
tM M
tM
Y W C XZ C Y
1D DFT Systolic Design Result
• Maximum regularity secondary objective function
• Latency = 3N/4+7
• 16 designs found
• Very irregular space-time mappingsSystolic Array
Space-Time
Views
(N=64) Y
X
Z
CM2
CM1Y
Y
X
X
Z
Z
DFT: Constraints Relaxed
• Requires either– X/CM2 time aligned, Z/CM1 internal
– Z/CM1 time aligned, X/CM2 internal
• Minimum area secondary objective designs for 1D DFT– Latency = N/2 + 8
– Six unique designs
– Block processing time = N/4 + 6
– Structure moderatly irregular
Y
X
Z
CM1
IM2
IM1
CM2
CM2
Space-Time View N=64
1D DFT: Throughput Vs. Latency
• High computational efficiencies inside space-time variable mappings are necessary to achieve the best latencies
• High computational efficiency in entire space-time volume is necessary for high throughputs
• Designs need to be “stackable” in time
Latency and Throughput Optimal Designs
• Maximum regularity setting
• Two structurally different designs– X/CM2 time aligned, Z/CM1 internal
– Z/CM1 time aligned, X/CM2 internal
• Latency = N/2 + 8
• Throughput = N/4 +1
• Very regular structure
Systolic Array (N=64)
Space-time view, two DFT iterations (N=64)
2D NxN DFT Design
• N 1D “row” DFTs followed by N “column” DFTs
• 1D DFT compution by factoring, N = n1 * n2 , and doing 2D n1 x n2 DFT
• Uses both of two optimal systolic designs– X/CM2 time aligned, Z/CM1 internal
– Z/CM1 time aligned, X/CM2 internal
Parameter Prior Designs Base-4 Array
Multipliers N2 N/4 Adders N2 2N Block Processing Time 2N+1 N2/2 + 6N + 18 Latency 4N N2/2+25N/4+19 ~area x throughput-1 = (M+A/4) CPD 5N3/2+ 5N2/4 3N3/8+9N2/2+27N/2
Systolic vs. “Pipelined” 16x16 DFT
† S. Yu and E. Swartzlander, “A Pipelined Architecture for the Multidimensional DFT,” IEEE Trans. Signal Processing, Vol. 49, No. 9, Sept. 2001.
Type Mult Add Registers ROM RAM tcycle Data/cycle
Systolic 4 32 80 16 256 mult 1.6
Pipelined† 6 32 292 24 - mult 4
More Information
• “Automatic Generation of Systolic Array Designs ForReconfigurable Computing” , Proc. Engineering of Reconfigurable Systems and Algorithms (ERSA '02), International Multiconference in Computer Science, Las Vegas, Nevada, June 24, 2002.– General description of SPADE– Faddeev algorithm (Find CX+D, given AX=B, X is unknown)
• Hardware Efficient Base-4 Systolic Architecture for Computing the Discrete Fourier Transform, 2002 IEEE Workshop on Signal Processing Systems, San Diego CA, October 16-18.– Details of base-4 DFT designs– Mapping to FPGAs
• www.centar.net (papers and extended viewgraphs)