isss 2001, montréal1 isss’01 s.derrien, s.rajopadhye, s.sur-kolay* irisa france *isi calcutta...
Post on 04-Jan-2016
215 Views
Preview:
TRANSCRIPT
ISSS 2001, Montréal 1
ISSS’01
S.Derrien, S.Rajopadhye, S.Sur-Kolay*
IRISA France *ISI calcutta
Combined Instruction and Loop Level Parallelism for Regular Array Synthesis on FPGAs
ISSS 2001, Montréal 2
Outline
Context and motivation Space time transformations Transformation flow Experimental validation Conclusion
ISSS 2001, Montréal 3
High performance IP-Cores High-level specifications
Matlab, C, C++ or specific language (Alpha) Targeting nested loops Core must be formally correct
Hard/Soft co-generation Hardware RTL module (VHDL) Simple driver API (C)
Regular Processor Arrays High data through-put, specialized datapath Well suited for VLSI/FPGA
ISSS 2001, Montréal 4
Targeting FPGAs Poor clock speed
Typical clock speed is 1/10 Asic speed Very design dependant Good at low precision arithmetic (8 bits) Really bad for complex operations (floats)
But high performance Optimized designs can compete with Asics Performance gain due to parallelism Pipeline comes for free (lots of DFFs)
ISSS 2001, Montréal 5
Processor Array Synthesis
For i:=1 to 3 For j:=1 to 3 For k:=1 to 3
C[i,j]:=C[i,j] +A[i,k]*B[k,j]; End for; End for;End for;
Iteration domain extracted from loop bounds
Data dependence vector between iterations
Iteration domain is projected on the processor grid
Matrix multiplication exampleIteration are scheduled on their associated PE
ISSS 2001, Montréal 6
PE Architecture
DatapathDatapathDatapath
Temporal registers act as local memory
Combinational datapath connected to registers
Unidirectional flow and pipelined connections
N classes of registers (N = loop dimension)
One critical path for each register class
Operating frequency set by worst critical path
Spatio-temporal registers must be disambiguated
Spatial registers serve asinterconnect between PEs
ISSS 2001, Montréal 7
Conclusion
Simplistic schedule inside a PE (no ILP) Complex loop bodies induces poor performance
Floating point Matrix mult operating at 12MHz 2D SOR on 16 bits operating at 40MHz
The PE architecture is not suited to FPGAs !!
Proposed solution : allowing pipelined data-paths, by altering the PE architecture through simple space-time transformations.
ISSS 2001, Montréal 8
Retiming
LUT
LUT
LUT
LUT
LUT
LUT
Tc= 1 logic level
Tc= 2 logic level
Move registers to minimize clock period
Handled by most FPGA RTL synthesis tools
Efficient iff sufficient number of registers
We just need to add registers in the PE !!
ISSS 2001, Montréal 9
Serialization (1/2)
Regroup PEs into clusters
Iterations in a cluster executed sequentially
Through-put is slowed down by cluster size
Local memory is duplicated
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
Original PE array before clustering
Array after clustering
ISSS 2001, Montréal 10
Serialization (2/2)
DatapathDatapath
Decomposed along each spatial dimension
Serialization impacts the PE according to simple transformation rules
Loop level Parallelism traded for Instruction Level Parallelism
Temporal registers duplicated by serialization factor i
Feed-back loop are created for all spatial paths in the ith axis
ISSS 2001, Montréal 11
Skewing
DatapathDatapath
Skewing by factor 2 along vertical PE axis
Affects latency, but not through-put.
Adds temporal registers along spatial axis
Skewing can be used before and after serialization
Cannot reduce original temporal critical path
ISSS 2001, Montréal 12
Problem formulation
Datapath
Find the optimal set of transformations parameters.
Minimize number of registers
Preserve loop-level parallelism
Tc= 86 ns, requiresdj= 6 stages to obtain Tc= 15ns
Tc= 70 ns, di=5 stages to obtain Tc= 15ns
Tc= 60 ns requires dt=4 stages to obtain Tc= 15ns
ISSS 2001, Montréal 13
1. Assumes i given (partitioning step)
4. Determine all the skewing parameters
2. Sort PE space axis in ascending order of Tc
2. For each PE axis i do
i. Pre-serialization skewing ipre
ii. Serialization i
4. For each PE axis i do
i. Post-serialization skewing ipost
Proposed heuristic
ISSS 2001, Montréal 14
Transformation example
1. Pre-skew along axis y by factor y
pre =1.
2. Serialisation along axis y axis by factor y =2.
3. Pre-skew along axis x by factor x
pre =2.
4. Serialisation along axis x by factor x =2.
6. Apply retiming
5. Post skew along axis y by factor y
post=1.
DatapathDatapathDatapathDatapathDatapath
84
Datapath4 4
84
Datapath4 4
84
Datapath
44
44Datapath
4
45
6 10
ISSS 2001, Montréal 15
Experimental validation
Chosen benchmark Matrix multiplication (8,16 bits and floats) Adaptive filter (DLMS) (8,16 bits and floats) String matching (DNA, Protein)
Performance metrics Ape : PE area usage
fpe : PE operating frequency
Raw performance =Npe.fpe Npe approximated by 1/Ape
ISSS 2001, Montréal 16
Area overhead
Area overhead decreases as combinational datapath area cost grows
ISSS 2001, Montréal 17
Frequency improvement
Speed improvment up to one order of magnitude (for floats)
ISSS 2001, Montréal 18
Raw performance
Speed improvment up to one order of magnitude (for floats)
ISSS 2001, Montréal 19
Conclusion Extract very fine grain ILP from the
datapath as a whole Simple space-time transformations but
yield impressive results. Preserve circuit correctness and control
logic regularity and simplicity Performance benefits are limited by the
lack of place & route aware retiming tools.
top related