10th reconfigurable architectures workshop (raw 2003), nice, france, april 22, 2003
DESCRIPTION
Loop Dissevering: A Technique for Temporally Partitioning Loops in Dynamically Reconfigurable Computing Platforms. João M. P. Cardoso University of Algarve, Faro, INESC-ID, Lisboa Portugal. 10th Reconfigurable Architectures Workshop (RAW 2003), Nice, France, April 22, 2003 - PowerPoint PPT PresentationTRANSCRIPT
Loop Dissevering: A Technique for Temporally Partitioning Loops in
Dynamically Reconfigurable Computing Platforms
Loop Dissevering: A Technique for Temporally Partitioning Loops in
Dynamically Reconfigurable Computing Platforms
10th Reconfigurable Architectures Workshop (RAW 2003), Nice, France, April 22, 2003
17th Annual Int’l Parallel & Distributed Processing Symposium (IPDPS 2003)
João M. P. CardosoUniversity of Algarve, Faro, INESC-ID, LisboaPortugal
João M. P. CardosoRAW 2003
MotivationMotivation
for(int i=0; i<8;i++) for(int j=0;j<8;j++) CosTrans[j+8*i] = CosBlock[i+8*j];
for(int i=0; i<8;i++) for(int j=0;j<8;j++) { TempBlock[i+j*8] = 0; for(int k=0;k<8;k++) TempBlock[i+j*8] += InIm[i+k*8] * CosTrans[k+j*8]; }
How to map sets of computational structures requiring more resources than available?
João M. P. CardosoRAW 2003
MotivationMotivation
How to map sets of computational structures requiring more resources than available? Temporal Partitioning
for(int i=0; i<8;i++) for(int j=0;j<8;j++) CosTrans[j+8*i] = CosBlock[i+8*j];
for(int i=0; i<8;i++) for(int j=0;j<8;j++) { TempBlock[i+j*8] = 0; for(int k=0;k<8;k++) TempBlock[i+j*8] += InIm[i+k*8] * CosTrans[k+j*8]; }
João M. P. CardosoRAW 2003
MotivationMotivation
How to map sets of computational structures requiring more resources than available? Temporal Partitioning
Other motivations for Partitioning Computations in Time each design is simpler
may lead to better performance! amortize some configuration time
by overlapping execution stages use of smaller reconfigurable arrays to implement
complex applications
For more info: see Cardoso and Weinhardt, DATE 2003
João M. P. CardosoRAW 2003
MotivationMotivation
How to map sets of computational structures requiring more resources than available? Temporal Partitioning
Computational structures for each loop or set of nested loops implemented in a single partition
But, what to do with a Loop requiring more resources than available?
João M. P. CardosoRAW 2003
OutlineOutline
Motivation
Configure-Execute Paradigm (execution stages)
Target Architecture
PACT XPP Architecture
XPP Configuration Flow
XPP-VC Compilation Flow
Temporal Partitioning of Loops
Experimental Results
Conclusions & Future Work
João M. P. CardosoRAW 2003
f2 c2
f2 c2
Configure-Execute Paradigm (Execution Stages)Configure-Execute Paradigm (Execution Stages)
the program in a single configuration
two configurations without on-chip
context planes and without partial reconfiguration
with partial reconfiguration
with on-chip context planes
Fetch (f) Configure (c) Compute (comp)
f1 c1 comp1 comp2
f1 c1 comp1
f2 c2_2
comp2
f1 c1 comp1 comp2
c2_1
time
João M. P. CardosoRAW 2003
PE
X × Y Coarse-grained array: Processing elements (PEs): compute typical ALU operations Two columns of SRAMs (Ms) I/O ports for data streaming
PEPE
PE
M
M
PACT XPP Architecture (briefly)PACT XPP Architecture (briefly)
João M. P. CardosoRAW 2003
Ready/ack. protocol for each programmable interconnection Flow of data (pre-foundry parameterized bit-widths) Flow of events (1-bit lines)
PE
PEPE
PE
M
M
PACT XPP Architecture (briefly)PACT XPP Architecture (briefly)
João M. P. CardosoRAW 2003
PE
PACT XPP Architecture (briefly)PACT XPP Architecture (briefly)
Dynamically reconfigurable: On-chip configuration cache and configuration manager Partial reconfiguration (only those used resources are
configured)
PEPE
PE
Configuration Manager
(CM)
Configuration Cache(CC)
fetch
configure
CMPort0CMPort1
M
M
João M. P. CardosoRAW 2003
XPP Configuration FlowXPP Configuration Flow
Uses 3 stages to execute each configuration:
Array may request the next configuration Configuration manager
accepts requests and proceeds without intervention from external host
c0;If(CMPort0) then c1;If(CMPort1) then c2;
c1
fetch configure
<N
CMport0CMport1
c2
c0Configuration
Cache(CC)
Configuration Manager
(CM)
c0
Fetch (f) Configure (c) Compute (comp)
João M. P. CardosoRAW 2003
XPP-VC Compilation FlowXPP-VC Compilation Flow
TempPart: partitions and generates reconfiguration statements which are executed by Configuration Manager
MODGen: maps C subset to NML (PACT proprietary structural language with reconfiguration primitives)
C programPreprocessing + Dependence
Analysis
TempPartTemporal
Partitioning
MODGen“Module Generation”
(with pipelining)
NML filexmapXPP
Binary CodeNML
Control Code Generation(Reconfiguration)
For more info: see Cardoso and Weinhardt, FPL 2002
João M. P. CardosoRAW 2003
Temporal PartitioningTemporal Partitioning
One partition for each node in the Hierarchical Task Graph (HTG) TOP level
Merge adjacent nodes if combination of both can be mapped to XPP device and if merge does not degrade overall performance
If HTG node too large, create separate partition for each node of the inner-HTG and call algorithm recursively
start
end
Loop 1
x
coef
Loop 2
Loop 3
Loop 4
tmp
y
João M. P. CardosoRAW 2003
Temporal Partitioning of LoopsTemporal Partitioning of Loops
What to do when loops in the program cannot be mapped due to the lack of enough resources? Software/reconfigware approach
control of the loop in software, migrates to reconfigware inner-code sections, each one
mapped to a single configuration Loop Distribution
transforms a loop into two or more loops each loop with the same iteration-space traversal of the
original loop inner statements of the original loop are split among the
loops Loop Dissevering
transforms a loop in a set of configurations cyclic behavior implemented by the configuration flow
João M. P. CardosoRAW 2003
Temporal Partitioning of LoopsTemporal Partitioning of Loops
Loop Distribution
Loop Dissevering
…for(nx=0;nx<X_DIM_BLK; nx++) for(ny=0;ny<Y_DIM_BLK; ny++) { for(i=0;i<N;i++) for(j=0;j<N;j++) { tmp = 0; Inner Loop 1 for(k=0;k<N;k++) tmp += X[i+ny*N][k+nx*N]* CosBlock[j][k]; TempBlock[i][j] = tmp; } // to be partitioned here for(i=0;i<N;i++) for(j=0;j<N;j++) { tmp = 0;Inner Loop 2 for(k=0;k<N;k++) tmp += TempBlock[k][j]* CosBlock[i][k]; Y[i+ny*N][j+nx*N] = tmp; } }…
João M. P. CardosoRAW 2003
Loop DistributionLoop Distribution …
for(nx=0;nx<X_DIM_BLK; nx++) for(ny=0;ny<Y_DIM_BLK; ny++) for(i=0;i<N;i++) for(j=0;j<N;j++) {
Inner Loop 1 TempBlock[i+ny*N][j+nx*N] = tmp; }
for(nx=0;nx<X_DIM_BLK; nx++) for(ny=0;ny<Y_DIM_BLK; ny++) for(i=0;i<N;i++) for(j=0;j<N;j++) { tmp = 0;
for(k=0;k<N;k++) tmp += TempBlock[k+ny*N][j+nx*N]*
CosBlock[i][k];Y[i+ny*N][j+nx*N] = tmp;
}…
begin
end
Conf. 1
Conf. 2
Conf. 1
Conf. 2
tmp += TempBlock[k][j]* CosBlock[i][k];
João M. P. CardosoRAW 2003
Loop DistributionLoop Distribution
Cannot be applied to all loops no break of cycles in the dependence graph of the
original loop
Use of auxiliary array variables for each loop-independent flow dependence of a scalar
variable (known as scalar expansion) and for each control dependence in the place where we
want to partition the loop
Expansion of some arrays
But, it preserves the software pipelining potential, and may improve parallelization, cache hit/miss ratio, etc.
João M. P. CardosoRAW 2003
Loop DisseveringLoop Dissevering
L1:
L3:
L4:
Finish:
…nx=0; write nx;read nx;If(nx>=X_DIM_BLK) goto Finish;ny=0; write ny;read ny; read nx;If(ny>=Y_DIM_BLK) goto L4;for(i=0;i<N;i++) for(j=0;j<N;j++) { Inner Loop 1 TempBlock[i][j] = tmp; }read ny; read nx;for(i=0;i<N;i++) for(j=0;j<N;j++) { Inner Loop 2 Y[i+ny*N][j+nx*N] = tmp; }ny++; write ny;goto L3;nx++; write nx;goto L1…
begin
Conf. 1
Conf. 2
end
Conf. 3
Conf. 4
Conf. 5
Conf. 1
Conf. 2
Conf. 3
Conf. 4
Conf. 5
João M. P. CardosoRAW 2003
Loop DisseveringLoop Dissevering
Applicable to every loop
Only relies on a configuration manager to execute complex loops
May relieve the host microprocessor to execute other tasks
No array or scalar expansion (only scalar communication)
But, Besides its usage to furnish feasible mappings, is it
worth to be applied? Does it lead to efficient solutions (in terms of performance)?
What are the improvements if the architecture can switch between configurations in few clock cycles?
João M. P. CardosoRAW 2003
Experimental ResultsExperimental Results
Compared Architectures Both with runtime support to partial reconfiguration
ARCH-A word-grained partial reconfiguration
ARCH-B context-planes with switching between contexts in few clock cycles
f2 c2
f1 c1 comp1
f2 c2_2
comp2
f1 c1 comp1 comp2
c2_1
João M. P. CardosoRAW 2003
Experimental ResultsExperimental Results
Benchmark Description #LoC #loops #loops after loop dist.
DCT88 Discrete Cosine
Transform on an image
80 8 10
BPICBinary pattern image
coding151 8 10
LifeConway’s game of life
algorithm118 10 -
Benchmarks
João M. P. CardosoRAW 2003
Experimental Results (resource savings)Experimental Results (resource savings)
Using loop dissevering When compared to implementations without loop
dissevering only 44% (DCT), 66% (BPIC), and 85% (Life) of resources are used
Benchmark
w/o loop dissevering w/ loop dissevering Ratio(#PEs)
#configs #PEs #configs #PEs
DCT 1 123 5 54/132 0.44
BPIC 1 148 5 97/189 0.66
Life 4 144/304 6 123/416 0.85
João M. P. CardosoRAW 2003
Experimental Results (speedups)Experimental Results (speedups)
Architecture A (ARCH-A) Word-grained partial reconfiguration
Architecture B (ARCH-B) Context-planes
DCT
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
6x6 w / ldis 6x6 w / ldis+unr 6x6 w / ld 6x6 w / ld+unr
João M. P. CardosoRAW 2003
Experimental Results (speedups)Experimental Results (speedups)
Life Applying Loop Dissevering Benefits of ARCH-B are neglected when partitions “in
the loop” compute for long times
1.21.41.61.82.02.22.4
8x8_2 8x8_4 8x8_8 8x8_16 16x16_2 16x16_4 16x16_8 16x16_16 32x32_2 32x32_4 32x32_8
ARCH. A ARCH. B
João M. P. CardosoRAW 2003
ConclusionsConclusions
Temporal Partitioning + Loop Dissevering guarantees the mapping of theoretically unlimited
computational structures
Loop Dissevering and Loop Distribution may lead to performance enhancements saving of resources
Loop Dissevering applicable to every loop performance efficient implementations may require fast
reconfiguration the resultant performance may decrease
when innermost loops are partitioned (no more potential for loop pipelining)
when each active partition computes for short times (does not amortize the reconfiguration time)
João M. P. CardosoRAW 2003
Future WorkFuture Work
More study on the impact of Loop Dissevering and Loop Distribution To understand the impact of the number of context-
planes, configuration cache size, etc. To evaluate loop partitioning when mapping to FPGAs
Automatic implementation of Loop Distribution
Methods to decide between Loop Dissevering and Loop Distribution
João M. P. CardosoRAW 2003
Acknowledgments (in the paper)Acknowledgments (in the paper)
Part of this work has been done when the author was with PACT XPP Technologies, Inc, Munich, Germany.
We gratefully acknowledge the support of all the members of PACT XPP Technologies, Inc., especially the help of Daniel Bretz, Armin Strobl, and Frank May, regarding the XDS tools. A special thanks to Markus Weinhardt regarding the fruitful discussions about loop dissevering and the XPP-VC compiler.