retargetable mapping of loop programs on coarse-grained...
TRANSCRIPT
![Page 1: Retargetable Mapping of Loop Programs on Coarse-grained ...ashriva6/esweek2010/codesisss2010/...Embedded Systems Week 2010, Scottsdale, USA October, 2010 22 •Currently: – Static](https://reader031.vdocuments.net/reader031/viewer/2022012001/6083edde7e617d658f1503c8/html5/thumbnails/1.jpg)
University of Erlangen-NurembergFrank Hannig
Embedded Systems Week 2010, Scottsdale, USAOctober, 2010
Retargetable Mapping of Loop Programs on Coarse-grained Reconfigurable Arrays
Frank [email protected]
![Page 2: Retargetable Mapping of Loop Programs on Coarse-grained ...ashriva6/esweek2010/codesisss2010/...Embedded Systems Week 2010, Scottsdale, USA October, 2010 22 •Currently: – Static](https://reader031.vdocuments.net/reader031/viewer/2022012001/6083edde7e617d658f1503c8/html5/thumbnails/2.jpg)
University of Erlangen-NurembergFrank Hannig
Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 2
Overview
• Motivation
• Architecture
• Mapping methodology
• Current and future work
![Page 3: Retargetable Mapping of Loop Programs on Coarse-grained ...ashriva6/esweek2010/codesisss2010/...Embedded Systems Week 2010, Scottsdale, USA October, 2010 22 •Currently: – Static](https://reader031.vdocuments.net/reader031/viewer/2022012001/6083edde7e617d658f1503c8/html5/thumbnails/3.jpg)
University of Erlangen-NurembergFrank Hannig
Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 3
Motivation
System-on-a-Chip (SoC)
communication network
embeddedprocessor memory
I/Ointerface
acceleratorCGRAaccelerator
![Page 4: Retargetable Mapping of Loop Programs on Coarse-grained ...ashriva6/esweek2010/codesisss2010/...Embedded Systems Week 2010, Scottsdale, USA October, 2010 22 •Currently: – Static](https://reader031.vdocuments.net/reader031/viewer/2022012001/6083edde7e617d658f1503c8/html5/thumbnails/4.jpg)
University of Erlangen-NurembergFrank Hannig
Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 4
Parallel Architectures’ Trade-offs
Performance
Flexibility
• Coarse-grained reconfiguration data is one to two orders of magnitude smaller than fine-grained
![Page 5: Retargetable Mapping of Loop Programs on Coarse-grained ...ashriva6/esweek2010/codesisss2010/...Embedded Systems Week 2010, Scottsdale, USA October, 2010 22 •Currently: – Static](https://reader031.vdocuments.net/reader031/viewer/2022012001/6083edde7e617d658f1503c8/html5/thumbnails/5.jpg)
University of Erlangen-NurembergFrank Hannig
Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 5
Architecture
Tightly-coupled processor arrays• Coarse-grained and weakly-programmable• Highly parameterizable architecture template• Reconfigurable
interconnect
GlobalCtrl.
![Page 6: Retargetable Mapping of Loop Programs on Coarse-grained ...ashriva6/esweek2010/codesisss2010/...Embedded Systems Week 2010, Scottsdale, USA October, 2010 22 •Currently: – Static](https://reader031.vdocuments.net/reader031/viewer/2022012001/6083edde7e617d658f1503c8/html5/thumbnails/6.jpg)
University of Erlangen-NurembergFrank Hannig
Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 6
Multicast Reconfiguration Scheme
![Page 7: Retargetable Mapping of Loop Programs on Coarse-grained ...ashriva6/esweek2010/codesisss2010/...Embedded Systems Week 2010, Scottsdale, USA October, 2010 22 •Currently: – Static](https://reader031.vdocuments.net/reader031/viewer/2022012001/6083edde7e617d658f1503c8/html5/thumbnails/7.jpg)
University of Erlangen-NurembergFrank Hannig
Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 7
Design flow, based on the PARO HLS toolAlgorithm (PAULA)
High-Level TransformationsLocalization Loop PerfectizationOutput Normal Form Loop UnrollingPartitioning Expression SplittingAffine Transformations ...
Code GenerationVLIW Code for each PE
Configuration of InterconnectCode of Controller
Hardware SynthesisProcessor Element Controller
Processor Array I/O Interface
HDL Generation
WPPA Configuration Hardware Description (VHDL)
Test BenchGeneration
Simulation
SimulationSimulation
Architecture Model
Front EndBack End
WPPA
Space-Time MappingAllocation Scheduling Resource Binding
WPPA FPGA
![Page 8: Retargetable Mapping of Loop Programs on Coarse-grained ...ashriva6/esweek2010/codesisss2010/...Embedded Systems Week 2010, Scottsdale, USA October, 2010 22 •Currently: – Static](https://reader031.vdocuments.net/reader031/viewer/2022012001/6083edde7e617d658f1503c8/html5/thumbnails/8.jpg)
University of Erlangen-NurembergFrank Hannig
Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 8
Simulation
Hardware SynthesisProcessor Element Controller
Processor Array I/O Interface
HDL Generation
Hardware SynthesisProcessor Element Controller
Processor Array I/O Interface
HDL Generation
Hardware Description (VHDL)
Test BenchGeneration
Simulation
Design flow, based on the PARO HLS toolAlgorithm (PAULA)
High-Level TransformationsLocalization Loop PerfectizationOutput Normal Form Loop UnrollingPartitioning Expression SplittingAffine Transformations ...
Code GenerationVLIW Code for each PE
Configuration of InterconnectCode of Controller
WPPA Configuration Hardware Description (VHDL)
Test BenchGeneration
SimulationSimulation
Architecture Model
Front EndBack End
WPPA FPGAWPPA
Space-Time MappingAllocation Scheduling Resource Binding
FPGA
![Page 9: Retargetable Mapping of Loop Programs on Coarse-grained ...ashriva6/esweek2010/codesisss2010/...Embedded Systems Week 2010, Scottsdale, USA October, 2010 22 •Currently: – Static](https://reader031.vdocuments.net/reader031/viewer/2022012001/6083edde7e617d658f1503c8/html5/thumbnails/9.jpg)
University of Erlangen-NurembergFrank Hannig
Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 9
Why not starting from C-code?• Most existing high-level synthesis tools start from C/C++ code
• Limitation: Semantics of input language (statement order, loop order) define execution order ⇒ limited parallelismExample:
int s = 0;for (i=0; i<=7; i++) { s += a[i]; }
• Tools have only few high-level transformations to parallelize code
![Page 10: Retargetable Mapping of Loop Programs on Coarse-grained ...ashriva6/esweek2010/codesisss2010/...Embedded Systems Week 2010, Scottsdale, USA October, 2010 22 •Currently: – Static](https://reader031.vdocuments.net/reader031/viewer/2022012001/6083edde7e617d658f1503c8/html5/thumbnails/10.jpg)
University of Erlangen-NurembergFrank Hannig
Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 10
Design entry: PAULA language
Key features:• Functional programming language• Full static single assignment (SSA) form, also for
multidimensional arrays• Powerful expressions for the specification of polyhedral and
lattice iteration domains• Convenient usage of reductions like ∑• Architectural modeling capabilities
![Page 11: Retargetable Mapping of Loop Programs on Coarse-grained ...ashriva6/esweek2010/codesisss2010/...Embedded Systems Week 2010, Scottsdale, USA October, 2010 22 •Currently: – Static](https://reader031.vdocuments.net/reader031/viewer/2022012001/6083edde7e617d658f1503c8/html5/thumbnails/11.jpg)
University of Erlangen-NurembergFrank Hannig
Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 11
PAULA and intermediate representation
Extended reduced dependence graph (RDG):
Example, 2-D Gauss window filter:...par (x >= 0 and x < 1280 and y >= 0 and y < 1024){ w[0,0]=1; w[0,1]=2; w[0,2]=1;w[1,0]=2; w[1,1]=4; w[1,2]=2;w[2,0]=1; w[2,1]=2; w[2,2]=1;h[x,y]=SUM[i>=0 and i<=2 and j>=0 and j<=2]
(pic_in[x+i,y+j] * w[i,j]);pic_out[x,y]=h[x,y] >> 4; // divided by 16
}
![Page 12: Retargetable Mapping of Loop Programs on Coarse-grained ...ashriva6/esweek2010/codesisss2010/...Embedded Systems Week 2010, Scottsdale, USA October, 2010 22 •Currently: – Static](https://reader031.vdocuments.net/reader031/viewer/2022012001/6083edde7e617d658f1503c8/html5/thumbnails/12.jpg)
University of Erlangen-NurembergFrank Hannig
Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 12
Simulation
Hardware SynthesisProcessor Element Controller
Processor Array I/O Interface
HDL Generation
Hardware Description (VHDL)
Test BenchGeneration
Simulation
Design flowAlgorithm (PAULA)
High-Level TransformationsLocalization Loop PerfectizationOutput Normal Form Loop UnrollingPartitioning Expression SplittingAffine Transformations ...
Code GenerationVLIW Code for each PE
Configuration of InterconnectCode of Controller
WPPA Configuration Simulation
Architecture Model
Front EndBack End
WPPA FPGAWPPA
Space-Time MappingAllocation Scheduling Resource Binding
![Page 13: Retargetable Mapping of Loop Programs on Coarse-grained ...ashriva6/esweek2010/codesisss2010/...Embedded Systems Week 2010, Scottsdale, USA October, 2010 22 •Currently: – Static](https://reader031.vdocuments.net/reader031/viewer/2022012001/6083edde7e617d658f1503c8/html5/thumbnails/13.jpg)
University of Erlangen-NurembergFrank Hannig
Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 13
Localization
• Example of one-dimensional localizationfor (i=0; i <= N; i++) for (i=0; i <= N; i++) { b[i] = a[0]; { if (i > 0) a[i] = a[i-1]; } b[i] = a[i];
}
• Example of two-dimensional localization
:x :y :z
![Page 14: Retargetable Mapping of Loop Programs on Coarse-grained ...ashriva6/esweek2010/codesisss2010/...Embedded Systems Week 2010, Scottsdale, USA October, 2010 22 •Currently: – Static](https://reader031.vdocuments.net/reader031/viewer/2022012001/6083edde7e617d658f1503c8/html5/thumbnails/14.jpg)
University of Erlangen-NurembergFrank Hannig
Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 14
Simulation
Hardware SynthesisProcessor Element Controller
Processor Array I/O Interface
HDL Generation
Hardware Description (VHDL)
Test BenchGeneration
Simulation
Design flowAlgorithm (PAULA)
High-Level TransformationsLocalization Loop PerfectizationOutput Normal Form Loop UnrollingPartitioning Expression SplittingAffine Transformations ...
Code GenerationVLIW Code for each PE
Configuration of InterconnectCode of Controller
WPPA Configuration Simulation
Architecture Model
Front EndBack End
WPPA FPGAWPPA
Space-Time MappingAllocation Scheduling Resource Binding
![Page 15: Retargetable Mapping of Loop Programs on Coarse-grained ...ashriva6/esweek2010/codesisss2010/...Embedded Systems Week 2010, Scottsdale, USA October, 2010 22 •Currently: – Static](https://reader031.vdocuments.net/reader031/viewer/2022012001/6083edde7e617d658f1503c8/html5/thumbnails/15.jpg)
University of Erlangen-NurembergFrank Hannig
Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 15
Space mapping / partitioning
• LSGP partitioning
dependence graph architecture
Main advantage:
Place & route for free! Since the space mapping directly defines the placement.
![Page 16: Retargetable Mapping of Loop Programs on Coarse-grained ...ashriva6/esweek2010/codesisss2010/...Embedded Systems Week 2010, Scottsdale, USA October, 2010 22 •Currently: – Static](https://reader031.vdocuments.net/reader031/viewer/2022012001/6083edde7e617d658f1503c8/html5/thumbnails/16.jpg)
University of Erlangen-NurembergFrank Hannig
Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 16
Space mapping / partitioning
• LPGS partitioning architecture
dependence graph
![Page 17: Retargetable Mapping of Loop Programs on Coarse-grained ...ashriva6/esweek2010/codesisss2010/...Embedded Systems Week 2010, Scottsdale, USA October, 2010 22 •Currently: – Static](https://reader031.vdocuments.net/reader031/viewer/2022012001/6083edde7e617d658f1503c8/html5/thumbnails/17.jpg)
University of Erlangen-NurembergFrank Hannig
Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 17
Space mapping / partitioning
• Hierarchical partitioning architecture dependence graph
– Balancing of: Communication, I/O and different levels of (local) memory– Adaptation of bandwidth, computational power, and memory
constraints
LS
GS
local memory
![Page 18: Retargetable Mapping of Loop Programs on Coarse-grained ...ashriva6/esweek2010/codesisss2010/...Embedded Systems Week 2010, Scottsdale, USA October, 2010 22 •Currently: – Static](https://reader031.vdocuments.net/reader031/viewer/2022012001/6083edde7e617d658f1503c8/html5/thumbnails/18.jpg)
University of Erlangen-NurembergFrank Hannig
Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 18
Scheduling
• Features:– Resource constraints (number of functional units)– Module selection (different binding possibilities of operations)– Functional and software pipelining– Some part can be concurrently executed
other have to be serialized– Exact approach based on mixed integer
linear programming (MIP)
• Goal, simultaneous optimization of:– Local schedule (execution order within PEs) and– Global schedule (execution between all PEs)
![Page 19: Retargetable Mapping of Loop Programs on Coarse-grained ...ashriva6/esweek2010/codesisss2010/...Embedded Systems Week 2010, Scottsdale, USA October, 2010 22 •Currently: – Static](https://reader031.vdocuments.net/reader031/viewer/2022012001/6083edde7e617d658f1503c8/html5/thumbnails/19.jpg)
University of Erlangen-NurembergFrank Hannig
Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 19
Example, Median Filter
• PAULA program of horizontal median filter
• Partitioned algorithm (4 stripes / processors)
2 90 3
48
m=4m=3
![Page 20: Retargetable Mapping of Loop Programs on Coarse-grained ...ashriva6/esweek2010/codesisss2010/...Embedded Systems Week 2010, Scottsdale, USA October, 2010 22 •Currently: – Static](https://reader031.vdocuments.net/reader031/viewer/2022012001/6083edde7e617d658f1503c8/html5/thumbnails/20.jpg)
University of Erlangen-NurembergFrank Hannig
Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 20
Code Generation, Median Filter
• Substitution of median-function by explicit comparisons
![Page 21: Retargetable Mapping of Loop Programs on Coarse-grained ...ashriva6/esweek2010/codesisss2010/...Embedded Systems Week 2010, Scottsdale, USA October, 2010 22 •Currently: – Static](https://reader031.vdocuments.net/reader031/viewer/2022012001/6083edde7e617d658f1503c8/html5/thumbnails/21.jpg)
University of Erlangen-NurembergFrank Hannig
Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 21
Code Generation, Median Filter
• VLIW code fragment for one processor
• Zero loop overhead: Run control flow completely in parallel with data flow
Generated by Global Controller
![Page 22: Retargetable Mapping of Loop Programs on Coarse-grained ...ashriva6/esweek2010/codesisss2010/...Embedded Systems Week 2010, Scottsdale, USA October, 2010 22 •Currently: – Static](https://reader031.vdocuments.net/reader031/viewer/2022012001/6083edde7e617d658f1503c8/html5/thumbnails/22.jpg)
University of Erlangen-NurembergFrank Hannig
Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 22
• Currently:– Static mapping, compilation for one fixed architecture allocation (no. of PEs, etc.)– Static partitioning and scheduling– If other resource constraints are given or different requirements are needed,
program has to compiled again
• Idea: Symbolic loop parallelization and mapping– Symbolic partitioning– Symbolic scheduling– Symbolic control generation
⇒* Replace at run-time only the symbols in theconfiguration data according to the available resources
* Algorithms can be adapted quickly at run-time without recompilation* Self-adaption enables fast reaction on QoS parameters, system load, failures, etc.
Current and future work
![Page 23: Retargetable Mapping of Loop Programs on Coarse-grained ...ashriva6/esweek2010/codesisss2010/...Embedded Systems Week 2010, Scottsdale, USA October, 2010 22 •Currently: – Static](https://reader031.vdocuments.net/reader031/viewer/2022012001/6083edde7e617d658f1503c8/html5/thumbnails/23.jpg)
University of Erlangen-NurembergFrank Hannig
Embedded Systems Week 2010, Scottsdale, USAOctober, 2010 23
Questions?
Retargetable Mapping of Loop Programs onCoarse-grained Reconfigurable Arrays
Frank HannigHardware/Software Co-DesignDepartment of Computer Science Phone: + 49 9131 85-25153University of Erlangen-Nuremberg Fax: + 49 9131 85-25149Am Weichselgarten 3 Email: [email protected] Erlangen, Germany URL: http://www12.cs.fau.de
AcknowledgementsHritam Dutta, Dmitrij Kissler, Alexey Kupriyanov,Vahid Lari, Holger Ruckdeschel, Jürgen Teich
This work was partially supported by the German Research Foundation (DFG)in projects under contracts TE 163 /3-1, TE 163 /3-2, and SFB TRR 89.