center for embedded computer systems university of california, irvine and san diego
DESCRIPTION
Hardware and Interface Synthesis of FPGA Blocks using Parallelizing Code Transformations. Sumit Gupta Manev Luthra, Nikil Dutt, Alex Nicolau Rajesh Gupta. Center for Embedded Computer Systems University of California, Irvine and San Diego http://www.cecs.uci.edu/~spark. - PowerPoint PPT PresentationTRANSCRIPT
Center for Embedded Computer SystemsUniversity of California, Irvine and San Diego
http://www.cecs.uci.edu/~spark
Hardware and Interface Synthesis of FPGA Blocks using
Parallelizing Code Transformations
Supported by Semiconductor Research Corporation
Sumit GuptaManev Luthra, Nikil Dutt, Alex Nicolau
Rajesh Gupta
2Copyright Sumit Gupta 2003
Overview of a HW/SW Co-Overview of a HW/SW Co-design Flowdesign Flow
ProcessorCore
MemoryI/O
HardwareBehavioralDescription
SoftwareBehavioralDescription
SoftwareCompiler
HighLevel
Synthesis
Application Specs
HW/SW Partitioning
Interface Synthesis
Logic Synthesis and P&R
C
RTLVHDL
ASIC FPGA
3Copyright Sumit Gupta 2003
OutlineOutline Motivation and Background Motivation and Background Our Approach to Our Approach to ParallelizingParallelizing High- High-
Level SynthesisLevel Synthesis Our Approach to Interface Synthesis Our Approach to Interface Synthesis
using Memory Mappingusing Memory Mapping Experimental ResultsExperimental Results
HLS: Multimedia and Image Processing HLS: Multimedia and Image Processing ApplicationsApplications
Interface Synthesis: Case Study of MPEG-1 Interface Synthesis: Case Study of MPEG-1 blockblock
Conclusions and Future WorkConclusions and Future Work
4Copyright Sumit Gupta 2003
High Level SynthesisHigh Level Synthesis
M e m o r y
ALUCo
ntr
ol
Data path
d = e - f g = h + i
If NodeT F
c
x = a + bc = a < b
j = d x gl = e + x
x = a + b;c = a < b;if (c) then d = e – f;else g = h + i;j = d x g;l = e + x;
Transform behavioral descriptions to RTL/gate level
From C to CDFG to Architecture
Problem # 1Problem # 1 : :Poor quality of HLS results beyond Poor quality of HLS results beyond straight-line behavioral descriptionsstraight-line behavioral descriptions
Poor/No controllability of Poor/No controllability of the HLSthe HLSresultsresults
Problem # 2Problem # 2 : :
5Copyright Sumit Gupta 2003
High-level SynthesisHigh-level Synthesis Well-researched area: from early 1980’sWell-researched area: from early 1980’s
Renewed interest due to new system level design Renewed interest due to new system level design methodologies methodologies
Large number of synthesis optimizations have been Large number of synthesis optimizations have been proposed proposed Either Either operation leveloperation level: algebraic transformations on DSP : algebraic transformations on DSP
codescodes or or logic levellogic level: Don’t Care based control optimizations: Don’t Care based control optimizations In contrast, compiler transformations operate at both In contrast, compiler transformations operate at both
operation level (fine-grain) and source level (coarse-grain) operation level (fine-grain) and source level (coarse-grain) Parallelizing Compiler TransformationsParallelizing Compiler Transformations
Different optimization objectives and cost models than HLSDifferent optimization objectives and cost models than HLS Our aimOur aim: Develop Synthesis and Parallelizing Compiler : Develop Synthesis and Parallelizing Compiler
Transformations that are “useful” for HLS Transformations that are “useful” for HLS Beyond scheduling results: in Beyond scheduling results: in Circuit Area and DelayCircuit Area and Delay For large designs with For large designs with complex control flowcomplex control flow (nested (nested
conditionals/loops)conditionals/loops)
6Copyright Sumit Gupta 2003
Our Approach: Our Approach: ParallelizingParallelizing HLS HLS (PHLS)(PHLS)
C Input VHDLOutput
Original CDFG
Optimized CDFG
Scheduling& Binding
Source-Level Compiler
Transformations
Scheduling Compiler & Dynamic
Transformations
Optimizing Compiler and Parallelizing Compiler transformations Optimizing Compiler and Parallelizing Compiler transformations applied at applied at Source-levelSource-level (Pre-synthesis) and during (Pre-synthesis) and during SchedulingScheduling Source-level code refinement using Source-level code refinement using Pre-synthesisPre-synthesis
transformationstransformations Code Restructuring by Code Restructuring by SpeculativeSpeculative Code Motions Code Motions Operation Operation replicationreplication to improve concurrency to improve concurrency DynamicDynamic transformations: exploit new opportunities during transformations: exploit new opportunities during
schedulingscheduling
7Copyright Sumit Gupta 2003
SPARSPARKK
High High Level Level
SynthesiSynthesis s
FramewoFrameworkrk
8Copyright Sumit Gupta 2003
SPARK Parallelizing HLS SPARK Parallelizing HLS FrameworkFramework
C input and C input and SynthesizableSynthesizable RTL VHDL output RTL VHDL output Tool-box Tool-box of Transformations and Heuristicsof Transformations and Heuristics
Each of these can be developed independently of the Each of these can be developed independently of the otherother
Script based Script based control over transformations & control over transformations & heuristicsheuristics
Hierarchical Intermediate Representation (HTGs)Hierarchical Intermediate Representation (HTGs) Retains structural information about design Retains structural information about design
(conditional blocks, loops)(conditional blocks, loops) Enables efficient and structured application of Enables efficient and structured application of
transformationstransformations Complete HLS tool:Complete HLS tool: Does Binding, Control Does Binding, Control
Synthesis and Backend VHDL generationSynthesis and Backend VHDL generation Interconnect Minimizing Resource BindingInterconnect Minimizing Resource Binding
Enables Enables Graphical VisualizationGraphical Visualization of Design of Design description and intermediate resultsdescription and intermediate results
100,000+ lines of C++ code100,000+ lines of C++ code
9Copyright Sumit Gupta 2003
Interface SynthesisInterface Synthesis ““Glue” logic to make the hardware and Glue” logic to make the hardware and
software talk to each othersoftware talk to each other Data must be transferred or shared between Data must be transferred or shared between
the hardware and softwarethe hardware and software Our approach to Interface Synthesis:Our approach to Interface Synthesis:
Map common data to a shared memoryMap common data to a shared memory Our focus is on optimizing the number and size of Our focus is on optimizing the number and size of
memory banks used for mappingmemory banks used for mapping Target architecture: Platform with FPGA and Target architecture: Platform with FPGA and
Processor core on same FPGA fabricProcessor core on same FPGA fabric Take advantage of the Memory Blocks available on Take advantage of the Memory Blocks available on
such FPGA platforms for implementing the shared such FPGA platforms for implementing the shared memorymemory
10Copyright Sumit Gupta 2003
The Interface HardwareThe Interface Hardware
Interface Logic
CPU CPU BusBus
ProcessProcessor Coreor Core
PeriphePeripheralral
PeriphePeripheralral
Custom Custom LogicLogic
Shared Memory
Main Main MemorMemor
yy
Interface Hardware
System architecture illustrating the System architecture illustrating the interface HWinterface HW
Data exchanged
between HW and SW
resides in the shared
memory
11Copyright Sumit Gupta 2003
Target HW Interface Target HW Interface ArchitectureArchitecture
Control
Logic
Memory Controll
er
Bus Interface
Appl. Kernel
Appl. Kernel
Mapped Local
Memory
FPGA FPGA FabricFabric
Pro
cess
or
Core
Pro
cess
or
Core
CPU CPU BusBus
Multiple Ports
Single Port
12Copyright Sumit Gupta 2003
Local Memory
An Example of Memory An Example of Memory MappingMapping
Inefficient use of the FPGA fabric by implementing
independent registers
Var NVar 2Var 1
FUFU 11 FUFU MM
N:2M MuxN:2M Mux
Var 1Var 2
FUFU 11 FUFU MM
K:2M MuxK:2M Mux
Var 3Mem 1Mem 1
Efficient use of FPGA resources by utilizing its abundant memory blocks
Var NMem 2Mem 2
Mem KMem K
13Copyright Sumit Gupta 2003
Mapping Designs with Mapping Designs with Control FlowControl Flow
Mapping variables to Mapping variables to memory banks is a memory banks is a well-known techniquewell-known technique
What’s new hereWhat’s new here: : Use Use ofof Scheduling info from Scheduling info from
High Level Synthesis High Level Synthesis tool tool
Mutual exclusivity info Mutual exclusivity info in designs with control in designs with control flowflow
Data access patterns Data access patterns resolved only during resolved only during run timerun time So how do we perform So how do we perform
memory mapping at memory mapping at compile time compile time ??
If NodeT F
BB 2BB 1
BB 3
a b
a c
b c
a b
C0
C1
Access Pattern: {(a, b), Access Pattern: {(a, b), (a, c)} or (a, c)} or {(a, b, c), {(a, b, c), (a, b)}(a, b)}
Condition resolved at
run time
a
14Copyright Sumit Gupta 2003
Our Approach to Memory Our Approach to Memory MappingMapping
We use We use Scheduling information from SPARK Synthesis Scheduling information from SPARK Synthesis
tooltool Cycles in which variables are accessed (read/write)Cycles in which variables are accessed (read/write)
Mutual exclusivity information from the Mutual exclusivity information from the control flow graphcontrol flow graph The conditions under which variables are accessedThe conditions under which variables are accessed
Capture this information using Multi-color Capture this information using Multi-color Conflict GraphsConflict Graphs
Then create a Variable to Memory Bank Then create a Variable to Memory Bank Mapping that minimizes the number of Mapping that minimizes the number of memory instancesmemory instances
Details available in ICCD 2003 paperDetails available in ICCD 2003 paper
15Copyright Sumit Gupta 2003
Experiments for SPARK Experiments for SPARK HLS ToolHLS Tool
Results presented here forResults presented here for Pre-synthesis transformationsPre-synthesis transformations Speculative Code MotionsSpeculative Code Motions Dynamic CSEDynamic CSE
We used We used SPARKSPARK to synthesize designs derived to synthesize designs derived from several industrial designsfrom several industrial designs MPEG-1, MPEG-2, GIMP Image Processing MPEG-1, MPEG-2, GIMP Image Processing
softwaresoftware Case StudyCase Study of Intel Instruction Length Decoder of Intel Instruction Length Decoder
Scheduling ResultsScheduling Results Number of States in Number of States in
FSMFSM Cycles on Longest Path Cycles on Longest Path
through Designthrough Design
VHDL: Logic VHDL: Logic Synthesis Synthesis Critical Path Length Critical Path Length
(ns)(ns) Unit AreaUnit Area
16Copyright Sumit Gupta 2003
Target ApplicationsTarget ApplicationsDesignDesign # of # of
IfsIfs# of # of
LoopsLoops# Non-# Non-Empty Empty Basic Basic BlocksBlocks
# of # of OperatiOperati
onsons
MPEG-1 MPEG-1 pred1pred1
44 22 1717 123123
MPEG-1 MPEG-1 pred2pred2
1111 66 4545 287287
MPEG-2 MPEG-2 dp_framdp_fram
ee
1818 44 6161 260260
GIMP GIMP
tilertiler1111 22 3535 150150
17Copyright Sumit Gupta 2003
MPEG-1 Pred1 Function
0
0.2
0.4
0.6
0.8
1
1.2
Longest Path(lcyc)
Critical Path(cns)
Total Delay (c*l) Unit Area
+ Speculative Code Motions
+ Pre-Synthesis Transforms
+ Dynamic CSE
MPEG-1 Pred2 Function
0
0.2
0.4
0.6
0.8
1
1.2
Longest Path(lcyc)
Critical Path(cns)
Total Delay (c*l) Unit Area
Scheduling & Logic Synthesis Scheduling & Logic Synthesis ResultsResults
Non-speculative CMs: Within BBs & Across Hier Blocks
42%
10%
36%
36%
8%
39%
Overall: 63-66 % improvement in DelayOverall: 63-66 % improvement in Delay
Almost constant Area Almost constant Area
18Copyright Sumit Gupta 2003
Non-speculative CMs: Within BBs & Across Hier Blocks
+ Speculative Code Motions
+ Pre-Synthesis Transforms
+ Dynamic CSE
Scheduling & Logic Synthesis Scheduling & Logic Synthesis ResultsResultsMPEG-2 DpFrame Function
0
0.2
0.4
0.6
0.8
1
1.2
Longest Path(lcyc)
Critical Path(cns)
Total Delay (c*l) Unit Area
GIMP Tiler Function
0
0.2
0.4
0.6
0.8
1
1.2
Longest Path(lcyc)
Critical Path(cns)
Total Delay (c*l) Unit Area
14%
20%1%
33%
41%
52%
Overall: 48-76 % improvement in DelayOverall: 48-76 % improvement in Delay
Almost constant Area Almost constant Area
19Copyright Sumit Gupta 2003
Interface Synthesis: Case Interface Synthesis: Case StudyStudy
Target Platform – Altera’s Nios development Target Platform – Altera’s Nios development board @33.33 MHzboard @33.33 MHz Soft core of the Nios embedded processor (no Soft core of the Nios embedded processor (no
caches)caches) Library of peripheral IP’s like timer, UART, SRAM Library of peripheral IP’s like timer, UART, SRAM
controllercontroller Altera FPGA with 8320 Logic CellsAltera FPGA with 8320 Logic Cells
ToolsTools The SPARK parallelizing high level synthesis The SPARK parallelizing high level synthesis
framework for high level synthesisframework for high level synthesis Leonardo Spectrum and Quartus II for logic Leonardo Spectrum and Quartus II for logic
synthesis and P&R respectivelysynthesis and P&R respectively The Nios specific GNU toolkit for compiling C codeThe Nios specific GNU toolkit for compiling C code
20Copyright Sumit Gupta 2003
Example: MPEG-1 Prediction Example: MPEG-1 Prediction BlockBlock
DesignDesign # of # of IfsIfs
# of # of LoopsLoops
# Non-# Non-Empty Empty Basic Basic BlocksBlocks
# of # of OperatiOperati
onsons
MPEG-1 MPEG-1 pred1pred1
44 22 1717 123123
MPEG-1 MPEG-1 pred2pred2
1111 66 4545 287287 MPEG-1 pred1 used MPEG-1 pred1 used 11 out of out of 33 kernels mapped kernels mapped
to HWto HW MPEG-2 pred2 used the remaining MPEG-2 pred2 used the remaining 22 kernels kernels
mapped to HWmapped to HW HW/SW interface consisted ofHW/SW interface consisted of 181 181 words of data words of data
21Copyright Sumit Gupta 2003
ConclusionsConclusions Parallelizing code transformations enable a new range Parallelizing code transformations enable a new range
of HLS transformationsof HLS transformations Provide the needed improvement in quality of HLS results Provide the needed improvement in quality of HLS results
Possible to be competitive against manually designed Possible to be competitive against manually designed circuits. circuits.
Can enable productivity improvements in microelectronic Can enable productivity improvements in microelectronic designdesign
Built a C-to-VHDL synthesis system with a range of code Built a C-to-VHDL synthesis system with a range of code transformationstransformations Platform for applying Coarse and Fine-grain OptimizationsPlatform for applying Coarse and Fine-grain Optimizations Tool-box approach where transformations and heuristics can Tool-box approach where transformations and heuristics can
be developedbe developed Script-based synthesis enables designer to control HLS processScript-based synthesis enables designer to control HLS process Performance improvements of 60-70 % across a number of Performance improvements of 60-70 % across a number of
designsdesigns Presented an interface synthesis approach targeted at Presented an interface synthesis approach targeted at
FPGA based platformsFPGA based platforms Based on a memory mapping algorithm that utilizes scheduling Based on a memory mapping algorithm that utilizes scheduling
information for synthesizing designs with control flowinformation for synthesizing designs with control flow
22Copyright Sumit Gupta 2003
SPARK ReleaseSPARK Release Available for downloadAvailable for download
http://www.cecs.uci.edu/~spark User ManualUser Manual
Running the toolRunning the tool Customizing the synthesis scriptsCustomizing the synthesis scripts
TutorialTutorial Synthesis of Portion of Motion Synthesis of Portion of Motion
Compensation algorithm in MPEG-1 Compensation algorithm in MPEG-1 player (along with logic synthesis scripts)player (along with logic synthesis scripts)
Thank YouThank You