elad hadar omer norkin supervisor: mike sumszyk winter 2010/11, single semester project
DESCRIPTION
Technion – Israel Institute of Technology Faculty of Electrical Engineering High Speed Digital System Lab (HS DSL). Exploring new implementation tools for GIDEL PROCSTAR platform ( PART I - PROC _HILs ). Elad Hadar Omer Norkin Supervisor: Mike Sumszyk - PowerPoint PPT PresentationTRANSCRIPT
Elad HadarOmer Norkin Supervisor: Mike Sumszyk
Winter 2010/11, Single semester project.Date:30/5/11
Technion – Israel Institute of TechnologyFaculty of Electrical EngineeringHigh Speed Digital System Lab (HS DSL)
Exploring new implementation tools for GIDEL PROCSTAR platform
(PART I - PROC_HILs)
What is PROC_HILs ? • PROC_HILs is a Hardware-In-the-Loop
acceleration tool for running Simulink designs on FPGAs.
• Automatically translate Simulink designs into FPGA code (compatible with the PROC board installed on the target PC) and run it under Simulink.
Why do we need PROC_HILs ? • Dramatically improves simulation speed, with
a dedicated accelerator for Simulink designs. • Enables building a design visually and
downloading it directly, with minimal effort, into the PROC board.
• Enables concurrent engineering at an early stage.
• Cuts development cycle time (and costs).• Improve design reliability.
Project motivation Implementing a video analysis designs on
GIDEL PROCSTAR III platform that will enable usage and exploration of a new development platform (PART I – PROC_HILs).
Proper usage of development tools throughout all stages of implementation from algorithm to hardware.
How it works• PROC_HILs enables
the user to download a Simulink design into PROC board and run it.
• The design runs on the on-board FPGAs, communicating with Simulink in real time.
• Generation process is fully automatic.
1. Simulink design
How it works – Main stages2. An HDL code is generated, synthesized and compiled to get an .rbf file (FPGA binary file) compatible with the specific PROC board
3. A new Simulink design file is generated. Single HIL block including all the inputs and outputs that were present in the original design, connected to all the sources and sinks
4. The design runs on the hardware fully synchronized with Simulink, receiving the signals from the simulation sources and outputting the results into the sinks.
Hardware and Development environment
• Main development stages were made on a GiDEL PROCe III (Altera Stratix III) board (1-FPGA)
• GiDEL PROC_HILs (Version 2.1.2)• ALTERA’s DSPBuilder blockset for Simulink (Version 10.1)• ProcWizard (Version 8.8)• Quartus II (version 10.1)• Matlab (Version 2009a)
• Additional development was made on a GiDEL PROCStar III (Altera Stratix III) board (4-FPGA)
Simulink Design - NLD• NLD is a hardware implementation of Non Linear
Diffusion algorithm for video images.• Enable local smoothing of the picture while
preserving edges.
• The Simulink design in this project is based on a previous project (Performed in the Technion HS-DS Lab by Tsion Bublil & Yony Dekell).
• The original Project was implemented on a PROCStar II (Altera Strartix II) board (4-FPGAs), using SynplifyDSP blockset library for Simulink.
Simulink Design - NLD
Design guide - lines• All I/Os Must be placed on the top level of the
design.• Simulink sources must be configured to the same
clock that toggles the input port they feed.• All signals from the workspace blocks feeding
inputs blocks and all frame output blocks must use the same frame size (as seen in the previous slide).
• The design must obey the following table rules:
* PROC_HILs User Guide V2.1.2 p. 49
Simulink Design - NLD1
Ry
a
br-
Pipelined Adderz-1
(1)
Delay
1R
2R256
1
Rx
a
br-
Pipelined Adderz-256
(1)
Delay
1R
Simulink Design - NLD
1g11
a0[13]:[13]
b0[13]:[13]
a1[13]:[13]
b1[13]:[13]
y[27]:[26]y = a0 X b0 + a1 X b1
Multiply Add
1
Constant
2Ry
1beta
1g12
a[13]:[13]
b[13]:[13]r [13]:[13]X
Multiplier2
2Rx
1Ry
Simulink Design - NLD
1
gm_out
0.00097656
beta3
1048576
beta2
9.5367e-007
beta1
1048576
beta
d(43:0) q(21:0)
Square Root
a
br-
Pipelined Adder
a[16]:[24]
b[16]:[24]r [16]:[24]X
Multiplier5
a[24]:[24]
b[24]:[24]r [28]:[16]X
Multiplier4a[16]:[16]
b[16]:[16]r [16]:[16]X
Multiplier3
a[24]:[24]
b[24]:[24]r [24]:[18]X
Multipl ier2
a[16]:[16]
b[16]:[16]r [16]:[16]X
Multiplier1
a[24]:[24]
b[24]:[24]
q[48]:[0]
r[24]:[24]
a = b X q + r
Divider
3g22
2g11
1
g12
6Out6
5Out5
4Out4
3Out3
2Out2
1Out1
z-206
(1)
Delay5
z-206
(1)
Delay4
z-206
(1)
Delay3
z-206
(1)
Delay2
z-206
(1)
Delay12
z-206
(1)
Delay1
6Rx
5R256
4Ry
3g22
2g12
1g11
Timing considerations• Determining clock rate
– Video processing algorithm will have to process 15 Iterations of 256 by 256 pixels for a frame, achieving a reasonable rate of 15 frames per second.
• Long logical path prevents meeting clock rate demands, and fails compilation.– Altera DSPBuilder Advanced blockset supports automatic
pipelining (was not implemented in this project).– Altera DSPBuilder blockset supports user pipelining using
internal pipeline definition of the block (determined by user), or inserting Delays throughout the logical path. This method requires careful attention of the designer, that must assure full synchronization of the logical paths, guarantied by design.
2256 15 15 14,745,600 15[ ]clock rate MHz
DSPBuilder internal blocks pipelining
1
gm_out
0.00097656
beta3
1048576
beta2
9.5367e-007
beta1
1048576
beta
d(43:0) q(21:0)
Square Root
a
br-
Pipelined Adder
a[16]:[24]
b[16]:[24]r [16]:[24]X
Multipl ier5
a[24]:[24]
b[24]:[24]r [28]:[16]X
Multipl ier4a[16]:[16]
b[16]:[16]r [16]:[16]X
Multipl ier3
a[24]:[24]
b[24]:[24]r [24]:[18]X
Multipl ier2
a[16]:[16]
b[16]:[16]r [16]:[16]X
Multipl ier1
a[24]:[24]
b[24]:[24]
q[48]:[0]
r[24]:[24]
a = b X q + r
Divider
3
g22
2
g11
1
g12
Simulink Design - NLD
1
belt_r
X Out1
min
X Out1
max
Rp Rpy
dpy
Rp Rpx
dpx
a
br+
Pipelined Adder3
a
br+
Pipelined Adder2
a
br-
Pipelined Adder1
a
br-
Pipelined Adder
a[16]:[16]
b[16]:[16]r [16]:[16]X
Multiplier8
a[16]:[16]
b[16]:[16]r [16]:[16]X
Multiplier7
a[16]:[16]
b[16]:[16]r [16]:[16]X
Multiplier6
a[16]:[16]
b[16]:[16]r [16]:[16]X
Multiplier5
a[13]:[13]
b[13]:[13]r [15]:[15]X
Mul tiplier4
a[13]:[13]
b[13]:[13]r [15]:[15]X
Mul tiplier3
a[13]:[13]
b[13]:[13]r [15]:[15]X
Mul tiplier2
a[13]:[13]
b[13]:[13]r [15]:[15]X
Mul tiplier1
z-255
(1)
Delay1
z-256
(1)
Delay
8
dt
7
R
6
Ry
5
Rx
4
g22
3
g12
2
g11
1
gm05
Compilation and synthesis flow • Validating performance of the completed design,
using Simulink environment.
• A full automatic compilation and synthesis starts by activating the GiDEL HIL generation tool block.
• Preliminary compability test starts by pressing the prompt GUI button.– Checks meeting design rules.– Does not check Hardware fitting
and feasibility.
GiDEL HIL Generation Tool
Compilation and synthesis (cont.)
• “GO” button issues a full compilation and synthesis of the design.
• The generation flow can be adjustedby selecting the “Advanced Mode”.Controls the enabling/disabling of different flow stages.
Compilation and synthesis (cont.)• Generation ends with a new Simulink design file.
• PROC_HILs does not fully elaborate the feasibility and hardware consumption of the design.– Quartus file are generated only while the generation process
is active and then automatically deleted.– Solution: During generation extract Quartus top design and
independently compile it with Quartus.
hw_loop_6b_HIL_HW_block
Convert
cvrt_outp
Convert
cvrt_inp4r
Convert
cvrt_inp3r
Convert
cvrt_inp
prob_belt
To Workspace1
beta
Signal FromWorkspace4
dt
Signal FromWorkspace3
vecR
Signal FromWorkspace
6.666666666666667E-8 sec
Clock
<your_design_name>_HIL
Quartus compilation example• NLD Hardware consumption:
• Original image:
NLD output example
• Smoothed image (3 Iterations):
Performances
15,000 75,000 150,000 750,000 1,500,000 7,500,000 15,000,0000
1000
2000
3000
4000
5000
6000
7000
15.38741 37.052384 64.449291 296.137446583.806265
2902.047122
5792.6473
9.876949 9.860184 10.134778 11.885162 14.166178 32.085269 54.825545
Run time- Simulation & Hardware
Simulink simulationHardWare simulation
Vector Length
Run
time
[sec
]
• Calculated warm-up time:• Simulation overhead: 9.9712 [sec]• Hardware overhead: 9.60422 [sec]
Performances
0
2,000,000
4,000,000
6,000,000
8,000,000
10,000,000
12,000,000
14,000,0000
20
40
60
80
100
120
Run time ratio- simulation & Hardware
Vector Length
Ratio:
• Reduced overhead, time ratio: 128.645454
Run time SimulationRun time Hardware
Eli’s comment: All Simulations were made on:
Applications• Implementing NLD as part of Video capture/view real-time
streaming. • Web cam envelopment:
– Resizing image (256x256)– Performing “log” on resized image– Spreading image to vector form– Reshaping to matrix form– Performing “power” on processed image
15[ / sec]Frame rate frames
Applications (cont.)• NLD algorithm Hardware block is inserted into the webcam envelop.
• Hardware is dramatically decreasing frame rate though it is designed with the capabilities of the desired frame rate.– Operating frequency is 15MHz.
• Conclusion: interface Simulink/Hardware overhead is to high to allow proper streaming in real-time applications.
0.077[ / sec]Insufficient Frame rate frames
Hardware Loop• A possible way to gain advantage of PROC_HILs is
using a hardware loop.
prob_belt
To Workspace1
beta
Signal FromWorkspace4
dt
Signal FromWorkspace3
vecR
Signal FromWorkspace
o[16]:[16]
Output
sel(0:0)
0-
1-
MUX
Multiplexer
i[16]:[16]
Input4
i[16]:[16]
Input3
i[32]:[32]
Input
a
btrueIF a>b
If Statement1
a
btrueIF a<b
If Statement
GiDEL HIL Generation Tool
GiDEL Frame Output
GiDEL Frame Input
GiDEL Frame Input
GiDEL Frame Input
d
rreq
wreq
q
ful l
empty
usdw(15:0)
FIFO
FIFO
q(23:0)mod1048576
Counter
1
Constant1
65536
Constant
6.666666666666667E-8 sec
Clock
In1
In2
In3
Out1
Belt1d
Pipeline levels: 256
FIFO Size: 256X256-(256)
Hardware Loop (cont.)• Multiple tries of the full HL designs showed problems
of convergence to the hardware limits of the PROCe-III Board.
• The same design was implemented on a PROCStar III board, with no problems reported in the generation flow.
• Problem encountered: While Simulink simulation showed reasonable results, hardware simulation showed different results (efforts to find origin and fix were stopped due to the project’s time constraints).
Problems encountered• Strict software compatibility demands
– There is only one combination of involved software version that matches (matlab, PROC HIL, Altera DSPBuilder, Quartus, PROC wizard)
• Moderate algorithms do not fit the common boards using Proc HIL and Altera DSPBuilder blockset.
• Altera DSP blockset variety is poor, and does not contain common operations (log, exp, power,
root, not, min/ max…)• For effective usage, one should use the Altera
advanced DSP Blockset, but it requires the simulink fixed point license.
thn
Problems encountered (cont.)
• Demands data flow as vectors and does not support matrices.
• Inconsistency between simulation and Hardware Performances.
• Inconvenient existing blocks – Square Root: accepts and returns only
whole numbers.– Divider: returns only in the form of: whole
number and res.
Conclusions
1. Allows to easily design and implement algorithms in Simulink environment.• Direct Hardware Burn.• Direct generation HDL code that matches the target
board.• Fast HW simulation using Simulink/Matlab interface.
2. Extremely efficient on resources consuming processing algorithms.
3. Not suited for applying on streaming data designs (Real-Time designs).
Future project plan (PART II)
Main goals/phases:1) Learning PROC API, PROC MegaFIFO2) Define and build an integrated DSPbuilder
design combining PROC API video streaming functions, data channels and PROC MegaFIFO memories.
Motivation: Learning and practice of effective debug methodology using PROC API.
GIDEL PROC_API – enable real-time configuration and querying of the board.
Video stream diagram (PART II)
RX - FIFO TX - FIFO
PROC MegaFIFO
PROC API
Time table to final presentationWeek
10Week
9Week
8Week
7Week
6Week
5Week
4Week
3Week
2Week
1Task
Learning PROC API, PROC MegaFIFO
Build a simple design combining DSPbuilder and the PROC Wizard using PROC API
Define an integrated design combining PROC API video streaming functions and data channels, PROC MegaFIFO memories and DSPbuilder design
Verification and writing the project’s book.