elad hadar omer norkin supervisor: mike sumszyk winter 2010/11, single semester project

33
Elad Hadar Omer Norkin Supervisor: Mike Sumszyk Winter 2010/11, Single semester project. Date:30/5/11 Technion – Israel Institute of Technology Faculty of Electrical Engineering High Speed Digital System Lab (HS DSL) Exploring new implementation tools for GIDEL PROCSTAR platform (PART I - PROC_HILs)

Upload: lilia

Post on 23-Feb-2016

38 views

Category:

Documents


0 download

DESCRIPTION

Technion – Israel Institute of Technology Faculty of Electrical Engineering High Speed Digital System Lab (HS DSL). Exploring new implementation tools for GIDEL PROCSTAR platform ( PART I - PROC _HILs ). Elad Hadar Omer Norkin Supervisor: Mike Sumszyk - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Elad Hadar Omer Norkin  Supervisor: Mike  Sumszyk Winter 2010/11, Single semester project

Elad HadarOmer Norkin Supervisor: Mike Sumszyk

Winter 2010/11, Single semester project.Date:30/5/11

Technion – Israel Institute of TechnologyFaculty of Electrical EngineeringHigh Speed Digital System Lab (HS DSL)

Exploring new implementation tools for GIDEL PROCSTAR platform

(PART I - PROC_HILs)

Page 2: Elad Hadar Omer Norkin  Supervisor: Mike  Sumszyk Winter 2010/11, Single semester project

What is PROC_HILs ? • PROC_HILs is a Hardware-In-the-Loop

acceleration tool for running Simulink designs on FPGAs.

• Automatically translate Simulink designs into FPGA code (compatible with the PROC board installed on the target PC) and run it under Simulink.

Page 3: Elad Hadar Omer Norkin  Supervisor: Mike  Sumszyk Winter 2010/11, Single semester project

Why do we need PROC_HILs ? • Dramatically improves simulation speed, with

a dedicated accelerator for Simulink designs. • Enables building a design visually and

downloading it directly, with minimal effort, into the PROC board.

• Enables concurrent engineering at an early stage.

• Cuts development cycle time (and costs).• Improve design reliability.

Page 4: Elad Hadar Omer Norkin  Supervisor: Mike  Sumszyk Winter 2010/11, Single semester project

Project motivation Implementing a video analysis designs on

GIDEL PROCSTAR III platform that will enable usage and exploration of a new development platform (PART I – PROC_HILs).

Proper usage of development tools throughout all stages of implementation from algorithm to hardware.

Page 5: Elad Hadar Omer Norkin  Supervisor: Mike  Sumszyk Winter 2010/11, Single semester project

How it works• PROC_HILs enables

the user to download a Simulink design into PROC board and run it.

• The design runs on the on-board FPGAs, communicating with Simulink in real time.

• Generation process is fully automatic.

Page 6: Elad Hadar Omer Norkin  Supervisor: Mike  Sumszyk Winter 2010/11, Single semester project

1. Simulink design

How it works – Main stages2. An HDL code is generated, synthesized and compiled to get an .rbf file (FPGA binary file) compatible with the specific PROC board

3. A new Simulink design file is generated. Single HIL block including all the inputs and outputs that were present in the original design, connected to all the sources and sinks

4. The design runs on the hardware fully synchronized with Simulink, receiving the signals from the simulation sources and outputting the results into the sinks.

Page 7: Elad Hadar Omer Norkin  Supervisor: Mike  Sumszyk Winter 2010/11, Single semester project

Hardware and Development environment

• Main development stages were made on a GiDEL PROCe III (Altera Stratix III) board (1-FPGA)

• GiDEL PROC_HILs (Version 2.1.2)• ALTERA’s DSPBuilder blockset for Simulink (Version 10.1)• ProcWizard (Version 8.8)• Quartus II (version 10.1)• Matlab (Version 2009a)

• Additional development was made on a GiDEL PROCStar III (Altera Stratix III) board (4-FPGA)

Page 8: Elad Hadar Omer Norkin  Supervisor: Mike  Sumszyk Winter 2010/11, Single semester project

Simulink Design - NLD• NLD is a hardware implementation of Non Linear

Diffusion algorithm for video images.• Enable local smoothing of the picture while

preserving edges.

• The Simulink design in this project is based on a previous project (Performed in the Technion HS-DS Lab by Tsion Bublil & Yony Dekell).

• The original Project was implemented on a PROCStar II (Altera Strartix II) board (4-FPGAs), using SynplifyDSP blockset library for Simulink.

Page 9: Elad Hadar Omer Norkin  Supervisor: Mike  Sumszyk Winter 2010/11, Single semester project

Simulink Design - NLD

Page 10: Elad Hadar Omer Norkin  Supervisor: Mike  Sumszyk Winter 2010/11, Single semester project

Design guide - lines• All I/Os Must be placed on the top level of the

design.• Simulink sources must be configured to the same

clock that toggles the input port they feed.• All signals from the workspace blocks feeding

inputs blocks and all frame output blocks must use the same frame size (as seen in the previous slide).

• The design must obey the following table rules:

* PROC_HILs User Guide V2.1.2 p. 49

Page 11: Elad Hadar Omer Norkin  Supervisor: Mike  Sumszyk Winter 2010/11, Single semester project

Simulink Design - NLD1

Ry

a

br-

Pipelined Adderz-1

(1)

Delay

1R

2R256

1

Rx

a

br-

Pipelined Adderz-256

(1)

Delay

1R

Page 12: Elad Hadar Omer Norkin  Supervisor: Mike  Sumszyk Winter 2010/11, Single semester project

Simulink Design - NLD

1g11

a0[13]:[13]

b0[13]:[13]

a1[13]:[13]

b1[13]:[13]

y[27]:[26]y = a0 X b0 + a1 X b1

Multiply Add

1

Constant

2Ry

1beta

1g12

a[13]:[13]

b[13]:[13]r [13]:[13]X

Multiplier2

2Rx

1Ry

Page 13: Elad Hadar Omer Norkin  Supervisor: Mike  Sumszyk Winter 2010/11, Single semester project

Simulink Design - NLD

1

gm_out

0.00097656

beta3

1048576

beta2

9.5367e-007

beta1

1048576

beta

d(43:0) q(21:0)

Square Root

a

br-

Pipelined Adder

a[16]:[24]

b[16]:[24]r [16]:[24]X

Multiplier5

a[24]:[24]

b[24]:[24]r [28]:[16]X

Multiplier4a[16]:[16]

b[16]:[16]r [16]:[16]X

Multiplier3

a[24]:[24]

b[24]:[24]r [24]:[18]X

Multipl ier2

a[16]:[16]

b[16]:[16]r [16]:[16]X

Multiplier1

a[24]:[24]

b[24]:[24]

q[48]:[0]

r[24]:[24]

a = b X q + r

Divider

3g22

2g11

1

g12

6Out6

5Out5

4Out4

3Out3

2Out2

1Out1

z-206

(1)

Delay5

z-206

(1)

Delay4

z-206

(1)

Delay3

z-206

(1)

Delay2

z-206

(1)

Delay12

z-206

(1)

Delay1

6Rx

5R256

4Ry

3g22

2g12

1g11

Page 14: Elad Hadar Omer Norkin  Supervisor: Mike  Sumszyk Winter 2010/11, Single semester project

Timing considerations• Determining clock rate

– Video processing algorithm will have to process 15 Iterations of 256 by 256 pixels for a frame, achieving a reasonable rate of 15 frames per second.

• Long logical path prevents meeting clock rate demands, and fails compilation.– Altera DSPBuilder Advanced blockset supports automatic

pipelining (was not implemented in this project).– Altera DSPBuilder blockset supports user pipelining using

internal pipeline definition of the block (determined by user), or inserting Delays throughout the logical path. This method requires careful attention of the designer, that must assure full synchronization of the logical paths, guarantied by design.

2256 15 15 14,745,600 15[ ]clock rate MHz

Page 15: Elad Hadar Omer Norkin  Supervisor: Mike  Sumszyk Winter 2010/11, Single semester project

DSPBuilder internal blocks pipelining

1

gm_out

0.00097656

beta3

1048576

beta2

9.5367e-007

beta1

1048576

beta

d(43:0) q(21:0)

Square Root

a

br-

Pipelined Adder

a[16]:[24]

b[16]:[24]r [16]:[24]X

Multipl ier5

a[24]:[24]

b[24]:[24]r [28]:[16]X

Multipl ier4a[16]:[16]

b[16]:[16]r [16]:[16]X

Multipl ier3

a[24]:[24]

b[24]:[24]r [24]:[18]X

Multipl ier2

a[16]:[16]

b[16]:[16]r [16]:[16]X

Multipl ier1

a[24]:[24]

b[24]:[24]

q[48]:[0]

r[24]:[24]

a = b X q + r

Divider

3

g22

2

g11

1

g12

Page 16: Elad Hadar Omer Norkin  Supervisor: Mike  Sumszyk Winter 2010/11, Single semester project

Simulink Design - NLD

1

belt_r

X Out1

min

X Out1

max

Rp Rpy

dpy

Rp Rpx

dpx

a

br+

Pipelined Adder3

a

br+

Pipelined Adder2

a

br-

Pipelined Adder1

a

br-

Pipelined Adder

a[16]:[16]

b[16]:[16]r [16]:[16]X

Multiplier8

a[16]:[16]

b[16]:[16]r [16]:[16]X

Multiplier7

a[16]:[16]

b[16]:[16]r [16]:[16]X

Multiplier6

a[16]:[16]

b[16]:[16]r [16]:[16]X

Multiplier5

a[13]:[13]

b[13]:[13]r [15]:[15]X

Mul tiplier4

a[13]:[13]

b[13]:[13]r [15]:[15]X

Mul tiplier3

a[13]:[13]

b[13]:[13]r [15]:[15]X

Mul tiplier2

a[13]:[13]

b[13]:[13]r [15]:[15]X

Mul tiplier1

z-255

(1)

Delay1

z-256

(1)

Delay

8

dt

7

R

6

Ry

5

Rx

4

g22

3

g12

2

g11

1

gm05

Page 17: Elad Hadar Omer Norkin  Supervisor: Mike  Sumszyk Winter 2010/11, Single semester project

Compilation and synthesis flow • Validating performance of the completed design,

using Simulink environment.

• A full automatic compilation and synthesis starts by activating the GiDEL HIL generation tool block.

• Preliminary compability test starts by pressing the prompt GUI button.– Checks meeting design rules.– Does not check Hardware fitting

and feasibility.

GiDEL HIL Generation Tool

Page 18: Elad Hadar Omer Norkin  Supervisor: Mike  Sumszyk Winter 2010/11, Single semester project

Compilation and synthesis (cont.)

• “GO” button issues a full compilation and synthesis of the design.

• The generation flow can be adjustedby selecting the “Advanced Mode”.Controls the enabling/disabling of different flow stages.

Page 19: Elad Hadar Omer Norkin  Supervisor: Mike  Sumszyk Winter 2010/11, Single semester project

Compilation and synthesis (cont.)• Generation ends with a new Simulink design file.

• PROC_HILs does not fully elaborate the feasibility and hardware consumption of the design.– Quartus file are generated only while the generation process

is active and then automatically deleted.– Solution: During generation extract Quartus top design and

independently compile it with Quartus.

hw_loop_6b_HIL_HW_block

Convert

cvrt_outp

Convert

cvrt_inp4r

Convert

cvrt_inp3r

Convert

cvrt_inp

prob_belt

To Workspace1

beta

Signal FromWorkspace4

dt

Signal FromWorkspace3

vecR

Signal FromWorkspace

6.666666666666667E-8 sec

Clock

<your_design_name>_HIL

Page 20: Elad Hadar Omer Norkin  Supervisor: Mike  Sumszyk Winter 2010/11, Single semester project

Quartus compilation example• NLD Hardware consumption:

Page 21: Elad Hadar Omer Norkin  Supervisor: Mike  Sumszyk Winter 2010/11, Single semester project

• Original image:

NLD output example

• Smoothed image (3 Iterations):

Page 22: Elad Hadar Omer Norkin  Supervisor: Mike  Sumszyk Winter 2010/11, Single semester project

Performances

15,000 75,000 150,000 750,000 1,500,000 7,500,000 15,000,0000

1000

2000

3000

4000

5000

6000

7000

15.38741 37.052384 64.449291 296.137446583.806265

2902.047122

5792.6473

9.876949 9.860184 10.134778 11.885162 14.166178 32.085269 54.825545

Run time- Simulation & Hardware

Simulink simulationHardWare simulation

Vector Length

Run

time

[sec

]

• Calculated warm-up time:• Simulation overhead: 9.9712 [sec]• Hardware overhead: 9.60422 [sec]

Page 23: Elad Hadar Omer Norkin  Supervisor: Mike  Sumszyk Winter 2010/11, Single semester project

Performances

0

2,000,000

4,000,000

6,000,000

8,000,000

10,000,000

12,000,000

14,000,0000

20

40

60

80

100

120

Run time ratio- simulation & Hardware

Vector Length

Ratio:

• Reduced overhead, time ratio: 128.645454

Run time SimulationRun time Hardware

Eli’s comment: All Simulations were made on:

Page 24: Elad Hadar Omer Norkin  Supervisor: Mike  Sumszyk Winter 2010/11, Single semester project

Applications• Implementing NLD as part of Video capture/view real-time

streaming. • Web cam envelopment:

– Resizing image (256x256)– Performing “log” on resized image– Spreading image to vector form– Reshaping to matrix form– Performing “power” on processed image

15[ / sec]Frame rate frames

Page 25: Elad Hadar Omer Norkin  Supervisor: Mike  Sumszyk Winter 2010/11, Single semester project

Applications (cont.)• NLD algorithm Hardware block is inserted into the webcam envelop.

• Hardware is dramatically decreasing frame rate though it is designed with the capabilities of the desired frame rate.– Operating frequency is 15MHz.

• Conclusion: interface Simulink/Hardware overhead is to high to allow proper streaming in real-time applications.

0.077[ / sec]Insufficient Frame rate frames

Page 26: Elad Hadar Omer Norkin  Supervisor: Mike  Sumszyk Winter 2010/11, Single semester project

Hardware Loop• A possible way to gain advantage of PROC_HILs is

using a hardware loop.

prob_belt

To Workspace1

beta

Signal FromWorkspace4

dt

Signal FromWorkspace3

vecR

Signal FromWorkspace

o[16]:[16]

Output

sel(0:0)

0-

1-

MUX

Multiplexer

i[16]:[16]

Input4

i[16]:[16]

Input3

i[32]:[32]

Input

a

btrueIF a>b

If Statement1

a

btrueIF a<b

If Statement

GiDEL HIL Generation Tool

GiDEL Frame Output

GiDEL Frame Input

GiDEL Frame Input

GiDEL Frame Input

d

rreq

wreq

q

ful l

empty

usdw(15:0)

FIFO

FIFO

q(23:0)mod1048576

Counter

1

Constant1

65536

Constant

6.666666666666667E-8 sec

Clock

In1

In2

In3

Out1

Belt1d

Pipeline levels: 256

FIFO Size: 256X256-(256)

Page 27: Elad Hadar Omer Norkin  Supervisor: Mike  Sumszyk Winter 2010/11, Single semester project

Hardware Loop (cont.)• Multiple tries of the full HL designs showed problems

of convergence to the hardware limits of the PROCe-III Board.

• The same design was implemented on a PROCStar III board, with no problems reported in the generation flow.

• Problem encountered: While Simulink simulation showed reasonable results, hardware simulation showed different results (efforts to find origin and fix were stopped due to the project’s time constraints).

Page 28: Elad Hadar Omer Norkin  Supervisor: Mike  Sumszyk Winter 2010/11, Single semester project

Problems encountered• Strict software compatibility demands

– There is only one combination of involved software version that matches (matlab, PROC HIL, Altera DSPBuilder, Quartus, PROC wizard)

• Moderate algorithms do not fit the common boards using Proc HIL and Altera DSPBuilder blockset.

• Altera DSP blockset variety is poor, and does not contain common operations (log, exp, power,

root, not, min/ max…)• For effective usage, one should use the Altera

advanced DSP Blockset, but it requires the simulink fixed point license.

thn

Page 29: Elad Hadar Omer Norkin  Supervisor: Mike  Sumszyk Winter 2010/11, Single semester project

Problems encountered (cont.)

• Demands data flow as vectors and does not support matrices.

• Inconsistency between simulation and Hardware Performances.

• Inconvenient existing blocks – Square Root: accepts and returns only

whole numbers.– Divider: returns only in the form of: whole

number and res.

Page 30: Elad Hadar Omer Norkin  Supervisor: Mike  Sumszyk Winter 2010/11, Single semester project

Conclusions

1. Allows to easily design and implement algorithms in Simulink environment.• Direct Hardware Burn.• Direct generation HDL code that matches the target

board.• Fast HW simulation using Simulink/Matlab interface.

2. Extremely efficient on resources consuming processing algorithms.

3. Not suited for applying on streaming data designs (Real-Time designs).

Page 31: Elad Hadar Omer Norkin  Supervisor: Mike  Sumszyk Winter 2010/11, Single semester project

Future project plan (PART II)

Main goals/phases:1) Learning PROC API, PROC MegaFIFO2) Define and build an integrated DSPbuilder

design combining PROC API video streaming functions, data channels and PROC MegaFIFO memories.

Motivation: Learning and practice of effective debug methodology using PROC API.

GIDEL PROC_API – enable real-time configuration and querying of the board.

Page 32: Elad Hadar Omer Norkin  Supervisor: Mike  Sumszyk Winter 2010/11, Single semester project

Video stream diagram (PART II)

RX - FIFO TX - FIFO

PROC MegaFIFO

PROC API

Page 33: Elad Hadar Omer Norkin  Supervisor: Mike  Sumszyk Winter 2010/11, Single semester project

Time table to final presentationWeek

10Week

9Week

8Week

7Week

6Week

5Week

4Week

3Week

2Week

1Task

Learning PROC API, PROC MegaFIFO

Build a simple design combining DSPbuilder and the PROC Wizard using PROC API

Define an integrated design combining PROC API video streaming functions and data channels, PROC MegaFIFO memories and DSPbuilder design

Verification and writing the project’s book.