implementation of image processing kernels on src and sgi reconfigurable computers esam el-araby 1,...

24
Implementation of Image Processing Kernels on SRC and SGI Reconfigurable Computers Esam El-Araby 1 , Mohamed Taher 1 , Tarek El-Ghazawi 1 , and Kris Gaj 2 1 The George Washington University, 2 George Mason University {esam, mtaher, tarek}@gwu.edu, [email protected]

Upload: preston-wilcox

Post on 13-Jan-2016

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Implementation of Image Processing Kernels on SRC and SGI Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1, and Kris Gaj 2

Implementation of Image Processing Kernels on SRC and SGI Reconfigurable Computers

Implementation of Image Processing Kernels on SRC and SGI Reconfigurable Computers

Esam El-Araby1, Mohamed Taher1, Tarek El-Ghazawi1, and Kris Gaj2

1The George Washington University,2George Mason University

{esam, mtaher, tarek}@gwu.edu, [email protected]

Esam El-Araby1, Mohamed Taher1, Tarek El-Ghazawi1, and Kris Gaj2

1The George Washington University,2George Mason University

{esam, mtaher, tarek}@gwu.edu, [email protected]

Page 2: Implementation of Image Processing Kernels on SRC and SGI Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1, and Kris Gaj 2

El-Araby 2 1008 / MAPLD2005

Introduction

What are Reconfigurable Computers (RCs)? RCs are computing systems based

on the close system-level integration of one or more general-purpose processors and one or more Field Programmable Gate Array (FPGA) chips

Benefits of RCs A trade-off between traditional hardware and software Hardware-like performance with software-like flexibility Hardware can be modified on-the-fly The programming model is aimed at shielding programmers from the details

of the hardware description Orders of magnitude performance improvements over traditional systems

Page 3: Implementation of Image Processing Kernels on SRC and SGI Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1, and Kris Gaj 2

El-Araby 3 1008 / MAPLD2005

Introduction (cnt’d) Status of RCs

An important research subject due to the recent fast growth of FPGAs technology

Evolved from: “glue logic” between components to Accelerator boards to Stand-alone general-purpose RCs to Parallel reconfigurable computers

However, there exist multiple challenges that must be resolved

Page 4: Implementation of Image Processing Kernels on SRC and SGI Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1, and Kris Gaj 2

El-Araby 4 1008 / MAPLD2005

Challenges Performance

I/O Bandwidth Significant Configuration Latency

Some systems spend 25% to 98% of their execution time performing reconfiguration

Need for Efficient OS and Run-Time Reconfiguration Management Reconfiguration methods in current systems are not fully dynamic

Ease of Use Compilers/Languages

HDLs (VHDL and Verilog) are hard to use by application scientists HLLs and simple interfaces

Debuggers

Page 5: Implementation of Image Processing Kernels on SRC and SGI Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1, and Kris Gaj 2

El-Araby 5 1008 / MAPLD2005

Hi-Bar sustains 1.4 GB/s per port with 180 ns latency per tier

Up to 256 input and 256 output ports with two tiers of switch

Common Memory (CM) has controller with DMA capability

Controller can perform other functions such as scatter/gather

Up to 8 GB DDR SDRAM supported per CM node

SRC Architecture(Hi-BarTM Based Systems)

Storage Area Storage Area Network Network

Local Area Local Area Network Network

Wide Area Wide Area Network Network DiskDisk

Customers’ Existing NetworksCustomers’ Existing Networks

PCI-XPCI-XPCI-XPCI-X

MAPMAP®®

SRC-6SRC-6

MAPMAP

PP

MemoryMemory

SNAPSNAP™™

PP

MemoryMemory

SNAPSNAP

Gig EthernetGig Ethernetetc.etc.

Common Common MemoryMemory

ChainingChainingGPIOGPIO

Common Common MemoryMemory

SRC Hi-Bar SwitchSRC Hi-Bar Switch

Source: [SRC]

Page 6: Implementation of Image Processing Kernels on SRC and SGI Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1, and Kris Gaj 2

El-Araby 6 1008 / MAPLD2005

SRC Programming Environment

Objectfiles

Application sources

MAP CompilerP Compiler

Logic synthesis

Place & Route

Linker.bin files

.edf files

.o files .o files

Applicationexecutable

Configurationbitstreams

HDLsources.c or .f files .vhd or .v files

Objectfiles

Application sourcesUser

Macro sources

MAP CompilerP Compiler

Logic synthesis

Place & Route

Linker

.edf files

.bin files

. files

.o files .o files

Applicationexecutable

Configurationbitstreams

HDL

.c or .f files .vhd or .v files

.v files

Page 7: Implementation of Image Processing Kernels on SRC and SGI Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1, and Kris Gaj 2

El-Araby 7 1008 / MAPLD2005

SRC Application Simulation Process

HLLHLLsourcesource

CFGCFGDFGDFG VerilogVerilogGeneratorGenerator

CompilerCompiler““Front-end”Front-end”

SynthesisSynthesis

Place Place and and

RouteRoute

logic.binlogic.bin

MacroMacroDefinitionDefinition

VerilogVerilog

DFGDFGBehavioralBehavioralSimulationSimulation

User Chip User Chip LevelLevel

SimulationSimulation

MacroMacroEmulationEmulation

C CodeC Code(Info File)(Info File)

MacroMacroVerilogVerilogCodeCode

OptionalOptional

MacroMacroVerilog CodeVerilog Code

Source: [SRC]

Page 8: Implementation of Image Processing Kernels on SRC and SGI Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1, and Kris Gaj 2

El-Araby 8 1008 / MAPLD2005

Steps to Final Logic

DFG Simulation

Verifies memory allocation

Verifies data movement

Uses real run time environment

Emulates the CM & OBM relationships

Simulates User Logic

HLL SourceHLL Source

DFG SimulationDFG Simulation

Source: [SRC]

Page 9: Implementation of Image Processing Kernels on SRC and SGI Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1, and Kris Gaj 2

El-Araby 9 1008 / MAPLD2005

Steps to Final Logic

DFG Simulation

User Logic Simulation

Test user developed macros

The application becomes the “test bench”

Push the generated logic one step closer to the actual hardware implementation

Requires “logic designer mentality” for debugging

Gives full visibility into the logic

HLL SourceHLL Source

DFG SimulationDFG Simulation

UL SimulationUL Simulation

Source: [SRC]

Page 10: Implementation of Image Processing Kernels on SRC and SGI Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1, and Kris Gaj 2

El-Araby 10 1008 / MAPLD2005

Steps to Final Logic

DFG Emulation User Logic Simulation

MAP Hardware Execution

Full execution using ComList and User Logic on MAP

MAP ExecutionMAP Execution

HLL SourceHLL Source

DFG SimulationDFG Simulation

UL SimulationUL Simulation

Source: [SRC]

Page 11: Implementation of Image Processing Kernels on SRC and SGI Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1, and Kris Gaj 2

El-Araby 11 1008 / MAPLD2005

SGI Systems(System Architecture)

C

C

C

C

C

C

C

C

C

C

C

V

RR

RR

IOIO

IO IORASC RASC

RASCRASC

NUMAlink system interconnect

General-purpose compute nodes

Peer-attached general purpose I/O

Integrated graphics/visualization

Reconfigurable Application Specific Computing

R

C

IO

RASC

V

Page 12: Implementation of Image Processing Kernels on SRC and SGI Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1, and Kris Gaj 2

El-Araby 12 1008 / MAPLD2005

RASC Architecture

Page 13: Implementation of Image Processing Kernels on SRC and SGI Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1, and Kris Gaj 2

El-Araby 13 1008 / MAPLD2005

RASC Architecture (cnt’d)

Page 14: Implementation of Image Processing Kernels on SRC and SGI Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1, and Kris Gaj 2

El-Araby 14 1008 / MAPLD2005

Design Flow (HDLs)

IA-32 Linux

Machine

Design iterations

Design Entry(Verilog, VHDL)

Design Synthesis(Synplify Pro,

Amplify)

Design Implementation

(ISE)

Design Verification

Behavioral Simulation(VCS, Modelsim)

Static Timing Analysis(ISE Timing Analyzer)

.v, .vhd.v, .vhd

.edf

.ncd, .pcf

.bin

MetadataProcessing

(Python)

.v, .vhd

.cfg

Altix Device Programming(RASC Abstraction Layer,

Device Manager, Device Driver)

Real-time Verification

(gdb)

.c

Page 15: Implementation of Image Processing Kernels on SRC and SGI Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1, and Kris Gaj 2

El-Araby 15 1008 / MAPLD2005

Design Flow (HLLs)

IA-32 Linux

Machine

RTL Generation and Integration with Core Services

Design Synthesis(Synplify Pro,

Amplify)

Design Verification

Behavioral Simulation(VCS, Modelsim)

Static Timing Analysis(ISE Timing Analyzer)

.v, .vhd

.v, .vhd

.edf

.ncd, .pcf

.bin

MetadataProcessing

(Python)

.v, .vhd

.cfg

Altix Device Programming(RASC Abstraction Layer,

Device Manager, Device Driver)

Real-time Verification

(gdb)

.c

Design Implementation(ISE)

HLL Design Entry(Handel-C, Impulse C, Mitrion C, Viva)

Page 16: Implementation of Image Processing Kernels on SRC and SGI Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1, and Kris Gaj 2

El-Araby 16 1008 / MAPLD2005

Application Programming Interface

Rasclib: Resource allocation in conjunction with the RASC Device

Manager Data movement to/from the COP via DMA engines Algorithm control (start, stop, single step, stepN) Automatic scaling across multiple devices Interfaces necessary for debugging

Page 17: Implementation of Image Processing Kernels on SRC and SGI Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1, and Kris Gaj 2

El-Araby 17 1008 / MAPLD2005

Abstraction Layer: Algorithm API

and deep scaling.

The Abstraction Layer’s algorithm API mirrors the COP API with a few additions that enable wide scaling,

Output Data

Application

COP

COP

COP

Input DataAlgorithm

COP

Input Data Output DataAlgorithm

Application COP

Page 18: Implementation of Image Processing Kernels on SRC and SGI Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1, and Kris Gaj 2

El-Araby 18 1008 / MAPLD2005

Based on Open Source Gnu Debugger (GDB)

Uses extensions to current command set

Can debug host application and FPGA

Provides notification when FPGA starts or stops

Supplies information on FPGA characteristics

Can “single-step” or “run N steps” of the algorithm

Dumps data regarding the set of “registers” that are visible when the FPGA is active

RASC Debugging

Page 19: Implementation of Image Processing Kernels on SRC and SGI Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1, and Kris Gaj 2

El-Araby 19 1008 / MAPLD2005

Applications of DWT

Pattern recognition

Feature extraction Metallurgy: characterization of rough surfaces

Trend detection: Finance: exploring variation of stock prices

Perfect reconstruction Communications: wireless channel signals

De-noising noisy data

FBI fingerprint compression

Detecting self-similarity in a time series

Video compression – JPEG 2000

Hyperspectral Dimension Reduction

Image Registration

Page 20: Implementation of Image Processing Kernels on SRC and SGI Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1, and Kris Gaj 2

El-Araby 20 1008 / MAPLD2005

The input image is first convolved along the rows by the two filters L and H and decimated along the columns by two resulting in two "column-decimated" images L and H

Each of the two images, L and H, is then convolved along the columns by the two filters L and H and decimated along the rows by two

This decomposition results into four images, LL, LH, HL and HH

The LL image is taken as the new input to perform the next level of decomposition

Multi-Resolution DWT Decomposition (Mallat Algorithm)

Page 21: Implementation of Image Processing Kernels on SRC and SGI Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1, and Kris Gaj 2

El-Araby 21 1008 / MAPLD2005

DWT Implementation (Top-Level)

data_in_H

data_in_L

data_out_H

data_out_LD1

D2

Q1

Q2

clkrst

rstclk

D Q

en

clkrst

out_enQ

Page 22: Implementation of Image Processing Kernels on SRC and SGI Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1, and Kris Gaj 2

El-Araby 22 1008 / MAPLD2005

FIR Module(Transposed Form)

Page 23: Implementation of Image Processing Kernels on SRC and SGI Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1, and Kris Gaj 2

El-Araby 23 1008 / MAPLD2005

DWT End-to-End Throughput (SRC-6 & SGI-RASC vs. P4)

Filter TypeSRC-6

(MB/sec)

SGI RASC

(MB/sec)P4

(MB/sec)

Daub1(Haar) 199 130 12

Daub2 199 130 9.98

Daub3 199 130 8.73

Image Size = 512 X 512 pixels

Page 24: Implementation of Image Processing Kernels on SRC and SGI Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1, and Kris Gaj 2

El-Araby 24 1008 / MAPLD2005

Conclusions DWT is implemented on both SRC-6 and SGI-RASC

systems Similarities and differences are analyzed with regard to:

System hardware architecture Ease of programming

Programming model Development time Hardware/software libraries

Performance The speed-up vs. microprocessor is reported

Primary bottlenecks limiting the performance of both systems are recognized

The capability to share and port applications between the SRC and SGI systems is explored