real time object recognition in videos with a parallel algorithm

Real Time Object Recognition inVideos with a Parallel Algorithm

Beatrix Azzopardi

Supervisor(s): Dr George Azzopardi

University of Malta

Faculty of ICT

University of Malta

May 2016

Submitted in partial fulfillment of the requirements for the degree of B.Sc. ComputingScience (Hons.)

Abstract:

COmbination of Shifted Filter REsponses - COSFIRE - is a rotation, reflection andscale invariant method for object recognition. Already having exhibited a high degree ofaccuracy in several applications, its integration into real-time systems has not yet beeninvestigated. This is due to the lack of an implementation which executes at the requiredframe rate of 30 frames/second.

A number of existing methods for object recognition have been accelerated using tech-nologies ranging from hardware implementations to multicore CPU solutions. COSFIREis an intuitive and flexible neuron-inspired method that is non-application specific. It isalso easily trainable due to its ability to derive an invariant filter from a single trainingimage or prototype. It thus merits inclusion in their number.

CUDA has been used to accelerate the execution of various algorithms through graphicsprocessing units. Albeit one of several APIs providing access to this massive paralleism,CUDA has been selected due to its abstraction of low-level hardware considerations whilststill exposing sufficent resources for more intensive optimization.

Hence, a CUDA implementation of the non-invariant application of a COSFIRE filterto input images has been developed. This implementation displays both data and taskparallelism. Data parallelism is made possible through the use of the frequency domain forthe execution of the two filtering operations. Task parallelism is actualised using CUDAstreams to facilitate concurrent processing of the COSFIRE filter tuples. This venture intotask parallelism also expands the horizon for future work investigating the scalability ofthe implementation, both to more complex COSFIRE filters and the execution of multipleCOSFIRE filters.

Whilst the target execution time has not yet been achieved, a 9.5x speedup has beenobserved when compared to the MATLAB implementation of COSFIRE. Thus, this projectis the first stepping stone towards a real-time implementation of COSFIRE. The work pre-sented here is in anticipation of further speedup through more aggressive optimisation andmore advanced techniques. The three topics central to this project - object recognition,COSFIRE in particular, and Graphics Processing Unit technology - are surveyed, fol-lowed by the documentation of the design and implementation of COSFIRE using CUDA.Evalution both in terms of execution times, achieved speedup and the quality of the im-plementation as indicated by application profiling is also presented. This work is thenbrought into further context by the discussion of related work and the conclusion’s detailsof potential future work.

Contents

1 Introduction 11.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.5 Aims and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.6 Report Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background 42.1 Object Detection and Recognition . . . . . . . . . . . . . . . . . . . . . . . 42.2 COSFIRE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.3 Graphics Processing Units . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Design and Specification 11

4 Implementation 15

5 Testing and Evaluation 17

6 Related Work 20

7 Conclusion 24

i

List of Figures

1 A generic system containing both a host (CPU) and a device (GPU). . . . 82 A generic multiprocessor found within an (NVIDIA) GPU. . . . . . . . . . 93 The CUDA threading model. . . . . . . . . . . . . . . . . . . . . . . . . . 94 Default stream usage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 Multiple stream usage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 Current COSFIRE Implementation. . . . . . . . . . . . . . . . . . . . . . . 187 Proposed revision of COSFIRE Implementation . . . . . . . . . . . . . . . 26

List of Tables

1 A comparison of the execution times in seconds. . . . . . . . . . . . . . . . 192 A comparison of the positions of the maxima. . . . . . . . . . . . . . . . . 363 A comparison of the values of the maxima. . . . . . . . . . . . . . . . . . . 37

ii

Glossary

COSFIRE - Combination of Shifted Filter Responses

S-COSFIRE - Shape Combination of Shifted Filter Responses

B-COSFIRE - Bar Combination of Shifted Filter Responses

V-COSFIRE - Vertex Combination of Shifted Filter Responses

CUFFT - CUDA Fast Fourier Transform

CUDA - Compute Unified Device Architecture

OpenCL - Open Compute Language

GPU - Graphics Processing Unit

CPU - Central Processing Unit

FPGA - Field Programmable Gate Array

SLI - Scalable Link Interface

ASIC - Application Specific Integrated Circuit

fps - frames per second

SSE - Streaming SIMD Extension

SURF - Speed Up Robust Feature

SIFT - Scale Invariant Feature Transform

FFT - Fast Fourier Transform

HOG - Histogram of Gradients

FIND - Feature Interaction Descriptor

Host - The CPU and its directly related components

Device - The GPU and its related components

SIMD - Single Instruction Multiple Data

iii

SIMT - Single Instruction Multiple Thread

MIMD - Multiple Instruction Multiple Data

DoG - Difference of Gaussian

iv

Real Time Object Recognition in Videos with aParallel Algorithm

Beatrix Azzopardi∗

Supervised by: Dr George AzzopardiMay 2016

Abstract: The following report documents the the development of an implementationof the COSFIRE object recognition algorithm using the CUDA C/C++ extension. Thepurpose of this effort was to decrease the execution time per image to the commonlyquoted real-time frame rate of 30 fps. Whilst this is yet elusive for COSFIRE, progresshas been made as a result of this project. This is evident by the almost 10x speedupachieved through the exploitation of both task-level and data-level parallelism. Thisreport begins by introducing the reader to the context of object recognition and ofparallelism/acceleration technologies. This is followed by an overview of object detectionand recognition, an exposition of COSFIRE and an introduction to graphics processingunits from a CUDA perspective. The design and implementation are also detailed anddiscussed in the subsequent chapters. Particular emphasis is placed on the usage of thefrequency domain for the execution of image filtering operations. The results obtainedthrough the CUDA implementation are then presented. This is followed by a discussionof related work in the field of accelerating the execution of various object recognitionmethods. The conclusion surmises the results and presents proposals for future workfurthering the ultimate aim of real-time execution rates for COSFIRE.

1 Introduction

1.1 Problem Definition

The purpose of this project is to prove that COSFIRE is a viable candidate algorithm forincorporation into real-time systems that require a versatile object, pattern and featuredetection and recognition component by demonstrating that it can execute accuratelywithin 0.05s or less (an execution rate proliferate in related work). This would extend itsapplicability to a multitude of domains beyond offline operation. This array of applicationshas already been demonstrated in several previous works [AP14, AP13a, AP13c, AP13b,AP12, ASVP15].

∗Submitted in partial fulfillment of the requirements for the degree of B.Sc. Computing Science (Hons.).

1

1.2 Motivation

As automation becomes more commonplace, the tasks undertaken are becoming further-more complex. The variety of possible inputs reflects the mass of stimuli that the humanbrain processes in fractions of a second. Similar processing demands are now being placedon machines. The field of object recognition is the problem of locating and identifyingobjects in a purely digital environment. The parallelisation of COSFIRE seeks to continuein its tradition of emulating the human brain by mimicking the concurrent processing ofdata.

Graphical representation is an intuitive ways to interface between a computer andits human user [Mac94]. The COSFIRE filter configuration attempts to simulate thehuman visual learning process. As may be observed, a human being can recognise anobject despite variations in its surroundings given that the fundamental set of definingpoints remains constant (a simplified version of human image-based learning). COSFIREis similar in that given a context free instance of the object to study, it can then identifythe same object irrespective of context and geometric transformations [AP13c]. It is alsoresilient to geometric transformations such as scaling, rotation, reflection or variance incontrast[AP13c].

A real-time COSFIRE implementation would be applicable in the development of dy-namic real-time applications. Its current applications are in medical imaging [AP13a],handwritten digit recognition and traffic sign detection [AP13c]. Real-time operation isthe execution of a system under hard constraints where failure to meet said constraintscould be catastrophic in the case of hard real-time or inconvenient in the case of a softreal-time system. Offline systems would also benefit from the reduced processing time thatthe proposed implementation would provide.

1.3 Approach

There are several ways in which an algorithm may be implemented in parallel, such as usinga framework such as OpenCL or CUDA. The algorithm may be designed to execute as adistributed application across several communicating Central Processing Units, forming acluster. Alternatively, it may be divided up into threads that execute cores within a singleCPU [And00], amongst various other techniques. Whilst each approach has its merits,the locality of GPU computing being comparable to that of a CPU whilst also offering amassively parallel architecture with the potential for scalability has led to its selection.

2

1.4 Scope

Whilst the aim of this work is to accelerate COSFIRE, it is not within the bounds of thiswork to optimise the algorithm itself or to ensure portability. The purpose of this workis to demonstrate the suitability of COSFIRE as an algorithmic component in real-timesystems. This shall be done by hierarchically considering both task parallelism and dataparallelism. The configuration phase and the derivation of the tuples and their invariantcounterparts shall be omitted from the implementation. Invariance during the applicationof the COSFIRE filter has also been omitted due to the selected application for evaluation(traffic sign detection) which as per its description in [AP13c], does not require invari-ance given the set of evaluation images. Following the implementation of the algorithm,the performance will be gauged against a sequential equivalent, such as the existing im-plementations of COSFIRE using OpenCV [?] or MATLAB [Azz]. The time and spacecomplexities of the respective algorithms shall also be taken into consideration along withaccuracy and frame rate.

1.5 Aims and Objectives

The aim of this project is to optimise the implementation of COSFIRE algorithm throughparallelization to a maximum execution time of 0.05 seconds for images of a resolutionof 270 by 360 pixels. The selection of this resolution is due to its use in technologiessuch as surveillance cameras [Hal], whilst the target execution time of 0.05 seconds hasbeen derived from the prevalence of systems performing at an average rate of 30 fps inthe literature, both in implementation-specific papers, as well as survey papers such as[PBKE12]. This translates to an approximate execution time of 0.03 to 0.04 seconds,rounded to 0.05 seconds to introduce a margin to account for elements beyond the scopeof this project, such as external libraries or the decoding of input images.

1.6 Report Layout

The Background offers a description of object recognition, COSFIRE, and the selectedmodel for parallelization. The Design and Specification provides an overview of the opera-tions required by COSFIRE. The Implementation furthers this discussion by highlightingthe salient aspects of the CUDA implementation. The Testing and Evaluation details theanalysis of the implementation. The Related Work surveys existing work in the area ofaccelerated object recognition. The Conclusion discusses the results and future work.

3

2 Background

2.1 Object Detection and Recognition

Object detection and recognition is one of the fundamental areas in computer vision. Itenables a variety of applications. However, object detection and recognition is rife withobstacles, especially when considered under real-time constraints. Methods must be robust:they must be tolerant to variations in illumination, geometric transformations, occlusion,etc. [Por06].

There exists a variety of techniques for object recognition, such as Hough Forests forimplementing a probabilistic voting system for recognition,[GL09]. COSFIRE differs fromthis approach in that an operator may be derived from a single prototype, in additon to itsuse of the geometric mean as opposed to the additive nature of the Hough transform [GL09].Other techniques are DeCAF [DJV+13], which is based on convolutional neural networks,and scan-window based object recognition [ZN08]. The latter is based on Histogram OfGradients detectors and Support Vector Machines classifiers, and achieves a execution rateof approximately 0.073s per image of resolution 384 by 288 pixels [ZN08] once acceleratedusing fragment shader programs. There has also been research carried out into the deriva-tion of heterogeneous computer architectures from neuroscience [HC14], which in turn hasbeen utilised to implement the HMAX model for use in robotics [HC14]. The artificial beecolony algorithm (ABC) has also been used to detect edges in grayscale images [YB13], acharacteristic of objects which is of paramount importance in the COSFIRE approach.

A combination of HOG and FIND descriptors are used in [MN11] in order to be able tobe able to detect pedestrians. A visual vocabulary processor incorporated into an SystemOn a Chip is presented in [CSH+12] which utilises the Scale Invariant Feature Transform,following the logic that the use of a hardware implementation will produce acceleratedexecution whilst reducing power consumption.There are also alternative approaches suchas that presented in [CW13], where some class of feature is used with one-class SVMs, orthe proposed use of sparselet models for multi-class object recognition in [SGZ+15].

SIFT has been the subject of extensive research such as the vector matching processoron an SoC is presented in [LKO+10]. A hardware approach was also pursued in [HHKC12].SIFT has also been implemented using OpenMP for multi-core architectures [ZCZX08]. Asimilar line of investigation (SIFT and the Harris features) is taken in [PSK+12] whereinvasive computing is utilised in conjunction with MPSoC technology. SIFT has also beenimplemented (along with KLT) using OpenGL and CG in [SFPG06a], achieving a reported

4

frame rate of 10Hz for feature extraction on images of size 640 x 480 pixels. Research hasalso been carried out into the design and implementation of a dedicated processor whichis utilised to implement the CAVAM, which in [OKP+13] uses SIFT.

A review of the various options for the acceleration of object recognition is providedin the introductions of [HOT+10, OMVEC13]. These discuss the trade-off between easeof development and performance on various platforms ranging from Application SpecificIntegrated Circuits ranging to Field Programmable Gate Arrays and Digital Signal Pro-cessors.

2.2 COSFIRE

COSFIRE is a trainable filter which may be used for versatile keypoint detection andpattern recognition of local contours [AP13c]. It was inspired by " the function of aspecific type of shape-selective neuron in area V4 of visual cortex " [AP13a] (which is seenin the context of the full visual system in [APZ+12]). Its applications include traffic signdetection and recognition, the detection of vascular bifurcations [AP13c, AP12, AP13a] andvessel delineation [ASVP15], and handwritten digit recognition, along with handwrittenletter and keyword spotting [AP14] and object spotting in complex scenes [AP14].

A COSFIRE filter or operator is first derived from an image intended for training thefilter. This input is referred to as the prototype [ASVP15, AP14]. COSFIRE entails thata bank of N distinct Gabor filters be established. These filters are utilised to detect edgesor lines but may be replaced by some filter type that is sensitive to a different type offeature [AP13c]). Each of these filters is applied to the prototype in turn, producing afilter response each. The radius with respect to the centre of the prototype is also variedto detect different features [AP12]. Hence, there is a set of responses for each radius.

The strongest responses along each radius are selected, noting the wavelength, orien-tation and position of the response with respect to the centre of the filter (which maybe visualised as that of the prototype). Each set of these four values forms a tuple ofthe form [λ, θ, ρ, φ]. Therefore, a COSFIRE operator may be described as a set of tupleswhich describe the dominant orientations present in the prototype and their mutual spatialarrangement [AP12, AP13a].

The configuration is an automatic process [AP12]. The resulting COSFIRE operator isthen applied to any input images, which shall be referred to as the application phase. Theresult of the application of a COSFIRE operator is the weighted geometric mean of theblurred and shifted Gabor responses of the input image to the Gabor filters encoded in the

5

tuples that formulate the COSFIRE operator. The detailed algorithm for the applicationof a COSFIRE filter is outlined below:

1. Apply each of the N Gabor filters to the image, producing N outputs. Let N be thenumber of tuples constituting the filter.

2. Apply the appropriate Gaussian blur to each of the Gabor responses. The purpose ofthis step is to introduce tolerance with respect to the relative positions of the dom-inant orientations (or characteristic, depending on the nature of the filter employedin the configuration and step 1) [AP14].

3. Shift each response towards the centre of the filter. The shift is determined bythe position at which the particular response was returned in the prototype duringthe configuration phase. The shifting also ensures that should an instance of theprototype be present, the COSFIRE filter shall return a response in the centre ofinstance of the prototype [AP14].

4. Evaluate the weighted geometric mean of all of the blurred and shifted responses.This step ensures that all component parts of the prototype are present. The mag-nitude of their combined response also indicates the likelihood of the response beinga true positive.

Whilst many object recognition methods have been accelerated, the addition of COS-FIRE to their number fills a void in the literature. Algorithms such as Viola-Jones areoften presented in an application-specific manner, such as the use of a variation on theViola-Jones algorithm for traffic sign detection in [ALT08]. Others utilise descriptors suchas SIFT, which whilst displaying invariance to some geometric transformations (reported in[KPC+09] to be scale, rotation [LKO+10] and translation, along with reports in [HMS+07]to be also a degree of tolerance to variations in lighting conditions), requires higher dimen-sional encoding, thus increasing both its computational and algorithmic complexities.

The keypoint descriptor ORB [RRKB11] combines the FAST detector and BRIEFfeature descriptor, which is reported to exhibit rotational invariance and a noise immunitysuperior to SIFT and SURF. Scale invariance is stated in [RRKB11] to be implementedthrough a scale pyramid. Similarly sharing its origins in ventral stream in the mammalianvisual pathway with COSFIRE, the NEOVUS architecture and system implements bothobject recognition methods and classification in a real-time manner using a heterogeneousenvironment consisting of an FPGA and a laptop computer.

6

The COSFIRE approach, on the other hand, combines the responses of simpler fil-ters into a single response that communicates the absence or presence of a more complexstructure [AP13c, AP14]. Likewise, COSFIRE filters may be combined to prove or dis-prove the presence of an even more complex object [AP14], such as the segmentationof a shoe prototype in [AP14]. The use of the geometric mean in place of an additiveoperator when combining the individual responses adds a robustness against false posi-tives triggered by one or more components of the prototype being present whilst othersare absent [AP13c], increasing its immunity to noise [ASVP15]. COSFIRE is reported in[AP12] to differ from feature descriptors such as SIFT in this respect. However, this mightbe cause for a reduction in the accuracy of the filter when processing images where theobject that would otherwise be detected is occluded. Despite this drawback, COSFIREfilters may be rendered invariant to rotation and scale, as well as reflection and contrastinversion through additional computation applied to the tuples derived during the configu-ration phase [AP13c, AP12, AP14, AP13a]. Operating in any of the above invariant modesrequires that the original COSFIRE operator and any filters derived through the abovemethods be all applied to the input and then the maximum of their responses returned asthe overall COSFIRE response. The manner in which this invariance is achieved is one ofthe advantages of COSFIRE, as highlighted in [AP14], as no additional training examplesare required to generate the invariant filter.

2.3 Graphics Processing Units

The traditional process executed by GPUs is the synthesis of images as opposed to theirdecomposition. This inversion of roles [PBKE12] made possible by the advent of pro-grammable GPUs is explored in [FM08] and [PBKE12], which discuss the role GPUs havecome to play in computer vision. The exposure of the considerable cardinality of logic unitsin an application neutral format has allowed the programming of graphics processing unitsto evolve from the mapping of non-graphics tasks to textures, vertices and shader programs[OHL+08] to a programming experience that differs little from traditional C programming.

In addition, any speed up gained through parallelization using a GPU has a high prob-ability of increasing as the technology advances [FM08, ABD+09], permitting increasedamounts of parallelism to be exploited through increased resources [FM08]. The effec-tiveness of the CUDA programming model as applied to various classes of applications isanalysed in [CBM+08].

The importance of these developments is brought into perspective when considering

7

their adherence to the more recent towards increasing the core-count as opposed to thedevelopment of higher-speed cores [OHL+08]. The growing trend towards the use of mas-sively parallel architectures stems from the ceiling currently present on processing speedsand transistor density [PBKE12, ABD+09, OHL+08]. This has led to a change in paradigm[ABD+09].

The application of GPUs to problems other than those related to graphics has ledto the use of the term GPU computing [NVI09]. Such computing has become prevalentthroughout a variety of fields [WCN10, RG15, CSB12, QMN09]. CUDA in particular isused in [SSLMW09, SR10, LR09, MWMA09]. An analysis of general purpose applicationimplemented using CUDA is offered in [CBM+08].

CUDA code requires CUDA capable hardware. OpenCL allows for the programmingof heterogeneous computing environments in a portable manner[KWm12], an obstacle fornon-OpenCL implementations as recognised in [SFPG06b]. Such portability is however atthe expense of programming convenience. A disadvantage of OpenCL is that due to thebroad spectrum of devices it caters for, not all features that it exposes are as universallyavailable. Hence, ensuring portability entails complexity. Implementations using OpenCLare thus more complex [KWm12], a trade-off which one might not consider justified forthis proof-of-concept project.

Figure 1: A generic system containing both a host (CPU) and a device (GPU).

The GPU and its VRAM are referred to as the device. A GPU contains an amount oflocal memory and a number of multiprocessors. Each multiprocessor consists of a numberof lockstep cores and miscellaneous computational and control components, along with adegree of internal private memory. In contrast to the construction of a CPU, the architec-tural emphasis of the multiprocessors within a GPU is placed of computation as opposedto memory hierarchy.

The following are the salient points of the CUDA programming model as drawn from[Coo13, KWm12].

8

Figure 2: A generic multiprocessor found within an (NVIDIA) GPU.

CUDA endorses the basic principle pf parallelism of distributing identical, independenttasks amongst two or more threads. However, in contrast to APIs such as OpenCL orPOSIX threads, CUDA offers an additional layer of abstraction that allows for an easiermapping of multidimensional data to considerably large numbers of threads, as illustratedin Figure 3. As illustrated in Figure 3, a collection of threads all created as the result ofa single kernel launch is referred to as a grid. The grid may be subdivided into blocks ofthreads. These blocks are allocated to the available multiprocessors on the device. A blockmay be allocated to no more than a single multiprocessor and is a non-preemptable unit[Coo13].

Figure 3: The CUDA threading model.

Once a block is allocated to a multiprocessor, the threads are divided up into groupsreferred to as warps. Warps are groups of threads which are guaranteed to be executedsimultaneously. All threads in one or more warps are hosted on the multiprocessor throughthe register file. This is what eliminates the overhead of context switching otherwise

9

incurred on CPUs. A context switch from one warp to another is the changing of thepointers directing the processors (each hosting a thread) to locations in the register file.This is done in order to mask the latency incurred by overhead intensive actions such asglobal memory access of some computations. The programming model only guaranteesthat at least one warp will execute at any given moment (barring all warps being stalled).It is possible for more than one warp to execute simultaneously on the same StreamingMultiprocessor, depending on the number of instruction-issuing units present [Coo13].

The general structure of any implementation using the CUDA programming model is asfollows: establish the data to be processed in RAM (host memory); allocate space for thedata in the GPU memory space and transfer it to said memory space; invoke one or morekernels which may or may not have host side code interspersed between them; and copyback the results from the GPU memory space into RAM once the device-side processinghas been completed.

The CUDA programming model is compared to CPU vector processing instructions in[CBM+08], drawing parallels between their scalar nature. CUDA is stated in [CBM+08]to have the added advantage of no explicit management of vectorization being required.One of the drawbacks of the CUDA programming model as highlighted in [CBM+08] isthe absence of convenient inter-thread communication methods that go beyond the scopeof a single block. Another, albeit hardware-induced, disadvantage is the bottleneck thatexists when transferring data between host and device in either direction is discussed in[CBM+08]. It subsequently advises that relocation of the computation to the GPU ispreferable to the overhead incurred should said computation remain the jurisdiction of theCPU, thus incurring a data transfer between the two distinct memory spaces. Similarly,divergence/branching in warps increases execution time [CBM+08, Coo13].

GPUs have evolved from strictly Single Instruction Multiple Thread execution models[NVI09] to architectures with an increasing number of Multiple Instruction Multiple Datacharacteristics. An example is recursion as first featured in [NVI10]. Such advances areexpanding the horizons of the parallelism of which GPU computing may utilise such as thepossibility of executing more than one kernel simultaneously on a single device, allowingthe execution of distinct kernels to overlap. CUDA exposes this capability through streamswhich may be utilised for both data as well as task level parallelism. A stream may bedescribed as a queue of grids 5. Grids allocated the same stream are executed in theirorder of invocation. The number of simultaneously executing grids is determined by theresources required by each grid at runtime.

10

Figure 4: Default stream usage. The default stream receives kernel calls and queues themin the order of issue. Therefore, a kernel invocation cannot be executed prior to thecompletion of all others before it in the queue. This queuing behaviour is transparent tothe user unless explicit or implicit synchronization is employed.

Figure 5: Multiple stream usage. Once named streams are created, their identifier maybe used to specify in which stream a given kernel is to execute. When a kernel is invokedwith a named stream specified, the kernel call is queued in the specified stream.

3 Design and Specification

A COSFIRE operator is composed of a set of tuples and a set of parameters defined duringthe configuration process. For the purposes of this project, the tuples and parametersdefining the COSFIRE operator shall be hard-coded for the detection of a pedestriantraffic sign [Azz].

It is necessary to examine the result of the application of the COSFIRE operator in orderto identify the maximum response. The result of the application is visually similar to thesuperimposition of all of the blurred Gabor responses. As the superimposition is executedusing a product, sections of some responses are suppressed to a value of zero. It may occurthat multiple areas of an image contained features satisfying the criterion of each filter

11

resulting in multiple positions have a non-negligible magnitude. Such magnitudes shouldbe considerably less than the largest magnitude present. Hence, the position at which thismaximum occurs is the most likely response to be a true positive.

The application of the COSFIRE operator entails:

1. Use the normalized grayscale input image as the input for each Gabor filter, storingthe result of each filter individually. Spatial two dimensional Gabor filters are definedin [Mov02] as the product of a complex sinusoid referred to as the carrier and a twodimensional Gaussian function referred to as the envelope.

Each Gabor filter is defined by three values: λi, the wavelength of the filter; θi,its orientation; and ψi, the phase offset. λi and θi together identify the thickness( where the thickness is equivalent to λi

2[AP13c]) and orientation of the line/edge

which triggers the filter response [AP13c]. The value of the standard deviation asso-ciated with a Gabor filter is used to determine the number of elements in the impulseresponse along a single dimension (the number of rows or columns) [PK97]. Thisstandard deviation has been hard-coded to a universal value obtained during theconfiguration of the COSFIRE operator [Azz].

2. Identify the maximum positive value in each of the original Gabor responses. Utilisea fraction t1 of this maximum positive value and threshold the responses as follows:wherever the value of an element is less than t1 of the maximum positive value, setthe element to zero. This step eliminates the negative values and truncates positivevalues less than the threshold to zero .

3. Blur the thresholded Gabor responses using the appropriate Gaussian filter. Theappropriate Gaussian filter is determined by extrapolating its standard deviation σifrom the tuple associated with the thresholded Gabor response. The value of sigmaiis derived using Equation 1 as given in [AP13c]:

σi = σ0 + (α× ρi) (1)

where σ0 and α are values found at configuration and rhoi the radius component ofthe polar coordinates included in the tuple in question. The size of the Gaussian

12

impulse response along a single dimension is evaluated using the logic in Equation 2

elementsi =

d6× σe+ 1 d6× σeevend6× σe otherwise

(2)

which ensures that the total number of elements in the impulse response is alwaysodd.

4. Shift the blurred thresholded Gabor responses by ρi in the opposite direction ofthe angle φi. The values of ρi and φi are found in the blurred thresholded Gaborresponse’s associated tuple. These are the polar coordinates indicating the positionof the response with respect to the centre of the prototype[AP13c]. The effect is thatall line and/or edges appear visually to be superimposed over the same central point.This is essential for the next step.

5. The shifted, blurred, thresholded Gabor responses are then collated into a singleresponse using multiplication. This geometric mean is a pixel-wise function used tosuperimpose the blurred and shifted responses into a single response to be returnedby the COSFIRE filter. It is completed by taking the nth root of the collated response,where n is the number of tuples in the COSFIRE operator. [AP12].

The phase offset for all Gabor filters in this implementation has been hard-coded to π2.

Hence, the sinusoid used to generate the Gabor filter has been offset from the origin by 90degrees. The fraction t1, σ0 and α are COSFIRE operator parameters.

All steps of the application of the COSFIRE filter are to be mapped to GPU execution,similarly to [WRC08], where all feature descriptor processing steps are executed by theGPU. This has been done in order to minimise any latency incurred by the utilisation ofthe Peripheral Component Interconnect Express connection [WRC08]. Hence, the CPUacts as a controller, as done in [MWMA09]. This is a different approach to that takenin [KD10] - which accelerated face detection using the Viola-Jones algorithm also usingCUDA - where auxiliary preparatory steps were handled by the CPU in a more collaborativeCPU-GPU system.

The approach that will be taken with COSFIRE is to exploit the independence of thetuples and execute these in parallel, whilst also accelerating the operations required byeach tuple. In contrast to [MCC+10], there will be individual steps that will be executingconcurrently. This has been made possible through the independence of the tuples duringeach step of the application of a COSFIRE operator [AP13a] and through the application

13

of the reduction pattern as presented in [KWm12] and used in [PR09, YpSzXm11]. Thereduction pattern has been used to parallelize the evaluation of the geometric mean.

In addition to the higher level design and parallelism discussed above, the approachtaken in [MWMA09] shall also be duly considered, which was to optimize each step in thehybrid local-search-genetic-algorithm for execution on the GPU. This shall also be done forthe four steps taken in applying a COSFIRE operator. The Gaussian filtering operationexecuted prior to the extraction of the SIFT features in the hardware implementationof object recognition in [HHKC12] was identified in the same work to be a bottleneck.There are several options for the implementation of Gaussian filtering: convolution in thespatial domain; two convolutions using the one dimensional impulse responses which whenmultiplied are equivalent to the two dimensional impulse response previously mentioned(this is referred to as a separable filter); and multiplication in the frequency domain, aproperty of which is that multiplication in one domain is equivalent to convolution in theother.

Both of the options for the application of the Gaussian filter in the spatial domainrequire that the the filtering be executed out-of-place, thus doubling the space complexity.They also require that multiple elements in memory be accessed for each computation.As the dimensions of the Gaussian filter are derived from its standard deviation, whichin turn is determined by the tuple under consideration, the separable methodology wouldbe preferable to the traditional two dimensional approach. However, separability is not aproperty common to all filters, in particular the Gabor filter.

Despite the objections in [CCIN14] against the use of the frequency domain for theexecution of Gabor filtering, the frequency domain will still be employed here. Use of thefrequency domain for Gabor filtering is stated in [CCIN14] to precludes the processing ofROIs and non-rectangular areas in isolation. This is a null point with respect to COSFIREas neither technique is employed. The second objection presented in [CCIN14] is that theuse of frequency domain increases the memory requirements when variable image sizesare expected. This entails that a padded copy of the filter must be generated for eachpotential image size. Such variability is beyond the scope of this implementation. Thirdly,the results in [Pod07a, Pod07b] indicate that larger spatial domain filters perform poorlywhen executed on a GPU. Hence, the departure from the spatial domain which is reportedlya closer model to the mammalian visual system [CCIN14], is justified.

14

4 Implementation

Three important principles of parallel design on many-core architectures are highlightedin [ABC+06]. Firstly, that remote accesses should not be made unnecessarily and byextension minimized. Secondly, load balancing should be given due consideration. Thirdly,the use of synchronization should also be monitored. These three principles also apply toprogramming using GPUs. A transfer of data in either direction between host and deviceis equivalent to remote access, being a high latency operation. Load balancing may becompared to the measure of occupancy, which in itself minimizes the amount of idle timea streaming multiprocessor experiences. It also may be compared to the distribution ofdata-parallel or task-parallel operations across multiple CUDA streams, thus reducing theidle time on a device-wide scale. Synchronization is consistently considered an overhead-intensive operation, which may lead to idle time in itself as threads await others. Suchsynchronization may be explicit, such as the a device wide synchronization call, or within awarp, where half of the threads within the warp are forced to be idle as they await the otherhalf to complete execution of a branch in a conditional control flow (warp divergence).

Several existing libraries and code sources have been utilised; specifically, the NVIDIAcuFFT library, the NVIDIA CUDA SDK, and OpenCV. Both NVIDIA cuFFT and theSDK are available as part of the CUDA 7.5 toolkit [NVIa]. OpenCV [Ope] has been usedto generate the Gaussian and Gabor impulse responses. The CUDA SDK is the source ofthe padding functions and the cuFFT methodology employed in this implementation.

The new proposed pipeline for the application of a COSFIRE operator bears someresemblance to that presented in [PR09]. As with fastHOG [PR09], the new COSFIREpipeline executes padding on the GPU. The difference is that whilst the former utilisesimplicit padding through strategic copying of data, the padding utilised in this work issourced from the CUDA SDK [NVIa]. It has been modified to be compatible with thecuFFT library data types.

Similar to the approach taken in [SSLMW09], each kernel in this implementation rep-resents the sequential work to be done on a single element. As observed in [Har09], the useof too many threads cause saturation that results in delays due to the queuing of threadsas they await access to resources. This problem is partially mitigated when consideringconcurrent grid launches in the implementation of COSFIRE by creating only as manystreams as the device can handle.

One of the techniques essential to this implementation of COSFIRE is the use ofstreams. Each tuple composing the COSFIRE operator is treated as both task and data

15

parallel, thus utilising two different levels of parallelism as done in [Har09] and [SR10],albeit through distinct streams as opposed to the use of multiple GPUs[Har09]. The sim-ilarity with the approach taken in [SR10] is that multiple pixels within a single image orresponse may be executed concurrently, both within the same block and between differentblocks, and that pixels from memory spaces corresponding to different tuples may also beexecuted in parallel.

The maximum number of concurrent grid launches - which correlates to the number ofstreams that theoretically could be simultaneously active - is verified through a start upruntime call which returns the compute capability of the device.

Despite developments in GPU architectures to accelerate matrix operations throughthe use of shared memory [NVI10], such an approach is not feasible for the CUDA imple-mentation of COSFIRE. Firstly, the impulse responses used both when blurring and foredge detection vary in their dimensions. This complicates the amount of shared memoryto be allocated to each kernel. The traditional implementation of convolution utilising thespatial domain is exemplified in [Pod07b], which refers to one of the samples in the CUDASDK. Secondly, larger kernels have demonstrated diminished performance when utilisingshared memory [Pod07a]. Hence, the approach proposed in [Pod07a] has been used, whichallows the implementation of efficient convolution to be generalized to larger non-separablefilters [Pod07a].

In addition to the observation and calculations presented in [Pod07a] that concludethat the Fast Fourier Transform approach becomes advantageous as the image and/orimpulse response dimensions increases, this method is suitable for this implementation fora number of specific reasons. Although Gaussian impulse responses are separable, Gaborimpulse responses are not.

The cuFFT implementation of the FFT has a logarithmic computational complexityof O(n log n) [NVIb]. Separable convolution is O(p(m + n)), where p is the number ofpixels in the image and m and n the dimensions of the impulse response. Element-wisemultiplication has a linear complexity of O(n).

The pipeline for the processing of a single tuple where only the Gabor filtering isexecuted in the frequency domain has a computational complexity of

O(n log n) +O(n) +O(n log n) +O(n(x+ y)) (3)

A pipeline where both Gabor and blurring filtering operations are executed in the frequency

16

domain has a computational complexity of

O(n log n) +O(n) +O(n) +O(n log n) (4)

Whilst the two complexities both reduce to approximately linear complexity, the imple-mentation of the latter approach produces cleaner and more compact code. The differencebetween the two complexities becomes apparent when considering impulse responses withgreater numbers of elements. The implementation of the latter pipeline described above ispresented diagrammatically in Figure 6.

The FFT-based two dimensional approach to convolution entails the spatial reorgani-zation of the filter values in addition to the padding of both the reorganized filter andthe input image before the convolution may be computed through multiplication in thefrequency domain [Pod07a]. In order to prevent the default behaviour of cyclic convolutionof the FFT, the padding of the image utilises a clamp to border approach [Pod07a]. Thesefunctions, along with the auxiliary functions for the calculation of the dimensions of thenew memory spaces from the CUDA SDK have been modified whilst retaining the featuresthat round up these padded dimensions to sizes that attain the best possible precision andperformance with respect to the cuFFT library [Pod07a].

5 Testing and Evaluation

The following section shall evaluate the implementation on two fronts: the first shall beto assess the results of each stage by comparing these with those produced by the existingMATLAB implementation. This shall determine the merit of the work done from theperspective of maintaining correctness. The second shall be to evaluate the performanceof each step in the algorithm using the NSight Visual Profiler as available with the CUDA7.5 toolkit and timings.

An image by image comparison between the results acquired using the existing MAT-LAB implementation and those obtained using the implementation presented in this reporthas been included in the appendix. These confirm that the pedestrian traffic sign is cor-rectly detected in all sixteen test images. Table 1 compares the time taken for each imagein the respective implementations. The performance of two versions of the accelerated im-plementation shall be compared along with the existing MATLAB implementation [Azz].The first employs data parallelism but not task parallelism. This comparison has beencarried out in order to determine whether the additional effort necessary to manage mul-

17

Figure 6: Current COSFIRE Implementation. Every Gabor and Gaussian generated andtransformed into the frequency domain. The shifts are converted from polar to Cartesiancoordinates. The normalized input image is transformed into the frequency domain andmultiplied with the FFT of each Gabor filter. The results are then thresholded by avalue local to each response. and blurred. Any negative values are then set to zero. Theresponses are then shifted accordingly and their geometric mean taken.

tiple streams, both explicitly in the implementation and in terms of any implicit overheadincurred at driver and scheduling levels, is justified by any performance gains.

In order to assess the performance of the implementation independently of externalfactors such as the loading of the image, the time taken to read in the image shall beignored. Secondly, the steps necessary to setup the Gabor and Gaussian filters will also beexcluded from this evaluation.

Therefore, the average execution time using the MATLAB implementation is 2.662seconds. The average execution time for the same set of images using the CUDA im-plementation which executes the tuples in sequential manner (without streams) is 0.533seconds. The average execution time when using the streamed CUDA implementation thatprocesses tuples concurrently is 0.279 seconds.

Hence, a 4.994x speedup is achieved by the sequential-tuple CUDA implementation withrespect to the MATLAB implementation. A 9.541x speedup is obtained using concurrent-tuple CUDA implementation with respect to the MATLAB implementation. The speedupgained between the two CUDA implementations is 1.91x.

The theoretical speedup cannot be extrapolated and compared to the observed speedupdue to differences between the MATLAB implementation and both the CUDA implemen-

18

Table 1: A comparison of the execution times in seconds.

Image MATLAB CUDA without streams CUDA with streamspedestrian_001 2.799399 0.531 0.297pedestrian_002 2.638611 0.516 0.282pedestrian_003 2.627206 0.547 0.266pedestrian_004 2.640140 0.532 0.265pedestrian_005 2.714690 0.516 0.281pedestrian_006 2.663668 0.515 0.281pedestrian_007 2.632859 0.531 0.281pedestrian_008 2.675649 0.515 0.281pedestrian_009 2.683719 0.547 0.282pedestrian_010 2.642838 0.531 0.282pedestrian_011 2.629064 0.531 0.281pedestrian_012 2.642524 0.547 0.282pedestrian_013 2.634648 0.547 0.281pedestrian_014 2.631263 0.546 0.282pedestrian_015 2.657334 0.532 0.281pedestrian_016 2.674898 0.547 0.266

tations. The former applies an optimization were only unique Gabor filters are applied tothe input image. These are then stored in a hash table and copied for as many uniqueblurring operations as there are for each Gabor response. These in turn are also hashedand copied once again for each unique shifting operation applicable to each of the blurredGabor responses [Azz]. It has not been implemented here due to The current implemen-tation considers the worst case scenario where each tuple in the COSFIRE filter requiresa unique Gabor filter that is not common to any of the other tuples.

The profiler output has been included in the appendix. The conclusions drawn from theprofiler output are surmised here. The default stream CUDA implementation displays adensely populated queue of kernel calls. The element-wise multiplication process comprisesthe majority percentage of the computation component of the both the streamed and non-streamed versions of the implementation. The next most significant kernels are revealed tobe the kernels responsible for the thresholding of the responses and the correction of theupscaling incurred by the use of the cuFFT library implementation of the FFT and IFFT(also for both implementations). A driving observation in favour of the streamed versionof the implementation is the profiler calculated compute utilization percentage of 37.3%.The streamed version increased this metric to 40.6%.

One particular concern is the identification of periods of idle time streams in the

19

streamed version of the application. If the kernel or function invocations are synchronous,thus causing the host to block until the requested task has been run to completion, thepurpose of utilising named CUDA streams is rendered void. This is not the case for thebespoke kernels. Whether or not the cuFFT functions are synchronous or asynchronoushas yet to be verified. The profiler output for the streamed version of the CUDA implemen-tation exhibits something of the desired parallelism; however, there appear to be instanceswhere there is a delay in issuing the kernel invocations to the streams. This might besymptomatic of some synchronous portion of device code that is preventing control fromreturning immediately to the host. This is a hypothesis that requires further investigationin future work.

The most commonly encountered kernel-specific reported issue by the profiler is thatthe kernels exhibit a high level of instruction and memory latency. This might be remediedin future work through the amalgamation of kernels where possible to improve the ratioof memory accesses to computational actions. An issue reported by the profiler only forthe streamed version of the implementation is that the amount of data handled in singlememory copy operations is not sufficiently large to efficiently utilise the available memorybandwidth. Further insights might be elicited from the profiling process if the analysis wereto be repeated using code modified to avail itself of the more advanced profiling featuresexposed by the API. The implementations should also be evaluated on different hardwareto identify any device-specific caveats.

6 Related Work

One of the motivations for the acceleration of the execution of COSFIRE through GPUtechnology is the prevalence of GPU implementations of alternative object recognitionmethods in the literature, which through such acceleration achieve state-of-the-art per-formance. The typical speedup expected when accelerating computer vision algorithmsusing GPUs is stated in [FM08] as ranging between 2x and 10x and above; given intensiveoptimisation for the latter. One example is the acceleration of the HOG feature compu-tation using CUDA as presented in [YpSzXm11] where a substantial speed-up of 10x wasreportedly achieved through a direct mapping of threads to pixels. The execution time tocompute the HOG features for a single image of 1201 by 960 pixels reported in [YpSzXm11]is 0.125 seconds. Similarly, speed-ups for both SIFT and SURF descriptor computationare reported in [CMS11] where GPU implementations were developed for both algorithms.

20

Parallelization through the massive Single Instruction Multiple Data capabilities ofGPUs maps in an immediately apparent fashion to biologically inspired object recognitionmethods. This is due to the "highly parallel nature of the brain" [WRC08], in particularthe visual cortex [WRC08]. To this end, substantial speed-ups have been observed throughthe acceleration of object recognition algorithms through GPU technology, such as thosereported in [KD10], [PR09]. A study of the acceleration of object recognition and detectionalong with a selection of closely associated algorithms using CUDA is presented in [Har09],placing its emphasis on the Viola-Jones object detection algorithm as applied to imagesranging from 300 by 300 pixels to 2900 by 1600 pixels with varying degrees of speedupachieved.

Both KLT feature tracking and SIFT feature extraction have been implemented withthe aim of accelerating their execution to meet real-time constraints using GPU technologyin [SFPG06b]. These implementations are reported to have achieved execution rates oftracking 10000 features at 30 Hz on a resolution of 1024 by 768 pixels and extracting 800features at 10 Hz at a resolution of 640 by 480 pixels respectively. The acceleration ofSIFT is also the subject of [HMS+07].

Another example of such acceleration is the biologically inspired classifier and descriptorpair presented in [WRC08], which is reported to have achieved an execution time of 270 msfor feature extraction and object classification with a classifier composed of over 250000feature descriptors [WRC08]. It is similar to COSFIRE in that it identifies the Gaborfilters that may best be used to describe the input. It differs from COSFIRE in its methodof Gabor filter identification. It optimizes the filter parameters as opposed to COSFIRE’sselection of the filters from a bank of statically defined Gabor filters. As is the case withCOSFIRE, the features presented in [WRC08] may also be rendered invariant to rotation,transformation and changes in scale through further computation.

The GPU-implemented detector presented in [CBLN09] is applied to 640 by 480 pixelresolution grayscale images, reporting an execution time for object recognition in under 5seconds. CUDA is used in [PR09] to accelerate the HOG based sliding window algorithmto real-time execution, claiming an execution time of 261 milliseconds for grayscale imagesof resolution 1280 by 960, with the expectation of an execution rate of 10 fps using multipleGPUs. CUDA is also used in conjunction with OpenMP and SSE in [KPC+09] to produce ahybrid software solution implementing SIFT and a Parallel-Descriptor, where the differentconcurrency technologies are used to implement separate components of the system. Thesame system reportedly achieved feature extraction rate of 1100 at an average frame rate

21

of 15 Hz on autonomous mobile robots [KPC+09].

Purely hardware implementations object recognition and tracking such as the CAVIARvision system [SGOL+09] illustrate the performance advantage hardware implementationshave over software implementations. This emulation of the brain, whilst efficient, resultsin considerable hardware requirements [SGOL+09] that reduce the systems’ applicability.Albeit scalable through additional hardware units [SGOL+09], such scalability is subjectto the same limitations as previously mentioned when considered beyond environmentswith a high resource availability. Networks On a Chip have also been applied, such asthe approach taken in [KLK+09], which combined SIMD and MIMD on a configurableprocessor in conjunction with a cellular neural network in order to achieve approximatelyreal-time object recognition at 22 fps. Component parts of object recognition techniquessuch the Gabor filters as used for feature extraction have also been accelerated usinghardware implementations based on FPGAs [CCIN14]. Hardware implementations areoften concerned with power consumption, such as the BONE-V vision processors developedin [OKH+12] which are used with an NoC set-up. Another instance where processors havebeen tailored to suit image processing is [TSN+12], which in this specific case was followedby integration into an SoC.

Impressive performance has been reported to have been achieved using FPGA imple-mentations [CED12, KT11, NSS+11, JvDS11, BWRZ13], an example of which is a trafficsign recognition system that achieves an execution rate of 60 fps at a resolution of 1280by 720 pixels [ADGMA13]. In spite of this considerable body of work, which one wouldintuitively deduce to execute at an increased rate when compared to software solutions,there exists work which proves to the contrary. An example of such work is the GPUimplementation of the Viola-Jones face detection algorithm [HOT+10], which achieves ap-proximately FPGA-level performance (at the time of writing) using a multiple GPU set-up,performing at 15.2 fps on 4 Tesla GPUs [HOT+10].

Such work has been superseded in more recent years, as exemplified by [OMVEC13] us-ing the biologically-inspired Hierarchical Model and X model which achieved a performancerate of 190 images per second [OMVEC13]. Whilst being reported in [OMVEC13] as scaleand transformation invariant with conditional rotational invariance, this particular FPGAimplementation is targeted at images of a resolution of 128 by 128 pixels [OMVEC13].

Hybrid approaches combining both FPGA and GPU technology also exist, such asthe pedestrian detection system presented in [BKDB10], illustrating the point raised in[FM08] that there are cases of algorithms which " benefit from being split between the

22

CPU and GPU " [FM08]. In this specific system, the GPU is used for the classification ofthe features extracted by the FPGA, which implement an SVM and HOG respectively. Incontrast with this approach, an FPGA and a number of DSP units are used to implementthe SVM in [KT12], which exhibited a variable frame rate depending on the application ofthe system. This particular system is reported to have achieved a rate of approximately 10fps, translating into an execution time of 100 ms per frame, which is marked in [BKDB10]as reason for further outsourcing of tasks to the GPU in addition to its then-current loadof the SVM. This is further motivation for computer vision systems implementations tobe confined to the GPU in terms of implementation, rather than split across multipleheterogeneous software components. Another method of constructing a hybrid system isthe utilisation of both the GPU and CPU (or device and host), as done in [MN11] in orderto detect both pedestrians and vehicles using HOG and FIND descriptors. This is one ofthe few examples of assistive road technology systems not confined to the detection of asingle class of object or road user. Its CUDA implementation is reported to have achievedan execution rate of 23.8 fps (0.042 s) on 640 by 480 images [MN11], making it similar tothe aim of this work.

Real-time object recognition has also been achieved through modifications to existingalgorithms, as exemplified by the SURFTrac algorithm presented by [TCGP09], whichbypassed the computation of the SURF descriptor. Such an approach will not be inves-tigated in this work. There also exist algorithms such as ORB [RRKB11] which havebeen reported to perform well on moderately large images (640 by 480) on mobile phones[RRKB11]. Similar resolutions are tackled in [BKDB10], which worked with images ofa resolution of 800 by 600 pixels in its pedestrian detection system. The most readilyavailable processing unit, the CPU, has also been utilised to exploit parallelism, as maybe seen in the real-time traffic sign detection system presented in [ALT08]. Its use of amulti-core processor for the acceleration of the Viola-Jones algorithm illustrates that bothdata and task parallelism need not be exploited on massively parallel architectures in or-der to achieve real-time performance. There has also been research into the possibilitiesof object recognition in mobile devices [LJC12], which achieved a real-time frame rate on240 by 320 images using SVMs and CS-LBP.

Object recognition has also been extended to three dimensional objects, as may beseen in [DUNI10]. It bears some similarity to the work done in [GL09] in that it utilises avoting scheme " similar to the Generalized Hough Transform " [GL09]. CUDA in partic-ular has been utilised to implement object pose estimation systems as seen in [PGBP10].

23

Such a use follows naturally from the convenience of the high level interface that allowsdirect mapping of data layouts up to three dimensions. Similarly, object recognition tech-niques have also been used to derive methods for particular problems, such as humanpose recognition[SFC+11]. Another example of the acceleration of methods for three di-mensional object recognition and pose estimation is [CMS11], which employs a hybridapproach of utilising both the GPU and CPU for the processing in the MOPED systemthat it presents. The SHOT three dimensional object descriptor has also been implementedusing CUDA in [HN15] where it was observed that the independence of the keypoints withrespect to each other facilitated the parallelization of their processing.

[PGBP10] is an example of algorithm designed with its target platform (in the case of[PGBP10], the GPU platform) clearly selected and defined. Whilst this allows ample op-portunity for optimisation at a low level of design, it does not necessarily entail portabilityboth in terms of execution and performance. COSFIRE has been designed independentof any particular platform and therefore its implementation might incur reorganizationand/or the addition of steps in the algorithm. This shall be discussed further in the Designand Specification section.

In addition, COSFIRE’s algorithm is such that it is not tailored to any one particularclass of object. A number of systems exchange general purpose application for improvedexecution rates. An example of such a dedicated system is [MCC+10], a road-sign detectorderived from particle swarm optimization. Road sign detection is one of the applicationsfor which COSFIRE has executed with success [?]. The system presented in [MCC+10],implemented using CUDA, achieved a full frame performance rate (on a resolution of 750by 480 pixels) of at least 20 fps (with an average execution time of 0.05 s) [MCC+10].

7 Conclusion

Hence, this work has illustrated that an execution time of 0.279 seconds is possible forCOSFIRE when implemented using CUDA such that both data parallelism and task par-allelism are utilised. Whilst this falls short of the target execution time of 0.05 seconds,it is possible that further speedup may be obtained through more aggressive optimisation,particularly through a more sophisticated and efficient filtering structure.

This would be the first of several potential avenues for future work building on theresults presented here. A diagram illustrating an overview of this revision is provided inFigure 7. Similarly, it is stated in [AP13c] that should two or more tuples share the same

24

values of λ, θ, and ρ, the same blurred Gabor response may be used to derive the results ofeach tuple by shifting appropriately, thus extending a similar optimisation to the blurringphase [Azz].

It is highlighted in [AP13c] that the COSFIRE approach permits the usage of filtersother than Gabor filters. By utilising the frequency domain, the computational complexityof the filtering operations should in theory remain constant irrespective of the type of filteremployed, with its limiting factor being that it must be possible to take the FFT of thefilter type in question. This implementation also exploits the multiplicative property ofthe frequency domain in order to execute multiplication in a data parallel manner that isefficient in terms of memory usage due to the absence of shared data. This approach hasbeen favoured in this implementation as the amount of shared data is directly proportionalto the standard deviation of the Gaussian impulse response. Whilst the Gaussian impulseresponses utilised by the COSFIRE operator are relatively small in size, this size is alsodirectly proportional to the radius of the filter as determined by the size of the prototype.Therefore, larger impulse responses (filter kernels) would result in greater shared memoryrequirements, which in turn reduces the number of blocks that may be hosted on a singlemultiprocessor within a GPU.

The result obtained using these implementation refers to an operator that has onlybeen applied in the mode defined by its prototype and not in invariant mode. The imple-mentation of the invariant mode may be the subject of future work. As previously stated,further analysis of these implementations is definitely required. In addition, further testingusing larger and/or alternative datasets of both images and video could also be the subjectof future work, along with analysis performed on the various GPU micro-architectures cur-rently available, in conjunction with further tuning of parameters such the kernel launchgrid specifications.

Another aspect that merits consideration is that of the reorganization of data as men-tioned in [KPC+09]. Whereas [KPC+09] considers reorganization from the perspective ofalignment for the purposes of compatibility with SIMD instructions, it might prove effectivein maximising the memory bandwidth utilisation within the GPU. The present implemen-tation uses a mapping which whilst straightforward might not be the optimal pattern ofmemory access. Further optimisation during the processing on the GPU at runtime mightbe possible using texture memory in a manner similar to that seen in [HMS+07] or [MN11].

Alternative implementations of the constituent operations of the COSFIRE algorithmcould also be the subject of future work, such as the investigation of the method in

25

Figure 7: Proposed revision of COSFIRE Implementation.

[KLK+09] where Gabor and Gaussian filtering is implemented using cellular neural net-works.

References

[ABC+06] Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph JamesGebis, Parry Husbands, Kurt Keutzer, David A. Patterson, William LesterPlishker, John Shalf, and Samuel Webb Williams. The landscape of parallelcomputing research: A view from berkeley, 2006.

[ABD+09] Krste Asanovic, Rastislav Bodik, James Demmel, Tony Keaveny, KurtKeutzer, John Kubiatowicz, Nelson Morgan, David Patterson, Koushik Sen,and John Wawrzynek. A view of the parallel computing landscape. Commu-nications of the ACM, 52(10):56–67, 2009.

[ADGMA13] N. Aguirre-Dobernack, H. Guzman-Miranda, and M. A. Aguirre. Implemen-tation of a machine vision system for real-time traffic sign recognition on fpga.In Industrial Electronics Society, IECON 2013 - 39th Annual Conference ofthe IEEE, pages 2285–2290, 2013. ID: 1.

[ALT08] R. Ach, N. Luth, and A. Techmer. Real-time detection of traffic signs on amulti-core processor. In Intelligent Vehicles Symposium, 2008 IEEE, pages307–312, June 2008.

[And00] Greg R Andrews. Foundations of multithreaded, parallel, and distributedprogramming. Addison-Wesley Longman Publishing Co., Inc., 2000.

26

[AP12] George Azzopardi and Nicolai Petkov. Detection of retinal vascular bifurca-tions by rotation-, scale-and reflection-invariant cosfire filters. In Computer-Based Medical Systems (CBMS), 2012 25th International Symposium on,pages 1–4. IEEE, 2012.

[AP13a] George Azzopardi and Nicolai Petkov. Automatic detection of vascular bi-furcations in segmented retinal images using trainable cosfire filters. PatternRecognition Letters, 34(8):922–933, 2013.

[AP13b] George Azzopardi and Nicolai Petkov. Automatic detection of vascular bi-furcations in segmented retinal images using trainable cosfire filters. PatternRecognition Letters, 34(8):922–933, 2013.

[AP13c] George Azzopardi and Nicolai Petkov. Trainable cosfire filters for keypointdetection and pattern recognition. Pattern Analysis and Machine Intelli-gence, IEEE Transactions on, 35(2):490–503, 2013.

[AP14] George Azzopardi and Nicolai Petkov. Ventral-stream-like shape representa-tion: from pixel intensity values to trainable object-selective cosfire models.Frontiers in computational neuroscience, 8, 2014.

[APZ+12] M. Aguilar, M. A. Peot, Jiangying Zhou, S. Simons, Yuwei Liao, N. Met-walli, and M. B. Anderson. Real-time unconstrained object recognition: Aprocessing pipeline based on the mammalian visual system. Pulse, IEEE,3(2):53–56, 2012. ID: 1.

[ASVP15] George Azzopardi, Nicola Strisciuglio, Mario Vento, and Nicolai Petkov.Trainable cosfire filters for vessel delineation with application to retinal im-ages. Medical image analysis, 19(1):46–57, 2015.

[Azz] George Azzopardi. Trainable COSFIRE fil-ters for keypoint detection and pattern recognition.http://www.mathworks.com/matlabcentral/fileexchange/37395-trainable-cosfire-filters-for-keypoint-detection-and-pattern-recognition.

[BKDB10] S. Bauer, S. Kohler, K. Doll, and U. Brunsmann. Fpga-gpu architecture forkernel svm pedestrian detection. In Computer Vision and Pattern Recog-nition Workshops (CVPRW), 2010 IEEE Computer Society Conference on,pages 61–68, 2010. ID: 1.

27

[BWRZ13] R. Brzoza-Woch, A. Ruta, and K. ZieliÅĎski. Remotely reconfigurable hard-wareâĂŞsoftware platform with web service interface for automated videosurveillance. Journal of Systems Architecture, 59(7):376–388, 8 2013.

[CBLN09] Adam Coates, Paul Baumstarck, Quoc Le, and Andrew Y Ng. Scalablelearning for object detection with gpu hardware. In Intelligent Robots andSystems, 2009. IROS 2009. IEEE/RSJ International Conference on, pages4287–4293. IEEE, 2009.

[CBM+08] Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaf-fer, and Kevin Skadron. A performance study of general-purpose applica-tions on graphics processors using cuda. Journal of Parallel and DistributedComputing; General-Purpose Processing using Graphics Processing Units,68(10):1370–1380, 10 2008.

[CCIN14] YongCheolPeter Cho, Nandhini Chandramoorthy, KevinM Irick, and Vi-jaykrishnan Narayanan. Accelerating multiresolution gabor feature extrac-tion for real time vision applications. Journal of Signal Processing Systems,76(2):149–168, 08/01 2014. J2: J Sign Process Syst.

[CED12] TamP Cao, Darrell Elton, and Guang Deng. Fast buffering for fpga imple-mentation of vision-based object recognition systems. Journal of Real-TimeImage Processing, 7(3):173–183, 09/01 2012. J2: J Real-Time Image Proc.

[CMS11] Alvaro Collet, Manuel Martinez, and Siddhartha S. Srinivasa. The mopedframework: Object recognition and pose estimation for manipulation. TheInternational Journal of Robotics Research, page 0278364911401765, 2011.

[Coo13] Shane Cook. CUDA programming: a developer’s guide to parallel computingwith GPUs. Morgan Kaufmann, Elsevier, 2013.

[CSB12] Gabriele Capannini, Fabrizio Silvestri, and Ranieri Baraglia. Sorting on gpusfor large scale datasets: A thorough comparison. Information Processing &Management, 48(5):903–917, 9 2012.

[CSH+12] Tse-Wei Chen, Yu-Chi Su, Keng-Yen Huang, Yi-Min Tsai, Shao-Yi Chien,and Liang-Gee Chen. Visual vocabulary processor based on binary tree ar-chitecture for real-time object recognition in full-hd resolution. Very Large

28

Scale Integration (VLSI) Systems, IEEE Transactions on, 20(12):2329–2332,2012. ID: 1.

[CW13] B. Cyganek and M. Wozniak. A framework for image analysis and ob-ject recognition in industrial applications with the ensemble of classifiers.In Emerging Technologies & Factory Automation (ETFA), 2013 IEEE 18thConference on, pages 1–4, 2013. ID: 1.

[DJV+13] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, EricTzeng, and Trevor Darrell. Decaf: A deep convolutional activation featurefor generic visual recognition. arXiv preprint arXiv:1310.1531, 2013.

[DUNI10] B. Drost, Markus Ulrich, N. Navab, and S. Ilic. Model globally, match locally:Efficient and robust 3d object recognition. In Computer Vision and PatternRecognition (CVPR), 2010 IEEE Conference on, pages 998–1005, 2010. ID:1.

[FM08] J. Fung and S. Mann. Using graphics devices in reverse: Gpu-based im-age processing and computer vision. In Multimedia and Expo, 2008 IEEEInternational Conference on, pages 9–12, 2008. ID: 1.

[GL09] J. Gall and V. Lempitsky. Class-specific hough forests for object detection.In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Con-ference on, pages 1022–1029, 2009. ID: 1.

[Hal] Mike Haldas. Cctv resolution.

[Har09] Jesse Patrick Harvey. Gpu acceleration of object classification algorithmsusing nvidia cuda. 2009.

[HC14] Andreas Holzbach and Gordon Cheng. A neuron-inspired computationalarchitecture for spatiotemporal visual processing. Biological cybernetics,108(3):249–259, 06/01 2014. J2: Biol Cybern.

[HHKC12] Feng-Cheng Huang, Shi-Yu Huang, Ji-Wei Ker, and Yung-Chang Chen.High-performance sift hardware accelerator for real-time image feature ex-traction. Circuits and Systems for Video Technology, IEEE Transactions on,22(3):340–351, 2012. ID: 1.

29

[HMS+07] S. Heymann, K. MÃĳller, A. Smolic, B. Froehlich, and T. Wiegand. Siftimplementation and optimization for general-purpose gpu. 2007.

[HN15] Linjia Hu and Saeid Nooshabadi. G-shot: Gpu accelerated 3d local de-scriptor for surface matching. Journal of Visual Communication and ImageRepresentation, 30:343–349, 7 2015.

[HOT+10] D. Hefenbrock, J. Oberg, Nhat Thanh, R. Kastner, and S. B. Baden. Ac-celerating viola-jones face detection to fpga-level using gpus. In Field-Programmable Custom Computing Machines (FCCM), 2010 18th IEEE An-nual International Symposium on, pages 11–18, 2010. ID: 1.

[JvDS11] Harald Jordan, Walter van Dyck, and Rene SmodiÃĎÂŊ. A co-processedcontour tracing algorithm for a smart camera. Journal of Real-Time ImageProcessing, 6(1):23–31, 03/01 2011. J2: J Real-Time Image Proc.

[KD10] Jiangang Kong and Yangdong Deng. Gpu accelerated face detection. InIntelligent Control and Information Processing (ICICIP), 2010 InternationalConference on, pages 584–588, 2010. ID: 1.

[KLK+09] Kwanho Kim, Seungjin Lee, Joo-Young Kim, Minsu Kim, and Hoi-Jun Yoo.A configurable heterogeneous multicore architecture with cellular neural net-work for real-time object recognition. Circuits and Systems for Video Tech-nology, IEEE Transactions on, 19(11):1612–1622, 2009. ID: 1.

[KPC+09] Junchul Kim, Eunsoo Park, Xuenan Cui, Hakil Kim, and W. A. Gruver.A fast feature extraction in object recognition using parallel processing oncpu and gpu. In Systems, Man and Cybernetics, 2009. SMC 2009. IEEEInternational Conference on, pages 3842–3847, 2009. ID: 1.

[KT11] C. Kyrkou and T. Theocharides. A flexible parallel hardware architecturefor adaboost-based real-time object detection. Very Large Scale Integration(VLSI) Systems, IEEE Transactions on, 19(6):1034–1047, 2011. ID: 1.

[KT12] C. Kyrkou and T. Theocharides. A parallel hardware architecture for real-time object detection with support vector machines. Computers, IEEETransactions on, 61(6):831–842, 2012. ID: 1.

30

[KWm12] David B Kirk and W Hwu Wen-mei. Programming massively parallel proces-sors: a hands-on approach. Newnes, 2012.

[LJC12] A. Lameira, R. Jesus, and N. Correia. Local object detection and recognitionin mobile devices. In Systems, Signals and Image Processing (IWSSIP), 201219th International Conference on, pages 193–196, 2012. ID: 1.

[LKO+10] Seungjin Lee, Joonsoo Kwon, Jinwook Oh, Junyoung Park, and Hoi-Jun Yoo.A 92m w 76.8gops vector matching processor with parallel huffman decoderand query re-ordering buffer for real-time object recognition. In Solid StateCircuits Conference (A-SSCC), 2010 IEEE Asian, pages 1–4, 2010. ID: 1.

[LR09] L. Ligowski andW. Rudnicki. An efficient implementation of smith watermanalgorithm on gpu using cuda, for massively parallel scanning of sequencedatabases. In Parallel & Distributed Processing, 2009. IPDPS 2009. IEEEInternational Symposium on, pages 1–8, 2009. ID: 1.

[Mac94] C. Machover. Four decades of computer graphics. Computer Graphics andApplications, IEEE, 14(6):14–19, 1994. ID: 1.

[MCC+10] Luca Mussi, Stefano Cagnoni, Elena Cardarelli, Fabio Daolio, Paolo Medici,and PierPaolo Porta. Gpu implementation of a road sign detector basedon particle swarm optimization. Evolutionary Intelligence, 3(3-4):155–169,12/01 2010. J2: Evol. Intel.

[MN11] T. Machida and T. Naito. Gpu & cpu cooperative accelerated pedestrianand vehicle detection. In Computer Vision Workshops (ICCV Workshops),2011 IEEE International Conference on, pages 506–513, 2011. ID: 1.

[Mov02] Javier R Movellan. Tutorial on gabor filters. Open Source Document, 2002.

[MWMA09] Asim Munawar, Mohamed Wahib, Masaharu Munetomo, and KiyoshiAkama. Hybrid of genetic algorithm and local search to solve max-sat prob-lem using nvidia cuda framework. Genetic Programming and Evolvable Ma-chines, 10(4):391–415, 12/01 2009. J2: Genet Program Evolvable Mach.

[NSS+11] Federico Nava, Donatella Sciuto, Marco Domenico Santambrogio, Stefan Her-brechtsmeier, Mario Porrmann, Ulf Witkowski, and Ulrich Rueckert. Apply-ing dynamic reconfiguration in the mobile robotics domain: A case study

31

on computer vision algorithms. ACM Trans.Reconfigurable Technol.Syst.,4(3):29:1–29:22, aug 2011.

[NVIa] NVIDIA. CUDA Toolkit. https://developer.nvidia.com/cuda-downloads.

[NVIb] NVIDIA. cuFFT. http://docs.nvidia.com/cuda/cufft/#axzz47O9KtK6P.

[NVI09] NVIDIA Corporation. Whitepaper nvidia’s next generation cuda tm computearchitecture: Fermi. 2009.

[NVI10] NVIDIA Corporation. Whitepaper nvidia gf100 world’s fastest gpu deliveringgreat gaming performance with true geometric realism. 2010.

[OHL+08] J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C.Phillips. Gpu computing. Proceedings of the IEEE, 96(5):879–899, 2008. ID:1.

[OKH+12] Jinwook Oh, Gyeonghoon Kim, Injoon Hong, Junyoung Park, SeungjinLee, Joo-Young Kim, Jeong-Ho Woo, and Hoi-Jun Yoo. Low-power, real-time object-recognition processors for mobile vision systems. Micro, IEEE,32(6):38–50, 2012. ID: 1.

[OKP+13] Jinwook Oh, Gyeonghoon Kim, Junyoung Park, Injoon Hong, Seungjin Lee,Joo-Young Kim, Jeong-Ho Woo, and Hoi-Jun Yoo. A 320 mw 342 gopsreal-time dynamic object recognition processor for hd 720p video streams.Solid-State Circuits, IEEE Journal of, 48(1):33–45, 2013. ID: 1.

[OMVEC13] G. Orchard, J. G. Martin, R. J. Vogelstein, and R. Etienne-Cummings.Fast neuromimetic object recognition using fpga outperforms gpu implemen-tations. Neural Networks and Learning Systems, IEEE Transactions on,24(8):1239–1252, 2013. ID: 1.

[Ope] OpenCV. OpenCV. http://opencv.org/.

[PBKE12] Kari Pulli, Anatoly Baksheev, Kirill Kornyakov, and Victor Eruhimov. Real-time computer vision with opencv. Communications of the ACM, 55(6):61–69, 2012.

[PGBP10] InKyu Park, Marcel Germann, MichaelD Breitenstein, and Hanspeter Pfister.Fast and automatic object pose estimation for range images on the gpu.

32

Machine Vision and Applications, 21(5):749–766, 08/01 2010. J2: MachineVision and Applications.

[PK97] Nikolai Petkov and Peter Kruizinga. Computational models of visual neuronsspecialised in the detection of periodic and aperiodic oriented visual stimuli:bar and grating cells. Biological cybernetics, 76(2):83–96, 1997.

[Pod07a] Victor Podlozhnyuk. Fft-based 2d convolution. NVIDIA white paper,page 32, 2007.

[Pod07b] Victor Podlozhnyuk. Image convolution with cuda. NVIDIA Corporationwhite paper, June, 2097(3), 2007.

[Por06] Fatih Porikli. Achieving real-time object detection and tracking under ex-treme conditions. Journal of Real-Time Image Processing, 1(1):33–40, 03/012006. J2: J Real-Time Image Proc.

[PR09] Victor Prisacariu and Ian Reid. fasthog-a real-time gpu implementation ofhog. Department of Engineering Science, 2310, 2009.

[PSK+12] J. Paul, W. Stechele, M. Krohnert, T. Asfour, and R. Dillmann. Invasivecomputing for robotic vision. In Design Automation Conference (ASP-DAC),2012 17th Asia and South Pacific, pages 207–212, 2012. ID: 1.

[QMN09] Deyuan Qiu, Stefan May, and Andreas NÃĳchter. GPU-accelerated near-est neighbor search for 3D registration, pages 194–203. Computer VisionSystems. Springer, 2009.

[RG15] I. Z. Reguly and M. B. Giles. Finite element algorithms and data structureson graphical processing units. International Journal of Parallel Program-ming, 43(2):203–239, 04/01 2015. J2: Int J Parallel Prog.

[RRKB11] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski. Orb: An efficient alter-native to sift or surf. In Computer Vision (ICCV), 2011 IEEE InternationalConference on, pages 2564–2571, 2011. ID: 1.

[SFC+11] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore,A. Kipman, and A. Blake. Real-time human pose recognition in parts fromsingle depth images. In Computer Vision and Pattern Recognition (CVPR),2011 IEEE Conference on, pages 1297–1304, 2011. ID: 1.

33

[SFPG06a] Sudipta N. Sinha, Jan-Michael Frahm, Marc Pollefeys, and Yakup Genc.Gpu-based video feature tracking and matching. In EDGE, Workshop onEdge Computing Using New Commodity Architectures, volume 278, page4321, 2006.

[SFPG06b] Sudipta N. Sinha, Jan-Michael Frahm, Marc Pollefeys, and Yakup Genc.Gpu-based video feature tracking and matching. In EDGE, Workshop onEdge Computing Using New Commodity Architectures, volume 278, page4321, 2006.

[SGOL+09] R. Serrano-Gotarredona, M. Oster, P. Lichtsteiner, A. Linares-Barranco,R. Paz-Vicente, F. Gomez-Rodriguez, L. Camunas-Mesa, R. Berner,M. Rivas-Perez, T. Delbruck, Shih-Chii Liu, R. Douglas, P. Hafliger,G. Jimenez-Moreno, A. C. Ballcels, T. Serrano-Gotarredona, A. J. Acosta-Jimenez, and B. Linares-Barranco. Caviar: A 45k neuron, 5m synapse,12g connects/s aer hardware sensoryâĂŞprocessingâĂŞlearningâĂŞactuat-ing system for high-speed visual object recognition and tracking. NeuralNetworks, IEEE Transactions on, 20(9):1417–1438, 2009. ID: 1.

[SGZ+15] Hyun Oh Song, R. Girshick, S. Zickler, C. Geyer, P. Felzenszwalb, andT. Darrell. Generalized sparselet models for real-time multiclass object recog-nition. Pattern Analysis and Machine Intelligence, IEEE Transactions on,37(5):1001–1012, 2015. ID: 1.

[SR10] T. R. P. Siriwardena and D. N. Ranasinghe. Accelerating global sequencealignment using cuda compatible multi-core gpu. In Information and Au-tomation for Sustainability (ICIAFs), 2010 5th International Conference on,pages 201–206, 2010. ID: 1.

[SSLMW09] Haixiang Shi, B. Schmidt, Weiguo Liu, and W. Muller-Wittig. Accelerat-ing error correction in high-throughput short-read dna sequencing data withcuda. In Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE Inter-national Symposium on, pages 1–8, 2009. ID: 1.

[TCGP09] Duy-Nguyen Ta, Wei-Chao Chen, N. Gelfand, and K. Pulli. Surftrac: Effi-cient tracking and continuous object recognition using local feature descrip-tors. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEEConference on, pages 2937–2944, 2009. ID: 1.

34

[TSN+12] Y. Tanabe, M. Sumiyoshi, M. Nishiyama, I. Yamazaki, S. Fujii, K. Kimura,T. Aoyama, M. Banno, H. Hayashi, and T. Miyamori. A 464gops 620gops/wheterogeneous multi-core soc for image-recognition applications. In Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2012 IEEEInternational, pages 222–223, 2012. ID: 1.

[WCN10] Fan Wu, Chung-Han Chen, and H. Narang. An efficient acceleration of sym-metric key cryptography using general purpose graphics processing unit. InEmerging Security Information Systems and Technologies (SECURWARE),2010 Fourth International Conference on, pages 228–233, 2010. ID: 1.

[WRC08] K. Woodbeck, G. Roth, and Huiqiong Chen. Visual cortex on the gpu:Biologically inspired classifier and feature descriptor for rapid recognition. InComputer Vision and Pattern Recognition Workshops, 2008. CVPRW ’08.IEEE Computer Society Conference on, pages 1–8, 2008. ID: 1.

[YB13] E. Yigitbasi and N. Baykan. Edge detection using artificial bee colony algo-rithm (abc). Int.J.Inf.Electron.Eng, 3(6):634–638, 2013.

[YpSzXm11] Chen Yan-ping, Li Shao-zi, and Lin Xian-ming. Fast hog feature computationbased on cuda. In Computer Science and Automation Engineering (CSAE),2011 IEEE International Conference on, volume 4, pages 748–751, 2011. ID:1.

[ZCZX08] Qi Zhang, Yurong Chen, Yimin Zhang, and Yinlong Xu. Sift implementationand optimization for multi-core systems. In Parallel and Distributed Process-ing, 2008. IPDPS 2008. IEEE International Symposium on, pages 1–8, April2008.

[ZN08] Li Zhang and R. Nevatia. Efficient scan-window based object detection us-ing gpgpu. In Computer Vision and Pattern Recognition Workshops, 2008.CVPRW ’08. IEEE Computer Society Conference on, pages 1–7, 2008. ID:1.

35

Appendices

Comparison of MATLAB and CUDA Implementation Results

The following table compares the x and y coordinates of the maximum value. This maxi-mum value represents the location of the pedestrian traffic sign within the image. As maybe seen below, individual components of the coordinates vary by no more than a singlepixel. This may be due to differences in the implementation of the rounding of the integershift quantity to an integer value. Hence, the position at which the traffic sign is detectedin the different implementations may be said to be consistent.

The original coordinates refer were obtained from the output of the Matlab implementa-tion, whilst the coordinates in the remaining columns were obtained from the non-streamedand streamed CUDA implementations respectively (NS denotes non-streamed). The imagenumber provided in the first column is the suffix appended to ’pedestrian_’ to form thecorresponding image name.

Table 2: A comparison of the positions of the maxima.

Image No. Original X NS X Streamed X Original Y NS Y Streamed Y001 70 71 71 245 245 245002 64 64 64 282 281 281003 57 58 58 257 257 257004 59 60 60 266 265 265005 61 61 61 289 289 289006 80 80 80 280 280 280007 77 77 77 271 270 270008 69 69 69 276 276 276009 69 69 69 262 262 262010 63 64 64 262 262 262011 76 76 76 283 283 283012 29 30 30 222 222 222013 61 62 62 244 244 244014 78 78 78 258 258 258015 84 84 84 233 233 233016 68 69 68 278 278 278

36

The values at the points listed in the previous table have been tabulated below for com-parison. There appears to be a substantial difference between the values obtained using thetwo CUDA implementations and those acquired from the output of the existing MATLABimplementation. This may be due to the scaling executed in the CUDA implementationsboth due to the use of the FFT and due to that executed as part of the geometric meanto prevent values from being truncated to zero by the double precision data type.

Table 3: A comparison of the values of the maxima.

Image Name Original Maximum Non-Streamed Maximum Streamed Maximumpedestrian_001 0.07894813 0.001406473 0.001891758pedestrian_002 0.087941377 0.001373211 0.001790502pedestrian_003 0.07361524 0.001255828 0.001671857pedestrian_004 0.090321376 0.001492714 0.002038221pedestrian_005 0.068051945 0.001031039 0.001269836pedestrian_006 0.079880015 0.001209435 0.001528052pedestrian_007 0.060314041 0.000951517 0.001177675pedestrian_008 0.098023692 0.001671446 0.002365485pedestrian_009 0.089268479 0.001360287 0.001839803pedestrian_010 0.105406726 0.001541581 0.00213989pedestrian_011 0.086563368 0.001361394 0.001804847pedestrian_012 0.071153954 0.001453064 0.001983714pedestrian_013 0.087038578 0.001364283 0.001857092pedestrian_014 0.080976226 0.001297458 0.001759806pedestrian_015 0.089224119 0.001305467 0.001733329pedestrian_016 0.075825626 0.001388505 0.001898157

37

(a) MATLAB Result

(b) COSFIRE CUDA Result (no streams) (c) COSFIRE CUDA Result (streams)

Comparison of results for pedestrian_001

38

(a) MATLAB Result



39

(a) MATLAB Result



40

(a) MATLAB Result



41

(a) MATLAB Result



42

(a) MATLAB Result



43

(a) MATLAB Result



44

(a) MATLAB Result



45

(a) MATLAB Result



46

(a) MATLAB Result



47

(a) MATLAB Result



48

(a) MATLAB Result



49

(a) MATLAB Result



50

(a) MATLAB Result



51

(a) MATLAB Result



52

(a) MATLAB Result



53

Visual Profiler Screenshots and Analysis

Profile Result 1.

Profile Result 2.

54

Profile Result 3.

Profile Result 4.

55

Profile Result 5.

56

Profile Result 6.

57

Profile Result 7.

58

Profile Result 8.

Profile Result 9.

59

Profile Result 10.

Profile Result 11.

Profile Result 12.

60

Profile Result 13.

61

Profile Result 14.

62

Profile Result 15.

Profile Result 16.

Profile Result 17.

63

Profile Result 18.

64

Profile Result 19.

65

GPU Specification

The specification of the GPU used during this FYP.

66

real time object recognition in videos with a parallel algorithm

Documents