mevbench: a mobile computer vision …mevbench: a mobile computer vision benchmarking suite jason...

MEVBench: A Mobile Computer Vision Benchmarking Suite

Jason Clemons, Haishan Zhu, Silvio Savarese, and Todd AustinUniversity Of Michigan

Electrical Engineering and Computer ScienceAnn Arbor, Michigan

{jclemons, francisz, silvio, austin}@umich.edu

Abstract

The growth in mobile vision applications, coupled withthe performance limitations of mobile platforms, has led toa growing need to understand computer vision applications.Computationally intensive mobile vision applications, suchas augmented reality or object recognition, place signifi-cant performance and power demands on existing embed-ded platforms, often leading to degraded application qual-ity. With a better understanding of this growing applica-tion space, it will be possible to more effectively optimizefuture embedded platforms. In this work, we introduce andevaluate a custom benchmark suite for mobile embedded vi-sion applications named MEVBench. MEVBench providesa wide range of mobile vision applications such as face de-tection, feature classification, object tracking and featureextraction. To better understand mobile vision processingcharacteristics at the architectural level, we analyze sin-gle and multithread implementations of many algorithms toevaluate performance, scalability, and memory characteris-tics. We provide insights into the major areas where archi-tecture can improve the performance of these applicationsin embedded systems.

1 Introduction

Computer vision brings together multiple fields such asmachine learning, image processing, probability, and artifi-cial intelligence to allow computer systems to analyze andact upon scenes. Computer vision algorithms continue toenter the mainstream, from the projection of special effectsin Hollywood movie scenes to information overlays on nav-igation systems. Many applications allow new user experi-ences such as augmented reality games on portable gamingsystems, which allow users to experience 3D objects ren-dered in real-world scenes, as shown in Figure 1. Theseapplications are expanding quickly into the mobile roboticsrealm where devices such as 3D cameras are being used fornavigation. The mobile computing space continues to see a

large growth in vision applications as mobile devices suchas tablets and smartphones gain more capable imaging de-vices. This, coupled with the proliferation of smartphonesand tablets, is leading to mobile computer vision becominga key application domain in embedded computing.

Figure 1. Augmented Reality The figure shows an exampleof augmented reality available on mobile platforms. The left im-age shows the original scene. In the right image a red cube framehas been rendered in proper perspective as though attached to themarker. Current mobile computing devices are capable of render-ing detailed objects into the scene.

Figure 2. Vision Overview The figure shows a typical visionpipeline. Features are extracted from an image and reasoned aboutbased on prior knowledge. Feature reasoning is used in concertwith scene reasoning to produce contextual knowledge about thescene. In this example the face is found in the image.

Despite this meteoric rise in mobile vision applications,there are limited benchmarks and analyses of the variouscomputation kernels that mobile computer vision applica-tions use. An overview of a typical computer vision pipeline

can be seen in Figure 2. The pipeline begins with processingof the scene to locate features or distinctive image character-istics such as corners, edges, or high contrast areas. Thesefeatures usually have a signature computed, called the fea-ture descriptor, that is used to reference the feature and pro-vide comparisons. The features are then used to drive scenereasoning. In many cases they are matched to a known-object database to determine object semantic informationand scene context. This information is then refined to de-termine contextual knowledge about the scene such as thepresence of an object or a person.

In this work, we analyze many key vision algorithms thatare particularly apt for the mobile computing space. Weexamine their architectural characteristics, including mem-ory demands, execution time, and microarchitectural per-formance. Our key contributions in this work are:

• We assemble a mobile computer vision benchmarksuite, composed of key applications and kernels for themobile space. We draw on existing vision benchmarksand computer vision libraries to provide an initial col-lection that broadly samples the application domain.

• We perform a detailed microarchitectural analysis ofthe MEVBench benchmarks, indicating their hotspots,performance, memory characteristics and scalability.

• To better assess the efficiency of computer visionbenchmarks on potential hardware platforms, we de-velop two new performance metrics that measure con-trol complexity and 2D spatial locality.

• Finally, we present insights into how future embeddedarchitectures can better serve the mobile computer vi-sion application domain.

2 Previous Work

The computer vision community continues to work to-ward solving many open problems but focuses primarily onoptimizing algorithm accuracy. For example, vision effortsinclude the Middlebury data set for disparity/depth genera-tion [16], Pascal Visual Object Classes Challenge for clas-sification of various objects [11], and the Daimler OccludedPedestrian benchmark [10], which all focus on reocognitionaccuracy and present the core of their results using preci-sion versus recall curves. While accuracy is important forthe general problem set, the mobile space introduces addi-tional important constraints on computation capabilities andpower usage. While there have been some investigation intothe mobile vision space in recent years [20], it is just nowbecoming a focus.

Previous vision benchmark efforts have either looked atcomputer vision components as a part of a larger benchmarkor provided benchmarks with basic timing analysis. In this

work we focus specifically on mobile computer vision ap-plications and provide detailed information at the architec-ture level of their computational demands.

VisBench contains a face recognition benchmark for thedevelopment and analysis of visual computing [14]. Itgroups graphics and computer vision together. Howevercomputer vision and graphics are distinctly different prob-lems. In graphics, there is a full 3D model that must bepresented to users, typically in 2D. Computer vision, on theother hand, takes a set of information such as a 2D imagesand attempts to reconstruct the information that was lostin the transformation from 3D to 2D. While some compo-nents are similar, studying computer vision algorithms (andin particular mobile vision algorithms) will best serve ourbroader purpose of designing more efficient mobile visioncomputing platforms.

In PARSEC, bodytrack is the only computer visionbenchmark [2]. While including this benchmark into PAR-SEC aids in the development and optimization of generalpurpose processors, there are now many more important vi-sion applications, such as augmented reality, that have be-come common in the realm of computer vision. Thus weprovide a benchmark analysis focused only on computer vi-sion.

There are embedded benchmarks, however they do nottarget the growing field of embedded computer vision.Mibench provides an embedded benchmark suite but doesnot contain any direct computer vision benchmarks [13].The Embedded Microprocessor Benchmark Consortiumprovides benchmark suites for embedded computing, suchas Coremark [8] and MultiBench [9], however none are tar-geted at the embedded computer vision space.

OpenCV is an open source computer vision library [3]. Itprovides many low level vision and machine learning algo-rithms for use in computer vision application development.It is widely used and has been optimized for various plat-forms such as ARM and x86. It is capable of providing avision framework to develop a wide range applications, butto date it has not been thoroughly benchmarked. We usedthe OpenCV framework to develop our custom benchmarks.

SD-VBS is a benchmark suite that provides singlethreaded versions of many important computer vision ap-plications [33]. They provide basic implementations of thealgorithms for use in benchmarking. We incorporate a num-ber of the SD-VBS benchmarks into our collection of appli-cations, but we also broaden the effort to include a num-ber of full-scale computer vision applications that are aptto the mobile space such as augmented reality and SURFfeature extraction. In addition, we include parallelized ver-sions of key vision kernels, as exploitation of explicit paral-lelism will likely be a critical factor in the design of success-ful mobile vision computing platforms. To our knowledge,MEVbench is the first mobile computer vision benchmarksuite.

Table 1. Benchmarks in MEVBench.Benchmark Input Type Multithreaded

Feature Extraction

SIFT Image YesSIFT (SD-VBS) Image NoSURF Image YesHoG Image YesFAST and BRIEF Image Yes

Feature Classification

SVM Feature Vectors YesSVM (SD-VBS) Feature Vectors NoAdaboost Feature Vectors YesK Nearest Neighbor Feature Vectors Yes

Multi-image Processing

Tracking (SD-VBS) Image Sequence NoDisparity (SD-VBS) Image Pairs NoImage Stitch (SD-VBS) Image Set No

Recognition Applications

Object Recognition Image YesFace Detection Image NoAugmented Reality Image No

3 Benchmark Details

MEVBench is targeted at mobile embedded systemssuch as the ARM A9 and Intel Atom processors that arecommon in smartphones and tablets. These devices aregaining in popularity [7] and acquiring more capable cam-eras and mobile processors [25] [27]. Mobile embeddedsystems differ from typical desktop systems in that they aremore concerned with size, energy and power constraints.This typically leads to lower computational power alongwith less memory resources. MEVBench provides full ap-plications, such as augmented reality, along with compo-nents of common vision algorithms such as SIFT featureextraction and SVM classification. Table 1 summarizes theMEVBench benchmarks. The algorithms are built usingthe OpenCV framework unless otherwise noted. For theOpenCV benchmarks, we used the framework and some ofthe functions OpenCV provides, but we assemple the appli-cations together using custom code. Furthermore, we devel-oped a custom framework for multithreading vision bench-marks based on Pthreads. The included SD-VBS bench-marks are a subset of the SD-VBS benchmark suite thatwere chosen because they are suited for the mobile visionspace [33].

3.1 Feature Extraction

Features are key characteristics of a scene or data. Typ-ical image features include corners, edges, intensity gradi-ents, shapes and blobs. Feature extraction is the process oflocating features within a scene and generating signaturesto represent each feature. This is a key component of most

computer vision applications, and the quality of a feature isbased on its invariance to changes in the scene. A high qual-ity feature is invariant to viewpoint, orientation, lighting andscale. Feature extraction quality is, in general, proportionalto the algorithm’s computational demand [5]. Our bench-mark provides a variety of feature extraction algorithms toaccommodate this characteristic vision workload. We pro-vide a wide variety of feature extraction algorithms from thehigh quality and computationally intensive Scale InvariantFeature Transforms (SIFT), to the low quality but efficientFAST corner detector.

3.1.1 Scale Invariant Feature Transform (SIFT)

SIFT is a common feature extraction algorithm that is usedto localize features and generate robust feature descriptors.SIFT descriptors are invariant to scale, lighting, viewpointand orientation of the given feature. It is commonly usedin applications that involve specific instance recognitionsuch as object recognition, tracking and localization, andpanoramic image stitching.

SIFT is a robust feature detection and extraction algo-rithm. SIFT first creates an image pyramid using iterativegaussian blurring [21]. A difference of gaussian (DoG)pyramid is then formed by taking the difference betweenthe pixel intensities of two adjacent images in the initial im-age pyramid. The DoG pyramid is searched for extremapixel locations that are greater than all their neighbors inboth the current image and the images at adjacent scales. Ifthis is true, the point is a potential feature point. The lo-calization of a possible feature is further refined using 3Dcurve fitting. The refined potential feature points are thenfiltered based on their resemblance to edges and their con-trast. Once a point is located, the descriptor is formed usingthe gradients of the image in the region around the featurepoint. The 128-entry feature descriptor is then normalizedto aid in illumination invariance. The SIFT algorithm, whilecomputationally expensive, provides a high level of invari-ance to changes in illumination, scale or rotation.

MEVBench has two different implementations of SIFT.The first is the single-threaded version from the SD-VBSbenchmark suite. This version is a self-contained imple-mentation of the feature point localization phase of SIFTwhich is commonly referenced as DoG localization. Fur-thermore, this version is optimized for code understandabil-ity. The second version is a multithreaded version of SIFTbuilt using the OpenCV framework . This version of SIFTis based on the implementation provided by Vedaldi [32].The MEVbanch version can be scaled by number of threadsand input size.

3.1.2 Speeded Up Robust Features (SURF)

SURF is a commonly used feature extraction alternative toSIFT. Its native form produces a smaller feature descriptor

than SIFT and takes less computation time [1]. However acomparison of the performance of the two algorithms givesmix results based on the application being used [18]; forexample, SIFT performs better at rotations while SURF isslightly more viewpoint invariant. Overall, SURF is morecommonly used in embedded systems because of the rel-atively low computational complexity. Similar to SIFT,SURF is used in applications that involve specific instancerecognition such as object recognition, tracking and local-ization, and panoramic image stitching.

SURF uses integral images to approximate image con-volutions. This allows for fast computation of regional in-formation once the integral images have been computed.Furthermore it uses a second order derivative computation(Hessian Matrix) and box filters to localize the feature pointlocations. The algorithm uses multiple sizes of the filtersto find feature points at different scales. The locations arethen filtered using a non-maxima suppression where onlythe strongest signal in an area is used. The feature descriptoris computed using gradient-like computations, specificallyHaar wavelets, for an oriented region. The SURF descriptoris based on the SIFT descriptor [1]. MEVBench contains amultithreaded version of SURF. This version can be scaledusing thread counts and input sizes.

3.1.3 Histogram of Oriented Gradients (HoG)

HoG is commonly used for human or object feature detec-tion [6]. HoG uses image gradients to describe featureswithin an image. To locate features, the algorithm usesa sliding window technique where feature descriptors arecomputed at all possible locations within the image andcompared against a database of possible feature descriptors.If the descriptors match, a possible object of interest hasbeen found at the given location. HoG feature descriptorsare computed using an array of cells. A histogram of thegradient directions for each cell is computed, and then thecells are grouped into blocks and normalized. The resultinghistograms are concatenated together to form the descrip-tor. The HoG algorithm is slower than some lower qualityfeature extractors, but it provides good illumination invari-ance and a small level of rotational invariance. MEVBenchcontains a multithreaded version of HoG. This version canbe scaled by thread count and input size.

3.1.4 FAST Corner Detector (FAST) and Binary Ro-bust Independent Elementary Features Descrip-tor (BRIEF)

FAST was originally developed for sensor fusion applica-tions [28]. This algorithm is designed to quickly locate im-age corners, a process typically used to implement positiontracking. Since FAST is primarily for detecting corners wehave coupled it with the BRIEF feature descriptor to com-plete the feature extraction. The FAST algorithm uses only

pixel intensity, within the 16 nearest pixels, to locate cor-ners. If the pixel contrast is high enough to form a propercorner, the pixel is considered a feature point. Once the de-tection phase is completed, the BRIEF feature descriptor iscomputed. The BRIEF feature detector compares the pixelintensities within a smoothed image patch to form bit vec-tors of the results [4]. These are then used to describe thepatch where the feature point was found. It was found thatas little as 16 bytes were enough for accurate matching [4].MEVBench contains a multithreaded version of the FASTand BRIEF combination. This version can be scaled to mul-tiple thread counts and input sizes.

3.2 Feature Classification

Once feature descriptors are extracted there is typicallya reasoning process that takes place based on these fea-tures. A common component of this reasoning phase isfeature classification. Feature classification attempts to pre-dict some information about a feature based on the descrip-tor. The classification will commonly use previous datato predict information about the new data. This operationis commonly implemented using machine learning tech-niques. MEVBench includes three different classificationalgorithms that are appropriate for use in embedded visionapplications.

3.2.1 Support Vector Machine (SVM)

SVM is a supervised learning method used to classify fea-ture vectors or descriptors [31]. The algorithm is trainedwith a set of feature vectors and their known classes. SVMtreats each piece of data as a point in n-space where n isthe dimension of the feature vector. It then tries to findseparating hyperplanes in the n-space between the variousclasses. The result of training is a set of vectors called sup-port vectors that can be used to evaluate a new data pointin the classification phase. The query vector is combinedwith the support vectors in various ways to classify the newpoint. MEVBench has two versions of SVM. The first isfrom SD-VBS which is a single-threaded version of SVMthat includes both the training and classification phase [33].The second version is a multithreaded version built usingthe OpenCV framework. This version implements a linearSVM kernel that can be scaled by both number of threadsand number of input feature vectors.

3.2.2 Adaboost

Adaboost is a supervised learning method based on decisiontrees [12]. The algorithm uses a group of weak learner de-cision trees to increase the accuracy of classification. Thetraining process of the classifier assigns a weight to eachweak learner and then the individual results are summed to-gether to determine the final classification. The query vec-

tor is merely classified by each weak learner and the resultsaggregated together based on the weights. Since Adaboostis based on decision trees, the computational complexity isnot typically high. MEVBench has a multithreaded versionof Adaboost. The number of threads and the number of fea-ture vectors can be varied. It implements Adaboost withdecision trees with a max depth of three.

3.2.3 K-Nearest Neighbor (KNN)

K-Nearest Neighbor is a supervised learning method thatclassifies new feature vectors based on their similarity tothe training set. The implementation used in this work isbased on FLANN [24]. The K nearest points in the train-ing set vote for the class of the query vector. The votesare weighted based on the similarity of the query vectorto the vectors in the training set. FLANN is configuredfor an approximate nearest neighbor using kd-trees thuseliminating the need to do a complete search of the train-ing set. MEVBench provides a multithreaded K-nearestneighbors implementation of FLANN, using approximateneighbor matching as this is more appropriate for resource-constrained mobile applications. The number of threadsalong with the number of vectors can be varied.

3.3 Multi-image Processing

Multi-image processing is commonly used in embeddedvision applications. In these applications, multiple imagesor frames are used to garner information about the scene oritems within the scene. For example, tracking can be usedin robotics to follow an obstacle or another mobile object.Each of the benchmarks for multi-image processing enableother applications while being an application in their ownright. The benchmarks here are a subset of the SD-VBSsuite [33], selected based on their suitability for use in mo-bile vision applications.

3.3.1 Tracking

Feature tracking involves extracting feature motion from animage sequence. In this benchmark we use the Kanade Lu-cas Tomasi (KLT) tracking algorithm [22]. The features areextracted and their motions are estimated based on inter-frame correlations. The features used are those from Shiand Tomasi [29]. The algorithm estimates the motion basedon gradient information for each feature as it moves fromframe to frame. Thus there is feature extraction, matching,and motion estimation for each tracked feature. This is asingle-threaded version of the application. The input sizescan be varied for this benchmark.

3.3.2 Disparity

Disparity is a measure of the shift in a point from one imageto another. It is used in stereo imaging to estimate distance

or depth. It can also be used with images from a singlecamera. The disparity of a point or object is inversely pro-portional to the depth of object. Thus disparity is used torecover the 3D information lost within a scene when it isprojected to a 2D image plane. In this benchmark we usethe algorithm from Marr and Poggio [23]. The algorithmuses patch-based region comparisons to match the pixelsbetween two images. This is a single-threaded benchmark.The input sizes can be varied from small to large images forthis benchmark.

3.3.3 Image Stitching

Image stitching takes multiple images and merges them toform a single image. This operation requires matching re-gions of overlap between the images and aligning themaccordingly, typically by combining feature detection andmatching algorithms. The two images must also be blendedtogether since the view points may be slightly different. Theimage stitching benchmark is a single-threaded implemen-tation of the algorithm from [30].

3.4 Recognition Applications

Recognition applications utilize vision kernels, such asfeature extraction and classification, to analyze images orscenes. These will typically augment the feature processingwith reasoning to extract information from a scene. For ex-ample, object detection has a geometric constraint that mustbe met before an object is considered found (e.g., a standingperson must be upon a horizontal surface).

3.4.1 Object Recognition

Object recognition uses computer vision to evaluate when atrained object is present within a scene. It is common forobject recogntion to use feature extraction, feature classi-fication and geometric constraints to recognize the trainedobject. The MEVBench benchmark is based on the tech-nique described by Lowe [21]. This technique uses SIFTfeatures and matches the query image’s features to thetrained object features. Then the features are filtered using ageometric constraint on their location. A histogram binningtechnique is used to group and verify the location predictedby the features. The features are then used to compute theestimated pose of the object and the error of the location foreach feature is computed. If the total error is below a thresh-old, the object is considered located. The object recognitionbenchmark is built using the OpenCV framework, and it ismultithreaded to support a variable number of processors.

3.4.2 Face Detection

Face detection is a common application for embedded com-puter vision. This involves locating a face within an im-age. The face detection technique used in this benchmark is

based on Viola-Jones method [34]. This method uses boxfilters to locate faces present in the image at variable scales.The face detection benchmark is a single-threaded imple-mentation, built using OpenCV.

3.4.3 Augmented Reality

Augmented Reality is an increasingly popular applicationon mobile devices for navigation and gaming. In augmentedreality, a known marker is used to determine scale and pro-vide a reference point within a scene. The MEVBenchimplementation uses a black and white marker that is lo-cated using a binary threshold-based segmentation of theimage similar to the technique presented by Kato et al [19].The marker is identified using a basic binary pattern onthe marker face. The projection of a virtual cube into thescene, relative to the marker, is performed by estimatingthe markers translation and rotation relative to the camera.This technique requires the camera calibration be knownand that the image be adjusted based on this calibration.The MEVBench implementation uses the OpenCV frame-work to implement this technique in a single-threaded ap-plication.

4 Benchmark Characterization

4.1 Experimental Setup

In order to evaluate the benchmarks, we employed a va-riety of physical and simulated systems. The embedded na-ture of the benchmark required that we look at cores com-monly used in the mobile space as well as desktop sys-tems. For the physical system embedded target we utilizedan ARM 1GHz dual-core A9 with 1 GB of RAM. Thisclass of processor is found in many smartphones and tabletSoCs such as the NVIDIA Tegra 2 [25]. For this devicewe ran the benchmarks on a TI OMAP4430. We called theclock gettime() to measure application times on this plat-form. For the desktop class physical system we used an In-tel Core 2 Quad Q6600 processor, configured as describedin Table 2. Application timing on the Intel-based desktopsystem employed the hardware timestamp counter (TSC) tocapture execution cycles count. The TSC is a register on In-tel x86 architectures that holds a 64-bit counter that is incre-mented at a set rate. We modified the Linux kernel on thissystem to virtualize the TSC on a per-thread basis. As such,each thread has a copy of the TSC that it swaps in and outat context switches. This allows us to have cycle-accuratedata on a per-thread basis on a standard Intel processor.

We gathered detailed microarchitectural characteristicsnot available on most physical systems, by employing anembedded platform simulator. The simulated embeddedplatform is a 1 GHz Intel Atom model with 2 GB of RAM,simulated using the MARSS x86 simulator [26]. For thesimulated desktop target we used MARSS to simulate an

Intel Core 2 class of processor in various configurations.Table 2 summarizes the experimental setup for the variousprocessors. All benchmarks were compiled using GNU g++compiler suite version 4.4.3 with maximum optimization.We also ran Intel Vtune Amplifier XE 2011 to gather codehotspot information for the MEVBench benchmarks [17].

Table 2. Configurations for profiling MEVBench.

Feature Configuration

Embedded Bare Metal

Operating System Linux 2.6.38-1208-omap4Processor 1 GHz Dual Core Arm A9 (OMAP4430)Memory 1GB low power DDR2 RAML1 Cache 32KBi, 32KBD private 4-way associativeL2 Cache 1MB shared

Desktop Bare Metal

Operating System Linux 2.6.32.24 Custom KernelProcessor 2.4 GHz Intel Core 2 Quad Q6600Memory 4 GB PC2-5300L1 Cache 128KBi,128KBD

private 8-way associativeL2 Cache 2x4 MB shared 16-way cache associative

Simulated Base Embedded Core

Operating System Linux 2.6.31.4Processor 1.0 GHz 32bit x86 in MarssMemory 2 GBL1 Cache 32KBi private 8-way cache associative,

24KBD private 6-way cache associativeL2 Cache 512KB shared 8-way cache associative

Simulated Base Desktop Core

Operating System Linux 2.6.31.4Processor 1.0 GHz 64bit x86 in MarssMemory 2 GBL1 Cache 32KBi, 32KBD private

8-way cache associativeL2 Cache 2MB shared 16-way cache associative

To assess the scalability of the algorithms, they were runwith varied thread counts and input sizes. We used threedifferent input sizes in this evaluation: small, medium andlarge. The small inputs for the benchmarks are based onimages that are a standard Common Intermediate Format(CIF) size of 352x288 pixels. The medium inputs are thestandard VGA size of 640x480 pixels. The large inputs arethe full HD size of 1920x1080 pixels. All image data is incolor PNG format. For classification, the small input sizewas 30 vectors. The medium input size was 116 vectors.and the the large input was 256 vectors. All of the vectorshave 3780 entries. Vector sizes were chosen to align withthe expected computation load from feature extraction.

4.2 Dynamic Code Hot Spot Analysis

We examined the most executed instructions inthe single-threaded versions of the benchmarks within

MEVBench using medium-sized inputs running on an In-tel Core 2 Quad Q6600 with 4GB RAM. We looked at theoperations taking place at these various hot spots to deter-mine possible software or hardware based optimizations.

4.2.1 SIFT

The SIFT benchmark based on OpenCV showed that gradi-ent computation for descriptor building and feature pointorientation accounted for 70% of the computation. Fur-thermore, the most executed computation component of thiswas a vector multiply. This vector multiply was part of the2D gradient computation required to build the descriptor.The SIFT benchmark from SD-VBS, which contains onlythe feature localization, spends 67% of the time blurringthe image which involves a 2D convolution.

4.2.2 FAST and BRIEF

The FAST and BRIEF benchmark spent 15% of the timelocating corners with the FAST feature detector. The ma-jority of the FAST algorithm is spent on a compare oper-ation used to detect the corners. BRIEF accounts for over30% of the execution time. The primary operations in thisportion of the benchmark are integral image computationsand a smoothing computations. The integral image compu-tations apply box filters which are convolution operations.The smoothing operation also utilizes a convolution opera-tion.

4.2.3 HoG

The HoG benchmark spent 20% of the time computing theintegral images. The primary operation in this computationwas a vector add. Also there was a significant amount oftime spent on a divide operation used for normalizing fea-ture vectors.

4.2.4 SURF

The SURF benchmark had a hot spot in the edge detectorused to localize features and compute the feature vectors.This was a vector instruction that accounted for 39.9% ofthe computation in the benchmark. Also, another 40% ofthe time was spent fetching image data from main memory.This is due to the nature of the image region based descrip-tors that SURF uses.

4.2.5 AdaBoost

In the AdaBoost benchmark, the primary computation tookplace in the prediction code. The majority of the runtimecalls are spent traversing the trees and performing compar-ison operations in the decision trees. Thus the comparisonoperation is the largest component of this computation.

4.2.6 K-Nearest Neighbor

In the K-nearest neighbor benchmark 40% of computationwas spent indexing the tree to find the nearest neighbor. Theother 60% of the time was primarily taken up with a vectoraddition to compute the classification based on the neigh-bor’s class.

4.2.7 SVM

In both the OpenCV and SD-VBS SVM benchmarks, over60% of the computation time is spent with an inner prod-uct calculation. This involves multiplying two vectors ele-ment by element and summing the results together. This isthe predominant operation for training the SVM classifieras well.

4.2.8 Stitch

The stitch benchmark spent 53% of execution performinga non-maxima suppression. In this operation the maximumfeature response value within a region is used to filter outweaker feature responses. This operation requires many 2Dspatial compares. The second highest fraction of compu-tation for this benchmark was a convolution operation forfinding features. This took 33.3% of the computation time.

4.2.9 Disparity

The disparity benchmark was computing the integral imagefor 57% of the time. The primary operation in this phasewas a vector add operation. No other single operation dom-inated the remainder of the computation.

4.2.10 Tracking

The tracking benchmark had a hot spot in the 2D gradientcomputation. This operation constitutes 56% of the compu-tation. This was mainly a vector operation for performing aconvolution.

4.2.11 Face Detection

The face detection benchmark spent 60% of the time eval-uating the class of the object using the cascade classifier.This is a decision tree designed such that when a decisionevaluates false no other comparisons are made and the clas-sifier returns false, but when the evaluation is true additionalcompares are made until the final leaf node is reached.

4.2.12 Object Detection

The object detection benchmark combines feature extrac-tion, classification, and a geometric check. The hot spot forthis benchmark is the same as that of feature extraction. Wefound that feature extraction dominates the execution tim-ing, taking 69% of the time.

4.2.13 Augmented Reality

The augmented reality benchmark has two major hot spotsthat take a combined 28% of the computation time. The firstis the location of the marker by tracing contours or edges.The second hot spot performs correction of the image basedon the camera calibration data. The adjustment allows thesystem accurately project the scene in 3D. Over 57% of thetime in augmented reality is taken up with memory readsand writes.

To summarize our hot spot analysis, the results suggestthat hotspots are taking place at vector instructions thus al-luding to a vector architecture being useful. There are alsohotspots that take place at complex or hard to predict con-trol flow areas such as the cascade from face detection. Inthose cases there is a need to deal with irregular branchingpatterns. This may be difficult for traditional vector ma-chines. Among the operations being performed at hotspots,the convolution operation is used in many benchmarks thusshowing that accelerating that operation may be helpful toembedded vision applications. There are also hotspots thatinvolve comparison operation in which a single value is be-ing compared to many other values. Many benchmarks alsorequire many memory accesses thus an efficient embeddedvision processor must have a streamlined memory system.

4.3 Computational Performance

We examined the runtime performance of MEVBenchon various platforms such as an ARM A9 and Intel Core2 Quad. Figure 3 shows the number of cycles for single-threaded runs of MEVBench on a physical Core 2 Quad asthe input is scaled. The logarithmic component to the cyclesis due to the nature of image data. If both the height andwidth are scaled, the amount of work is scaled as the prod-uct of those increases. For example, the doubling of heightand width increases the number of pixels to compute by afactor of 4. Thus moving to HD computation on embeddedsystems will require a significant increase in computationalefficiency.

Figure 4 shows the instructions per cycle (IPC) for thesimulated Core 2 and Atom cores. This figure shows thedegree of instruction level parallelism the desktop cores canextract from the benchmarks when compared to the embed-ded Atom core. Given the amount of power and area coststo extract instruction level parallelism, the embedded pro-cessor will need to utilize more efficient computational re-sources to gain performance. Thus, instruction level par-allelism may not be the driving performance enhancementin embedded platforms, which leaves thread-level and data-level parallelism to improve performance in this space.

Figure 5 shows the effect of running the benchmarkson the actual Core 2 Quad when compiled with or with-out vector instructions. We chose this execution target be-cause the vector width of the x86 SSE instructions is greater

Figure 3. Execution Time The figure shows the number of cy-cles of execution for single-threaded versions of each benchmark.This gives an indication of the amount of overall computation con-tained within each benchmark. The experiments were run on amodified Linux kernel on the Core 2 Quad bare metal configura-tion.

Figure 4. IPC for Varied Core Capability The figure showsthe IPC for each benchmark on both the simulated Atom and Core2 cores. This marks the difference between desktop machines andembedded platforms in terms of throughput.

than the ARM SIMD engine. When the SIMD instructionsare used, they are inserted automatically by the compiler.In some cases the inclusion of vector instructions hurt per-formance. This is due partly to the control complexity ofthe vision kernels such as k-nearest neighbor where the kd-tree is searched. This suggests that an efficient embeddedprocessor for computer vision will need to support vectorinstructions in some cases but disable it in others where itmight hinder performance.

4.4 Memory System Performance

Embedded systems have limited memory when com-pared to their desktop counterparts. In order to design prop-erly for the embedded vision space these targets must effi-ciently serve the memory demands of vision applications.

Figure 5. Vector Instruction Impact The figure shows the im-pact of using vector instructions on various versions of the bench-marks. This examines the amount of data parallelism present. Thecontrol complexity and need to rely on the memory system hin-ders the performance in some cases when vector instructions areactivated.

A common limiting factor in low-cost systems is mem-ory bandwidth. Vision algorithms rely a great deal on imagedata that has little temporal locality. Figure 6 examines thememory bandwidth used by each application, in terms ofbytes activity per instruction. This includes all memory ac-tivity through the buses whether it is touched or not by theactual application. Thus a cache miss in L1 will result intwo data transfers, the L2 to L1 and the L1 to processor.This is because the entire memory system is using powerand all activity contributes to this. Bytes accessed per in-struction are calculated on a per work element basis. As em-bedded vision systems move toward real time performancethe memory system will need to accommodate this amountof data movement per instruction. This metric is agnosticto the actual frame rate and designers are free to calculatethe amount of data per second based on this and the numberof instructions per frame. The Atom core exhibits a highernumber here because of smaller caches forcing it to accessfull memory pages more often. Furthermore some appli-cations access memory in a way that is not tailored to thetraditional memory system.

Efficient memory is key to embedded performance interms of energy and timing. In an embedded system, acache miss can be more costly due to the slower memo-ries. Thus we examined the L2 cache hit rate for variouscomponents of the MEVBench. We found that the missesper instruction were very high. We also noted that manyvision algorithms operate on images and thus can take ad-vantage of 2D spacial locality. Thus, we created a cachecontroller that assumed all of memory contained images ar-ranged into patches. Each patch was an 8 byte row by 8 bytecolumn and data was fetched into the cache based on this ar-rangement with an 8-way associativity. We found that thisoutperformed the standard cache in the Core 2 simulation,

0

0.5

1

1.5

2

2.5

3

3.5

Byt

es P

er In

stru

ctio

n

Core2 Simulated

Atom Simulated

Figure 6. Memory Activity Per Instruction The figure showsthe memory activity in bytes per instruction for the single-threadedversions of each benchmark. This evaluates the stress on the mem-ory system per instruction of work. The benchmarks were run on asingle image or set of vectors. A high overhead indicates an inef-ficient memory fetch or caching policy. The algorithms often fetchmemory that is not needed but it is still counted in the measuredoverhead. Furthermore, misses in upper level caches increase thenumbers because the data needs to be transferred more than once.

as seen in Figure 7. Thus to increase performance, proces-sors for the embedded vision space should take advantage of2D locality present within many of the applications. Someapplications exhibit a cache performance decrease with thepatch cache. This is due partially to how it is accessingits data in the most time consuming portions of code. Forexample, Adaboost is a decision tree based algorithm thatworks with feature vectors. The patch cache prefetches thewrong data in this case. In SD-VBS SIFT, performance wasreduced because the algorithm performs a 1D operation fora large amount of computation but the patch memory is de-signed for 2D data. The SD-VBS SVM has a similar issuedue to the size of the vectors it uses. As such, 2D memoryoptimizations, while beneficial, should only be enabled forcode with substantial 2D spatial locality.

4.5 Multithreaded Performance

Figure 8 shows the performance in cycles of a dual coreARM A9 as the number of threads is increased for themedium and small input sizes. The performance of HoG isseen to get worse and plateau as the thread count reaches4, this is due to the memory used by HoG being largeand requiring swap as the number of threads increases. Itshould also be noted that FAST/BRIEF also suffers an issuewhere as the number of cores increases, coordinating thecores overtakes the execution of the feature extraction. Thisshows that for some algorithms, a lower number of high-performance cores may perform better than a large numberof small cores.

We examined the regularity of branching to evaluatehow well MEVBench algorithms might map to architec-

Figure 7. Patch Memory The figure shows the effect of usinga patch based cache on various benchmarks. The Y axis is the cachemisses per instruction for a traditional cache divided by the missesper instruction for a patch-based cache. A value above one meansthe patch based cache had higher performance while a value be-low one shows the traditional cache has higher performance. Thisdemonstrates how much more efficient the patch cache can be forcertain vision applications. This was done for the single-threadedversions of the benchmarks on the Core 2 simulated core with thesmall input sizes. Some algorithms have a worse miss rate. Thisis due partially to how it is accessing its data in the most time con-suming portions of code. For example, Adaboost is a decision treebased algorithm that operates on feature vectors. The patch cacheprefetches the wrong data in this case.

tures where multiple cores perform the same instruction inlock step, such as GPGPUs. We looked at the top 30% ofdynamic branches from various benchmarks and measuredhow often they changed targets. Figure 9 shows the branchdivergence measure for the various benchmarks. AdaBoostand HoG have such high transitions because they have highcontrol complexity built into the algorithms. HoG is per-forming a binning operation on the entire image as a firstoperation to compute the feature descriptor. This binningoperation is designed to decrease the amount of time andloops for future computation. Adaboost is a decision treebased classifier, thus as each feature is evaluated, branchingin the tree is quite varied. Stitch has a portion where val-ues are compared for non maxima suppression. This majoroperation in stitch is executed many times. Thus, this oper-ation dominates and can lead to different control paths. Thebranch divergence characteristics suggest that some of thealgorithms would experience significant stalls on a GPGPU.

Figure 10 shows the average IPC for multithreaded ver-sions the MEVBench benchmarks. FAST and BRIEF per-formance drops off quickly due to the coordination takingmore time and the threads waiting for each other to finish.The SURF and SVM degrade because some cores do morework than others , forcing many to wait during coordina-tion. The total IPC of the system is still always higher thanthe individual threads, however.

A

B

0

1

2

3

4

5

6

7

Spe

ed

Up

(R

ela

ive

to

1 t

hre

ad

exe

cuti

on

)

1 Thread 2 Thread 4 Thread 8 Thread

0

1

2

3

4

5

6

7

Spe

ed

Up

(R

ela

ive

to

1 t

hre

ad

exe

cuti

on

)

1 Thread 2 Thread 4 Thread 8 Thread

Figure 8. Performance (cycles) vs Number Of Threads Thefigure shows the effect of using multithreaded versions of some ofthe benchmarks. Specifically the feature extraction, classification,and object recognition benchmarks. This is because extraction andclassification are core vision components. Object recognition isused to show the potential of multithread vision applications. Theseplots show the performance of the 1GHz dual core ARM A9. Theplot (A) on the top is for a small input size while the plot (B) onthe bottom is for a medium input size. The HoG trend is caused bythe large memory demand of HoG which incur page faults in thememory system.

5 Architectural Implications

The analysis has shown that some of the MEVBenchworkloads would benefit from data parallelism while oth-ers may be hindered. Therefore, a key attribute of an mo-bile embedded processor for this space is the ability to ex-tract data-level parallelism when present but still be ableto perform well at single-threaded applications. In somecases the algorithms are hindered through the use of mul-tiple threads. This lends itself to a multicore with at leastone powerful core to deal with these single-threaded appli-cations, plus additional (possibly simpler) cores for leverag-ing available explicit parallelism. Given the area, cost andpower constraints, a heterogeneous multicore with lowerarea and power cores to support a larger core when thread

Figure 9. Branch Divergence The figure shows the branchdivergence present in the benchmarks. This figure shows how of-ten the top 30% of the most executed branches change their targetlocation. This result can be used to predict the complexity of algo-rithmic control flow, and its amenability to dataflow architecturessuch as GPGPUs.

Figure 10. Average Mutlithreaded IPC The figure showsthe average IPC for each Core 2 core as the number of threadsand cores are increased. This is a measure of how much the vari-ous cores affect each other’s performance. The FAST performancedrops off at 4 cores due to the amount of work coordinating thethreads exceeding the work to actually perform the feature extrac-tion.

level parallelism is available is a fair solution to this issue.There is a fair bit of control complexity in the various

workloads. Thus the core needs the ability to efficientlyhandle diverging control flow. However, the hotspot analy-sis showed many applications have hot spots at vector oper-ations. For some benchmarks, control complexity rules outtraditional vector machines and possibly GPGPU architec-tures that do not deal well with branch divergence. Thus,architectures that support specific vector instructions wouldbe a better fit.

Finally, it was found that patch-based caching or mem-ory accesses can take advantage of the ample 2D spatiallocality present in many vision algorithms. This should de-crease energy and execution time. The memory accesses

per instruction data shows that inefficient memory manage-ment in the architecture can increase the memory bandwidthrequirement. We also see that not all benchmarks see im-provement from the patch memory; thus it is beneficial toallow multiple memory access modes to increase perfor-mance, decrease required memory bandwidth and decreaseenergy usage. We have also seen there is a performance gapbetween the embedded and desktop systems. This will needto be overcome to enable more accurate embedded visionapplications. As mobile vision systems push toward accu-rate real-time computation, the need for performance willcontinue to increase.

6 Conclusion

Mobile embedded vision applications are quickly in-creasing in number. However the performance of mobileembedded processors has not kept pace. Currently there arelimited benchmark suites for vision and none (known bythe authors) tailored to mobile vision space. MEVBenchaddresses this gap and provides a wide range of embeddedvision workloads to evaluate mobile embedded vision ar-chitectures. It provides single and multithreaded workloadsthat are characteristic of this burgeoning space. MEVBenchcan be obtained at http://www.eecs.umich.edu/MEVBench.

We evaluated the performance of MEVBench on vari-ous platforms, embedded and desktop along with physicaland simulated processors. We used this evaluation to gaininsights into possible future mobile embedded vision ar-chitectures. We show that embedded vision architecturesneed to support heterogenous computation for both dataparallelism and single-threaded execution. We also showthat exploitation of 2D spatial locality can improve perfor-mance. We introduced a new measure of control complexityin branch divergence, and showed that it can help guide ar-chitectural decisions.

Overall we have delivered a much needed component forunderstanding mobile embedded vision computing, with thepromise of one day achieving real-time high quality perfor-mance with low energy usage. We are planning on continu-ing our work by exploring novel architectures for improvingthe efficiency of mobile vision computing. To support thiseffort, a power and energy evaluation methodology will bedeveloped based on MEVBench. We will also be modifyingthe SIFT benchmark to use Hess’s SIFT library [15].

Acknowledgment

The authors acknowledge the support of the Gigas-cale Systems Research Center, one of five research cen-ters funded under the Focus Center Research Program, aSemiconductor Research Corporation program. We also ac-knowledge the National Science Foundation (NSF) for their

support of this project. We would also like to thank Dr.Michael Taylor and his students for allowing the use of SD-VBS [33]. Rob Hess for allowing us the use of his SIFTLibrary [15]. Marius Muja et al. for allowing the use oftheir FLANN library [24].

References

[1] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. Speeded-Up Robust Features (SURF). Computer Vision and ImageUnderstanding, 2008.

[2] C. Bienia and K. Li. Parsec 2.0: A new benchmark suitefor chip-multiprocessors. In Proceedings of the 5th AnnualWorkshop on Modeling, Benchmarking and Simulation, June2009.

[3] D. G. R. Bradski and A. Kaehler. Learning OpenCV.O’Reilly Media, Inc., 2008.

[4] M. Calonder, V. Lepetit, C. Strecha, and P. Fua. BRIEF: Bi-nary Robust Independent Elementary Features. In EuropeanConference on Computer Vision, 9 2010.

[5] J. Clemons, A. Jones, R. Perricone, S. Savarese, andT. Austin. EFFEX: An Embedded Processor for ComputerVision-Based Feature Extraction. In Design Automation Con-ference, 2011.

[6] N. Dalal and B. Triggs. Histograms of oriented gradients forhuman detection. In Computer Vision and Pattern Recogni-tion, 2005.

[7] M. Donovan. The 2010 Mobile Year in Review - U.S.http://www.comscore.com/Press_Events/Presentations_Whitepapers-/2011/2010_Mobile_Year_in_Review_-_U.S, March 2011.

[8] EEMBC. CoreMark: An EEMBC Benchmark. http://www.coremark.org/home.php, 2011.

[9] EEMBC. MultiBench 1.0 Multicore Benchmark Software.http://www.eembc.org/benchmark/multi_sl.php, 2011.

[10] M. Enzweiler and D. M. Gavrila. Monocular pedestriandetection: Survey and experiments. IEEE Transactions onPattern Analysis and Machine Intelligence, 31:2179–2195,2009.

[11] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, andA. Zisserman. The pascal visual object classes (voc) chal-lenge. International Journal of Computer Vision, 88(2):303–338, June 2010.

[12] Y. Freund and R. E. Schapire. Experiments with a NewBoosting Algorithm. In International Conference On Ma-chine Learning. Morgan Kaufmann, 1996.

[13] M. Guthaus, J. Ringenberg, D. Ernst, T. Austin, T. Mudge,and R. Brown. Mibench: A free, commercially representa-tive embedded benchmark suite. In IEEE International Sym-posium on Workload Characterization, 2001.

[14] R. W. Heiland, M. P. Baker, and D. K. Tafti. Visbench: Aframework for remote data visualization and analysis. In Pro-ceedings of the International Conference on ComputationalScience-Part II, pages 718–727, London, UK, UK, 2001.Springer-Verlag.

[15] R. Hess. SIFT feature detector for OpenCV, February 2009.http://web.engr.oregonstate.edu/˜hess.

[16] H. Hirschmller and D. Scharstein. Evaluation of cost func-tions for stereo matching. In IEEE Conference on ComputerVision and Pattern Recognition, 2007.

[17] Intel. Intel vtune xe performance analyzer, 2011.http://www.intel.com/VTuneAmplifier.

[18] L. Juan and O. Gwun. A comparison of sift, pca-sift andsurf. International Journal of Image Processing, 3(4):143–152, 2010.

[19] H. Kato and M. Billinghurst. Marker tracking and hmd cal-ibration for a video-based augmented reality conferencingsystem. In Proceedings of the 2nd IEEE and ACM Inter-national Workshop on Augmented Reality, pages 85–, Wash-ington, DC, USA, 1999. IEEE Computer Society.

[20] G. Klein and D. Murray. Parallel tracking and mapping on acamera phone. In IEEE and ACM International Symposiumon Mixed and Augmented Reality, Orlando, October 2009.

[21] D. G. Lowe. Distinctive image features from scale-invariantkeypoints. IJCV, 2004.

[22] B. Lucas and T. Kanade. An Iterative Image RegistrationTechnique with an Application to Stereo Vision. In Interna-tional Joint Conference on Artificial Intelligence, pages 674–679, April 1981.

[23] D. Marr and T. Poggio. Cooperative Computation of StereoDisparity. Science, 194(4262):283–287, October 1976.

[24] M. Muja and D. G. Lowe. Fast Approximate Nearest Neigh-bors with Automatic Algorithm Configuration. In Interna-tional Conference on Computer Vision Theory and Applica-tion, pages 331–340, 2009.

[25] NVIDIA. Tegra 2, June 2011.http://www.nvidia.com/object/tegra-2.html.

[26] A. Patel, F. Afram, S. Chen, and K. Ghose. MARSS: AFull System Simulator for Multicore x86 CPUs. In DesignAutomation Conference, 2011.

[27] M. Rayfield. Tegra Roadmap Revealed: NextChip To Be Worlds First Quad-Core Mobile Proces-sor. http://blogs.nvidia.com/2011/02/tegra_roadmap_revealed_next_chip_worlds_first_quadcore_mobile_processor/.

[28] E. Rosten and T. Drummond. Fusing points and lines forhigh performance tracking. In International Conference OnComputer Vision, volume 2, October 2005.

[29] J. Shi and C. Tomasi. Good features to track. In 1994IEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR’94), pages 593 – 600, 1994.

[30] R. Szeliski. Image alignment and Stitching: A Tutorial.Found. Trends. Comput. Graph. Vis., 2:1–104, January 2006.

[31] V. N. Vapnik. The Nature of Statistical Learning Theory.Springer New York Inc., New York, NY, USA, 1995.

[32] A. Vedaldi. Code: SIFT++. http://www.vlfeat.org/\˜{}vedaldi/code/siftpp.html, June 2011.

[33] S. K. Venkata, I. Ahn, D. Jeon, A. Gupta, C. Louie, S. Garcia,S. Belongie, and M. B. Taylor. SD-VBS: The San DiegoVision Benchmark Suite. In IEEE International Symposiumon Workload Characterization, 2009.

[34] P. Viola and M. Jones. Robust real-time object detection.International Journal of Computer Vision, 2002.

mevbench: a mobile computer vision …mevbench: a mobile computer vision benchmarking suite jason...

Documents