mpc1011: matching local self-similarities on gpus

Massively Parallel Computing

MPC1011: Matching Local Self-Similarities on GPUs

M. Neumann and D. Ritter

AbstractWe present a parallel version of the self similarity approach of paper [SI07]. The goal of this approach is tomeasure the similarity of different images and find matches of an image within another image. This approachutilizes that the internal layout of local self-similarities correlate across different images. These internal self-similarites are captured within a compact descriptor. These descriptors are computed densely throughout theimages at different scales. The descriptors account for a certain amount of local and global geometric distortionswhich allows the use of rough hand-sketches in order to find real instances of the object in the image. We show thebasic concepts, the parallelization approaches, object detection examples and a comparison to a CPU-version.

1. Introduction

Finding objects in images is used in many applications, e.g.object recognition, object tracking, image in image search,hand-sketch search etc. The existing methods usually use lo-cal or global image properties to capture scene information.These are then compared in order to figure out the similar-ity between images. The assumption, that these approachesmake, is that there is a underlying visual property (color,intensity, edges, gradients, etc.) that can be compared be-tween these images. However, this assumption can be notsatisfying enough, because images can show an instance ofthe same object even without sharing the same visual prop-erties. Therefore, a “local self-similarity descriptor” was in-troduced in [SI07]. This descriptor captures the internal geo-metric layouts of local-self-similarities and also accounts forsome local affine deformations.

However, the local-self-similarity-descriptor creation andthe matching of these descriptors is quite a compute-intensive task. Therefore, the usage of the increasing com-pute power of GPUs can reduce the calculation time. Thisis possible because many of the processes for the descriptorcalculations and the matching phase can be done in a paral-lel SIMD fashion. The resulting speed-up can then be used inorder to take higher resolution images or use smaller imagesfor real-time applications on object detection.

2. Related Work

Image descriptors, which take local or global image prop-erties into account, are well documented. See [MS05] or[XHE∗10] for a comparison of the most popular approaches.

How such a descriptor can be accelerated using a GPU ap-proach is discussed in [WFG∗09].

The “local self similarity descriptor” is introduced in[SI07]. In order to make the descriptor sparse, they used anapproach introduced in [Hoy04]. Using sketches for imageretrieval is discussed in [BBGI10]. For the matching phasethey used a modified version of the “ensemble matching”algorithm described in [BI07].

3. Overview of System

In this section we present an overview of the several algorith-mic steps in order to match two images. Figure 1 shows thedifferent steps of our pipeline which are described in detailin section 4. The approach consists of the following threesteps. Extracting descriptors from the query image, match-ing the descriptors across images and the visualization ofthe results. We implemented the extraction and the matchingphase in parallel on GPU.

We differentiate between two kinds of input images: queryimages and database images. The query images are searchedin database images. Since self-similarity may appear at var-ious scales and in different region sizes, we extract self-similarity descriptors at multiple scales of the query image.

The first step is to transform the images to the LAB colorspace. The following creation of the descriptors is based onthe paper [SI07]. We divided the pictures in a 5x5 pixel grid.We will refer to them as patches. For each of these patcheswe calculate one local self-similarity descriptor which con-tains the 80 final values. For calculating these descriptors,we use a larger surrounding image region (45x45 pixels)

c© 2011 The Author(s)

M. Neumann & D. Ritter / MPC1011: Matching Local Self-Similarities on GPUs

around the center of these patches and measured the simi-larity between the patch and its local environment. To cal-culate the similarity, we use a SSD (Sum of Square Differ-ences) between the patch and its several surrounding patchesin the image region. This results in 1681 values per gridelement. These values are then transformed in a log-polarrepresentation with 20 bins an 4 angles. In each bin we se-lect the maximal value which reduces our initial 1681 val-ues to 80 values in the descriptor. The descriptor values arethen normalized to the range [0..1]. However, using all de-scriptors would provide bad results in the matching phasebecause of non-informative descriptors. Therefore, we fil-ter out non-informative descriptors and descriptors with highself-similarity. After that filtering the descriptors are normal-ized again to the range [0..1].

In the matching phase we compare multiple query de-scriptors, that were extracted before at multiple scales, witha single database image. The matching algorithm generatesmultiple likelihood maps that are 5 times smaller than thedatabase image size. To measure the similarity between de-scriptor values, accounting their respective position, we im-plemented a weighting function that is described in the thenext section.

Each likelihood map is visualized as a heat map. We alsogenerate a combined heat map with all scales within one heatmap.

4. Description of Algorithmic Steps

4.1. SSD

The local self-similarity descriptor creation is described in[SI07]. Where local means that the descriptor is calculatedonly with the help of its local environment. Therefore, it isa measure of similarity in its environment. We divided theinput picture in a 5x5 pixel grid and defined for each of thesegrid elements a surrounding box of the size 45x45 pixels. It isnecessary to choose an odd number to get a real center pointin this box. For each of those 5x5 patches we then calculatedone descriptor.

This section describes how we divided the calculations inblocks and threads on the GPU. This partitioning is also il-lustrated in Figure 2. To calculate one descriptor we usedone block on the GPU. This means we have as many blocksas patches. Furthermore each block consists of 41 threadswhich are all located on top of our surrounding box. Eachthread traverses the box from the top to the bottom and cal-culates a SSD between its current position and the center ofthe patch. Note that a SSD is not calculated between singlepoints. These points are only used to center a 5x5 patch onthem and then calculate a SSD between those 5x5 patches.Because a thread is traversing the box, each thread is re-sponsible to calculate 41 SSDs. Remember that we have 41threads which are all doing the same, so we get 1681 valuesfor each patch. Together these 1681 values form a “distancesurface”.

Figure 1: Overview of the different algorithmic steps for matching a query with a database image. The upper part of the pictureshows the process on the database image and the lower part shows the process on the query images.



Figure 2: This picture shows the partitioning in blocks andthreads for the SSD calculation.

Figure 3: The descriptor creation progress. The [source:[SI07]]

4.2. Log-Polar Transformation

The resulting “distance surface” from the SSD is now trans-formed into a so called “correlation surface”. This is doneby applying equation 1 to every 5x5 patch q.

Sq(x,y) = e−SSDq(x,y)

max(varnoise ,varauto(q)) (1)

Where varnoise is a constant and varauto(q) is the maximalvariance in a small region around the center of the patchq. After the transformation into the “correlation surface”, amapping to log-polar coordinates is done. This maps every“correlation surface” to one bin in the log-polar representa-tion. The log-polar representation has 80 bins (20 angles, 4radial intervals). The resulting 80 descriptor values are themaximum of each bin. The descriptor values are normalizedto the range [0..1]. Figure 3 shows this process.

Instead of applying equation 1 to all elements of the “cor-relation surface” and then finding the maximum of the val-ues, we mapped the minimum of the “correlation surface”to the bins and then calculated the costly e-function only onthe resulting 80 descriptor values. This works because themaximum of equation 1 is the minimum of the “correlationsurface” values because of the negative e-function.

In order to parallelize this step, first of all, a mappingmask is computed once on the GPU. This mapping maskis used as a look-up table for the position of each “corre-lation surface” element in the log-polar bins. The mappingitself also runs in parallel. Because only the minimum valuefor each bin has to be stored in the final descriptor, we usedthe atomic minimum function. As thread layout, we chose afixed number of threads per block in order to be flexible tochanges in the size of the “correlation surface”. The grid sizewas determined by the number of patches n and the number

of block needed per block m, which results in n∗m launchedblocks. Then the variances, which are needed to calculatedequation 1, are also calculated on the GPU. Finally equation1 is applied on the minimums of the correlation surface inorder to get the descriptor.

4.3. Filtering of Non-Informative Descriptors

In order to find the query image in the database image, thefirst step is to eliminate the non-informative descriptors. Ac-cording to [SI07], there are two kinds of non-informative de-scriptors. First, there are the descriptors that do not captureany self-similarities. These are descriptors where the centerpatch is salient which has the effect that no similarity canbe found in the patch. The second kind of descriptor is theone that contains a high self-similarity which happens if thepatch consists of a large homogeneous area.

In order to find those, our approach was to eliminate thedescriptors where the sum of the descriptor values is below acertain threshold (saliency) and those where the sum is abovea certain threshold (homogeneity).

4.4. Matching

After creating the descriptors, we determine the similar-ity between them. In order to do this, we implemented aGPU device function that uses a sigmoid function on theL1-Distance between the 80 values of two descriptor pairs.The sigmoid function gives us a similarity probability in therange of [0..1] between the two descriptors.

Our basic concept for matching a query image with adatabase image is to lay the descriptor values of the queryimage over the descriptors from the database image. Af-ter building such an overlay, we measure the similarity ofthis overlay. Because our query image is smaller than thedatabase image, we shift the query image over the databaseimage.

Our partitioning is similar to the one we used to calculatethe SSDs. For each of these overlays we use one block. Oneblock has as many threads as the query image has patchesin x-direction. Again each thread traverses the overlay in y-direction. In each traversing step, the thread calculates thesimilarity between the query and the underlying database de-scriptor and adds this probability to a sumthread.x variable.After each thread has reached the height of the query image,a single thread sums up all sumthread.x variables to sum f inalvariable. The sum f inal variable is then written into the resulton all positions used in the overlay. Note that some descrip-tors will not be used and therefore that position sum f inal isnot written. For the write back into the result matrix we useda atomic add function. This is necessary because each de-scriptor position is included in several overlays. Thus, onlythe overlays with the highest probabilities are written to theresult matrix. Figure 4 shows this overlay and shift approach.



Figure 4: This illustration shows how two ensembles of image descriptors are matched. First the descriptors are positioned oneach other. The green and the blue descriptors form an overlay. Then a probability for this overlay is calculated. The probabilityis written in the result. After that, the overlay is shifted in x- and y-direction. This is done in parallel for multiple blocks. In thisexample we need 6 blocks each with 3 threads.

Figure 5: Illustration of the neighborhood search with ra-dius 1.

To account for local deformations, we integrated a neigh-borhood search that is shown in Figure 5. This means ifwe build an overlay, each thread compares multiple querydescriptors in a specified radius with their correspondingdatabase descriptors. With increasing distance the similar-ity decreases. Therefore, we used a sigmoid function. Weintegrated this approach by multiplying the probability ofsimilar descriptor values with the probability of similar de-scriptor positions.

4.5. Visualization

As visualization we implemented a heat map. We used afunction that transfers the probability value into the colorsof a heat map.

We generate a quadripartite output image. First part is theoriginal database image, second part is the original queryimage, third part is a visualization of the generated querydescriptor and the fourth part shows the resulting heat map.This quadripartite view is very useful because you can eas-ily see how adjusting some parameters influences the re-sult. Note that the quadripartite image is generated for everyquery image scale. From the created heat maps we also pro-duce a combined heat map that represents an average of allheat maps.

5. Results

As we mentioned before, our output image is a collage offour images. In this section we want to present you someof our results. We used the ETHZ shape dataset [dat11] asthe query and database images in Figures 6, 7, 8, and 9. Inthe early stage of our implementation stage we used the pic-tures shown in Figure 10. This was a simplified testing envi-ronment because in Figure 10 we extracted the query imagedirectly form the database image.

We also implemented a visualization of the used and the



discarded descriptors. The used ones are white and the dis-carded ones are black. Figure 6 (e) shows a visualization ofa database descriptor.

Figures 6, 7, 8, 9 are all showing results of a more chal-lenging apple-logo retrieval test. For the query image weused just a simple black and white colored apple logo.The large homogeneous colored areas in the query providesnot enough self-similarity information. Descriptors whosesummed up values are below 20 and above 70 are kickedout (maximum is 80). Therefore, the resulting descriptor in-cludes only the edges of the apple. For the neighborhoodsearch we achieved the best results with radius 2. This meansour matching algorithm has a tolerance of a 25 pixels squareat each point (5 pixel width for the center patch plus 2x5pixels to right, left, bottom and top).

The algorithm produces only a good match if the patternin the query is at a similar size as the database image.

6. Discussion

To create a self-similarity descriptor it is necessary to havea surrounding box around a grid element. We have chosena length of 45 pixels for this square box. Because of that,the image must have a size of at least 45x45 pixels to ex-tract one descriptor from an image. Note, that the surround-ing box causes a 20 pixel width border on the images whereno descriptors are calculated.

The maximum size of a query, as well as a database image,is restricted to 1280x1280 pixels. That‘s because NIVIDAGPUs only allow up to 65536 blocks (

√65536∗ 5 = 1280).

Furthermore, the query image needs to be smaller than thedatabase image. Assume that the query image has the samesize as the database image which means we have only oneoverlay, then a match can not be detected if the pattern islocalized in different areas of the two input pictures.

We measured the performance of our parallel descrip-tor creation against a CPU version that is included in theOpenCV package [ope11]. For the comparison we calculatethe descriptors at multiple scales once with our implementa-tion an once with the implementation in OpenCV. Figure 11shows that our speedup increases with larger images. Thisis because more blocks are used with increasing image size.The maximum speedup is factor 32 with an image size of1024x1024 pixels.

Figure 12 shows that our descriptor implementation scalesvery well on different GPUs. The GTX 480 with 480 coresis almost twice as fast as the GTX 285 with 240 cores.

We also measured the time for matching two images atdifferent query scales. This is shown in Figure 4. A unintu-itive observation is, that in the beginning the time increasesuntil a query size of 400x400 pixels and then it decreasesagain. This is because with a larger query size, the amountof blocks is decreasing, while the amount of threads is in-creasing.

Figure 11: Comparison of our descriptor implementationwith the OpenCV implementation. The speedup of our ver-sion increases with larger images. The maximum speedup isfactor 32.

Figure 12: This graph shows how the calculation time forthe descriptors is varying on different GPUs. On the GTX480, which has twice amount of cores than the GTX 285, theperformance gain is around the factor 2.

7. Conclusion and Future Work

We presented a parallel version of the “local self similar-ity” descriptor, introduced in [SI07]. With the usage of theGPU we achieved a 32x speedup on a single GTX 480 GPUin comparison to the OpenCV CPU implementation. Thisspeedup can be used for calculating higher resolution imagesor doing object detection in real-time applications.

However, there are still lot’s of improvements that can bemade. In order to improve the matching result, the sparse-ness measure from [Hoy04] could be used to improve the

Figure 13: Comparison for matching two images at differentquery scales. The database image size is fixed. The GTX 480is about two times faster than the GTX 285.



(a)

(b)

(c)

(d)

(e) Database descriptor (f) Finalmap

Figure 6: The pictures (a)-(d) show the quadripartite output images. Such a quadripartite result picture is useful for adjustingparameters because you can e.g. easily see how a changing descriptor influences the heat map. Image (e) shows a visualizationof the database descriptor where only the white descriptors are used. Image (f) shows a combined heat map of the heat mapsfrom images (a)-(d).



(a)

(b)

(c)

(d)

Figure 7: These pictures show another apple-logo recognition. Only if the pattern in the query and the database image arealmost at the same size, a match is possible.

matching results. Also, there are lot’s of parameters that canbe tweaked in order to improve the results. In case of per-formance, one could use more threads per thread block forthe SSD calculation. Thereby our descriptor algorithm couldhandle images larger than 1280x1280 pixels. The matchingand be accelerated.

References

[BBGI10] BAGON S., BROSTOVSKI O., GALUN M., IRANI M.:Detecting and sketching the common. In Computer Vision andPattern Recognition (CVPR), 2010 IEEE Conference on (2010),pp. 33–40. 1

[BI07] BOIMAN O., IRANI M.: Detecting Irregularities in Imagesand in Video. International Journal of Computer Vision 74, 1(Jan. 2007), 17–31. 1

[dat11] ETHZ - Computer Vision Lab: Datasets. http://www.vision.ee.ethz.ch/datasets/index.en.html,2011. [Online; accessed 21-March-2011]. 4

[Hoy04] HOYER P. O.: Non-negative matrix factorization withsparseness constraints. 1, 5

[MS05] MIKOLAJCZYK K., SCHMID C.: A performance evalua-tion of local descriptors, 2005. 1

[ope11] Welcome - OpenCV Wiki. http://opencv.willowgarage.com/wiki/, 2011. [Online; accessed 21-March-2011]. 5

[SI07] SHECHTMAN E., IRANI M.: Matching Local Self-Similarities across Images and Videos. 2007 IEEE Conferenceon Computer Vision and Pattern Recognition (June 2007), 1–8.1, 2, 3, 5

[WFG∗09] WANG Y., FENG Z., GUO H., HE C., YANG Y.:Scene Recognition Acceleration using CUDA and OpenMP.Computer, 3 (2009), 1422–1425. 1

[XHE∗10] XIAO J., HAYS J., EHINGER K. A., OLIVA A., TOR-RALBA A.: SUN database: Large-scale scene recognition fromabbey to zoo. In 2010 IEEE Conference on Computer Vision andPattern Recognition (CVPR) (June 2010), IEEE, pp. 3485–3492.1


http://www.vision.ee.ethz.ch/datasets/index.en.html

http://www.vision.ee.ethz.ch/datasets/index.en.html

http://opencv.willowgarage.com/wiki/

http://opencv.willowgarage.com/wiki/


(a)

(b)

(c)

(d)


Figure 8: Our approach works also with quite small patterns in real life pictures.



(a)

(b)

(c)

(d) Database descriptor (e) Finalmap

Figure 9: Here the pictures (a)-(c) are matching correct. Only in the finalmap the head area is slightly mismatched.



(a)

(b)

(c)

(d)


Figure 10: In this set of result pictures we used a cutout from the database image as the query image. We used this query imagein the early implementation stage.


mpc1011: matching local self-similarities on gpus

Documents