improved spatial pyramid matching for image...

Improved Spatial Pyramid Matching

for Image Classification

Mohammad Shahiduzzaman, Dengsheng Zhang, and Guojun Lu

Gippsland School of IT, Monash University, Australia{Shahid.Zaman,Dengsheng.Zhang,Guojun.Lu}@monash.edu

Abstract. Spatial analysis of salient feature points has been shown to bepromising in image analysis and classification. In the past, spatial pyramidmatching makes use of both of salient feature points and spatial multires-olution blocks to match between images. However, it is shown that differ-ent images or blocks can still have similar features using spatial pyramidmatching. The analysis and matching will be more accurate in scale space.In this paper, we propose to do spatial pyramid matching in scale space.Specifically, pyramid match histograms are computed in multiple scales torefine the kernel for support vector machine classification. We show thatthe combination of salient point features, scale space and spatial pyramidmatching improves the original spatial pyramid matching significantly.

1 Introduction

Image classification has attracted large amount of research interest in the pastfew decades due to the ever increasing digital image data generated around theworld. Traditionally, images are represented and retrieved using low level fea-tures. Recently, machine learning tools have been widely used to classify imagesinto semantic categories. Now low level features can be used more efficiently thanever. Image classification is an important application in computer vision. Ourresearch goal is to improve methods for Image classification, more specificallynatural scene images or images with some spatial configurations. We want toclassify an image based on its semantic category of a scene like forest, road orbuilding etc. Our approach to whole image categorization employs to renownedtechniques namely Spatial Pyramid Matching (SPM) [1] and scale space theory.Our objective is to combine the power of these two methods.

In this paper, scene categorization is attempted by global image representationdeveloped from low level image properties. There is another approach for thistask that is to get idea of high level semantic attributes by segmentation ofobjects on the scene (like bed or car) and classify the scene accordingly. Webelieve scene classification can be done without extracting this high level objectcues. This is inspired by the publications of [2] where they proved that peoplecan recognize natural scenes while overlooking most of the details in it (i.e.the constituent objects). In another publication [3] it is also shown that globalinformation is as important as local information for scene classification by humansubjects.

R. Kimmel, R. Klette, and A. Sugimoto (Eds.): ACCV 2010, Part IV, LNCS 6495, pp. 449–459, 2011.� Springer-Verlag Berlin Heidelberg 2011

450 M. Shahiduzzaman, D. Zhang, and G. Lu

Scale is an important aspect of local feature finding in prominent cue detectionin images. The most prominent example of using scale space and characteristicsscale is the local invariant feature detector SIFT [4]. In SIFT the authors usedmaxima/minima of neighboring scale space to find the interest points or keypoints of an image. Scene features like sands in a beach or certain textures in thecurtain of a room would be more evident in bigger scales. Scale-space theory is aframework for multi-scale signal representation. It is a formal theory for handlingimage structures at different scales, by representing an image as a one-parameterfamily of smoothed images, the scale-space representation, parameterized by thesize of the smoothing kernel used for suppressing fine-scale structures [5].

In recent years the bag-of-features (BoF) model has been extremely popular inimage categorization. The method treats an image as a collection of unorderedappearance descriptors extracted from local patches. Then the patches or de-scriptors are quantized into discrete visual words of a codebook dictionary, andthen the image histograms are compared and classified according to the dictio-nary. The BoF approach discards the spatial order of local descriptors, whichseverely limits the descriptive power of the image representation. By overcomingthis problem, one particular extension of the BoF model, called spatial pyra-mid matching (SPM) [1], has made a remarkable success on a range of imageclassification benchmarks and was the major component of the state-of-the-artsystems, e.g., [6].

Our method is based on SPM. Similarly like SPM we have used the subdivideand disorder principle. The essence of this principle is to partition the imageinto smaller blocks and calculate orderless statistics of low level image features.Existing methods differs by the use of features (like pixel value, gradient orien-tation, and filter bank outputs) and the subdivision method (regular grid, quadtrees, and flexible image windows). SPM and as well as our method is indepen-dent in choice of features, anyone can plug any other type of features to get aclassification result. Authors of [7] offered an early insight into subdivide andprinciple by suggesting that locally orderless image play an important role in vi-sual perception. While SPM authors did not consider their Gaussian scale spaceof apertures, we integrated that idea into SPM. Importance of locally orderlessstatistics is also evident from few recent publications.

To summarize, our method provides a unified framework to combine the gainsfrom subdivide and disorder principle and scale space aperture with a choiceof low level features. It will enable to combine the locally orderless statisticsresults from multiple scales and different fixed hierarchy or rectangular windowsto achieve the scene classification task.

2 Related Methods

In this work we combine the power of multiresolution histogram with spatialpyramid matching. So our method consists of two concepts - multiresolutionor scale space analysis of image and spatial pyramid matching. In kernel basedlearning methods like support vector machine (SVM), we need to provide a

Improved Spatial Pyramid Matching for Image Classification 451

Fig. 1. Schematic illustration of Pyramid match kernel with two levels

kernel for learning and testing. There are many kernels, which varies in formula-tion. For example, histogram intersection kernel is a kernel matrix which isbuilt by histogram intersection. Essentially it provides a pair wise similarity mea-sure of the training and testing images. A pyramid match kernel (PMK) [1]works with an unordered image representation/features. The idea of the methodis to compute multiresolution histograms and finding the histogram intersectionat each resolution. In figure 1, for two different images X and Y, histogramsand the corresponding histogram intersections are computed at three resolutionlevels (0,1,2). The bin size is doubled in successive higher resolutions while thebin numbers are down sampled by 2. After that, all new histogram matchingin each resolution is weighted and summed up to form the histogram intersec-tion kernel. It has the limitation of discarding all spatial information. Let usconstruct a sequence of grids at resolutions 0,1,. . . ,L such that the grid at levell has 2l cells along each dimension. Number of matches (I l) at level l is givenby the histogram intersection function. Therefore, the number of new matchesfound at level l is given by I l − I l+1 for l = 0,1,. . . ,L-1. The weight associatedwith level l is set to 1

(2L−l) .Spatial pyramid matching (SPM) takes a different approach of performing

pyramid matching in the two-dimensional image space, and using traditionalclustering techniques in feature space. So in SPM the histogram computationis done at a single resolution and in multiple pyramid levels within the sameresolution, whereas in PMK it is done in multiresolution. PMK dont employ anyfeature clustering, directly map features in multiresolution histogram bins. Onthe other hand, SPM uses feature clustering during histogram computation tofind the representative feature sets. In SPM, all feature vectors are first quantizedinto M discrete types (i.e. the total number of histogram indices is M).

In figure 2, we are showing an example of constructing a three-level spatialpyramid. The image has three types of features, indicated by triangles, circlesand stars. At the top row, the image is subdivided at three different levels ofresolution. At the bottom row, the number of features that fall in each sub-region is counted. The spatial histograms are weighted according to pyramid


Fig. 2. Three-level spatial pyramid example

match kernel. During kernel computation, each type calculation comprised oftwo sets of two- dimensional vectors, Xm and Ym, representing the coordinatesof features of type m found in the respective images. The final kernel is then thesum of the separate channel kernels:

KL(X, Y ) =M∑

m=1

KL(Xm, Ym) (1)

This method reduces to a standard bag of features when it is a single level.Considering the fact that pyramid match kernel is simply a weighted sum ofhistogram intersections, and c × min(a, b) = min(ca, cb) for positive numbers,KL can be implemented as a single histogram intersection of long vectors formedby concatenating the appropriately weighted histograms of all channels at allresolutions. So essentially we are weighting the histograms before computing thehistogram intersection for convenience as the reverse would yield the same result.For L levels and M channels and S scales, the resulting vector has dimensionality:

(ML∑

l=1

4l) × S = M13(4L+1 − 1) × S (2)

Several experiments reported in results section use the settings of M = 200, L = 3and S = 3 resulting in (3×17000) -dimensional histogram intersections. However


these operations are efficient because the histogram vectors are extremely sparse,the computational complexity of the kernel is linear in the number of features.

One important aspect of the training and test images that we run the exper-iment only on gray level images; even if color images are available we convertedin to gray level images. We decide this from the finding of [9] that removingcolor information from images doesnt make the scene categorization tasks moreattention demanding.

3 Proposed Method: Multi-scale SPM

SPM uses a mechanism to combine local salient features and their spatial re-lationship so as to provide a robust feature matching. However, in many cases,different image and block can have similar histograms, this degrade the perfor-mance of SPM. This drawback can be overcome by analyzing images in scalespace, as confusions in previous case can be clarified at different scales. Forexample, in figure 3, images (a) and (b) are artificially generated images withalmost similar histograms, later they are Gaussian blurred and hence their his-tograms are also more discriminative than the original histograms. For a givenimage f(x,y), its linear (Gaussian) scale-space representation is a family of de-rived signals L(x,y;t) defined by the convolution of f(x,y) with the Gaussiankernel:

gtgtgt(x, y) =1

2πteee

−(x2+y2)2t Such that LLL(x, y; t) = (gtgtgt × fff)(x, y) (3)

Inspired by scale space theory we want to propose a multi-scale spatial pyramidmatching method. Key idea behind our method is the use of scale space to gain

(a) (b) (c) (d)

(e) (f) (g) (h)

Fig. 3. (a) and (c) are different images with almost similar image histograms (b) and(d). (e) and (g) are corresponding Gaussian blurred images and the previous smalldifference in histograms is now more prominent in higher scales(f and g).


Fig. 4. Block diagram of the proposed method

more discriminative power in classification. The major steps of our algorithmare (figure 4).

3.1 Feature Generation in Different Scales

First SIFT features are generated from all the images in different scales in aregular grid. Here a dense feature representation is used to avoid the problemssuperfluous data like clutter, occlusion etc. 128 bit SIFT descriptors are calcu-lated for all images in all scales in 8*8 regular grid settings and using a 16*16patch in the grid centers. These features are saved into files for use in later steps.

3.2 Calculate Dictionary

The features are clustered according to the parameter M which is the total num-ber of bins in of the computed histograms. It is often believed that increasing thenumber of M will increase the classification accuracy. But, in our experimentswe are getting comparable accuracy from M=200 setup compared to M=400 andM=600. Again the dictionary is built for all images in all scales. Dictionary iscalculated using K-means based clustering using all the extracted SIFT featuresin a specific scale. In figure 5 (left image), we are showing the correspondinghistogram of the values of a 200 sized dictionary. Separate dictionaries are cal-culated for separate scales. The dictionaries are calculated for using in histogramgeneration in later stages.


Fig. 5. Histogram plot of the calculated dictionary (left) and combined pyramidhistogram plot of all individual histograms in different levels (right)

3.3 Compile Pyramid Histogram

For all scales, the image is divided ranging from coarse to finer resolution andcompute histogram in each area and assign weight according to PMK. Match infiner resolution will be given more weight than match in coarse resolution. Afterthese steps now we have all the data required to build the pyramid histogram.With the different scale level histograms, we can just concatenate those forminga long histogram or compute inter-scale intersection/selection before formingthe concatenation. We are taking the first approach in our method. Though thiswill essentially increase the size of the long histogram by the scale factor, butthat wouldnt be a problem performance-wise. In this research our focus is onincreasing classification accuracy and leveraging performance on the currentlyavailable powerful hardware. In figure 5 (right image), one such combined pyra-mid histogram is shown. According to equation 2, size of the histogram is 34000for dictionary size 200, 3 pyramid levels and scale level 1.

3.4 Kernel Computation and SVM Classification

For SVM, we just need to build the histogram intersection kernel from the com-piled pyramid histograms. As we explained before, for the histogram intersectionkernel computation we just need to find the intersections of the long histogramconcatenation formed in the previous step. For training kernel intersection iscomputed between the same concatenated histograms and for training kernel itis between training histogram and testing histogram. A grey scale image map ofthe testing and training kernel is shown in figure 6. For training kernel, a whiteline is visible along the diagonal, as there will be a perfect match for correspond-ing training pairs. In testing kernel the matches are scattered as training andtesting sets are different. For SVM, we are using a modified version of libSVMlibrary [10] which implements the one vs. all classification. scales and differentfixed hierarchy or rectangular windows to achieve the scene classification task.


Fig. 6. Histogram intersection kernel as image for Training images (left) and testingimages (right)

4 Experimental Results

4.1 Test Dataset

We tested our method on scene category dataset [1], Caltech-101 [11] and Caltech-256 [12]. A brief statistical comparison of these three datasets is given in table 1.

4.2 Performance Metric

Two separate performance metric is used to measure the results combined ac-curacy and average of per class accuracy. Per class accuracy (P) is defined asthe ratio of correctly classified images in a class with respect total number ofimages in that particular class. If total number of image categories is N, thencombined accuracy and average of per class accuracy is defined as:

Average of per class accuracy =∑N

i=1 Pi

N(4)

Combined accuracy =Total number of correctly classified images × 100

Total number of images in the dataset(5)

Table 1. Statistical information of the image datasets used

Dataset No. of Total No. of Avg. image Max. no. of train/testcategories images size images used

Scene category 15 4485 300*250 100/restCaltech-101 102 9144 300*200 30/300Caltech-256 257 30607 351*300 60/300


Table 2. Accuracy results on different combination of parameters. Bold font meansits the best for a certain codebook size and pyramid level.

Codebook Pyramid Scale Combined Avg. of per classSize level level accuracy (%) accuracy (%)

200 3 1 81.47 ± 0.59 81.11 ± 0.68200 3 2 83.69 ± 0.50 83.31 ± 0.59200 3 3 83.45 ± 0.57 83.21 ± 0.61

200 2 1 79.88 ± 0.52 81.1 ± 0.30200 2 2 82.69 ± 0.67 82.25 ± 0.52200 2 3 82.78 ± 0.70 82.21 ± 0.75

400 3 1 81.95 ± 0.57 81.1 ± 0.60400 3 2 83.78 ± 0.64 83.48 ± 0.58400 3 3 83.71 ± 0.54 83.29 ± 0.70

400 2 1 80.28 ± 0.53 81.4 ± 0.50400 2 2 83.22 ± 0.44 82.75 ± 0.40400 2 3 83.10 ± 0.63 82.67 ± 0.78

Table 3. Our result compared to the original SPM for codebook size = 400, pyramidlevel = 3 and scale level = 2

SPM [1] Proposed method

Average of per class accuracy(%) 81.1 ± 0.60 83.48 ± 0.58Combined accuracy(%) 81.95 ± 0.57 83.78 ± 0.64

Table 4. Caltech-101 result for codebook size=400, pyramid level=3 and scale level=3



Table 5. Caltech-256 result for codebook size=400, pyramid level=3 and scale level=3



Table 2 is the extensive experiment donewith codebook size, pyramid level, scalelevel. Results are first grouped by codebook size and pyramid levels. The notablething here is that, scale level greater than one always produce better results thansingle level. Using the combined accuracy metric, we get our best result from code-book size 400, pyramid level 3 and scale level 2. Scale level 1 is basically the originalSPM. So for scale level 1, we use the results from [1]. But as the authors of [1] didn’treport the result of combined accuracy,we calculated it using our own implementa-tion of SPM. All results are obtained using a 2*64 bit Quad core processor with 48


Fig. 7. Per class accuracy for the result (average of per class accuracy) reported inTable 2

GB of RAM. All experiments are run for ten times with randomly selected trainingand testing images. The average of all the runs and standard deviation is reportedhere. Table 3 summarizes our best result compared to the original SPM. In figure7, we showed the per class accuracy for the best result reported in Table 4. Ourmethod outperforms SPM in eleven categories and provides comparable perfor-mance in the four categories. We tested whether the difference between two meth-ods reported in table 2 is statistically significant by the Matlab function ttest. Inthis case, ttest result indicated that the improvement obtained the by the proposedmethod is indeed statistically significant. The results on Caltech-101 and Caltech-256 are presented in table 4, 5 and it is in line with the results obtained from scenecategory dataset. On both of these databases, according to overall average accu-racy metric, proposed method is better than SPM by around 3% margin and usingthe average of per class accuracy metric, the margin is around 6%.

5 Conclusion and Future Scope

This paper presents an improvement to the spatial pyramid matching scheme.We provided a simple, intuitive and effective way to improve the SPM method.


To the best of our knowledge, this has not been done by previous researchers.The proposed extension is quite general and not limited to any specific featuredescriptors or classifiers and can be used as a surrogate module or new baselinefor SPM in image categorization systems.

The weight mechanism of the spatial pyramid matching (SPM) method is notsophisticated enough. It defines uniform and better weight level to the finer reso-lution blocks and punishes the coarse resolution blocks by assigning less weight.As a basic method this is okay, but consider a finer resolution block containingonly background or clutter, then assigning it more weight is only misleading cal-culation. So in the future, there is room for redesigning this weight mechanismto only assigning more weight to the corresponding blocks irrespective of scaleor spatial resolution.

References

1. Lazebnik, S., Schmid, C., Ponce, J.: Beyond Bags of Features: Spatial PyramidMatching for Recognizing Natural Scene Categories. In: Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, vol. 2, pp. 2196–2178(2006)

2. Oliva, A., Torralba, A.: Modeling the shape of the scene: a holistic representationof the spatial envelope. International Journal of Computer Vision 42(3), 145–175(2001)

3. Ogel, J., Schwaninger, A., Wallraven, C., Bulthoff, H.H.: Categorization of NaturalScenes: Local versus Global Information and the Role of Color. ACM Transactionson Applied Perception 4(3) (2007)

4. Lowe, D.G.: Distinctive Image Features from Scale-Invariant Keypoints. Interna-tional Journal of Computer Vision 60(3), 91–110 (2004)

5. Witkin, A.P.: Scale-space filtering. In: Proceedings of 8th International Joint Con-ference on Artificial Intelligence, pp. 1019–1022 (1983)

6. Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., Zisserman, A.: ThePASCAL Visual Object Classes Challenge. In: VOC 2009 (2009),http://www.pascal-network.org/challenges/VOC/voc2009/workshop/index.

html7. Koenderink, J., Doorn, A.V.: The structure of locally orderless images. Interna-

tional Journal of Computer Vision 31(199), 159–1688. Grauman, K., Darrell, T.: The Pyramid Match Kernel: Discriminative Classifi-

cation with Sets of Image Features. In: Proceedings of the IEEE InternationalConference on Computer Vision, ICCV (2005)

9. Fei-fei, L., Perona, P.: A Bayesian hierarchical model for learning natural scene cat-egories. In: Proceedings of the IEEE Conference on Computer Vision and PatternRecognition (2005)

10. Chang C., Lin C.: LIBSVM: a library for support vector machines (2001),http://www.csie.ntu.edu.tw/~cjlin/libsvm

11. Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from fewtraining examples: an incremental Bayesian approach tested on 101 object cat-egories. In: Proceedings of IEEE Workshop on Generative-Model Based Vision,CVPR (2004)

12. Griffin, G., Holub, A., Perona, P.: Caltech-256 Object Category Dataset. CaltechTechnical Report. Technical Report, Caltech (2007)

http://www.pascal-network.org/challenges/VOC/voc2009/workshop/index.html

http://www.pascal-network.org/challenges/VOC/voc2009/workshop/index.html

http://www.csie.ntu.edu.tw/~cjlin/libsvm

improved spatial pyramid matching for image...

Documents