tscene classification using pyramid histogram of

International Journal on Computational Sciences & Applications (IJCSA) Vol.4, No.4, August 2014

DOI:10.5121/ijcsa.2014.4402 15

Scene Classification Using Pyramid Histogram of

Multi-Scale Block Local Binary Pattern

Dipankar Das

Department of Information and Communication Engineering, University of Rajshahi,

Rajshahi-6205, Bangladesh.

ABSTRACT

Pyramid Histogram of Multi-scale Block Local Binary Pattern (PH-MBLBP) descriptor for recognizing

scene categories, is presented in this paper. We show that scene categorization, especially for indoor and

outdoor environments, requires its visual descriptor to process properties that are different from other

vision domains (e.g., SIFT descriptor used for object categorization). Our proposed PH-MBLBP satisfies

these properties and suits the scene categorization task. Since the proposed PH-MBLBP mainly encodes

micro- and macro-structures of image patterns, thus, it provides relatively more complete image descriptor

than the basic LBP operator. Moreover, our PH-MBLBP descriptor is more powerful texture descriptor

than the conventional operator and it can also be calculated extremely fast. Our experiments demonstrate

that PH-MBLBP outperforms the other descriptor such as SIFT.

KEYWORDS

Scene recognition, visual descriptor, local binary pattern, SIFT

1.INTRODUCTION

Humans are very proficient at perceiving natural scenes and understanding their contents.

However, scene classes, such as forest, building, office, mountain etc., are very ambiguous and

change a wide range of illumination and scale condition. Thus, in computer vision, it is very

difficult to classify such scenes categories into different classes. In the literature [1, 2], two basic

strategies can be found for classifying scene categories. The first approach uses low-level features

such as texture, colour, power spectrum, etc. This type of approaches have been proposed in [3,

4], where the scene is considered as an individual object. In [3, 4], the authors consider only a

small number of scene categories (city versus landscape, indoor versus outdoor, etc.). On the

other hand, the second approach as proposed in [5, 6, 7], uses an intermediate representation of

scene and then uses a classifier for classifying the scene into different categories. This strategy

has been shown to classify a larger number (up to 15) of scene categories. In this paper we

propose a classification algorithm based on spatial Pyramid Histogram of Multi-scale Block

Local Binary Pattern (PH-MBLBP) to represent the natural scene categories that can be used to

classify a large number of scene classes.

In recent years, several image representation techniques such as, probabilistic latent semantic

analysis (pLSA) model [8], part-based model [9], bag of visual words (BoW) model [10, 1], etc.

are used for scene categorization. Among these methods, the BoW model proposed by Anna

Bosch et al. [1] has shown an excellent performance due to its robustness to scale, translation and

rotation invariance. However, the BoW model is less discriminative for texture classification [11].


16

Moreover, natural images are rich in texture information, hence, how to classify the texture

images effectively is of challenge for scene classification problem. GIST descriptor, a holistic

representation of the spatial envelope, proposed by Oliva and Torralba [6], is an effective texture

descriptor. The proposed method achieved high accuracy in recognizing natural scene categories

such as mountain and forest. However, when classes of indoor environment are added, its

performance drops dramatically [12].

Local Binary Patterns (LBP) [13] is one of the best performing texture descriptor, and it has been

widely used in various applications such as scene classification, object recognition, and face

recognition. In the last few years, various extensions are made for the conventional LBP

descriptors [14, 15, 16]. Since basic LBP and its simple extension is very easy to implement and

produces very good performance on some selected object categories, thus it has been used by

several researchers for classifying scene classes and recognizing human face . In [16], Zhang et

al. propose basic LBP in Gabor transform domain. The combination of Gabor transform and LBP

descriptors improves the discriminative power of LBP descriptors. Recently, Wu et al. [12]

propose a Census transformation based approach which is equivalent to LBP code to recognize

scene categories. However, the census transformation based approach proposed in [12], uses only

gray level information of the image and is not invariant to rotation. Liao et al. [14] propose a

Dominant local binary pattern (DLBP) which is an extension of basic uniform LBP. Their main

target is to capture the dominating patterns in texture images by counting the co-occurrence

frequencies of all rotation invariant patterns defined in LBP groups. The DLBP proposed in [14]

is very useful in describing images with texture characteristic. The method also invariant in

curvature edges, complicated shapes, crossing boundaries and corners. However, most of the

previous representation methods have the common disadvantage of poorly capturing the intrinsic

multi-scale and multi-orientation geometric information of objects within images.

The basic LBP is proved as a successful and powerful local descriptor for micro-structures of

images. In the basic LBP operator, we replace the pixels values of the 3× 3-neighborhood of an

image by the thresholding of each pixel with respect to the center pixel and taking the result as a

“binary string” or a “decimal number”. However, the basic LBP operator has the following

disadvantages for classifying scene categories: (i) noisy image produces very worse result for

basic LBP operator because it is constructed in very small spatial support area, and (ii) the

dominant features of the scene (macro-structure) may be ignored in basic LBP operator because it

is calculated in the local 3×3 neighbourhood areas.

Thus, in this research, we propose an alternative representation technique, called spatial Pyramid

Histogram of Multi-scale Block LBP (PH-MBLBP). Our proposed approach overcomes some of

the limitations of basic LBP operator and successfully applies to scene categorization task. Our

main target of this research is to represent an image of a scene by its local feature in micro-

structure and macro-structure, and the spatial layout of these structures. In our approach the

micro-structure of a scene is represented by using the basic LBP and macro-structure is captured

by the average values of block sub-regions, instead of individual pixels. Then the spatial layout of

these micro- and macro-structure features is represented by tiling the image into regions at

multiple resolutions. Thus, the proposed PH-MBLBP descriptor presents several advantages: (1)

spatial pyramid histogram of MBLBP is used to capture geometric information by tiling the

image into regions at multiple resolutions, and (2) MBLBP perform better than the basic LBP for

the following reasons- (i) MBLBP encodes both micro- and micro-structures of scene image

patterns, and therefore it provides a more reliable image representation technique than the original

LBP operator; and (ii) we can efficiently compute the MBLBP using an integral image

representation techniques.


17

2. PYRAMID HISTOGRAM OF MBLBP

In this research, the scene image is represented by the spatial pyramid histogram of multi-scale

block local binary pattern (MBLBP). First, the approach computes the block local binary pattern

of the image at different scale. Then MBLBP is represented by the pyramid histogram.

2.1. Multi-Scale Block Local Binary Pattern (MBLBP)

The original local binary pattern (LBP) is a form of non-parametric local transform originally

developed for finding correspondence between local patches. The LBP maps the local

neighbourhood surrounding a pixel, P, to a “bit string” representing the set of neighbouring pixels

in a square window, as illustrated in Error! Reference source not found.. If the intensity

value of the pixel P is less than or equal to one of its neighbouring pixels, a bit `1' is set in the

corresponding position. Otherwise a bit `0' is set. In this intensity comparison, we generate 8-bit

for the corresponding 8-pixel positions. Finally, we put together the generated 8-bit in any order

to produce the LBP code. Here we arrange bits from left to right and from top to bottom. The

converted LBP value is a base-10 number in the range {0⋯⋯255}.

Figure 1: The basic LBP coding

Then to generate a texture descriptor of an image, we construct a histogram of the LBP code.

Error! Reference source not found. shows an illustration of the basic LBP coding technique.

Multi-scale LBP [17] is an extension to the basic LBP, with respect to neighbourhoods of

different sizes.

Figure 2: (a) The MBLBP operator with block size 3×3, (b) Thresholding binary value, (c) Binary code,

(d) Corresponding decimal value.

In this paper, we propose a distinctive rectangle features, called Multi-scale Block Local Binary

Pattern (MBLBP) feature. In MBLBP, we first calculate the average intensity values of

rectangular sub-regions as shown in Error! Reference source not found.(a). Then we

compare these average intensity values with respect to the center value in a similar way of the

basic LBP to generate the MBLBP code (Error! Reference source not found.(b)-(d)). Each

sub-region is a square block containing neighbouring pixels (or just one pixel, in this case it is

original LBP). Formally, given a pixel(� , �), the resulting MBLBP can be expressed in decimal

form as:

(1)

40 44 49

38 45 50

38 39 49

0 0 1

0 1

0 0 1

(0011100) 5610

P

(a) (b) (c) (d)

7 7 5 6 8 7 9 7 8

(00001010)2 1010

0 0 0

0 0

1 0 1


18

where i covers the 8 neighbours rectangular sub-regions, �(� � 1,⋯ ,8) are the average

intensities of those rectangular sub-regions, and is the average intensity of the center

rectangular sub-region. The function �(�) is defined as:

(2)

A more detailed description of the MBLBP operator is illustrated in Error! Reference source

not found.. The whole MBLBP is composed of 9 blocks. We take the size s of the block as a

parameter, and s×s denoting the scale of the block size of MBLBP operator (particularly, the

MBLBP with block size 1×1 is in fact the original LBP). In this research, the average pixel value

of each block is computed using the integral image as proposed in[18]. For this reason, we can

compute MBLBP code in a very fast way. Thus the complexity of the proposed MBLBP coding

takes a little more time than the basic LBP operator.

The MBLBP feature relies solely upon the set of block comparisons, and is therefore robust to

illumination changes, gamma variation, etc. The MBLBP retains global structures of an image

besides capturing the local structures. An original image and its MBLBP image with different

block sizes (pixel values are replaced by its LBP values) are shown in Error! Reference

source not found.(a)-(d).

Figure 3: (a) Original image, (b) MBLBP with block size 1×1 (i.e. original LBP),

(c) MBLBP with block size 3×3, (b) MBLBP with block size 9×9.

In MBLBP, the LBP values of neighbouring pixels are highly correlated and there exist some

direct and indirect constrains posed by the center pixels. The transitive property of such

constrains make them propagate to pixels that are far apart. The propagated constrains make LBP

values and their histograms implicitly contain information for describing global structures. As

(c) (d)

(a) (b)


19

shown in Fig. 3 (b)-(d), the LBP operation maps any 3×3 pixels into one of 28

cases each

corresponding to a special type of local structure of pixel intensities. Here the LBP value acts as

an index to these different local structures. We compute histograms of MBLBP values at different

pyramid levels on the scene image. Finally, these histograms are used as the visual descriptors of

the scene.

2.2. Spatial Pyramid Histogram of MBLBP

Our objective is to represent an image by its local small and large scale structures, and the spatial

layout of the structure. In this research, we first use the MBLBP with different size of windows to

represent local small and large scale structure. Then, the spatial layout of the features is

represented by dividing the image into regions at multiple resolutions. The idea is illustrated in

Error! Reference source not found.. The descriptor consists of a histogram of MBLBP

features over each image sub-region at each resolution level – a Pyramid of Histograms of

MBLBP (PH-MBLBP). The distance between two PH-MBLBP image descriptors then reflects

the extent to which the images contain similar structure and the extent to which the structure

correspond in their spatial layout.

Figure 4: Spatial Pyramid framework. The image is recursively split and histogram of MBLBP image is

then computed for each grid cell at each pyramid resolution level. (a) Histogram of MBLBP, (b) histogram

with 2-level spatial pyramid, and (c) histogram with 3-level spatial pyramid.

(a)

+

+ +

(b)

(c)


20

To construct the pyramid representation, we repetitively double the number of sub-divisions in

both axis directions. In each sub-division, we count the number of co-occurrence frequency. Our

representation is called pyramid representation because the number of co-occurrence frequency in

a sub-division at one level is simply the sum over those contained in the 4-sub-division it is

divided into at the next level.

If we denote a pyramid level by r, then it has 2r sub-divisions along each dimension. Therefore, if

our MBLBP histogram is represented by K bins, then the level 0 is represented by K feature

vectors, level 1 by 4K feature vectors, and so on. The dimension of the PH-MBLBP descriptor for

the entire image is given by:

�� ∑ 4��∈! (3)

were R represent the total number of pyramid level. For an illustration, let R = 1, and K = 100

bins, then the dimension of the PH-MBLBP descriptor will be a 500-vector.

2. TRAINING

In this research, the LIBLINEAR classifier [19] is learned with PH-MBLBP feature vector. Since

our database consists of large number of images with high dimensional features vector, thus we

use LIBLINEAR classifier as it is shown to be effective for large sparse data with large number

of instances [19]. Moreover, LIBLINEAR is an easy-to-use tool to deal with such data.

LIBLINEAR classifier inherits many important characteristics (such as open source, well written

documentation, simple to use) of the most widely used support vector machine (SVM) classifier

known as LIBSVM [20]. Another important feature of the LIBLINEAR classifier over LIBSVM

is that it takes relatively very small amount of time to learn a huge amount of examples. For

instances, in our experiment to learn more than 4000 examples, the LIBLINEAR takes several

seconds whereas the LIBSVM takes several minutes. In addition, LIBLINEAR classifier is

competitive with or in some cases it is faster than other state of the art linear classifiers such as

Pegasos [21].

2. EXPERIMENTS

In this section, we reported results on fifteen scene categories database [22]. Multi-class

classification was done with a LIBLINEAR classifier trained using the one-versus-all rule: a

classifier was learned to separate each class from the rest, and a test image was assigned the label

of the classifier with the highest response.

2.3. Database

Our experimental dataset contains 15-scene categories. The 15 class scene dataset was gradually

built. The initial 8 classes were collected by Oliva and Torralba [6], and then 5 categories were

added by Fei-Fei and Perona [5]; finally, 2 additional categories were introduced by Lazebnik et

al. [22]. The 15 scene categories are: (1) bedroom (216 images), (2) CAL-suburb (241 images),

(3) industrial (311 images), (4) kitchen (210 images), (5) livingroom (289 images), (6) MIT-cost

(360 images), (7) MIT-forest (328 images), (8) MIT-highway (260 images), (9) MIT-insidecity

(308 images), (10) MIT-mountain (374 images), (11) MIT-opencountry (410 images), (12) MIT-

street (292 images), (13) MIT-tallbuilding (356 images), (14) PAR-office (215 images), and (15)

store (315 images). The database contained a total of 4485 images and the average image size was

300×250 pixels. The majority of images of the database were collected from the Google image

search, personal photographs, and COREL image collection. In scene categorization, this is one


of the most widely used databases in literature so far.

because it contains a wide range of indoor and outdoor images

Figure 5: Retrieval

2.4. Experimental Results

In experimental result, we compared our proposed feature descriptor (PH

commonly used SIFT descriptor

methodology. We use the confusion table

success rates were calculated by the average value of the diagonal entries of the confusion table.

Figure 6: Recognition performance for 15

(1) Bedroom (2) CALsuburb

(6) MITcost (7) MITforest

(11)MITopencountry (l2) MITstreet

putational Sciences & Applications (IJCSA) Vol.4, No.4, August 2014

ed databases in literature so far. The database is popular for researchers

nge of indoor and outdoor images.

Figure 5: Retrieval of images from 15-category scene database.

compared our proposed feature descriptor (PH-MBLBP) with the most

commonly used SIFT descriptor [23]. The testing and training were performed in one

confusion table to measure the performance of the system. The

by the average value of the diagonal entries of the confusion table.

Figure 6: Recognition performance for 15-category scene database for SIFT and PH-MBLBP descriptors.

) CALsuburb (3) Industrial (4) Kitchen (5) Livingroom

) MITforest (8) MIThighway (9) MITinsidecity (10) MITmountain

MITstreet (13) MITtallbuilding (14) PARoffice (15) Store

putational Sciences & Applications (IJCSA) Vol.4, No.4, August 2014

21

The database is popular for researchers

MBLBP) with the most

. The testing and training were performed in one-versus-all

to measure the performance of the system. The overall

by the average value of the diagonal entries of the confusion table.

MBLBP descriptors.

Livingroom

) MITmountain

tore


22

2.6.1. PH-MBLBP Versus SIFT

We compared our MBLBP feature descriptor with the SIFT descriptor to evaluate the

effectiveness of the proposed descriptor.

Figure 7: Cross-category confusion matrix for PH-MBLBP descriptor.

Figure 8: Cross-category confusion matrix for SIFT descriptor.


23

We computed the dense feature for both descriptors with patch size of 10. The 1 versus all

recognition rate of MBLBP and SIFT descriptors are 82.5% and 80.08%, respectively. Figure 6 shows the recognition performance for 15-category scene database. We also computed the confusion matrix for both PH-MBLBP and SIFT descriptors. The cross-

category confusion matrices for both descriptors are shown in Error! Reference source not found. and Figure 8, respectively. Finally, we computed the ROC curve and the Area Under the

ROC Curve (AUC) for both descriptors. The ROC curve for PH-MBLBP and SIFT descriptors

are shown in Error! Reference source not found.9. The AUC accuracy for PH-MBLBP and

SIFT descriptor were 98.264% and 97.277%, respectively.

The above results and analyses indicate that our proposed approach PH-MBLBP performs better

than the SIFT descriptor for 15-category scene database.

Figure 9: The ROC curves for both SIFT and PH-MBLBP descriptors.

3.CONCLUSION

In this paper, we have proposed a pyramid histogram of multi-scale block local binary pattern

(PH-MBLBP) descriptor for scene classification. We use the LIBLINEAR classifier to classify

the scene categories. The proposed PH-MBLBP represents images as micro- and macro-

structures, and hence the representation is more robust than the basic LBP operator and other

conventional operator. The LIBLINEAR classifier successfully classifies 15-category scene

classes with high classification accuracy with the PH-MBLBP descriptor. We have shown that the

PH-MBLBP adapted to incorporate at multiple blocks has superior performance with the

conventional SIFT descriptor. The proposed PH-MBLBP performed by approximately 2.5%

higher accuracy than the SIFT descriptor on the same database.


24

REFERENCES [1] A. Bosch, A. Zisserman and X. Muñoz, (2008) "Scene classification using a hybrid

generative/discriminative approach," IEEE Transactions on Pattern Analysis and Machine Intelligence,

vol. 30, no. 4, pp. 712-727.

[2] A. Bosch, X. Muñoz and R. Marti, (2007) "A review: Which is the best way to organize/classify

images by content?," Image and Vision Computing, vol. 25, no. 6, pp. 778–791.

[3] M. Szummer and R. W. Picard, (1998) "Indoor-outdoor image classification," in In ICCV Workshop

on Content-based Access of Image and Video Databases, Bombay, India.

[4] A. Vailaya, A. Figueiredo, A. Jain and H. Zhang, (2001) "Image classification for content-based

indexing," IEEE Transactions on Image Processing, vol. 10, p. 117–129.

[5] L. Fei-Fei and P. Perona, (2005) "A bayesian hierarchical model for learning natural scene categories,"

in In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 2,

pages 524–531, Washington, DC, USA, 2005., Washington, DC, USA.

[6] A. Oliva and A. Torralba, (2001) "Modeling the shape of the scene: a holistic representation of the

spatial envelope," International Journal of Computer Vision, vol. 42, no. 3, pp. 145-175.

[7] J. Vogel and B. Schiele, (2007) "Semantic modeling of natural scenes for content-based image

retrieval," International Journal of Computer Vision, vol. 72, no. 2, pp. 133-157.

[8] T. Hofmann, (2001) "Unsupervised learning by probabilistic latent semantic analysis," Machine

Learning, vol. 41, no. 2, pp. 177-196.

[9] R. Fergus, P. Perona and A. Zisserman, (2005) "A sparse object category model for efficient learning

and exhaustive recognition," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR

05).

[10] D. Gokalp and S. Aksoy, (2007) "Scene classification using bag-of-regions representations," in IEEE

Conference on Computer Vision and Pattern Recognition (CVPR 07).

[11] X. Qian, X. Hua and L. K. P. Chen, (2011) "PLBP: An effective local binary patterns texture

descriptor with pyramid representation," Pattern Recognition, vol. 44, no. 10-11, pp. 2502-2515.

[12] J. Wu and J. M. Rehg, (2011) "CENTRIST: A Visual Descriptor for Scene Categorization," IEEE

Trans. Pattern Anal. Mach. Intell., vol. 33, no. 8, pp. 1489-1501.

[13] T. Ojala, M. Pietikainen and D. Harwood, (1996) "A comparative study of texture measures with

classification based on feature distributions," Pattern Recognition, vol. 29, no. 1, pp. 51-59.

[14] S. Liao, M. Law and A. Chung, (2009) "Dominant Local Binary Patterns for Texture Classification,"

IEEE Transactions on Image Processing, vol. 18, no. 5, pp. 1107-1118.

[15] G. Zhao and M. Pietikainen, (2007) "Dynamic texture recognition using local binary patterns with an

application to facial expressions," IEEE Transactions on PAMI, vol. 29, no. 6, pp. 915-928.

[16] W. Zhang, S. Shan, W. Gao, X. Chen and H. Zhang, (2005) "Local Gabor Binary Pattern Histogram

Sequence (LGBPHS): A Novel Non-Statistical Model for Face Representation and Recognition," in

IEEE International Conference on Computer Vision (ICCV).

[17] T. Ojala, M. Pietikainen and M. Maenpaa, (2002) "Multiresolution gray-scale and rotation invariant

texture classification with local binary patterns," IEEE Transactions on Pattern Analysis and Machine

Intelligence, vol. 24, no. 7, pp. 971-987.

[18] P. Viola and M. Jones, (2001) " Robust real time object detection," in IEEE ICCV Workshop on

Statistical and Computational Theories of Vision, Vancouver, Canada.

[19] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang and C.-J. Lin, (2008) "LIBLINEAR: A Library for

Large Linear Classification," Journal of Machine Learning Research, vol. 9, pp. 1871-1874.

[20] C.-C. Chang and C.-J. Lin, (2001) "LIBSVM: a library for support vector machines," Software

available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.

[21] S. Shalev-Shwartz, Y. Singer and N. Srebro, (2007) "Pegasos: primal estimated sub-gradient solver for

SVM," in International Conference on Machine Learning, Corvallis, Oregon, USA.

[22] S. Lazebnik, C. Schmid and J. Ponce, (2006) "Beyond Bags of Features: Spatial Pyramid Matching for

Recognizing Natural Scene Categories," in IEEE Computer Society Conference on Computer Vision

and Pattern Recognition, New York, NY, USA.

[23] D. G. Lowe, (2004) "Distinctive Image Features from Scale-Invariant Keypoints," International Journal

of Computer Vision, vol. 60, no. 2, pp. 91-110.


25

Author

Dipankar Das received his B.Sc. and M.Sc. degree in Computer Science and

Technology from the University of Rajshahi, Rajshahi, Bangladesh in 1996 and1997,

respectively. He also received his PhD degree in Computer Vision from Saitama

University Japan in 2010. He was a Postdoctoral fellow in Robot Vision from October

2011 to March 2014 at the same university. He is currently working as an associate

professor of the Department of Information and Communication Engineering,

University of Rajshahi. His research interests include Object Recognition and Human

Computer Interaction.