department of computer science experiments in building

Department of Computer ScienceGeorge Mason UniversityTechnical Report Series

4400 University Drive MS#4A5Fairfax, VA 22030-4444 USAhttp://cs.gmu.edu/703-993-1530

Experiments in Building Recognition

Wei [email protected]

Jana [email protected]∗

Technical Report GMU-CS-TR-2004-3

Abstract

In this report we study the problem of building recogni-tion. Given a database of building images, we are inter-ested in classifying a novel view of a building by findingthe closest match from the database. We examine theefficacy of local scale-invariant features and their asso-ciated descriptors as representations of building appear-ance and use a simple voting scheme in the recognitionphase.

1 Introduction

Object recognition has been an active research area incomputer vision for a long time. Various approacheshave been proposed to recognize a variety of dissim-ilar objects coming from different classes. Buildingshave several characteristics which make the recognitionmore challenging. They are highly structured, with par-allelism and orthogonality being the prevailing relation-ships between lines. They typically have repeating struc-tures, such as windows, pillars (see Figure 1). In im-ages of buildings occlusions happen frequently, due totrees around them or self-occlusions due to the changeof viewpoint. We consider scenarios where the dominantstructure in the image is due to buildings. This differsfrom related work on detecting and recognizing build-ings in aerial images.

In this report we are interested in investigating the ef-ficacy of local image descriptors as building representa-tions, their detection repeatability as a function of view-point as well as discrimination capabilities. Given adatabase of building images and a query image with anovel view of a building, we want to find the closest

∗This work is supported by NSF IIS-0118732

Figure 1: Some examples with repeating structureswithin each image.

match from the database.

2 Related work

The majority of approaches for dealing with classes ofobjects which share common visual attributes such ascars, faces or buildings focused mostly on the prob-lem of detection and class classfication. In [18] authorsworked on locating buildings in a given image or classi-fying images as building/no building images in the con-text of CBIR application [7]. More challenging problemof detection of man-made structures in cluttered sceneshas been addressed in [5], where the authors modelledthe image spatial dependencies using Markov RandomFields.

The crucial part of a object recognition system is ob-ject representation. Currently existing approaches can bebroadly divided into two categories: appearance basedand geometric techniques. Geometric techniques modelobjects using geometric information such as shape [21],

1

[2], [3], or coordinates of points [20]. Rather weak dis-crimination capabilities of purely geometric methods aswell as difficulties of employing these techniques in thepresence of clutter make them mostly applicable in sce-narios where detailed geometric models of objects (suchas provided by CAD models) are available.

Appearance based techniques represent objects bytheir appearance, and can be further divided into globalor local appearance based methods. Principal Compo-nent Analysis (PCA) [19], [12] has been traditionallyused in a glabal setting. The disadvantage of the stan-dard PCA techniques is that they can’t tolerate occlu-sions and clutter. In [6] authors propose to handle oc-clusion problems of PCA by considering only subset ofimage windows. Local appearance based methods rep-resent objects by a set of local image descriptors [16]associated with feature points. In [13] authors introduce”eigen-windows” which are also based on eigen-spaceanalysis.

Recently several feature detectors and their associ-ated local descriptors have been proposed in the liter-ature. Mikolajczyk and Schmid [11] present an affineinvariant interest points detector. They initialize inter-est points based on a multi-scale Harris corner detec-tor, and iteratively modify the position, scale and shapeof those points. They use normalized Gaussian deriva-tives as local descriptor, and get point to point correspon-dences based on Mahalanobis distance between two de-scriptors. Rothganger [14] extends their work to multi-view case, by exploting multiview geometric constraintsbetween sets of matched interest points. Spin image isused to represent the normalized image patch and nor-malized cross-correlation of the descriptors is used formatching. Lowe [10] introduced so-called SIFT featuresby selecting stable features in the scale space. We de-scribe SIFT features in greater detail in the next section.Nelson and Selinger [17] employ a hierarchy of percep-tual grouping process. Their basic primitives are keyedcontext patches, which are square image region extractedby taking prominent contour fragments, the orientationand size of the regions are normalized by the contour.

To the best of our knowledge, the only work related tobuilding recognition is by Bohm and Haala [4]. They useoverall shape of the building to recognize buildings. Therequirement is much more than image, including CADmodel, GPS receiver and compass.

Approach In this report we evaluate the performanceof building recognition using the scale invariant featuresintroduced in [10]. The set of local descriptors for eachreference image represent model for each building. Inthe recognition stage, each descriptor of a query imageis matched against the database of models. The numberof matches between the pair of query and model images

Figure 2: Examples of SIFT features. The size of thecircle is proportional to scale of the feature.

is then used determine the best match in the database. Inthe following sections we briefly describe the process ofselecting the SIFT features, the distance metric used tocompare the descriptors and the final recognition stage.The experiment has been carried out on the database of36 buildings.

3 Building representation

The buildings are represented by a set of scale invariantSIFT features and their associated descriptors. The SIFTfeatures correspond to highly distinguishable image lo-cations which can be detected efficiently and have beenshown to be stable across wide variations of viewpointand scale. Such image locations are detected by search-ing for peaks in the imageD(x, y, σ) which is obtainedby taking a difference of two neighboring images in thescale space:

D(x, y, σ) = (G(x, y, kσ)−G(x, y, σ)) ∗ I(x, y)= L(x, y, kσ)− L(x, y, σ). (1)

The image scale spaceL(x, y, σ) is first build by con-volving the image with Gaussian kernel with varyingσ, such that at particularσ, L(x, y, σ) = G(x, y, σ) ∗I(x, y). This difference operation approximates theconvolution of an image with Laplacian of GaussianD(x, y, σ) ≈ σ2∇2G∗I(x, y), with additional scale nor-malization [8]. Candidate feature locations are obtainedby searching for local maxima and minima ofD(x, y, σ).In the second stage the detected peaks with low contrastor poor localization are discarded. The location of key-points can be further refined by fitting a 3D quadraticfunction to local point neighborhood to get sub-pixel ac-curacy in [9, 1].

3.1 Keypoint descriptor

Orientation of keypoint is determined by peaks in localorientation histogram, which is formed around a circularwindow of the keypoint. The window size is 3 times that

2

of the scale of the keypoint. By assigning a dominantorientation to each keypoint, the keypoint descriptor isrepresented relative to this orientation in order to achievethe invariance to image rotation.

After assigning scale and orientation to each keypoint,a descriptor is created characterizing a local image areaaround each keypoint: First, the gradient magnitude andorientation are computed at each pixel. The local neigh-borhood of each keypoint is then represented by 16 localorientation histograms, each belonging to one of the4×4sub-windows of the local keypoint neighborhood with 8orientation bins in each. This forms a 128 (16 × 8) el-ement feature vector for each keypoint. The boundaryeffects between histograms are alleviated by distribut-ing gradient value of each pixel to adjacent histogrambins via linear interpolation. Although the descriptor isnot fully affine invariant, it has been shown to be robustacross a substantial range of affine distortions [10]. Ex-amples of buildings and their associated scale invariantfeatures are in Figure 2. Given a database of referenceviews of buildings each is characterized by a set of key-points and their corresponding descriptors. The descrip-tors are then normalized in order to tolerate moderatechanges in image contrast. The normalized descriptorsare saved in the model database.

3.2 Image Matching

Given a query image and model image, the matchingbetween the two is based on two criteria. As proposedby Lowe, we first for each keypoint descriptord ∈ <n

find the two closest neighbors. As long as the ratio ofdistances between the closest and second-closest key-point descriptor is below certain threshold the matchhas been successful. The lower the ratio, the better thematch. If the ratio is below some thresholdτr, the pointcorresponding to the descriptor is labelled as matched.In practice, we use the squared distance instead of Eu-clidean distance so a point is labelled as matched when

d2(f ,m1st)d2(f ,m2nd)

< τ2r (2)

wheref is the descriptor to be matched andm1st andm2nd ∈ <n are the closest and the second closest de-scriptors from the model database. In our casen = 128.In [9] Lowe suggests usingτr = 0.8. Generally, thismeasure is effective because correct matches need tohave the closest neighbor significantly closer than theclosest incorrect match. In our case, it is not alwaystrue, because there are many repeating patterns in build-ings, such as corners of windows. Consequently it is verylikely that a keypoint has several close matches with thedistance ratio between the first and second neighbors be-ing close to 1. In order to retain these matches we use

additional criterion related to the actual distance betweenthe descriptors. If the cosine of the angle between twodescriptors is above certain thresholdτc, we also labelthem as matched. Given two descriptorsa and b theircosine distance is defined as:

cos(a, b) =aT b

‖a‖2‖b‖2 (3)

The cosine measure reflects the angle between two vec-tors (descriptors). The square of Euclidean distance be-tween two descriptors can be obtained directly from theircosine value as:

d2(a, b) =∑

i

a2i +

∑

i

b2i − 2

∑

i

aibi

= ‖a‖+ ‖b‖ − 2 cos(a, b) = 2(1− cos(a, b))

The using the second criterion we can retrieve largernumber of correct matches, which would otherwise berejected by the first criterion. Since we don’t want tointroduce many false matches the thresholdτc is ratherhigh.

3.3 Threshold selection

The choice of thresholds can significantly affect therecognition rate. Ideally we would like to choose suchthresholds which would guarantee large number of cor-rect classifications while keeping the number of falsepositive matches low. The choice of suitable thresholdsis typically based in ROC (Receiver Operating Curve)analysis. ROC curve capture the relationship betweenthe sensitivity (True Positive Rate) and thespecificity(True Negative Rate). Its y-axis is the True Positive Rate(TPR), and x-axis is False Positive Rate (FPR) which isexactly 1-Specificity. The TPR are defined as follows:

TPR =TP

P=

TP

TP + FN(4)

whereTP represents the number of true positives (in ourcase the number of correct matches found given a thresh-old), P represents the total number of positive (correctmatches), andFN represents the number of false neg-atives (i.e. the number of correct matches that are notfound given a threshold). The FPR is defined as

FPR =FP

N=

FP

TN + FP(5)

whereFP represents the number of false positives (inour case, the number of incorrect matches found givena threshold),N represents the total number of negatives(incorrect matches), andTN represents the number oftrue negatives (i.e. the number of incorrect matches thatare not found given a threshold).

3

The values of the individual quantities are rather dif-ficult to obtain in the context of our application. First,numbersP andN are hard to determine. Given the factthat number of keypoints in each image is around 800,even for one image pair counting the number of correctmatches is nontrivial. The situation becomes worse be-cause of the repeating structure which causes the key-point to ”correctly” match to multiple positions. Further-more in order to study how the thresholds affect the per-formance, we need to identifyFP and TP for matchresults returned by a range of thresholds. These in ourapplication can be determined only tediously by hand. Inorder to determine the appropriate value of the thresholdswe approximate the ROC curve as follows. We randomlychoose one test image, and match it’s keypoints to all thekeypoints in the 36 models. For each model, a set ofmatched keypoints is returned. All the matched pairs ex-cept those from the correct model are false positives (FP).For matched pairs from the correct model, we still needidentify the true positives (TP), because they can includewrong matches. Once we know theFP andTP for thechosen threshold, we need to obtainN andP . In factwe are really interested in the ratio ofN andP . Oncetheir ratio is set, the shape of the ROC curve wouldn’tchange with regard to differentN andP . Let’s considerP first, in order for the set of match pairs to include allthe true positive, the threshold should be minimal strict.i.e. the similarity threshold should be 0, while the ratiothreshold should 1. Of course, false positives are also in-cluded in the set, our observation shows that around 2/3of matched pairs are true positives. Matched pairs pro-duced from all the remaining 35 models are false posi-tives. Let the number of keypoints in the test image beNq, theN ≈ (35 + 1/3)Nq andP ≈ 2/3Nq. The ra-tio N/P is approximately 50. After all those parametersare set, we can draw the ROC curve. Of course, one testimage is not enough to represent the variation of wholedataset. In this experiment we use four test images tostudy the behavior of the threshold, and use their averageto draw the approximate ROC curve. See Figure 3 and4 for final result. The threshold we choose areτc = 0.97andτr = 0.6.

3.4 Keypoint matching

Each keypoint is labeled as matched if either of the twocriterias is met, thus we get more correct matches. Figure5 shows the benefit of combining the Lowe’s measureand cosine measure.

Because of repeating patterns, each particular key-point may not be matched to its exact position, as Fig-ure 7 shows. For example a window corner may bematched to a similar window corner with completely dif-ferent location. The match is correct in the meaning of

0 0.05 0.1 0.15 0.2 0.250

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45Approximate ROC curve for Tc

Tc=0.97

Tc=0.95

Tc=0.93

Figure 3: Approximate ROC curve forτc.

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.090

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5Approximate ROC curve for Tr

Tr=0.6

Tr=0.7

Tr=05

Figure 4: Approximate ROC curve forτr.

local appearance matching, but the points are not in cor-rect correspondence in the sense that they are not pro-jections of same 3D point. Fortunately, this is not a bigproblem for recognition purpose, because we only needto know whether the point come from the building.

4 Building recognition

The building recognition is accomplished here by a sim-ple voting scheme. Given a query image and sets ofmatched keypoints pairs, with all existing models in thedatabase, the most likely model is the one with the largestnumber of matches.

Although the geometric consistency between matchedkeypoints can further improve the recognition result, thecomplexity involved in achieving it hinders us from pur-suing it. The robust technique like RANSAC is knownto handle50% outliers, in our case, there are often wellabove50% outliers. Mismatches constitute part of out-liers, furthermore, repeating patterns cause many locally

4

Figure 5: (left) Match result using only Lowe’s crite-ria (15 matches); (center) result using only cosine mea-sure (9 matches); (right) result using both measure(22matches).

correct yet globally mismatched keypoint. In the extremecase, all the matched keypoints may belong to the latercategory. Consequently, it’s hard to define geometricconsistence, yet recognition is still successful. The ex-amples in Figure 7 show some extreme situations.

5 Experimental results

The image database used in the experiment consists of36 buildings, with several images for each building. Theimages are taken from different viewpoints and have dif-ferent scales. We use one image as reference image andtwo other images for testing. All of the buildings arelisted in Tables 6, 7 and 8.

In the experiment 53 out of 72 test images are correctlyrecognized (listed as first candidate), the rest of them aremisclassified. But 10 out of them have correct answerlisted in top 5 list, so the results are still useful for furtherprocessing. The detailed results are listed in Table 1 andTable 2. For the test images which get completely wrongresults (the correct answers are not shown in the top 5list), possible reasons are either too large out-of-planerotation (e.g. image 25) or scale change (e.g. image 14).In addition, the lighting conditions maybe rather differ-ent between reference and test images. We are currentlyin the process of extending our database and carrying outadditional experiments. We plan to further investigate thealternative choice of features, their descriptors as well as

Figure 6: Keypoints match results to incorrect models.

Figure 7: Repeating structures cause mismatches, al-though in term of local appearance, they are correct.

means of modelling spatial relationships between the lo-cal keypoints.

6 Conclusion and future work

In this report we have presented initial experiments inbuilding recognition using scale invariant features andtheir associated descriptors. We examined two differentcriteria for matching and demonstrated that their combi-nation yields favorable results in the context of our do-main. Currently we can only deal with single buildingin the image and do not exploit spatial relationships be-tween buildings. We are currently extending the databaseof features and developing the additional stage whichwould enable robustly estimate the pose of the camerawith respect to the building.

5

References

[1] M. Brown and D. Lowe. Invariant features from interestpoint groups. InIn Proceedings of BMVC, Cardiff, Wales,pages 656–665, 2002.

[2] A. Dubinskiy and S. Zhu. A multiscale generative modelfor animate shapes and parts. InInternational Conferenceon Computer Vision, October 2003.

[3] B. Falcidieno and F. Giannini. Automatic recognitionand representation of shape-based features in a geomet-ric modeling system.Computer Vision, Graphics, andImage Processing, 48:93–123, 1989.

[4] P. K. Jan Bohm, N. Haala. Automated appearance-based building detection in terrestrial images. InInterna-tional Archives on Photogrammetry and Remote SensingIAPRS, pages 491–495, September 2002.

[5] S. Kumar and M. Hebert. Man-made structure detectionin natural images using a causal multiscale random field.In CVPR 2003, pages 119–126, 2003.

[6] A. Leonardis and H. Bischof. Dealing with occlusions inthe eigenspace approach. In1996 Conference on Com-puter Vision and Pattern Recognition (CVPR ’96), pages453–458, June 1996.

[7] Y. Li and L. G. Shapiro. Consistent line clusters for build-ing recognition in cbir. InCVPR 2002, pages 952–956,August 2002.

[8] T. Lindeberg. Scale-space theory: A basic tool foranalysing structures at different scales.Journal of Ap-plied Statistics, 21(2), 1994.

[9] D. Lowe. Distinctive image features from scale-invariantkeypoints. International Journal of Computer Vision,page to appear, 2004.

[10] D. G. Lowe. Object recognition from local scale-invariant features. InProc. of the International Confer-ence on Computer Vision ICCV, pages 1150–1157, 1999.

[11] K. Mikolajczyk and C. Schmid. An affine invariant inter-est point detector. InEuropean Conference on ComputerVision, pages 128–142, 2002.

[12] H. Murase and S. Nayar. Visual learning and recognitionof 3–d objects from appearance.International Journal ofComputer Vision, 14:5–24, 1995.

[13] K. Ohba and K. Ikeuchi. Detectability, uniqueness, andreliability of eigen windows for stable verification of par-tially occluded objects. IEEE Transactions on PatternAnalysis and Machine Intelligence, 19:1043–1047, 1997.

[14] F. Rothganger, S. Lazebnik, C. Schmid, and J. Ponce.3d object modeling and recognition using affine-invariantpatches and multi-view spatial constraints. InCVPR2003, pages 272–277, June 2003.

[15] R. Schettini. Multicolored object recognition and lo-cation. Pattern Recognition Letters, 15(11):1089–1097,November 1994.

[16] C. Schmid and R. Mohr. Local greyvalue invariants forimage retrieval. Pattern Analysis and Machine Intelli-gence, 19:530–535, 1997.

[17] A. Selinger and R. C. Nelson. A perceptual group-ing hierarchy for appearance-based3D object recogni-tion. Computer Vision and Image Understanding: CVIU,76(1):83–92, 1999.

[18] A. Stassopoulou. Building detection using bayesian net-works. International Journal of Pattern Recognition andArtificial Intelligence, 83(5):705–740, 2000.

[19] M. Turk and A. Pentland. Eigenfaces for recognition.J.Cognitive Neuroscience, 3:71–86, 1994.

[20] Z. Zhang, M. Lyons, M. Schuster, and S. Aka-matsu. Comparison between geometry-based and gabor-wavelets-based facial expression recognition using multi-layer perceptron. InProceedings of the third IEEE In-ternational Conference on Automatic Face and GestureRecognition, pages 454–459, April 1998.

[21] S. C. Zhu and A. Yuille. Forms: a flexible objects recog-nition and modeling system.Computer Vision, Graphics,and Image Processing, 20:187–212, 1996.

6

First Second Third Fourth Fifth Correct?1 1 29 5 26 11 12 2 9 29 1 5 13 3 29 1 23 11 14 4 5 8 2 29 15 5 7 6 12 24 16 6 5 11 29 1 17 7 2 8 5 28 18 8 29 28 17 24 19 5 9 29 2 28 010 10 29 9 4 5 111 11 5 1 29 4 112 5 12 28 6 9 013 13 9 4 29 5 114 14 1 29 2 9 115 15 5 8 28 2 116 16 5 29 9 2 117 17 15 5 29 11 118 18 5 11 6 28 119 19 5 1 3 23 120 20 5 8 11 1 121 21 7 2 29 6 122 22 5 28 2 29 123 23 29 2 5 6 124 29 28 24 5 2 025 2 5 29 7 1 026 29 26 28 5 11 027 27 5 2 29 23 128 28 29 5 1 9 129 29 2 1 4 5 130 30 5 8 14 20 131 31 11 1 5 4 132 5 20 26 32 1 033 34 1 26 35 28 034 34 26 35 11 1 135 35 34 26 17 29 136 36 35 34 29 5 1

Table 1: Recognition results with top 5 list for first testimage

First Second Third Fourth Fifth Correct?1 1 29 2 5 20 12 2 29 5 4 13 13 3 29 2 11 34 14 4 5 26 28 2 15 2 5 29 4 20 06 6 2 29 26 34 17 7 8 26 28 29 18 8 29 2 6 1 19 2 29 11 34 1 010 10 11 20 29 27 111 11 1 29 5 20 112 5 20 30 2 29 013 13 2 23 29 9 114 1 29 11 23 34 015 29 1 15 5 28 016 29 20 16 23 11 017 29 11 23 6 18 018 2 11 27 26 20 019 19 34 26 5 23 120 20 35 29 5 28 121 29 4 35 28 2 022 22 35 2 31 9 123 23 29 5 26 35 124 35 29 28 2 24 025 2 25 12 7 29 026 26 35 2 11 29 127 27 5 29 28 2 128 28 35 29 6 2 129 29 35 2 23 28 130 30 29 5 28 2 131 31 35 5 23 1 132 32 35 5 34 26 133 35 29 1 2 9 034 35 34 29 2 26 035 35 29 2 5 28 136 36 5 28 35 29 1

Table 2: Recognition results with top 5 list for secondtest image

7

Reference image Test image

1

2

3

4

5

6

7

8

9

10

11

12

Table 3: Image data set


13

14

15

16

17

18

19

20

21

22

23

24

Table 4: Image data set continue

8


25

26

27

28

29

30

31

32

33

34

35

36

Table 5: Image data set continue

9

department of computer science experiments in building

Documents