hierarchical building recognition€¦ · age plane at vanishing points. in case of urban...

8
Hierarchical Building Recognition Anonymous Abstract We propose a novel and efficient method for recognition of buildings and locations with dominant man-made struc- tures. The method exploits two different representations of model views, which differ in their complexity and dis- criminative capability. At the coarse level the model views are represented by spatially tuned color histograms com- puted over dominant orientations structures in the image. This representation enables fast and efficient retrieval of the closest match from the database. At the finer level the matching is accomplished using feature descriptors associ- ated with local image regions. The probabilistic formula- tion of the matching as well as means of dealing with mul- tiple matches due to repetitive structures is discussed. Ex- tensive experiments on two building databases with varying degree of occlusions, viewpoint and illumination changes are presented. 1 Introduction In this paper, we study the problem of building recognition. As an instance of the recognition problem, this domain is interesting since the class of buildings contains many sim- ilarities, while at the same time calls for the techniques which are capable of fine discrimination between different instances of the class. The problem is also of interest in the context of navigation applications, where the goal is to determine the pose of the camera with respect to the scene given an image of the scene and the database of previously recorded images of the city or urban area. From the per- spective of applications the natural concern is efficiency and scalability. 1.1 Related work One of the central issues pertinent to the recognition prob- lem is the choice of suitable representation of the class and its scalability to large number of exemplars. In the con- text of object recognition, both global and local image de- scriptors have been considered. Global image descriptors typically consider the entire image as a point in the high- dimensional space and model the changes in appearance as a function of viewpoint using subspace methods [11]. Given the subspace representation the pose of the camera is typically obtained by spline interpolation method, ex- ploiting the continuity of the mapping between the object appearance and continuously changing viewpoint. Alter- native global representations proposed in the past include responses to banks of filters [19], multi-dimensional his- tograms [7]. In [17] the authors suggested to improve dis- criminantive power of plain color indexing technique with encoding of spatial information, by dividing the image into 5 partially overlapping regions. Multi-channel approach us- ing histograms was explored in the context of general ob- ject recognition [10]. The disadvantage of this approach is that using multiple indices can make the combined in- dex relatively large. Local feature based techniques have recently become very effective in the context of different object recognition problems since they fare favorably in the presence of large amount of clutter and changes in view- point. The representatives of local feature based techniques include scale invariant features [9, 14] and their associated descriptors, which are invariant with respect to rotation or affine transformations. Several authors motivated by the navigation applications also studied the problem of build- ing recognition. In [12] authors proposed matching using point features prior aligning the views of the buildings to their canonical views in the database, followed by the rela- tive pose recovery between the views. Authors in [16] used matching of invariant regions, while in [6] the line segments were matched with the assumption of planar motion. Coarse classification of locations was also motivating application in work of [18] on context-based place recognition. 1.2 Approach overview We propose to tackle the building recognition problem by a two stage hierarchical approach. The first stage is com- prised of an efficient coarse classification scheme based lo- calized color histograms computed over dominant orienta- tion structures in the image. A variable number of candi- dates, with their associated confidence measures are then subjected to the second, feature based matching stage. The main contribution of this paper is the introduction of the lo- calized color histogram descriptor used in the fast indexing scheme and simplified probabilistic feature based matching method used in the final recognition stage. We carry out ex- tensive experiments on two building databases with varying

Upload: others

Post on 26-Sep-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hierarchical Building Recognition€¦ · age plane at vanishing points. In case of urban environments the dominant directions are typically aligned with three or-thogonal axes of

Hierarchical Building Recognition

Anonymous

AbstractWe propose a novel and efficient method for recognitionof buildings and locations with dominant man-made struc-tures. The method exploits two different representationsof model views, which differ in their complexity and dis-criminative capability. At the coarse level the model viewsare represented by spatially tuned color histograms com-puted over dominant orientations structures in the image.This representation enables fast and efficient retrieval ofthe closest match from the database. At the finer level thematching is accomplished using feature descriptors associ-ated with local image regions. The probabilistic formula-tion of the matching as well as means of dealing with mul-tiple matches due to repetitive structures is discussed. Ex-tensive experiments on two building databases with varyingdegree of occlusions, viewpoint and illumination changesare presented.

1 IntroductionIn this paper, we study the problem of building recognition.As an instance of the recognition problem, this domain isinteresting since the class of buildings contains many sim-ilarities, while at the same time calls for the techniqueswhich are capable of fine discrimination between differentinstances of the class. The problem is also of interest inthe context of navigation applications, where the goal is todetermine the pose of the camera with respect to the scenegiven an image of the scene and the database of previouslyrecorded images of the city or urban area. From the per-spective of applications the natural concern is efficiency andscalability.

1.1 Related work

One of the central issues pertinent to the recognition prob-lem is the choice of suitable representation of the class andits scalability to large number of exemplars. In the con-text of object recognition, both global and local image de-scriptors have been considered. Global image descriptorstypically consider the entire image as a point in the high-dimensional space and model the changes in appearanceas a function of viewpoint using subspace methods [11].

Given the subspace representation the pose of the camerais typically obtained by spline interpolation method, ex-ploiting the continuity of the mapping between the objectappearance and continuously changing viewpoint. Alter-native global representations proposed in the past includeresponses to banks of filters [19], multi-dimensional his-tograms [7]. In [17] the authors suggested to improve dis-criminantive power of plain color indexing technique withencoding of spatial information, by dividing the image into5 partially overlapping regions. Multi-channel approach us-ing histograms was explored in the context of general ob-ject recognition [10]. The disadvantage of this approachis that using multiple indices can make the combined in-dex relatively large. Local feature based techniques haverecently become very effective in the context of differentobject recognition problems since they fare favorably in thepresence of large amount of clutter and changes in view-point. The representatives of local feature based techniquesinclude scale invariant features [9, 14] and their associateddescriptors, which are invariant with respect to rotation oraffine transformations. Several authors motivated by thenavigation applications also studied the problem of build-ing recognition. In [12] authors proposed matching usingpoint features prior aligning the views of the buildings totheir canonical views in the database, followed by the rela-tive pose recovery between the views. Authors in [16] usedmatching of invariant regions, while in [6] the line segmentswere matched with the assumption of planar motion. Coarseclassification of locations was also motivating application inwork of [18] on context-based place recognition.

1.2 Approach overview

We propose to tackle the building recognition problem bya two stage hierarchical approach. The first stage is com-prised of an efficient coarse classification scheme based lo-calized color histograms computed over dominant orienta-tion structures in the image. A variable number of candi-dates, with their associated confidence measures are thensubjected to the second, feature based matching stage. Themain contribution of this paper is the introduction of the lo-calized color histogram descriptor used in the fast indexingscheme and simplified probabilistic feature based matchingmethod used in the final recognition stage. We carry out ex-tensive experiments on two building databases with varying

Page 2: Hierarchical Building Recognition€¦ · age plane at vanishing points. In case of urban environments the dominant directions are typically aligned with three or-thogonal axes of

degree of occlusions, viewpoint and illumination changesand discuss the applicability of our approach to other recog-nition problems.

2 Spatially tuned color histogram

In order to exploit the efficiency and compactness of thehistogram based representations and at the same time gainthe advantages of discrimination capability and robustnessof local feature descriptors, we propose a representation ofbuildings which trades these characteristics favorably. Therepresentation is motivated by the observation, that man-made structures like buildings contain constrained geomet-ric structure, such as parallel and orthogonal lines and pla-nar structures. Parallel lines in the world intersect in the im-age plane at vanishing points. In case of urban environmentsthe dominant directions are typically aligned with three or-thogonal axes of the world coordinate frame. Since theseorientations typically belong to man-made structures, wepropose to compute the color distribution of pixels, whoseorientation complies with main vanishing directions. Dis-criminative power is gained by weakly encoding the spatialinformation, which is achieved by treating the histogramsof the different dominant orientations separately. We willoutline the process in more detail in the following section.

2.1 Dominant vanishing directions

The detection of dominant vanishing directions in the im-age, which are due to the presence of dominant man-madestructures is based on our earlier work [1]. The detectionof line segments is followed by an efficient approach forsimultaneous grouping of lines into dominant vanishing di-rections and estimation of vanishing points using expecta-tion maximization algorithm (EM). The EM algorithm typ-ically converges in several iterations, due to the effectiveinitialization stage based on peaks in orientation histogram.In our experiments, the number of EM iterations is set tobe 10, but we often observe good convergence after less the5 iterations. In the presence of buildings which lack domi-nant orientations, the vanishing point estimation process isterminated due to the lack of straight line support. In suchcases, the first recognition stage is bypassed and matchingbased on local descriptors is carried out. We have not en-countered this situation throughout our experiments.

2.2 Pixels membership assignment

In the above step, we have obtained the global direction in-formation from the line segments aligned with the dominantorientations. The EM process typically returns two or threevanishing points, which correspond to principal directions

Figure 1: Three views of same building. Top: the origi-nal image. Middle: pixel membership assigned by geomet-ric constraints. Bottom: pixel membership assigned in theend, deep blue represents area classified as background, red,light blue and yellow color represent three group of pixels,respectively.

vi,vj and vk in the world coordinate frame. Now let’s as-sume that there are 3 principal directions and that the imageplane rotation of the camera doesn’t exceed 45o. Then wecan identify those three principal directions as left vi, rightvj and vertical vk, respectively. Each image pixel can thenbe classified as either aligned with one of the directions orbelonging to outlier model. Pixels with the sufficiently highgradient value are classified as aligned with one of the di-rections if the difference between its local gradient directionand the direction vi,vj and vk is less than some threshold.In our experiments, the threshold is set to be 30o. Basedon the above criterion, pixels belonging to background clut-ter (e.g. trees and grassland) will be still selected as mid-dle row of Figure 1 shows. Note however that those pix-els are located in area where gradient direction changes fre-quently and their neighbor pixels are unlikely to belong tosame group. Hence most of them can be eliminated by do-ing connected component analysis and removing small con-nected components. The building pixels survive this stagebecause they typically have consistent gradient direction.The final group membership assignments are shown in thebottom row of Figure 1, where most background bushesand trees are removed. Note also that the membershipsof individual pixels remains stable across different views,which enables us to achieve representation which has highrepeatability with respect to change of viewpoints. We willnext demonstrate that augmenting this spatial representationwith color information will yield discriminative descriptorfor the class.

Page 3: Hierarchical Building Recognition€¦ · age plane at vanishing points. In case of urban environments the dominant directions are typically aligned with three or-thogonal axes of

Figure 2: Three views of another building with more back-ground clutter and viewpoint change. Top: the original im-age. Bottom: pixel membership assigned in the end, deepblue represents area classified as background, red, light blueand yellow color represent three group of pixels, respec-tively.

2.3 Color histogram formationConsidering only pixels which belong to dominant direc-tions, we compute a color histogram for each group of pix-els. First, the RGB value of each group of pixels is mappedto hue space. Unlike the traditional color indexing tech-nique where pixel colors are represented in 3D RGB spaceor 2D hue-saturation space, we adopt the 1D representa-tion [3]. The representation considers only hue value cal-culated by taking the arctan2(Cb, Cr) with (Y,Cb, Cr) de-fined as:

YCb

Cr

=

0.2125 0.7154 0.07210.1150 −0.3850 0.50000.5000 −0.45400 −0.0460

RGB

Hue values are calculated by:

H = arctan(Cb, Cr)/π − 1 ≤ H ≤ 1 (1)

The hue distribution of each group of pixels is quantizedinto 16 bins. In order to avoid the boundary effects, whichcause the histogram to change abruptly when some valuesshift smoothly from one bin to another, we use linear inter-polation to assign a weight to each histogram bin accordingto the distance of between the value and the bin’s centralvalue.

2.4 Building retrievalGiven a 16 × 3 = 48 descriptor representing each modelimage, building retrieval can proceed by comparing descrip-tors of a test image to all models. There is one subtle issue:because of the viewpoint change, pixels which belong toleft group may happen to be in the right group, and viceversa. Currently we handle this issue by combining the his-tograms of left and right group into one. In the end, we have

two histograms yielding 16×2 = 32 indexing vector to rep-resent each image. We may lose some discriminative powerin this way, but we already saw considerable improvementover using single global histogram. One byproduct we ob-tain is short indexing vector, which is good for both storageand comparison. In this stage, the two histograms are nor-malized individually.

To compare a test image to different models, differentdistance measures can be used. We tried L1 distance andχ2 statistics. Though χ2 statistics is not a metric (triangleinequality doesn’t hold), we chose to use it because it bringsabout 7% increase in recognition rate compared to L1 dis-tance. Given the descriptor of a test image ht and of modelhp, the χ2 distance is defined as:

χ2(ht, hp) =∑

k

(ht(k) − hp(k))2

ht(k) + hp(k)(2)

where k is number of bins in the histogram (k = 32 in ourcase). The compact size of descriptors makes the compar-ison very fast, which is especially beneficial when dealingwith very large database. In the first recognition stage, wereturn a subset of models, which will be considered in thesecond stage. The number of models considered will de-pend on the ambiguity of the first top k results. The ambi-guity is quantified as:

Am =χ2

1st1

n−1 (∑

i χ2i )

(3)

where i = 2, 3, ..., n, and χ2i is the ith closest distance of the

result. We set n to be 5 in our experiments. The ambiguitymeasure will be very low when the test image is easy toclassify and is close to 1 when it’s hard to identify. Thenthe number of listed results Nr can be calculated by Nr =dNm ∗Am2e, here Nm is maximum size of list we allowed,which we set to be 20. The actual size of the list is typicallymuch smaller. When each object model has multiple viewsin the database, the smallest χ2 distance among those viewsis used to compute the size of list. We report the recognitionrates obtained by this first recognition stage in Section 4.

3 Local feature based refinementThe purpose of the second recognition stage is to furtherrefine the results, enable finer matching and establish corre-spondences for subsequent recovery of the relative pose be-tween the test and model view. In this stage we exploit theSIFT features and their associated descriptors introducedby [9]. For each model image, the feature descriptors areextracted off line and saved in the database along with thecolor indexing vectors. After extracting features from a testimage its descriptors are matched to those of the models se-lected in the first recognition stage.

Page 4: Hierarchical Building Recognition€¦ · age plane at vanishing points. In case of urban environments the dominant directions are typically aligned with three or-thogonal axes of

a) b) c)

Figure 3: a) match result using only nearest neighbor rationcriteria (11 matches); b) result using only cosine measure(15 matches); c) result using both measures (24 matches).

In the original matching scheme suggested in [9] a pairof keypoints is considered a match if the distance ratio be-tween the closest match and second closest one is belowsome threshold τr. In the context of buildings, which con-tain many repetitive structures this criterion will reject manypossible matches, because up to k nearest neighbors mayhave very close distances. One option for tackling this is-sue would be to perform some clustering in the space ofdescriptors to capture this repeatability as suggested in thecontext of texture analysis [8]. While this strategy wouldbe suitable for recognition, it would not allow us to obtainactual correspondences. We instead choose to add anothercriterion, which considers two features as matched, whenthe cosine of angle between their descriptors is above somethreshold τc. For each keypoint in the query image, onlythe nearest neighbor is kept if multiple candidates pass thedistance threshold. The use of both nearest neighbor ratioand distance threshold yields larger number of matches asdemonstrated in Figure 3. Although the difference betweenthe number of matches using only distance threshold andboth criteria is small, the missed matches are often crucialin case of models with small number of features.

Given k candidate models obtained in the first stage,we can refine the recognition by matching SIFT keypoints.Denoting the number of matches between each candidatemodel image Ij and the test image Q by {C(Q, Ij)} themost likely model can be determined using simple votingstrategy. In such case the best model is the one with thehighest number of successfully matched points:

C = maxj

|{C(Q, Ij)}|.

The Figures 5 and 4 show the top 4 candidates returnedby the first stage and their respective SIFT matches. Notethat the correct models have many more successful matchesthan other candidates.

Figure 4: The other three models listed as in top 3 in coarserecognition stage, have much less matches than the correctmodel.

Figure 5: Another example that appearance based techniquehelps identify correct model. The correct model has muchmore matches, although it was listed as a third candidate bythe coarse recognition stage.

3.1 Probabilistic Formulation

In order to better understand the advantages and limitationsof the local feature based methods in our domain, we havecarried out additional set of experiments involving only lo-cal feature based methods used in the second recognitionstage. Since this strategy may be useful in the absence ofcolor information, we have extended the database by ad-ditional 68 building models captured by grayscale images.There are 3-4 views per building, with larger amount ofviewpoint variation and clutter. Using solely voting strategyoutlined above has proven ineffective due to the large varia-tion in the number of features detected in different build-ing models as well as repetitive nature of the man-madestructures. To improve the recognition rate achieved usingstandard voting, we introduced simple probabilistic modelwhich properly takes into account both the number of fea-tures in the model and quality of the attained matches char-acterized by similarity between descriptors. Several prob-abilistic formulations have been considered in the past inthe context of object recognition. The existing models dif-fer in their complexity and the number of aspects they ac-count for probabilistically (e.g. photometric, geometric as-pects) [13, 4]. The probabilistic model we use accounts onlyfor quality of the matches without explicitly consideringspatial relationships between features, because the repeti-tive patterns often cause features to be matched to differentlocations.

Page 5: Hierarchical Building Recognition€¦ · age plane at vanishing points. In case of urban environments the dominant directions are typically aligned with three or-thogonal axes of

In order to quantify the match quality we need to recon-cile the matching criteria based on the distance thresholdand nearest neighbor ratio test. For keypoints pairs that getmatched by exceeding distance ratio threshold, their sim-ilarity scores can be quite low, which does not reflect thefact that they are potentially high quality matches. This isreconciled by learning the match quality for these matchesin controlled experiment by observing the portion of thematched pairs which are correct correspondences. More de-tails about this stage can be found in [2]. Another possibil-ity would be to use the Mahalanobis distance between twodescriptors where the covariance is learned in controlled ex-periment.

In the probabilistic setting the recognition problem can beformulated as the problem of computing posterior probabil-ity of the building model given a query image Q, P (Ij |Q).The probability that a building Ij appears in query imageQ depends only on the set of matches {mi} = {C(Q, Ij)}and can be written as P (Ij |{mi}). Since spatial relation-ships between individual matches are not considered, theindexes of the keypoints are omitted. In the following sec-tion we also omit the model index j to improve the clarity.Instead of computing P (I|{mi}) directly, we consider theprobability that Q doesn’t contain model I , which can beexpressed as P (I|{mi}) = 1 − P (I|{mi}). As we willdemonstrate shortly, by doing so, we avoid the need for as-signing probabilities to the unmatched features. Assumingthat all matched pairs are independent, we then obtain:

P (I|Q) = P (I|{mi}) =P ({mi}|I)P (I)

P ({mi})

=

∏n

i=1 P (mi|I)P (I)∏n

i=1 P (mi)(4)

=

∏n

i=1(1 − P (mi|I))P (I)∏n

i=1(P (mi|I) + P (mi|I))(5)

Where n is the number of matched pairs, P (I) is the priorwhich is assumed to be uniform for all models. The confi-dence measure obtained from the first stage of recognitioncan be used in place of prior.

Denoting αi = (1−P (mi|I))(P (mi|I)+P (mi|I))

we can write:

P (I|Q) = 1 −n

i=1

αiP (I). (6)

The term αi naturally depends on the match quality, whichis related to the distance between two descriptors. Insteadof evaluating P (I|Q) based on all detected features and as-signing a small constant probability ε to the features whichwere detected but not matched, we evaluate this term onlyfor matched features. By excluding the contribution of un-matched features, we avoid situation where they signifi-cantly skew the overall likelihood estimate. Assuming that

there are m features which were not matched, these featureswill contribute to the overall likelihood function by the fol-lowing term (1 − ε)m and can affect the final likelihooddespite the fact that large number of correct matches wasdetected. αi is also related to each model’s number of fea-tures; it will be large for model with more features. Becausethe probability that none of its features get matched will belower. We choose approximately same number of featuresfor each model to eliminate this bias.

Model Feature Selection Stage. In order to choose a sub-set of features to make the number of features to be ap-proximately the same in all models, we need to considerthe quality of each feature. This quality can be measuredby its repeatability and distinctiveness. A feature which ap-pears in multiple views of same model is more repeatableand likely to appear in a query view. On the other hand,a feature which appears in multiple models is less charac-teristic than those present only in views of one model. Bystudying the local information content of each feature in themodel views [5] we can estimate the posterior probability ofall model features P (Ij |mi) by directly estimating this con-ditional density using Parzen windows approach. The prob-ability would be higher when the feature is more repeatableand lower when the feature is less characteristic. For eachmodel, we keep only those features with P (Ij |mi) higherthan certain threshold τp. We set the threshold so that eachimage contains at most 500 discriminative features. Orig-inally, models can have thousands of features. This pro-cedure removes around 60% of features from the originalfeature set.

Learning the parameter α. With the influence of featurenumber removed, the relationship between αi and similar-ity score di between two descriptors can be represented bya function: F : di → αi. We learn this function in a super-vised setting. Given a test image and a subset of model im-ages (including the correct model) and by fixing a particularsimilarity threshold τs, we can obtain a large set of matchedpairs between the test image and the model images. Denotethe cardinality of the set to be M . Then we can identify theset of correct matches among them with cardinality denotedby N . N/M is used as the probability of true correspon-dences for that given τs and M−N

Mis the ατ parameter we

want for this τs. Repeating this for a number of differentτs, we get a set of sample points along the function curvewhich represents relationship between the similarity scoredi and αi. Following robust function

F (di) =2s3

π(s2 + (1 − di)2)2(7)

approximates the curve shape very well when s = 0.05.Parameter αi is obtained by dividing F (di) with its maxi-

Page 6: Hierarchical Building Recognition€¦ · age plane at vanishing points. In case of urban environments the dominant directions are typically aligned with three or-thogonal axes of

mum to obtain values in interval [0, 1]. With the mappingbetween α and di in place, the posterior probability of eachmodel given a set of matches can be computed from Equa-tion 6. If we set the mapping to be a constant, then theprobability formulation has the same effect as voting.

4 Experimental results

To test our approach, we downloaded the ZuBuD [15]database which contains 200 buildings of Zurich with 5views each. Some of the image were taken with camerarotated 90o, so we pre-rotated them before experiments, be-cause the 90o rotation will cause ambiguity. As we men-tioned before, our approach can currently tolerate 45o rota-tion around optical axis. To demonstrate the benefits of us-ing spatially tuned histogram, we carried out a preliminaryexperiment. We compared it with few alternatives: a) usingpixels from the detected straight lines only to form one colorhistogram; b) using all the pixels from the three groups toform one color histogram. The first views of the 200 build-ings are chosen as models, the second views are chosen astest image. The results are summarized in Table 1. The firstthree columns of the table list the percentage of correctlyclassified building that are listed in the top k list, the lastcolumn shows the average size of list for all the test images.

pixels 1st top 5 list average sizeline pixels 65.5% 83.5% 88% 5.5

single histogram 69% 89% 92% 5.0our approach 83% 93% 95% 5.1

Table 1: Recognition performance of the first recognitionstage.

As shown in the table, based on one building view permodel, we obtain 83% recognition rate, which shows that ithas fairly good discriminant power. We can also see thebenefit of using variable top k list. While the top 5 listshows 93% hit ratio, an average size of 5.1 list provides95% hit ratio.

For the second experiment we use the query images pro-vided by Zurich database and use all 5 views per model.This setting brings noticeable improvement. Out of 115query images, we got 90.3% correct recognition result, and96.5% of them have correct model in top 5 list. The remain-ing 4 images are very difficult to recognize. Three of themcome from one building; they failed because of large light-ing and viewpoint change between the query and the modelviews. The 4th failure is due to dramatic viewpoint change,which is difficult to recognize even for human. Figure 8shows the two misclassified buildings. We can see the 32

Figure 6: Example of correct recognized test image. Thequery image and top four results are listed from left to right.Some images are resized for display purpose.

Figure 7: Incorrect recognitions with correct models in thelist.

dimensional spatially tuned color histogram has very gooddiscrimination capability. The first stage recognition showsbetter recognition rate than the result reported by [16] and iscomparable to the line matching technique described in [6].The ambiguity measure we obtained in this experiment isvery small. For 64 query images only 1 candidate is se-lected. The maximum list number is 9, and the averagenumber is 2.2087.

Besides its apparent efficiency in comparison, the index-ing vector extraction is quite efficient, too. Though it’s cur-rently implemented in Matlab, our experiment shows thatprocessing one image only takes 2 seconds on a 1.5GHznotebook computer. If planar motion can be assumed, theprocess can be faster, because the vanishing directions areknow as prior.

4.1 Local feature based recognitionWe held two separate experiments in this stage. First, wedo the fine recognition for the candidates provided by thecoarse recognition stage. As we demonstrated in Figure5 and Figure 4, the correct models’ matches outnumberother candidates’ by a large margin and hence the benefitsof the probabilistic matching didn’t manifest here. Our ex-periment shows that all the incorrect coarse recognitions ex-cept the 4 failures were corrected. Our final recognition ratefor ZuBud query image is 96%, which shows improvementon the reported works in the same database [6] [16]. Thesecond experiment was based on a database of 68 buildingswith no color information available. The first view of eachbuilding is used as model, the other two views are used astest images. Table 2 summarizes the results. In the exper-

Page 7: Hierarchical Building Recognition€¦ · age plane at vanishing points. In case of urban environments the dominant directions are typically aligned with three or-thogonal axes of

Figure 8: The two buildings which failed. The query imagesare in left, with their corresponding five model views right.Top: One query view of the building which cause 3 failure.Bottom: the 4th failure.

test set (method) 1st top 5view 2 (voting) 95.5% 98.5%

view 2 (probabilistic) 98.5% 98.5%view 3 (voting) 88.2% 98.5%

view 3 (probabilistic) 91.2% 98.5%

Table 2: Recognition performance of SIFT feature basedmatching.

iments with the second database, we see the benefit of theprobabilistic model. For 6 images which will get wrong re-sults by voting, will get classified correctly using the prob-abilistic formulation with probability score more distinctthan match number. Figure 9a-b shows one such example.Additional example in Figure 9c shows the intricacies offeature based matching in the presence of repetitive struc-tures. In this case the building was correctly recognized.We are currently extending this work by incorporating thepose recovery stage using the matches obtained in the sec-ond stage.

5 Conclusion and future workIn this paper, we proposed a hierarchical scheme for build-ing recognition. We first introduce a new spatially tunedcolor histogram representation of buildings which is usedin the first recognition stage. Our experiments show that ithas very good discriminating capability comparable to thelocal feature based techniques, without the need of find-ing correspondences. When multiple views of models areavailable, the selected candidates can be obtained with highaccuracy and correct classification is achieved in the firststage. Due to its compact size this representation is veryefficient and scales well to large databases. In the secondstage we used local feature based matching to further re-fine the recognition results. In the context of feature basedmatching scheme we have proposed a new simple proba-bilistic models accounting for feature quality and demon-strated its performance on a larger database. The second

a) b) c)

Figure 9: a) The wrong model with 25 matches; b) whilethe correct model with 22 matches; voting would producewrong result. While the probability of the wrong model isonly 0.03, much less than the correct one which get 0.12; c)correct classification using probabilistic voting.

recognition stage is quite self-contained and can stand on itsown if color information is not available. The bias towardmodels with more features is resolved by a feature selec-tion process, which also favorably improves the efficiencyof the second matching stage and makes the model databasemore compact. Both stages complement each other well andenable us to successfully recognize buildings despite largevariations in viewpoint and presence of background clutter.We are currently extending the experiments to incorporatethe pose recovery stage using the matches obtained in thesecond stage.

References[1] Anonymous.

[2] Anonymous.

[3] H. Aoki, B. Schiele, and A. Pentland. Recognizing personallocation from video, 1998.

[4] M. Burl, M. Weber, and P. Perona. A probabilistic approachto object recognition using local photometry and global ge-ometry. In Proceedings of ECCV, 1998.

[5] G. Fritz and L. Paletta. Object recognition using local infor-mation content. In ICPR’04, pages 15 – 18, 2004.

[6] T. Goedeme and T. Tuytelaars. Fast wide baseline matchingfor visual navigation. In CVPR’04, pages 24 – 29, 2004.

[7] B. Schiele H. Aoki and A. Pentland. Recognizing placesusing image sequences. In Conference on Perceptual UserInterfaces, San Francisco, November 1998.

[8] S. Lazebnik, C. Schmid, and J. Ponce. Affine-invariant localdescriptors and neighborhood statistics for texture recogni-tion. In Proceedings of the IEEE International Conferenceon Computer Vision, pages 649–655, 2003.

[9] D. Lowe. Distinctive image features from scale-invariantkeypoints. International Journal of Computer Vision, pageto appear, 2004.

Page 8: Hierarchical Building Recognition€¦ · age plane at vanishing points. In case of urban environments the dominant directions are typically aligned with three or-thogonal axes of

[10] B. W. Mel. Seemore: Combining color, shape and textureshistorgramming in neurally inspired approach to visual ob-ject recognition. Neural Computation, 9:777–804.

[11] S. Nayar, S. Nene, and H. Murase. Subspace methods forrobot vision. IEEE Transactions on Robotics and Automa-tion, 6(5):750–758, 1995.

[12] D. Robertson and R. Cipolla. An image-based system forurban navigation. In BMVC, 2004.

[13] C. Schmid. A structured probabilistic model for recogni-tion. In Proceedings of CVPR, Kauai, Hawai, pages 485–490, 1999.

[14] C. Schmid and R. Mohr. Local greyvalue invariants for im-age retrieval. Pattern Analysis and Machine Intelligence,19:530–535, 1997.

[15] H. Shao, T. Svoboda, and L. Van Gool. Zubud-zurich build-ings database for image based recognition. Technique reportNo. 260, Swiss Federal Institute of Technology, Switzerland,2003.

[16] H. Shao, T. Svoboda, T. Tuytelaars, and L. Van Gool. Hpatindexing for fast object/scene recognition based on local ap-pearance. In computer lecture notes on Image and video re-trieval, pages 71–80, July 2003.

[17] M. Stricker and A. Dimai. Spectral covariance and fuzzy re-gions for image indexing. Machine Vision and Applications,10:66–73, 1997.

[18] A. Torralba, K. Murphy, W. Freeman, and M. Rubin.Context-based vision system for place and object recogni-tion. In International Conference on Computer Vision, 2003.

[19] A. Torralba and P. Sinha. Recognizing indoor scenes. MITAI Memo, 2001.