learning shape hierarchies for shape detection · learning shape hierarchies for shape detection...

Learning Shape Hierarchies for Shape Detection

Xiaodong FanDept. of Electrical and Computer Engineering

The Johns Hopkins [email protected]

Donald GemanDept. of Applied Mathematics and Statistics

The Johns Hopkins [email protected]

Abstract

One approach to detecting objects in greyscale scenesis to design the computational process itself rather thanprobability distributions (generative modeling) or decisionboundaries (predictive learning). The general idea is toconstruct a tree-structured representation of the space ofobject instantiations as well as a binary classifier for eachcell in this hierarchical scaffold. This is highly efficientdueto pruning groups of hypotheses simultaneously at all stagesof processing. The emphasis in existing work has been ondesign and modeling. Our approach is more data-drivenand focused on learning. In particular, we induce the hier-archy of shapes directly from training data, employ boostingto build the classifiers utilizing cell-dependent “negative”examples, and extend and adapt the hierarchy as new sam-ples are encountered. The resulting generality is illustratedby applying the very same learning and parsing algorithmsto detecting both rigid and deformable shapes.

1. IntroductionDetecting instances from generic object classes is a chal-lenging problem for many reasons. One is the extreme vari-ation in presentation, due to both linear and nonlinear trans-formations; for example, the variation within the class ofboats or handwritten threes is evidently enormous and diffi-cult to capture, even with a very large number of examples.How does one organize the computation, both offline andonline, in order to account for all presentations?

Most methods concentrate on learning and modeling,e.g., estimating decision boundaries directly from data ordesigning and then estimating class-conditional distribu-tions of features. In classic pattern recognition tasks, suchas character and face recognition, the best results often re-quire very large training sets and detection often proceedsby an exhaustive search over image positions and scales, ineffect applyingall learned classifiers or models to each cor-responding subimage.

Relatively less attention has been paid toefficiency,whether in the sense of shared parts, small-sample learn-ing or limiting online computation. There is, however, a

growing body of work in which efficiency is addressed byhierarchical representations. Variations on this theme canbe found in work on hierarchical matching [11, 21], feature-sharing and parts [2, 13, 19], cascades [18, 22], acceleratednearest-neighbor indexing [3, 10] and coarse-to-fine search[1, 9, 12, 14].

When the emphasis is on efficient scene parsing the mainidea, as employed explicitly by the authors in [1, 9, 11], isto simultaneouslyinvestigate entire groups of “similar” ob-ject instances at various levels of resolution so that multiplehypotheses can be jointly pruned. One begins with a tree-structured, hierarchical representation of all possible objectinstances. The groupings – cells in the hierarchy – can bequite heterogeneous with respect to both the class and thepresentation of the objects. There is a classifier for eachcell which is dedicated to simultaneously detecting the cor-responding object instances. However, the number of clas-sifiers to execute is not fixed in advance; rather, it is data-driven because a classifier is executed if and only if all itsancestor classifiers are executed and return a positive an-swer. An object instance is detected if all the classifiers forall the cells which cover that instance respond positively.In this way, the dominating “clutter” hypothesis can be ac-cepted as soon as possible and the examination of specifichypotheses can be delayed as long as possible.

We propose several ways to generalize this framework.In previous work the emphasis has been on design issues,both theoretical and algorithmic. For instance, statisticalfoundations were considered in [1] and [5], and the hier-archies in previous work were designed by hand, either en-tirely, as in [9], or partially, as in [1, 11]. Moreover, learningwas neither adaptive nor sequential.

There are advantages to having a more data-driven pro-cedure. We may wish to accommodate objects (e.g., de-formable shapes such as handwritten digits) with high-dimensional “poses”, or many object classes at once, inwhich case automatic hierarchy construction is almost un-avoidable. This is addressed in§3 using an entropy-basedcriterion to recursively group training samples with simi-lar presentations by agglomerative clustering. Moreover,inprevious work, each cell classifier in the hierarchy is con-structed independently of the samples likely to be encoun-

1

tered at execution, which is not consistent with the sequen-tial, coarse-to-fine nature of scene parsing. In§4 we con-sider a specific, cell-dependent “alternative hypothesis”andemploy boosting to train cell classifiers against those shapeswe anticipate to be especially confusing if that classifier isevaluated. Finally, in§5, we extend the hierarchy by regard-ing the original leaf cells as the roots of attentional cascades[22], and adapt it to new samples using sequential learning.This, together with online competitions among conflictingexplanations, allows us to obtain a significant reduction infalse positives.

These ideas are illustrated by recognizing handwrittendigits (10 classes), the printed characters on license plates(37 classes) and 3D objects (20 classes). As in previousimplementations by various authors, detection is rapid andsmall-sample training is feasible. Our objective is not state-of-the-art error rates, although this is in fact achieved forplate interpretation. In particular, we do not use sophisti-cated features or cell classifiers; on the contrary, we put thisissue in the background by using only linear classifiers withbinary edge variables. Instead, our goal is to demonstratethe generality of a more data-driven approach by applyingthe samelearning and parsing algorithms to deformable,rigid and 3D shapes. This critical assessment is continuedin §7.

2. Hierarchical Framework

The methodology is designed to detect shapes from multi-ple, generic object classes in greyscale images. The out-put is a list of detected shapes, along with their class la-bels and some information about their specific presentationin image, e.g., the geometric pose if there is a meaning-ful, low-dimensional representation, as with printed char-acters or frontal views of faces. Following the authors in[1] and elsewhere,indexingprimes globalinterpretationinthe sense that hierarchical search results in a list (“index”)of candidate detections but does not account for context,such as expected relationships among detected shapes; thisis the role of more intense computation driven by extract-ing a global interpretation from the index. The natural con-straint on indexing is that all the true instances are repre-sented in the index with high probability, at least in termsof correct class labels and approximate instantiation para-meters. A key aspect of post-indexing, or interpretation, is“active classification” in the sense of pruning the index setusing binary classifiers which are designedonlineto disam-biguate among conflicting detections. This is computation-ally feasible when enough auxiliary information about thepresentation is provided during indexing since very specifichypotheses can then be examined. Our emphasis here is onefficient indexing and online pruning.

2.1. Efficient RepresentationLet C be the set of object classes of our interest, regarded asunambiguous semantic labels, and letΩ represent the set ofall possible subimages containing shapes with labels inC.Given a subimageω corresponding to an unlabeled shape,our goal is to efficiently determine whether or notω ∈ Ω.Evidently this cannot be done by checking item by item. Onthe contrary, we would like to accomplish this in a highlyefficient manner by grouping shapes with shared propertiesand checking entire subsets simultaneously. The ideal datastructure would then be a tree-structuredhierarchyH cor-responding to a recursive partitioning of the entire spaceΩ.The subset or “cell” ofΩ at each nodet of the underlyingtree graphT is denotedΩt. Near the root, the cells might bevery heterogeneous with respect to both class and presenta-tion. In contrast, the cells at the leaves ofT would representhomogeneoussubsets of shapes, in particular from the sameclass, but even more alike in the sense of similar presenta-tion (e.g., scale, position, subclass type).

2.2. Efficient SearchIdeally there would be a binary classifierft ∈ −1, 1 ateach nodet that returns 1 ifω ∈ Ωt, and -1 otherwise. Inthis way, all the shapesΩt can be accepted or eliminatedtogether. The classifiers are executed bread-first, coarse-to-fine. Start with the root classifier and proceed recursively.In general, the classifiers at the children of nodet are per-formed if and only if fs = 1 for every nodes which isan ancestor oft (includingt itself). For each surviving leafnodet that is not ruled out we addΩt to the index (list of de-tections.) This strategy is efficient because most image ar-eas do not enclose a shape of interest and hence, with highprobability, the search along most branches stops quickly.Put differently, many possible labelings are simultaneouslyrejected without dealing with shape detail.

3. Learning a Hierarchy from Data

When objects have a well-defined position, scale and orien-tation (e.g., printed characters or faces), it is natural toman-ually specify a recursive partitioning of the abstract spaceof geometric poses [1, 9] with similarity based on simplepose continuity. For multiple classes, a full hierarchicalde-composition of the class/pose space can then be constructedby joining the pose decomposition with a clustering of classprototypes [1], thereby identifying the cells inH with bothsemantic and appearance labels. However, this does notreadily extend to highly deformable shapes or 3D objectswhose spatial presentation involves many degrees of free-dom. At some level, it is difficult to stipulate the pose ofan object with sufficient precision to perform hierarchicalclustering analytically, i.e., based on pose continuity.

2

We therefore take a more data-driven approach. Ratherthan designing the hierarchy by hand and synthesizing train-ing sets at each node, as in the work cited above, we learnH from the data. Of course we do not have the entire setΩ; all we have is a set of training samples. The cells weobtain can be regarded as sparse representations of the idealcellsΩt discussed above. More specifically, given a setLof training samples – segmented subimages containing ex-amples of the shapes of interest together with class labels –we construct a recursive partitioning ofL. Such a processautomatically generates a positive training setL+

t ⊂ L rep-resenting the cellΩt at each nodet.

The construction ofH = H(L) from a training setLis bottom-up, based on recursive clustering. The basic ideais to measure the tightness or homogeneity of a cluster bythe amount of uncertainty in the empirical distribution ofthe feature vector over the cluster. This leads in a naturalway to a special case of a previously-used global cost func-tion and to an entropy-based distance between two clustersof training samples. The recursive partitioning ofL is per-formed in two phases: The first is within-class clusteringto partition each object class into homogeneoussubclasses;the results are then refined by iteratively moving samplesfrom one subclass to another, thereby reducing the globalcost. The second is between-class clustering in which sub-classes of all classes are recursively merged without regardto class labels. The final, root cluster isL itself.

3.1. An Entropy-Based Distance MeasureAs usual, each elementω ∈ L is a pair (x, y), wherex = (x1, ..., xd) is the feature vector computed from ashapeω andy ∈ C is the class label. We will writeX andY

to indicate the corresponding random variables. Here, eachcomponentxk of x is a binary variable indicating the pres-ence of a certain categorical feature. The particular one weuse is described in§4, but the methodology is independentof x.

Begin with the finest partitioning ofL into disjoint clus-tersLiN

i=1. We use standard agglomerative clustering, re-cursively merging two closest clusters. The same procedureis applied to both phases. Only the initialization changes:The Li are singletons during the within-class phase andsubclasses (i.e., the outputs of the first phase) during thebetween-class phase.

Rather than defining a metric on the observation space0, 1d, which is not straightforward, we utilize a statisti-cal measure of cluster homogeneity based on entropy and acorresponding measure for the whole partition. We employa greedy minimization procedure in which each fusion inthe agglomerative process achieves the minimal incrementof partition entropy.

Let PLi(x) denote the empirical distribution of the fea-

ture vector based onLi. For simplicity, we assume that the

binary variablesXk are conditionally independent givenLi. This assumption is obviouslynot truein practice, espe-cially for heterogeneous clusters, but dramatically simpli-fies our computation and yields reasonably good partitionsin our experiments.

Recalling thatxk(ω) ∈ 0, 1, the relative frequency es-timates of the marginals are

pi,k = PLi(Xk = 1) =

1

|Li|∑

ω∈Li

xk(ω)

and the corresponding entropy ofPLi(x), orcluster entropy

of Li, is

H(Li) = −∑

x

PLi(x) log PLi

(x)

= −d∑

k=1

[pi,k log pi,k + (1 − pi,k) log(1 − pi,k)]

Finally, theentropy of the partitionis the weighted averageof the cluster entropies [6, 16]:

∑N

i=1|Li||L| H(Li). (1)

The cluster entropy has long been used to measure the com-pactness of a partition [6], here in the sense that a “tight”cluster should correspond to a concentrated probability dis-tribution for X. This leads to a natural cost function forpartitioning of a given data set (see e.g., [16]). However, inour case, rather than minimizing (1) over all possible parti-tions, we successively minimize cost increment due to therecursive merging of clusters. More specifically, supposewe consider mergingLi andLj . The difference betweenthe entropy of the new and old partitions is then

|Li ∪ Lj ||L| H(Li ∪ Lj) −

|Li||L| H(Li) −

|Lj ||L| H(Lj), (2)

which is always non-negative due to the concavity of en-tropy and expressingPLi∪Lj

as a mixture ofPLiandPLj

.The pairwise distance between two clustersLi andLj isthen defined to be

d(Li,Lj) = |Li∪Lj |H(Li∪Lj)−|Li|H(Li)−|Lj |H(Lj).(3)

and agglomerative clustering proceeds by merging the twoclusters at each stage for which (3) is minimized; equiva-lently, the incremental gain in (1) is minimized.

3.2. ClusteringAt this point, the most straightforward approach would beto treat every training sample as an initial cluster. However,this would be computational difficult and, more importantly,

3

would not yield the types ofsubclasseswhich make hierar-chical search effective.

We first partition each class separately. Fix a classy ∈ C.Initially, every training sample serves as an initial cluster(actually we augment this set by shifting each sample onepixel in each direction to get less degenerate probability es-timates for the features). The process is terminated whenthe clusters become too heterogeneous. This occurs whenthe cluster entropy exceeds a thresholdHmax, determinedas a fraction (e.g.,0.3 in our experiments) of the cluster en-tropy for the entire set of examples inL with classy.

Furthermore, the global cost of the partition of classy

can be reduced iteratively by an algorithm akin to k-means.First, estimate the likelihood of every training sampleω un-der the feature distribution of each subclass and assignω tothe cluster that maximizes likelihood. Second, re-estimatethe feature distributions for the affected subclasses. Nowalternate between these two steps until the entropy of thepartition stops decreasing. The following theorem showsthe effectiveness of the process, which iteratively reducesthe entropy of the partition as defined in (1); see AppendixA for the proof.

Theorem (Iterative entropy reduction) Given a partitionP = LiN

i=1 of L, and any sample(x, y) ∈ L, construct anew partitionP∗ = L∗

i Ni=1 of L by re-assigning(x, y) ∈

L to thei∗-th cluster, wherei∗ = argmaxi PLi(x). Then

H(P∗) ≤ H(P).

This process is of course similar to the EM algorithm,but without explicitly assuming a mixture model forX andmaking soft assignments. In our case, we put each sample ina cluster at each step and there is no averaging over clustersin computing sample likelihoods. Our method is also in thesame spirit as the one in [7] for splitting nodes in buildingclassification trees.

Finally, we perform agglomerative clustering using thesame optimization criteria, but starting with the aggregatedcollection of all subclasses from all classes inC. Therefore,the complexity only depends on the number of subclasses,which is much smaller than the number of training samples.However, this time we continue the recursive merging untilall the training samples are grouped in a single cluster, theroot ofH.

4. Learning Cell ClassifiersWhereas construction of the hierarchy is bottom-up, thetraining of the cell classifiersft proceeds top-down. Thisis necessary since the alternative hypothesisHalt

t for build-ing ft depends on the estimated performance of the cellclassifiers “above”t, which serve as filters to extract a neg-ative training setL−

t fromL. We can then use any standardmachine learning technique to train a binary classifierft

fromL+t andL−

t .

A cell-dependent alternative: Given a shapeω of un-known label, we seek a binary classifierft to distinguishbetweenHnull

t : ω ∈ Ωt and the specific alternativeHalt

t = Bt∩(ω 6∈ Ωt), whereBt is the event thatfs(x) = 1for every ancestors of t up to the root. In effect, we are“boosting”ft by training against those shapes we anticipateto be especially confusing for the current task.

The setsL+t andL−

t represent training samples belong-ing to Hnull

t andHaltt , respectively.L+

t is automaticallygenerated by the hierarchical clustering of the entire train-ing setL introduced in§3;L−

t contains all training samplesthat respond positively to all the ancestor classifiersanddonot belong toΩt.

Features: We use the same binary edges as in [1]; thereis then anXk for each location, one of four orientationsand two polarities. Moreover, following the authors of [1],the edges are “spread” (copied to neighboring locations) inorder to gain stability to local geometric distortions. Thistechnique also facilitates the grouping of shapes based onshared features.

Linear Classifiers: For simplicity, we restrict ourselvesto linear classifiers. The most straightforward approachis naive Bayes: Assume the features are conditionally in-dependent under eachHt, as in [9] and other references.By definition, this ignores important correlations among thefeature components. Alternatively, we choose the AdaBoostprocedure used in [22] to train face detectors. The effectis to select a small number of discriminating features andexecution is then rapid. Given our features are binary, thefinal classifier is linear in the selected features. Moreover,AdaBoost often has good generalization properties since itrapidly achieves large margins [20].

Details on the construction may be found in [22]. Essen-tially, at each round, the best feature is selected in the senseof minimizing a weighted error on the training sets, andsamples are then re-weighted to over-dedicate the next clas-sifier to currently mis-classified examples. Afterk rounds,one obtains a linear classifierft with k selected features,sayxi1 , ..., xik

, andk weightsλ1, ..., λk:

ft(x) = sign(

k∑

j=1

λjxij− λ0), (4)

whereλ0 is a threshold for achieving a desired balance be-tween type-I (false negative) and type-II (false positive)er-rors.

The number of featuresk is determined by estimatingthe sum of the type-I and type-II errors on the training setfor varyingk (1 ≤ k ≤ 200 in our experiment), and choos-ing the minimizingk. However, the AdaBoost procedureis designed to balance the errors whereas our fundamentalconstraint is a negliable type-I error. Once the features arechosen, we attempt to satisfy this as follows: First, an ideal,maximal type-I errorα (α = 0.001 in our experiment) is

4

specified for the entire hierarchy. IfH has depthD, andassuming independent errors, the maximal type-I errorαt

for each cell classifierft is simply set to D√

α, enforced bylowering the thresholdλ0 in ft until it is achieved, at leastempirically (on the training set).

5. Extensions and UpdatesUp to this point, we have described an algorithm to learn ahierarchy of classifiers for fast coarse-to-fine shape index-ing. The entire learning procedure consists of two steps:automatic hierarchy construction (§3) and cell-by-cell clas-sifier training (§4). In this section, we discuss two adaptivemechanisms for achieving further gains in efficiency, bothin accuracy and learning.

5.1. Extending the HierarchyEvery branch from the root ofH to a leaf cellΩt can beinterpreted as an attentional cascade in the sense of [22],sequentially rejecting shapes and background clutter thatin-creasingly “resemble” the targeted shapes inΩt. Naturally,one can continue this chain of rejectors if the false positiverate at the leaf is still substantial.

Therefore, for any leaft for which the false positive er-ror of ft exceeds a threshold, we “extend” the hierarchyto depthD + 1 by adding asinglenew classifier emanat-ing from t in order to prune detections before proceedingto more explicit and expensive competitions as described in§6. This process can be repeated several times. The alter-native hypothesis is defined in the same fashion as above,and the same AdaBoost procedure is used to train the clas-sifer. Indeed, the whole philosophy is based on delayingdedicated or intensive procedures as long as possible andfocusing computational resources on the most ambiguouscases.

5.2. Updating the HierarchyWe describe a sequential learning algorithm to update thehierarchy, bothH and the cell classifiers, as new trainingsamples are received. This is far more efficientwithout re-learningeverything in batch mode.

First, each new sample(x, y) is assigned to the subclassfrom classy ∈ C for which the likelihood ofx is maxi-mized under the empirical feature distribution. The sameiterative method as described in§3.2 can be applied to re-fine the updated subclasses. Second, the subclasses can besplit into two if they are too heterogeneous relative to thesame entropy criterion introduced in§3.2. We use the bi-partition method for splitting: Start with a random divisionof the original subclass into two sets and refine the partitionby the iterative method described in§3.2. (Multiple randominitializations can be used since iterative entropy minimiza-tion only achieves the local minimum.) Finally, all these

changes can be simply propagated to the root, thereby up-dating the entire hierarchy to accommodate the new data.Of course, the between-class clustering (grouping of sub-classes) is only sub-optimal. However, this method dramat-ically reduces the complexity of learning relative to startingover.

Once the hierarchy is updated, the cell classifiers are re-trained following the same procedure as described in§4.

6. From Indexing to InterpretationDuring indexing, i.e., the online collection of detectedshapes, the image lattice is scanned everyη pixels; that is,only the locations on a coarse sublattice are visited. Thevalue ofη depends on the degree of translational invarianceof the cell classifiers at the root of the hierarchy; for in-stance, in previous work the root classifiers were invariantto a distinguished point on the object within a8× 8 or even16×16 window. Around every scanned locationz, a subim-ageI0 of a fixed size (e.g.,30 × 50 in our experiments onthe license plates) centered atz is extracted and sent to thehierarchy to detect candidate shapes inI0.

We assume that there is at most one shape in each subim-age. Hence, multiple detections generated by indexing ina single subimage must be due to false positives, and arepruned in the interpretation stage. To this end, we performpairwise competitions among the initial detections using abinary classifier for each pair of conflicting detections, i.e.,pair of subclasses, which arise from different classes.Theseclassifiers are trained on-line, a form of “active classifica-tion” which we argue is essential to efficient scene parsing.Indeed, it is almost impossible to anticipate every pair ofconflicting interpretations that might be encountered onlineand learn a corresponding classifier offline.

For pairwise competition, we use the naive Bayes class-sifier fB(x). This is of the same form as (4) but applied tothe whole feature vectorx = (x1, . . . , xd) with coefficientsλj = log

pj(1−qj)(1−pj)qj

, wherepj andqj are the estimated proba-bilities of the eventXj = 1 under the two competing sub-classes. LearningfB(x) is extremely fast as no optimiza-tion is involved. In fact, the pieces are already in storagefrom the leaf classifiers since the linear combination of fea-tures is simply

∑d

j=1 logpj

(1−pj)xj −∑d

j=1 logqj

(1−qj)xj =

φp − φq. We can pre-computeφp andφq for every reachedsubclass and evaluation offB(x) is trivial. Finally, thethreshold infB(x) is learned online from the examples intwo subclasses and is stored for the future use. The selectedsubclass is simply the one with the highest winning percent-age.

This type of pairwise competition extends to the detectedshapes at nearby subimages whose bounding boxes signifi-cantly overlap since these shapes are essentially explainingthe same image data. In addition, prior knowledge about the

5

layout of shapes in the scene can be used to further refinethe interpretations. For example, in reading license plates,one can exploit the knowledge that the characters appear ona straight line, and compare two chains of detections ratherthan the individual pairs. However, this is beyond the scopeof this paper and we refer the reader to [1] for an exampleof how this can be done.

7. Experiments7.1. Detecting Handwritten DigitsIn the first experiment, we detect and recognize isolatedhandwritten digits from the MNIST database [15]. Thereare 10 classes. Since we view this as shape detection ratherthan pure classification, we do not assume that each test im-age contains a digit; the same algorithm could be trainedagainst cluttered backgrounds (only the training set wouldchange) although this is only illustrated in the second exper-iment. Many algorithms have been tested on this benchmarkdatabase; see [15] for a list of comparisons. Our goal, how-ever, is not the state-of-the-art error rate, nor is our methodin any way dedicated to this particular problem.

Of course, handwritten digits are highly deformable ob-jects. Hence some approaches use deformable templatematching [2, 4]. However, these methods are often compu-tationally inefficient and may encounter problems in gen-eralizing to interpretations involving many object classes.Moreover, the method in [4], which is among the beston this database, uses 20000 prototypes; the error rate in-creases to 2.5% if only a few hundred are selected fromthem. Alternatively, one can use machine learning tech-niques to train classifiers that generalize well with signifi-cant within-class variations [15]. But most of these methodsdepend on a large training set, e.g., 60000 samples.

The tree hierarchy: We randomly select only100 trainingsamples per class. After within-class clustering, each classis divided on average into 15 subclasses with fairly commonstructure (see Figure 1). Between-class clustering leads toa full hierarchyH consisting of 13 levels and 158 leaves.The hierarchy is then extended at some leaves, as describedin §5.1, yielding a maximum depth of 17. The cell classi-fiers obtained by Adaboost are easy to interpret. At the up-per levels near the root, the cells are very heterogeneous inclass and appearance and the classifiers simply either checkfor a few common features (oriented edges in our case) orpenalize features that rarely appear. At deeper levels, thecell classifiers are more like global templates. This is illus-trated in Figure 2.Accuracy: We send 10000 test samples down the hierar-chy. On average, 6.3 subclasses are reached per image.On 22 images, no subclass from the true class is reached(0.2% type-I error). After the pairwise competition the mis-classification rate is 2.52%.

Figure 1:Some subclasses of handwritten digits.

(a) (b) (c) (d)

Figure 2: (a): A sample of shapes in a heterogeneous cell; (b):Features with positive coefficients in the cell classifier, indicatingcommon oriented edges on the shapes in the cell. Not shown hereare the features with negative coefficients, representing edges notexpected to appear on the shapes: (c),(d): The same for a leafcell.

Moreover, we also experimented with different sizes oftraining sets; see table 1. Our results are comparable tosome methods reported in [15], but we use many fewertraining samples.

Computation: On a Pentium IV 3.2GHz desktop, it takesless than a minute to recognize 10000 digits. To see the gainin on-line efficiency due to hierarchical testing, notice thaton average we compute 7205 features per image using thehierarchy trained from 1000 samples. However, without thehierarchy, one needs to evaluate the cell classifiers for eachone of the 158 subclasses (nodes at the bottom level), whichresults in 28126 features per image. Therefore, the coarse-to-fine search clearly reduces computation by pruning manysubclasses efficiently.

Table 1:Results on the MNIST database. The result correspond-ing to 300 examples per class is obtained by updating the hierarchylearned from 100 examples per class as described in§5.2.

number of samples/classmis-classification rate100 2.52%300 2.01%

7.2. Reading License PlatesIn the second experiment, we detect and recognize instancesfrom multiple shape classes in a single image without pre-segmentation. Our objective is to identify the characters ona license plate from a photograph of the rear of a car, asshown in Figure 3. We omit the process of detecting andextracting the plates from the photograph as it is irrelevantto the main theme of the paper.

Although printed characters are generated from fixedprototypes, they appear with different poses (positions,

6

Figure 3:A typical photograph and samples of extracted plates.

scales and orientations) on the plates. There are also otherchallenges due to variations in stroke widths, variable il-lumination, background clutters and perspective projection.The coarse-to-fine procedure proposed in [1] offers an effi-cient detection mechanism, and we borrow the basic strat-egy, except, again, our method is entirely data-driven. Inparticular, rather than manually designing a pose decom-position, welearn the full hierarchy of object classes andposes directly from the training set without utilizing posecontinuity. That is, hierarchy construction is completelydriven by the criterion of grouping shapes with commonappearances, as described in§3. As expected, nearby posesare grouped naturally in our within-class clustering process,so that characters can be localized at a desired resolutiononce a subclass is reached during indexing. In some cases,shapes from different classes are grouped before pose ag-gregation. For instance, it makes more sense to group verti-cal B’s and 8’s before clustering B’s with different orienta-tions. In Figure 4, we have shown a sample of shapes froma heterogeneous cell which contain multiple classes.

Figure 4:A sample of shapes from a heterogeneous cell.

Training set: There are 37 classes (26 letters, 10 digits anda special symbol). We start from 37 class prototypes. Com-binations of scaling, rotation, and translation are applied tosynthesize the training shapes with various poses. However,the pose of the samples is not considered during learning.

The tree hierarchy and indexing results:The learned hi-erarchy contains 18 levels and 684 leaves. The indexingprocess proceeds by scanning the plate image at every 5 pix-els, and applying the hierarchy to each subimage around thevisited location. This generates roughly 30 initial detectionsper plate on the 380 test plate images; see Figure 5(a). Threecharacters are missed in total. Many of the false positivesare due to background clutter and to detections that straddletwo nearby characters. Some are caused by the confusionsamong similar characters.

Final interpretation: We use a similar method in [1] toprune the indexing set to the final string of characters. Most

of the ambiguities can be resolved by comparing individu-ally hypothesized characters (Figure 5(c)). In some cases,the competitions may involve strings of detections, as illus-trated in Figure 5(d). As a final result, we achieve 98.9%classification rate per symbol over 380 plates.

(a) Result of indexing (b) Final interpretation

(c) Pairwise competition (d) Comparing stringsof individual characters of characters

Figure 5:Some results on reading license plates.

7.3. Recognizing 3D Objects

Finally, we conduct experiments on recognizing 3D objects.Our primary goal is to demonstrate the generality of ourdata-driven approach to deal with 3D poses. More specifi-cally, we use the COIL-20 database [17], which collects 20objects (classes), with every object taken from 72 different,equally-spaced viewing angles at intervals of 5 degrees. Allimages in the database are pre-processed by size normaliza-tion and histogram stretched. This allows us to concentrateon the 3D pose variations only.

We evenly take 36 viewing angles from each object classfor training and use the rest for testing. Figure 6 showssome of the subclasses obtained by the within-class clus-tering, which groups objects with nearby viewing angles.Eventually, we extract 5-7 subclasses per class, and the fullhierarchy of 20 classes has 13 levels in total. On average, 2subclasses are reached per image at the indexing stage withno missed detections, and the final misclassification rate is0%. This result is encouraging, and suggests that the samelearning algorithm can be used to recognize 3D objects, butis far from demonstrating a working system on 3D objectsdue to the simplicity of the setting (e.g., absence of clutter)and to testing on views not significantly different from thoseseen in training.

7

Figure 6:Some subclasses of the COIL-20 objects.

8. ConclusionsMotivated by efficient shape detection, we have proposeda systematic approach to induce a hierarchy of shapes andclassifiers directly from data. The emphasis is on learningrather than design. The proposed framework is efficient of-fline because pose information is not assumed and the ex-isting hierarchy can be sequentially updated as new trainingsamples are provided. It is efficient online due to coarse-to-fine indexing followed by a form of “active classification”which is dedicated to the image-dependent detections andwhich uses cheap, binary classifiers constructed on the spot.We demonstrate the generality of our approach by apply-ing the same learning and parsing algorithm to recognizingshapes with different characteristics, achieving good speedand accuracy with simple binary features, linear classifiersand small training sets.

References[1] Y. Amit, D. Geman and X. Fan, “A coarse-to-fine strategy for multi-

class shape detection,”IEEE Trans. PAMI, 26(12): pp. 1606-1621,2004.

[2] Y. Amit and A. Trouve, “POP: Patchwork of of parts models for ob-ject recognition,” Technical report, University of Chicago, 2004.

[3] J. Beis and D. Lowe, “Shape indexing using approximate nearest-neighbour search in high-dimensional spaces,”Proc. CVPR, pp.1000-1006, 1997.

[4] S. Belongie, J. Malik, and J. Puzicha, “Shape matching and objectrecognition using shape contexts,”IEEE Trans. PAMI, 24(24): pp.509-522, 2002.

[5] G. Blanchard and D. Geman, “Hierarchical testing designs for patternrecogntion,”Annals of Statistics, 2005.

[6] H. H. Bock, “Probabilistic aspects in cluster analysis,” Conceptualand Numerical Analysis of Data, Springer-Verlag, Heidelberg, pp.12-44.

[7] P.A. Chow, “Optimal partitioning for classification andregressiontrees,”IEEE Trans. PAMI, 13(4): pp. 340-354, 1991.

[8] R. Fergus, P. Perona, A. Zisserman, “Object class recognition by un-supervised scale-invariant learning,”Proc. CVPR, vol. II, pp. 264-271.

[9] F. Fleuret and D. Geman, “Coarse-to-fine face detection,” Inter. Jour-nal of Computer Vision, vol. 41, pp. 85-107, 2001.

[10] J. Friedman, J. Bentley, and R. Finkel, “An algorithm for finding bestmatches in logarithmic expected time,”ACM Trans. on MathematicalSoftware, 3(3): pp. 209-226, 1977.

[11] D.M. Gavrila, “Multi-feature hierarchical template matching usingdistance transforms,”Proc. ICPR, 1998.

[12] S. Geman, K. Manbeck, and E. McClure, “Coarse-to-fine search andrank-sum statistics in object recognition,” technical report, BrownUniv., 1995.

[13] S. Geman, D. Potter, and Z. Chi, “Composition systems,”QuarterlyJ. Applied Math, pp. 707-737, 2002.

[14] F. Jung, “Detecting new buildings from time-varying aerial stereopairs,” Technical report, IGN.

[15] Y. LeCun, “The MNIST handwritten digit database,”http://yann.lecun.com/exdb/mnist.

[16] T. Li, S. Ma and M. Ogihara, “Entropy-based criterion incategoricalclustering,”Proc. ICML, 2004.

[17] S. A. Nene, S. K. Nayar and H. Murase, “Columbia Object Image Li-brary (COIL-20),” Technical Report CUCS-005-96, February1996.

[18] D. Socolinsky, J. Neuheisel, C. Priebe, J. De Vinney andD.Marchette, “Fast face detection with a boosted cccd classifier,” Tech-nical report, Johns Hopkins University, 2002.

[19] A. Torralba, K. P. Murphy and W. T. Freeman. “Sharing features:efficient boosting procedures for multiclass object detection,” CVPR,pp. 762- 769, 2004.

[20] R.E. Schapire, Y. Freund, P. Bartlett, and W.S. Lee, “Boosting themargin: a new explanation for the effectiveness of voting methods,”Proc. ICML, 1997.

[21] S. Ullman, M. Vidal-Naquet, and E. Sali, (2002) “Visualfeatures ofintermediate complexity and their use in classification,”Nature Neu-roscience, 5(7), pp. 1-6, 2002.

[22] P. Viola and M. J. Jones, “Robust real-time face detection,” Proc.ICCV, pp. II: 747, 2001.

A. Proof of the Theorem in §3.2Proof. It is equivalent to proving−|L|H(P∗) ≥ −|L|H(P). Tothis end, notice,

−N

i=1

|L∗

i |H(L∗

i )

(a)=

Ni=1

|L∗i |

dk=1

1j=0

PL∗i(Xk = j) log PL∗

i(Xk = j)

(b)

≥

Ni=1

|L∗i |

dk=1

1j=0

PL∗i(Xk = j) log PLi

(Xk = j)(c)=

Ni=1

|L∗i |

dk=1

ω∈L∗

ixk(ω)

|L∗i |

log PLi(Xk = 1)

+

ω∈L∗i(1 − xk(ω))

|L∗i |

log PLi(Xk = 0)

(d)=

Ni=1

ω∈L∗

i

dk=1

log PLi(Xk = xk(ω))

(e)=

Ni=1

ω∈L∗

i

log PLi(X = x(ω))

(f)

≥N

i=1

ω∈Li

log PLi(X = x(ω))

(g)= −

Ni=1

|Li|H(Li)

Step (a) is based on the conditional independence of the Bernoullirandom variablesXj; Step (b) is deduced from the non-negativity of the KL distance between the distributionsPL∗

iand

PLi; step (c) is due to our frequency estimates of the probabil-

ities; step (d) follows the observation thatxk(ω) ∈ 0, 1, andswitching two summations; step (e) is based on the conditional

8

independence; and step (f) comes from the definition of the newpartitionP∗. Finally, the reasoning of step (g) is similar to thederivation from step (c) to step (e).

9

learning shape hierarchies for shape detection · learning shape hierarchies for shape detection...

Documents