mixtures of trees for object recognitionluthuli.cs.uiuc.edu/~daf/papers/cvpr.pdfthe object model. we...

6
Mixtures of Trees for Object Recognition Sergey Ioffe David Forsyth Abstract Efficient detection of objects in images is complicated by variations of object appearance due to intra-class object differences, articulation, lighting, occlusions, and aspect variations. To reduce the search required for detection, we employ the bottom-up approach where we find candidate image features and associate some of them with parts of the object model. We represent objects as collections of local features, and would like to allow any of them to be absent, with only a small subset sufficient for detection; fur- thermore, our model should allow efficient correspondence search. We propose a model, Mixture of Trees, that achieves these goals. With a mixture of trees, we can model the in- dividual appearances of the features, relationships among them, and the aspect, and handle occlusions. Indepen- dences captured in the model make efficient inference pos- sible. In our earlier work, we have shown that mixtures of trees can be used to model objects with a natural tree struc- ture, in the context of human tracking. Now we show that a natural tree structure is not required, and use a mixture of trees for both frontal and view-invariant face detection. We also show that by modeling faces as collections of features we can establish an intrinsic coordinate frame for a face, and estimate the out-of-plane rotation of a face. 1. Introduction One of the main difficulties in object recognition is being able to represent the variations in object appearance and to detect objects efficiently. Template-based approaches (e.g., to detect frontal views of faces [8, 11] and pedestrians [7]) are not general because they do not allow object parts to move with respect to each other. An alternative is to use a model that, instead of regarding an object as rigid, models local appearance of parts and the relationships among the parts. Such representations have been used extensively to represent people (e.g. [1, 3]) and have been applied to faces [10, 13, 14]. Detecting articulated objects requires a search of a very large configuration space, which, in the context of tracking, is often made possible by constraining the config- uration of the object in one of the frames. However, if our object detector is to be entirely automatic, we need a method that allows us to explore the search space efficiently. Among the ways to make the search efficient is the bottom-up approach, where the candidate object parts are first detected and then grouped into arrangements obeying the constraints imposed by the object model. Examples in face detection include [13] and [14], who model faces as flexible arrangements of local features. However, if many features are used to represent an object, and many candidate features of each type are found in the image, it is imprac- tical to evaluate each feature arrangement, due to the over- whelming number of such arrangements. The correspon- dence search, where a part of the object model is associ- ated with some of the candidate features, can be made more efficient by pruning arrangements of a few features before proceeding to bigger ones [5]. Alternatively, the model can be constrained to allow efficient search. One example of such a model is a tree, in which correspondence search can be performed efficiently with dynamic programming (e.g. [3, 4]). Representing an object with a fixed number of features makes recognition vulnerable to occlusions, aspect varia- tions, and failures of local feature detectors. Instead, we would like to model objects with a large number of features, only several of which may be enough for recognition. To avoid the combinatorial complexity of the correspondence search, we propose a novel model that uses a mixture of trees to represent the aspect (which features are present and which are not) as well as the relationships among the fea- tures; by capturing conditional independences among the features composing an object, mixtures of trees allow effi- cient inference using a Viterbi algorithm. Some objects, such as human bodies, have a natural tree representation (with the torso as the root, for example), and we have shown [4] that mixtures of trees can be used to rep- resent, detect and track such objects. However, our model is not limited to articulated objects, and, because we learn the tree structure automatically, can be used for objects without an intuitive tree representation. We illustrate this by apply- ing our model to face detection. By using a large number of features only a few of which are sufficient for detection, we can model the variations of appearance due to different individuals, facial expressions, lighting, and pose. In section 2, we describe mixtures of trees, and show how to model faces with a mixture of trees in section 3. We use our model for frontal (section 4) and view-invariant (section 5) face detection. The feature arrangements rep- resenting faces carry implicit orientation information. We illustrate this in section 6, where we use the automatically extracted feature representation of faces to infer the angle of out-of-plane rotation. 2. Modeling with mixtures of trees Let us suppose that an object is a collection of primitives, , each of which can be treated as a vector rep- resenting its configuration (e.g., the position in the image). 1

Upload: others

Post on 01-Apr-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mixtures of Trees for Object Recognitionluthuli.cs.uiuc.edu/~daf/papers/cvpr.pdfthe object model. We represent objects as collections of local features, and would like to allow any

Mixtures of Treesfor ObjectRecognition

Sergey Ioffe David Forsyth

AbstractEfficient detectionof objectsin images is complicated byvariations of object appearancedue to intra-classobjectdifferences, articulation, lighting, occlusions, and aspectvariations. To reducethesearch requiredfor detection, weemploy the bottom-up approach where we find candidateimage features and associatesomeof themwith parts ofthe object model. We representobjectsas collectionsoflocal features,and would like to allow any of themto beabsent, with onlya smallsubsetsufficientfor detection;fur-thermore, our modelshould allow efficientcorrespondencesearch. Weproposea model,Mixtureof Trees,thatachievesthesegoals. With a mixture of trees,we canmodelthe in-dividual appearancesof the features,relationshipsamongthem, and the aspect,and handle occlusions. Indepen-dencescapturedin themodelmake efficient inferencepos-sible. In our earlier work, wehaveshownthat mixturesoftreescanbeusedto modelobjectswith a natural treestruc-ture, in thecontext of human tracking. Nowweshowthat anatural treestructure is not required,andusea mixture oftreesfor bothfrontalandview-invariant facedetection. Wealsoshowthat by modelingfacesascollections of featureswe can establishan intrinsic coordinate framefor a face,andestimatetheout-of-planerotationof a face.

1. IntroductionOneof the main difficulties in object recognition is beingableto representthevariations in objectappearanceandtodetectobjectsefficiently. Template-basedapproaches(e.g.,to detectfrontal views of faces[8, 11] andpedestrians[7])arenot generalbecausethey do not allow objectpartstomove with respectto eachother. An alternative is to useamodel that, insteadof regarding anobject asrigid, modelslocal appearanceof partsandthe relationships among theparts. Suchrepresentationshave beenusedextensively torepresentpeople(e.g.[1, 3]) andhavebeenappliedto faces[10, 13, 14]. Detectingarticulatedobjects requiresa searchof avery largeconfigurationspace,which,in thecontext oftracking, is oftenmadepossibleby constraining theconfig-uration of theobjectin oneof the frames.However, if ourobjectdetectoris tobeentirelyautomatic,weneedamethodthatallowsusto explore thesearchspaceefficiently.

Among the ways to make the searchefficient is thebottom-upapproach,wherethe candidateobjectpartsarefirst detectedandthengroupedinto arrangementsobeyingtheconstraintsimposedby theobjectmodel. Examplesinfacedetectioninclude [13] and [14], who model facesasflexible arrangementsof local features. However, if many

featuresareusedto representanobject,andmany candidatefeaturesof eachtype arefound in the image,it is imprac-tical to evaluateeachfeaturearrangement, dueto theover-whelming number of sucharrangements.The correspon-dence search,wherea part of the objectmodel is associ-atedwith someof thecandidatefeatures,canbemademoreefficient by pruning arrangementsof a few featuresbeforeproceedingto biggerones[5]. Alternatively, themodelcanbe constrained to allow efficient search. Oneexample ofsucha modelis a tree, in which correspondencesearchcanbe performedefficiently with dynamic programming (e.g.[3, 4]).

Representingan object with a fixednumber of featuresmakes recognition vulnerable to occlusions, aspectvaria-tions, and failuresof local feature detectors. Instead,wewouldliketo modelobjectswith alargenumberof features,only several of which may be enough for recognition. Toavoid the combinatorial complexity of the correspondencesearch,we proposea novel model that usesa mixture oftreesto represent theaspect(whichfeaturesarepresent andwhich arenot) aswell asthe relationshipsamong the fea-tures; by capturing conditional independencesamong thefeaturescomposinganobject, mixturesof treesallow effi-cientinferenceusinga Viterbi algorithm.

Someobjects,suchashumanbodies,have a naturaltreerepresentation(with thetorsoastheroot,for example),andwehaveshown [4] thatmixturesof treescanbeusedto rep-resent,detectandtracksuchobjects.However, ourmodel isnot limited to articulatedobjects,and,becausewe learnthetreestructureautomatically, canbeusedfor objectswithoutanintuitive treerepresentation. We illustratethis by apply-ing our modelto facedetection.By usinga large numberof featuresonly a few of which aresufficient for detection,we canmodel thevariationsof appearancedueto differentindividuals,facialexpressions,lighting, andpose.

In section2, we describemixtures of trees,and showhow to model faceswith a mixture of treesin section3.We useourmodelfor frontal (section4) andview-invariant(section5) facedetection. The featurearrangements rep-resentingfacescarry implicit orientationinformation. Weillustratethis in section6, wherewe usethe automaticallyextractedfeature representation of facesto infer the angleof out-of-plane rotation.

2. Modeling with mixtures of treesLet ussupposethatanobjectis acollectionof

�primitives,������������� �

, eachof whichcanbetreatedasa vectorrep-resentingits configuration (e.g., thepositionin theimage).

1

Page 2: Mixtures of Trees for Object Recognitionluthuli.cs.uiuc.edu/~daf/papers/cvpr.pdfthe object model. We represent objects as collections of local features, and would like to allow any

Givenan image,the local detectors will provide us with afinite setof possibleconfigurationsfor eachprimitive

���.

Thesearecandidateprimitives; theobjective is to build anassemblyby choosinganelementfrom eachcandidateset,so that the resultingsetof primitivessatisfiessomeglobalconstraints.

The global constraints can be capturedin a distribu-tion ��� � � ������ �� , which will behigh whentheassemblylooks like the objectof interest,and low whenit doesn’t.Assumingexactly oneobject presentin the image,we canlocalizetheobjectby finding theassemblymaximizing thevalueof � . In general, thismaximization requiresacombi-natorial correspondencesearch.However, if ��� ����������� �is represented with a tree, correspondence searchis effi-ciently accomplishedwith a Viterbi algorithm. If thereare�

candidateconfigurationsfor eachof the�

primitives,thenthesearchtakes ��� ����� �

time,whereas for a generaldistribution � thecomplexity wouldbe ��� � �

.2.1. Learning the tree modelIn additionto making correspondence searchefficient, theconditional independencescaptured in the treemodelsim-plify learning, by reducing thenumber of parametersto beestimated,dueto thefactorizedform of thedistribution:

��� ������������ ��� ��� ���! " �# �%$�'&( �! " �# ��� ���*)�+-,.� �"/where

���! " �#is the nodeat the root of the tree, and

+-,0�denotestheparentof thenode

�1�. Learningthemodelin-

volves learning the structure(i.e., the tree edges)as wellas the parametersof the prior ��� �2�! " �# � andconditionals��� �3��)4+-,'� � . We learnthemodel by maximizingthe log-likelihood of the training data,which canbe shown to beequivalent to minimizing the entropy of the distribution,subjectto theprior ��� � �5 " #�� andconditionals ��� � � )6+-, �4�being setto their MAP estimates.Theentropy canbemin-imizedefficiently [2, 12] by findingtheminimum spanningtreein the directedgraph, whoseedgeweightsarethe ap-propriateconditional entropies.2.2. Mixtures of treesIt is difficult to usea treeto modelcaseswheresomeof theprimitivesconstituting anobject aremissing– dueto occlu-sions,variations in aspector failures of the local detectors.Mixturesof trees,introducedin [6], provide a solution. Inparticular, we can think of assembliesas beinggeneratedby amixturemodel, whoseclassvariable specifieswhatset7

of primitiveswill constituteanobject,while conditionalclassdistributions �-8�� �9�3�*:<;�= 7 � � generatetheconfigu-rations of thoseprimitives. Themixture distribution is

��� �9���*:.;>= 7 � �?�A@ � 7 � ��8�� ���3�B:<;�= 7 � �where

@ � 7 � is theprobability thata random view of anob-jectconsistsof thoseprimitives.This mixturehas C com-ponents– onefor eachpossiblesubset

7of primitivetypes.

Learning a mixture of treesinvolvesestimatingthemixtureweights

@ � 7 � , aswell thestructureandthemodel parame-tersfor eachof thecomponenttrees.

Figure1: Usinga generating treeto derivethestructure fora mixturecomponent.Thedashedlinesare theedgesin thegenerating tree, which spansall of the nodes. Thenodesof themixture componentare shaded,andits edges(shownas solid) are obtainedby makinga grandparent “adopt”a nodeif its parent is not presentin this tree (i.e., is notshaded). Thusmixture componentsare encoded implicitly,which allows efficient representation, learning and infer-encefor mixtureswith a large number of components.Thestructure of thegenerating treeis learnedby entropymini-mization.

2.3. Mixtures of trees with shared structureExplicitly representing C mixture components is unac-ceptable if the numberof objectparts

�is large. Instead,

we usea singlegenerating treewhich is usedto generatethestructures of all of themixturecomponents.

A generating tree is a directedtree D whosenodes are���������E��, with

���! " #at the root. It provides the struc-

ture of the graphical model representing@ � 7 � : @ � 7 �F����G ���! " #IH �KJ �'&( �! " �# ���G �3�9HL) G +-,'�MH �N/ where G ���9H denotes

theevent that�O�

is oneof theprimitivesconstituting a ran-domview of theobject,andthedistributions arelearnedbycounting occurrencesof eachprimitive andpairsof prim-itives in the training data. For a subset

7of object part

types, the mixture component �P8 contains all the edges� � QSRT� �U�suchthat

�SQis anancestorof

� �in thegen-

eratingtree,andnone of thenodeson thepathfrom��Q

to� �is in theset

��� � :K;V= 7 �. This meansthat,if thepar-

entof node� �

is not present in a view of theobject, then� �is “adopted” by its grandparent,or, if that one is ab-

sentaswell, a great-grandparent,etc. If weassumethattheroot

���! " �#is alwaysa partof theobject,then �W8 will bea

tree,since�3�! " #

will ensurethatthegraphicalmodelis con-nected. An example of obtaining thestructureof a mixturecomponentis shown in figure 1. We ensureconnectivity byusinga “dummy” featureasthe root

�VX, representingthe

rough positionof theassembly;candidate root features areadded to testimagesat thenodes of a sparsegrid.

Thedistribution ��8 is theproduct of theprior ��� �2�! " �# �andconditionals ��� � � )0�*Q � corresponding to the edgesof the treerepresenting � 8 . We learnthe structure of thegenerating tree D that minimizesthe entropy of the distri-bution. We arenot awareof anefficient algorithm thatpro-ducestheminimum; instead,weobtainalocalminimum byiteratively applying entropy-reducing local changes (suchasreplacing a node’s parentwith another node)to D untilconvergence.

2

Page 3: Mixtures of Trees for Object Recognitionluthuli.cs.uiuc.edu/~daf/papers/cvpr.pdfthe object model. We represent objects as collections of local features, and would like to allow any

RLA

RUA

TOR

(a)

RLA

RUA

TOR

RUA

TOR

RLA

TOR

(b)

TOR

TOR

RUA

RLA

(c)

Figure 2: Convertinga mixture of treesinto a graphwithchoicenodes, onwhichcorrespondencesearch isperformedusingdynamic programming. (a) A fragmentof thegener-ating tree for a person, containing the torso, right upperarm, and right lower arm. (b) Themixture of 4 treesthatresultsif we require the torso to be alwayspresent. Thetriangle representsa choicenode; only oneof its childrenis selected,andthemixture weightsare givenby themodelof theaspect.(c) Thesharedstructure of themixture com-ponentsis captured usingextra choicenodes. Theemptynodescorrespond to adding no extra segments;themixtureweightcorresponding to each child of a choicenodeis de-rived fromtheprior for anaspect.

2.4. Grouping using mixtures of treesTo localizeanobject in animage,wefind theassemblythatmaximizesthe posterior

+?Y � object)assembly

�or, equiva-

lently, theBayesFactorZ � ��� ��� � � �\[ ��]'^`_a� �9� � � � wherethenumeratoris theprobability of a configurationin a ran-domview of theobject,andthedenominator is theproba-bility of seeingit in the background. We modelthe back-ground as a Poissonprocess: � ].^`_ � �9����:S;%= 7 � ���J �9b 8Lc � where c � is the rate(or density)of the Poissonprocessaccording to which the primitives of type

���are

distributed in thebackground. Becauseof the independentstructure of � ]'^`_ , the Bayesfactorcanbe obtainedby as-sociatingthe term cWd �� with eachmemberof

�e�’s candi-

dateset, and multiplying thoseterms into the likelihood��� ��� � � � .We perform the correspondence searchusinga Viterbi

algorithm on tree D ; at eachnode,we selectnot only thebestprimitives to choosefrom thechildren’scandidatesets,but also the edges to be includedin the tree (i.e., whichpartsconstitutean object instance). This is equivalent todynamicprogrammingona graphwith choicenodes, illus-tratedin figure2. Thisalgorithmrunsin time ��� ��f0�%� ������ �1�9�g� �

where�

is the number of primitives in eachcandidateset,

�is thenumber of objectparts,and

fis the

depth of thegeneratingtree.

3. Learning the model of a faceRepresenting a faceasanassemblyof local featuresallowsus to modelboth the relative rigidity of the facial featuresandtheflexibility of their arrangements.Otherapproachesmodeling faceswith local feature arrangements (e.g.[13])usuallyrelyonaspecific,smallsetof features,becausethey

Figure 3: Cluster centers found by grouping subimagesextractedfrom training face images with the modifiedK-Meansalgorithm. Theclusteringprocedure learnsboththeaverage grey-level appearanceof each feature and its av-erage warpedposition(according to which each clusterispositionedin thefigure).

cannot handlemissingfeaturesandlack anefficient group-ing mechanism. With a mixture of trees,we canaddresstheseissues.Becauseof theefficient inferenceon mixturesof trees,wecanuseaverylargenumber of features( hji9k'l ),but require only a few ( hgi9l ) to declarea detection. There-fore, we canhandle occlusions andmultiple aspects,andusethe samemodel to representall orientations of a face.However, theorientation is not discarded; instead,it is im-plicitly encodedby themixtureof trees.Wecanrecovertheposeby examining the feature arrangementobtainedfor afaceimage,andusingthetypesof thefeaturesconstitutinganarrangement,aswell astheir geometric relationships, toestimatetheorientation.

3.1. Facial featuresEachfacialfeatureis representedasasmallimagepatch;anassemblyis a group of featuressatisfyingsomeconstraintsimposedby thegeometry of a face. For eachfeaturetype,thecandidatefeaturesareimagepatcheswhoseappearanceis sufficiently similar to the“canonical” appearanceof thatfeature.

To reduce the time it takes to find candidatefeatures,eachimage– both training and test – is represented as acollectionof small ( m�nem ) imagepatchescenteredat inter-estpoints (found with theHarrisoperator, e.g. [9]). Localcontrastnormalizationis appliedto counterthevariationsinbrightnessandcontrastdueto differentlighting.

Insteadof manually specifyingwhatfeaturescomposeaface,we learna setof featuresthatarestable,(i.e., presentin a largenumberof faceimages,at roughly thesameplacerelativeto theface),distinctfromotherfeatures,anddistinctfrom the background. First, we clusterthe imagepatchesin trainingimagesusingtheK-Means algorithmwhich wemodified sothat it learnsnot only theappearancesbut alsothe warpedpositionsof the clustercenters; the warp foreachtraining imageis computed as an affine transforma-

3

Page 4: Mixtures of Trees for Object Recognitionluthuli.cs.uiuc.edu/~daf/papers/cvpr.pdfthe object model. We represent objects as collections of local features, and would like to allow any

Figure 4: Examplesof facesdetectedin thetestdata(with thethresholdo � i ). Theboxes indicatethebounding boxesofthefeature assembliesrepresentingthefaces,andthedot showsthepositionof theroot nodein themixture of trees,whichindicatesthe“g eneral position” of a face.

tion thatmaps3 “landmarkpoints”ona faceimageto theircanonical positions. The similarity betweena patchanda clustercenteris computed asthe Euclidean distancebe-tweentheir pixel representations, subjectto proximity be-tweentheir warpedpositions.By clusteringimagepatches,we convert eachtraining imageto an assembly; theseas-sembliesarenow usedto learn the facemodel. Figure3shows theclustercentersobtainedwith ouralgorithm.

Becausepixels within a patcharenot independent, werepresenteachimagepatchwith its projectionsonto sev-eral(e.g.15)dominant independentcomponents(found byapplying PCA to the sub-imagesof faceimages).We cancapture dependencesamongtheseprojections conditionalon thefeaturetypeandstill maintainthelinearcomplexityof themodel with a tree-structuredmodel.All conditionalsin this model areGaussian,andthetreestructureis learnedby maximizing mutual information[2].

3.2. Modeling feature arrangementsTo model faceswith a mixture of trees,we needto learnthepairwiserelationships first, andthenuseentropy mini-mizationto obtainthestructureof thegeneratingtree.Con-ditional probability tablesfor feature visibility arelearnedby simplecounting. Thedistributions of the relativeposi-tionshave the form ��� � � ).��� �*� ���qp � /6r � ) p � /\r � �*����qp �ts p � /6r �ts r � � where �qp �ts p � /\r �Ws r � � is thedis-placementbetweenthetwo features. We usea Gaussiantorepresent��� � � )���� � .

To be able to detectfaces,we needto learnthe modelof the background as well as that of a face. We can in-corporatethebackground modelinto theefficient inferencemechanismif it obeysthesameindependencesasthosecap-turedin the mixture of treesmodeling a face. We choosethesimplestsuchmodel,in which the featuresdetectedinbackground areindependent,andmodeledwith a Poissonprocess.Theappearanceof imagepatchesin non-faceim-agesis modeled with a mixtureof distributionsof thesametypeasusedtomodelfacialfeatures.Thisallowsusto moreaccurately model the backgroundimagepatchesthat looksimilar to facial features.The probability densityof gen-eratingan assembly

�9�O�1:�;�= 7 �in thebackgroundbe-

comes � ]'^`_ � ���3�S:u;O= 7 � ��� J �9b 8vc � � X � �3� � wherec �

wyx o -10 -3 0 3 5 10Detection 90% 80% 76% 69% 66% 56%

Falsealarms 1364 279 129 50 22 1

Table1: Frontal facedetectionresults.Thedatabasecon-tained117 images,with a total of 511faces.We showthefractionof facescorrectlydetected, andthenumber of non-facesmistakenlydetected,for differentvaluesof thethresh-old o with which theposterioris compared

is therateof thePoissonprocesswe assumeto begenerat-ing thecandidatefeaturesof type

;.

To find face-like assembliesof imagepatches,we com-pute,for eachfeature type

;andeachimagepatch zt{ , the

probability � � �!z0{ � of seeingthis patchin a random viewof feature

� �, and the probability � X �!z { � of seeingthe

patchin a random view of a non-face. In maximizing theBayesfactor, we associatean extra multiplicative weight� � �!z { �6[ � c � � X �5z { ��� with eachfeature

;andpatch z { . In

practice,we will make z { a candidatefor feature;

only ifthe ratio � � �!z { ��[ � X �!z { � is sufficiently large – a conditionthatdoesnothold for mostpatch/featurepairs.

4. Detecting frontal facesWe have usedmixturesof treesto learn the model of thefrontal faces. The training datawas kindly provided byHenrySchneiderman. Thetrainingbackgroundimagesarechosenat randomfrom theCorelimagedatabase.

We testedour facefinderon imagesfrom the MIT andCMU facedatabases— 117photographs with 511 frontalfaces.Facesweredetectedat a range of scales,spacedby afactorof C �|�} . If two assemblies’bounding boxesoverlapby morethana smallamount, we remove theonewith thesmallerposterior. Table1 shows the performancefor ourdetectorfor somevaluesof the threshold o for the Bayesfactor. Figure4 shows someexamplesof faceswe detectedin testimages.In figure5, we show anexample of our facedetectorappliedto alargegroupphoto (with o setverylow;this doesnot resultin many falsedetections, sincemostofthosearesuppressedby therealfaces).

4

Page 5: Mixtures of Trees for Object Recognitionluthuli.cs.uiuc.edu/~daf/papers/cvpr.pdfthe object model. We represent objects as collections of local features, and would like to allow any

Figure 5: Facesfound in a large group photo. The threshold o is set very low; mostfalsedetections are suppressedbythecorrectlydetectedfaces.Out of 93 facesin the image, 79 were correctlydetected,14 were missed,and5 falsealarmsoccurred.Mostof themissedfaceswere not foundbecausethey were smallerthanthesizeof thetrainingfaces.

a b c

Figure 6: Examplesof theview-invariant facedetector applied to facesandbackgrounds. (a) examplesof faceassembliesdetectedcorrectly. Thesquaresshowthe featuresin the assembly, the circle correspondsto the root node of the mixtureof trees,and the edgesshowthe edgesof the mixture component corresponding to the assembly. (b) Falsedetections inbackgroundimages.(c) Anexampleof a misseddetection. Eventhough a featureassemblyis foundcorrespondingto a face,its posterioris too low to declarea facedetection.

5. View-invariant face detectionOut-of-plane face rotations suggestthat a model that isable to representaspectwould perform well. Our datawaskindly provided by M. Weberet al., authorsof [13].It contained facesof 22 subjects,photographed against auniformbackground(whichwe syntheticallyreplacedwithrandom imagesfrom theCoreldatabase)at 9 differentan-gles,spacedby iMku~ andspanning theentirerange betweenthe frontal view andthe profile; 18 to 36 picturesof eachperson, for different posesandfacialexpressions,werein-cluded.Werandomly chose14individualsandplacedall oftheir images into the testset,usingthephotographsof theremaining 8 subjectsfor testing.

Eachfaceimagewasrescaledto be between40 and55pixels in height, andeachnon-faceimagewasa i9C4��n�i9m.Cimage taken from the Corel database. For eachimage,we decidewhetheror not it containsa face by compar-ing the posteriorof the highestBayesfactor feature ar-rangementwith a threshold. Theerrorrate( �5���u���9�M���4�P����<�q�9�9�1��� ��[ C ) rangedbetween4% and8%. Our perfor-mance is betterthanthe about 15% error ratesreported in[13] for a single detectortrainedand testedon the entirerange of rotations, which shows that a mixture of treesis

ableto represent thevariationsof thefaceappearanceasitrotates.In figure6, weshow examplesof correctlydetectedfaces,aswell asfalsedetectionsreportedin background im-ages.

6. Pose estimationRepresenting a facewith a large number of features, al-lowing any features to beabsent,andmodeling theway inwhich the featurevisibilities andconfigurationsaffect oneanother, allowsourface-detection systemto beview invari-ant.However, theorientation is notdiscarded,but is insteadimplicitly encodedin themodel. By examining thefeaturearrangementfound for a faceimage,we can estimateanintrinsic coordinate frame for the face. The typesof thefeaturesconstitutinga face,aswell astheir geometric con-figuration, canbeusedto derive correspondencesbetweendifferent views of a face,or betweenan imageanda 3Dmodel of thehead,andalsocarry implicit aspectinforma-tion. For example, having found a featurecorrespondingtotheleft eye,weknow thattheview is not theright profile.

To determine the poseof a face, we learn, for eachfeature

�>�and each view direction � , the probability+?Y �G ���MHP) � � that this feature is present in the view. Our

5

Page 6: Mixtures of Trees for Object Recognitionluthuli.cs.uiuc.edu/~daf/papers/cvpr.pdfthe object model. We represent objects as collections of local features, and would like to allow any

datasetcontains9 differentview directions,andthefeaturefrequency information is captured in a m3n � table,where�

is the number of available features. The entriesof thetableareestimatedfrom thetrainingfaceassemblies.

Given an assembly� � ���O��:-;�= 7 �, we cancom-

putetheprobability thatit hasbeengeneratedbyaparticularview � : +?Y �!� ) � �W� ����� ) � � +�Y �5� �P��J �9b 8 +?Y ��G �3�9H�) � �for a uniform

+?Y �5� � . As our estimateof the pose,we usetheexpectedvalue � ����� +?Y �!� ) � � � . If �u� is thecorrectview angle,theestimationerror is givenby � s ��� , andthe

RMS error is � �A�] ( � �q� ] s � �] � � [ � , for � test images.In our experiments,theRMS error was i9k�~ , i.e. on theav-erage (andin factfor mosttestimages)theestimatedangleis within oneangle stepof theactualangle. Comparethiswith the RMS error of �.m�~ that would result from alwaysreporting theaverage faceangle( �<k�~ ).7. ConclusionsMixturesof treesallow us to represent objectsasflexiblecollectionsof parts,wheresomepartcanbemissing,andwemodel theaspect,geometric relationshipsamong theparts,andtheindividual partappearances.Dueto theconditionalindependences in the model, inferencecan be performedefficiently, in a bottom-upfashion,wherecandidateobjectpartsarefirst detectedandthengroupedinto arrangementsusing dynamic programmingon the mixture of trees. Inaddition to beingableto representandtrackpeople, aswehaveshown in [4], mixturesof treescanmodel objectswith-outanintuitivetreedecomposition,andwehaveshown thisby applying our model to frontal and view-invariant facedetection.

Eventhough theresultswehaveobtainedfor frontal facedetection areslightly behindthestateof theart, our modelhasthe advantagethat it canbe usedto do more thanjustdetection. We can determine not only whethera face ispresent, but alsotheconfigurationof the face,i.e. what fa-cial featuresarepresent,andwhere they arewith respecttoeachother. Our modelcanusea largenumber of features( h�iMk4l ), with only a few ( h�i9l ) neededfor detection, andimplicitly encodes the aspect;we have shown that we canrecover theaspectinformationby examining thefeaturear-rangementsobtainedfor a faceto estimatetheout-of-planerotation of a face. Thefuture work includesusingthe fea-ture representationof an objectthat mixturesof treeshelpus obtainfor applicationssuchasrecognition of individu-als, genders,or facial expressions(by comparing featuresof thesametype in different faces,we do not have to relyon the two facesbeing in the samepose),matchingfea-turesbetweentwo imagesof a face,or betweenan imageanda 3D facemodel,andfacetracking(we have alreadyshown [4] thattemporal coherencecanbeincorporatedintoourmodel).

Anotherimportantadvantageof ourmodel is thatit is nottailoredto a particular typeof object: we have shown thatit canbeusedfor suchdiverseobjects ashumanbodiesandfaces.Oneof theresearchdirections is to usethemodelto

representotherobjects,or entireclassesof objects.For ex-ample,just aswe detectfacesin anarbitrary view andthendeterminethepose,wecoulduseasinglemixtureof treestorepresentmany kindsof animal,andusethe resultingfea-ture representationof an object in an imageto determinewhattypeof animal it is.

AcknowledgementsWe thankMichael Jordanfor suggestingmixturesof treesfor object representation. This researchwassupported byNSFGraduateResearchFellowshipto SI andby theDigitalLibrary grantsIIS-9817353 andIIS-9979201.

References[1] C. Bregler and J. Malik. Tracking peoplewith twists and

exponential maps. In IEEE Conf. on ComputerVision andPatternRecognition, pages8–15, 1998.

[2] C.K. Chow andC.N. Liu. Approximatingdiscreteprobabil-ity distributionswith dependencetrees. IEEE Transactionson InformationTheory, 14:462–467,1968.

[3] P. Felzenszwalb andD. Huttenlocher. Efficient matchingofpictorial structures.In IEEE Conf. on ComputerVision andPatternRecognition, 2000.

[4] S. Ioffe andD. Forsyth. Humantrackingwith mixturesoftrees.In Int. Conf. on ComputerVision, 2001.

[5] S. Ioffe andD.A. Forsyth. Probabilisticmethodsfor findingpeople.Int. J. ComputerVision, 2001.

[6] M. Meila andM.I. Jordan.Learningwith mixturesof trees.Journal of MachineLearningResearch, 1:1–48, 2000.

[7] M. Oren,C. Papageorgiou, P. Sinha,andE. Osuna. Pedes-trian detectionusing wavelet templates. In IEEE Conf.on ComputerVision and Pattern Recognition, pages193–9,1997.

[8] H.A. Rowley, S. Baluja, and T. Kanade. Neural network-basedfacedetection.IEEET. PatternAnalysisandMachineIntelligence, 20(1):23–38,1998.

[9] C. Schmid,R. Mohr, andC. Bauckhage. Evaluationof inter-estpoint detectors. Int. J. ComputerVision, 37(2):151–72,2000.

[10] H. SchneidermanandT. Kanade. A statisticalmethodfor3d objectdetectionappliedto facesandcars.In IEEE Conf.onComputerVisionandPatternRecognition, pages746–51,2000.

[11] K-K SungandT. Poggio.Example-based learningfor view-basedhumanfacedetection.PAMI, 20(1):39–51, 1998.

[12] R.E. Tarjan. Finding optimum branchings. Networks,7(1):25–36, 1977.

[13] M. Weber, W. Einhauser, M. Welling, and P. Perona.Viewpoint-invariantlearninganddetectionof humanheads.In IEEE International Conferenceon AutomaticFace andGesture Recognition, pages20–7,2000.

[14] L. Wiskott, J.-M. Fellous,N. Kuiger, andC. von der Mals-burg. Face recognition by elastic bunch graph matching.PAMI, 19(7):775–9,1997.

6