mixtures of trees for object recognitionluthuli.cs.uiuc.edu/~daf/papers/cvpr.pdfthe object model. we...

Mixtures of Treesfor ObjectRecognition

Sergey Ioffe David Forsyth

AbstractEfficient detectionof objectsin images is complicated byvariations of object appearancedue to intra-classobjectdifferences, articulation, lighting, occlusions, and aspectvariations. To reducethesearch requiredfor detection, weemploy the bottom-up approach where we find candidateimage features and associatesomeof themwith parts ofthe object model. We representobjectsas collectionsoflocal features,and would like to allow any of themto beabsent, with onlya smallsubsetsufficientfor detection;fur-thermore, our modelshould allow efficientcorrespondencesearch. Weproposea model,Mixtureof Trees,thatachievesthesegoals. With a mixture of trees,we canmodelthe in-dividual appearancesof the features,relationshipsamongthem, and the aspect,and handle occlusions. Indepen-dencescapturedin themodelmake efficient inferencepos-sible. In our earlier work, wehaveshownthat mixturesoftreescanbeusedto modelobjectswith a natural treestruc-ture, in thecontext of human tracking. Nowweshowthat anatural treestructure is not required,andusea mixture oftreesfor bothfrontalandview-invariant facedetection. Wealsoshowthat by modelingfacesascollections of featureswe can establishan intrinsic coordinate framefor a face,andestimatetheout-of-planerotationof a face.

1. IntroductionOneof the main difficulties in object recognition is beingableto representthevariations in objectappearanceandtodetectobjectsefficiently. Template-basedapproaches(e.g.,to detectfrontal views of faces[8, 11] andpedestrians[7])arenot generalbecausethey do not allow objectpartstomove with respectto eachother. An alternative is to useamodel that, insteadof regarding anobject asrigid, modelslocal appearanceof partsandthe relationships among theparts. Suchrepresentationshave beenusedextensively torepresentpeople(e.g.[1, 3]) andhavebeenappliedto faces[10, 13, 14]. Detectingarticulatedobjects requiresa searchof avery largeconfigurationspace,which,in thecontext oftracking, is oftenmadepossibleby constraining theconfig-uration of theobjectin oneof the frames.However, if ourobjectdetectoris tobeentirelyautomatic,weneedamethodthatallowsusto explore thesearchspaceefficiently.

Among the ways to make the searchefficient is thebottom-upapproach,wherethe candidateobjectpartsarefirst detectedandthengroupedinto arrangementsobeyingtheconstraintsimposedby theobjectmodel. Examplesinfacedetectioninclude [13] and [14], who model facesasflexible arrangementsof local features. However, if many

featuresareusedto representanobject,andmany candidatefeaturesof eachtype arefound in the image,it is imprac-tical to evaluateeachfeaturearrangement, dueto theover-whelming number of sucharrangements.The correspon-dence search,wherea part of the objectmodel is associ-atedwith someof thecandidatefeatures,canbemademoreefficient by pruning arrangementsof a few featuresbeforeproceedingto biggerones[5]. Alternatively, themodelcanbe constrained to allow efficient search. Oneexample ofsucha modelis a tree, in which correspondencesearchcanbe performedefficiently with dynamic programming (e.g.[3, 4]).

Representingan object with a fixednumber of featuresmakes recognition vulnerable to occlusions, aspectvaria-tions, and failuresof local feature detectors. Instead,wewouldliketo modelobjectswith alargenumberof features,only several of which may be enough for recognition. Toavoid the combinatorial complexity of the correspondencesearch,we proposea novel model that usesa mixture oftreesto represent theaspect(whichfeaturesarepresent andwhich arenot) aswell asthe relationshipsamong the fea-tures; by capturing conditional independencesamong thefeaturescomposinganobject, mixturesof treesallow effi-cientinferenceusinga Viterbi algorithm.

Someobjects,suchashumanbodies,have a naturaltreerepresentation(with thetorsoastheroot,for example),andwehaveshown [4] thatmixturesof treescanbeusedto rep-resent,detectandtracksuchobjects.However, ourmodel isnot limited to articulatedobjects,and,becausewe learnthetreestructureautomatically, canbeusedfor objectswithoutanintuitive treerepresentation. We illustratethis by apply-ing our modelto facedetection.By usinga large numberof featuresonly a few of which aresufficient for detection,we canmodel thevariationsof appearancedueto differentindividuals,facialexpressions,lighting, andpose.

In section2, we describemixtures of trees,and showhow to model faceswith a mixture of treesin section3.We useourmodelfor frontal (section4) andview-invariant(section5) facedetection. The featurearrangements rep-resentingfacescarry implicit orientationinformation. Weillustratethis in section6, wherewe usethe automaticallyextractedfeature representation of facesto infer the angleof out-of-plane rotation.

2. Modeling with mixtures of treesLet ussupposethatanobjectis acollectionof

�primitives,��

, eachof whichcanbetreatedasa vectorrep-resentingits configuration (e.g., thepositionin theimage).

1

Givenan image,the local detectors will provide us with afinite setof possibleconfigurationsfor eachprimitive

��.

Thesearecandidateprimitives; theobjective is to build anassemblyby choosinganelementfrom eachcandidateset,so that the resultingsetof primitivessatisfiessomeglobalconstraints.

The global constraints can be capturedin a distribu-tion �� , which will behigh whentheassemblylooks like the objectof interest,and low whenit doesn’t.Assumingexactly oneobject presentin the image,we canlocalizetheobjectby finding theassemblymaximizing thevalueof � . In general, thismaximization requiresacombi-natorial correspondencesearch.However, if �� is represented with a tree, correspondence searchis effi-ciently accomplishedwith a Viterbi algorithm. If thereare�

candidateconfigurationsfor eachof the�

primitives,thenthesearchtakes ��

time,whereas for a generaldistribution � thecomplexity wouldbe ��

.2.1. Learning the tree modelIn additionto making correspondence searchefficient, theconditional independencescaptured in the treemodelsim-plify learning, by reducing thenumber of parametersto beestimated,dueto thefactorizedform of thedistribution:

�� ! " �# �%$�'&( �! " �# �� *)�+-,.� �"/where

��! " �#is the nodeat the root of the tree, and

+-,0�denotestheparentof thenode

�1�. Learningthemodelin-

volves learning the structure(i.e., the tree edges)as wellas the parametersof the prior �� 2�! " �# � andconditionals�� 3��)4+-,'� � . We learnthemodel by maximizingthe log-likelihood of the training data,which canbe shown to beequivalent to minimizing the entropy of the distribution,subjectto theprior �� 5 " #�� andconditionals �� )6+-, �4�being setto their MAP estimates.Theentropy canbemin-imizedefficiently [2, 12] by findingtheminimum spanningtreein the directedgraph, whoseedgeweightsarethe ap-propriateconditional entropies.2.2. Mixtures of treesIt is difficult to usea treeto modelcaseswheresomeof theprimitivesconstituting anobject aremissing– dueto occlu-sions,variations in aspector failures of the local detectors.Mixturesof trees,introducedin [6], provide a solution. Inparticular, we can think of assembliesas beinggeneratedby amixturemodel, whoseclassvariable specifieswhatset7

of primitiveswill constituteanobject,while conditionalclassdistributions �-8�� 9�3�*:<;�= 7 � � generatetheconfigu-rations of thoseprimitives. Themixture distribution is

�� 9��*:.;>= 7 � �?�A@ � 7 � ��8�� 3�B:<;�= 7 � �where

@ � 7 � is theprobability thata random view of anob-jectconsistsof thoseprimitives.This mixturehas C com-ponents– onefor eachpossiblesubset

7of primitivetypes.

Learning a mixture of treesinvolvesestimatingthemixtureweights

@ � 7 � , aswell thestructureandthemodel parame-tersfor eachof thecomponenttrees.

Figure1: Usinga generating treeto derivethestructure fora mixturecomponent.Thedashedlinesare theedgesin thegenerating tree, which spansall of the nodes. Thenodesof themixture componentare shaded,andits edges(shownas solid) are obtainedby makinga grandparent “adopt”a nodeif its parent is not presentin this tree (i.e., is notshaded). Thusmixture componentsare encoded implicitly,which allows efficient representation, learning and infer-encefor mixtureswith a large number of components.Thestructure of thegenerating treeis learnedby entropymini-mization.

2.3. Mixtures of trees with shared structureExplicitly representing C mixture components is unac-ceptable if the numberof objectparts

�is large. Instead,

we usea singlegenerating treewhich is usedto generatethestructures of all of themixturecomponents.

A generating tree is a directedtree D whosenodes are��E��, with

��! " #at the root. It provides the struc-

ture of the graphical model representing@ � 7 � : @ � 7 �F��G ��! " #IH �KJ �'&( �! " �# ��G �3�9HL) G +-,'�MH �N/ where G ��9H denotes

theevent that�O�

is oneof theprimitivesconstituting a ran-domview of theobject,andthedistributions arelearnedbycounting occurrencesof eachprimitive andpairsof prim-itives in the training data. For a subset

7of object part

types, the mixture component �P8 contains all the edges� � QSRT� �U�suchthat

�SQis anancestorof

� �in thegen-

eratingtree,andnone of thenodeson thepathfrom��Q

to� �is in theset

�� :K;V= 7 �. This meansthat,if thepar-

entof node� �

is not present in a view of theobject, then� �is “adopted” by its grandparent,or, if that one is ab-

sentaswell, a great-grandparent,etc. If weassumethattheroot

��! " �#is alwaysa partof theobject,then �W8 will bea

tree,since�3�! " #

will ensurethatthegraphicalmodelis con-nected. An example of obtaining thestructureof a mixturecomponentis shown in figure 1. We ensureconnectivity byusinga “dummy” featureasthe root

�VX, representingthe

rough positionof theassembly;candidate root features areadded to testimagesat thenodes of a sparsegrid.

Thedistribution ��8 is theproduct of theprior �� 2�! " �# �andconditionals �� )0�*Q � corresponding to the edgesof the treerepresenting � 8 . We learnthe structure of thegenerating tree D that minimizesthe entropy of the distri-bution. We arenot awareof anefficient algorithm thatpro-ducestheminimum; instead,weobtainalocalminimum byiteratively applying entropy-reducing local changes (suchasreplacing a node’s parentwith another node)to D untilconvergence.

2

RLA

RUA

TOR

(a)

RLA

RUA

TOR

RUA

TOR

RLA

TOR

(b)

TOR

TOR

RUA

RLA

(c)

Figure 2: Convertinga mixture of treesinto a graphwithchoicenodes, onwhichcorrespondencesearch isperformedusingdynamic programming. (a) A fragmentof thegener-ating tree for a person, containing the torso, right upperarm, and right lower arm. (b) Themixture of 4 treesthatresultsif we require the torso to be alwayspresent. Thetriangle representsa choicenode; only oneof its childrenis selected,andthemixture weightsare givenby themodelof theaspect.(c) Thesharedstructure of themixture com-ponentsis captured usingextra choicenodes. Theemptynodescorrespond to adding no extra segments;themixtureweightcorresponding to each child of a choicenodeis de-rived fromtheprior for anaspect.

2.4. Grouping using mixtures of treesTo localizeanobject in animage,wefind theassemblythatmaximizesthe posterior

+?Y � object)assembly

�or, equiva-

lently, theBayesFactorZ � �� \[ ��]'^`_a� �9� � � � wherethenumeratoris theprobability of a configurationin a ran-domview of theobject,andthedenominator is theproba-bility of seeingit in the background. We modelthe back-ground as a Poissonprocess: � ].^`_ � �9��:S;%= 7 � ��J �9b 8Lc � where c � is the rate(or density)of the Poissonprocessaccording to which the primitives of type

��are

distributed in thebackground. Becauseof the independentstructure of � ]'^`_ , the Bayesfactorcanbe obtainedby as-sociatingthe term cWd �� with eachmemberof

�e�’s candi-

dateset, and multiplying thoseterms into the likelihood�� .We perform the correspondence searchusinga Viterbi

algorithm on tree D ; at eachnode,we selectnot only thebestprimitives to choosefrom thechildren’scandidatesets,but also the edges to be includedin the tree (i.e., whichpartsconstitutean object instance). This is equivalent todynamicprogrammingona graphwith choicenodes, illus-tratedin figure2. Thisalgorithmrunsin time �� f0�%� �� 1�9�g� �

where�

is the number of primitives in eachcandidateset,

�is thenumber of objectparts,and

fis the

depth of thegeneratingtree.

3. Learning the model of a faceRepresenting a faceasanassemblyof local featuresallowsus to modelboth the relative rigidity of the facial featuresandtheflexibility of their arrangements.Otherapproachesmodeling faceswith local feature arrangements (e.g.[13])usuallyrelyonaspecific,smallsetof features,becausethey

Figure 3: Cluster centers found by grouping subimagesextractedfrom training face images with the modifiedK-Meansalgorithm. Theclusteringprocedure learnsboththeaverage grey-level appearanceof each feature and its av-erage warpedposition(according to which each clusterispositionedin thefigure).

cannot handlemissingfeaturesandlack anefficient group-ing mechanism. With a mixture of trees,we canaddresstheseissues.Becauseof theefficient inferenceon mixturesof trees,wecanuseaverylargenumber of features( hji9k'l ),but require only a few ( hgi9l ) to declarea detection. There-fore, we canhandle occlusions andmultiple aspects,andusethe samemodel to representall orientations of a face.However, theorientation is not discarded; instead,it is im-plicitly encodedby themixtureof trees.Wecanrecovertheposeby examining the feature arrangementobtainedfor afaceimage,andusingthetypesof thefeaturesconstitutinganarrangement,aswell astheir geometric relationships, toestimatetheorientation.

3.1. Facial featuresEachfacialfeatureis representedasasmallimagepatch;anassemblyis a group of featuressatisfyingsomeconstraintsimposedby thegeometry of a face. For eachfeaturetype,thecandidatefeaturesareimagepatcheswhoseappearanceis sufficiently similar to the“canonical” appearanceof thatfeature.

To reduce the time it takes to find candidatefeatures,eachimage– both training and test – is represented as acollectionof small ( m�nem ) imagepatchescenteredat inter-estpoints (found with theHarrisoperator, e.g. [9]). Localcontrastnormalizationis appliedto counterthevariationsinbrightnessandcontrastdueto differentlighting.

Insteadof manually specifyingwhatfeaturescomposeaface,we learna setof featuresthatarestable,(i.e., presentin a largenumberof faceimages,at roughly thesameplacerelativeto theface),distinctfromotherfeatures,anddistinctfrom the background. First, we clusterthe imagepatchesin trainingimagesusingtheK-Means algorithmwhich wemodified sothat it learnsnot only theappearancesbut alsothe warpedpositionsof the clustercenters; the warp foreachtraining imageis computed as an affine transforma-

3

Figure 4: Examplesof facesdetectedin thetestdata(with thethresholdo � i ). Theboxes indicatethebounding boxesofthefeature assembliesrepresentingthefaces,andthedot showsthepositionof theroot nodein themixture of trees,whichindicatesthe“g eneral position” of a face.

tion thatmaps3 “landmarkpoints”ona faceimageto theircanonical positions. The similarity betweena patchanda clustercenteris computed asthe Euclidean distancebe-tweentheir pixel representations, subjectto proximity be-tweentheir warpedpositions.By clusteringimagepatches,we convert eachtraining imageto an assembly; theseas-sembliesarenow usedto learn the facemodel. Figure3shows theclustercentersobtainedwith ouralgorithm.

Becausepixels within a patcharenot independent, werepresenteachimagepatchwith its projectionsonto sev-eral(e.g.15)dominant independentcomponents(found byapplying PCA to the sub-imagesof faceimages).We cancapture dependencesamongtheseprojections conditionalon thefeaturetypeandstill maintainthelinearcomplexityof themodel with a tree-structuredmodel.All conditionalsin this model areGaussian,andthetreestructureis learnedby maximizing mutual information[2].

3.2. Modeling feature arrangementsTo model faceswith a mixture of trees,we needto learnthepairwiserelationships first, andthenuseentropy mini-mizationto obtainthestructureof thegeneratingtree.Con-ditional probability tablesfor feature visibility arelearnedby simplecounting. Thedistributions of the relativeposi-tionshave the form �� ).�� *� ��qp � /6r � ) p � /\r � �*��qp �ts p � /6r �ts r � � where �qp �ts p � /\r �Ws r � � is thedis-placementbetweenthetwo features. We usea Gaussiantorepresent�� )�� .

To be able to detectfaces,we needto learnthe modelof the background as well as that of a face. We can in-corporatethebackground modelinto theefficient inferencemechanismif it obeysthesameindependencesasthosecap-turedin the mixture of treesmodeling a face. We choosethesimplestsuchmodel,in which the featuresdetectedinbackground areindependent,andmodeledwith a Poissonprocess.Theappearanceof imagepatchesin non-faceim-agesis modeled with a mixtureof distributionsof thesametypeasusedtomodelfacialfeatures.Thisallowsusto moreaccurately model the backgroundimagepatchesthat looksimilar to facial features.The probability densityof gen-eratingan assembly

�9�O�1:�;�= 7 �in thebackgroundbe-

comes � ]'^`_ � ��3�S:u;O= 7 � �� J �9b 8vc � � X � �3� � wherec �

wyx o -10 -3 0 3 5 10Detection 90% 80% 76% 69% 66% 56%

Falsealarms 1364 279 129 50 22 1

Table1: Frontal facedetectionresults.Thedatabasecon-tained117 images,with a total of 511faces.We showthefractionof facescorrectlydetected, andthenumber of non-facesmistakenlydetected,for differentvaluesof thethresh-old o with which theposterioris compared

is therateof thePoissonprocesswe assumeto begenerat-ing thecandidatefeaturesof type

;.

To find face-like assembliesof imagepatches,we com-pute,for eachfeature type

;andeachimagepatch zt{ , the

probability � � �!z0{ � of seeingthis patchin a random viewof feature

� �, and the probability � X �!z { � of seeingthe

patchin a random view of a non-face. In maximizing theBayesfactor, we associatean extra multiplicative weight� � �!z { �6[ � c � � X �5z { �� with eachfeature

;andpatch z { . In

practice,we will make z { a candidatefor feature;

only ifthe ratio � � �!z { ��[ � X �!z { � is sufficiently large – a conditionthatdoesnothold for mostpatch/featurepairs.

4. Detecting frontal facesWe have usedmixturesof treesto learn the model of thefrontal faces. The training datawas kindly provided byHenrySchneiderman. Thetrainingbackgroundimagesarechosenat randomfrom theCorelimagedatabase.

We testedour facefinderon imagesfrom the MIT andCMU facedatabases— 117photographs with 511 frontalfaces.Facesweredetectedat a range of scales,spacedby afactorof C �|�} . If two assemblies’bounding boxesoverlapby morethana smallamount, we remove theonewith thesmallerposterior. Table1 shows the performancefor ourdetectorfor somevaluesof the threshold o for the Bayesfactor. Figure4 shows someexamplesof faceswe detectedin testimages.In figure5, we show anexample of our facedetectorappliedto alargegroupphoto (with o setverylow;this doesnot resultin many falsedetections, sincemostofthosearesuppressedby therealfaces).

4

Figure 5: Facesfound in a large group photo. The threshold o is set very low; mostfalsedetections are suppressedbythecorrectlydetectedfaces.Out of 93 facesin the image, 79 were correctlydetected,14 were missed,and5 falsealarmsoccurred.Mostof themissedfaceswere not foundbecausethey were smallerthanthesizeof thetrainingfaces.

a b c

Figure 6: Examplesof theview-invariant facedetector applied to facesandbackgrounds. (a) examplesof faceassembliesdetectedcorrectly. Thesquaresshowthe featuresin the assembly, the circle correspondsto the root node of the mixtureof trees,and the edgesshowthe edgesof the mixture component corresponding to the assembly. (b) Falsedetections inbackgroundimages.(c) Anexampleof a misseddetection. Eventhough a featureassemblyis foundcorrespondingto a face,its posterioris too low to declarea facedetection.

5. View-invariant face detectionOut-of-plane face rotations suggestthat a model that isable to representaspectwould perform well. Our datawaskindly provided by M. Weberet al., authorsof [13].It contained facesof 22 subjects,photographed against auniformbackground(whichwe syntheticallyreplacedwithrandom imagesfrom theCoreldatabase)at 9 differentan-gles,spacedby iMku~ andspanning theentirerange betweenthe frontal view andthe profile; 18 to 36 picturesof eachperson, for different posesandfacialexpressions,werein-cluded.Werandomly chose14individualsandplacedall oftheir images into the testset,usingthephotographsof theremaining 8 subjectsfor testing.

Eachfaceimagewasrescaledto be between40 and55pixels in height, andeachnon-faceimagewasa i9C4��n�i9m.Cimage taken from the Corel database. For eachimage,we decidewhetheror not it containsa face by compar-ing the posteriorof the highestBayesfactor feature ar-rangementwith a threshold. Theerrorrate( �5��u��9�M��4�P��<�q�9�9�1�� [ C ) rangedbetween4% and8%. Our perfor-mance is betterthanthe about 15% error ratesreported in[13] for a single detectortrainedand testedon the entirerange of rotations, which shows that a mixture of treesis

ableto represent thevariationsof thefaceappearanceasitrotates.In figure6, weshow examplesof correctlydetectedfaces,aswell asfalsedetectionsreportedin background im-ages.

6. Pose estimationRepresenting a facewith a large number of features, al-lowing any features to beabsent,andmodeling theway inwhich the featurevisibilities andconfigurationsaffect oneanother, allowsourface-detection systemto beview invari-ant.However, theorientation is notdiscarded,but is insteadimplicitly encodedin themodel. By examining thefeaturearrangementfound for a faceimage,we can estimateanintrinsic coordinate frame for the face. The typesof thefeaturesconstitutinga face,aswell astheir geometric con-figuration, canbeusedto derive correspondencesbetweendifferent views of a face,or betweenan imageanda 3Dmodel of thehead,andalsocarry implicit aspectinforma-tion. For example, having found a featurecorrespondingtotheleft eye,weknow thattheview is not theright profile.

To determine the poseof a face, we learn, for eachfeature

�>�and each view direction � , the probability+?Y �G ��MHP) � � that this feature is present in the view. Our

5

datasetcontains9 differentview directions,andthefeaturefrequency information is captured in a m3n � table,where�

is the number of available features. The entriesof thetableareestimatedfrom thetrainingfaceassemblies.

Given an assembly� � ��O��:-;�= 7 �, we cancom-

putetheprobability thatit hasbeengeneratedbyaparticularview � : +?Y �!� ) � �W� �� ) � � +�Y �5� �P��J �9b 8 +?Y ��G �3�9H�) � �for a uniform

+?Y �5� � . As our estimateof the pose,we usetheexpectedvalue � �� +?Y �!� ) � � � . If �u� is thecorrectview angle,theestimationerror is givenby � s �� , andthe

RMS error is � �A�] ( � �q� ] s � �] � � [ � , for � test images.In our experiments,theRMS error was i9k�~ , i.e. on theav-erage (andin factfor mosttestimages)theestimatedangleis within oneangle stepof theactualangle. Comparethiswith the RMS error of �.m�~ that would result from alwaysreporting theaverage faceangle( �<k�~ ).7. ConclusionsMixturesof treesallow us to represent objectsasflexiblecollectionsof parts,wheresomepartcanbemissing,andwemodel theaspect,geometric relationshipsamong theparts,andtheindividual partappearances.Dueto theconditionalindependences in the model, inferencecan be performedefficiently, in a bottom-upfashion,wherecandidateobjectpartsarefirst detectedandthengroupedinto arrangementsusing dynamic programmingon the mixture of trees. Inaddition to beingableto representandtrackpeople, aswehaveshown in [4], mixturesof treescanmodel objectswith-outanintuitivetreedecomposition,andwehaveshown thisby applying our model to frontal and view-invariant facedetection.

Eventhough theresultswehaveobtainedfor frontal facedetection areslightly behindthestateof theart, our modelhasthe advantagethat it canbe usedto do more thanjustdetection. We can determine not only whethera face ispresent, but alsotheconfigurationof the face,i.e. what fa-cial featuresarepresent,andwhere they arewith respecttoeachother. Our modelcanusea largenumber of features( h�iMk4l ), with only a few ( h�i9l ) neededfor detection, andimplicitly encodes the aspect;we have shown that we canrecover theaspectinformationby examining thefeaturear-rangementsobtainedfor a faceto estimatetheout-of-planerotation of a face. Thefuture work includesusingthe fea-ture representationof an objectthat mixturesof treeshelpus obtainfor applicationssuchasrecognition of individu-als, genders,or facial expressions(by comparing featuresof thesametype in different faces,we do not have to relyon the two facesbeing in the samepose),matchingfea-turesbetweentwo imagesof a face,or betweenan imageanda 3D facemodel,andfacetracking(we have alreadyshown [4] thattemporal coherencecanbeincorporatedintoourmodel).

Anotherimportantadvantageof ourmodel is thatit is nottailoredto a particular typeof object: we have shown thatit canbeusedfor suchdiverseobjects ashumanbodiesandfaces.Oneof theresearchdirections is to usethemodelto

representotherobjects,or entireclassesof objects.For ex-ample,just aswe detectfacesin anarbitrary view andthendeterminethepose,wecoulduseasinglemixtureof treestorepresentmany kindsof animal,andusethe resultingfea-ture representationof an object in an imageto determinewhattypeof animal it is.

AcknowledgementsWe thankMichael Jordanfor suggestingmixturesof treesfor object representation. This researchwassupported byNSFGraduateResearchFellowshipto SI andby theDigitalLibrary grantsIIS-9817353 andIIS-9979201.

References[1] C. Bregler and J. Malik. Tracking peoplewith twists and

exponential maps. In IEEE Conf. on ComputerVision andPatternRecognition, pages8–15, 1998.

[2] C.K. Chow andC.N. Liu. Approximatingdiscreteprobabil-ity distributionswith dependencetrees. IEEE Transactionson InformationTheory, 14:462–467,1968.

[3] P. Felzenszwalb andD. Huttenlocher. Efficient matchingofpictorial structures.In IEEE Conf. on ComputerVision andPatternRecognition, 2000.

[4] S. Ioffe andD. Forsyth. Humantrackingwith mixturesoftrees.In Int. Conf. on ComputerVision, 2001.

[5] S. Ioffe andD.A. Forsyth. Probabilisticmethodsfor findingpeople.Int. J. ComputerVision, 2001.

[6] M. Meila andM.I. Jordan.Learningwith mixturesof trees.Journal of MachineLearningResearch, 1:1–48, 2000.

[7] M. Oren,C. Papageorgiou, P. Sinha,andE. Osuna. Pedes-trian detectionusing wavelet templates. In IEEE Conf.on ComputerVision and Pattern Recognition, pages193–9,1997.

[8] H.A. Rowley, S. Baluja, and T. Kanade. Neural network-basedfacedetection.IEEET. PatternAnalysisandMachineIntelligence, 20(1):23–38,1998.

[9] C. Schmid,R. Mohr, andC. Bauckhage. Evaluationof inter-estpoint detectors. Int. J. ComputerVision, 37(2):151–72,2000.

[10] H. SchneidermanandT. Kanade. A statisticalmethodfor3d objectdetectionappliedto facesandcars.In IEEE Conf.onComputerVisionandPatternRecognition, pages746–51,2000.

[11] K-K SungandT. Poggio.Example-based learningfor view-basedhumanfacedetection.PAMI, 20(1):39–51, 1998.

[12] R.E. Tarjan. Finding optimum branchings. Networks,7(1):25–36, 1977.

[13] M. Weber, W. Einhauser, M. Welling, and P. Perona.Viewpoint-invariantlearninganddetectionof humanheads.In IEEE International Conferenceon AutomaticFace andGesture Recognition, pages20–7,2000.

[14] L. Wiskott, J.-M. Fellous,N. Kuiger, andC. von der Mals-burg. Face recognition by elastic bunch graph matching.PAMI, 19(7):775–9,1997.

6

mixtures of trees for object recognitionluthuli.cs.uiuc.edu/~daf/papers/cvpr.pdfthe object model. we...

Documents