image parsing: unifying segmentation, detection, and...

28
International Journal of Computer Vision 63(2), 113–140, 2005 c 2005 Springer Science + Business Media, Inc. Manufactured in The Netherlands. Image Parsing: Unifying Segmentation, Detection, and Recognition ZHUOWEN TU AND XIANGRONG CHEN Departments of Statistics, University of California, Los Angeles, Los Angeles, CA 90095, USA [email protected] [email protected] ALAN L. YUILLE Departments of ‘Statistics’ and ‘Psychology’, University of California, Los Angeles, Los Angeles, CA 90095, USA [email protected] SONG-CHUN ZHU Departments of ‘Statistics’ and ‘Computer Science’ University of California, Los Angeles, Los Angeles, CA 90095, USA [email protected] Received March 7, 2003; Accepted June 23, 2003 First online version published in February, 2005 Abstract. In this paper we present a Bayesian framework for parsing images into their constituent visual patterns. The parsing algorithm optimizes the posterior probability and outputs a scene representation as a “parsing graph”, in a spirit similar to parsing sentences in speech and natural language. The algorithm constructs the parsing graph and re-configures it dynamically using a set of moves, which are mostly reversible Markov chain jumps. This computa- tional framework integrates two popular inference approaches—generative (top-down) methods and discriminative (bottom-up) methods. The former formulates the posterior probability in terms of generative models for images defined by likelihood functions and priors. The latter computes discriminative probabilities based on a sequence (cascade) of bottom-up tests/filters. In our Markov chain algorithm design, the posterior probability, defined by the generative models, is the invariant (target) probability for the Markov chain, and the discriminative probabilities are used to construct proposal probabilities to drive the Markov chain. Intuitively, the bottom-up discriminative proba- bilities activate top-down generative models. In this paper, we focus on two types of visual patterns—generic visual patterns, such as texture and shading, and object patterns including human faces and text. These types of patterns compete and cooperate to explain the image and so image parsing unifies image segmentation, object detection, and recognition (if we use generic visual patterns only then image parsing will correspond to image segmentation (Tu and Zhu, 2002. IEEE Trans. PAMI, 24(5):657–673). We illustrate our algorithm on natural images of complex city scenes and show examples where image segmentation can be improved by allowing object specific knowledge to disambiguate low-level segmentation cues, and conversely where object detection can be improved by using generic visual patterns to explain away shadows and occlusions. Keywords: image parsing, image segmentation, object detection, object recognition, data driven Markov Chain Monte Carlo, AdaBoost

Upload: others

Post on 19-Jul-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Image Parsing: Unifying Segmentation, Detection, and ...efros/courses/LBMV07/Papers/tu-ijcv-05.pdf · Examples of image segmentation failure by an algorithm (Tu and Zhu, 2002a) which

International Journal of Computer Vision 63(2), 113–140, 2005c© 2005 Springer Science + Business Media, Inc. Manufactured in The Netherlands.

Image Parsing: Unifying Segmentation, Detection, and Recognition

ZHUOWEN TU AND XIANGRONG CHENDepartments of Statistics, University of California, Los Angeles, Los Angeles, CA 90095, USA

[email protected]

[email protected]

ALAN L. YUILLEDepartments of ‘Statistics’ and ‘Psychology’, University of California, Los Angeles, Los Angeles,

CA 90095, [email protected]

SONG-CHUN ZHUDepartments of ‘Statistics’ and ‘Computer Science’ University of California, Los Angeles, Los Angeles,

CA 90095, [email protected]

Received March 7, 2003; Accepted June 23, 2003

First online version published in February, 2005

Abstract. In this paper we present a Bayesian framework for parsing images into their constituent visual patterns.The parsing algorithm optimizes the posterior probability and outputs a scene representation as a “parsing graph”, ina spirit similar to parsing sentences in speech and natural language. The algorithm constructs the parsing graph andre-configures it dynamically using a set of moves, which are mostly reversible Markov chain jumps. This computa-tional framework integrates two popular inference approaches—generative (top-down) methods and discriminative(bottom-up) methods. The former formulates the posterior probability in terms of generative models for imagesdefined by likelihood functions and priors. The latter computes discriminative probabilities based on a sequence(cascade) of bottom-up tests/filters. In our Markov chain algorithm design, the posterior probability, defined by thegenerative models, is the invariant (target) probability for the Markov chain, and the discriminative probabilities areused to construct proposal probabilities to drive the Markov chain. Intuitively, the bottom-up discriminative proba-bilities activate top-down generative models. In this paper, we focus on two types of visual patterns—generic visualpatterns, such as texture and shading, and object patterns including human faces and text. These types of patternscompete and cooperate to explain the image and so image parsing unifies image segmentation, object detection, andrecognition (if we use generic visual patterns only then image parsing will correspond to image segmentation (Tuand Zhu, 2002. IEEE Trans. PAMI, 24(5):657–673). We illustrate our algorithm on natural images of complex cityscenes and show examples where image segmentation can be improved by allowing object specific knowledge todisambiguate low-level segmentation cues, and conversely where object detection can be improved by using genericvisual patterns to explain away shadows and occlusions.

Keywords: image parsing, image segmentation, object detection, object recognition, data driven Markov ChainMonte Carlo, AdaBoost

Page 2: Image Parsing: Unifying Segmentation, Detection, and ...efros/courses/LBMV07/Papers/tu-ijcv-05.pdf · Examples of image segmentation failure by an algorithm (Tu and Zhu, 2002a) which

114 Tu et al.

1. Introduction

1.1. Objectives of Image Parsing

We define image parsing to be the task of decom-posing an image I into its constituent visual patterns.The output is represented by a hierarchical graphW —called the “parsing graph”. The goal is to optimizethe Bayesian posterior probability p(W | I). Figure 1 il-lustrates a typical example where a football scene is firstdivided into three parts at a coarse level: a person inthe foreground, a sports field, and the spectators. Thesethree parts are further decomposed into nine visual pat-terns in the second level: a face, three texture regions,some text, a point process (the band on the field), acurve process (the markings on the field), a color re-gion, and a region for nearby people. In principle, wecan continue decomposing these parts until we reach aresolution limit (e.g. there is not sufficient resolution todetect the blades of grass on the sports field). The pars-ing graph is similar in spirit to the parsing trees usedin speech and natural language processing (Manningand Schutze, 2003) except that it can include horizon-tal connections (see the dashed curves in Fig. 1) forspecifying spatial relationships and boundary sharingbetween different visual patterns.

Figure 1. Image parsing example. The parsing graph is hierarchical and combines generative models (downward arrows) with horizontalconnections (dashed lines), which specify spatial relationships between the visual patterns. See Fig. 4 for a more abstract representationincluding variables for the node attributes.

As in natural language processing, the parsing graphis not fixed and depends on the input image(s). An im-age parsing algorithm must construct the parsing graphon the fly.1 Our image parsing algorithm consists of aset of reversible Markov chain jumps (Green, 1995)with each type of jump corresponding to an opera-tor for reconfiguring the parsing graph (i.e., creatingor deleting nodes or changing the values of node at-tributes). These jumps combine to form an ergodic andreversible Markov chain in the space of possible pars-ing graphs. The Markov chain probability is guaranteedto converges to the invariant probability p(W | I) andthe Markov chain will simulate fair samples from thisprobability.2 Our approach is built on previous work onData-Driven Markov Chain Monte Carlo (DDMCMC)for recognition (Zhu et al., 2000), segmentation (Tuand Zhu, 2002a), grouping (Tu and Zhu, 2002b) andgraph partitioning (Barbu and Zhu, 2003, 2004).

Image parsing seeks a full generative explanation ofthe input image in terms of generative models, p(I | W )and p(W ), for the diverse visual patterns which occurin natural images, see Fig. 1. This differs from stan-dard approaches to computer vision tasks—such assegmentation, grouping, and recognition—which usu-ally involve isolated vision modules which only explaindifferent parts (or aspects) of the image. The image

Page 3: Image Parsing: Unifying Segmentation, Detection, and ...efros/courses/LBMV07/Papers/tu-ijcv-05.pdf · Examples of image segmentation failure by an algorithm (Tu and Zhu, 2002a) which

Image Parsing 115

Figure 2. Examples of image segmentation failure by an algorithm (Tu and Zhu, 2002a) which uses only generic visual patterns (i.e. onlylow-level visual cues). The results (b) show that low-level visual cues are not sufficient to obtain good intuitive segmentations. The limitationsof using only generic visual patterns are also clear in the synthesized images (c) which are obtained by stochastic sampling from the generativemodels after the parameters have been estimated by DDMCMC. The right panels (d) show the segmentations obtained by human subjects who,by contrast to the algorithm, appear to use object specific knowledge when doing the segmentation (though they were not instructed to) (Martinet al., 2001). We conclude that to achieve good segmentation on these types of images requires combining segmentation with object detectionand recognition.

parsing approach enables these different modules tocooperate and compete to give a consistent interpreta-tion of the entire image.

The integration of visual modules is of increasingimportance as progress on the individual modules startsapproaching performance ceilings. In particular, workon segmentation (Shi and Malik, 2000; Tu and Zhu,2002; Fowlkes and Malik, 2004) and edge detection(Konishi and Coughlan, 2003; Bowyer et al., 2001) hasreached performance levels where there seems littleroom for improvement when only low-level cues areused. For example, the segmentation failures in Fig. 2can only be resolved by combining segmentation withobject detection and recognition. Combining these cuesis made easier because of recent successful work onthe detection and recognition of objects (Lowe, 2003;Weber et al., 2000; Ponce et al., 2004; Belongie et al.,2002; Viola and Jones, 2001; Wu et al., 2004) and theclassification of natural scenes (Barnard and Forsyth,2001; Murphy et al., 2003) using, broadly speaking,discriminative methods based on local bottom-up tests.

But combining different visual modules requiresa common framework which ensures consistency.Despite the effectiveness of discriminative methodsfor computing scene components, such as object la-bels and categories, they can also generate redundantand conflicting results. Mathematicians have argued(Blanchard and Geman, 2003) that discriminativemethods must be followed by more sophisticated pro-cesses to (i) remove false alarms, (ii) amend missing

objects by global context information, and (iii) rec-oncile conflicting (overlapping) explanations throughmodel comparison. In this paper, we impose such pro-cesses by using generative models for the entire image.

As we will show, our image parsing algorithm is ableto integrate discriminative and generative methods soas to take advantage of their complementary strengths.Moreover, we can couple modules such as segmenta-tion and object detection by our choice of the set ofvisual patterns used to parse the image. In this paper,we focus on two types of patterns:—generic visual pat-terns for low/middle level vision, such as texture andshading, and object patterns for high level vision, suchas frontal human faces and text.

These two types of patterns illustrate different waysin which the parsing graph can be constructed (seeFig. 20 and the related discussion). The object patterns(face and text) have comparatively little variability sothey can often be effectively detected as a whole bybottom-up tests and their parts can be located subse-quentially. Thus their parsing sub-graphs can be con-structed in a “decompositional” manner from whole toparts. By contrast, a generic texture region has arbi-trary shape and its intensity pattern has high entropy.Detecting such a region by bottom-up tests will re-quire an enormous number of tests to deal with all thisvariability, and so will be computationally impracti-cal. Instead, the parsing subgraphs should be built bygrouping small elements in a “compositional” manner(Bienenstock et al., 1997).

Page 4: Image Parsing: Unifying Segmentation, Detection, and ...efros/courses/LBMV07/Papers/tu-ijcv-05.pdf · Examples of image segmentation failure by an algorithm (Tu and Zhu, 2002a) which

116 Tu et al.

We illustrate our algorithm on natural images ofcomplex city scenes and give examples where imagesegmentation can be improved by allowing object spe-cific knowledge to disambiguate low-level cues, andconversely object detection can be improved by usinggeneric visual patterns to explain away shadows andocclusions.

This paper is structured as follows. In Section 2, wegive an overview of the image parsing framework anddiscuss its theoretical background. Then in Section 3,we describe the parsing graph and the generative mod-els used for generic visual patterns, text, and faces.In Section 4 we give the control structure of the im-age parsing algorithm. Section 5 gives details of thecomponents of the algorithm. Section 6 shows how wecombine AdaBoost with other tests to get proposals fordetecting objects including text and faces. In Section 7we present experimental results. Section 8 addressessome open problems in further developing the imageparser as a general inference engine. We summarize thepaper in Section 9.

2. Overview of Image Parsing Framework

2.1. Bottom-Up and Top-Down Processing

A major element of our work is to integrate discrimina-tive and generative methods for inference. In the recentcomputer vision literature, top-down and bottom-upprocedures can be broadly categorized into two popu-lar inference paradigms—generative methods for “top-down” and discriminative methods for “bottom-up”,illustrated in Fig. 3. From this perspective, integratinggenerative and discriminative models is equivalent tocombining bottom-up and top-down processing.3

The role of bottom-up and top-down processingin vision has been often discussed. There is grow-ing experimental evidence (see Thorpe et al., 1996;

Figure 3. Comparison of two inference paradigms: Top-down generative methods versus bottom- up discriminative methods. The generativemethod specifies how the image I can be synthesized from the scene representation W. By contrast, the discriminative methods are based byperforming tests Tstj(I) and are not guaranteed to yield consistent solutions, see crosses explained in the text.

Li et al., 2003) that humans can perform high levelscene and object categorization tasks as fast as lowlevel texture discrimination and other so-called pre-attentive vision tasks. This suggests that humans candetect both low and high level visual patterns atearly stages in visual processing. It contrasts with tra-ditional bottom-up feedforward architectures (Marr,1982) which start with edge detection, followed bysegmentation/grouping, before proceeding to objectrecognition and other high-level vision tasks. These ex-periments also relate to long standing conjectures aboutthe role of the bottom-up/top-down loops in the visualcortical areas (Mumford, 1995; Ullman, 1995), visualroutines and pathways (Ullman, 1984), the binding ofvisual cues (Treisman, 1986), and neural network mod-els such as the Helmholtz machine (Dayan et al., 1995).But although combining bottom-up and top-down pro-cessing is clearly important, there has not yet been arigorous mathematical framework for how to achieve it.

In this paper, we combine generative and discrimi-native approaches to design an DDMCMC algorithmwhich uses discriminative methods to perform rapid in-ference of the parameters of generative models. Froma computer vision perspective, DDMCMC combinesbottom-up processing, implemented by the discrimi-native models, together with top-down processing bythe generative models. The rest of this section gives anoverview of our approach.

2.2. Generative and Discriminative Methods

Generative methods specify how the image I is gen-erated from the scene representation W ∈ �. It com-bines a prior p(W) and a likelihood function p(I | W ) togive a joint posterior probability p(W | I). These can beexpressed as probabilities on graphs, where the inputimage I is represented on the leaf nodes and W denotes

Page 5: Image Parsing: Unifying Segmentation, Detection, and ...efros/courses/LBMV07/Papers/tu-ijcv-05.pdf · Examples of image segmentation failure by an algorithm (Tu and Zhu, 2002a) which

Image Parsing 117

the remaining nodes and node attributes of the graph.The structure of the graph, and in particular the numberof nodes, is unknown and must be estimated for eachinput image.

To perform inference using generative methods re-quires estimating W ∗ = arg max P(W | I). This is of-ten computationally demanding because there are usu-ally no known efficient inference algorithms (certainlynot for the class of P(W | I) studied in this paper).

In this paper, we will perform inference by stochasticsampling W from the posterior:

W ∼ p(W | l) ∝ p(I | W )p(W ). (1)

This enables us to estimate W ∗ = arg max P(W | I).Stochastic sampling is attractive because it is a generaltechnique that can be applied to any inference prob-lem. Moreover, it generate samples that can be used tovalidate the model assumptions. But the dimension ofthe sample space � for image parsing is very high andso standard sampling techniques are computationallyexpensive.

By contrast, discriminative methods are very fast tocompute. They do not specify models for how the im-age is generated. Instead they give discriminative (con-ditional) probabilities q(w j | Tst j (I)) for components{w j } of W based on a sequence of bottom-up testsTst j (I) performed on the image. The tests are based onlocal image features {F j,n(I)} which can be computedfrom the image in a cascade manner (e.g. AdaBoostfilters, see Section 6),

Tst j (I) = (Fj,1(I), Fj,2(I), . . . , Fj,n(I)),

j = 1, 2, . . . , K . (2)

The following theorem shows that the KL-divergence between the true marginal posteriorp(w j | I) and the optimal discriminant approximationq(w j | Tst(I)) using test Tst(I) will decrease monoton-ically as new tests are added.4

Theorem 1. The information gained for a variable wby a new test Tst+(I) is the decrease of Kullback-Leiblerdivergence between p(w | I) and its best discriminativeestimate q(w | Tst(I)) or the increase of mutual infor-mation between w and the tests.

EI[KL(p(w | I) ‖ q(w | Tst(I)))]

−EI[KL(p(w | I) ‖ q(w | Tst(I), Tst+(I)))]

= MI(w ‖ Tst, T st+) − M I (w ‖ Tst)

= ETst,Tst+ [KL(q(w | Tst, T st+) ‖ q(w | Tst))] ≥ 0,

where EI is the expectation with respect to P(I), andETst,Tst+ is the expectation with respect to the probabil-ity on the test responses (Tst, Tst+) induced by P(I).

The decrease of the Kullback-Leibler divergenceequals zero if and only if Tst(I) are sufficient statis-tics with respect to w.

In practice discriminative methods, particularly stan-dard computer vision algorithms—see Section 4.1, willtypically only use a small number of features for com-putational practicality. Also their discriminative prob-abilities q(w j | Tst(I)) will often not be optimal. Fortu-nately the image parsing algorithm in this paper onlyrequires the discriminative probabilities q(w j | Tst(I))to be rough approximations to p(w j | I).

The difference between discriminative and gener-ative models is illustrated in Fig. 3. Discriminativemodels are fast to compute and can be run in parallelbecause different components are computed indepen-dently (see arrows in Fig. 3). But the components {wi }may not yield a consistent solution W and, moreover,W may not specify a consistent model for generatingthe observed image I. These inconsistencies are indi-cated by the crosses in Fig. 3. Generative models ensureconsistency but require solving a difficult inferenceproblem.

It is an open problem whether discriminative meth-ods can be designed to infer the entire state W forthe complicated generative models that we are deal-ing with. Recent work (Kumar and Hebert, 2003) is astep in this direction. But mathematicians (Blanchardand Geman, 2003) have argued that this will not bepractical and that discriminative models will alwaysrequire additional post-processing.

2.3. Markov Chain Kernels and Sub-Kernels

Formally, our DDMCMC image parsing algorithmsimulates a Markov chain MC =< �, ν,K > withkernelK in space � and with probability v for the start-ing state. An element W ∈ � is a parsing graph. Welet the set of parsing graphs � be finite as images havefinite pixels and grey levels.

We proceed by defining a set of moves for reconfigur-ing the graph. These include moves to: (i) create nodes,(ii) delete nodes, and (iii) change node attributes. Wespecify stochastic dynamics for these moves in termsof transition kernels.5

Page 6: Image Parsing: Unifying Segmentation, Detection, and ...efros/courses/LBMV07/Papers/tu-ijcv-05.pdf · Examples of image segmentation failure by an algorithm (Tu and Zhu, 2002a) which

118 Tu et al.

For each move we define a Markov Chain sub- kernelby a transition matrix Ka(W ′ | W : I) with a ∈ A beingan index. This represents the probability that the sys-tem makes a transition from state W to state W ′ whensub-kernel a is applied (i.e.

∑w′ Ka(W ′ | W : I) =

1, ∀ W ). Kernels which alter the graph structure aregrouped into reversible pairs. For example, the sub-kernel for node creation Ka,r (W ′ | W : I) is paired withthe sub- kernel for node deletion Ka,l(W ′ | W : I). Thiscan be combined into a paired sub-kernel Ka =ρarKa,r (W ′ | W : I)+ρalKa,l(W ′ | W : I)(ρar +ρal = 1)This pairing ensures thatKa(W ′ | W : I) = 0 if, and onlyif, Ka(W | W ′ : I) = 0 for all states W, W′ ∈ �. Thesub-kernels (after pairing) are constructed to obey thedetailed balance condition:

p(W | I)Ka(W ′ | W : I) = p(W ′ | I)Ka(W | W ′ : I).

(3)

The full transition kernel is expressed as:

K(W ′ | W : I) =∑

a

ρ(a : I)Ka(W ′ | W : I),

a

ρ(a : I) = 1, ρ(a : I) > 0. (4)

To implement this kernel, at each time step the al-gorithm selects the choice of move with probabilityρ(a : I) for move a, and then uses kernel Ka(W ′ | W ; I)to select the transition from state W to state W ′. Notethat both probabilities ρ(a : I) and Ka(W ′ | W ; I) de-pend on the input image I. This distinguishes our DDM-CMC methods from conventional MCMC computing(Liu, 2001; Bremaud, 1999).

The full kernel obeys detailed balance, Eq. (3), be-cause all the sub- kernels do. It will also be ergodic,provided the set of moves is sufficient (i.e. so that wecan transition between any two states W, W ′ ∈ � us-ing these moves). These two conditions ensure thatp(W | I) is the invariant (target) probability of theMarkov Chain (Bremaud, 1999) in the finite space �.

Applying the kernel Ka(t) updates the Markov chainstate probability µt (W ) at step t to µt+1(W ′) at t + 16:

µt+1(W ′) =∑

W

Ka(t)(W′ | W : I)µt (W ). (5)

In summary, the DDMCMC image parser simulatesa Markov chainMC with a unique invariant probabilityp(W | I). At time t, the Markov chain state (i.e. the parse

graph) W follows a probability µt which is the productof the sub-kernels selected up to time t ,

W ∼ µt (W ) = ν(Wo) · [Ka(1) ◦ Ka(2) ◦ · · · ◦ Ka(t)

]

(Wo, W ) → p(W | I). (6)

where a(t) indexes the sub-kernel selected at time t. Asthe time t increases, µt (W ) approaches the posteriorp(W | I) monotonically (Bremaud, 1999) at a geomet-ric rate (Diaconis and Hanlon, 1992) independent ofthe starting configuration. The following convergencetheorem is useful for image parsing because it helpsquantify the effectiveness of the different sub-kernels.

Theorem 2. The Kullback-Leibler divergence be-tween the posterior p(W | I) and the Markov chain stateprobability decreases monotonically when a sub-kernelKa(t), ∀a(t) ∈ A is applied,

KL(p(W | I)‖µt (W )) − KL(p(W | I) ‖ µt+1(W )) ≥ 0

(7)

The decrease of KL- divergence is strictly positive andis equal to zero only after the Markov chain becomesstationary, i.e. µ = p.

[Proof] See Appendix A.The theorem is related to the second law of ther-

modynamics (Cover and Thomas, 1991), and its proofmakes use of the detailed balance Eq. (3). This KLdivergence gives a measure of the “power” of eachsub-kernel Ka(t) and so it suggests an efficient mech-anism for selecting the sub-kernels at each time step,see Section 8. By contrast, classic convergence analy-sis (see Appendix B) show that the convergence of theMarkov Chain is exponentially fast, but does not givemeasures of power of sub-kernels.

2.4. DDMCMC and Proposal Probabilities

We now describe how to design the sub-kernels usingproposal probabilities and discriminative models. Thisis at the heart of DDMCMC.

Each sub- kernel7 is designed to be of Metropolis-Hastings form (Metropolis et al., 1953; Hastings,1970):

Ka(W ′ | W : I) = Qa(W ′ | W : Tsta(I))

× min

{

1,p(W ′ | I)Qa(W | W ′ : Tsta(I))

p(W | I)Qa(W ′ | W : Tsta(I))

}

,

W ′ �= W (8)

Page 7: Image Parsing: Unifying Segmentation, Detection, and ...efros/courses/LBMV07/Papers/tu-ijcv-05.pdf · Examples of image segmentation failure by an algorithm (Tu and Zhu, 2002a) which

Image Parsing 119

where a transition from W to W ′ is proposed (stochasti-cally) by the proposal probability Qa(W | W ′ : Tsta(I))and accepted (stochastically) by the acceptanceprobability:

α(W ′ | W : I)

= min

{

1,p(W ′ | I)Qa(W | W ′ : Tsta(I))

p(W | I)Qa(W ′ | W : Tsta(I))

}

. (9)

The Metropolis-Hastings form ensures that the sub-kernels obey detailed balance (after pairing) (Bremaud,1999).

The proposal probabilities Qa(W ′ | W : Tsta(I)) willbe built from discriminative probabilities using testsTsta(I) performed on the image. The design of the pro-posal probabilities is a trade-off. Ideally the proposalswould be sampled from the posterior p(W ′ | I), but thisis impractical. Instead the trade-off requires: (i) it ispossible to make large moves in � at each time step,(ii) the proposals should encourage moves to states withhigh posterior probability, and (iii) the proposals mustbe fast to compute.

More formally, we define the scope �a(W ) = {W ′ ∈� :Ka(W ′ | W : I) > 0} to be the set of states which canbe reached from W in one time step using sub-kernela. We want the scope Sa(W ) to be large so that we canmake large moves in the space � at each time step (i.e.jump towards the solution and not crawl). The scopeshould also, if possible, include states W ′ with highposterior p(W ′ | I) (i.e. it is not enough for the scope tobe large, it should also be in the right part of �).

The proposals Qa(W ′ | W : Tsta(I)) should be cho-sen so as to approximate

p(W ′ | I)∑

W ′′∈�a (W ) p(W ′′ | I)if W ′ ∈ �a(W ),

= 0, otherwise. (10)

The proposals will be functions of the discriminativemodels for the components of W ′ and of the generativemodels for the current state W (because it is compu-tationally cheap to evaluate the generative models forthe current state). The details of the model p(W | I) willdetermine the form of the proposals and how large wecan make the scope while keeping the proposals easyto compute and able to approximate Eq. (10). See thedetailed examples given in Section 5.

This description gives the bare bones of DDMCMC.We refer to Tu and Zhu (2002) for further details ofthese issues from an MCMC perspective. In the discus-

sion section, we describe strategies to improve DDM-CMC. Preliminary theoretical results for the conver-gence of DDMCMC are encouraging for a special case(see Appendix C).

Finally, in Appendix D, we address the importantpractical issue of how to maintain detailed balancewhen there are multiple routes to transition betweentwo state W and W ′. We describe two ways to do thisand the trade-offs involved.

3. Generative Models and Bayesian Formulation

This section describes the graph structure and the gen-erative models used for our image parsing algorithm inthis paper.

Figure 1 illustrates the general structure of a parsinggraph. In this paper, we use a two-layer-graph illus-trated in Fig. 4. The top node (“root”) of the graphrepresents the whole scene (with a label). It has K in-termediate nodes for the visual patterns (face, text, tex-ture, and shading). Each visual pattern has a number ofpixels at the bottom (“leaves”). In this graph no hori-zontal connections are considered between the visualpatterns except the constraint that they share bound-aries and form a partition of the image lattice (see Tuand Zhu (2002) for an example of image parsing wherehorizontal connections are used, but without objectpatterns).

The number K of intermediate nodes is a randomvariable, and each node i = 1, . . . , K has a set of

Figure 4. Abstract representation of the parsing graph used in thispaper. The intermediate nodes represent the visual patterns. Theirchild nodes correspond to the pixels in the image.

Page 8: Image Parsing: Unifying Segmentation, Detection, and ...efros/courses/LBMV07/Papers/tu-ijcv-05.pdf · Examples of image segmentation failure by an algorithm (Tu and Zhu, 2002a) which

120 Tu et al.

attributes (Li , ζi , �i ) defined as follows. Li is the shapedescriptor and determines the region Ri = R(Li ) of theimage pixels covered by the visual pattern of the inter-mediate node. Conceptually, the pixels within Ri arechild nodes of the intermediate node i. (Regions maycontain holes, in which case the shape descriptor willhave internal and external boundaries). The remainingattribute variables (ζi , �i ) specify the probability mod-els p(IR(Li ) | ζi , Li , �i ) for generating the sub-imageIR(Li ) in region R(Li ). The variables ζi ∈ {1, . . . , 66}indicate the visual pattern type (3 types of generic vi-sual patterns, 1 face pattern, and 62 text character pat-terns), and �i denotes the model parameters for thecorresponding visual pattern (details are given in thefollowing subsections). The complete scene descrip-tion can be summarized by:

W = (K , {(ζi , Li , �i ) : i = 1, 2, . . . , K }).The shape descriptors {Li : i = 1, . . . , K } are re-

quired to be consistent so that each pixel in the image isa child of one, and only one, of the intermediate nodes.The shape descriptors must provide a partition of theimage lattice � = {(m, n) : 1 ≤ m ≤ Height(I), 1 ≤n ≤ Width(I)} and hence satisfy the condition

� =K⋃

i=1

R(Li ), R(Li ) ∩ R(L j ) = ∅, ∀i �= j.

The generation process from the scene descriptionW to I is governed by the likelihood function:

p(I | W ) =K∏

i=1

p(IR(Li )

∣∣ ζi Li , �i

).

The prior probability p(W) is defined by

p(W ) = p(K )K∏

i=1

p(Li )p(ζi | Li )p(�i | ζi ).

In our Bayesian formulation, parsing the image cor-responds to computing the W ∗ that maximizes a pos-teriori probability over �, the solution space of W,

W ∗ = arg max pW∈�

(W | I) = arg max pW∈�

(I | W )p(W ).

(11)

It remains to specify the prior p(W ) and the likeli-hood function p(I | W ). We set the prior terms p(K )

and p(�i | ζi ) to be uniform probabilities. The termp(ζi | Li ) is used to penalize high model complexityand was estimated for the three generic visual patternsfrom training data in Tu and Zhu (2002).

3.1. Shape Models

We use two types of shape descriptor in this paper.The first is used to define shapes of generic visualpatterns and faces. The second defines the shapes oftext characters.

1. Shape Descriptors for Generic Visual Patternsand FacesIn this case, the shape descriptor represents the

boundary8 of the image region by a list of pixelsLi = ∂Ri. The prior is defined by:

p(Li ) ∝ exp{−γ | R(Li ) | α − λ|Li |}. (12)

In this paper, we set α = 0.9. For computationalreasons, we use this prior for face shapes though morecomplicated priors (Cootes et al., 2001) can be applied.

2. Shape Descriptors for Text Characters

We model text characters by 62 deformable tem-plates corresponding to the ten digits and the twenty sixletters in both upper and lower cases. These deformabletemplates are defined by 62 prototype characters anda set of deformations. The prototypes are representedby an outer boundary and, at most, two inner bound-aries. Each boundary is modeled by a B-spline usingtwenty five control points. The prototype characters areindexed by ci ∈ {1, . . . , 62} and their control points arerepresented by a matrix TP(ci ).

We now define two types of deformations on thetemplates. One is a global affine transformation, andthe other is a local elastic deformation. First we allowthe letters to be deformed by an affine transform Mi .We put a prior p(Mi ) to penalize severe rotation anddistortion. This is obtained by decomposing Mi as:

Mi =(

σx 0

0 σy

)(cos θ − sin θ

sin θ cos θ

)(1 h

0 1

)

.

Page 9: Image Parsing: Unifying Segmentation, Detection, and ...efros/courses/LBMV07/Papers/tu-ijcv-05.pdf · Examples of image segmentation failure by an algorithm (Tu and Zhu, 2002a) which

Image Parsing 121

Figure 5. Random samples drawn from the shape descriptors for text characters.

where θ is the rotation angle, σx and σy denote scaling,and h is for shearing. The prior on Mi is

p(Mi ) ∝ exp

{

− a|θ |2 − b

(σx

σy+ σy

σx

)2

− ch2

}

,

where a, b, c are parameters.Next, we allow local deformations by adjusting

the positions of the B-spline control points. For adigit/letter ci and affine transform Mi , the contourpoints of the template are given by GTP(Mi , ci ) =U × Ms × Mi × TP(ci ). Similarly the contour pointson the shape with control points Si are given byGs(Mi , Ci ) = U × Ms × Si (U and Ms are theB-Spline matrices). We define a probability distribu-tion p(Si | Mi , Ci ) for the elastic deformation given bySi ,

p(Si | Mi , ci ) ∝ exp{−γ | R(Li ) | α

−D(GS(Mi , ci )‖GTP(Mi , ci ))},

where D(GS(Mi , ci )‖GTP(Mi , ci )) is the overall dis-tance between contour template and the deformed con-tour (these deformations are small so the correspon-dence between points on the curves can be obtained bynearest neighbor matches, see Tu and Yuille (2004) forhow we can refine this). Figure 5 shows some samplesdrawn from the above model.

In summary, each deformable template is indexed byci ∈ {1 . . . 62} and has a shape descriptor:

Li = (ci , Mi , Si ),

The prior distribution on Li is specified by:

p(Li ) = p(ci )p(Mi )p(Si | Mi , ci ).

Here p(ci ) is a uniform distribution on all the dig-its and letters (we do not place a prior distribution ontext strings, though it is possible to do so (Klein andManning, 2002)).

3.2. Generative Intensity Models

We use four families of generative intensity models fordescribing intensity patterns of (approximately) con-stant intensity, clutter/texture, shading, and face. Thefirst three are similar to those defined in Tu and Zhu(2002).

1. Constant Intensity Model ζ = 1. This assumesthat pixel intensities in a region R are subject to in-dependently and identically distributed (iid) Gaussiandistribution,

p1(IR(L)

∣∣ ζ = 1, L , �

) =∏

v∈R(L)

G(Iv − µ; σ 2),

� = (µ, σ )

2. Clutter/Texture Model ζ = 2. This is a non-parametric intensity histogram h() discretized to take Gvalues (i.e. is expressed as a vector (h1, h2, . . . , hG)).Let n j be the number of pixels in R(L) with intensityvalue j .

p2(IR(L)

∣∣ ζ = 2, L , �

) =∏

v∈R(L)

h(Iv) =G∏

j=1

hnjj ,

� = (h1, h2, . . . , hG).

3. Shading Model ζ = 3 and ζ = 5, . . . , 66. Thisfamily of models are used to describe generic shading

Page 10: Image Parsing: Unifying Segmentation, Detection, and ...efros/courses/LBMV07/Papers/tu-ijcv-05.pdf · Examples of image segmentation failure by an algorithm (Tu and Zhu, 2002a) which

122 Tu et al.

Figure 6. Random samples drawn from the PCA face model.

patterns, and text characters. We use a quadratic form

J (x, y; �) = ax2 + bxy + cy2 + dx + ey + f ,

with parameters � = (a, b, c, d, e, f, σ ). Therefore,the generative model for pixel (x, y) is

p3(IR(L)

∣∣ ζ ∈ {3, (5, . . . , 66

)}, L , �)

=∏

v∈R(L)

G(Iv − Jv; σ 2), � = (a, b, c, d, e, f, σ ).

4. The PCA face model ζ = 4. The generativemodel for faces is simpler and uses Principal Com-ponent Analysis (PCA) to obtain representations of thefaces in terms of principal components{Bi} and covari-ances . Lower level features, also modeled by PCA,can be added (Moghaddam and Pentland, 1997). Fig-ure 6 shows some faces sampled from the PCA model.We also add other features such as the occlusion pro-cess, as described in Hallinan et al. (1999).

p4(IR(L)‖ζ = 4, L , �)

= G

(

IR(L) −∑

i

λi Bi ;

)

,

� = (λ1, . . . , λn, ).

4. Overview of the Algorithm

This section gives the control structure of an imageparsing algorithm based on the strategy described inSection 2, and Fig. 8 shows the diagram. Our algorithm

must construct the parse graph on the fly and to estimatethe scene interpretation W .

Figure 7 illustrates how the algorithm selects theMarkov chain moves (dynamics or sub-kernels) tosearch through the space of possible parse graphs ofthe image by altering the graph structure (by deletingor adding nodes) and by changing the node attributes.An equivalent way of visualizing the algorithm is interms of a search through the solution space � see (Tuand Zhu 2002a, 2002b) for more details of this view-point.

We first define the set of moves to reconfigure thegraph. These are: (i) birth or death of face nodes,(ii) birth or death of text characters, (iii) splitting ormerging of regions, (iv) switching node attributes (re-gion type ζi and model parameters �i ), (v) boundaryevolution (altering the shape descriptors Li of nodeswith adjacent regions). These moves are implementedby sub-kernels. The first four moves are reversiblejumps (Green, 1995), and will be implemented by theMetropolis-Hastings Eq. (8). The fifth move, bound-ary evolution, is implemented by a stochastic partialdifferential equation.

The sub-kernels for these moves require pro-posal probabilities driven by elementary discriminativemethods, which we review in the next subsection. Theproposal probabilities are designed using the criteria inSection 2.4, and full details are given in Section 5.

The control structure of the algorithm is describedin Section 4.2. The full transition kernel for the im-age parser is built by combining the sub-kernels, asdescribed in Section 2.3 and Fig. 8. The algorithmproceeds (stochastically) by selecting a sub-kernel, se-

Page 11: Image Parsing: Unifying Segmentation, Detection, and ...efros/courses/LBMV07/Papers/tu-ijcv-05.pdf · Examples of image segmentation failure by an algorithm (Tu and Zhu, 2002a) which

Image Parsing 123

Figure 7. Examples of Markov chain dynamics that change the graph structure or the node attributes of the graph giving rise to different waysto parse the image.

Figure 8. Integrating generative (top-down) and discriminative (bottom-up) methods for image parsing. This diagram illustrates the mainpoints of the image parser. The dynamics are implemented by an ergodic Markov chain K, whose invariant probability is the posterior p(W‖I),and which is composed of reversible sub-kernels Ka for making different types of moves in the parse graph (e.g. giving birth to new nodesor merging nodes). At each time step the algorith-m selects a sub-kernel stochastically. The selected sub-kernel proposes a specific move (e.g.to create or delete specific nodes) and this move is then evaluated and accepted stochastically, see Eq. (8). The proposals are based on bothbottom-up (discriminative) and top-down (generative) processes, see Section 2.4. The bottom-up processes compute discriminative probabilitiesq(w j ‖Tst j (I)), j = 1, 2, 3, 4 from the input image I based on feature tests Tst j (I). An additional sub-kernel for boundary evolution uses astochastic partial differential equation will be described later.

Page 12: Image Parsing: Unifying Segmentation, Detection, and ...efros/courses/LBMV07/Papers/tu-ijcv-05.pdf · Examples of image segmentation failure by an algorithm (Tu and Zhu, 2002a) which

124 Tu et al.

lecting where in the graph to apply it, and then decidingwhether or not to accept the operation.

4.1. The Discriminative Methods

The discriminative methods give approximate posteriorprobabilities q(w j‖Tst j (I)) for the elementary com-ponents w j of W . For computational efficiency, theseprobabilities are based only on a small number of sim-ple tests Tst j (I).

We briefly overview and classify the discriminativemethods used in our implementation. Section 5 showshow these discriminative methods are composed, seecrosses in Fig. 8, to give proposals for making movesin the parsing graph.

1. Edge Cues. These cues are based on edge de-tectors (Canny, 1986; Bowyer et al., 2001; Konishiet al., 2003). They are used to give proposals for re-gion boundaries (i.e. the shape descriptor attributes ofthe nodes). Specifically, we run the Canny detector atthree scales followed by edge linking to give partitionsof the image lattice. This gives a finite list of candidatepartitions which are assigned weights, see Section 5.2.3and (Tu and Zhu, 2002a). The discriminative proba-bility is represented by this weighted list of particles.In principle, statistical edge detectors (Konishi et al.,2003) would be preferable to Canny because they givediscriminative probabilities q(w j | Tst j (I)) learnt fromtraining data.

2. Binarization Cues. These cues are computed us-ing a variant of Niblack’s algorithm (Niblack, 1986).They are used to propose boundaries for text characters(i.e. shape descriptors for text nodes), and will be usedin conjunction with proposals for text detection. Thebinarization algorithm, and an example of its output,are given in Section 6. Like edge cues, the algorithm isrun at different parameters settings and represents thediscriminative probability by a weighted list of parti-cles indicating candidate boundary locations.

3. Face Region Cues. These cues are learnt bya variant of AdaBoost (Schapire 2002; Viola andJones, 2001) which outputs discriminative probabili-ties (Friedman et al. 1998), see Section 6. They proposethe presence effaces in sub-regions of the image. Thesecues are combined with edge detection to propose thelocalization of faces in an image.

4. Text Region Cues. These cues are also learnt by aprobabilistic version of AdaBoost, see Section 6. Thealgorithm is applied to image windows (at a range ofscales). It outputs a discriminative probability for thepresence of text in each window. Text region cues arecombined with binarization to propose boundaries fortext characters.

5. Shape Affinity Cues. These act on shape bound-aries, produced by binarization, to propose text char-acters. They use shape context cues (Belongie et al.,2004). and information features (Tu and Yuille, 2004)to propose matches between the shape boundaries andthe deformable template models of text characters.

6. Region Affinity Cues. These are used to estimatewhether two regions Ri , R j are likely to have been gen-erated by the same visual pattern family and model pa-rameters. They use an affinity similarity measure (Shiand Malik, 2000) of the intensity properties IRi , IR j .

7. Model Parameter and Visual Pattern Family Cues.These are used to propose model parameters and visualpattern family identity. They are based on clusteringalgorithms, such as mean-shift (Comaniciu and Meer,1999). The clustering algorithms depend on the modeltypes and are described in Tu and Zhu (2002a).

In our current implementation, we conduct all thebottom-up tests Tst j (I), j = 1, 2, . . . , K at an earlystage for all the discriminative models q j (w j‖Tst j (I)),and they are then combined to form composite testsTsta(I) for each subkernel Ka in Eqs. (8 and 9). It maybe more efficient to perform these test as required, seediscussion in Section 8.

4.2. Control Structure of the Algorithm

The control strategy used by our image parser is illus-trated in Fig. 8. The image parser explores the spaceof parsing graphs by a Markov Chain Monte Carlosampling algorithm. This algorithm uses a transitionkernel K which is composed of sub-kernels Ka cor-responding to different ways to reconfigure the pars-ing graph. These sub-kernels come in reversible pairs9

(e.g., birth and death) and are designed so that the targetprobability distribution of the kernel is the generativeposterior p(W | I). At each time step, a sub-kernelis selected stochastically. The sub-kernels use theMetropolis-Hasting sampling algorithm, see Eq. (8),which proceeds in two stages. First, it proposes

Page 13: Image Parsing: Unifying Segmentation, Detection, and ...efros/courses/LBMV07/Papers/tu-ijcv-05.pdf · Examples of image segmentation failure by an algorithm (Tu and Zhu, 2002a) which

Image Parsing 125

a reconfiguration of the graph by sampling froma proposal probability. Then it accepts or rejectsthis reconfiguration by sampling the acceptanceprobability.

To summarize, we outline the control strategy ofthe algorithm below. At each time step, it specifies(stochastically) which move to select (i.e. which sub-kernel), where to apply it in the graph, and whetherto accept the move. The probability to select movesρ(a : I) was first set to be independent of I, but wegot better performance by adapting it using discrim-inative cues to estimate the number of faces and textcharacters in the image (see details below). The choiceof where to apply the move is specified (stochastically)by the sub-kernel. For some sub-kernels it is selectedrandomly and for others is chosen based on a fitnessfactor (see details in Section 5), which measures howwell the current model fits the image data. Some an-nealing is required to start the algorithm because of thelimited scope of the moves in the current implemen-tation (the need for annealing will be reduced if thecompositional techniques described in Barbu and Zhu(2003)) are used).

We improved the effectivenss of the algorithm bymaking the move selection adapt to the image (i.e.by making ρ(a : I) depend on I). In particular, weincreased the probability of giving birth and deathof faces and text, ρ(1) and ρ(2), if the bottom-up(AdaBoost) proposals suggested that there are manyobjects in the scene. For example, let N(I) be thenumber of proposals for faces or text above a thresh-old Ta . Then we modify the probabilities in the ta-ble by ρ(a1) → {(ρ(a1) + kg(N (I))}/Z , ρ(a2) →{ρ(a2) + kg(N )}/Z , ρ(a3) → ρ(a3)/Z , ρ(a4) →ρ(a4)/Z , where g(x) = x, x ≤ Tb g(x) = Tb, x ≥Tb and Z = 1 + 2k is chosen to normalize theprobability.

The basic control strategy of the image parsingalgorithm

1. Initialize W (e.g. by dividing the image into four re-gions), setting their shape descriptors, and assigningthe remaining node attributes at random.

2. Set the temperature to be Tinit

3. Select the type a of move by sampling from a prob-ability ρ(a), with ρ(1) = 0.2 for faces, ρ(2) = 0.2for text, ρ(3) = 0.4 for splitting and merging,ρ(4) = 0.15 for switching region model (type ormodel parameters), and ρ(5) = 0.05 for boundary

evolution. This was modified slightly adaptively, seecaption and text.

4. If the selected move is boundary evolution, thenselect adjacent regions (nodes) at random and applystochastic steepest descent, see Section 5.1.

5. If the jump moves are selected, then a new solutionW ′ is randomly sampled as follows:

– For the birth or death of a face, see Section 5.2.2,we propose to create or delete a face. This in-cludes a proposal for where in the image to dothis.

– For the birth of death of text, see Section 5.2.1,we propose to create a text character or delete anexisting one. This includes a proposal for whereto do this.

– For region splitting, see Section 5.2.3, a region(node) is randomly chosen biased by its fitnessfactor. There are proposals for where to split itand for the attributes of the resulting two nodes.

– For region merging, see Section 5.2.3, two neigh-boring regions (nodes) are selected based on aproposal probability. There are proposals for theattributes of the resulting node.

– For switching, see Section 5.2.4, a region is se-lected randomly according to its fitness factorand a new region type and/or model parametersis proposed.

• The full proposal probabilities, Q(W‖W : I)and Q(W ′‖W : I) are computed.

• The Metropolis-Hastings algorithm, Eq. (8),is applied to accept or reject the proposedmove.

6. Reduce the temperature T = 1 + Tinit × exp(−t ×c‖R‖), where t is the current iteration step, c is aconstant and ‖R‖ is the size of the image.

7. Repeat the above steps and until the convergence cri-terion is satisfied (by reaching the maximum num-ber of allowed steps or by lack of decrease of thenegative log posterior).

5. The Markov Chain Kernels

This section gives a detailed discussion of the individ-ual Markov Chain kernel, their proposal probabilities,and their fitness factors.

Page 14: Image Parsing: Unifying Segmentation, Detection, and ...efros/courses/LBMV07/Papers/tu-ijcv-05.pdf · Examples of image segmentation failure by an algorithm (Tu and Zhu, 2002a) which

126 Tu et al.

5.1. Boundary Evolution

These moves evolve the positions of the region bound-aries but preserve the graph structure. They are im-plemented by a stochastic partial differential equation(Langevin equation) driven by Brownian noise and canbe derived from a Markov Chain (Gemam and Huang,1986). The deterministic component of the PDE is ob-tained by performing steepest descent on the negativelog-posterior, as derived in Zhu and Yuille (1996).

We illustrate the approach by deriving the determin-istic component of the PDE for the evolution of theboundary between a letter Tj and a generic visual pat-tern region Ri . The boundary will be expressed in termsof the control points {Sm} of the shape descriptor ofthe letter. Let v denote a point on the boundary, i.e.v(s) = (x(s), y(s)) on �(s) = ∂ Ri ∩ ∂ R j . The de-terministic part of the evolution equation is obtainedby taking the derivative of the negative log-posterior−logp(W/I) with respect to the control points.

More precisely, the relevant parts of the negative log-posterior, see Eq. (3) and (11) are given by E(Ri ) andE(Tj ) where:

E(Ri ) =∫ ∫

Ri

{− log p(I(x, y)|θζi )} dxdy

+ γ |Ri |α + λ|∂ Ri |.

and

E(Tj ) =∫ ∫

L j

− log p(I(x, y)|θζ j ) dxdy

+ γ |R(L j )|α − log p(L j ).

Differentiating E(Ri ) + E(Tj ) with respect to thecontrol points {Sm} yields the evolution PDE:

d Sm

dt= −δE(Ri )

δSm− δD(Tj )

δSm

=∫ [

−δE(Ri )

δv− δE(Tj )

δv

]1

|J(S)|ds

=∫

n(v)

[

logp(I(v); θζi

)

p(I(v); θζ j

)

+ αγ

(1

|D j |1−α− 1

|Di |1−α

)

−λκ + D(GSj (s)‖GT (s)

)]

1

|J(s)| ds,

where J(s) is the Jacobian matrix for the spline func-tion. (Recall that α = 0.9 in the implementation).

The log-likelihood ratio term logp(I(v);θζi )p(I(v);θζ j ) imple-

ments the competition between the letter and thegeneric region models for ownership of the boundarypixels.

5.2. Markov Chain Sub-Kernels

Changes in the graph structure are realized by Markovchain jumps implemented by four different sub-kernels.

5.2.1. Sub-Kernel I: Birth and Death of Text. Thispair of jumps is used to create or delete text charac-ters. We start with a parse graph W and transition intoparse graph W ′ by creating a character. Conversely, wetransition from W ′ back to W by deleting a character.

The proposals for creating and deleting text charac-ters are designed to approximate the terms in Eq. (10).We obtain a list of candidate text character shapes byusing AdaBoost to detect text regions followed by bi-narization to detect candidate text character boundarieswithin text regions (see Section 6.2). This list is repre-sented by a set of particles which are weighted by thesimilarity to the deformable templates for text charac-ters (see below):

S1r (W ) = {(z(µ)

1r , ω(µ)1r

)µ = 1, 2, . . . , N1r

}.

Similarly, we specify another set of weighted particlesfor removing text characters:

S1l(W′) = {(

z(ν)1l , ω

(ν)1l

)ν = 1, 2, . . . , N1l

}.

{z(µ)1r } and {z(ν)

1l } represent the possible (discretized)shape positions and text character deformable tem-plates for creating or removing text, and {ω(µ)

1r } and{ω(ν)

1l } are their corresponding weights. The particlesare then used to compute proposal probabilities

Q1r (W ′|W : I) = ω1r (W ′)∑N1r

µ=1 ω(µ)1r

,

Q1l(W |W ′, I) = ω1l(W )∑N1l

ν=1 ω(ν)1l

.

The weights ω(µ)1r and ω

(ν)1l for creating new text char-

acters are specified by shape affinity measures, such asshape contexts (Belongie et al., 2002) and informative

Page 15: Image Parsing: Unifying Segmentation, Detection, and ...efros/courses/LBMV07/Papers/tu-ijcv-05.pdf · Examples of image segmentation failure by an algorithm (Tu and Zhu, 2002a) which

Image Parsing 127

features (Tu and Yuille, 2004). For deleting text charac-ters we calculate ω

(ν)1l directly from the likelihood and

prior on the text character. Ideally these weights willapproximate the ratios p(W ′|I)

p(W |I) and p(W |I)p(W ′|I) .

5.2.2. Sub-Kernel II: Birth and Death of Face. Thesub-kernel for the birth and death of faces is very sim-ilar to the sub-kernel of birth and death of text. Weuse AdaBoost method discussed in Section 6.2 to de-tect candidate faces. Face boundaries are obtained di-rectly from using edge detection to give candidate faceboundaries. The proposal probabilities are computedsimilarly to those for sub-kernel I.

5.2.3. Sub-Kernel III: Splitting and Merging RegionsThis pair of jumps is used to create or delete nodes bysplitting and merging regions (nodes). We start witha parse graph W and transition into parse graph W ′

by splitting node i into nodes j and k. Conversely, wetransition back to W by merging nodes j and k intonode i . The selection of which region i to split is basedon a robust function on p(IRi |ζi , Li , �i ) (i.e. the worsethe model for region Ri fits the data, the more likelywe are to split it). For merging, we use a region affinitymeasure (Shi and Malik, 2000) and propose mergesbetween regions which have high affinity.

Formally, we define W, W ′:

W = (K , (ζk, Lk, �k), W−)

⇀↽ W ′ = (K + 1, (ζi , Li , �i ), (ζ j , L j , � j ), W−)

where W− denotes the attributes of the remaining K −1nodes in the graph.

We obtain proposals by seeking approximations toEq. 10 as follows.

We first obtain three edge maps. These are given byCanny edge detectors (Canny, 1986) at different scales(see Tu and Zhu (2002a), for details). We use these edgemaps to create a list of particles for splitting S3r (W ). Alist of particles for merging is denoted by S3l(W ′).

S3r (W ) = {(z(µ)

3r , ω(µ)3r

): µ = 1, 2, . . . , N3r .

},

S3l(W′) = {(

z(ν)3l , ω

(ν)3l

): ν = 1, 2, . . . , N3l .

}

where {z(µ)3r } and {z(ν)

3l } represent the possible (dis-cretized) positions for splitting and merging, and theirweights {ω3r }, {ω3l} will be defined shortly. In otherwords, we can only split a region i into regions j and kalong a contour zµ

3r (i.e. zµ

3r forms the new boundary).

Figure 9. The evolution of the region boundaries is implemented bystochastic partial differential equations which are driven by modelscompeting for ownership of the regions.

Similarly we can only merge regions j and k into re-gion i by deleting a boundary contour zµ

3l . For example,Figure 11 shows 7 candidate sites for splitting W and5 sites for merging W ′.

We now define the weights {ω3r }, {ω3l}. Theseweights will be used to determine probabilities for thesplits and merges by:

Q3r (W ′|W : I) = ω3r (W ′)

N3rµ=1ω

(µ)3r

,

Q3l(W |W ′ : I) = ω3l(W )

N3lν=1ω

(ν)3l

.

Again, we would like ωµ

3r and ων3l to approximate the

ratios p(W | I)p(W ′ | I) and p(W ′ | I)

p(W | I) respectively. p(W ′ | I)p(W | I) is given

by:

p(W ′ | I)

p(W | I)= p

(IRi | ζi , Li , �i

)p(IR j | ζ j , L j , � j

)

p(IRk | ζk, Lk, �k)

· p(ζi , Li , �i )p(ζ j , L j , � j )

p(ζk, Lk, �k)· p(K + 1)

p(k)

This is expensive to compute, so we approximatep(W ′ | I)p(W | I) and p(W | I)

p(W ′ | I) by:

ω(µ)3r = q(Ri , R j )

p(IRk | ζk, Lk, �k

)

· [q(Li )q(ζi , �i )][q(L j )q(ζ j , � j )]

p(ζk, Lk, �k). (13)

ω(ν)3l = q(Ri , R j )

p(IRi | ζi , Li , �i

)p(IR j | ζ j , L j , � j

)

· q(Lk)q(ζk, �k)

p(ζi , Li , �i )p(ζ j , L j , � j ), (14)

Page 16: Image Parsing: Unifying Segmentation, Detection, and ...efros/courses/LBMV07/Papers/tu-ijcv-05.pdf · Examples of image segmentation failure by an algorithm (Tu and Zhu, 2002a) which

128 Tu et al.

Figure 10. An example of the birth-death of text. State W consists of three generic regions and a character “T”. Proposals are computed for 3candidate characters, “E”, “X”, and “T”, obtained by AdaBoost and binarization methods (see Section 6.2). One is selected, see arrow, whichchanges the state to W ′. Conversely, there are 2 candidate in state W ′ and the one selected, see arrow, returns the system to state W .

Figure 11. An example of the split- merge sub- kernel. State W consists of three regions and proposals are computed for 7 candidate splits.One is selected, see arrow, which changes the state to W ′. Conversely, there are 5 candidate merges in state W ′ and the one selected, see arrow,returns the system to state W .

Where q(Ri , R j ) is an affinity measure (Shi andMalik, 2000) of the similarity of the two regions Ri

and R j (it is a weighted sum of the intensity dif-ference | I i − I j |, and the chi-squared difference be-tween the intensity histograms), q(Li ) is given by thepriors on the shape descriptors, and q(ζi , �i ) is ob-tained by clustering in parameter space (see Tu and Zhu(2002)).

5.2.4. Jump II: Switching Node Attributes. Thesemoves switch the attributes of a node i . This involveschanging the region type ζi and the model parameters�i .

The move transitions between two states:

W = ((ζi , Li , �i ), W−) ⇀↽ W ′ = ((ζ ′i , L ′

i , �′i ), W−)

The proposal, see Eq. (10), should approximate:

p(W ′ | I)

p(W | I)= p

(IRi

∣∣ ζ ′

i , L ′i , �

′i

)p(ζ ′

i , L ′i , �

′i )

p(IRi

∣∣ ζi , Li , �i

)p(ζi , Li , �i )

.

We approximate this by a weight ω(µ)4 given by

ω(µ)4 = q(L ′

i )q(ζ ′i , �

′i )

p(IRi | ζi , Li , �i

)p(ζi , Li , �i )

,

Page 17: Image Parsing: Unifying Segmentation, Detection, and ...efros/courses/LBMV07/Papers/tu-ijcv-05.pdf · Examples of image segmentation failure by an algorithm (Tu and Zhu, 2002a) which

Image Parsing 129

where q(L ′i )q(ζ ′

i , �′i ) are the same functions used in

the split and merge moves. The proposal probabil-ity is the weight normalized in the candidate set,Q4(W ′ | W : I) = ω4(W ′)

∑N4µ=1 ω

(µ)4

.

6. AdaBoost for Discriminative Probabilitiesfor Face and Text

This section describes how we use AdaBoost tech-niques to compute discriminative probabilities for de-tecting faces and text (strings of letters). We also de-scribe the binarization algorithm used to detect theboundaries of text characters.

6.1. Computing Discriminative Probabilitiesby Adaboost

The standard AdaBoost algorithm, for example fordistinguishing faces from non-faces (Viola and Jones,2001), learns a binary-valued strong classifier-HAda bycombining a set of n binary-valued “weak classifiers” orfeature tests TstAda(I) = (h1(I), . . . , hn(I)) using a setof weights αAda = (α1, . . . , αn) (Freund and Schapire,1996).

HAda(TstAda(I)) = sign

( n∑

i=1

αi hi (I)

)

= sign < αAda, TstAda(I) > . (15)

The features are selected from a pre-designed dic-tionary �Ada. The selection of features and the tuningof weights are posed as a supervised learning prob-lem. Given a set of labeled examples, {(Ii , �i ) : i =1, 2, . . . , M} (�i = ±1), AdaBoost learning can be for-mulated as greedily optimizing the following function(Schapire, 2002).

(α∗Ada, Tst∗Ada)

= arg minTstAda⊂�Ada

arg minαAda

M∑

i=1

exp−�i <αAda,TstAda(Ii )>.

(16)

To obtain discriminative probabilities we use a the-orem (Friedman et al., 1998) which states that the fea-tures and test learnt by AdaBoost give (asymptotically)posterior probabilities for the object labels (e.g. face ornon-face). The AdaBoost strong classifier can be red-erived as the log posterior ratio test.

Theorem 3 (Friedman et al., 1998). With sufficienttraining samples M and features n, AdaBoost learningselects the weights α∗

Ada and tests Tst∗Ada to satisfy

q(� = +1 | I) = e�<αAda,TstAda(Ii)>

e<αAda,TstAda(Ii )> + e−<αAda,TstAda(Ii )>.

Moreover, the strong classifier converges asymptoti-cally to the posterior probability ratio test

HAda(TstAda(I)) = sign(< αAda, TstAda(I) >)

= sign

(q(� = +1 | I)

q(� = −1 | I)

)

.

In practice, the AdaBoost classifier is applied to win-dows in the image at different scales. Each window isevaluated as being face or non-face (or text versus non-text). For most images the posterior probabilities forfaces or text are negligible for almost all parts of animage. So we use a cascade of tests (Viola and Jones,2001; Wu et al., 2004) which enables us to rapidly rejectmany windows by setting their marginal probabilitiesto be zero.

Of course, AdaBoost will only converge to approx-imations to the true posterior probabilities p(� | I) be-cause only a limited number of tests can be used (andthere is only a limited amount of training data).

Note that AdaBoost is only one way to learn a poste-rior probability, see Theorem 1. It has been found to bevery effective for object patterns which have relativelyrigid structures, such as faces and text (the shapes ofletters are variable but the patterns of a sequence arefairly structured (Chen and Yuille, 2004).

6.2. AdaBoost Training

We used standard AdaBoost training methods (Freundand Schapire, 1996; Friedman et al., 1998) combinedwith the cascade approach using asymmetric weighting(Viola and Jones, 2001; Wu et al., 2004). The cascadeenables the algorithm to rule out most of the imageas face, or text, locations with a few tests and allowscomputational resources to be concentrated on the morechallenging parts of the images.

The AdaBoost for text was designed to detect textsegments. Our test data was extracted by hand from162 images of San Francisco, see Fig. 12, and con-tained 561 text images, some of which can be seen

Page 18: Image Parsing: Unifying Segmentation, Detection, and ...efros/courses/LBMV07/Papers/tu-ijcv-05.pdf · Examples of image segmentation failure by an algorithm (Tu and Zhu, 2002a) which

130 Tu et al.

Figure 12. Some scenes from which the training text patches are extracted.

Figure 13. Positive training examples for AdaBoost.

in Fig. 13. More than half of the images were takenby blind volunteers (which reduces bias). We dividedeach text image into several overlapping text segmentswith fixed width-to-height ration 2:1 (typically con-taining between two and three letters). A total of 7,000text segments were used as the positive training set.The negative examples were obtained by a bootstrapprocess similar to Drucker et al. (1993) . First we se-lected negative examples by randomly sampling fromwindows in the image dataset. After training with thesesamples, we applied the AdaBoost algorithm at a rangeof scales to classify all windows in the training images.Those misclassified as text were then used as negativeexamples for the next stage of AdaBoost. The imageregions most easily confused with text were vegetation,

and repetitive structures such as railings or building fa-cades. The features used for AdaBoost were image testscorresponding to the statistics of elementary filters. Thefeatures were chosen to detect properties of text seg-ments that were relatively invariant to the shapes ofthe individual letters or digits. They included averag-ing the intensity within image windows, and statisticsof the number of edges. We refer to (Chen and Yuille,2004) for more details.

Figure 14(a) shows some failure examples of Ad-aboost for text detection. These correspond to situationssuch as heavy shading, blurred images, isolated dig-its, vertical commercial signs, and non-standard fonts.They were not included in the training examples shownin Fig. 13.

Page 19: Image Parsing: Unifying Segmentation, Detection, and ...efros/courses/LBMV07/Papers/tu-ijcv-05.pdf · Examples of image segmentation failure by an algorithm (Tu and Zhu, 2002a) which

Image Parsing 131

Figure 14. Some failure examples of text and faces that Adaboost misses. The text image failures (a) are challenging and were not includedin our positive training examples. The face failures (b) include heavy shading, occlusion and facial features.

The AdaBoost posteriors for faces was trained in asimilar way. This time we used Haar basis functions(Viola and Jones, 2001) as elementary features. Weused the FERET (Phillips et al., 1998) database forour positive examples, see Fig. 13(b), and by allowingsmall rotation and translation transformation we had5,000 positive examples. We used the same strategy asdescribed above for text to obtain negative examples.Failure examples are shown in Fig. 14.

In both cases, we evaluated the log posterior ratio teston testing datasets using a number of different thresh-olds (see (Viola and Jones, 2001)). In agreement withprevious work on faces (Viola and Jones, 2001), Ad-aBoost gave very high performance with very few falsepositives and false negatives, see Table 1. But these lowerror rates are slightly misleading because of the enor-mous number of windows in each image, see Table 1.

Table 1. Performance of AdaBoost at different thresholds.

Object False positive False negative Images Subwindows

Face 65 26 162 355,960,040

Face 918 14 162 355,960,040

Face 7542 1 162 355,960,040

Text 118 27 35 20,183,316

Text 1879 ‘5 35 20,183,316

A small false positive rate may imply a large numberof false positives for any regular image. By varying thethreshold, we can either eliminate the false positives orthe false negatives but not both at the same time. Weillustrate this by showing the face regions and text re-gions proposed by AdaBoost in Fig. 15. If we attemptclassification by putting a threshold then we can onlycorrectly detect all the faces and the text at the expenseof false positives.

When Adaboost is integrated with the generic regionmodels in the image parser, the generic region propos-als can remove false positives and find text that Ad-aBoost misses. For example, the ‘9’ in the right panel ofFig. 15 is not detected because our AdaBoost algorithmwas trained on text segments. Instead it is detected as ageneric shading region and later recognized as a letter‘9’, see Fig. 17. Some false positive text and faces inFig. 15 are removed in Figs. 17 and 19.

The AdaBoost algorithm for text needs to be supple-mented with a binarization algorithm, described below,to determine text character location. This is followedby appling shape contexts (Belongie et al., 2002) andinformative features (Tu and Yuille, 2004) to the bina-rization results to make proposals for the presence ofspecific letters and digits.

In many cases, see Fig. 16, the results of binarizationare so good that the letters and digits can be detected im-meadiately (i.e. the proposals made by the binarization

Page 20: Image Parsing: Unifying Segmentation, Detection, and ...efros/courses/LBMV07/Papers/tu-ijcv-05.pdf · Examples of image segmentation failure by an algorithm (Tu and Zhu, 2002a) which

132 Tu et al.

Figure 15. The boxes show faces and text as detected by the AdaBoost log posterior ratio test with fixed threshold. Observe the false positivesdue to vegetation, tree structure, and random image patterns. It is impossible to select a threshold which has no false positives and false negativesfor this image. As it is shown in our experiments later, the generative models will remove the false positives and also recover the missing text.

Figure 16. Example of binarization on the detected text.

stage are automatically accepted). But this will not al-ways be the case. We note that binarization gives farbetter results than alternatives such as edge detection(Canny, 1986).

The binarization algorithm is a variant of one pro-posed by Niblack (1986). We binarize the image inten-sity using an adaptive thresholding based on a adaptivewindow size. Adaptive methods are needed becauseimage windows containing text often have shading,shadow, and occlusion. Our binarization method de-termines the threshold Tb(v) for each pixel v by the in-tensity distribution of its local window r (v) (centeredon ν).

Tb(v) = µ(Ir (v)

) + k · std(Ir (v)),

where µ(Ir (v)) and std(Ir (ν)) are the intensity meanand standard deviation within the local window. Thesize of the local window is selected to be the smallestpossibble window whose intensity variance is above a

fixed threshold. The parameter k = ±0.2, where the± allows for cases where the foreground is brighter ordarker than the background.

7. Experiments

The image parsing algorithm is applied to a number ofoutdoor/indoor images. The speed in PCs (Pentium IV)is comparable to segmentation methods such as nor-malized cuts (Malik et al., 2001) or the DDMCMC al-gorithm in Tu and Zhu (2002a). It typically runs around10–20 min. The main portion of the computing time isspent in segmenting the generic patterns and by bound-ary diffusion (Zhu and Yuille, 1996).

Figures 17, 18, and 19 show some challenging ex-amples which have heavy clutter and shading effects.We present the results in two parts. One shows the seg-mentation boundaries for generic regions and objects,and the other shows the text and faces detected with

Page 21: Image Parsing: Unifying Segmentation, Detection, and ...efros/courses/LBMV07/Papers/tu-ijcv-05.pdf · Examples of image segmentation failure by an algorithm (Tu and Zhu, 2002a) which

Image Parsing 133

Figure 17. Results of segmentation and recognition on two images. The results are improved compare to the purely bottom-up (AdaBoost)results displayed in Fig. 15.

Figure 18. A close-up look of an image in Fig. 17. The dark glasses are explained by the generic shading model and so the face model does nothave to fit this part of the data. Otherwise the face model would have difficulty because it would try to fit the glasses to eyes. Standard AdaBoostonly correctly classifies these faces at the expense of false positives, see Fig. 15. We show two examples of synthesized faces, one (Synthesis1) with the dark glasses (modelled by shading regions) and the other (Synthesis 2) with the dark glasses removed (i.e. using the generative facemodel to sample parts of the face (e.g. eyes) obscured by the dark glasses.

text symbols to indicate text recognition, i.e. the lettersare correctly read by the algorithm. Then we synthesizeimages sampled from the likelihood model p(I | W ∗)where W ∗ is the parsing graph (the faces, text, regionsparameters and boundaries) obtained by the parsing al-gorithm. The synthesized images are used to visualizethe parsing graph W ∗, i.e. the image content that thecomputer “understand”.

In the experiments, we observed that the face and textmodels improved the image segmentation results bycomparison to our previous work (Tu and Zhu, 2002a)which only used generic region models. Conversely,the generic region models improve object detectionby removing some false alarms and recovering objectswhich were not initially detected. We now discuss spe-cific examples.

In Figure 15, we showed two images where the textand faces were detected purely bottom-up using Ad-aBoost. It is was impossible to select a threshold sothat our AdaBoost algorithm had no false positives orfalse negatives. To ensure no false negatives, apart fromthe ‘9’, we had to lower the threshold and admit falsepositives due to vegetation and heavy shadows (e.g. theshadow in the sign “HEIGHTS OPTICAL”).

The letter ‘9’ was not detected at any threshold. Thisis because our AdaBoost algorithm was trained to de-tect text segments, and so did not respond to a singledigit.

By comparison, Fig. 17 shows the image pars-ing results for these two images. We see that thefalse alarms proposed by AdaBoost are removed be-cause they are better explained by the generic region

Page 22: Image Parsing: Unifying Segmentation, Detection, and ...efros/courses/LBMV07/Papers/tu-ijcv-05.pdf · Examples of image segmentation failure by an algorithm (Tu and Zhu, 2002a) which

134 Tu et al.

Figure 19. Results of segmentation and recognition on outdoor images. Observe the ability to detect faces and text at multiple scale.

models. The generic shading models help object de-tection by explaining away the heavy shading on thetext “HEIGHTS OPTICAL” and the dark glasses onthe women, see Fig. 18. Moreover, the missing digit‘9’ is now correctly detected. The algorithm first de-tected it as a generic shading region and then reclassi-fied as a digit using the sub-kernel that switches nodeattributes.

The ability to synthesize the image from the parsinggraph W* is an advantage of the Bayesian approach.The synthesis helps illustrate the successes, and some-times the weaknesses, of the generative models. More-over, the synthesized images show how much informa-tion about the image has been captured by the models.In Table 2, we give the number of variables used in ourrepresentation W ∗ and show that they are roughly pro-portional to the jpeg bytes. Most of the variables in W ∗

are used to represent points on the segmentation bound-ary, and at present they are counted independently. Wecould reduce the coding length of W ∗ substantially byencoding the boundary points effectively, for example,using spatial proximity. Image encoding is not the goalof our current work, however, and more sophisticatedgenerative models would be needed to synthesize veryrealistic images.

Table 2. The number of variables in W* for each image comparedto the JPG bytes.

Image Stop Soccer Parking Street Westwood

jpg bytes 23,998 19,563 23,311 26,170 27,790

|W | 4,886 3,971 5,013 6,346 9,687

Page 23: Image Parsing: Unifying Segmentation, Detection, and ...efros/courses/LBMV07/Papers/tu-ijcv-05.pdf · Examples of image segmentation failure by an algorithm (Tu and Zhu, 2002a) which

Image Parsing 135

Figure 20. Two mechanisms for constructing the parsing graph. See text for explanation.

8. Discussion

In this section, we describe two challenging techni-cal problems for image parsing. Our current work ad-dresses these issues.

1. Two mechanisms for constructing the parsinggraph

In the introduction to this paper we stated that theparsing graph can be constructed in compositional anddecompositional modes. The compositional mode pro-ceeds by grouping small elements while the decom-positional approach involves detecting an object as awhole and then locating its parts, see Fig. 20.

The compositional mode appears most effective forFig. 20(a). Detecting the cheetah by bottom-up tests,such as those learnt by AdaBoost, seems difficult ow-ing to the large variability of shape and photometricproperties of cheetahs. By contrast, it is quite practi-cal using Swendsen-Wang Cuts (Barbu and Zhu, 2004)to segment the image and obtain the boundary of thecheetah using a bottom-up compositional approach anda parsing tree with multiple levels. The parsing graphis constructed starting with the pixels as leaves (thereare 46,256 pixels in Fig. 20(a)). The next level of thegraph is obtained using local image texture similaritiesto construct graph nodes (113 of them) correspondingto “atomic regions” of the image. Then the algorithmcontructs nodes (4 of them) for “texture regions” at thenext level by grouping the atomic regions (i.e. eachatomic region node will be the child of a texture regionnode). At each level, we compute a discriminative (pro-

posal) probability for how likely adjacent nodes (e.g.pixels or atomic regions) belong to the same object orpattern. We then apply a transition kernel implement-ing split and merge dynamics (using the proposals).We refer to Barbu and Zhu (2004) for more detaileddiscussion.

For objects with little variability, such as the facesshown in Fig. 20(b), we can use bottom-up proposals(e.g. AdaBoost) to activate a node that represents theentire face. The parsing graph can then be constructeddownwards (i.e. in the decompositional mode) by ex-panding the face node to create child nodes for theparts of the face. These child nodes could, in turn, beexpanded to grandchild nodes representing finer scaleparts. The amount of node expansion can be made adap-tive to depend on the resolution of the image. For ex-ample, the largest face in Fig. 20(b) is expanded intochild nodes but there is not sufficient resolution to ex-pand the face nodes corresponding to the three smallerfaces.

The major technical problem is to develop a mathe-matical criterion for which mode is most effective forwhich types of objects and patterns. This will enablethe algorithm to adapt its search strategy accordingly.

2. Optimal ordering strategy for tests and kernelsThe control strategy of our current image parsing

algorithm does not select the tests and sub-kernels inan optimal way. At each time step, the choice of sub-kernel is independent of the current state W (though thechoice of where in the graph to apply the sub-kernel willdepend on W).

Page 24: Image Parsing: Unifying Segmentation, Detection, and ...efros/courses/LBMV07/Papers/tu-ijcv-05.pdf · Examples of image segmentation failure by an algorithm (Tu and Zhu, 2002a) which

136 Tu et al.

Moreover, bottom-up tests are performed which arenever used by the algorithm.

It would be more efficient to have a control strategywhich selects the sub-kernels and tests adaptively, pro-vided the selection process requires low computationalcost. We seek to find an optimal control strategy forselection which is effective for a large set of imagesand visual patterns. The selection criteria should selectthose tests and sub-kernels which maximize the gain ininformation.

We propose the two information criteria that we de-scribed in Section 2.

The first is stated in Theorem 1. It measures theinformation gained for variable w in the parsinggraph by performing a new test Tst+. The informa-tion gain is δ(w‖Tst+) = KL(p(w | I)‖q(w | Tst(I))) −KL(p(w | I)‖q(w | Tstt (I), F+)), where Tst(I) denotesthe previous tests (and KL is the Kullback-Leibler di-vergence).

The second is stated in Theorem 2. It measures thepower of a sub-kernel Ka by the decrease of the KL-divergence δ(Ka) = KL(p‖µt ) − KL(p‖µtKa). Theamount of decrease δa gives a measure of the power ofthe sub-kernel Ka when informed by Tstt (I).

We need also take into account the computationalcost of the selection procedures. See (Blanchard andGeman, 2003) for a case study for how to optimallyselect tests taking into account their computationalcosts.

9. Summary and Future Work

This paper introduces a computational framework forparsing images into basic visual patterns. We formu-lated the problem using Bayesian probability theoryand designed a stochastic DDMCMC algorithm to per-form inference. Our framework gives a rigourous wayto combine segmentation with object detection andrecognition. We give proof of concept by implementinga model whose visual patterns include generic regions(texture and shading) and objects (text and faces). Ourapproach enables these different visual patterns to com-pete and cooperate to explain the input images.

This paper also provides a way to integrate dis-criminative and generative methods of inference. Bothmethods are extensively used by the vision and ma-chine learning communities and correspond to the dis-tinction between bottom-up and top-down processing.Discriminative methods are typically fast but can givesub-optimal and inconsistent results, see Fig. 3. By con-

trast, generative methods are optimal (in the sense ofBayesian Decision Theory) but can be slow becausethey require extensive search. Our DDMCMC algo-rithm integrates both methods, as illustrated in Fig. 8,by using discriminative methods to propose generativesolutions.

The goal of our algorithm is to construct a parsegraph representing the image. The structure of thegraph is not fixed and will depend on the input im-age. The algorithm proceeds by constructing MarkovChain dynamics, implemented by sub-kernels, for dif-ferent moves to configure the parsing graph—such ascreating or deleting nodes, or altering node attributes.Our approach can be scaled-up by adding new sub-kernels, corresponding to different vision models. Thisis similar in spirit to Ullman’s concept of “visual rou-tines” (Ullman, 1995). Overall, the ideas in this pa-per can be applied to any other inference problemthat can be formulated as probabilistic inference ongraphs.

Other work by our group deals with a related se-ries of visual inference tasks using a similar frame-work. This includes image segmentation (Tu and Zhu,2002), curve grouping (Tu and Zhu, 2002a), shape de-tection (Tu and Yuille, 2004), motion analysis (Barbuand Zhu, 2004), and 3D scene reconstruction (Han andZhu, 2003). In the future, we plan to integrate thesevisual modules and develop a general purpose visionsystem.

Finally, we are working on ways to improve thespeed of the image parsing algorithm as discussedin Section 8. In particular, we expect the use of theSwendsen-Wang cut algorithms (Barbu and Zhu, 2003,2004) to drastically accelerate the search. We anticipatethat this, and other improvements, will reduce the run-ning time of DDMCMC algorithms from 10–20 min(Tu and Zhu, 2002a) to well under a minute.

Appendix A: Proof of Theorem 2

Proof: For notational simplicity, we ignore the de-pendencies on the kernels and probabilities on the inputimage I.

Let µt (Wt ) be the state probability at time step t. Af-ter applying a sub-kernel Ka(Wt+1 | Wt ), its state be-comes Wt+1 with probability:

µt+1(Wt+1) =∑

Wt

µt (Wt )Ka(Wt+1 | Wt ). (17)

Page 25: Image Parsing: Unifying Segmentation, Detection, and ...efros/courses/LBMV07/Papers/tu-ijcv-05.pdf · Examples of image segmentation failure by an algorithm (Tu and Zhu, 2002a) which

Image Parsing 137

The joint probability µ(Wt , Wt+1) for Wt and Wt+1

can be expressed in two ways as:

µ(Wt , Wt+1) = µt (Wt )Ka(Wt+1 | Wt )

= µt+1(Wt+1)pMC(Wt | Wt+1), (18)

where pMC(Wt | Wt+1) is the “posterior” probabilityfor state Wt at time step t conditioned on state Wt+1 attime step t + 1.

This joint probability µ(Wt , Wt+1) can be comparedto the joint probability ρ(Wt , Wt+1) at equilibrium(i.e. when µt (Wt ) = P(Wt )). We can also expressµ(Wt , Wt+1) in two ways:

p(Wt , Wt+1) = p(Wt )Ka(Wt+1 | Wt )

= p(Wt+1)Ka(Wt+1, Wt ), (19)

where the second equality is obtained using the detailedbalance condition on Ka .

We calculate the Kullback-Leibler (K-L) diver-gence between the joint probabilities P(Wt , Wt+1) andµ(Wt , Wt+1) in two ways using the first and secondequalities in Eqs. (18) and (19). We obtain two expres-sions (1 & 2) for the K-L divergence:

KL(p(Wt , Wt+1)‖µ(Wt , Wt+1))

=∑

Wt+1

Wt

p(Wt , Wt+1) logp(Wt , Wt+1)

µ(Wt , Wt+1)

1=∑

Wt

p(Wt )∑

Wt+1

Ka(Wt+1 | Wt ))

× logp(Wt ) · Ka(Wt+1 | Wt )

µt (Wt )Ka(Wt+1 | Wt )= K L(p(W )‖µt (W )) (20)2=

Wt+1

Wt

Ka(Wt | Wt+1)p(Wt+1)

× logp(Wt+1)Ka(Wt | Wt+1)

µt+1(Wt+1)pMC(Wt | Wt+1)= K L(p(Wt+1)‖µt+1(W ))

+ E p(Wt+1)[K L(Ka(Wt | Wt+1)‖pMC(Wt | Wt+1))]

(21)

We equate the two alternatives expressions for theKL divergence, using Eqs. (20) and (21), and obtain

K L(p(W )‖µt (W )) − K L(p(W )‖µt+1(W ))

= E p(Wt+1)[K L(Ka(t)(Wt | Wt+1)‖pMC(Wt | Wt+1))

]

This proves that the K-L divergence decreasesmonotonically. The decrease is zero only whenKα(t)(Wt | Wt+1) = pMC(Wt | Wt+1) (becauseK L(p‖µ) ≥ 0 with equality only when p = µ).This occurs if and only if µt (W ) = p(W ) (using thedefinition of pMC(Wt | Wt+1), given in Eq. (18), andthe detailed balance conditions for Ka(t)(Wt | Wt+1)).

Appendix B: Classic Markov ChainConvergence Result

Traditional Markov chain convergence analysis is con-cerned with the total variance between the invariant(target) probability p(W | I) and the Markov chain stateprobability µt (W ),

‖µt − p‖TV =∑

W

|p(W | I) − µt (W )|. (22)

In a finite state space �, a classic result relates theTV-measure to the second largest eigenvalue modulus(λslem ∈ [0, 1)) of the kernel K. The following theoremis adopted from Diaconis and Hanlon (1992).

Theorem 4 (Diaconis and Hanlon, 1992). For anirreducible kernel K on finite space with invariant(target) probability p(W | I) and initial state Wo, thenat step n

‖µt − p‖TV ≤√

1 − p(Wo | I)

4p(Wo | I)· λt

slem. (23)

Appendix C: First-Hitting-Time Analysis

We include a second result which bounds the expectedfirst hitting-time E[τ (W )] for an arbitrary state W ,where τ (W ) is the number of time steps for a MarkovChain MC to first visit a state W ∈ �.

τ (W ) = min{t ≥ 1: Wt = W }.

Page 26: Image Parsing: Unifying Segmentation, Detection, and ...efros/courses/LBMV07/Papers/tu-ijcv-05.pdf · Examples of image segmentation failure by an algorithm (Tu and Zhu, 2002a) which

138 Tu et al.

Figure 21. There may be multiple routes between the two states W and W ′ and this creates an additional computational burden for computingthe acceptance rates.

The expected first hitting time has higher resolutionthan the TV-norm in Theorem 4 or the KL-divergencein Theorem 2, as it is concerned with individual statesrather than the entire space �. In particular, we are in-terested in the expected first hitting time for those statesW with high probability P(W). The following theorem(Maciuca and Zhu, 2003) shows how an informed pro-posal q can improve the expected first hitting time fora special class of Markov Chains.

Theorem 5 (Maciuca and Zhu, 2003). Letp(W ), W ∈ � be the invariance (target) proba-bility of a Markov chain MC, and Q(W, W ′) = q(W ′)be the proposal probability in the Metropolis-HastingEq. (8), then

1

min{p(W ), q(W )} ≤ E[τ (W )] ≤ 1

min{p(W ), q(W )}· 1

1 − ‖p − q‖TV(24)

where ‖p − q‖ = 12

∑x∈� |p(x) − q(x) |≤ 1.

In this Markov Chain design, the proposal proba-bility depends only on W ′ not W , and is called theMetropolized independence sampler (MIS). A simi-lar result can also be obtained for the MetropolizedGibbs sampler (MGS). This result shows that we shouldchoose the proposal probability q so that it overlapswith p and so that q(W) is large for those states W withhigh probability P(W). By contrast, choosing q to bethe uniform probability, q(W ) = 1/|�|∀W ∈ �, willresult in expected first hitting times of length greaterthan |�|) for any state with P(W ) > 1/|�|.

Appendix D: Multiple Routes in the Sub-Kernels

This section addresses a practical issue about detailedbalance which leads to a trade-off between computation

and efficiency in the algorithm design. For notationalsimplicity, we write terms like Q(W ′ | W : I) &K(W ′ |W : I) as Q(W, W ′) & K(W, W ′) in this section.

In some situations, the sub-kernels can give multi-ple routes between states W and W ′. For example, oneregion can be split into the same two sub-regions fol-lowing different bottom-up proposals. In this section,we show that detailed balance can be maintained eitherby using extra computation to integrate out the multipleroutes, or by treating the alternative routes separatelyusing different sub-kernels but at the price of reducedefficiency.

Suppose there are n pairs of routes{(q(i)Qi (W, W ′), q(i)Qi (W ′, W ), i = 1 . . . n} betweentwo states W and W ′, such as the two pairs of movesin Fig. 21(a). We represent the transition kernel by

K(W, W ′) = Q(W, W ′)α(W, W ′),

where

Q(W, W ′) =n∑

i=1

q(i)Qi (W, W ′), (25)

and

α(W, W ′) = min

(

1Q(W ′, W )p(W ′)Q(W, W ′)p(W )

)

Every time a move from W to W ′ is considered wehave to consider all possible routes, which can resultin heavy computation.

Instead, we can treat each route as a sub-kernel asdescribed in Section 2.3. Then the Markov chain sat-isfies detailed balance because each pair K(W, W ′)and Ki (W ′, W ) are specified by Metropolis-Hastings8 and hence obey the detailed balance equationp(W )Ki (W, W ′) = p(W ′)Ki (W ′, W ), i = 1 . . . n.

Page 27: Image Parsing: Unifying Segmentation, Detection, and ...efros/courses/LBMV07/Papers/tu-ijcv-05.pdf · Examples of image segmentation failure by an algorithm (Tu and Zhu, 2002a) which

Image Parsing 139

This reduces the computational cost. But, as the follow-ing theorem shows, this reduction comes at the priceof efficiency.

Theorem 6. The kernels K(W, W ′) and K(W ′, W )as computed in Eq. (25) satisfy

K(W, W ′) ≥n∑

i=1

q(i)Ki (W, W ′), and

K(W, W ′) ≥n∑

i=1

q(i)Ki (W′, W ), ∀ W ′ �= W.

This theorem says that the Markov chain whichobeys detailed balance for each pair of moves is lesseffective than the one which combines all the routes.Markov chain design must balance the computationcost of computing all the proposals against the effec-tiveness of the Markov kernel. The situation shown inFig. 21(b) can be treated as a special case of Fig. 21(a).

Acknowledgments

This work was supported by an NIH (NEI) grant RO1-EY 012691-04, an NSF SGER grant IIS-0240148, anNSF grant IIS-0244763 and an ONR grant N00014-02-1-0952. The authors thank the Smith-Kettlewell re-search institute for providing us with text training im-ages. We thank Yingnian Wu for stimulating discus-sions on the Markov chain convergence theorems.

Notes

1. Unlike most graphical inference algorithms in the literature whichassume fixed graphs, such as belief propagation (Yedidia, 2001).

2. For many natural images the posterior probabilities P(W | I) arestrongly peaked and so fair samples are close to the posteriormaximum argmax P(W | I). So in this paper we do not distinguishbetween sampling and inference (optimization).

3. Recently the term “discriminative model” has been extendedto cover almost any approximation to the posterior distributionP(W | I), e.g. Kumar and Hebert (2003). We will use “discrimi-native model” in its traditional sense of categorization.

4. The optimal approximation occurs when g(w, | Tst(I)) equals theprobability p(w, | Tst(I)) induced by P(I | W )P(W ).

5. We choose stochastic dynamics because the Markov chain prob-ability is guaranteed to converge to the posterior P(W | I). Thecomplexity of the problem means that deterministic algorithmsfor implementing these moves risk getting stuck in local minima.

6. Algorithms like belief propagation (Yedidia et al., 2001) can bederived as approximations to this update equation by using a Gibbssampler and making independence assumptions (Yuille, 2004).

7. Except for one that evolves region boundaries.8. The boundary can include an “internal boundary” if there is a hole

inside the image region explained by a different visual pattern.9. Except for the boundary evolution sub-kernel which will be de-

scribed separately, see Section 5.1.

References

Barbu, A. and Zhu, S.C. 2003. Graph partition by Swendsen-Wangcut. In Proc. of Int’l Conf. on Computer Vision, Nice, France.

Barbu, A. and Zhu, S.C. 2004. Multi-grid and multi-level Swendsen-Wang cuts for hierarchic graph partition. In Proc. of IEEE Conf.on Computer Vision and Pattern Recognition, Washington DC.

Barnard, K. and Forsyth, D.A. 2001. Learning the semantics of wordsand pictures, ICCV.

Belongie, S., Malik, J., and Puzicha, J. 2002. Shape matching andobject recognition using shape contexts, IEEE Trans, on PatternAnalysis and Machine Intelligence, 24:509–522.

Bienenstock, E., Geman, S., and Potter, D. 1997. Compositionality,MDL Priors, and Object Recognition, NIPS.

Blanchard, G. and Geman, D. 2003. Hierarchical testing designsfor pattern recognition. Technical report, Math. Science, JohnsHopkins University.

Bremaud, P. 1999. Markov Chains: Gibbs Fields, Monte Carlo Sim-ulation and Queues. Springer. (Chapter 6).

Bowyer, K.W. Kranenburg, C., and Dougherty, S. 2001. Edge detec-tor evaluation using empirical ROC curves, Computer Vision andImage Understanding, 84(1):77–103.

Canny, J. 1986. A computational approach to edge detection. IEEETrans, on PAMI, 8(6).

Chen, X. and Yuille, A.L. 2004. AdaBoost learning for detecting andreading text in city scenes. In Proc. of IEEE Conf. on ComputerVision and Pattern Recognition. Washington DC.

Cootes, T.F., Edwards, G.J., and Taylor, C.J. 2001. Active appearancemodels. IEEE Trans, on PAMI, 6(23).

Comaniciu, D. and Meer, P. 1999. Mean shift analysis and applica-tions. Proc. of ICCV.

Cover, T.M. and Thomas, J.A. 1991 . Elements of Information The-ory. John Wiley and Sons, Inc: NY, pp. 33–36.

Dayan, P., Hinton, G., Neal, R., and Zemel, R. 1995. The HelmholtzMachine. Neural Computation, 7:889–904.

Diaconis, P. and Hanlon, P. 1992. Eigenanalysis for some examples ofthe Metropolis algorithms. Contemporary Mathematics, 138:99–117.

Drucker, H., Schapire, R., and Simard, P. 1993. Boosting perfor-mance in neural networks. Intl J. Pattern Rec. and Artificial Intel-ligence, 7(4).

Fowlkes, C. and Malik, J. 2004. How Much Does Globalization HelpSegmentation? CVPR.

Freund, Y. and Schapire, R. 1996. Experiments with a new boostingalgorithm. In Proc. of 13th Int’l Conference on Machine Learning.

Friedman, J., Hastie, T. and Tibshirani, R. 1998. Additive logisticregression: A statistical view of boosting, Dept. of Statistics, Stan-ford Univ. Technical Report.

Geman, S. and Huang, C.R. 1986. Diffusion for global optimization,SIAM J. on Control and Optimization, 24(5).

Green, P.J. 1995. Reversible jump markov chain monte carlo compu-tation and bayesian model determination. Biometrika, 82(4):711–732.

Page 28: Image Parsing: Unifying Segmentation, Detection, and ...efros/courses/LBMV07/Papers/tu-ijcv-05.pdf · Examples of image segmentation failure by an algorithm (Tu and Zhu, 2002a) which

140 Tu et al.

Hallinan, P., Gordon, G., Yuille, A., Giblin, P., and Mumford, D.1999. Two and Three Dimensional Patterns of the Face, A.K.Peters.

Han, F. and Zhu, S.C. 2003. Bayesian reconstruction of 3D shapesand scenes from a single image. In Proc. Int’l Workshop on HighLevel Knowledge in 3D Modeling and Motion, Nice France.

Hastings, W.K. 1970. Monte Carlo sampling methods using Markovchains and their applications. Biometrika, 57:97–109.

Klein, D. and Manning, C.D. 2002. A generative constituent-contextmodel for improved grammar induction. Proc of 40th AnnualMeeting of the Assoc. for Computational Linguistics.

Konishi, S., Coughlan, J.M., Yuille, A.L., and Zhu, S.C. 2003. Sta-tistical edge detection: learning and evaluating edge cues. IEEETrans, on Pattern Analysis and Machine Intelligence, 25(1):57–74.

Kumar, S. and Hebert, M. 2003. Discriminative random fields. InProc. of Int’l Conf. on Computer Vision, Nice, France.

Li, F.F., VanRullen, R., Koch, C., and Perona, P. 2003. Rapid naturalscene categorization in the near absence of attention. In Proc. ofNational Academy of Sciences, 99(14).

Liu, J.S. 2001. Monte Carlo Strategies in Scientific Computing.Springer.

Lowe, L.D. 2003. Distinctive image features from scale-invariantkeypoints, IJCV.

Maciuca, R. and Zhu, S.C. 2003. How do heuristics expedite markovchain search. In Proc. of 3rd Workshop on Statistical and Compu-tational Theory for Vision.

Malik, J., Belongie, S., Leung, T., and Shi, J. 2001. Contour andtexture analysis for image segmentation. Int’l Journal of ComputerVision, 43(1).

Manning, C.D. and Schiitze, H. 2003. Foundations of Statistical Nat-ural Language Processing. MIT Press.

Marr, D. 1982. Vision. W.H. Freeman and Co. San Francisco.Martin, D., Fowlkes, C., Tal, D., and Malik, J. 2001. A database of

human segmented natural images and its application to evaluatingsegmentation algorithms and measuring ecological statistics. InProc. of 8th Int’l Conference on Computer Vision.

Metropolis. N., Rosenbluth M.N., Rosenbluth, A.W., Teller, A.H.,and Teller, E. 1953. Equations of state calculations by fast com-puting machines, J. Chem. Phys., 21:1087–92.

Moghaddam, B. and Pentland, A. 1997. Probabilistic visual learningfor object representation. IEEE Trans. PAMI, 19(7).

Mumford, D.B. 1995. Neuronal Architectures for Pattern-theoreticProblems. In Large-Scale Neuronal Theories of the Brain. C. Kochand J. L. Davis (Eds). MIT Press: A Bradford Book.

Murphy, K., Torralba, A., and Freeman, W.T. 2003. Using the forestto see the tree: A graphical model relating features, objects andthe scenes, NIPS.

Niblack, W. 1986. An Introduction to Digital Image Processing:Prentice Hall. pp. 115–116.

Phillips, P.J., Wechsler, H., Huang, J., and Rauss, P. 1998. The FERETdatabase and evaluation procedure for face recognition algorithms.Image and Vision Computing Journal, 16(5).

Ponce, J., Lazebnik, S., Rothganger, F., and Schmid, C. 2004. To-ward true 3D object recognition, Reconnaissance de Formes etIntelligence Artificielle, Toulous, FR.

Schapire, R.E. 2002. The boosting approach to machine learning: Anoverview, MSRI Workshop on Nonlinear Estimation and Classifi-cation.

Shi, J. and Malik, J. 2000. Normalized Cuts and Image Segmentation.IEEE Trans. PAMI, 22(8).

Thorpe, S., Fize, D., and Marlot, C. 1996. Speed of processing in thehuman visual system. Nature, 381(6582):520–522.

Treisman, A. 1986. Features and objects in visual processing. Scien-tific American.

Tu, Z. and Zhu, S.C. 2002. Image segmentation by Data-drivenMarkov chain Monte Carlo. IEEE Trans. PAMI, 24(5):657–673.

Tu, Z.W. and Zhu, S.C. 2002. Parsing images into regions, curves andcurve groups, Int’l Journal of Computer Vision, (Under review),A short version appeared in the Proc. of ECCV.

Tu, Z.W. and Yuille, A.L. 2004. Shape matching and recognition: Us-ing generative models and informative features”. In ProceedingsEuropean Conference on Computer Vision. ECCV’04. Prague.

Turk, M. and Pentland, A. 1991. Eigenfaces for recognition. J. ofCognitive Neurosciences, 3(1):71–86.

Ullman, S. 1984. Visual routines. Cognition, 18:97–159.Ullman, S. 1995. Sequence seeking and counterstreams: A model

for bidirectional information flow in the cortex. In Large-ScaleNeuronal Theories of the Brain. C. Koch and J. L. Davis Eds. MITPress. A Bradford Book.

Viola, P. and Jones, M. 2001. Fast and robust classification usingasymmetric Adaboost and a detector cascade, In Proc. of NIPS01.

Weber, M., Welling, M., and Perona, P. 2000. Towards AutomaticDiscovery of Object Categories, Proc. of CVPR.

Wu, J., Regh, J.M., and Mullin, M.D. 2004. Learning a rare eventdetection cascade by direct feature selection, NIPS.

Yedidia, J.S., Freeman, W.T., and Weiss, Y. 2001. Generalized be-lief propagation. In Advances in Neural Information ProcessingSystems, 13:689–695.

Yuille, A.L. 2004. Belief Propagation and Gibbs Sampling. Submit-ted to Neural Computation.

Zhu, S.C. and Yuille, A.L. 1996. Region competition: Unifyingsnakes, region growing, and Bayes/MDL for multiband image seg-mentation. IEEE Trans. PAMI, 18(9).

Zhu, S.C., Zhang, R., and Tu, Z.W. 2000. Integrating top-down/bottom-up for object recognition by data-driven Markovchain Monte Carlo. In Proc. of IEEE Conf. on Computer Visionand Pattern Recognition, Hilton Head, SC.