nonparametric bayesian models for unsupervised …vised scene analysis. a first family of...

8
Nonparametric Bayesian Models for Unsupervised Scene Analysis and Reconstruction Dominik Joho, Gian Diego Tipaldi, Nikolas Engelhard, Cyrill Stachniss, Wolfram Burgard Department of Computer Science, University of Freiburg, Germany {joho, tipaldi, engelhar, stachnis, burgard}@informatik.uni-freiburg.de Abstract—Robots operating in domestic environments need to deal with a variety of different objects. Often, these objects are neither placed randomly, nor independently of each other. For example, objects on a breakfast table such as plates, knives, or bowls typically occur in recurrent configurations. In this paper, we propose a novel hierarchical generative model to reason about latent object constellations in a scene. The proposed model is a combination of Dirichlet processes and beta processes, which allow for a probabilistic treatment of the unknown dimensionality of the parameter space. We show how the model can be employed to address a set of different tasks in scene understanding ranging from unsupervised scene segmentation to completion of a par- tially specified scene. We describe how sampling in this model can be done using Markov chain Monte Carlo (MCMC) techniques and present an experimental evaluation with simulated as well as real-world data obtained with a Kinect camera. I. I NTRODUCTION Imagine a person laying a breakfast table and the person gets interrupted so that she cannot continue with the breakfast preparation. A service robot such as the one depicted in Fig. 1 should be able to proceed laying the table without receiving specific instructions. It faces a series of challenges: how to infer the total number of covers, how to infer which objects are missing on the table, and how should the missing parts be arranged. For this, the robot should not require any user- specific pre-programmed model but should ground its decision based on the breakfast tables it has seen in the past. In this paper, we address the problem of scene under- standing given a set of unlabeled examples and generating a plausible configuration from a partially specified scene. The key contribution of this paper is the definition of a novel hierarchical nonparametric Bayesian model to represent the scene structure in terms of object groups and their spatial configuration. We show how to infer the scene structure in an unsupervised fashion by using Markov chain Monte Carlo (MCMC) techniques to sample from the posterior distribution of the latent variables given the observations. In our model, each scene contains an unknown number of latent meta-objects. As illustrated in Fig. 2, a meta-object is a collection of parts. In the breakfast table example, a place cover can be seen as a latent meta-object of a certain type that, for example, generates the observable objects plate, knife, and mug. A meta-object of a different type might generate a cereal bowl and a spoon. It is important to note that not all instances of the same meta-object are identical. Instances differ in the sense that some parts may be missing and that the individual parts may not be arranged in the same way. Fig. 1. A scene typically contains several observable objects and the task is to infer the latent meta-objects where a meta-object is considered to be a constellation of observable objects. At a breakfast table, for example, the meta-objects might be the covers that consist of the observable objects plate, knife, fork, and cup. When specifying a generative model for our problem, we have the difficulty that the dimensionality of the model is part of the learning problem. This means, that besides learning the parameters of the model, like the pose of a meta-object, we additionally need to infer the number of involved meta- objects, parts, etc. The standard solution would be to follow the model selection approaches, for example, learning several models and then choosing the best one. Such a comparison is typically done by trading off the data likelihood with the model complexity as, for example, done for the Bayesian information criterion (BIC). The problem with this approach is the huge number of possible models, which renders this approach intractable in our case. To avoid this complexity, we follow another approach, motivated by recent developments in the field of hierarchical nonparametric Bayesian models based on the Dirichlet process and the beta process. These models are able to adjust their dimensionality according to the given data, thereby sidestep- ping the need to select among several finite-dimensional model alternatives. Based on a prior over scenes, learned upon observations of training examples, the model can be used for parsing new scenes or completing partially specified scenes by sampling the missing objects. As a motivating example, we consider learning the object constellations on a breakfast table as depicted in Fig. 1. Our model, however, is general and not restricted to this scenario.

Upload: others

Post on 10-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Nonparametric Bayesian Models for Unsupervised …vised scene analysis. A first family of approaches, and the most related to our model, employs nonparametric Bayesian models to infer

Nonparametric Bayesian Models forUnsupervised Scene Analysis and Reconstruction

Dominik Joho, Gian Diego Tipaldi, Nikolas Engelhard, Cyrill Stachniss, Wolfram BurgardDepartment of Computer Science, University of Freiburg, Germany{joho, tipaldi, engelhar, stachnis, burgard}@informatik.uni-freiburg.de

Abstract—Robots operating in domestic environments need todeal with a variety of different objects. Often, these objects areneither placed randomly, nor independently of each other. Forexample, objects on a breakfast table such as plates, knives, orbowls typically occur in recurrent configurations. In this paper,we propose a novel hierarchical generative model to reason aboutlatent object constellations in a scene. The proposed model isa combination of Dirichlet processes and beta processes, whichallow for a probabilistic treatment of the unknown dimensionalityof the parameter space. We show how the model can be employedto address a set of different tasks in scene understanding rangingfrom unsupervised scene segmentation to completion of a par-tially specified scene. We describe how sampling in this model canbe done using Markov chain Monte Carlo (MCMC) techniquesand present an experimental evaluation with simulated as wellas real-world data obtained with a Kinect camera.

I. INTRODUCTION

Imagine a person laying a breakfast table and the persongets interrupted so that she cannot continue with the breakfastpreparation. A service robot such as the one depicted in Fig. 1should be able to proceed laying the table without receivingspecific instructions. It faces a series of challenges: how toinfer the total number of covers, how to infer which objectsare missing on the table, and how should the missing partsbe arranged. For this, the robot should not require any user-specific pre-programmed model but should ground its decisionbased on the breakfast tables it has seen in the past.

In this paper, we address the problem of scene under-standing given a set of unlabeled examples and generatinga plausible configuration from a partially specified scene. Thekey contribution of this paper is the definition of a novelhierarchical nonparametric Bayesian model to represent thescene structure in terms of object groups and their spatialconfiguration. We show how to infer the scene structure inan unsupervised fashion by using Markov chain Monte Carlo(MCMC) techniques to sample from the posterior distributionof the latent variables given the observations.

In our model, each scene contains an unknown number oflatent meta-objects. As illustrated in Fig. 2, a meta-object isa collection of parts. In the breakfast table example, a placecover can be seen as a latent meta-object of a certain type that,for example, generates the observable objects plate, knife, andmug. A meta-object of a different type might generate a cerealbowl and a spoon. It is important to note that not all instancesof the same meta-object are identical. Instances differ in thesense that some parts may be missing and that the individualparts may not be arranged in the same way.

Fig. 1. A scene typically contains several observable objects and the taskis to infer the latent meta-objects where a meta-object is considered to bea constellation of observable objects. At a breakfast table, for example, themeta-objects might be the covers that consist of the observable objects plate,knife, fork, and cup.

When specifying a generative model for our problem, wehave the difficulty that the dimensionality of the model is partof the learning problem. This means, that besides learningthe parameters of the model, like the pose of a meta-object,we additionally need to infer the number of involved meta-objects, parts, etc. The standard solution would be to followthe model selection approaches, for example, learning severalmodels and then choosing the best one. Such a comparisonis typically done by trading off the data likelihood with themodel complexity as, for example, done for the Bayesianinformation criterion (BIC). The problem with this approachis the huge number of possible models, which renders thisapproach intractable in our case.

To avoid this complexity, we follow another approach,motivated by recent developments in the field of hierarchicalnonparametric Bayesian models based on the Dirichlet processand the beta process. These models are able to adjust theirdimensionality according to the given data, thereby sidestep-ping the need to select among several finite-dimensional modelalternatives. Based on a prior over scenes, learned uponobservations of training examples, the model can be used forparsing new scenes or completing partially specified scenes bysampling the missing objects.

As a motivating example, we consider learning the objectconstellations on a breakfast table as depicted in Fig. 1. Ourmodel, however, is general and not restricted to this scenario.

Page 2: Nonparametric Bayesian Models for Unsupervised …vised scene analysis. A first family of approaches, and the most related to our model, employs nonparametric Bayesian models to infer

Fig. 2. A meta-object is represented as a collection of parts. Each part hasGaussian distribution, a multinomial distribution over the observable objecttypes, and a binary activation probability.

II. RELATED WORK

In this section we describe the relevant works on unsuper-vised scene analysis. A first family of approaches, and themost related to our model, employs nonparametric Bayesianmodels to infer spatial relations. Sudderth et al. [15] introducedthe transformed Dirichlet process, a hierarchical model whichshares a set of stochastically transformed clusters amonggroups of data. The model has been applied to improveobject detection in visual scenes, given the ability of themodel to reason about the number of objects and their spatialconfiguration. Following his paper, Austerweil and Griffiths[2] presented the transformed Indian buffet process, wherethey model both the features representing an object and theirposition relative to the object center. Moreover, the set oftransformations that can be applied to a feature depends onthe objects context.

A complementary approach is the use of constellationmodels [3, 4, 11]. Those models explicitly consider partsand structure of objects and can be learnt efficiently and ina semisupervised manner. The more successful constellationmodel proposed is the star model [4], a sparse representationof the object consisting of a star topology configuration ofparts modeling the output of a variety of feature detectors. Themain limitations of those methods, however, lies in the fact thatthe number of objects and parts must be defined beforehandand thus cannot be trivially used for scene understanding andobject discovery. Ranganathan and Dellaert [11] used a 3Dconstellation model for representing indoor environments asobject constellations. Another related approach to constellationmodels is presented in [10], where a hierarchical rule-basedmodel is used to capture spatial relations. They also employeda star constellation model, and developed an expectationmaximization (EM)-style algorithm to infer the structure andthe labels of the objects and parts.

Another family of approaches relies on discriminative learn-ing and unsupervised model selection techniques. An approachto automatic discovery of object part representation has beendeveloped in [13]. In their work, the authors introduced alatent conditional random field (CRF) based on a flexibleassembly of parts. Individual part labels are modeled as hiddennodes and a modified version of the EM algorithm has been

developed for learning the pairwise structure of the underlyinggraphical model. Triebel et al. [19] presented a unsupervisedapproach to segment 3D range scan data and to discoverobjects by the frequency of their parts appearance. The datais first segmented using graph-based clustering, then eachsegment is treated as a potential object part. The authors thenused CRFs to represent both the part graph to model theinterdependence of parts with object labels, and a scene graphto smooth the object labels. Spinello et al. [14] proposed anunsupervised detection technique based on a voting scheme ofimage descriptors. They introduced the concept of latticelets:a minimal set of pairwise relations that can generalize thepatterns connectivity. Conditional random fields are then usedto couple low level detections with high level scene descrip-tion. Jiang et al. [8] used an undirected graphical model toinfer the best placement for multiple objects in a scene. Theirmodel considers several features of object configurations, suchas stability, stacking, and semantic preferences.

A different approach is the one of Fidler and Leonardis[5]. They proposed an approach to constructing a hierarchicalrepresentation of visual input. Their approach uses a bottomup strategy, where they learn the statistically most significantcompositions of simple features with respect to more complexhigher level ones. Parts are learned sequentially, layer afterlayer. Separate classification and grouping technique are usedfor the bottom and top layers to account for the numericaldifference (sensor data) and semantical ones (object category).

However, to the best of the our knowledge, no previousmodel is able to deal with a nested hierarchy of differentobjects and group types and being flexible enough to allowfor both scene understanding and generation.

III. GENERATIVE SCENE MODEL

In this section, we describe the proposed generative scenemodel. We assume that the reader is familiar with the basicsof Bayesian nonparametric models [6], especially with theChinese restaurant process (CRP) and the Dirichlet process(DP) [16], the (two-parameter) Indian buffet process (IBP) andthe beta process (BP) [7, 18], and the concepts of hierarchical[17] and nested [12] processes in this context.

In the following, we consider a scene as a collection ofobservable objects represented as labeled points in the 2Dplane. The 2D asssumption is due to our motivation to modeltable scenes. However, the model is not specifically gearedtowards 2D data and could in principle also be applied to3D data. We assume that each scene contains an unknownnumber of latent meta-objects, which is a collection of parts,and that we need to infer the number of latent meta-objectswhen learning the model. The observable objects of a scenecan be grouped into clusters and each cluster corresponds toan instance of a meta-object that generated the observableobjects of its cluster. Each part generates at most one observ-able object. A part consists of a Gaussian distribution thatconstrains the relative position of an observable object withrespect to a reference frame that is defined by the meta-object.Further, each part is associated with a multinomial distribution

Page 3: Nonparametric Bayesian Models for Unsupervised …vised scene analysis. A first family of approaches, and the most related to our model, employs nonparametric Bayesian models to infer

Scene 1

IBP

CRP

IBP

IBP

Scene 2

IBP IBP

IBP

Fig. 3. Basic structure of our model: The Indian buffet processes (IBP) on topare nested within the tables (represented as circles) of the Chinese restaurantprocess (CRP) below them. The blocks within the IBP frames represent therelevant part of the IBP matrix (see Fig. 4). Clusters in the scenes are meta-objects and objects (colored points) of the clusters need to be associated toentries in the IBPs shown on top. This is explicitly shown for one cluster inthe first scene. For visibility reasons, the rest of the associations are drawnas a single thick line. At the lowest level there are the IBPs that draw themeta-objects and noise objects for each scene.

over the type of the observable objects to be generated atthis relative location and, lastly, a part is associated with abinary activation probability that specifies how likely it is thatthis part generates the object when the meta-object is beinginstantiated in a scene. A prototypical meta-object is illustratedin Fig. 2. Please note, that we will use the terms “meta-object”and “cover” interchangeably.

A. Description of the Generative Process

Following the example given in the introduction, imaginethat our robot’s goal is to set a table for a typical familybreakfast. At the beginning, it enters a room with an emptybreakfast table and an infinite number of side tables, each hold-ing a prototypical cover. It estimates the area A of the breakfasttable surface and boldly decides that n ∼ Poisson(Aλ) coversare just right. The robot chooses one of the side tables andfinds a note with the address of a Chinese restaurant whereit can get the cover. Arriving there, it sees again infinitelymany tables each corresponding to a particular cover typeand each displaying a count of how often someone took acover from this table. It is fine with just about any cover typeand decides randomly based on the counts displayed at thetables, even considering a previously unvisited table. At thattable there is another note redirecting to an Indian restaurant.In this restaurant, it is being told that the cover needs to beassembled by choosing the parts that make up this cover. Therobot randomly selects the parts based on their popularity andeven considers to use a few parts no one has ever used before.For each chosen part its relative position is sampled fromthe part’s Gaussian distribution and then the robot samples

Fig. 4. A more detailed view on an IBP representing a meta-object type.This would correspond to one of the IBPs at the very top in Fig. 3. The rowsrepresent customers and the columns represent dishes (parts). The customersof this IBP are the meta-object instances associated to this meta-object typein any of the scenes. The objects of the instances must be associated to oneof the parts. They thereby update the posterior predictive distribution (shownat the bottom) over the objects of a new customer. Remember, that the actualpart parameters are integrated out due to the usage of conjugate priors.

the object from the part’s multinomial distribution over types(plate, knife, fork, etc.). Having assembled the cover this way,the robot returns to the breakfast table and puts it randomlyon it. The process is then repeated for the remaining n − 1covers. See Fig. 3 for an overview of the model structure.

More formally, we have a hierarchical model with a high-level Dirichlet process DPt, a low-level beta process BPc anda further independent beta process BPε. First, we draw Gtfrom the high-level DPt

Gt ∼ DPt(αt,BPp(cp, αp,Dir×NW)), (1)

where the base distribution of DPt is a beta process BPpmodeling the parts’ parameters. This is done only once andall scenes to be generated will make use of the same draw Gt.This draw describes the distribution over all possible covertypes and corresponds to the Chinese restaurant mentionedabove. The base distribution of the beta process is the priordistribution over the part parameters. The parameters are a 2DGaussian distribution over the relative location and a multi-nomial over the observable object types. The parameters aresampled independently and we use their conjugate priors in thebase distribution, i.e., a (symmetric) Dirichlet distribution Dirfor the multinomial and the normal-Wishart distribution [9]NW for the 2D Gaussian. The mass parameter αp of BPpis our prior over the number of activated parts of a singlecover instance and the concentration parameter cp influencesthe total number of instantiated parts across all instances of thesame type. Likewise, the parameter αt influences the expectednumber of cover types. Each scene s has its own meta-object(cover) IBP and the cover instances are determined by a

Page 4: Nonparametric Bayesian Models for Unsupervised …vised scene analysis. A first family of approaches, and the most related to our model, employs nonparametric Bayesian models to infer

single draw from the corresponding beta-Bernoulli process asfollows:

G(s)c ∼ BPc(1, |As|αc, Gt × U(As × [−π, π]))

(2)

{Gtj , Tj}j ∼ BeP(G(s)c ) (3)

{µk,Σk,γk}k ∼ BeP(Gtj ) for each j (4){x,ω} ∼ p(z | µk,Σk,γk, Tj) for each k (5)

In Eq. (2), the concentration parameter is irrelevant and arbi-trarily set to one. The mass parameter |As|αc is the expectednumber of cover instances in a scene. The base distribution ofBPc samples the parameters of a cover j: its type tj and itspose Tj . The type tj is drawn from the distribution over covertypes Gt from Eq. (1). The pose Tj is drawn from a uniformdistribution U over the pose space As × [−π, π] where As isthe table area. Each atom selected by the Bernoulli processin Eq. (3) corresponds to a cover instance and this selectionprocess corresponds to the side-table metaphor mentionedabove. The cover type tj basically references a draw Gtjfrom the nested beta process in Eq. (1) which models theparts of a cover type. Thus, in Eq. (4) we need another drawfrom a Bernoulli process to sample the activated parts for thiscover instance, which yields the Gaussians (µk,Σk) and themultinomials (γk) for each active part k. Finally, in Eq. (5)we draw the actual observable data from the data distributionas realizations from the multinomials and the (transformed)Gaussians, each yielding an object z = {x, ω} on the tablewith location x and type ω.

Each scene has an additional independent beta process BPε

G(s)ε ∼ BPε(1, αε,M × U(As)) (6)

{xi, ωi}i ∼ BeP(G(s)ε ) (7)

that directly samples objects (instead of covers / meta-objects)at random locations in the scene. Here, M is a multinomialover the observable object types and U [As] is the uniformdistribution over the table area As. This beta-Bernoulli processwill mainly serve as a “noise model” during MCMC inferenceto account for yet unexplained objects in the scene. Accord-ingly, we set the parameter αε to a rather low value to penalizescenes with many unexplained objects. Figs. 3 and 4 illustratethe overall structure of the model.

B. Posterior Inference in the Model

In this section, we describe how to sample from the posteriordistribution over the latent variables {C,a} given the obser-vations z. We use the following notation. The observations arethe objects of all scenes and a single object zi = {xi, ωi} hasa 2D location xi on the table and a discrete object type ωi.A meta-object instance j has parameters Cj = {Tj , tj ,dj},where Tj denotes the pose and tj denotes the meta-objecttype, which is an index to a table in the type CRP, and djdenotes part activations and associations to the observableobjects. If dj,k = 0 then part k of meta-object instance jis inactive, where k is an index to a dish in the correspondingpart IBP (which is nested in the CRP table tj). Otherwise, the

part is active and dj,k 6= 0 is a reference to the associatedobservable object, i.e., dj,k = i if it generated the observationzi. Next, for each scene we have the associations a to the noiseprocess, i.e., the list of objects currently not associated to anymeta-object instance. For ease of notation, we will use indexfunctions [·] in an intuitive way, for example, t[−j] denotesall type assignments except the type assignment tj of meta-object j, and T[tj ] denotes all poses of meta-objects that havethe same type tj as meta-object j, and z[d] are all observationsreferenced by the associations d, etc.

We employ Metropolis-Hastings (MH) steps to sample fromthe posterior [1], which allows for big steps in the state spaceby updating several strongly correlated variables at a time.In the starting state of the Markov chain, all objects areassigned to the noise IBP of their respective scene and thus areinterpreted as yet unexplained objects. The sampler is run fora fixed yet sufficiently high number of iterations to be sure thatthe Markov chain converged. We use several different typesof MH moves, which we will explain in the following afterdescribing the joint probability.

Joint Probability: p(C,a, z). The joint probability is

p(C,a, z) =

(S∏s=1

p(ns,ε)p(ns,m)

)p(t)

(nt∏t=1

p(d[t] | t)

)

p(T)

(nt∏t=1

Kt∏k=1

p(z[d[t,k]] | T[t],d[t,k], t)

). (8)

Here, ns,m is the number of meta-object instances and ns,εis the number of noise objects in scene s. Each dish param-eter of the noise IBP directly corresponds to the parameterszi = {xi, ωi} of the associated noise object, as we assumedthat there is no data distribution associated with these dishes.The base distribution for the dish parameters consists of inde-pendent and uniform priors over the table area and the objecttypes, and so each dish parameter has probability (nω |As|)−1,where nω is the number of observable object types and |As|is the area of the table in scene s. Thus, the probability of theobjects z[a[s]] associated to a scene’s noise IBP only dependson the number of noise objects and not on the particular type orposition of these objects. However, it would be straightforwardto use a non-uniform base distribution. Denoting the Poissondistribution with mean λ as Poi(· | λ) we thus have

p(ns,ε) = p(z[a[s]],a[s]) = ns,ε! Poi(ns,ε | αε)(nω |As|)−ns,ε .(9)

Next, p(ns,m) = ns,m! Poi(ns,m | αm |As|) is the priorprobability for having ns,m meta-objects in scene s. The dishparameters of a meta-object IBP are the meta-object param-eters Cj , consisting of the pose, type, and part activations.The probability for sampling a pose p(Tj) = (|As| 2π)−1

is uniform over the table surface and uniform in orientation,hence p(T) =

∏Ss=1(|As| 2π)−ns,m . Next, p(t) is the CRP

prior for the meta-object types of all meta-object instances.The factors p(d[t] | t) are the IBP priors for the part activationsfor all meta-object instances of a certain type t and there are ntdifferent types instantiated in total. During MCMC inference,

Page 5: Nonparametric Bayesian Models for Unsupervised …vised scene analysis. A first family of approaches, and the most related to our model, employs nonparametric Bayesian models to infer

we will exclusively need the conditional distribution for asingle meta-object j

p(tj ,dj | t[−j],d[−j]) = p(dj | d[−j,tj ], t)p(tj | t[−j]). (10)

Here, p(tj | t[−j]) is the CRP predictive distribution

p(tj = i | t[−j]) =

{ni

αc+∑i′ ni′

tj is an existing typeαc

αc+∑i′ ni′

tj is a new type,

(11)and ni is the number of meta-object instances of type i (notcounting instance j) and αc is the concentration parameterof the CRP. Further, p(dj | d[−j,tj ], t) is the predictivedistribution of the two-parameter IBP, which factors intoactivation probabilities for each of the existing parts and anadditional factor for the number of new parts. An existing partis a part that has been activated by at least one other meta-object instance of this type in any of the scenes. The activationprobability for an existing part k is

p(dj,k 6= 0 | d[−j,tj ], t) =nk

ntj + cp, (12)

where cp is the concentration parameter of the two-parameterIBP, nk is the number of meta-object instances that have partk activated in any of the scenes, and ntj is the total numberof meta-object instances of type tj in all scenes (the countsexclude the meta-object j itself). The probability for havingnK+ associations to new parts is

p(d[j,K+] | d[−j,tj ], t) = nK+ ! Poi

(nK+

∣∣∣∣ cpαpntj + cp

), (13)

where αp is the mass parameter of the two-parameter IBP, andd[j,K+] denotes the associations to new parts.

As stated in Eq. (8), the data likelihood p(z[d] | T,d, t) forthe objects associated to meta-objects factors into likelihoodsfor each individual part k. Further, it factors into a spatialcomponent and a component for the observable object type.As the meta-object poses T are given, we can transform theabsolute positions x[d[t,k]] of the objects associated to a certainpart k of meta-object type t into relative positions x̃[d[t,k]]

with respect to a common meta-object reference frame. Therelative positions are assumed to be sampled from the part’sGaussian distribution which in turn is sampled from a normal-Wishart distribution. As the Gaussian and the normal-Wishartdistribution form a conjugate pair, we can analytically integrateout the part’s Gaussian distribution which therefore does nothave to be explicitly represented. Hence, the joint likelihoodfor x̃[d[t,k]] of part k is computed as the marginal likelihoodunder a normal-Wishart prior. During MCMC inference, weonly need to work with the posterior predictive distribution

p(x̃[dj,k] | x̃[d[−j,tj ,k]]) = tν(x̃[dj,k] | µ,Σ) (14)

for a single relative position given the rest. This is a multivari-ate t-distribution tν with parameters µ,Σ depending both onx̃[d[−j,tj ,k]]

and the parameters of the normal-Wishart prior –for details see [9]. Similarly, the part’s multinomial distributionover the observable object types can be integrated out as it

forms a conjugate pair with the Dirichlet distribution. Theposterior predictive distribution for a single object type is

p(ω[dj,k] | ω[d[−j,tj ,k]]) =

nω + αω∑ω′(nω′ + αω′)

, (15)

where nω is the number of times an object of type ω has beenassociated to part k of this meta-object type tj , and αω isthe pseudo-count of the Dirichlet prior. When describing theMCMC moves, we will sometimes make use of the predictivelikelihood for all objects z[dj ] associated to a single meta-object instance j. This likelihood factors into the posteriorpredictive distributions of the individual parts and their spatialand object type components, as described above. In the fol-lowing, we will describe the various MCMC moves in detail.

Death (birth) move: (Tj , tj ,dj ,a)→ (a?). A death moveselects a meta-object j uniformly at random, adds all of itscurrently associated objects to the noise process and removesthe meta-object j from the model. The proposal probabilityfor this move is qd(C−j ,a

? | Cj ,C−j ,a) = (nm)−1 wherenm denotes the number of instantiated meta-objects in thisscene before the death move. To simplify notation, we willjust write qd(Cj) for the probability of deleting meta-objectj. The reverse proposal is the birth proposal qb(C?j ,C−j ,a

? |C−j ,a, z) that proposes new parameters C?j = {T ?j , t?j ,d

?j}

for an additional meta-object: the pose T ?j , the type t?j , and theassociations d?j . The new meta-object may reference any ofthe objects previously associated to the noise process and anynon-referenced noise objects remain associated to the noiseprocess. We will describe the details of the birth proposal indetail later on. To simplify notation, we will just write qb(Cj)for the birth proposal. Plugging in the model and proposaldistributions in the MH ratio and simplifying we arrive at theacceptance ratio of the death move

Rd =1

p(z[dj ] | z[d[−j,tj ]],T[tj ],d[tj ], t)p(tj ,dj | t[−j],d[−j])

1

p(Tj)

p(nm − 1)

p(nm)

p(nε + nj)

p(nε)

qb(Cj)

qd(Cj). (16)

The counts nj , nm, and nε refer to the state before the deathmove, and nm denotes the number of meta-objects in thisscene, nj are the number of objects currently associated tometa-object j, and nε is the number of noise objects. Theratio of a birth move is derived similarly.

Switch move: (Tj , tj ,dj ,a)→ (T ?j , t?j ,d

?j ,a

?). This moveis a combined death and birth move. It removes a meta-objectand then proposes a new meta-object using the birth proposal.Thus, the number of meta-objects remains the same but onemeta-object simultaneously changes its type tj , pose Tj , andpart associations dj . The death proposals cancel out and theacceptance ratio of this move is

Rs =p(z[d?j ] | z[d[−j,t?

j]], T

?j ,T[−j,t?j ],d

?j ,d[−j,t?j ], t

?j , t[−j])

p(z[dj ] | z[d[−j,tj ]], Tj ,T[−j,tj ],dj ,d[−j,tj ], tj , t[−j])

p(t?j ,d?j | t[−j],d[−j])p(T

?j )p(n?ε )qb(Cj)

p(tj ,dj | t[−j],d[−j])p(Tj)p(nε)qb(C?j ). (17)

Page 6: Nonparametric Bayesian Models for Unsupervised …vised scene analysis. A first family of approaches, and the most related to our model, employs nonparametric Bayesian models to infer

Shift move: (Tj)→ (T ?j ). This move disturbs the pose Tjof a meta-object by adding Gaussian noise to it while the typeand part associations remain unchanged. The acceptance ratiodepends only on the spatial posterior predictive distributionsof the objects associated to this meta-object. The proposalprobabilities cancel due to symmetry and the final ratio is

RT =p(x[dj ] | x[d[−j,tj ]]

, T ?j ,T[−j,tj ],d[tj ], t)p(T ?j )

p(x[dj ] | x[d[−j,tj ]], Tj ,T[−j,tj ],d[tj ], t)p(Tj)

. (18)

Association move: (dj ,a)→ (d?j ,a?). This move samples

the individual part associations of a single meta-object j. Inthe IBP metaphor, this corresponds to sampling the selecteddishes (activated parts) of this meta-object. We need to splitthis sampling into two stages: first, we use Gibbs sampling toobtain the association for each existing part and, second, weuse MH moves for sampling associations to new parts.

In the first stage, we proceed for each existing part asfollows. If a part is already associated with an object, weconsider this object to be temporarily re-associated to the noiseprocess (such that there are now nε noise objects). We then useGibbs sampling to obtain one of nε + 1 possible associations:either the part k is inactive (dj,k = 0) and not associatedwith any object, or it is active (dj,k 6= 0) and associated withone out of nε currently available noise objects (e.g. zi whendj,k = i). The probabilities for these cases are proportional to

p(dj,k = i) ∝p(dj,k = 0 | d[−j,tj ,k], t)p(nε) i = 0

p(zi | z[d[−j,tj ,k]],T[tj ], dj,k,d[−j,tj ,k], t)

p(dj,k 6= 0 | d[−j,tj ,k], t)p(nε − 1)i 6= 0

(19)

In the second stage, we resample the associations to newparts. For this, we use two complementary MH moves: onemove increases the number of new parts by one by assigninga noise object to a new part, while the other move decreasesthe number of new parts by one by assigning an associatedobject (of a new part) to the noise process. The acceptanceratio for removing object zi from a new part k of meta-objectj that currently has nK+ new parts is

R− =1

p(zi | Tj ,d[j,K+])

p(d?[j,K+] | d[−j,tj ], t)

p(d[j,K+] | d[−j,tj ], t)

p(nε + 1)

p(nε)

q+

q−(20)

The proposal q− = (nK+)−1 chooses one of the new parts tobe removed uniformly at random, while the reverse proposalq+ = (nε + 1)−1 chooses uniformly at random one of thennε + 1 noise objects to be associated to a new part. The MHmove that increases the number new parts is derived similarly.

Birth proposal. qb(Tj , tj ,dj) The birth, death, and switchmoves rely on a birth proposal that samples new meta-objectparameters Cj = {Tj , tj ,dj}. The general idea is to samplethe pose Tj and type tj in a first step. We then proceed tosample the associations dj given Tj and tj , i.e., the potentialassignment of noise objects to the parts of this meta-object.

Sampling Tj and tj is done in either of two modes: theobject mode or the matching mode. In object mode, we choose

an object uniformly at random and center the meta-object poseTj at the object’s location (with random orientation) and addGaussian noise to it. The type tj is sampled from the currentpredictive distribution of the type CRP. In contrast, the match-ing mode was inspired by bottom-up top-down approachesand aims to propose Tj and tj in a more efficient way byconsidering the currently instantiated meta-object types in themodel. However, in contrast to the object mode, it cannotpropose new meta-object types (new tables in the type CRP).It selects two objects and associates them to a suitable partpair. This suffices to define the pose Tj as the correspondingtransformation of these parts into the scene. In detail, we firstsample one of the objects zi at random and choose its nearestneighbor zj . We match this ordered pair 〈zi, zj〉 against allordered part pairs 〈ki′ , kj′〉 of the meta-object types to obtainthe matching probabilities with respect to the parts’ posteriorpredictive means, µi′ and µj′ , of the spatial distribution, andtheir posterior predictive distributions, Mi′ and Mj′ , over theobservable object types ωi and ωj

pm(〈zi, zj〉, 〈ki′ , kj′〉) =N (d∆ |0, σ2m)Mi′(ωi)Mj′(ωj).

(21)Here, d∆ = ‖xi − xj‖ − ‖µi′ − µj′‖ is the residual of theobjects’ relative distance w.r.t. the distance of the posteriormeans and σ2

m is a fixed constant. We then sample a part pair〈ki′ , kj′〉 proportionally to its matching probability to define,together with the objects zi and zj , the pose Tj . Finally, weadd Gaussian noise to Tj . The sampled part pair implicitlydefines the meta-object type tj .

After having sampled Tj and tj using either of these twomodes, the next step is to sample the associations dj . Forthis, we randomly choose, without repetition, a noise objectand use Gibbs sampling to obtain its association to either: (a)a yet unassociated part of the meta-object instance; (b) a newpart of the meta-object instance; or (c) be considered a noiseobject. The probabilities for (a) and (b) are proportional to thespatial and object type posterior predictive distributions of therespective parts, while (c) is based on the noise IBP’s basedistribution.

Besides sampling from the proposal distribution we alsoneed to be able to evaluate the likelihood qb(Cj) for samplinga given parameter set Cj . For this, we need to marginalizeover the latent variables of the proposal, e.g. the binary modevariable (object mode or matching mode), the chosen objectzi, and the chosen part pair of the matching mode.

IV. EXPERIMENTS

We tested our model on both synthetic data and real-worlddata acquired with a Kinect camera.

A. Synthetic Data

In the synthetic data experiment, table scenes were gener-ated automatically using a different, hand-crafted generativemodel. We generated 25 training scenes ranging from two tosix covers. The cover types represent two different breakfasttypes. The first type is composed by a cereal bowl and a spoon,while the second type is composed by a plate with a glass, a

Page 7: Nonparametric Bayesian Models for Unsupervised …vised scene analysis. A first family of approaches, and the most related to our model, employs nonparametric Bayesian models to infer

(a) Examples of parsed scenes of the synthetic data set. (b) Examples of parsed scenes of the real-world data set.

Fig. 5. The picture shows the inferred meta-objects with their part relationships. Each picture shows a different scene from a different MCMC run.

Fig. 6. Spatial posterior predictive distributions of several parts of a covertype and their activation probabilities and most likely observable object types.The cover type on the left was learned on synthetic data, while the one onthe right was learned on the real-world data set.

knife

plate cupcup knife

plate

spoon

bowl

Fig. 7. Segmentation (left) and classification (right) results on the pointcloud segments visualized in the Kinect image.

fork and a knife. The fork can be randomly placed at the rightor the left of the plates or being absent.

The latent parameters are inferred using all the generatedtraining scenes and setting the hyperparameters to αε = 0.5,cp = 0.25,αp = 2.5, αt = 5, and αc = 1. As a first test, wewanted to show that the model is able to segment the scenesin a consistent and meaningful way. The results of this testare shown in Fig. 5a. A set of different scenes are segmentedby using the learned model. We see that each scene has beensegmented with the same cover types and objects are correctlyclustered. Note that the color of the meta-object and of the partcan change in every run, since the ordering of types and meta-objects is not relevant in our model. What is important is thatthe topological and metrical configuration are respected.

A second test is to see if the model can infer the meta-objects in incomplete scenes and if it is able to complete them.To this end, we artificially eliminated objects from alreadygenerated scenes and segmented the altered scene again using

plate?

(a) Incomplete scene of the synthetic data set.

plate?

(b) Incomplete scene of the real data set.

Fig. 8. Inferred meta-objects in an incomplete scene of the synthetic data set(a) and the real data set (b). The color of the spheres indicate the meta-objecttype and the color of the lines indicate the part type. In both scenes one platehas been removed. The model is still able to correctly segment the incompletescenes and infer the missing objects. To enhance readability the pictures havebeen modified with human readable labels.

the learned model. We discovered that the approach was ableto infer the correct cover type even in the absence of themissing object. As can be seen in Fig. 8a, the approachcorrectly segmented and recognized the meta-objects and isaware of the missing plate (the red branch of the hierarchythat is not grounded on an existing object). The missing objectcan then be inferred by sampling from the part’s posteriorpredictive distribution over the location and type of the missingobject, as the one shown in Fig. 6 (left).

B. Real-world Data

We used a Microsoft Kinect depth camera to identify theobjects on the table by, first, segmenting the objects bysubtracting the table plane in combination with a color-basedsegmentation. Second, we detect the objects based on thesegmented pointcloud using a straight forward feature-based

Page 8: Nonparametric Bayesian Models for Unsupervised …vised scene analysis. A first family of approaches, and the most related to our model, employs nonparametric Bayesian models to infer

−1,000

0

1,000lo

g-pr

obab

ility

0

10

20

met

a-ob

ject

part

s

0 1,000 2,000 3,000 4,000 5,000 6,000

0

2

4

6

MCMC sampling iteration

met

a-ob

ject

type

s

Fig. 9. The plots depict the log-probability, the number of meta-object parts(summed over all meta-object types), and the number of meta-object typesas they evolve during MCMC sampling. The red lines corresponds to thesynthetic data set and the blue lines to the real-world data set.

object detection system with a cascade of one-vs-all classifiers.An example for the segmentation and object identification isshown in Fig. 7. Note that there are likely to be better detectionsystems, however, the task of detecting the objects on the tableis orthogonal to the scientific contribution of this paper.

The same set of tests performed on the synthetic datahave been performed also in this case, showing basically thesame results. In particular, Fig. 5b shows the segmentationresults, Fig. 8b shows a modified scene where a plate hasbeen removed and Fig. 6 (right) shows the posterior predictivedistribution for location and type, as for the synthetic case, thatcan be used to complete an incomplete scene.

To better illustrate the inference process, we plot the log-probability and the number of parts, and number of meta-object types in Fig. 9 as they evolve during MCMC sampling.

V. CONCLUSION

This paper presents a novel and fully probabilistic gener-ative model for unsupervised scene analysis. Our approachis able to model the spatial arrangement of objects. It main-tains a nonparametric prior over scenes, which is updatedby observing scenes, and can directly be applied for modelcompletion by inferring missing objects in an incompletescene. Our model applies a combination of Dirichlet processesand beta processes, allowing for a probabilistic treatment of thedimensionality of the model. In this way, we avoid a modelselection step, which is typically intractable for the modelsconsidered here. To evaluate our approach, we successfullyapplied it to the problem of adding missing objects in thecontext of the task of laying a table.

ACKNOWLEDGMENTS

This work has been supported by the Deutsche Forschungsgemein-schaft (DFG) under contract number SFB/TR 8 Spatial Cognition

(R6-[SpaceGuide]) and by the EC under contract number FP7-ICT-248258-First-MM.

REFERENCES

[1] C. Andrieu, N. de Freitas, A. Doucet, and M. I. Jordan.An Introduction to MCMC for Machine Learning. MachineLearning, 50(1–2):5–43, 2003.

[2] J. L. Austerweil and T. L. Griffiths. Learning invariant featuresusing the Transformed Indian Buffet Process. In Advances inNeural Information Processing Systems (NIPS), 2010.

[3] R. Fergus, P. Perona, and A. Zisserman. Object class recognitionby unsupervised scale-invariant learning. In Proc. of the IEEEConf. on Computer Vision and Pattern Recognition (CVPR),2003.

[4] R. Fergus, P. Perona, and A. Zisserman. A Sparse Object Cate-gory Model for Efficient Learning and Exhaustive Recognition.In Proc. of the IEEE Conf. on Computer Vision and PatternRecognition (CVPR), 2005.

[5] S. Fidler and A. Leonardis. Towards Scalable Representationsof Object Categories: Learning a Hierarchy of Parts. In Proc.of the IEEE Conf. on Computer Vision and Pattern Recognition(CVPR), 2007.

[6] S. J. Gershman and D. M. Blei. A tutorial on Bayesiannonparametric models. Journal of Mathematical Psychology,56(1):1–12, 2012.

[7] T. L. Griffiths and Z. Ghahramani. Infinite latent featuremodels and the Indian buffet process. In Advances in NeuralInformation Processing Systems (NIPS). 2006.

[8] Y. Jiang, M. Lim, C. Zheng, and A. Saxena. Learning to placenew objects in a scene. Int. Journal of Robotics Research(IJRR), 2012. To appear.

[9] K. P. Murphy. Conjugate Bayesian analysis of the univariateGaussian: a tutorial. Self-published notes, 2007.

[10] D. Parikh, C. L. Zitnick, and T. Chen. Unsupervised Learning ofHierarchical Spatial Structures In Images. In Proc. of the IEEEConf. on Computer Vision and Pattern Recognition (CVPR),2009.

[11] A. Ranganathan and F. Dellaert. Semantic Modeling of Placesusing Objects. In Proc. of Robotics: Science and Systems (RSS),Atlanta, GA, USA, 2007.

[12] A. Rodrı́guez, D. B. Dunson, and A. E. Gelfand. The NestedDirichlet Process. Journal of the American Statistical Associa-tion, 103(483):1131–1154, 2008.

[13] P. Schnitzspan, S. Roth, and B. Schiele. Automatic Discovery ofMeaningful Object Parts with Latent CRFs. In Proc. of the IEEEConf. on Computer Vision and Pattern Recognition (CVPR),2010.

[14] L. Spinello, R. Triebel, D. Vasquez, K. O. Arras, and R. Sieg-wart. Exploiting Repetitive Object Patterns for Model Com-pression and Completion. In Proc. of the European Conf. onComputer Vision (ECCV), 2010.

[15] E. B. Sudderth, A. Torralba, W. T. Freeman, and A. S. Willsky.Describing Visual Scenes Using Transformed Objects and Parts.Int. Journal of Computer Vision, 77(1–3):291–330, 2008.

[16] Y. W. Teh and M. I. Jordan. Hierarchical Bayesian Nonpara-metric Models with Applications. In Bayesian Nonparametrics:Principles and Practice. Cambridge University Press, 2010.

[17] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierar-chical Dirichlet Processes. Journal of the American StatisticalAssociation, 101(476):1566–1581, 2006.

[18] R. Thibaux and M. I. Jordan. Hierarchical Beta Processes andthe Indian Buffet Process. In Proc. of the Int. Conf. on ArtificialIntelligence and Statistics (AISTATS), 2007.

[19] R. Triebel, J. Shin, and R. Siegwart. Segmentation and Unsu-pervised Part-based Discovery of Repetitive Objects. In Proc.of Robotics: Science and Systems (RSS), 2010.