andrzej pronobis, rajesh p. n. rao - arxivandrzej pronobis, rajesh p. n. rao abstract—we propose a...

Learning Deep Generative Spatial Models for Mobile Robots

Andrzej Pronobis, Rajesh P. N. Rao

Abstract— We propose a new probabilistic framework thatallows mobile robots to autonomously learn deep generativemodels of their environments that span multiple levels ofabstraction. Unlike traditional approaches that attempt tointegrate separately engineered components for low-level fea-tures, geometry, and semantic representations, our approachleverages recent advances in sum-product networks (SPNs)and deep learning to learn a unified deep model of a robot’sspatial environment, from low-level representations to semanticinterpretations. Our results, based on laser range finder datafrom a mobile robot, demonstrate that the proposed approachcan learn the geometry of places and is a versatile platformfor solving tasks ranging from semantic classification of places,uncertainty estimation and novelty detection to generation ofplace appearances based on semantic information and predic-tion of missing data in partial observations.

I. INTRODUCTION

The ability to represent knowledge about the environmentis fundamental for a mobile robot. Spatial knowledge existsat multiple levels of abstraction, from data generated by therobot’s sensors (such as cameras or range sensors), throughgeometry and appearance up to high level semantic descrip-tions (such as categories of objects present in the room orfunctional room categories). The tasks of modeling knowl-edge at each of these levels of abstractions have receivedconsiderable attention in the robotics community and manysolutions have been proposed to the problems of featuredesign and learning, 2D and 3D mapping, object recognition,place classification, and semantic mapping. Furthermore,experiments have demonstrated that robotic systems canleverage spatial knowledge at all levels of abstraction tobetter perform real-world tasks in human environments [1].

Traditionally, robotic systems performing semantic map-ping have utilized an assembly of independent compo-nents [2], which exchange information in a limited fashion.This includes engineered feature extractors and combinationsof machine learning techniques, making integration withplaning and decision making difficult. However, the recentdeep learning revolution has demonstrated that replacingmultiple independent representations with a single integratedmodel can lead to a drastic increase in performance [3],[4]. As a result, deep models have also been applied to theproblem of place classification and semantic mapping [5],[6]. However, the problem was simply framed as one ofclassification, where sensory data is fed to a convolutionalneural network in order to obtain semantic labels.

The authors are with Computer Science & Engineering, Universityof Washington, Seattle, WA, USA. A. Pronobis is also with Robotics,Perception and Learning Lab, KTH Royal Institute of Technology, Stock-holm, Sweden. {pronobis,rao}@cs.washington.edu. This workwas supported by the Swedish Research Council (VR) project SKAEENet.The help by Kaiyu Zheng and Kousuke Ariga is gratefully acknowledged.

In this work, our goal is not only to unify multiple levels ofa representation into a single model, but also to demonstratethat the role of a spatial model can go beyond classification.To this end, we propose a deep probabilistic generative modelof the geometry of local environments (places), which learnsa joint distribution between a low-level representation of thelocal environment and its semantic interpretation. To thisend, our model leverages Sum-Product Networks (SPNs) arecently proposed probabilistic deep architecture.

SPNs have been shown to provide state-of-the-art resultsin several domains [7], [8], [9]. However, to our knowledge,this work is the first to apply SPNs in robotics. Our placemodel consists of an SPN with a unique structure designedto hierarchically represent the geometry and semantics ofa place from the perspective of a mobile robot. To thisend, the network represents a place as a polar occupancygrid surrounding the robot, where the nearby objects arerepresented in more detail than objects further from the robot.On top of the occupancy data, we propose a unique SPNstructure which combines prior knowledge about the datawith random structure generation (as in random forests) forparts of the network modeling complex dependencies.

Our model is generative and probabilistic. This allows usto infer semantics of a place together with a probabilityrepresenting uncertainty of the model. This provides addi-tional information for classification, but also rich informationto a potential planning or decision making subsystem. Theuncertainty can further be used to perform novelty detection,for instance to detect that a place is of a previously unknowncategory. Finally, the fact that we use generative learningpermits us to reason jointly about the geometry of the worldand its semantics. This can be used to generate values ofmissing observations as well as prototypical appearances ofplaces of various semantic categories.

Our goal in this paper is to present the potential of SPNs,and deep generative models in general, to spatial modeling inrobotics. Therefore, we present results from several differentexperiments addressing several sub-problems. We use laserrange data to capture the geometry of places and reportresults from semantic classification of places, uncertaintyestimation, novelty detection, as well as generation of placeappearances based on semantic information and prediction ofmissing values in partial observations. Although, we use laserrange data in our initial experiments, the proposed modelcan easily be extended to include 3D or visual informationwithout changing the general model structure.

II. RELATED WORK

Representing semantic spatial knowledge is a broadlyresearched topic in the fields of computer vision and robotics,with many solutions employing vision as the sensor of

arX

iv:1

610.

0262

7v1

[cs

.RO

] 9

Oct

201

6

choice [10], [11], [12], [13], [14], [15]. Images clearly carryrich information about semantics; however, they are alsosignificantly affected by changing environment conditions.At the same time, robotics researchers have seen advantagesof using range data that are much more robust to real-world settings and easier to process in real time. In thiswork, we focus on laser range data, as a way of introducingand evaluating a new spatial model, employing a recentlyproposed deep architecture.

Laser range data has been extensively used for placeclassification and extracting semantic descriptions. As a re-sult, many traditional handcrafted representations have beenproposed. Buschka et al. [16] contributed a simple methodthat incrementally divided grid maps of indoor environmentsinto two classes of open spaces (rooms and corridors) usinglocal sub-maps. Mozos et al. [17] applied AdaBoost to createa classifier based on a set of manually designed geometricalscan features to classify different places in indoor envi-ronments into rooms, corridors and doorways. A similarapproach was taken by Topp et al. [18]. Further, Mozos etal. [19] added a HMM on top of the point-wise classificationsto utilize information about the connectivity of space andincluded visual object detections as features of places. Tapuset al. [20] also combined omnidirectional vision with laserdata to build descriptors, called fingerprints of places. Finally,Pronobis et al. [21] proposed a discriminative model basedon SVMs that combined laser range data and visual cues forsemantic place classification.

Deep learning and unsupervised feature learning tech-niques [22] after many successes in speech recognition, lan-guage understanding and computer vision [3], [23] enteredthe field of robotics with superior performance in objectdetection, recognition [24], [25] or robot grasping [26], [27],[4]. The latest work in visual place categorization proposesusing a deep convolutional network complemented withseries of one-vs-all classifiers [5], [6]. In this work, the deepmodel is used exclusively for classification. In contrast, wepropose a probabilistic generative model and demonstrate itsusefulness in a wide range of robotics tasks.

Our spatial model builds on a new deep architectureof the Sum-Product Networks (SPNs). SPNs have recentlyachieved promising performance in various applications,such as speech [8] and language modeling [28], humanactivity recognition [9], and image classification [7] andcompletion [29]. To the best of our knowledge, our workis the first that introduces SPNs to the domain of robotics.Specifically, we demonstrate how to design and successfullyadopt SPNs to a task of real-time spatial modeling for mobilerobots, showing the importance of generative properties ofthe model. Moreover, we present a new generic librarynamed the LibSPN that implements various SPN architec-tures, inference and learning algorithms opening doors forusing SPNs in a broader range of domains.

III. SUM-PRODUCT NETWORKS

Sum-product networks are a recently proposed proba-bilistic deep architecture with several appealing properties

Fig. 1: A simple naive Bayes mixture model with threecomponents over two binary variables represented as an SPN.The bottom layer represents indicators for each of the twovariables. Y1 represents a latent variable marginalized out bythe top sum node.

and solid theoretical foundations [29], [7], [30]. The keyobservation that led to the development of the SPNs isthat one of the primary limitations of probabilistic graphicalmodels is the complexity of their partition function. Thisoften requires employing complex approximate inferencein the presence of non-convex likelihood functions. SPNsrepresent probability distributions with partition functionsthat are guaranteed to be tractable, involve a polynomialnumber of sums and product operations, permitting exactinference. While not all probability distributions can berepresented by polynomial-sized SPNs, recent experimentsin several domains demonstrate that the class of distributionsmodeled by SPNs is sufficient for many real-world problems.

SPNs represent a joint or conditional probability distribu-tion and can be learned both generatively [29] and discrim-inatively [7]. They are a deep, hierarchical representation,capable of representing context-specific independence. Asshown in Fig. 1 on a simple example of a naive Bayes dis-tribution, the network is a generalized directed acyclic graphof alternating layers of weighted sum and product nodes.The sum nodes can be seen as representing mixture models,over components defined using product nodes, with weightsof each sum representing priors. The latent variables of suchmixtures can be made explicit and their values inferred. Thistechnique is often used for classification models where theroot sum is a mixture of sub-SPNs representing multipleclasses. The bottom layers effectively define features reactingto certain values of indicators of the variables in the model.

Formally, following Poon & Domingos [29], we can definean SPN as follows:

Definition 1: An SPN over variables X1, . . . , XV is arooted directed acyclic graph whose leaves are the indicators(X1

1 , . . . , XI1 ), . . . , (X

1V , . . . , X

IV ) and whose internal nodes

are sums and products. Each edge (i, j) emanating from asum node i has a non-negative weight wij . The value of aproduct node is the product of the values of its children. Thevalue of a sum node is

∑j∈Ch(i) wijvj , where Ch(i) are the

children of i and vj is the value of node j. The value of anSPN S[X1, . . . , XV ] is the value of its root.

Not all possible architectures consisting of sums and

products will results in a valid distribution. While a lessconstraining condition on validity has been derived in [29],a simpler condition, which does not limit the power of themodel in a substantial way [30] is to guarantee that the SPNis complete and decomposable [29].

Definition 2: A sum-product network is complete iff allchildren of the same sum node have the same scope.

Definition 3: A sum-product network is decomposable iffno variable appears in more than one child of a product node.

The scope of a node is defined as the set of variables thathave indicators among the descendants of the node.

A valid SPN will compute unnormalized probability ofevidence expressed in terms of indicators. However, we canobserve, that without loss of generality [30], the weights ofeach sum can be normalized, in which case the value of theSPN S[X1

1 , . . . , XIV ] is equal to the normalized probability

P (X1, . . . , XV ) of the distribution modeled by the network.

A. Generating SPN structure

The structure of the SPN determines the group of dis-tributions that can be learned. Therefore, most previousworks [7], [9], [28] relied on domain knowledge to design theappropriate structure. Furthermore, several structure learningalgorithms were proposed [31], [32], which try to discoverindependencies between the random variables in the datasetand structure the SPN accordingly. In this work, we experi-ment with a different approach, originally hinted at in [29],which generates a random structure, as in random forests.That approach to SPN structure generation has not beenpreviously evaluated and our experiments demonstrate thatit can lead to very good performance.

The algorithm recursively generates multiple decomposi-tions of a set of random variables into multiple subsets untileach subset is a singleton. That approach is illustrated inFig. 3. At each level, the set to be decomposed is representedby multiple mixtures (sum nodes), and each subset in eachdecomposition is modeled by multiple mixtures. Productnodes are used as an intermediate layer and act as featuresdetecting particular combinations of mixtures representingthe subsets. After the weights of the SPN are learned(potentially with an additional sparsity prior), the networkcan be pruned to remove children of sums associated withzero weights. Our experiments indicate that such an approachcan accommodate a wide range of distributions.

B. Inference and Learning

Inference in SPNs is accomplished by an upward passthrough the network. Once the indicators are set to representthe evidence, the upward pass will yield the probability of theevidence as the value of the root node. Partial evidence (ormissing data) can easily be expressed by setting all indicatorsfor a variable to 1. Furthermore, since SPNs compute anetwork polynomial [33], derivatives computed over thenetwork can be used to perform inference for modifiedevidence without recomputing the whole SPN. Finally, itcan be shown [29] that MPE inference can be performedby replacing all sum nodes with max nodes, while retaining

the weights. Then, the indicators of the variables for whichthe MPE state is inferred are all set to 1 and a standardupward pass is performed. A downward pass then followswhich recursively selects the highest valued child of eachsum (max) node, and all children of a product node. Theindicators selected by this process indicate the MPE state ofthe variables.

SPNs lend themselves to be learned generatively or dis-criminatively using Expectation Maximization (EM) or gra-dient descent. In this work, we employ hard EM, which wasshown to work well for generative learning [29]. As is oftenthe case for deep models, the gradient quickly diminishes asmore layers are added. Hard EM overcomes this problem,permitting learning of SPNs with hundreds of layers. Thehard EM learning procedure consists of MPE inference of thestate of the latent variables in each sum (the E step) followedby an update of the weights according to the number oftimes each child of a sum node was selected during thedownward pass (the M step). We achieved best results bymodifying the inference to use sums during the upwards pass,while selecting the max valued child during the downwardpass. Furthermore, we performed additive smoothing whenupdating the weights corresponding to a Dirichlet prior. Weterminated learning when the average log-likelihood did notimprove beyond the threshold of 0.05.

IV. THE SPATIAL MODEL

Our spatial model is designed to support spatial reasoningfor the purpose of planning and executing actions on a mobilerobot. Below, we first describe how observations of localenvironments are represented in our framework with thispurpose in mind. Then, we provide details of the SPN-basedgenerative model performing joint inference on the low-level observations, geometry of places and their semanticinterpretations.

A. Representing Local Environments

To represent an observation of the local environment (aplace), the model relies on local occupancy grids generatedfrom laser range data. In realistic settings, a mobile robotalmost always has access to more than one observation of aplace. Therefore, we use Rao-Blackwellized particle filtergrid mapping [34] to integrate the incoming laser rangescans into local maps formed around the robot. This tech-nique allows us to assemble more complete representationsof places. During our experiments, the robot was drivingthrough the environment with a roughly constant speed,while continuously gathering data and performing inferencebased on the the current state the local map. This does notgenerate complete local maps (especially at the time whenthe robot enters a new part of the environment for the firsttime), but does help to obtain a more complete observation.

Our goal is to model the geometry and semantics of alocal environment only. We assume that larger-scale spacemodel will be built by integrating multiple models of places.Therefore, we constrain the observation to the parts of theenvironment that are visible from the robot (can be raytraced

(a) Corridor (b) Doorway

(c) Small Office (d) Large Office

Fig. 2: Comparison of local environment observations usedin our experiments, expressed as Cartesian and polar occu-pancy grids for places of different semantic categories.

from the robot’s location). As a result, walls in the environ-ment occlude the view and the local map will mostly containobjects present inside a single room. In practice, additionalnoise is almost always present, but is averaged out duringlearning of the model. In order to include fuller appearanceof the objects (e.g. corners of furniture) we include a smallvicinity around every occupied cell visible from the robot tothe place observation. Examples of such local representationscan be seen in Fig. 2.

As the next step, every local observation available to therobot is transformed into a robo-centric polar occupancygrid representation (compare polar and Cartesian occupancygrids in Fig. 2). The resulting observation contains higher-resolution details closer to the robot and lower-resolutioninformation further away. This focuses the attention of themodel to the nearby objects. Higher resolution of infor-mation closer to the robot is important for understandingthe semantics of the exact robot’s location (for instancewhen the robot is in a doorway). However, it also relatesto how spatial information is used by a mobile robot whileexecuting actions. It is in the vicinity of the robot thathigher accuracy of spatial information is required. A similarprinciple is exploited by many navigation components, whichuse different resolution of information for local and globalpath planning. Additionally, such representation correspondsto the way the robot perceives the world because of thelimited resolution of its sensors. Our goal is to use a similarstrategy when representing 3D and visual information in thefuture, by extending the polar representation to 3 dimensions.

Fig. 3: The structure of the SPN implementing our spatialmodel. The bottom images illustrate a robot in an environ-ment and a robocentric polar grid formed around the robot.The SPN is built on top of the variables representing theoccupancy in the polar grid.

Finally, a high-resolution map of the complete environmentcan be largely recovered by integrating the polar observationsover the path of the robot.

The polar grids in our experiments were built for themaximum radius of 5m, with angle step of 6.4 degrees andresolution decreasing with the distance from the robot.

B. Deep Generative Spatial Model

The proposed probabilistic spatial model is based ona generative SPN illustrated in Fig. 3. The model learnsa distribution P (Y,X1, . . . , XC), where Y represents thesemantic category of a place, and X1, . . . , XC are inputvariables representing the occupancy in each cell of thepolar occupancy grid. Each occupancy cell is representedby three indicators in the SPN, one for empty space, one foroccupied space and one for unknown/occluded space. Thesum nodes directly connected to the indicators can be seen assingle-variable categorical distributions modeling each cell.We experimented with a different representation in whichonly two values (occupied/empty) were considered for eachcell and both indicators were set to 1 for unknown space torepresent lack of evidence. However, we did not observe abenefit in terms of classification performance.

The structure of the model is partially static and partiallygenerated randomly according to the algorithm describedin III-A. We begin by splitting the polar grid equally into8 views (45 degrees each). For each view, we randomlygenerate an SPN by recursively building a hierarchy ofdecompositions of subsets of polar cells. Then, on top of thesub-SPNs representing each view, we randomly generate anSPN representing complete place geometries for each placeclass. The sub-SPNs for place classes are combined by a sumnode forming the root of the network. The latent variablerepresented by that sum node becomes Y and is set to theappropriate class label during supervised learning. Similarly,we infer its value when classifying observations.

Sub-dividing the representation into views allows us to usenetworks of different complexity for representing lower-levelfeatures and geometry of each view, as well as for modelingthe top composition of views into a place class. In our experi-ments, when representing views, we recursively decomposedthe set of polar cells using a single decomposition at eachlevel, into two random cell sub-sets, and generated 4 mixturesto model each such subset. This procedure was repeateduntil each subset contained a single variable representing asingle cell. To increase the discriminative power of each viewrepresentation, we used 8 sums at the top level of the viewsub-SPN. All sums modeling views were then consideredinput to a randomly generated SPN structure representingplace classes. To ensure that each class can be associatedwith a rich assortment of place geometries, we increasedthe complexity of the generated structure and at each levelrecursively decomposed the sets of mixtures representingviews into 5 subsets in 4 different random ways.

Several straightforward modifications to the architecturecan be considered. First, the weights of the sum nodesfor each view could be shared, potentially simplifying thelearning process. Second, the latent variables in the mixturesmodeling each view can be accessed to explicitly reasonabout types of views discovered by the learning algorithm.

C. Types of Inference

As a generative model of a joint distribution between low-level observations and high-level semantic phenomena, our

spatial model is capable of various types of inferences.First, the model can simply be used to classify observa-

tions into semantic categories, which corresponds to MPEinference of Y :

y∗ = argmaxy

P (y|x1, . . . , xC) (1)

Second, the probability of an observation can be used asa measure of novelty:∑

y

P (y, x1, . . . , xC) > threshold (2)

We use this approach to successfully separate observations ofclasses known during the training process from observationsgathered in unknown place classes.

If instead, we condition on the semantic information, wecan perform MPE inference over the variables representingoccupancy of polar grid cells:

x∗1, . . . , x∗C = argmax

x1,...,xC

P (x1, . . . , xC |y) (3)

This leads to generation of most likely, prototypical examplesfor each class.

Finally, we can use partial evidence about the occupancyand infer most likely state of a subset of polar grid cells forwhich evidence was not available:

x∗J , . . . , x∗C = argmax

xJ ,...,xC

∑y

P (y, x1, . . . , xJ , . . . , xC) (4)

We use this technique to infer missing observations in ourexperiments.

V. EXPERIMENTS

We conducted four types of experiments corresponding tothe types of inference described in Sec. IV-C. Below, wedescribe the experimental setup common to all experiments.We follow with specific details and presentation of resultsfor each type of inference.

A. Experimental Setup

Our experiments were performed on laser range data fromthe COLD-Stockholm database [2]. The database containsmultiple sequences of laser range data captured using amobile robot navigating with roughly constant speed throughfour different floors of an office environment. On eachfloor, the robot navigates through multiple rooms of differentsemantic categories. Three of the room categories containmultiple instances of rooms, evenly distributed across floors.There are 9 different large offices, 8 different small offices,and 4 long corridors (1 per floor, the corridor featuresdifferent appearances in different parts). We used theseclasses for training the model, and added another place classthat represents a doorway and contains multiple examples ofobservations captured when the robot was navigating througha door. For each floor, we used two sequences capturedduring two independent runs through the environment.

The dataset features several other room classes: an eleva-tor, a living room, a meeting room, a large meeting room,

Fig. 4: Normalized confusion matrix for the task of semanticplace classification.

and a kitchen. However, only one or two room instancesare available in each of these classes. Therefore, we decidedto use these classes as examples of unknown semanticcategories for testing novelty detection.

Finally, to ensure variability between the training andtesting sets, we split the samples from known classes fourtimes, each time training the model on samples from threefloors and leaving one floor out for testing. The presentedresults are averaged over the four splits.

The experiments were conducted using our new library forlearning, inference and structure generation of generic SPNs.SPNs are still a new architecture, and only few domain-specific implementations exist at the time of writing. Ourlibrary offers a toolbox for structure generation, learning andinference and enables quick application of SPNs to variousdomains. It integrates with Google’s TensorFlow, a popularframework implementing several deep models. This leads toan efficient solution capable of utilizing multiple GPU andCPU devices, but also the ability to combine SPN modelswith other architectures (such as Convolutional Nets) in anintuitive way. The presented experiments are as much anevaluation of the spatial model we propose as they are ofour new library.

B. Semantic Place CategorizationOur first experiment evaluated the model in the typical

scenario of semantic place categorization. The model wastrained on the four known classes and evaluated on observa-tions collected in places belonging to the same classes, buton different floors. The normalized confusion matrix for thecategorization problem is shown in Fig. 4.

We see that the model obtains high classification ratesfor all classes. Most of the confusion exists between thesmall and large office classes. Offices in the dataset oftenhave complex geometry that varies between room instanceswithin the same class. The classification rate averaged overall classes (giving equal importance to each class in theaverage) is 94%.

C. Novelty DetectionIn the second experiment, we aimed to evaluate the quality

of the uncertainty measure produced by the model and its

Fig. 5: Percentage of samples considered novel for variousvalues of log likelihood threshold and samples from bothknown and unknown classes.

ability to recognize outliers that belong to classes not knownduring training. We used the same model trained in the pre-vious experiment and evaluated the marginal probability overall observations in the test set, from classes known duringtraining and from novel classes. We then thresholded theprobability for various values of threshold and observed whatpercentage of samples from known and unknown classes willbe considered novel and rejected.

Fig. 5 illustrates how the percentage of samples rejectedas novel changes with the threshold. We can observe thatsome novel samples are rejected already for very thresholds,when none of the samples from known classes are rejected.With increasing threshold, the model starts to consider alsosome samples from the known classes as novel. Still, thesamples from unknown classes are substantially more oftenrejected. For instance, when the model correctly identifies80.1% of unknown class samples as novel, only 30.6% ofsamples from known classes are incorrectly rejected.

D. Generating Observations of Places

In this experiment, we conditioned on the semantic classvariable and inferred the MPE state of the observationvariables. Our aim was to inspect what the model considersthe most likely configuration of observation variables foreach class.

The generated MPE states of polar occupancy grids foreach class are shown in Fig. 6. We can compare these plotsto actual examples of places depicted in Fig. 2. We cansee that each polar grid is very characteristic of the classfrom which it was generated. The corridor is an elongatedstructure with occupied cells on both sides and a hint of whatis likely to be the view into an open room in the middle. Thedoorway is depicted as a narrow structure with empty spaceon both sides. Despite the fact that, as shown in Fig. 2, largevariability exists between the instances of offices within thesame category, the generated observations of small and largeoffices clearly differ in size as well as shape.

While a model trained on laser range data offers limitedinformation about the appearance of the local environment,

(a) Corridor (b) Doorway (c) Small Office (d) Large Office

Fig. 6: Results of MPE inference over the observation variables conditioned on the value of the semantic class variable.The generated polar occupancy grids can be seen as prototypical appearances of places from each semantic category.

Fig. 7: Normalized confusion matrix for the problem ofsemantic classification of partial observations.

we can expect interesting predictions from a similar modeltrained on 3D data.

E. Predicting Missing Observations

Our final experiment further exploited the generative prop-erties of our model to address the problem of filling missingvalues in partial observations of places. To this end, foreach test sample in the dataset, we obscured a random 90degree view. This was achieved by setting all indicators forthe obscured polar cells to 1 to indicate missing evidence.Then, we performed semantic categorization as in the firstexperiment in Sec. V-B and MPE inference over the missingvalues similarly to the experiment in Sec. V-D.

Fig. 8 shows examples of polar grids filled with predictedoccupancy information to replace the missing values. Whilethe predictions of the model are often consistent with theoriginal complete grid, the network does makes mistakes.This typically happens when the obscured viewpoint removescritical information that would be used to determine the classof the place. For instance, in case of the bottom predictionfor the doorway class, removing the viewpoint made the gridvery similar to a small office. In case of the bottom predictionfor the corridor class, the network completes the scan as if thecorridor continued. However, in the environment in which theobservation was captured, this was indeed the case and theoriginal observation was partial even before it was obscured.

Overall, when averaged over all the test samples, themodel correctly reconstructs 76.7% of obscured polar cells.To highlight how the missing data affects semantic catego-rization, we plot another confusion matrix for this experimentin Fig. 7. We can see that the recognition rates remain highdespite the fact that 25% of each input occupancy grid isnow obscured.

F. Discussion

The experiments presented in this paper clearly demon-strate the potential of our generative spatial model. In ourcurrent implementation in LibSPN it is real-time duringinference and very efficient during learning. As a result, itshould be possible to extend the model to include additionalmodalities and capture visual appearance as well as 3Dstructure of the environment.

The experiments with different inference types were allperformed on the same model after a single training phase(separately for each dataset split). This demonstrates thatour model spans not only multiple levels of abstraction ofspatial knowledge, but also multiple tasks and applications.In particular, the model retains high capability to discriminatebetween semantic classes, while being trained generativelyto represent a joint distribution over low-level observations.

Finally, we observe that the model can be successfullytrained with limited number of data samples. The trainingset in our experiments contained between 250 and 350 datasamples per class, with the exception of the doorway, forwhich only 30–50 samples were available. Despite theselimitations, the model obtains good results even in presenceof noise in the data and testing using samples from previouslyunseen floors in the environment.

VI. CONCLUSIONS AND FUTURE WORK

This paper represents, to our knowledge, the first ap-plication of sum-product networks to mobile robotics. Ourresults demonstrate that these networks provide an efficientframework for learning deep probabilistic representations ofrobotic environments, spanning low-level features, geometry,and semantic representations. We have shown that SPNs canbe used to solve a variety of important robotic tasks, fromsemantic classification of places and uncertainty estimation

(a) Corridor (b) Doorway (c) Small Office (d) Large Office

Fig. 8: Examples of successful and unsuccessful completions of place observations with missing data grouped by truesemantic category. For each pair of polar grids, the complete grid is shown on the left, while the inferred grid is shown onthe right. The 90 degree viewpoint that was obscured and filled with predictions is highlighted.

to novelty detection and generation of place appearancesbased on semantic information. While our results were basedon laser range data, the approach is readily applicable tolearning rich hierarchical representations from RGBD or 2Dvisual data. Our future efforts will explore such learningof robot environments as well as exploit the resulting deeprepresentations for probabilistic reasoning and planning atmultiple levels of abstraction.

REFERENCES

[1] A. Aydemir, A. Pronobis, M. Gobelbecker, and P. Jensfelt, “Activevisual object search in unknown environments using uncertain seman-tics,” T-RO, vol. 29, no. 4, 2013.

[2] A. Pronobis and P. Jensfelt, “Large-scale semantic mapping andreasoning with heterogeneous modalities,” in ICRA, 2012.

[3] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” in NIPS, 2012.

[4] S. Levine, P. Pastor, A. Krizhevsky, and D. Quillen, “Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection.” ISER, 2016.

[5] N. Sunderhauf, S. Shirazi, A. Jacobson, F. Dayoub, E. Pepperell,B. Upcroft, and M. Milford, “Place recognition with ConvNet land-marks: Viewpoint-Robust, Condition-Robust, Training-Free,” in RSS,2015.

[6] N. Sunderhauf, F. Dayoub, S. McMahon, B. Talbot, R. Schulz,P. Corke, G. Wyeth, B. Upcroft, and M. Milford, “Place categorizationand semantic mapping on a mobile robot,” in ICRA, 2016.

[7] R. Gens and P. Domingos, “Discriminative learning of sum-productnetworks,” in NIPS, 2012.

[8] R. Peharz, P. Robert, K. Georg, M. Pejman, and P. Franz, “Modelingspeech with sum-product networks: Application to bandwidth exten-sion,” in ICASSP, 2014.

[9] M. Amer and S. Todorovic, “Sum product networks for activityrecognition,” PAMI, 2015.

[10] A. Oliva and A. Torralba, “Building the gist of a scene: The role ofglobal image features in recognition,” Brain Research, vol. 155, 2006.

[11] A. C. Murillo and J. Kosecka, “Experiments in place recognition usinggist panoramas,” in ICCV Workshop on Omnidirectional Vision, 2009.

[12] M. Artac, M. Jogan, and A. Leonardis, “Mobile robot localizationusing an incremental eigenspace model,” in ICRA, 2002.

[13] I. Ulrich and I. Nourbakhsh, “Appearance-based place recognition fortopological localization,” in ICRA, 2000.

[14] A. Ranganathan, “PLISS: Detecting and labeling places using onlinechange-point detection,” in RSS, 2010.

[15] J. Kosecka, L. Zhou, P. Barber, and Z. Duric, “Qualitative image basedlocalization in indoors environments,” in CVPR, 2003.

[16] P. Buschka and A. Saffiotti, “A virtual sensor for room detection,” inIROS, 2002.

[17] O. M. Mozos, C. Stachniss, and W. Burgard, “Supervised learning ofplaces from range data using AdaBoost,” in ICRA, 2005.

[18] E. A. Topp and H. I. Christensen, “Topological modelling for humanaugmented mapping,” in IROS, 2006.

[19] O. M. Mozos, R. Triebel, P. Jensfelt, A. Rottmann, and W. Burgard,“Supervised semantic labeling of places using information extractedfrom sensor data,” RAS, vol. 55, no. 5, 2007.

[20] A. Tapus and R. Siegwart, “Incremental Robot Mapping with Finger-prints of Places,” in ICRA, 2005.

[21] A. Pronobis, O. M. Mozos, B. Caputo, and P. Jensfelt, “Multi-modalsemantic place classification,” IJRR, vol. 29, no. 2-3, Feb. 2010.

[22] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: Areview and new perspectives,” PAMI, vol. 35, no. 8, 2013.

[23] G. E. Hinton and S. Osindero, “A fast learning algorithm for deepbelief nets,” NC, 2006.

[24] K. Lai, L. Bo, X. Ren, and D. Fox, “A large-scale hierarchical multi-view RGB-D object dataset,” in ICRA, 2011.

[25] A. Eitel, J. Springenberg, L. Spinello, M. Riedmiller, and W. Burgard,“Multimodal deep learning for robust rgb-d object recognition,” inIROS, 2015.

[26] M. Madry, L. Bo, D. Kragic, and D. Fox, “ST-HMP: UnsupervisedSpatio-Temporal Feature Learning for Tactile Data,” in ICRA, 2014.

[27] I. Lenz, H. Lee, and A. Saxena, “Deep learning for detecting roboticgrasps,” IJRR, vol. 34, no. 4-5, 2015.

[28] W.-C. Cheng, S. Kok, H. V. Pham, H. L. Chieu, and K. M. A. Chai,“Language modeling with Sum-Product networks,” in Interspeech,2014.

[29] H. Poon and P. Domingos, “Sum-product networks: A new deeparchitecture,” in UAI, 2011.

[30] R. Peharz, S. Tschiatschek, F. Pernkopf, and P. Domingos, “Ontheoretical properties of sum-product networks,” in AISTATS, 2015.

[31] R. Gens and P. Domingos, “Learning the structure of sum-productnetworks,” in ICML, 2013.

[32] A. Rooshenas and D. Lowd, “Learning Sum-Product networks withdirect and indirect variable interactions,” in ICML, 2014.

[33] A. Darwiche, “A differential approach to inference in bayesian net-works,” Journal of the ACM, vol. 50, no. 3, May 2003.

[34] G. Grisetti, C. Stachniss, and W. Burgard, “Improved techniquesfor grid mapping with rao-blackwellized particle filters,” IEEE ToR,vol. 23, no. 1, 2007.

andrzej pronobis, rajesh p. n. rao - arxivandrzej pronobis, rajesh p. n. rao abstract—we propose a...

Documents