hierarchical semantic parsing for object pose estimation ...cli53/papers/chi_icra16.pdf · object...

Hierarchical Semantic Parsing for Object Pose Estimation in DenselyCluttered Scenes

Chi Li, Jonathan Bohren, Eric Carlson, Gregory D. Hager

Abstract— Densely cluttered scenes are composed of multipleobjects which are in close contact and heavily occlude eachother. Few existing 3D object recognition systems are capableof accurately predicting object poses in such scenarios. Thisis mainly due to the presence of objects with texturelesssurfaces, similar appearances and the difficulty of objectinstance segmentation. In this paper, we present a hierarchicalsemantic segmentation algorithm which partitions a denselycluttered scene into different object regions. A RANSAC-basedregistration method is subsequently applied to estimate 6-DoFobject poses within each object class. Part of this algorithmincludes a generalized pooling scheme used to construct robustand discriminative object representations from a convolutionalarchitecture with multiple pooling domains. We also providea new RGB-D dataset which serves as a benchmark forobject pose estimation in densely cluttered scenes. This datasetcontains five thousand scene frames and over twenty thousandlabeled poses of ten common hand tools. We show that ourmethod demonstrates improved performance of pose estima-tion on this new dataset compared with other state-of-the-artmethods.

I. INTRODUCTION

Robust object pose estimation is necessary for any robotcontrol system to interpret and reason about its immediatesurroundings before interacting with them. In many roboticapplications such as automated manufacturing, physical ma-nipulation and daily indoor home services, the capabilities ofthe robot are limited most dramatically by the capabilities ofobject recognition systems. As such, implementers are forcedto either settle for limited capabilities or engineer the envi-ronment to better accommodate the perception algorithms.

The first challenge involves recognizing objects withoutconstraining their geometry or surface properties. Variousobject recognition methods take advantage of invariant localfeature descriptors such as SIFT [18] and CSHOT [28] torecognize a small set of objects with discriminative textureor 3D geometry [21], [1], [29], [26], [32]. Unfortunately, theperformance of local surface matching degrades significantlyfor textureless objects, dense clutter, or large viewpointdiscrepancies between the training and testing data. Otherapproaches match object templates to estimate the poses oftextureless objects [13], [7], [27]. However, they are notrobust to the occlusion caused by dense clutter and scalepoorly for the recognition of a large number of objects. Thus,few existing algorithms can reliably estimate 6-DoF poses of

This work is supported by the National Science Foundation under GrantNo. NRI-1227277. This work is also supported by the National Aeronauticsand Space Administration under Grant No. NNX12AM45H.

The authors are with the Computational Interaction and Robotics Labo-ratory, Johns Hopkins University, USA.

Fig. 1: An example of densely cluttered scene that is com-posed of objects with ambiguous appearances and in closecontact.

object instances in densely cluttered scenes (one example isshown in Fig. 1).

Another challenge in 3D perception is to discriminatebetween objects with similar texture and shapes such asthe drills shown in Figure 1. Most of the hand-craftedglobal features [13], [1] for RGB-D data are not capableof capturing visual details while preserving the invariantobject representation. Recently, convolutional architectures[14], [17], [23], [6] have demonstrated substantial progressin fine-grained recognition for hundreds of object instances.However, it is non-trivial to adapt these large-scale methodsfor precise pose estimation because they are only optimizedfor object classification.

Our previous framework [16] introduces a synergisticintegration of large-scale recognition architectures into therigid 6-DoF object pose estimation. The overview of thisframework is shown in Fig. 2. We redesign the multi-domainpooling architecture [17] for the semantic segmentation andsubsequently apply RANSAC-based registration techniquesto estimate object poses for each semantic class. However,the scene background needs to be removed for high segmen-tation accuracy and the RGB-D feature reported in [17] issensitive to the changes in lighting conditions.

In this paper, we present three innovations to improveour previous semantic segmentation algorithm for the objectpose estimation. First, we introduce a generalized poolingmethod to construct pooled features from two new poolingdomains in high dimensional spaces: SIFT [18] and FPFH[22]. This improves the robustness of the color pooling [17]

Fig. 2: Overview of the object pose estimation framework.

to illumination changes. Second, we design a generic regionhierarchy based on 3D supervoxels [20] which containsobject patterns in different spatial scales. Third, the multi-domain pooled features are efficiently propagated throughthe region hierarchy and the semantic labels of all regionsare blended for the semantic segmentation.

We quantitatively and qualitatively evaluate our pose es-timation algorithm on a new dataset that contains complexconfigurations of ten common hand tools in densely clutteredscenes. More than twenty thousand labeled object poses areprovided for five thousand testing frames. To our knowledge,this is the first dataset designed specifically for fine-grainedobject recognition and pose estimation in densely clutteredscenes.

The remainder of this paper is organized as follows.Sec. II provides a review of the semantic segmentationand pose estimation methods. Sec. III introduces the newfeature. Sec. IV presents the hierarchical semantic parsingalgorithm followed by the model registration shown in Sec.V. Experiments are presented in Sec. VI and we concludethe paper in Sec. VII.

II. RELATED WORK

Most object pose estimation algorithms generate objecthypotheses by matching local surfaces. Hough voting [29],[27] and hand-crafted global descriptors [26], [32] have beenapplied to select valid object hypotheses. More principledapproaches propose an optimization framework to search forthe subset of hypotheses that explains the scene best [1], [2].These methods handle the occlusion well but work poorlyfor objects with textureless or similar textured surfaces.On the other hand, the template-based algorithms such as[13], [7] avoid the reliance on local discriminative featuresby matching global object templates with sampled slidingwindows on RGB-D images. Unfortunately, these methodsare sensitive to the partial occlusion caused by the denseclutter. Moreover, they do not scale well in recognizingmultiple object instances in terms of time and accuracy,particularly when the objects have similar appearances.

Semantic scene segmentation is another well-studied topic.One popular pipeline makes use of Conditional RandomFields (CRFs) to model the pairwise relationships betweenadjacent local patterns and optimize the overall labeling [31],[12]. However, most of CRF implementations are usuallylimited to low-order graphs for the efficiency, which prevent

them from modeling patterns over large image areas. An-other semantic scene parsing algorithm [10] classifies objectproposals generated from a hierarchical scene segmentationtree. A 3-level semantic parsing hierarchy presented in [3]is composed of pixels, supervoxels, and whole instance seg-ments. Unfortunately, these image region hierarchies dependson some bottom-up grouping criteria that is based on strongassumptions such as convex object surfaces. This makes themunable to generalize to broader classes of objects. More im-portantly, these methods ignore partial object surfaces lyingbetween local regions and global whole-object segments,which makes them sensitive to occlusion.

Recently, the state-of-the-art image features learned fromConvolutional Neural Networks (CNNs) [14] have beenadapted for RGB-D large-scale object recognition [23], in-stance segmentation [24] and semantic segmentation [11].Nevertheless, the spatial pooling domain used by CNNs ismore sensitive to 3D rotations than some alternative domainssuch as color [17]. In this paper, we further explore thisgeneral approach by employing two more pooling domainsinduced by SIFT[18] and FPFH[22] feature spaces, whichyields a new representation for RGB-D data that is morerobust to 3D rotations and illumination changes.

III. MULTI-DOMAIN FEATURE POOLING

A typical convolutional architecture involves two keysteps: convolution of local filters over input signals andpooling of filter responses within predefined neighborhoods.While color-based pooling is capable of greatly reducing thevariability of raw input signals caused by 3D rotation [17],it is still sensitive to the changes of lighting conditions. Inthis work, we further explore two more alternative poolingspaces: SIFT [18] (gradient) and FPFH [22] (3D geometry)that are invariant to illumination variations and 3D rotation.Traditional pooling on the spatial domain can not be directlyapplied for high dimensional pooling domains like SIFT andFPFH due to the exponentially growing number of poolingbins. Therefore, we present a generalized pooling approachbased on K-means and nearest neighbor search for arbitrarypooling domains.

A. Local Filters and Responses

We first briefly explain how to compute local filters andresponses for RGB-D data. We construct the 3D local filtersfor pooling using the same method described in [17]. First,we use CSHOT [28], a rotationally invariant 3D feature,as the local descriptor of a 3D point pi in a point cloudP = {p1, · · · , pn}. Next, two separate dictionaries Dc andDd with K filters (e.g. codewords) are learned for both L2-normalized color and depth components (denoted as zc andzd) in CSHOT. We do so by randomly sampling raw CSHOTfeatures across different object classes as the training dataand run hierarchical K-means. In turn, given a test pointpi, we compute its responses with respect to both CSHOTdictionaries by transforming each of its CSHOT component z(e.g. zc or zd) into a response vector µ = {µ1, · · · ,µ j, · · · ,µK}by the soft-assignment encoder [30]:

µ j =exp(β d(z,d j)

)∑

nk=1 exp

(β d(z,dk)

)s.t. d(z,dk) =

{d(z,dk) : dk ∈Nk(z)+∞ : dk /∈Nk(z)

(1)

where Nk(z) denotes the k-nearest neighbors of a rawCSHOT component z defined by the Euclidean distanced(z,dk) between z and the kth codeword dk. β is a smoothingparameter with negative value. Finally, the local response xifor pi is constructed by concatenating the responses µc andµd for zc and zd : xi = [µc,µd ].

B. Generalized PoolingConsider a set of pooling domains S = {St} and a point

cloud P = {pi} with its corresponding local responses X ={xi} (CSHOT filter responses in our case). Our objective isto determine a method to pool X in S. We define a poolingpair < ct

i,xi > for a 3D point pi, where pooling indicatorct

i ∈ St is used to direct the local response xi to a specificpooling region in St . For example, if St denotes the SIFTfeature space, ct

i is a SIFT feature descriptor of pi. Next, weconsider a set of region seeds W t = {wt

j |wtj ∈ St∧1≤ t ≤M}

as the representation of M pooling regions Rt = {rt1, · · · ,rt

M}in St . The ith pooling region rt

i is defined as:

rti = {w | (w ∈ St)∧ (i = arg j min‖w−wt

j‖)} (2)

In order to capture richer visual characteristics, we candeploy multiple seed sets {W t

k} in one pooling domain tobuild a set of pooling regions {Rt

k} with different poolinggranularities [17]. We note that spatial and color poolingare special cases of the above generic pooling scheme. Incolor pooling, for example, ct

i is the color value of pi andW t is the set of centers of gridded cells in color space. Thepyramid structure over the spatial or color domain can alsobe constructed by setting {W t

k} as the centers of pooling cellsat each level.

Next, the sum-pooling operator is applied to sum over alllocal responses that go into the same pooling region. Thus,the pooled representation Y (Rt

k) for Rtk is a concatenation of

L2-normalized pooled features from all M pooling regionsRt

k = {rt1,k, · · · ,rt

M,k}: Y (Rtk) = [Y (rt

1,k), · · · ,Y (rtM,k)]. The ith

dimension Y (rti,k) is computed as follows:

Y (rti,k) =

Y (rti,k)

‖Y (rti,k)‖2

, s.t. Y (rti,k) = ∑

cti∈rt

i,k

xi (3)

Finally, we combine pooled features in all pooling do-mains to form the final representation Y :

Y =[Y (RLAB

1 ), · · · ,Y (RLABNLAB

),

Y (RSIFT1 ), · · · ,Y (RSIFT

NSIFT),

Y (RFPFH1 ), · · · ,Y (RFPFH

NFPFH)] (4)

where NLAB, NSIFT and NFPFH are the numbers of sets ofpooling regions in LAB, SIFT and FPFH domain respec-tively.

Fig. 3: The flowchart of the hierarchical semantic parsing.

IV. HIERARCHICAL SEMANTIC PARSING

The pipeline of our hierarchical semantic parsing algo-rithm is shown in Fig. 3. We first present a hierarchy ofregion proposals which avoids relying on region mergingheuristics used in most of the scene segmentation techniques[10], [9]. This region hierarchy explores a larger set ofpartial object regions that span from local to global patterns.The multi-domain pooled features (Sec. III) are efficientlypropagated through this generic region hierarchy (Sec. IV-A)and the semantic labels of regions at all scales are combinedfor the robust semantic segmentation.

Our semantic segmentation algorithm is hierarchical alongtwo dimensions. First, the semantic label for each supervoxelis evaluated and fused in various surrounding contexts basedon the region hierarchy. Second, the semantic parsing isconducted in a hierarchical order, where the LAB-pooledfeature is applied over the whole scene to extract foregroundregions and the full multi-domain pooled feature (Sec. III)is subsequently extracted on the predicted foreground forthe multi-class object classification. This two-stage processis motivated by the fact that LAB pooling is more compu-tationally efficient than SIFT and FPFH poolings since noadditional local feature computation and nearest neighborsearch among region seeds is conducted. We now describeeach element of this algorithm in more detail.

A. Hierarchy of Multi-Scale Region Proposals

Consider an undirected graph G = {V,E} where vertexesand edges correspond to supervoxels and their adjacency con-nections returned from the 3D segmentation algorithm [20].We define an order-k region set Ok = {ok

i } in which eachregion ok

i is a connected subgraph of G containing exactlyk vertexes (i.e. k supervoxels that are path connected). As aresult, the O1 contains all raw individual supervoxels and O2

includes all connected pairs. By induction, we can constructany higher order set Ok (k > 1) from Ok−1 by growing eachregion ok−1

i ∈Ok−1 with all connected supervoxel neighborsone at a time and removing all region duplicates during thisprocess. We summarize this generic region growing algo-rithm in Algorithm 1. Finally, the generic region hierarchyH is composed of all order sets H = {O1, · · · ,ON} with

Algorithm 1 Generic Region Growing

Input Ok−1 and graph G = {V,E}Initialize Ok = /0for each region proposal ok−1

i ∈ Ok−1 dofor each supervoxel v j ∈ ok−1

i dofor each neighbor vk of v j s.t. < vi,v j >∈ E do

if (vk /∈ ok−1i )∧ (ok−1

i ∪ vk /∈ Ok) thenOk = Ok ∪ (ok−1

i ∪ vk)end if

end forend for

end forOutput Ok

N orders. We note that region proposals in H are uniqueand may share common supervoxels, which is different fromother methods such as [10].

B. Propagation of Multi-Domain Pooled Features

Consider two regions op and oq from arbitrary order sets inthe hierarchy H . We compute the pooled feature Yp+q(rt

i,k)in the pooling region rt

i,k ∈Rtk for the combined region op∪oq

as follows:

Yp+q(rti,k) =

Yp(rti,k)+ Yq(rt

i,k)

‖Yp(rti,k)+ Yq(rt

i,k)‖2(5)

where Yp(rti,k) and Yq(rt

i,k) are the unnormalized pooledfeatures (defined in Eq. 3) for op and oq respectively. Finally,the concatenation of propagated features from all regionseeds and domains forms the representation Yp+q for op∪oq.Therefore, once we extract the pooled features for order-1 set(raw individual supervoxels), Eq. 5 can be recursively appliedto compute features for regions in higher order sets. Note thatthis feature propagation scheme does not work for spatiallypooled features [14], [6], [23] because spatial pooling regionsneed to be reconstructed for each new merged region.

C. Two-Stage Semantic Segmentation

We design a two-stage semantic parsing process that isshown in Fig. 3. First, we build a shallow hierarchy H f start-ing from order-1 set upon a given scene and only use LAB-pooled features to select supervoxels belonging to foregroundclass. Subsequently, a deeper hierarchy Hm is establishedupon foreground regions and more time-consuming SIFTand FPFH pooling are added along with the previouslyextracted LAB-pooled features to jointly classify differentobject classes. We deploy a deep hierarchy for Hm in orderto mine discriminative features for large image regions forthe fine-grained object classification.

Separate Support Vector Machines (SVM) are trained withrespect to each order set Ok in both H f and Hm. Werandomly sample region proposals from Ok in each objectsample or background scene as the training data. All regionsfrom object samples contribute as the positive data to trainforeground/background SVM and no background region is

involved for multi-class SVM training. At test time, eachregion proposal ok

i will receive a vector of object classresponses from the SVM for the kth order set. Finally, theclass label of each supervoxel is inherited from the one withthe maximum SVM score over all region proposals whichcontain it.

V. POSE ESTIMATION USING OBJRECRANSACAfter semantic parsing, the original scene point cloud is

partitioned into regions with homogeneous semantic labels.Similar to [16], we estimate the object poses for eachsegmented semantic class with an open-source RANSAC-based object recognition system, “ObjRecRANSAC”[19]1 Toreduce the number of false positives returned from “Ob-jRecRANSAC”, a simple non-maximum suppression step iscarried out to filter inaccurate pose estimation. Consider thefunction Q(q,M,P) that indicates the confidence score of thepose q for the model M supported by the scene cloud P:

Q(q,M,P) = ∑vi∈M

1(minpi∈P‖T (vi,q)− pi‖2 < δD) (6)

where vi is the vertex in object mesh M and T (vi,q)transforms the vi based on pose q. δD is a pre-set parameter.Both M and P can be represented by voxel grids, whichmakes function Q(q,M,P) highly efficient in practice. Wereject the hypothesis with lower score from any pair ofhypotheses whose projected 2D intersection is more than50% of its union. The remaining hypotheses are the finalposes estimated by the algorithm for the scene.

VI. EXPERIMENTS

We test the multi-domain pooled feature introduced inSec. III on the UW-RGBD Object Dataset [15] and evaluateour pose estimation algorithm on our new JHUScene-50Dataset. For efficiency, we downsample raw point cloudsvia octree binning with leaf size 0.003m. We set the radiusfor normal estimation, CSHOT and FPFH descriptors to0.02m, and both descriptors are L2-normalized. The CSHOTdictionaries are the same as the ones used in [17]. We inducea sparse representation by setting the number of nearestneighbors to K = 20 for soft encoding (10% of the totalsize of the dictionary), which enables fast vector additionand multiplication in feature propagation (Eq. 5).

We set NLAB = 5, NSIFT = 1 and NFPFH = 1 in Eq. 4. Ran-domly sampled SIFT or FPFH features from UW-RGBD areclustered via hierarchical K-means to learn 400 region seedsfor each of the FPFH or SIFT pooling domains, respectively.For SIFT pooling, we run multiple SIFT keypoint detectorswith σ ∈ (0.7,1.6) and no response threshold for edges andcontrast regions2. Keypoints with valid CSHOT codes arepooled according to their SIFT descriptors.

For our two-stage semantic segmentation algorithm, webuilt a 3-order region hierarchy H f on the whole scene and

1One reference CUDA-based implementation is available for academicuse under an open-source license: http://github.com/tum-mvp/ObjRecRANSAC.git.

2We use the non-free SIFT implementations in OpenCV.

http://github.com/tum-mvp/ObjRecRANSAC.git

http://github.com/tum-mvp/ObjRecRANSAC.git

Algorithm Acc. Algorithm Acc.RF [15] 73.1 LAB 89.1

CKM Desc. [4] 90.8 FPFH 74.5Kernel Desc. [5] 93.0 SIFT 88.5HMP-RGBD [6] 92.8 LAB+FPFH 93.6

CNN [23] 94.1 LAB+SIFT 94.7MSMD-All [17] 94.1 ALL 95.7

TABLE I: Instance recognition accuracy (%) on UW-RGBD.

a 5-order region hierarchy Hm on the extracted foregroundregions 3. We do not train a SVM for the order-1 set inHm because individual supervoxels are generally too localto be discriminative. For ObjRecRANSAC, object models areinitialized separately for each class and the length for theiroriented point pair features is set as 0.07m for all objects.Other parameters are kept the same as those reported in theoriginal implementation [19].

A. UW-RGBD Object Dataset

We first examine the performance of the generalized multi-domain pooled representation alone for large-scale objectinstance classification on the UW-RGBD object benchmark[15]. The UW-RGBD dataset contains 300 textured and tex-tureless object instance classes. Table I reports the recogni-tion rates of our methods and comparative algorithms on thetest set under the leave-sequence-out setting. Variants of theproposed method are marked in bold type and named accord-ing to the pooling domains that are actually deployed. Theresults show that the combined pooled features from all threedomains (e.g. LAB, FPFH and SIFT) achieve the best result95.7%, which improves the current state-of-the-art [17], [23]by 1.6%. This improvement is attributed to the use of morediverse pooling domains compared with [17]. Moreover, ourchoices of features used in the two-stage semantic parsing(in Sec. IV-C) is validated by the following two results. First,the LAB pooling domain individually outperforms SIFT andFPFH and is more efficient in computation. Thus, poolingfeatures in the single LAB domain achieve a good balancebetween efficiency and accuracy for foreground extractionin the first stage. Second, SIFT and FPFH poolings areable to capture useful RGB-D patterns complimentary to theLAB-pooled feature while being less sensitive to illuminationchanges, which leads to the better performance of fine-grained multi-class classification compared with the singleLAB pooling.

B. JHUScene-50 Dataset

Only a few benchmarks for object pose estimation havebeen presented in literature. The LINEMOD dataset [13]contains thousands of RGB-D images but only a singlepose is provided for a textureless object under almost noocclusion. UW-RGBD pose benchmark [15] only consistsof 1-DoF labeled pose for segmented objects. [2] offers 50challenging scene frames composed of multiple objects inclose contact but no color information is provided for objectmodels. In this paper, we contribute a new scene dataset

3N-order hierarchy indicates first N order sets to be used.

called JHUScene-50 that is designed to test 6-DoF poseestimation algorithms for generic objects in densely clutteredenvironments.

JHUScene-50 contains 50 scenes where each scene hasat least three object instances from ten typical hand tools(shown in Fig. 4). For object modeling, we place each often hand tools on an electric turntable and capture 900RGB-D partial views as the training data per object underboth fixed and randomly sampled view points4. We referreaders to [17] for more details of the object data collectionabove. We generate a full 3D mesh for each object follow-ing similar procedures in [25]. A video sequence of 100frames is recorded per scene by freely moving a RGB-Dcamera5 in one of the five indoor environments includingoffice workspaces, robot manipulation platforms and largecontainers. Each of the 5 indoor contexts contains ten sceneswith multiple object instances densely cluttered in variousways. In addition, we capture a video sequence with 600frames for each of five indoor environments without anyobject in it as the training data for the background class.

Each video sequence contains at least 3 hand tool instancesthat form complex scene clutters, where some objects arein partial and even complete occlusion at some frames.In order to facilitate the annotation for object poses, weplace artificial plane markers in the background to computerigid transformations between first frame and every otherframe. We then manually label the 6-DoF poses of all objectinstances in the first frames and propagate them to theremaining frames. In JHUScene-50, there are 22520 labeledobject poses in total. Fig. 4c shows an example of the labeledobject poses. Furthermore, we generate groundtruth of thesemantic segmentation of a point cloud by finding nearestvertex of each 3D point among all the annotated poses. Ifthe distance to the nearest neighbor is less than 0.01m, theclass label of the corresponding pose is assigned to that point.In turn, all points without any assigned labels belong to thebackground class.

1) Semantic Segmentation: We assume that robots knowthe perception background in advance. Consequently, fore-ground/background SVM classifiers for all order sets inH f are specifically trained for each environment from theassociated background data. All multi-class SVMs for Hmare shared across the five indoor contexts. The performanceof our semantic parsing algorithm is measured by the averageprecision and recall accuracies over all frames 6.

Table II reports the average precision and recall ratesof the foreground and all ten objects in the five indoorenvironments. Three major observations follow. First, therecall rates of the foreground extraction are mostly above90% across different environments, which means most offoreground regions are extracted for the subsequent multi-class classification. The low precision of the foregroundis compensated by the scene-model matching check via

4Fixed view points are at 30, 45 and 60 degrees above the horizon.5PrimeSense Carmine 1.08 epth sensors.6precision = |{predictions}∩{groundtruth}|

|{predictions}| , recall = |{predictions}∩{groundtruth}||{groundtruth}|

(a) 10 Hand Tools. (b) Example Scene. (c) Pose Groundtruth.

Fig. 4: Figures from the left to right are object examples, a scene and its pose groundtruth in JHUScene-50.

ScenePrecision / Recall

Foreground All ObjectsO1 O1+O2 O1+O2+O3 O2 O2+O3 O2+O3+O4 O2+O3+O4+O5

Office 70.8 / 94.6 74.1 / 95.2 75.3 / 95.6 70.3 / 69.3 72.3 / 73.2 74.2 / 72.1 74.5 / 73.6Labpod 64.9 / 90.0 70.6 / 90.4 71.8 / 91.7 60.7 / 63.9 64.3 / 68.4 65.8 / 70.4 66.7 / 71.3Barrett 47.7 / 90.9 57.4 / 89.3 61.3 / 88.9 55.5 / 53.6 57.9 / 57.3 58.7 / 58.8 59.7 / 59.0

Box 63.8 / 89.5 71.3 / 90.5 71.8 / 92.2 65.1 / 62.6 66.1 / 66.2 66.9 / 65.6 67.8 / 68.8Shelf 61.1 / 90.6 70.6 / 91.4 71.8 / 91.6 72.4 / 68.7 74.7 / 72.3 75.5 / 73.5 75.9 / 74.1

Overall 61.6 / 91.1 68.8 / 91.4 70.0 / 92.0 64.5 / 63.6 66.8 / 67.5 67.3 / 68.8 67.6 / 69.3

TABLE II: Reported precision and recall of the foreground and all objects at different orders in H f and Hm from 5 indoorenvironments.

ObjRecRANSAC in the pose estimation stage as we showin Sec. VI-B.2. Second, for both foreground and object,precision and recall rates are monotonically increasing inmost of those environments7 when higher order sets in H for Hm are deployed. This result indicates the effectivenessof our hierarchical semantic segmentation algorithm. Third,the object recall rates are mostly lower than the foregroundrecall by 20%∼ 30%. This is mainly because the difficultyof correctly classifying ambiguous local and partial surfacesbetween objects with similar appearances. However, accord-ing to [19], ObjRecRANSAC is able to estimate the correctpose from each partial view as long as more than 20% ofthe object surface is visible. As we show later, the 69.3%recall of object regions actually supports the pose estimationwell on our dataset. Some qualitative results of the semanticsegmentation are illustrated in Fig. 5

2) Pose Estimation: Our last experiment is to test theoverall performance of our method for instance pose esti-mation. Following the same metric used in [16], we decidethe correctness of an estimated pose if the transformedobject mesh has more than 70% surface overlap with thegroundtruth. This criterion does not measure the matchingbetween surface textures because the accurate prediction of3D object occupancy is the dominant factor in most ofperception scenarios such as the object manipulation. Inaddition, multiple optimal solutions may be valid for objectswith symmetrical structures in shape and texture undercertain partial views. Given a test frame, we calculate theprecision/recall 8 of all estimated poses as the measurement

7Only foreground recall slightly degrades in “barrett”.8Precision= |{true positives}|

|{all predicted poses}| . Recall= |{true positives}||{all groundtruth poses}| .

of pose estimation performance.We report the average precision and recall rates of dif-

ferent pose estimation algorithms in Table. III. We can seethat pure local matching methods [29], [2], [19] barelywork on JHUScene-50 while our method “Object Segmenta-tion+ObjRecRANSAC” being able to retrieve 73.1% correctposes with 81.6% precision. Moreover, we analyze how se-mantic segmentation helps the model registration by runningObjRecRANSAC within four different scenarios: the rawscene (no semantics), extracted foreground regions, classifiedobject regions and groundtruth segmentation. From Table III,the vanilla ObjRecRANSAC [19] (no semantics) performsmuch worse than the other three variants of our method withsemantics support at different degrees. The foreground-basedObjRecRANSAC significantly degrades the performance ofthe object-based ObjRecRANSAC by roughly 40% on preci-sion and 30% on recall. Last, the groundtruth segmentationenables the ObjRecRANSAC to achieve the highest precisionand recall that are both above 93% on average. In conclusion,the more input of the correct semantics information fromthe scene, the better performance of pose estimation. Thissupports the basic motivation for our semantics-based poseestimation framework. The high precision/recall performance(94.6%/93.1%) based on the groundtruth segmentation showsa promising future of improving the current semantic parsingmethod to approach the perception requirements of realrobotic systems.

3) Error Analysis and System Runtime: We first reportthat our algorithm could yield at least one correct objectpose in 99.5% of all 5000 frames. That means only onein 200 times that a robot would not be able to interact

Algorithm Precision / RecallOffice Labpod Barrett Box Shelf Overall

SHOT+Hough Voting [29] 0.02 / 0.91 0.01 / 0.41 0.04 / 0.53 0.02 / 0.66 0.07 / 1.30 0.32 / 0.76Hypotheses Verification [2] 3.2 / 2.1 2.8 / 2.1 1.7 / 1.1 3.6 / 0.55 7.3 / 4.5 3.7 / 2.1

Vanilla ObjRecRANSAC [19] 1.4 / 4.3 1.3 / 4.7 1.8 / 7.2 0.7 / 2.6 2.4 / 10.3 1.5 / 5.8Foreground Segmentation +

ObjRecRANSAC 41.8 / 45.2 38.9 / 41.0 41.8 / 43.4 43.3 / 43.4 48.3 / 48.2 42.8 / 44.3

Object Segmentation +ObjRecRANSAC 84.8 / 78.3 78.9 / 69.4 78.4 / 67.0 80.4 / 70.4 85.9 / 80.3 81.7 / 72.7

Groundtruth Segmentation +ObjRecRANSAC 96.3 / 95.8 89.6 / 86.2 96.5 / 95.1 94.3 / 93.2 96.3 / 95.3 94.6 / 93.1

TABLE III: Reported precision and recall of the estimated object poses by different algorithms in 5 indoor contexts.

with any object in a scene. Fig. 5 shows some qualitativeresults of our semantic segmentation and pose estimationalgorithms. More results can be found in the supplementaryvideo. We can see from subfigures (a) to (g) in Fig. 5that ObjRecRANSAC is able to yield correct object posesgiven an imperfect semantic segmentation. One failure caseis also demonstrated in the bottom-right corner of Fig. 5.The silver hammer and yellow mallet behind it are notdetected due to their similar appearances to other objectsas well as the background clutter. One way to improve thecurrent semantic classifiers is to retrain SVMs with the falsepositives and negatives in the training set returned from theoriginal classifiers.

The current runtime of the whole system is around 5 sec-onds per scene on average 9. The most time consuming partsare the CSHOT and FPFH feature computation which can bedramatically accelerated with GPU-based implementations.Additionally, the estimated poses from our algorithm couldbe used to initialize a locally-stable object tracking system tocontinuously estimate the object poses in a dynamic scene.

VII. CONCLUSION

In this paper, we present a semantic segmentation algo-rithm to partition a scene into different object regions whereobject poses are estimated by a standard model registrationmethod. The three main conclusions of this work are that:(1) The combination of pooled features from LAB, SIFT andFPFH domains are robust to 3D transformations as well asillumination variations and achieve the state-of-the-art per-formance for the instance recognition; (2) The semantic seg-mentation is more accurate by fusing the predicted labels ofregion proposals at more diverse scales; (3) The RANSAC-based pose estimation method (e.g. ObjRecRANSAC) issignificantly enhanced with the correctly segmented objectregions;

For future work, we can develop better features and se-mantics fusion scheme to improve the semantic segmentationand in turn the overall performance of pose estimation.Moreover, we can exploit temporal constraints in a datasequence to speed up and stabilize the estimated poses fromour current system.

9All tests are performed on a desktop with Intel Xeon CPU E5-2690.

REFERENCES

[1] Aldoma, A., Tombari, F., Prankl, J., Richtsfeld, A., Di Stefano, L.,Vincze, M. Multimodal cue integration through Hypotheses Verifi-cation for RGB-D object recognition and 6DOF pose estimation. InICRA, 2013.

[2] Aldoma, A., Tombari, F., Stefano, L.D., Vincze, M.: A Global Hy-potheses Verification Method for 3D Object Recognition. In: ECCV,2012.

[3] Asif, U., Bennamoun, M., Sohel, F.: Efficient RGB-D object catego-rization using cascaded ensembles of randomized decision trees. InICRA, 2015.

[4] Blum, M., Springenberg, J. T., Wlfing, J., Riedmiller, M. A learnedfeature descriptor for object recognition in rgb-d data. In ICRA, 2012.

[5] Bo, L., Lai, K., Ren, X., Fox, D.: Object recognition with hierarchicalkernel descriptors. In CVPR, 2011.

[6] Bo, L., Ren, X., Fox, D. Unsupervised feature learning for RGB-Dbased object recognition. In ISER, 2013.

[7] Brachmann, E., Krull, A., Michel, F., Gumhold, S., Shotton, J., Rother,C.: Learning 6d object pose estimation using 3d object coordinates.In ECCV, 2014.

[8] Couprie, C., Farabet, C., Najman, L., LeCun, Y.: Indoor semanticsegmentation using depth information.In ICLR, 2013.

[9] Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hier-archies for accurate object detection and semantic segmentation. InCVPR, 2014.

[10] Gupta, S., Arbelaez, P., Malik, J.: Perceptual organization and recog-nition of indoor scenes from RGB-D images. In CVPR, 2013.

[11] Gupta, S., Girshick, R., Arbelez, P., Malik, J. Learning rich featuresfrom RGB-D images for object detection and segmentation. In ECCV,2014.

[12] Hermans, A., Floros, G., Leibe, B.: Dense 3D semantic mapping ofindoor scenes from RGB-D images. In ICRA, 2014.

[13] Hinterstoisser, S., Lepetit, V., Ilic, S., Holzer, S., Bradski, G., Kono-lige, K., Navab, N.: Model based training, detection and pose estima-tion of texture-less 3D objects in heavily cluttered scenes. In ACCV,2012.

[14] Krizhevsky, A., Sutskever, I., Hinton, G. E. Imagenet classificationwith deep convolutional neural networks. In NIPS, 2012.

[15] Lai, K., Bo, L., Ren, X., Fox, D. A large-scale hierarchical multi-viewrgb-d object dataset. In ICRA, 2011.

[16] Li, C.,Boheren, J., Hager, G.D.: Bridging the Robot Perception GapWith Mid-Level Vision. In: International Symposium on RoboticsResearch (ISRR), 2015.

[17] Li, C.,Reiter, A., Hager, G.D.: Beyond Spatial Pooling, Fine-GrainedRepresentation Learning in Multiple Domains. In: CVPR, 2015.

[18] Lowe, D. G. Distinctive image features from scale-invariant keypoints.IJCV, 2004.

[19] Papazov, C., Burschka, D. An efficient RANSAC for 3D objectrecognition in noisy and occluded scenes. In ACCV, 2010.

[20] Papon, J., Abramov, A., Schoeler, M., Worgotter, F.: Voxel cloudconnectivity segmentation-supervoxels for point clouds. In CVPR,2013.

[21] Pauwels, K., Ivan, V., Ros, E., Vijayakumar, S.: Real-time object poserecognition and tracking with an imprecisely calibrated moving RGB-D camera. In IROS, 2014.

[22] Rusu, R. B.: Semantic 3d object maps for everyday manipulation inhuman living environments. KI-Knstliche Intelligenz, 2010.

(a) Labpod 2 (b) Box 7 (c) Shelf 6 (d) Barrett 3

(e) Labpod 4 (f) Box 4 (g) Office 2 (h) Failure case

Fig. 5: Example results of the semantic segmentation and pose estimation are shown in upper and bottom parts in eachsubfigure, respectively. Each predicted semantic class and the associated estimated poses are highlighted by a unique color.

[23] Schwarz, M., Schulz, H., Behnke, S.: RGB-D Object Recognition andPose Estimation based on Pre-trained Convolutional Neural NetworkFeatures. In ICRA, 2015.

[24] Silberman, N., Sontag, D., Fergus, R.: Instance Segmentation of IndoorScenes using a Coverage Loss. In: ECCV, 2014.

[25] Singh, A., Sha, J., Narayan, K. S., Achim, T., Abbeel, P. BigBIRD:A large-scale 3D database of object instances. In ICRA, 2014.

[26] Tang, J., Miller, S., Singh, A., Abbeel, P. A textured object recognitionpipeline for color and depth image data. In ICRA, 2012.

[27] Tejani, A., Tang, D., Kouskouridas, R., Kim, T. K. Latent-class houghforests for 3D object detection and pose estimation. In ECCV, 2014.

[28] Tombari, F., Salti, S., Di Stefano, L. A combined texture-shapedescriptor for enhanced 3D feature matching. In ICIP, 2011.

[29] Tombari, F., Stefano, L. D.: Object recognition in 3D scenes with oc-clusions and clutter by hough voting. In Image and Video Technology(PSIVT), 2010.

[30] Van Gemert, J. C., Veenman, C. J., Smeulders, A. W., Geusebroek, J.M.: Visual word ambiguity. Pattern Analysis and Machine Intelligence,IEEE Transactions on, 2010.

[31] Wolf, D., Prankl, J., Vincze, M.: Fast semantic segmentation of 3Dpoint clouds using a dense CRF with learned parameters. In ICRA,2015.

[32] Xie, Z.,Singh, A., Uang, J., Narayan, K. S., Abbeel, P. Multimodalblending for high-accuracy instance recognition. In IROS, 2013.

hierarchical semantic parsing for object pose estimation ...cli53/papers/chi_icra16.pdf · object...

Documents