dex-net 1.0: a cloud-based network of 3d objects for ... · dex-net 1.0 contains approximately 2.5...

Dex-Net 1.0: A Cloud-Based Network of 3D Objects for Robust GraspPlanning Using a Multi-Armed Bandit Model with Correlated Rewards

Jeffrey Mahler1, Florian T. Pokorny1, Brian Hou1, Melrose Roderick1, Michael Laskey1, Mathieu Aubry1,Kai Kohlhoff2, Torsten Kroger2, James Kuffner2, Ken Goldberg1

Abstract— This paper presents the Dexterity Network (Dex-Net) 1.0, a dataset of 3D object models and a sampling-basedplanning algorithm to explore how Cloud Robotics can beused for robust grasp planning. The algorithm uses a Multi-Armed Bandit model with correlated rewards to leverage priorgrasps and 3D object models in a growing dataset that currentlyincludes over 10,000 unique 3D object models and 2.5 millionparallel-jaw grasps. Each grasp includes an estimate of the prob-ability of force closure under uncertainty in object and gripperpose and friction. Dex-Net 1.0 uses Multi-View ConvolutionalNeural Networks (MV-CNNs), a new deep learning method for3D object classification, to provide a similarity metric betweenobjects, and the Google Cloud Platform to simultaneously runup to 1,500 virtual cores, reducing experiment runtime by up tothree orders of magnitude. Experiments suggest that correlatedbandit techniques can use a cloud-based network of objectmodels to significantly reduce the number of samples requiredfor robust grasp planning. We report on system sensitivityto variations in similarity metrics and in uncertainty in poseand friction. Code and updated information is available athttp://berkeleyautomation.github.io/dex-net/.

I. INTRODUCTION

Cloud-based Robotics and Automation systems exchangedata and perform computation via networks instead ofoperating in isolation with limited computation and memory.Potential advantages to using the Cloud include Big Data: ac-cess to updated libraries of images, maps, and object/productdata; and Parallel Computation: access to grid computingfor statistical analysis, machine learning, and planning [22].Scaling effects have recently been demonstrated in com-puter vision and speech recognition, where learning fromlarge datasets such as ImageNet has increased performancesignificantly [14], [24] over decades of previous research.Can analogous scaling effects emerge when datasets of 3Dobject models are applied to learning robust grasping andmanipulation policies for robots? This question is beingexplored by others [12], [18], [27], [32], [33], and in thispaper we present initial results using a new dataset of 3Dmodels and grasp planning algorithm.

The primary contribution of this paper is the Dex-Net1.0 algorithm for efficiently computing robust parallel-jawgrasps (pairs of contact points that define a grasp axis) withhigh probability of success based on a binary grasp qualitymetric with uncertainty due to imprecision in sensing and

1 University of California, Berkeley, USA;{jmahler, ftpokorny, brian.hou,laskeymd, goldberg}@berkeley.edu,melrose [email protected],[email protected]

2 Google Inc., Mountain View, USA; {kohlhoff, tkr,kuffner}@google.com

Uniform Allocation

Thompson Sampling

Dex-Net 1.0 (N=1,000)

Dex-Net 1.0 (N=10,000)

1.0

0.9

0.8

0.7

0.60 500 1000 1500 2000

Iteration

No

rmal

ized

Qu

alit

y

Query Object 1,000 10,000

Network Size (# Objects)

Increasin

g S

imilarity

1.0

0.9

0.8

0.7

0.60

Fig. 1: Average normalized grasp quality versus iteration for 25 trials forthe Dex-Net 1.0 Algorithm with 1,000 and 10,000 prior 3D objects fromDex-Net (bottom) and illustrations of five nearest neighbors in Dex-Net (top)for a spray bottle. We measure quality by the probability of force closureof the best grasp predicted by the algorithm on each iteration and comparewith Thompson sampling without priors [26] and uniform allocation [20],[43]. (Top) The spray bottle has no similar neighbors with 1,000 objects,but two other spray bottles are found by the MV-CNN in the 10,000 objectset. (Bottom) As a result, the Dex-Net 1.0 algorithm quickly converges tothe optimal grasp with 10,000 prior objects.

control. The algorithm speeds up robust grasp planning withMulti-Armed Bandits (MABs) [26] by learning from a largedataset of prior grasps and 3D object models using ContinuousCorrelated Beta Processes (CCBPs) [11], [31], an efficientmodel for estimating a belief distribution on predicted graspquality based on prior data. This paper also presents Dex-Net 1.0, a growing dataset of 10,000 3D object modelstypically found in warehouses and homes such as containers,tools, tableware, and toys, and an implemented cloud-basedalgorithm for efficiently finding grasps with high probabilityof force closure (PF ) under perturbations in sensing andcontrol.

Dex-Net 1.0 contains approximately 2.5 million parallel-

jaw grasps, as each object is labelled with up to 250 graspsand an estimate of PF for each under uncertainty in objectpose, gripper pose, and friction coefficient. To the best of ourknowledge, this is the largest object dataset used for graspingresearch to-date. We incorporate Multi-View ConvolutionalNeural Networks (MV-CNNs) [42], a state-of-the-art methodfor 3D shape classification, to efficiently retrieve similar 3Dobjects for predicting robust grasp quality with CCBPs.

We implemented the Dex-Net 1.0 algorithm on GoogleCompute Engine and store Dex-Net 1.0 on Google CloudStorage, with a system that can distribute grasp evaluations for3D objects across up to 1,500 instances at once. Experimentssuggest that using 10,000 prior object models from Dex-Netreduces the number of samples needed to plan parallel-jawgrasps by up to 2× on average over 45 objects when usingPF as a success metric.

II. RELATED WORK

Grasp planning considers the problem of finding grasps fora given object that achieve force closure or optimize a relatedquality metric, such as the epsilon quality [35], correlationwith human labels [2], [18], or success in physical trials.Often it is assumed that the object is known exactly and thatcontacts are placed exactly, and mechanical wrench spaceanalysis is applied. As computing quality metrics can betime consuming, a common grasp planning method is tostore a database of 3D objects labelled with grasps and theirquality and to transfer the stored grasps to similar objects atruntime [5], [41]. To study grasp planning at scale, Goldfederet al. [12], [13] developed the Columbia grasp database,a dataset of 1,814 distinct models and over 200,000 forceclosure grasps generated using the GraspIt! sampling-basedgrasp planner. Pokorny et al. [34] introduced Grasp ModuliSpaces, enabling joint grasp and shape interpolation, andanalyzed a set of 100 million sampled grasp configurations.

Robust grasp planning considers optimizing a grasp qualitymetric in the presence of bounded perturbations in propertiessuch as object shape, pose, or mechanical properties suchas friction, which are inevitable due to imprecision inperception and control. One way to treat perturbations isstatistical sampling. Since sampling in high dimensions canbe computationally demanding, recent research has studiedlabelling grasps in a database with metrics that are robust toimprecision in perception and control using probability offorce closure (PF ) [43] or expected Ferrari-Canny quality [23].Experiments by Weisz et al. [43] and Kim et al. [23] suggestthat the robust metrics are better correlated with success ona physical robot than deterministic wrench space metrics.Brook et al. [7] planned robust grasps for a database of 892point clouds and developed a model to predict grasp successon a physical robot based on correlations with grasps in thedatabase. Kehoe et al. [21] created a Cloud-based system totransfer grasps evaluated by PF on 100 objects in a databaseto a physical robot by indexing the objects with the GoogleGoggles object recognition engine.

Another line of research has focused on synthesizinggrasps using statistical models [5] learned from a database

of images [27] or point clouds [9], [15], [46] of objectsannotated with grasps from human demonstrators [15], [27]or physical execution [15]. Kappler et al. [18] created adatabase of over 700 object instances, each labelled with500 Barrett hand grasps and their associated quality fromhuman annotations and the results of simulations with theODE physics engine. The authors trained a deep neuralnetwork to predict grasp quality from heightmaps of thelocal object surface. In comparison, we predict PF givena known 3D object model from similar objects and graspsusing Continuous Correlated Beta Processes (CCBPs) andMulti-Armed Bandits (MABs).

Our work is also closely related to research on activelysampling grasps to build a statistical model of grasp qual-ity from fewer examples [10], [25], [36]. Montesano andLopes [31] used Continuous Correlated Beta Processes [11]to actively acquire grasp executions on a physical robotusing image filters to measure similarity. Pinto et al. [33]used importance sampling over grasp success probabilitiespredicted from images by a Convolutional Neural Network(CNN) to actively acquire over 700 hours of labels forsuccessful and unsuccessful 3 DOF crane grasps. Oberlinand Tellex [32] developed a budgeted MAB algorithm forplanning 3 DOF crane grasps using priors from the responsesof hand-designed depth image filters, but did not study theeffects of orders of magnitude of prior data on convergence.In this work we extend the MAB model of [26] from 2Dto 3D and study the scaling effects of using prior data fromDex-Net on robust grasp planning.

To use the prior information contained in Dex-Net, we alsodraw on research on 3D model similarity. One line of researchhas focused on shape geometry, such as characteristics ofharmonic functions on the shape [6], or CNNs trained ona voxel representation of shape [30], [45]. Another line ofresearch relies on the description of rendered views of a 3Dmodel [12]. One of the key difficulties of these methodsis comparing views from different objects, which may beoriented inconsistently. Su et al. [42] address this issue byusing CNNs trained for ImageNet classification as descriptorsfor the different views and aggregating them with a secondCNN that learns the invariance to orientation. Using thismethod, the authors improve state-of-the-art classificationaccuracy on ModelNet40 by 10%. We use max-pooling toaggregate views, similar to the average pooling of [1].

III. DEFINITIONS AND PROBLEM STATEMENT

One goal of Cloud Robotics is to pre-compute a large setof robust grasps for each object so that when the object isencountered, the set can be downloaded such that at least onegrasp is achievable in the presence of clutter and occlusions. Inthis paper we consider the sub-problem of efficiently planninga parallel-jaw grasp that maximizes the expected value of abinary quality metric, such as the probability of force closure(PF ), for a given 3D object model under perturbations inobject pose, gripper pose, and friction coefficient. The Dex-Net 1.0 algorithm can also produce a set of grasps in rankedorder of robustness. We assume the exact object shape is given

x v

xy

z

z z

c1 c2n1 n2

ρ2ρ1

Fig. 2: Illustration of grasp parameterization and contact model. (Left) Weparameterize parallel-jaw grasps by the centroid of the jaws x ∈ R3 andapproach direction, or direction along which the jaws close, v ∈ S2. Theparameters x and v are specified with respect to a coordinate frame at theobject center of mass z and oriented along the principal directions of theobject. (Right) The jaws are closed until contacting the object surface atlocations c1, c2 ∈ R3, at which the surface has normals n1,n2 ∈ S2. Thecontacts are used to compute the moment arms ρi = ci − z.

as a signed distance function (SDF) f : R3 → R [29], whichis zero on the object surface, positive outside the object, andnegative within. We assume the object is specified in unitsof meters with given center of mass z ∈ R3. Furthermore,we assume soft-finger point contacts with a Coulomb frictionmodel [47]. We also assume that the gripper jaws are alwaysopened to their maximal width w ∈ R before closing.

A. Grasp and Object ParameterizationThe grasp parameters are illustrated in Fig. 2. Let g =

(x,v) be a parallel-jaw grasp parameterized by the centroidof the jaws in 3D space x ∈ R3 and an approach axis v ∈ S2.We denote by O = (f, z) an object with SDF f and center ofmass z, and denote by S = {y ∈ R3

∣∣f(y) = 0} the surfaceof O. We specify all points with respect to a reference framecentered at the object center of mass z and oriented alongthe principal axes of S.

B. ObjectiveThe objective of the Dex-Net 1.0 algorithm is to find a

grasp g∗ that maximizes an expected binary grasp qualitymetric S(g) ∈ {0, 1} such as force closure [23], [26],[29], [43] subject to uncertainty in the state of the object,environment, or robot. We refer to the expected quality as theprobability of success, PS(g) = E[S(g)]. Since sampling inhigh-dimensional spaces can be computationally expensive,we attempt to solve for g∗ in as few samples T as possibleby maximizing over the sum of PS(gt) for grasps sampledat times t = 1, ..., T [26], [40]:

maximizeg1,..,gT∈G

∑Tt=1PS(gt). (III.1)

Past work has solved this objective by evaluating andranking a discrete set of K candidate grasps Γ = {g1, ...,gK}using Monte-Carlo integration [20], [43] or Multi-ArmedBandits (MAB) [26]. In this work, we extend the 2D MABmodel of [26] to leverage similarities between prior grasps and3D objects in Dex-Net to reduce the number of samples [16].

C. Quality Metric

In this work we evaluate our algorithm using the probabilityof force closure (PF ), or the ability to resist external forceand torques in arbitrary directions [29], as a quality metric.PF allows us to study the effects of large amounts of data onapproximate solutions of Equation III.1 because it is relativelyinexpensive to evaluate, and PF has also shown promise inphysical experiments [23], [43].

Let F ∈ {0, 1} denote the occurrence of force closure.For a grasp g on object O under uncertainty in object poseξ, gripper pose ν, and friction coefficient γ the probabilityof force closure PF (g,O) = P (F = 1 | g,O, ξ, ν, γ). Tocompute force closure for a grasp g ∈ G on object O ∈ Hgiven samples of object pose ξ, gripper pose ν, and frictioncoefficient γ, we first compute a set of possible contactwrenches W using a soft finger contact model [47]. ThenF = 1 if 0 is in the convex hull of W [43].

D. Sources of Uncertainty

For PF evaluation we assume Gaussian distributions onobject pose, gripper pose, and friction coefficient to modelerrors in registration, robot calibration, or classification ofmaterial properties. Let υ ∼ N (0,Συ) denote a zero-mean Gaussian on R6 and µξ ∈ SE(3) be the meanobject pose. We define the object pose random variableξ = exp (υ∧)µξ, where the ∧ operator maps from R6 tothe Lie algebra se(3) [3]. Let ν ∼ N (0,Σν) denote zero-mean Gaussian gripper pose uncertainty with mean µν ∈ G.Let γ ∼ N (µγ ,Σγ) denote a Gaussian distribution on thefriction coefficient with mean µγ ∈ R. We denote by ξ, ν,and γ samples of the random variables.

E. Contact Model

Given a grasp g on object surface f and samples ξ, ν,and γ, let ci ∈ R3 denote the 3D contact location be-tween the i-th jaw and surface as shown in Fig. 2. Letni = ∇f(ci)/‖∇f(ci)‖2 denote the surface normal andlet ti,1, ti,2 ∈ S2 be its tangent vectors. To computethe forces that each soft contact can apply to the objectfor friction coefficient γ, we discretize the friction coneat ci [35] into a set of l facets with vertices Fi ={ni + γ cos

(2πjl

)ti,1 + γ sin

(2πjl

)ti,2∣∣j = 1, ..., l

}. Thus

the set of wrenches that g can apply to O is W = {wi,j =(fi,j , fi,j × ρi)

∣∣i = 1, 2 and fi,j ∈ Fi} where ρi = (ci − z)is the moment arm at ci.

IV. DEXTERITY NETWORK

The Dexterity Network (Dex-Net) 1.0 dataset is a growingset that currently includes over 10,000 unique 3D objectmodels annotated with 2.5 million parallel-jaw grasps.

A. Data

Dex-Net 1.0 contains 13,252 3D mesh models: 8,987from the SHREC 2014 challenge dataset [28], 2,539 fromModelNet40 [45], 1,371 from 3DNet [44], 129 from the KITobject database∗ [19], 120 from BigBIRD∗ [38], 80 from theYale-CMU-Berkeley dataset∗ [8], and 26 from the Amazon

Picking Challenge∗ scans (∗ indicates laser-scanner data). Wepreprocess each mesh by removing unreferenced vertices,computing a reference frame with Principal ComponentAnalysis (PCA) on the mesh vertices, setting the mesh centerof mass z to the center of the mesh bounding box, andrescaling the synthetic meshes to fit the smallest dimensionof the bounding box within w = 0.1m. To resolve orientationambiguity in the reference frame, we orient the positive z-axistoward the side of the xy plane with more vertices. We alsoconvert each mesh to an SDF using SDFGen [4].

B. Grasp Sampling

Each 3D object Oi in Dex-Net is labelled with up to250 parallel-jaw grasps and their PF . We generate K graspsfor each object using a modification of the 2D algorithmpresented in Smith et al. [39] to concentrate samples ongrasps that are antipodal [29]. To sample a single grasp, wegenerate a contact point c1 by sampling uniformly from theobject surface S, sampling a direction v ∈ S2 uniformlyat random from the friction cone, and finding an antipodalcontact c2 on the line c1 + tv where t > 0. We add thegrasp gi,k = (0.5(c1 + c2),v) to the candidate set if thecontacts are antipodal [29]. We evaluated PF (gi,k) usingMonte-Carlo integration [20], [43] by sampling the objectpose, gripper pose, and friction random variables N = 500times and recording Zi,k, the number of samples for whichgi,k achieved force closure (F = 1).

C. Depthmap Gradient Features

To measure grasp similarity in the Dex-Net 1.0 algorithm,we embed each grasp g = (x,v) of object O in Dex-Netin a feature space based on a 2D map of the local surfaceorientation at the contacts, inspired by grasp heightmaps [15],[18]. We generate a depthmap di for contact ci by orthog-onally projecting the local object surface onto an m × mgrid centered at ci and oriented along the line to the objectcenter of mass, ai = z−ci. Since F only depends on ci andits surface normal, rotations of di about ai correspond tograsps of the equivalent quality. We therefore make each dirotation-invariant by orienting its axes along the eigenvectorsof a weighted covariance matrix of the 3D surface points thatgenerate di as described in [37]. Fig. 3 illustrates local surfacepatches extracted by this procedure. We finally take the x-and y-image gradients of di to form depthmap gradients∇di = (∇xdi,∇ydi), motivated by the dependence of Fon surface normals [35], and we store each in Dex-Net 1.0.

V. DEEP LEARNING FOR OBJECT SIMILARITY

We use Multi-View Convolutional Neural Networks (MV-CNNs) [42] to efficiently index prior 3D object and grasp datafrom Dex-Net by embedding each object in a vector spacewhere distance represents object similarity, as shown in Fig. 4.We first render every object on a white background in a totalof C = 50 virtual camera views oriented toward the objectcenter and spaced on a grid of angle increments δθ = 2π

5 andδϕ = 2π

5 on a viewing sphere with radii r = R, 2R, whereR is the maximum dimension of the object bounding box.

Depthmap

Depth (m)

c1 c2

c3

Depthmap

Depthmap

v1v2

v3

0.2 0.0 -0.20.1 -0.1

d1 d2

d3

Fig. 3: Illustration of three local surface depthmaps extracted on a teapot.Each depthmap is “rendered” along the grasp axis vi at contact ci andoriented by the directions of maximum variation in the depthmap. We usegradients of the depthmaps for similarity between grasps in Dex-Net.

Deep CNN

Deep CNN

Deep CNN

Max Pooling

fc7

response

fc7

response

PCA

Object O

ψ(O)

Rendered View

Rendered View

C1

C2

C3

Fig. 4: Illustration of our Multi-View Convolutional Neural Network (MV-CNN) deep learning method for embedding 3D object models in a Euclideanvector space to compute global shape similarity. We pass a set of 50 virtuallyrendered camera viewpoints discretized around a sphere through a deepConvolutional Neural Network (CNN) with the AlexNet [24] architecture.Finally, we take the maximum fc7 response across each of the 50 views foreach dimension and run PCA to reduce dimensionality.

Then we train a CNN with the architecture of AlexNet [24]to predict the 3D object class label for the rendered imageson a training set of models. We initialize the weights of thenetwork with the weights learned on ImageNet by Krizhevskyet al. [24] and optimize using Stochastic Gradient Descent(SGD). Next, we pass each of the C views of each objectthrough the optimized CNN and max-pool the output ofthe fc7 layer, the highest layer of the network before theclass label prediction. Finally, we use Principal ComponentAnalysis (PCA) to reduce the max-pooled output from 4,096dimensions to a 100 dimensional feature vector ψ(O).

Given the MV-CNN object representation, we measurethe dissimilarity between two objects Oi and Oj by theEuclidean distance ‖ψ(Oi)−ψ(Oj)‖2. For efficient lookupsof similar objects, Dex-Net contains a KD-Tree nearestneighbor query structure with the feature vectors of all priorobjects. In our implementation, we trained the MV-CNN usingthe Caffe library [17] on rendered images from a trainingset of approximately 6,000 3D models sampled from theSHREC 2014 [28] portion of Dex-Net, which has 171 uniquecategories, for 500,000 iterations of SGD. To validate theimplementation, we tested on the SHREC 2014 challengedataset and achieved a 1-NN accuracy of 86.7%, comparedto 86.8% achieved by the winner of SHREC 2014 [28].

VI. CORRELATED MULTI-ARMED BANDIT ALGORITHM

The Dex-Net 1.0 algorithm (see pseudocode below) opti-mizes the probability of success PS (Equation III.1) for abinary quality metric such as force closure over a discreteset of candidate grasps Γ on an object O using a BayesianMulti-Armed Bandit (MAB) model [26], [40] with correlatedrewards [16] and priors computed from Dex-Net 1.0. We firstgenerate the set of K candidate grasps using the antipodalgrasp sampling described in Section IV-B and treat the graspsas “arms” in the MAB model. Next, we predict PS for eachgrasp using the M most similar objects from the Dex-Net1.0 dataset and estimate a Bayesian posterior distributionon our prediction. Then, for iterations t = 1, ..., T we useThompson sampling [26], [32] to select a grasp gt,k ∈ Γ toevaluate, sample the quality S(gt,k), and update a posteriorbelief distribution on PS for each grasp. Finally, we rank Γby the q-lower confidence bound on PS for each grasp andstore the ranking in the database.

To illustrate convergence of the algorithm, we use forceclosure [47] as our binary quality metric. We plan to studyother quality metrics such as success on physical trials [25],[31] and alternate MAB methods based on upper confidencebounds [25], [32] or Gittins indices [26] in future work.

A. Model of Correlated Rewards

Let Sj = S(gj) ∈ {0, 1} be a random binary qualitymetric evaluated on grasp gj ∈ Γ. For example, Sj mightmodel force closure under uncertainty in object pose, gripperpose, or friction. Each Sj is a Bernoulli random variable withprobability of success θj = PS(gj).

We use Continuous Correlated Beta Processes(CCBPs) [11], [31] to model a joint posterior beliefdistribution over the θj for all grasps in Dex-Net, whichenables us to predict θj from prior grasp and object datain Dex-Net 1.0 using a closed-form posterior update. Thejoint distribution models pairwise correlations of θ betweengrasp-object pairs P = (g,O) (points in a Grasp ModuliSpace [34]) measured using a normalized kernel functionk(Pi,Pj) that approaches 1 as the arguments becomeincreasingly similar and approaches 0 as the argumentsbecome dissimilar.

Dex-Net 1.0 measures similarity using a set of featuremaps ϕm ∈ Rdm for m = 1, ..., 3, where dm is thedimension of the feature space for each. The first featuremap ϕ1(P) = (x,v, ‖ρ1‖2, ‖ρ2‖2) captures similarity in thegrasp parameters, where x ∈ R3 is the grasp center, v ∈ S2 isthe approach axis, and ρi ∈ R3 is the i-th moment arm. Thesecond feature map ϕ2(P) = (∇d1,∇d2) uses the depthmapgradients described in Section IV-C. Our third feature mapϕ3(P) = ψ(O) is the object similarity map described inSection V to capture global shape similarity.

Given the feature maps, we use the squared exponentialkernel

k(Yp,Yq) = exp

(−1

2

3∑m=1

‖ϕm(Pp)− ϕm(Pq)‖2Cm

).

where Cm ∈ Rdm×dm is the bandwidth for ϕm and ‖y‖Cm=

yTC−1m y. The bandwidths are set by maximizing the log-likelihood [11] of the true θ on a set of training data.

B. Predicting Grasp Quality Using Prior Data

Before evaluating any grasps in Γ, the Dex-Net 1.0algorithm predicts θj for each candidate grasp gj based on itskernel similarity to all grasps and objects from the Dex-Net1.0 dataset D. In particular, we estimate a Bayesian posteriordensity p(θj) by treating D as prior observations and usingthe closed form posterior update for CCBPs [11]:

p(θj | αj,0, βj,0) ∝ θαj,0−1j (1− θj)βj,0−1 (VI.1)

αj,0 = α0 +

|D|∑i=1

K∑k=1

k(Pj ,Pi,k)Zi,k (VI.2)

βj,0 = β0 +

|D|∑i=1

K∑k=1

k(Pj ,Pi,k)(N − Zi,k) (VI.3)

where α0 and β0 are prior parameters for the Beta distribu-tion [26], N is the number of times each grasp gi,k ∈ D wassampled to estimate θi, and Zi,k is the number of observedsuccesses for gi,k. Intuitively, the prior dataset contributesfractional observations of successes and failures for the graspcandidates Γ proportional to the kernel similarity. We estimatethe above sums using the M nearest neighbors to O in theobject similarity KD-Tree described in Section V.

C. Grasp Selection Policy

On iteration t we select the next grasp to sample gj ∈ Γusing Thompson Sampling. In Thompson Sampling we drawa sample θ` ∼ p(θ` | α`,t, β`,t) for each grasp g` ∈ Γ, thenchoose the grasp gj with the highest θj [26]. After observingSj , we update the belief for all grasps g` ∈ Γ by updating arunning count of the fractional successes and failures [11]:

α`,t = α`,t−1 + k(P`,Pj)Sj (VI.4)β`,t = β`,t−1 + k(P`,Pj)(1− Sj). (VI.5)

VII. EXPERIMENTS

We evaluate the performance of the Dex-Net 1.0 algorithmon robust grasp planning for varying sizes of prior data usedfrom Dex-Net using force closure as our binary quality metric,and we explore the sensitivity of the convergence rate to objectshape, the similarity kernel bandwidths, and uncertainty. Wecreated two training sets of 1,000, and 10,000 objects byuniformly sampling objects from Dex-Net. We uniformlysampled a set of 300 validation objects for selecting algorithmhyperparameters and selected a set of 45 test objects fromthe remaining objects. We ran the algorithm for T = 2, 000iterations with M = 10 nearest neighbors, α0 = β0 = 1 [26],and a lower confidence bound containing q = 75% of thebelief distribution. We used isotropic Gaussian uncertaintywith object and gripper translation variance σt = 0.005,object and gripper rotation variance σr = 0.1, and frictionvariance σγ = 0.1. For each experiment we compare theDex-Net algorithm to Thompson sampling without priors

1 Input: Object O, Number of Candidate Grasps K, Number ofNearest Neighbors M , Dex-Net 1.0 Dataset D, Feature mapsϕ, Maximum Iterations T , Prior beta shape α0, β0, LowerBound Confidence q, Quality Metric SResult: Estimate of the grasp with highest PF , g∗

// Generate candidate grasps and priors2 Γ = AntipodalGraspSample(O,K) ;3 A0 = ∅,B0 = ∅;4 for gk ∈ Γ do

// Equations VI.2 and VI.35 αk,0, βk,0 = ComputePriors(O,gk,D,M, ϕ, α0, β0);6 A0 = A0 ∪ {αk,0},B0 = B0 ∪ {βk,0};7 end// Run MAB to Evaluate Grasps

8 for t = 1, .., T do9 j = ThompsonSample(At−1,Bt−1);

10 Sj = SampleQuality(gj ,O, S);// Equations VI.4 and VI.5

11 At,Bt = UpdateBeta(j, Sj ,Γ);12 g∗

t =MaxLowerConfidence(q,At,Bt);13 end14 return g∗

T ;Dex-Net 1.0 Algorithm: Robust Grasp Planning UsingMulti-Armed Bandits with Correlated Rewards

(TS) [26], a state-of-the-art method for robust grasp planning,and uniform allocation (UA), a widely-used method for robustgrasp planning that selects the next grasp to evaluate uniformlyat random [20], [23], [43].

The inverse kernel bandwidths were selected by maxi-mizing the log-likelihood of the true PF under the CCBPmodel [11] on the validation set using a grid search overhyperparameters. The inverse bandwidths of the similaritykernel were Cg = diag(0, 0, 3×10−5, 3×10−5) for the graspparameter features, an isotropic Gaussian mask Cd with meanµd = 500.0 and σd = 0.33 for the differential depthmaps,and Cs = 106 ∗ I for the shape similarity features.

To scale experiments, we developed a Cloud-based systemon top of Google Cloud Platform. We used Google ComputeEngine (GCE) to construct the Dex-Net 1.0 dataset and todistribute subsets of objects to virtual machines for MABexperiments, and we used Google Cloud Storage to store Dex-Net. The system launched up to 1,500 GCE virtual instancesat once for experiments, reducing the runtime by an estimatedthree orders of magnitude to approximately 315 seconds perobject for both loading the dataset and running the Dex-Net1.0 algorithm. Each virtual instance ran Ubuntu 12.04 on asingle core with 3.75 GB of RAM.

A. Scaling of Average Convergence Rate

To examine the effects of orders of magnitude of prior dataon convergence to a grasp with high PF , we ran the Dex-Net 1.0 algorithm on the test objects with priors computedfrom 1,000 and 10,000 objects from Dex-Net. Fig. 5 showsthe normalized PF (the ratio of the PF for the best grasppredicted by the algorithm to the highest PF of the candidategrasps) versus iteration averaged over 25 trials for each of the45 test objects to facilitate comparison across objects. Theaverage runtime per iteration was 16 ms for UA, 17 ms forTS, and 22 ms for Dex-Net 1.0. The algorithm with 10,000

Uniform Allocation

Thompson Sampling

Dex-Net 1.0 (N=1,000)

Dex-Net 1.0 (N=10,000)

1.0

0.9

0.8

0.7

0.60 500 1000 1500 2000

Iteration

No

rmal

ized

Qu

alit

y

Fig. 5: Average normalized grasp quality versus iteration over 45 test objectsand 25 trials per object for the Dex-Net 1.0 algorithm with 1,000 and 10,000prior 3D objects from Dex-Net. We measure quality by the PF for thebest grasp predicted by the algorithm on each iteration and compare withThompson sampling without priors and uniform allocation. The algorithmconverges faster with 10,000 models, never dropping below approximately90% of the grasp with highest PF from a set of 250 candidate grasps.

objects takes approximately 2× fewer iterations to reach themaximum normalized PF value reached by TS, which isparticularly promising for binary success metrics that areexpensive to evaluate such as detailed physics simulations,human labels, or physical grasping trials. Furthermore, the10,000 object curve does not fall below approximately 90%of the best grasp in the set across all iterations, suggestingthat a grasp with high PF is found using prior data alone.The maximum standard error of the mean over all iterationswas 2 × 10−3 for UA, 2 × 10−3 for TS, and 1 × 10−3 forDex-Net 1.0 with 1,000 and 10,000 objects.

B. Sensitivity to Object Shape

To understand the behavior of the Dex-Net 1.0 algorithmon individual 3D objects, we examined performance on adrill and spray bottle from the test set, both uncommonobject categories in Dex-Net 1.0. Fig. 1 and Fig. 6 show thenormalized PF versus iteration averaged over 25 trials for2,000 iterations on the spray bottle and drill, respectively. Wesee that the spray bottle converges very quickly when usinga prior dataset of 10,000 objects, finding the optimal graspin the set in about 1,500 iterations. This convergence may beexplained by the two similar spray bottles retrieved by theMV-CNN from the 10,000 object dataset. Fig. 7 illustrates thegrasps predicted to have the highest PF on the spray bottleby the different algorithms after 100 iterations. On the otherhand, performance on the drill does not improve using either1,000 or 10,000 objects, perhaps because the closest modelin Dex-Net according to the similarity metric is a phone.

C. Sensitivity to Similarity and Uncertainty

We also studied the sensitivity of the Dex-Net algorithm tothe similarity kernel bandwidths described in Section VI-Band the levels of pose and friction uncertainty for the testobject. We varied the inverse bandwidths of the kernel for thegrasp parameters and depthmap gradients to the lower valuesCg = diag(0, 0, 15, 15), µd = 350.0, and σd = 3.0 as wellas the higher values Cg = diag(0, 0, 300, 300), µd = 750.0,and σd = 1.75. We also tested low uncertainty with variances

1.0

0.9

0.8

0.7

0.6500 1000 1500 20000

Uniform AllocationThompson SamplingDex-Net 1.0 (N=1,000)Dex-Net 1.0 (N=10,000)

1.0

0.9

0.8

0.7

0.6500 1000 1500 2000

Iteration

Norm

aliz

ed Q

ual

ity

00

Network Size (# Objects)Query Object 1,000 10,000

Increasin

g S

imilarity

Fig. 6: Failure object for the Dex-Net 1.0 algorithm. (Top) The drill, whichis relatively rare in the dataset, has no geometrically similar neighbors evenwith 10,000 objects. (Bottom) Plotted is the average normalized grasp qualityversus iteration over 25 trials for the Dex-Net 1.0 algorithm with 1,000 and10,000 prior 3D objects. The lack of similar objects leads to no significantperformance increase over Thompson sampling without priors.

PF = 0.60 PF = 0.67 PF = 0.78

Thompson Sampling Dex-Net 1.0 (N=1,000) Dex-Net 1.0 (N=10,000)

Fig. 7: Illustration of the grasps predicted to have the highest PF after only100 iterations by Thompson sampling without priors and the Dex-Net 1.0algorithm with 1,000 and 10,000 prior objects on the spray bottle. Thompsonsampling without priors chooses a grasp near the edge of the object, whilethe Dex-Net algorithm selects grasps closer to the object center-of-mass. Forreference, the highest quality grasp for the spray bottle was PF = 0.81.

(σt, σr, σγ) = (0.0025, 0.05, 0.05) and high uncertainty withvariances (σt, σr, σγ) = (0.01, 0.2, 0.2). Fig. 8 shows thenormalized PF versus iteration averaged over 25 trials for2,000 iterations on the 45 test objects. The results suggestthat a conservative setting of similarity kernel bandwidthis important for convergence and that the algorithm is notsensitive to uncertainty levels.

VIII. DISCUSSION AND FUTURE WORK

We presented Dexterity Network (Dex-Net) 1.0, a newdataset and associated algorithm to study the scaling effectsof Big Data and Cloud Computation on robust grasp planningwith binary grasp quality metrics. The algorithm uses a Multi-Armed Bandit model with correlated rewards to leverage priorgrasps and 3D object models. Experiments using the GoogleCloud Platform suggest that prior data can speed robust grasp

Uniform Allocation

Thompson Sampling

Dex-Net 1.0 (N=1,000)

Dex-Net 1.0 (N=10,000)

1.0

0.9

0.8

0.7

0.60 500 1000 1500 2000

Iteration

No

rmal

ized

Qual

ity

Uniform Allocation

Thompson Sampling

Dex-Net 1.0 (N=1,000)

Dex-Net 1.0 (N=10,000)

1.0

0.9

0.8

0.7

0.60 500 1000 1500 2000

Iteration

No

rmal

ized

Qual

ity

Uniform Allocation

Thompson Sampling

Dex-Net 1.0 (N=1,000)

Dex-Net 1.0 (N=10,000)

1.0

0.9

0.8

0.7

0.60 500 1000 1500 2000

Iteration

Norm

aliz

ed Q

ual

ity

Uniform Allocation

Thompson Sampling

Dex-Net 1.0 (N=1,000)

Dex-Net 1.0 (N=10,000)

1.0

0.9

0.8

0.7

0.60 500 1000 1500 2000

Iteration

Norm

aliz

ed Q

ual

ity

Overestimated Similarity Underestimated Similarity

Lower Uncertainty Higher Uncertainty

Fig. 8: Sensitivity to similiarity kernel (top) and pose and friction uncertainty(bottom) for the normalized grasp quality versus iteration averaged over 25trials per object for the Dex-Net algorithm with 1,000 and 10,000 prior 3Dobjects. (Top-left) Using a higher inverse bandwidth causes the algorithm tomeasure false similarities between grasps, leading to performance on parwith uniform allocation. (Top-right) A lower inverse bandwith decreases theconvergence rate, but on average the Dex-Net algorithm still selects a graspwithin approximately 85% of the grasp with highest PF for all iterations.(Bottom-left) Lower uncertainty increases the quality for all methods, (bottom-right) higher uncertainty decreases the quality for all methods, and the Dex-Net algorithm with 10,000 prior objects still converges approximately 2×faster than Thompson sampling without priors.

planning by a factor of 2 and that average grasp qualityincreases with the number of similar objects in the dataset.

In future work, we will extend the Dex-Net 1.0 algorithmto computing a set of grasps that adequately “cover” eachobject from a variety of accessibility conditions depending onpose and clutter. We also plan to study Deep Learning [24]to better estimate grasp quality from prior object and graspdata. One shortcoming of the current work is our evaluationwith a single quality metric. Thus, the next step is to performphysical experiments to evaluate the grasps found by Dex-Net1.0, and to refine grasps using the Dex-Net 1.0 algorithm withphysical success as a quality metric. We are experimentingwith several robots and plan to share a subset of 3D objectsand algorithms to facilitate comparison with related methodsand experimentation with different robots and objects.

IX. ACKNOWLEDGMENTS

This research was performed in UC Berkeley’s Automation Sciences Labunder the UC Berkeley Center for Information Technology in the Interest ofSociety (CITRIS) “People and Robots” Initiative: http://robotics.citris-uc.org.The authors were supported in part by the U.S. National Science Foundationunder NRI Award IIS-1227536, by grants from Google, UC Berkeley’sAlgorithms, Machines, and People Lab; the Knut and Alice WallenbergFoundation; the NSF-Graduate Research Fellowship; and the Departmentof Defense (DoD) through the National Defense Science & EngineeringGraduate Fellowship (NDSEG) Program. We thank our colleagues who wrotecode for the project, in particular Raul Puri, Sahaana Suri, Nikhil Sharma,and Josh Price, as well as our colleagues who gave feedback and suggestions,in particular Pieter Abbeel, Alyosha Efros, Animesh Garg, Kevin Jamieson,Sanjay Krishnan, Sergey Levine, Zoe McCarthy, Stephen McKinley, SachinPatil, Nan Tian, and Jur van den Berg.

REFERENCES

[1] M. Aubry and B. Russell, “Understanding deep features with computer-generated imagery,” arXiv preprint arXiv:1506.01151, 2015.

[2] R. Balasubramanian, L. Xu, P. D. Brook, J. R. Smith, and Y. Matsuoka,“Physical human interactive guidance: Identifying grasping principlesfrom human-planned grasps,” Robotics, IEEE Transactions on, vol. 28,no. 4, pp. 899–910, 2012.

[3] T. D. Barfoot and P. T. Furgale, “Associating uncertainty with three-dimensional poses for use in estimation problems,” Robotics, IEEETransactions on, vol. 30, no. 3, pp. 679–693, 2014.

[4] C. Batty, “Sdfgen,” https://github.com/christopherbatty/SDFGen.[5] J. Bohg, A. Morales, T. Asfour, and D. Kragic, “Data-driven grasp

synthesisa survey,” Robotics, IEEE Transactions on, vol. 30, no. 2, pp.289–309, 2014.

[6] A. M. Bronstein, M. M. Bronstein, L. J. Guibas, and M. Ovsjanikov,“Shape google: Geometric words and expressions for invariant shaperetrieval,” ACM Transactions on Graphics (TOG), vol. 30, 2011.

[7] P. Brook, M. Ciocarlie, and K. Hsiao, “Collaborative grasp planningwith multiple object representations,” in Proc. IEEE Int. Conf. Roboticsand Automation (ICRA). IEEE, 2011, pp. 2851–2858.

[8] B. Calli, A. Walsman, A. Singh, S. Srinivasa, P. Abbeel, andA. M. Dollar, “Benchmarking in manipulation research: The ycbobject and model set and benchmarking protocols,” arXiv preprintarXiv:1502.03143, 2015.

[9] R. Detry, C. H. Ek, M. Madry, and D. Kragic, “Learning a dictionaryof prototypical grasp-predicting parts from grasping experience,” inRobotics and Automation (ICRA), 2013 IEEE International Conferenceon. IEEE, 2013, pp. 601–608.

[10] R. Detry, D. Kraft, O. Kroemer, L. Bodenhagen, J. Peters, N. Kruger,and J. Piater, “Learning grasp affordance densities,” Paladyn, Journalof Behavioral Robotics, vol. 2, no. 1, pp. 1–17, 2011.

[11] R. Goetschalckx, P. Poupart, and J. Hoey, “Continuous correlated betaprocesses,” in IJCAI Proceedings-International Joint Conference onArtificial Intelligence, vol. 22, no. 1. Citeseer, 2011, p. 1269.

[12] C. Goldfeder and P. K. Allen, “Data-driven grasping,” AutonomousRobots, vol. 31, no. 1, pp. 1–20, 2011.

[13] C. Goldfeder, M. Ciocarlie, H. Dang, and P. K. Allen, “The columbiagrasp database,” in Robotics and Automation, 2009. ICRA’09. IEEEInternational Conference on. IEEE, 2009, pp. 1710–1716.

[14] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen,R. Prenger, S. Satheesh, S. Sengupta, A. Coates, et al., “Deep-speech: Scaling up end-to-end speech recognition,” arXiv preprintarXiv:1412.5567, 2014.

[15] A. Herzog, P. Pastor, M. Kalakrishnan, L. Righetti, J. Bohg, T. Asfour,and S. Schaal, “Learning of grasp selection based on shape-templates,”Autonomous Robots, vol. 36, no. 1-2, pp. 51–65, 2014.

[16] M. W. Hoffman, B. Shahriari, and N. de Freitas, “Exploiting correlationand budget constraints in bayesian multi-armed bandit optimization,”arXiv preprint arXiv:1303.6746, 2013.

[17] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture forfast feature embedding,” arXiv preprint arXiv:1408.5093, 2014.

[18] D. Kappler, J. Bohg, and S. Schaal, “Leveraging big data for graspplanning,” in Proc. IEEE Int. Conf. Robotics and Automation (ICRA),2015.

[19] A. Kasper, Z. Xue, and R. Dillmann, “The kit object models database:An object model database for object recognition, localization andmanipulation in service robotics,” The International Journal of RoboticsResearch, vol. 31, no. 8, pp. 927–934, 2012.

[20] B. Kehoe, D. Berenson, and K. Goldberg, “Toward cloud-based graspingwith uncertainty in shape: Estimating lower bounds on achieving forceclosure with zero-slip push grasps,” in Proc. IEEE Int. Conf. Roboticsand Automation (ICRA). IEEE, 2012, pp. 576–583.

[21] B. Kehoe, A. Matsukawa, S. Candido, J. Kuffner, and K. Goldberg,“Cloud-based robot grasping with the google object recognitionengine,” in Robotics and Automation (ICRA), 2013 IEEE InternationalConference on. IEEE, 2013, pp. 4263–4270.

[22] B. Kehoe, S. Patil, P. Abbeel, and K. Goldberg, “A survey of research oncloud robotics and automation,” Automation Science and Engineering,IEEE Transactions on, vol. 12, no. 2, pp. 398–409, 2015.

[23] J. Kim, K. Iwamoto, J. J. Kuffner, Y. Ota, and N. S. Pollard, “Physicallybased grasp quality evaluation under pose uncertainty,” IEEE Trans.Robotics, vol. 29, no. 6, pp. 1424–1439, 2013.

[24] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” in Advances in neuralinformation processing systems, 2012, pp. 1097–1105.

[25] O. Kroemer, R. Detry, J. Piater, and J. Peters, “Combining active

learning and reactive control for robot grasping,” Robotics andAutonomous Systems, vol. 58, no. 9, pp. 1105–1116, 2010.

[26] M. Laskey, J. Mahler, Z. McCarthy, F. Pokorny, S. Patil, J. van denBerg, D. Kragic, P. Abbeel, and K. Goldberg, “Multi-arm bandit modelsfor 2d sample based grasp planning with uncertainty.” in Proc. IEEEConf. on Automation Science and Engineering (CASE). IEEE, 2015.

[27] I. Lenz, H. Lee, and A. Saxena, “Deep learning for detecting roboticgrasps,” The International Journal of Robotics Research, vol. 34, no.4-5, pp. 705–724, 2015.

[28] B. Li, Y. Lu, C. Li, A. Godil, T. Schreck, M. Aono, M. Burtscher,Q. Chen, N. K. Chowdhury, B. Fang, et al., “A comparison of 3dshape retrieval methods based on a large-scale benchmark supportingmultimodal queries,” Computer Vision and Image Understanding, vol.131, pp. 1–27, 2015.

[29] J. Mahler, S. Patil, B. Kehoe, J. van den Berg, M. Ciocarlie,P. Abbeel, and K. Goldberg, “Gp-gpis-opt: Grasp planning under shapeuncertainty using gaussian process implicit surfaces and sequentialconvex programming,” 2015.

[30] D. Maturana and S. Scherer, “Voxnet: A 3d convolutional neuralnetwork for real-time object recognition,” in Proc. IEEE/RSJ Int. Conf.on Intelligent Robots and Systems (IROS), 2015.

[31] L. Montesano and M. Lopes, “Active learning of visual descriptors forgrasping using non-parametric smoothed beta distributions,” Roboticsand Autonomous Systems, vol. 60, no. 3, pp. 452–462, 2012.

[32] J. Oberlin and S. Tellex, “Autonomously acquiring instance-basedobject models from experience,” in Int. S. Robotics Research (ISRR),2015.

[33] L. Pinto and A. Gupta, “Supersizing self-supervision: Learning to graspfrom 50k tries and 700 robot hours,” in Proc. IEEE Int. Conf. Roboticsand Automation (ICRA), 2016.

[34] F. T. Pokorny, K. Hang, and D. Kragic, “Grasp moduli spaces.” inRobotics: Science and Systems, 2013.

[35] F. T. Pokorny and D. Kragic, “Classical grasp quality evaluation: Newtheory and algorithms,” in IEEE/RSJ International Conference onIntelligent Robots and Systems (IROS), 2013.

[36] M. Salganicoff, L. H. Ungar, and R. Bajcsy, “Active learning forvision-based robot grasping,” Machine Learning, vol. 23, no. 2-3, pp.251–278, 1996.

[37] S. Salti, F. Tombari, and L. Di Stefano, “Shot: Unique signatures ofhistograms for surface and texture description,” Computer Vision andImage Understanding, vol. 125, pp. 251–264, 2014.

[38] A. Singh, J. Sha, K. S. Narayan, T. Achim, and P. Abbeel, “Bigbird:A large-scale 3d database of object instances,” in Proc. IEEE Int. Conf.Robotics and Automation (ICRA), 2014.

[39] G. Smith, E. Lee, K. Goldberg, K. Bohringer, and J. Craig, “Computingparallel-jaw grips,” in Proc. IEEE Int. Conf. Robotics and Automation(ICRA), 1999.

[40] N. Srinivas, A. Krause, S. Kakade, and M. Seeger, “Gaussian processoptimization in the bandit setting: No regret and experimental design,”in Proc. International Conference on Machine Learning (ICML), 2010.

[41] T. Stouraitis, U. Hillenbrand, and M. A. Roa, “Functional powergrasps transferred through warping and replanning,” in Robotics andAutomation (ICRA), 2015 IEEE International Conference on. IEEE,2015, pp. 4933–4940.

[42] H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller, “Multi-viewconvolutional neural networks for 3d shape recognition,” arXiv preprintarXiv:1505.00880, 2015.

[43] J. Weisz and P. K. Allen, “Pose error robust grasping from contactwrench space metrics,” in Robotics and Automation (ICRA), 2012 IEEEInternational Conference on. IEEE, 2012, pp. 557–562.

[44] W. Wohlkinger, A. Aldoma, R. B. Rusu, and M. Vincze, “3dnet: Large-scale object class recognition from cad models,” in Proc. IEEE Int.Conf. Robotics and Automation (ICRA), 2012.

[45] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao,“3d shapenets: A deep representation for volumetric shape modeling,”in CVPR, vol. 1, no. 2, 2015, p. 3.

[46] L. E. Zhang, M. Ciocarlie, and K. Hsiao, “Grasp evaluation withgraspable feature matching,” in RSS Workshop on Mobile Manipulation:Learning to Manipulate, 2011.

[47] Y. Zheng and W.-H. Qian, “Coping with the grasping uncertaintiesin force-closure analysis,” Int. J. Robotics Research (IJRR), vol. 24,no. 4, pp. 311–327, 2005.

dex-net 1.0: a cloud-based network of 3d objects for ... · dex-net 1.0 contains approximately 2.5...

Documents