active learning for robot exploration · bayesian optimization is an optimization technique for...
TRANSCRIPT
Active Learning for Robot ExplorationBayesian Optimization for Object Grasping
Jose Miguel Silva do Carmo Nogueira
Thesis to obtain the Master of Science Degree in
Electrical and Computer Engineering
Supervisor: Prof. Alexandre Jose Malheiro Bernardino
Examination Committee
Chairperson: Prof. Joao Fernando Cardoso Silva SequeiraSupervisor: Prof. Alexandre Jose Malheiro Bernardino
Members of the Committee: Prof. Joao Manuel de Freitas Xavier
May 2017
Sorte ao jogo, sorte ao jogo.Jose Nogueira
Dedicatoria
Este trabalho e para ti, meu grande amigo. Por todas as horas, por toda a ajuda. Que eu tenha
sido para ti tudo o que tu foste para mim. Nunca te esquecerei.
iii
Acknowledgments
Thanks to all my Vislab colleagues during my short time as researcher.
A great thank you to Lorenzo Jamone and Ruben Martinez-Cantin, my fellow research partners
and great teachers.
And to my mentor and true friend, Alexandre Bernardino, my greatest appreciation.
This work is not only my own. It’s part of all the giants on whose shoulders I stood on.
v
Abstract
The thesis proposal here addressed is Active Learning for Robot Exploration, which relates to
learning an unknown activity throughout experimentation. Specifically we will present the problem of
Object Grasping Optimization, where we wish to learn about the optimal way to grab an object. We
will discuss how we evaluate grasp quality, how do we use information during exploration to decide
where to grasp next and what are the inherent problems that may affect the learning activity and how
we propose to solve them.
The learning strategy employed here is Bayesian Optimization: a global optimization method,
where prior beliefs are used to form a stochastic model of an objective function we wish to learn. This
model is then used to determine where to sample next over the input space, given an active learning
criterion. Bayesian Optimization is an optimization technique for black-box functions, known as one
of the most successful trial-and-error techniques designed for sample efficiency, at the cost of extra
computation. It’s mainly used with functions that are expensive to evaluate (in terms of cost, energy,
time) and it has been called the intelligent brute-force algorithm.
The two main problems we address in this work is objective functions with varying smoothness
properties over the input space and input noise presence, which reduce Bayesian Optimization ef-
fectiveness. Both these problems are noticeable aspects of Object Grasping, as well for many other
applications.
These problems will be tackled by using a heteroscedastic regression model with Bayesian Opti-
mization - Treed Gaussian Processes and by formulating new learning criteria and rule for selecting
the best candidate for the optimum, which we denote Unscented Bayesian Optimization.
A Treed Gaussian Process is a stochastic model which is, essentially, a composition of several
Gaussian Processes wherein each is responsible for a exclusive partition of the input space with dif-
ferent smoothness and noise parameters. The Unscented Bayesian Optimization uses the unscented
transform to compute the expected and variance values of the stochastic model in order to improve
optimization and active search in presence of input noise modeled by a covariance matrix.
The results presented highlight how our methods outperform the classical Bayesian Optimization,
both in synthetic problems and in realistic robot grasp simulations.
vii
Acronyms
BO Bayesian Optimization. 2–4, 6, 8, 9, 12, 21–23, 25, 27, 28, 30, 32–36, 38, 40, 42
BO-GP Bayesian Optimization using Gaussian Processes. 25, 28–31, 33, 35
BO-TGP Bayesian Optimization using Treed Gaussian Processes. 25, 28–33, 35, 42
GF Gramacy 2-D Exponential Function. 25, 26, 28–30
GM Mixture of 2D Gaussian distributions. 25, 26, 28–30, 34, 35, 40
GP Gaussian Process. 2, 4, 6, 8–10, 13, 15, 22, 27–33
GPs Gaussian Processes. 3, 6, 8, 9, 15, 16, 25, 27, 30
MCMC Monte Carlo Markov Chain. 13
RKHS 1D Reproducing Kernel Hilbert Space Function. 25–30, 33–35, 40
SP Stochastic Process. 2, 3, 9, 12, 18
TGP Treed Gaussian Process. 4, 15, 22, 27–33, 42
TGPs Treed Gaussian Processs. 15, 16, 27, 28, 30
UBO Unscented Bayesian Optimization. 4, 23, 34–36, 38, 40, 42
UO Unscented Outcome. 22
viii
List of symbols
x∗ Optimization optimum.
xt+1 Next query.
X Input space.
d Number of input space dimensions.
f Objective function.
ε Observational noise.
y Observation. Query output.
D1:t Dataset. Sample until iteration t.
u Learning criterion function.
yt Observation at iteration t.
xt Query at iteration t.
ε Observational noise at iteration t.
GP Gaussian Process.
µ Mean.
k Kernel function. Covariance function.
kυ=5/2 Matern 5/2 covariance function.
σp Objective function Signal variance hyper-parameter.
σn Observational noise hyper-parameter.
σx Input space noise hyper-parameter.
li Kernel length on dimension i hyper-parameter.
θ Hyper-parameters.
θi Hyper-parameters i-th sample.
Θ Hyper-parameters’ samples.
µi(xq) Gaussian Process’ expected value for query.
σ2i (xq) Gaussian Process’ variance value for query.
Φ Normal distribution’s cumulative density function.
ix
φ Normal distribution’s probability density function.
ξ Auxiliary exploitation/exploration parameter.
hi TGP binary test on feature i.
τ TGP binary test threshold.
wk Set of all TGP node weights, with order k.
wk Set of all TGP node weights, with order k.
x0,x(i)+ ,x
(i)− Sigma points.
ω0, ω(i)+ , ω
(i)− Unscented weights.
Σx Input space noise covariance matrix.
δx, δy, δz, θx, θy, θz, s1 Simox application input search space variables.
xmci Monte Carlo input space i-th sample.
ymc Monte Carlo sample observation.
x
Contents
Acronyms viii
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Report Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Related Work 5
2.1 Gaussian Process and Bayesian Optimization’s Fundamental Literature . . . . . . . . . 6
2.2 Learning Criterion Function Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Learning Criteria Functions Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Bayesian Optimization Toolbox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5 Simox - Robotics Simulation Environment . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.6 Grasping References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.7 Noise Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.8 Heteroscedasticity and varying smoothness properties . . . . . . . . . . . . . . . . . . . 9
2.9 Other honorable contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.9.1 Curse of Dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.9.2 Multi-Valued Objective Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Theoretical Concepts 11
3.1 Bayesian Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Gaussian Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Treed Gaussian Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3.1 Tree Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3.2 Hyper-parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 Learning Criteria Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4.1 Expected Improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.5 Unscented Bayesian Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.5.1 Unscented transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.5.2 Computing the unscented transformation . . . . . . . . . . . . . . . . . . . . . . 20
3.5.3 Unscented expected improvement . . . . . . . . . . . . . . . . . . . . . . . . . . 21
xi
Contents
3.5.4 Unscented optimal incumbent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4 Experiments 24
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Synthetic Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3 Robot Grasp Simulator - Simox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.4 Results BO-GP Vs. BO-TGP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.4.2 Synthetic Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.4.3 Robot Grasp Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4.3.A Simox Metric Signal profile . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.5 Results BO Vs. UBO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.5.2 Synthetic Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.5.3 Robot Grasp Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5 Conclusions 41
xii
1Introduction
1
1.1 Motivation
Active Learning - or Optimal Experimental Design - is a widely known technique in robotics for
online Machine Learning, where both prior data and samples acquired during task execution are used
to determine a certain learning objective. These objectives may correspond to parameter optimization,
function estimation, among others.
At the present problem, active learning will be employed to determine an objective function -
a grasp metric which measures the quality of the grasp. This quality can be intuitively perceived
as the volume of convex envelope originated by points of contact on the object’s surface and their
corresponding friction cones.
It must be noted that learning the best grasp configuration which wields a high valued metric may
not be an easy task. To grasp an object one most consider gripper motion and object reachability, all
of which involve risk of collision, material wear and time as a resource. Furthermore, in this setting it
is very common the presence of controller noise and different smoothness metric behaviors along the
object surface: near an object’s edge the return grasp metric may be highly variable depending on
the real position of the grasping device; while, on the other hand, performing a grasp near a smooth
surface will return a much less variable metric in multiple tries. These aspects make the learning of
such metrics not trivial. Moreover, we wish not only to learn the configuration of the end-effector 1
which leads to a highest valued grasp metric, but also, take into account its consistency given the
presence of imprecision (noise) in the end-effector control.
In a broader perspective, we wish to learn about a specific metric or quantity which we do not
know before hand or how it behaves. Here the knowledge acquired during the learning process will
take form whether by decreasing uncertainty about the estimated metric over the input space and by
determine where this metric (or its estimation with low uncertainty) has high values.
The estimation of these objective functions will be done through a realization of a Stochastic
Process (SP). One of the advantages of using such descriptors relies on the fact that one can estimate
both expected value and high order statistics of any point with only a limited set of sampled points.
These statistics give important information of where to explore next.
Bayesian Optimization (BO) typically uses such processes to describe the objective function which
we wish to optimize 2. Its usefulness shows when we are trying to optimize functions which are
expensive to evaluate (sample budget is a restriction) or when the optimization problem is non-convex.
BO methods had shown positive results in terms of the number of samples required to approach
function optima [1], [2], [3]. This feature derives from its ability to include prior belief into choosing
where to sample next and the ability to represent uncertainty over the estimated values.
A Gaussian Process (GP) is one commonly used SP in BO. In a GP the input domain is contin-
uous and every point in its domain is associated with a normally distributed random variable. More
specifically, any finite set of samples have a joint Gaussian distribution and any query point is also
normally distributed - with mean and variance values.1grasp manipulator with haptic sensors2Global maximization or minimization.
2
There exists a great variety of applications using Gaussian Processes (GPs) in active learning
and data estimation, ranging from bipedal walking learning [4], estimating sensor network data [5],
configuration of parameterized algorithms [6], parameter-based procedural animation design [7], path
planning [8], to the problem at hand: grasping optimization [9], [10], [11].
Despite the popularity of GPs and its use in BO, some of the GPs’ limitations threat applications
susceptible to noise over the input space and when the objective function has different smoothness
behaviors over the input space (GPs don’t model input noise in its regression model and covari-
ance functions remain with identical kernel lengths throughout the input space - same smoothness
degrees). These two aspects will naturally hinder BO performance in such applications.
The learning problem at hand - grasp optimization - is an example where these kind of problems
may occur. This dissertation sets to expose and address these issues with state of the art solutions
for BO and evaluate their effectiveness.
1.2 Problem Statement
The general Bayesian Optimization problem is to find the global optimum (either maximum or
minimum value) of an unknown function f : Rd → R over a compact input domain X. In this case,
we wish to maximize f(.) which represents our grasping metric and x our end-effector parameterized
configuration:
x∗ = argmaxx∈X⊂Rd
f (x) , d ≥ 1 (1.1)
However, we don’t inspect directly the values f (x). Instead observe a value y which adds obser-
vational noise to our metric (eq. 1.2), assumed to follow a normal distribution).
y(x) = f(x) + ε, ε ∼ N (0, σn(x)) (1.2)
Since we only know y(x), we must consider f estimation through a probabilistic distribution over
functions - a SP, which maximizes,
P (f |D1:t) ∝ P (D1:t|f)P (f) (1.3)
where, D1:t = {x1:t, y1:t} are the observations made so far. Equation 1.3 is the ”Bayes’ Theorem”
applied in this optimization, hence the name of Bayesian Optimization.
Under appropriate conditions (see [3]), BO will converge to the optimal value x∗. One of these
conditions is related with how optimization chooses the next sample xt+1 through a function known
as learning criterion u : Rd → R. This function serves as a way to evaluate information gain in search
of the global optimum and considers implicitly or explicitly a trade-off between exploration (searching
where the function estimation has high variance) and exploitation (searching where the objective
function has high expected values).
xt+1 = argmaxx∈X⊂Rd
u (x) (1.4)
3
Typically, BO problems are considered under a fixed budget of samples available during the op-
timization process. The specified budget reflects how we evaluate the task of sampling one point in
terms of cost. The cost of sampling comes from material wear, risk of damage or collision and com-
putational cost. In fact, BO complexity scales highly with the number of samples taken: O(n3) [12].
To correctly evaluate the performance of this kind of optimization, one should weigh the difference
between the estimated global optimum and the real global optimum (if known) after using the bud-
get. Other standard procedure to evaluate performance is to plot the progression of the best sample
taken over the number of iterations and compare with other learning methods. The later will be used
throughout this thesis since it’s a standard procedure in the BO problem community.
As said in section 1.1, BO will be used to optimize a grasp metric over a continuous space of
feasible parametric end-effector configurations. This metric may have different characteristics in terms
of smoothness, average value and observational noise, related to surface characteristics of the object
unknown beforehand. It is also important to take into account that it may exist mechanical imprecision
which adds noise to the input space. The challenge at hand is to consider, implicitly or explicitly, these
three critical points which influence BO performance in the grasping setting.
1.3 Report Outline
This dissertation will be structured in four different Sections.
Section 2 will cover State of the Art articles and other published works considered in this disserta-
tion. The State of the Art covers Bayesian Optimization and Stochastic Processes used in Bayesian
Optimization, the grasp learning settings and noise and heteroscedasticity modeling.
Section 3 will present all theoretical basis used for this work. It will be presented the Bayesian
Optimization algorithm, Gaussian Processes’ model, learning criteria, Treed Gaussian Processes’
model and the Unscented Bayesian Optimization variant.
In Section 4 it will be presented all results and discuss them in detail. This section will be divided
by two main sets of results: one for GP Vs. Treed Gaussian Process (TGP) and one for BO Vs.
Unscented Bayesian Optimization (UBO).
Section 5 will conclude with this work application spectrum and all the relevant information and
conclusions taken from all experiments carried in this work.
4
2Related Work
5
2.1 Gaussian Process and Bayesian Optimization’s Fundamen-tal Literature
Two of the most mentioned works related to BO and GPs are the MIT Press book from Carl
Rasmussen and Christopher Williams [13] and a publication from Eric Brochu, Vlad Cora and Nando
de Freitas [3].
Rasmussen and Williams [13] define GPs, theoretically and practically. They address both regres-
sion and classification problems using GP, remark the importance of covariance functions - kernel
functions - and show how to estimate these kernels’ hyper-parameters.
Brochu et al. [3] is a tutorial on BO using GPs. It indicates BO’s requirements and restrictions,
references GPs’ regression equations and the most commonly used kernels. One of its important
contributions is the statement and exposition of three learning criteria functions typically used in BO.
The first one is called probability of improvement (PI) [14], which is a naive attempt that utilizes the
mean and variance outputs of the GPs. The other two are, by far, more robust functions - the expected
improvement (EI) [15] and the upper confidence bound criteria (UCB) [5], yield better results (see
[5] for an experimental comparison). It shows some applications where Bayesian Optimization was
applied and tested, such as Sony AIBO ERS-7 gait parameter learning, parameter learning for neural
network controllers, simulated driving task learning, and more.
2.2 Learning Criterion Function Optimization
Using a learning criterion function demands optimizations of their own, in order to obtain the next
point to sample xt+1, as is in equation 1.4. An algorithm often used [3, 6, 7, 9] to solve this optimization
problem is the DIRECT algorithm [16]. This algorithm is applicable for real valued Lipschitz continuous
functions, with limited input domains.
2.3 Learning Criteria Functions Comparison
In the Srinivas’ paper [5], the authors present an alternative learning criterion function - upper
confidence bound criterion, formalize its exposition and makes a comparison with other standard
and commonly used functions. With this function it is possible to determine bounds for the rate of
convergence of BO using Gaussian Processes. Its use also grants an explicit parameter that controls
the exploitation-exploration trade-off. The comparison between different learning criteria used in this
dissertation may be examined in table 2.1.
2.4 Bayesian Optimization Toolbox
In this thesis, we used and further implemented BO Algorithms with a Toolbox written in C++ by
Ruben Martinez-Cantin: BayesOpt [12]. This toolbox has a variety of stochastic surrogate models,
covariance functions, learning criteria functions and optimization algorithms. Also, it’s a state of the art
6
FunctionsPI EI UCB
Positive Aspects Simple, heuristic func-tion.
Includes an im-plicit trade-off be-tween exploitationand exploration.
Represents themaximum likeli-hood estimator for theimprovement-metric.
Includes an ex-plicit trade-off be-tween exploitationand exploration.
Linear equation.
Known bounds forrates of convergence.
Negative Aspects
It’s pure exploita-tion oriented.
No exploration term.
Needs the normalcumulative distri-bution function.
Unknown rate ofconvergence.
The explo-ration/exploitationparameter doesn’t ex-press a linear controlover this adjustment.
Needs both nor-mal probability densityand cumulative dis-tribution functions.
Unknown rate ofconvergence.
Bounds for the rates ofconvergence are nottrivially calculated.
These bounds aredeeply connectedwith the trade-offparameter. Incorrecttuning of this pa-rameter may easilylead to local optimumconvergence.
Experimental Results [5] The worst results ofthe three functions.
Both have similar performances, althoughUCB is slightly better.
Table 2.1: Learning Criteria Functions’ Comparison
optimization toolbox and its performance, in terms of optimization gap and run-time, shows positive
results compared with other similar software. In this work, we augmented BayesOpt features and API
to incorporate new contributions.
2.5 Simox - Robotics Simulation Environment
In order to do experiments and collect data and results for the grasp learning experiments, we
used a simulator written in C++ by Nikolaus Vahrenkamp called Simox [17]. Simox provides a set of
three toolboxes that are associated to, respectively: model physics and visuals of objects and robot
identities; path planning for robot actuators; grasp quality assessment. This simulator is targeted for
grasping tasks and has a model for iCub, with remarkable hand representations.
2.6 Grasping References
Although this work does not address directly the choice for a grasping metric (since it implements
the one in from Simox toolbox), some other works relative to grasp quality metrics were considered
for this thesis.
Under the grasping optimization framework, Veiga [9] uses a Bimodal Wrech Space Analysis Met-
ric (BWSAM) to sample grasping quality over an object. This metric is particularly suited since it
manages both situations where a force-closure grip is performed (object is correctly gripped) and
7
situations of non-force-closure (grip device may or not touch the object, and the object may fall if an
external force is applied). This work also considers a soft touch contact approach 1 to target noise
reduction for the grasping metric. In terms of BO, Veiga uses regular Gaussian Regression with a
Matern kernel, expected improvement learning criterion and no hyper-parameter optimization.
Similar to the previous approach, Dragiev et al. [10] use visual and haptic sensors to describe
the grasping metric as a potential field. This allows easy determination of GP’s prior using vision.
The potential field is posteriorly used as a trajectory guidance metric for the grasping device and it is
modeled through a Gaussian Process - GPISP. Although this work is important in terms of grasping,
its implementation doesn’t consider BO.
Montesano and Lopes [11] considered the usage of Beta Distributions to model success of each
grasp. The input query is evaluated in terms of visual descriptors captured by a camera. The objective
function, in this case, is not a real valued function: the grasping metric is evaluated in terms of a binary
outcome: success or unsuccess. The relation between the input space and the outcome metric is
done using Beta Processes - BP -, which are used in their Bayesian Optimization. The authors
define a set of active learning criteria, similar to the PI and EI for the regular GPs. The experiments
performed only compare performances between the developed learning criteria, but no comparison is
made to the existing solutions for grasping optimization. This way of grading grasp might be efficient
way to handle observational noise for this application.
As final reference, Henriques [18] addresses end-effector calibration, parametrization and path
planning. This work is particularly important to this dissertation since it was developed for the iCub’s
hand and defines hand closure parametrization as a suitable learning input space for grasping tasks.
With a dataglove, Henriques collected data for a set of dexterous grasp types, in order to map the
motor actuators into a set of eigen components or grasp synergies. Mapping the actuators into the
synergy space allows to reduce the dimensionality of the controlled variables. He also claims that this
description is less susceptible to calibration errors.
2.7 Noise Modeling
As previously said in Chapter 1, noise handling must be an important consideration to be taken
into account in grasping applications (and also in other real task applications). Observational noise
description with a hyper-parameter - which is already done in regular GPs - might be insufficient and
lead to bad optimization. Researches have already begun to worry about this issue, but only a few
address it under the BO subject.
Tesch et al. [19] tried to evaluate binary outcomes for stochastic modeled functions, just as Monte-
sano and Lopes [11]. This work is considered under the BO paradigm, and performs a transformation
over a regular GP to obtain a new metric for binary classification. This metric is then used to define
new learning criterion. Their work is tested for synthetic functions and later on with a snake robot.
Experiments only compare results between different learning criteria, using either regular GP or the1considers a vicinity of points of contact around the collision point with the object
8
surrogate model with a specific designed learning criterion. The authors also purpose a new explo-
ration bound called estimation bounds that forbids exploration in certain areas: mainly those that have
already been exploited a lot. These implementations may perform well under noise conditions and
are yet to be tested in this setting.
McHutchon and Rasmussen [20] considered the modeling of noise in the input space. This noise
is posteriorly included in the Gaussian Process - Noisy Input GP or NIGP. No Bayesian Optimization
is performed. This implementation may model well input noise for grasping applications.
2.8 Heteroscedasticity and varying smoothness properties
Heteroscedasticity is a non-uniformity in variability (or, in statistical terms, variance) of a certain
measure dispersion. In Bayesian Optimization, it is a concept specifically applied to refer to non-
stationarity of the objective function. In other words, heteroscedasticity is not correctly modeled with
fixed correlation parameters (which translates into fixed smoothness properties throughout the input
space). Some of the most recent approaches to model different smoothness properties in objective
functions used Heteroscedastic regression models to address this issue.
Works from Le et al. [21], Kersting et al. [22], Kuindersma et al. [23] consider non-parametric
observational noise for SP modeling - this is called heteroscedastic noise. Le et al. [21] define jointly
the objective function and noise modeling with a series of equations - called Heteroscedastic GP
(HGP) regression. Regression is tested afterwards with synthetic data and supplied information from
different sources. This method outperforms GP in their experiments. It is also important to remark that
this method is later compared with McHutchon and Rasmussen [20], having sightly lower results in
those experiments’ settings. Kersting et al. [22] purpose an identical approach to the noise problem,
only this time both objective function and noise are modeled by two different GPs. Similar experiments
were performed, with identical kind of results and conclusions: heteroscedastic models perform well
under variational noise conditions. Kuindersma et al. [23] use an almost identical approach as in
Kersting et al. [22], only this time it’s applied to BO rather than to a data fitting problem. The authors
define a particular active learning criterion which will exploit the new description for the variational
noise and the expected value for the objective function. Their work is applied to pendulum control and
robot balance recovery.
2.9 Other honorable contributions
2.9.1 Curse of Dimensionality
One of the drawbacks of GP’s application is the curse of dimensionality. BO rate of convergence
is highly influenced by the number of dimensions of the input space. Wang et al. [6] suggest a solution
to this problem. Their work assumes that only a subset of all dimensions have predominant effect on
the objective function’s output. They use a random sampled matrix which is a linear transformation
between the higher dimensional input space and the lower (effective) dimensional one. Under cer-
tain conditions, one may use the presented theorems which state if the higher dimensional problem
9
converges to the optimal solution, then the lower dimensional one shall also converge to the same
optimum with high probability.
2.9.2 Multi-Valued Objective Functions
Along the studies done for State of the Art purposes, it was considered the usage of Multi-Valued
Functions (functions that could output different values for the same input entry). This subject is of great
importance in robotics (inverse kinematics of serial robots, for example), and it might be applicable
eventually for this case of studies since GP can’t represent such functions. Damas and Santos-Victor
[24] tackle this issue. Unfortunately, this work only sets the problem in terms of regression, in which
all training data is sampled beforehand. Additional work is required to include the selection of the
linear experts in an active learning sample law. The purposed algorithm is compared afterwards with
other two Multi-Valued representation methods, showing better results.
10
3Theoretical Concepts
11
3.1 Bayesian Optimization
Bayesian Optimization is an optimization technique used to find the global optimum of an objective
function which is either expensive to evaluate or doesn’t have a closed-form expression. It makes use
of the Bayes’ Theorem to include prior belief about how we think the function behaves (smoothness,
parametric signal and noise modeling, sampled points) to estimate the target objective function. This
is done by using a probabilistic surrogate model, typically a SP, that is a distribution over the family of
functions P (f) where the target function f() belongs.
BO also incorporates a decision making process that takes all the information captured in the
surrogate model and selects, via a learning criterion, the next query point in order to maximize it. In
that way, BO can be understood as active learning applied to learn the optima location.
Algorithm 3.1 Bayesian Optimization1: for i = t = 1,2,...,n do2: Update the SP with all available dataset and prior information3: Find xt = argmaxxu (x|D1:t−1). The u (·) is the learning criterion.4: Sample yt = f (xt) + εt; Augment dataset with the new observation {xt, yt}5: end for
In line 2 of the algorithm 3.1, one has the option to update (estimate) the hyper-parameters of the
SP. These hyper-parameters are what determine our prior belief on how the distribution of f() over
the function space is. By doing this estimation we are actively adapting the hyper-parameters to the
sampled data during the learning process.
12
3.2 Gaussian Process
A Gaussian Process is surrogate model which describes the target function in a specific input
query f(x) by its mean µ and covariance k. In this case, the target function is the grasping metric
which we wish to maximize by choosing an appropriate end-effector configuration.
f (x) ∼ GP (µ (x) , k (x,x′)) (3.1)
The covariance function determines smoothness properties of the objective function we are sam-
pling. It materializes how much the output f(x) is correlated over the input space. In this work we
used the Matern class covariance function (eq. 3.2, fig. 3.1), with υ = 5/2.
kυ=5/2 (xj ,xj′) = σ2p
d∏i=1
((1 +
√5r
li+
5r2
3l2i
)exp
(−√
5r
li
))(3.2)
σp, li are called hyper-parameters θ of the GP.
Figure 3.1: Matern class covariance function with υ = 5/2
The hyper-parameters are estimated (with the exception of σp, which is fixed) using Slice Sam-
pling, a Monte Carlo Markov Chain (MCMC) algorithm [12, 25], in order to maximize their log-marginal-
likelihood (eq. 3.3). Basically, it’s an pseudo-random number sampler algorithm which approximates
an arbitrary probability density function (in this case, the hyper-parameters likelihood function). The
state of the MCMC chain (with m samples) results on a set of hyper-parameter particles Θ = {θi}mi=1,
our estimations.
13
2 log p (y|x1:t, θ) =− yᵀ(Kθt + σ2
nI)−1
y
− log∣∣Kθ
t + σ2nI∣∣− t log (2π)
(3.3)
Without loss of generality, we consider our prior belief of the objective function with zero mean and
covariance k.
[ffq
]∼ N
(0,
[K(X,X) + σ2
nI k(X,xq)k(xq,X) k(xq,xq)
])(3.4)
where K(X,X), k(X,xq), k(xq,X) and k(xq,xq) denote, respectively n × n, n × 1, 1 × n and 1 × 1
covariance matrices between pairs of samples. The symbol q marks new query and σn parameterizes
observational noise. X denotes all query points from the current dataset.
To get the posterior distribution and estimate a target function value at a new query point xq with
with kernel ki conditioned on the i-th hyper-parameter sample ki = k(·, ·|θi), we consider the joint
Gaussian prior with the observed data points X, which gives:
fq|xq,X,y ∼m∑i=1
N(µi(xq), σ
2i (xq)
)(3.5)
where,
µi(xq) = ki(xq,X)Ki(X,X)−1y
σ2i (xq) = ki(xq,xq)− ki(xq,X)Ki(X,X)−1ki(X,xq)
(3.6)
Note that, because we use a sampling distribution of θ, the predictive distribution at any point xq
is a mixture of Gaussians.
14
3.3 Treed Gaussian Process
A Treed Gaussian Process distinguishes from a GP as it is a partial non-stationary regression
model: it considers changes in the model’s parametrization over the input space, opposed to GPs.
In other words, it can model different smoothness objective function behaviors over the input space.
The main difference between this model and the ones from section 2.7 is that non-stationarity is due
to the partition of the input space, where each partition has its own hyper-parameters, rather than
having them vary continuously over the input space.
In Assel et al. [26], Treed Gaussian Processs (TGPs) were used as means to address het-
eroscedasticity in traditional Bayesian learning (using GPs) performance decrease. Here, the het-
eroscedasticity concept is applied to non-stationarity (or different smoothness behaviors). This work
showed that using TGPs as surrogate functions wielded better results and faster convergence for their
experiments.
A TGP can be described as a Decision Tree (fig. 3.2), where each leaf node corresponds to
singular GP with a respective input space compact interval. This means that for any point of the
input domain, there is only one GP which models the objective function in its corresponding compact
interval, according to equations 3.6.The union of all leaves’ intervals represents all the input space
and the intersection of any two of these intervals is empty. L denotes the set of all leaves of the TGP.
Figure 3.2: Treed Gaussian Process
To understand how to travel throughout TGP, or in other words, determine which leaf governs a
specific point inside the input space, one must understand how the non-leaf nodes work. In each non-
leaf node a binary test is performed to x to determine to which child node x belongs. This test can be
explicitly written as hi (x) > τ , with i = 1..d and d as the number of dimensions of x. i represents a
feature of x and τ is called the threshold. The function hi is expressed by:
15
hi (x) = xi, xi = ith component of x (3.7)
Each one of these binary outcomes correspond to only one of the child nodes.
3.3.1 Tree Construction
To construct the tree which models the current state of the learning process, we start with a tree
composed by only one node which governs all the input space. Then, this tree is split recursively until
splitting is no longer viable.
We wish to split (if possible) every node into two children resulting in a overall uncertainty reduction
of the original node, while also guarantying that the two new child nodes have a minimum number of
samples. This last detail is crucial for hyper-parameter optimization, since with a low number of
samples hyper-parameter estimation can be compromised.
The uncertainty of a node A is defined as:
U (A) =1
|A|∑yi∈A
(yA − yi)2 (3.8)
where yA is the average of the output of the samples in A, |A| the number of samples and yi the
output of sample i.
Node A is split on feature i and threshold τ into two child-nodes A′h,τ and A′′h,τ if the splitting
correspond to an overall uncertainty reduction, if it doesn’t violate the minimum number of samples
per leaf and if it maximizes the following equation:
I(A,A′h,τ , A
′′h,τ
)= U (A)−
∣∣∣A′h,τ ∣∣∣|A|
U(A′h,τ
)−
∣∣∣A′′h,τ ∣∣∣|A|
U(A′′h,τ
) (3.9)
Equations 3.7 and 3.9 imply that these splits occur at the sampled points. Therefore we have a
finite and discrete set of features i and thresholds τ from which we wish to maximize eq. 3.9. It
was also shown in Assel et al. [26] that this strategy allows to maintain low variance in the vicinity of
the splits (since the surrogate model variance is the lowest at the sampled queries). In this work we
maximize the previous equation with brute-force search.
3.3.2 Hyper-parameter estimation
As explained in Section 3.2, we use the log-marginal-likelihood to perform hyper-parameter opti-
mization for GPs.
For TGPs, an aggregation technique is performed which allows a specific GP associated with leaf
j ∈ L to optimize hyper-parameters with its respective samples as well as the samples from other
leafs.
16
For the sake of notation, let y(j) denote the data in node j, let y(j\i) denote the data in node j
excluding the data in node i. Let δj be the depth of node j such that the root node has depth equal to
zero. Let ρj be the list of nodes in the path from node j to the root, and let ρji be the i-th element in
the list ρj , such that ρj0 = j and ρjδj = 0 (root).
We then consider the weighted marginal pseudo-likelihood decomposition (proposed in Assel et al.
[26]) as follows:
p (y|x1:t, θ) ≈ pwj0(y(j)|x(j), θ
)×|ρj|∏i=1
pwji
(y(ρji\ρ
ji−1)|x(ρji\ρ
ji−1)
, θ) (3.10)
Using eqs. 3.3 and 3.10, we obtain the weighted log-marginal-likelihood:
log p (y|x1:t, θ) = wj0 log p(y(j)|x(j), θ
)+
|ρj|∑i=1
wji log p(y(ρji\ρ
ji−1)|x(ρji\ρ
ji−1)
, θ) (3.11)
One should think of the set(y(ρji\ρ
ji−1)
,x(ρji\ρji−1)
)as all samples of the i-th order parent of node
j except the samples of the (i− 1)-th order parent of node j. For example, the 0 order parent of
node j is itself, the 1-st order parent is its direct father node and the 2-nd order parent would be its
grandfather node.
The purpose of using this aggregation technique for estimating the hyper-parameters is not only
to make leafs not totally independent from each other in terms of regression, but also to allow lower
values for the minimum number of samples per leaf so hyper-parameter optimization is not compro-
mised. But one should notice that by reinforcing this cross-effect we can also lose the opportunity to
model the objective function with more accuracy.
As for the weights wji , we propose a different approach from Assel et al. [26], in light of what was
said in the previous paragraph. In their work they use a fixed formula to calculate weights. We explore
alterations to this formula and compare their performance. We consider the weights as:
w(k)ji =
(2
1 + δj − δi
)k(3.12)
Higher values of k promote greater independency between leafs when estimating the hyper-
parameters. In Assel et al. [26] k was equal to 1. We note the set of all weights over j and i as
wk. We also note, from now forward, w∞ as the case where each leaf only uses its own data for
hyper-parameter estimation, i.e. w(∞)j0 = 1 and w(∞)ji = 0,∀i 6= 0.
17
3.4 Learning Criteria Functions
The Learning Criteria functions u(.) considered were already referenced in Section 2.3. They
are used as means to choose the next query xq in order to gain the maximum information about
the objective function we are trying to maximize and to search where the global optimum is most
likely to be located. Their usefulness comes from their sample-efficient results and their ability to
optimize non-convex functions (as long as they are Lipschitz continuous 1). It is expected that these
criteria have distinct sampling behaviors over the optimization. It is also important to notice that these
functions will need a global maximization method in order to determine the next point to be sampled
(eq. 1.4). For the remainder of this work, we will only consider the expected improvement criterion
and its unscented variation (presented in section 3.5).
3.4.1 Expected Improvement
This criterion considers the expected value of the Improvement metric:
I(x) = max{
0, f(xt+1)− f(x+)}
(3.13)
where f(xt+1) and f(x+) represent, respectively, the objective function value for the next query can-
didate and the maximum observation for D1:t.
The probability of I can be calculated from the normal density function:
P (I(x)) =1√
2πσ(x)exp
(− ((µ(x)− f(x+))− I)
2
2σ2(x)
)(3.14)
where µ(x) and σ2(x) represent, respectively, the estimated mean and variance of f(x) given by the
SP. In the BayesOpt library using MCMC samples, these values are µi=0 and σ2i=0 from equation 3.6.
The last equation yields,
EI (x) =
{(µ(x)− f(x+)) Φ(Z) + σ(x)φ(Z) if σ(x) > 00 otherwise , Z =
µ(x)− f(x+)
σ(x)(3.15)
2
The EI criterion can be further improved to include an auxiliary parameter ξ, which tunes the
exploitation-exploration trade-off (typically ξ = 0.01, scaled by the objective function signal variance
σp [3]).
EI (x) =
{(µ(x)− f(x+)− ξ) Φ(Zξ) + σ(x)φ(Zξ) if σ(x) > 00 otherwise , Zξ =
µ(x)− f(x+)− ξσ(x)
(3.16)
1strong form of uniform continuity for functions2where φ(.) and Φ(.) denote the PDF and CDF of the standard normal distribution respectively.
18
3.5 Unscented Bayesian Optimization
In this dissertation, we consider the input noise during the decision process to explore and select
the regions that are safe, rather than modeling it during function modeling state [20]. That is, the
regions that guarantee good results even if the experiment/trial is repeated several times in the same
vicinity. This contribution is twofold: we present the unscented expected improvement and the un-
scented optimum incumbent. Both methods are based on the unscented transformation [27, 28] that
was initially developed for tracking and filtering applications.
3.5.1 Unscented transformation
The unscented transformation is a method to propagate probability distributions through nonlinear
transformations with a trade off of computational cost vs accuracy. It is based on the principle that
it is easier to approximate a probability distribution than to approximate an arbitrary nonlinear func-
tion. The unscented transformation uses a set of deterministically selected samples from the original
distribution (called sigma points) and transform them through the nonlinear function f(·). Then, the
transformed distribution is computed based on the weighted combination of the transformed sigma
points.
One of the advantage of the unscented transformation is that the mean and covariance estimates
of the new distribution are accurate to the third order of the Taylor series expansions of f(·) provided
that the original distribution is a Gaussian prior, or up to the second order of the expansion for any
other prior. Figure 3.3 highlights the differences between approximating the distribution using sigma
points (UT) or using standard first-order Taylor linearization (Lin.). The distribution from the UT is
closer to the real distribution. Because the prior and posterior distributions are both Gaussians, the
unscented transformation is a linearization method. However, because the linearization is based on
the statistics of the distribution, it is often found in the literature as statistical linearization.
Another advantage of the unscented transformation is its computational cost. For a d-dimensional
input space, the unscented transformation requires a set of 2d + 1 sigma points. Thus, the compu-
tational cost is negligible compared to other alternatives to distribution approximation such as Monte
Carlo, which requires a large number of samples, or numerical integration such as Gaussian quadra-
ture, which has an exponential cost on d. Van der Merwe [29] proved that the unscented transforma-
tion is part of the more general sigma point filters, which achieve similar performance results. Other
sigma point methods are the central difference filter (CDF) [30] and the divided difference filter (DDF)
[31].
19
Figure 3.3: Propagation of a normal distribution through a nonlinear function. The first order Taylor expansion(dotted) only uses information of the function at the mean point to compute the linear approximation, while theUT (dashed) approaches the function with a linear regression of several sigma points. The actual distribution isthe solid one. (Adapted from [29])
3.5.2 Computing the unscented transformation
Assuming that the prior distribution is a Gaussian distribution x ∼ N (x,Σx), then the 2d+ 1 sigma
points of the unscented transformation are computed based on this sampling strategy
x0 = x
x(i)+ = x +
(√(d+ k)Σx
)i
∀i = 1..d
x(i)− = x−
(√(d+ k)Σx
)i
∀i = 1..d
(3.17)
where (√·)i is the i-th row or column of the corresponding matrix square root. In this case, k is a
free parameter that can be used to tune the scale of the sigma points. Although it may break the
positive defined requirement, the original authors [28] recommended k = −3 or k = 1. To alleviate
the potential numerical problem raised by negative k values and to increase the expressiveness of
the methods, the authors later introduced the scaled unscented transform [32]. However, for our
application such extra complexity is unnecessary.
For these sigma points, the weights are defined as:
ω0 =k
d+ k
ω(i)+ =
1
2(d+ k)∀i = 1..d
ω(i)− =
1
2(d+ k)∀i = 1..d
(3.18)
Then, the transformed distribution is computed as x′ ∼ N (x′,Σ′x) where:
x′ =
2d∑i=0
ω(i)f(x(i)) (3.19)
20
3.5.3 Unscented expected improvement
BO is about selecting the most interesting point each iteration. This is done using criteria designed
to select the point that has the higher potential to become the optimum. However, all those methods
assume that the observed value would be exactly the outcome of the query plus some observation
noise. They assume that the query is always deterministic. This is not true for the grasp optimization
problem, where noise over the grip configuration controller (input space) may cause more pronounced
estimation errors due to unexisting input noise modeling.
Instead, we are going to assume that the query is a probability distribution. Thus, instead of
analyzing the outcome of the criterion, we are going to analyze the resulting posterior distribution of
transforming the query distribution through the acquisition function. For the remainder of the section,
we assume that the input distribution corresponds to the input noise in each query point xq of the
BO process. That is, each query point is distributed according to an isotropic multivariate normal
distribution N (0, Iσx).
For the purpose of safe BO, we will use the expected value of the transformed distribution as the
acquisition function. In this case, we will use the expected improvement. Therefore, the unscented
expected improvement is computed as:
UEI(x) =
2d∑i=0
ω(i)EI(x(i)) (3.20)
where x(i) and ω(i) are computed according to equations (3.17) and (3.18) respectively, and using
Σ′x = Iσx. Note that we only compute the expected value of the transformed distribution x′ = UEI(x).
This value is enough to take a decision considering the risk on the input noise. However, the value of
Σ′x represents also the output uncertainty and can be used as meta-analysis tool, that is, the value
can be used as a risk on the estimation of the risk.
3.5.4 Unscented optimal incumbent
The unscented expected improvement can be used to drive the search procedure towards safe
regions. However, because the target function is unknown by definition, the sampling procedure can
still query good outcomes in unsafe areas. In grasping experiments, for example, one may still take
samples (even though with lower frequency than without using unscented expected improvement)
with high-valued observations and high signal variance in their vicinities (unsafe areas for grasping).
Furthermore, in BO there is a final decision that is independent of the acquisition function em-
ployed. Once the optimization process is stopped after sampling N queries, we still need to decide
which point is the best. Moreover, after every iterations, we may need to say which point is the
incumbent as the best observation.
If the final decision about the incumbent is based on the greedy policy of selecting the sample with
best outcome x∗ such that ybest = f(x∗) we may select an unsafe query.
Instead, we propose to apply the unscented transformation also to the select the optimal incum-
bent x∗, based on the function outcome f(). This would require to evaluate on f() the 2d + 1 sigma
21
points for each observation. However, the main idea of BO is to reduce the number of evaluations
on f(). Therefore, we evaluate the sigma points at the GP prediction µ(). Thus, let us define the
Unscented Outcome (UO) as:
UO(x) =
2d∑i=0
ω(i)m∑j=1
µj(x(i)) (3.21)
where∑mj=1 µj(x
(i)) is the prediction of the GP or TGP according to equation (3.6) integrated over
the kernel hyper-parameters and at the sigma points of equation (3.17).
Under these conditions, the incumbent of the optimal solution x∗ corresponds to:
x∗ = arg maxx
UO(x) (3.22)
In the BO literature, when f() represents a stochastic function with large output noise, it is common
to return the expected value of the GP at the optimum query, instead of the optimum observation 1.1.
Note that our method is also valid under those conditions.
As an illustrative and motivational example for the unscented optimal incumbent, observe function
in Fig. 3.4. In this case, the maximum of the function is at x ≈ 0.87. However, this maximum is
very risky, that is, small variations in x results in large deviations from the optimal outcome. On the
other hand, the local maximum at x ≈ 0.07 is much safer. Even if there is noise in x, repeated
queries will produce similar outcomes. In this case, if we assume input noise of σx = 0.05 and
compute the unscented transformation of that noise through the function, we can see that the sigma
points centered at the leftmost maximum would have higher unscented outcome that the sigma points
centered at the global maximum. We can conclude the expected posterior value of the local smooth
maximum would be larger than the value at the global narrow maximum.
Figure 3.4: RKHS function as in https://github.com/iassael/bo-benchmark-rkhs
In summary, our method takes the unscented transformation to compute the decision functions in
BO, assuming that each query is a probability distribution (due to the input noise) instead of a deter-
ministic value. We found that, for BO, we need to consider the unscented version of the acquisition
function, for which we propose the unscented expected improvement. Furthermore, we also need to
take into consideration the decision to select the best observation or the potential optimum. In this
22
case, we propose the unscented optimal incumbent as a robust selection method. Overall, we call
this version of BO as UBO.
23
4Experiments
24
4.1 Overview
In this section we describe the methods used and experiments carried out in this work. Results will
be evaluated separately into two main subsections: one for comparing Bayesian Optimization using
Treed Gaussian Processes (BO-TGP) against Bayesian Optimization using Gaussian Processes (BO-
GP) (or simply BO); the other to compare the benefits of the Unscented Bayesian Optimization (UBO)
with respect to the classical Bayesian optimization (BO). We opted to use GPs for the Unscented
experiments since they are used as standard regression models in BO and it makes comparison
more standard for both contributions.
For both sets of experiments, we first illustrate each type of optimization with synthetic functions
(with low input space dimensionality) that allow us visually understand about the proposed contri-
butions. Then, we show the results of autonomous grasping exploration of daily life objects with a
dexterous robot hand using realistic simulations [17].
In all BO experiments of this work, we have used an extended version of the BayesOpt software
[12] with the proposed methods. For the kernel function, we used the standard choice of the Matern
kernel with ν = 5/2. We used slice sampling for the kernel hyper-parameters optimization (as ex-
plained in 3.2).
4.2 Synthetic Functions
The synthetic functions presented here are the 1D Reproducing Kernel Hilbert Space Function
(RKHS) [33], the Gramacy 2-D Exponential Function (GF) [34] and a specially designed Mixture of
2D Gaussian distributions (GM) (courtesy of Ruben Martinez-Cantin), see Fig.4.1.
All three functions will be used for the BO-TGP Vs. BO-GP experiments. Results from these
experiments carry important information since they all have different degrees of smoothness over
their input space (which may influence standard BO effectiveness).
As for the UBO experiments, only RKHS and GM functions will be used. We chose to do so since
both have multiple local maxima, with one global maximum located at a narrow peak. The global
maximum in both functions represents a region of high risk in presence of significant input noise. This
is not the case for the GF function, which was left out of that set of experiments.
4.3 Robot Grasp Simulator - Simox
We use the Simox simulation toolbox [17] as the simulation environment for the robot exploratory
task. Simox simulates the iCub’s robot hand grasping arbitrary objects.
Given an initial pose for the robot hand and a finger joint trajectory, the simulator runs until all
fingers are in contact with the object surface and posteriorly computes a grasp quality metric with
wrench space analysis from the Simox toolbox [17]. We use a representation of the iCub’s left hand
which can move freely in space (Fig. 4.3) and a few static objects, shown in 4.2.
25
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−2
−1
0
1
2
3
4
5
6
x
y
(a) RKHS function (b) GM function (c) GF function
Figure 4.1: Synthetic Functions
(a) Water bottle (b) Mug (c) Glass (d) Drill
Figure 4.2: Objects used in the simulations with corresponding initial robot hand configuration.
26
The robot hand is initially placed with palm facing parallel to one of the facets, at a fixed distance of
the object bounding box, and the thumb aligned with one of the neighbor facets (Fig. 4.3). This setup
uniquely defines the default pose of the hand with respect to the object. The learning goal is then to
find the optimal grasp pose by choosing incremental translations and rotations ((δx, δy, δz, θx, θy, θz))
of the hand’s pose with respect to its default pose.
Figure 4.3: The Simox robot grasp simulator. The iCub’s left hand is used to perform grasping trials on arbitraryobjects, in this case a glass. The red lines around the glass represent and Object Oriented Bounding Box, whosefacets are used to setup the initial hand configuration.
As for grasp closure primitives, we implement a parametrization like Henriques [18]. This parametriza-
tion maps hand closure synergies into the motor controlling space and then to the hand joints’ values.
We define distinct types of grasps which are modeled with a principal component description. Each
grasp closure is controlled by one parameter, associated with the most energetic component and clo-
sure motion, and a set of other samples that control different finger position adjustments si. These
low energy components can be used as learning variables to optimize the grasp, in addition to the
translation and rotation parameters.
Specifically for experiments, we used a power grasp posture synergy with only one component for
posture adjustments (s1). The advantage of the BO methodology as a black-box optimization is that
the system is agnostic to the parametrization selected and it can be easily replaced.
4.4 Results BO-GP Vs. BO-TGP
4.4.1 Motivation
Before presenting and reviewing this section’s experimental results, we show a model regression
example for the RKHS to illustrate the practical differences between GPs and TGPs as surrogate
models. See figure 4.4.
In this example, we took 27 random samples of the RKHS function into a GP and a TGP, respec-
27
(a) BO-GP (b) BO-TGP
Figure 4.4: Model regression with GP vs TGP
tively. The main differences between the two surrogate models are that TGPs can better estimate
(comparing to GP) the expected value of the objective function f() and show much less variance
between samples. Note that the later fact decreases the chances of using samples where variance
would be high due to inadequate estimation over smooth areas (as can be seen in sub-figure 4.4(a).
Revisit section 3.4.
This happens as result of TGPs’ capability of separating different regions of the input space with
different smoothness characteristics. In turn, better model regression helps the learning criterion
effectiveness by giving more accurate information about the objective function, which benefits BO
overall.
4.4.2 Synthetic Functions
We have performed 100 runs of BO for all three functions (RKHS, GF, GM) and the optimization
procedure using GP and TGP (with different weights wk - see sub-subsection 3.3.2).
For RKHS, each run has 5 initial random samples and the optimization performs 45 iterations, with
σp = 3.93, sl = 8 1; for GF, each run has 20 initial samples and the optimization performs 40 iterations,
with σp = 0.2 2, sl = 15; for GM, each run has 30 initial samples and the optimization performs 110
iterations, with σp = 0.642, sl = 15. For all functions it was used σn = 10−6.
Results for these experiments are shown in figure 4.5 and table 4.1.
From RKHS and GM results we can clearly see that BO-TGP finds the optimal value more fre-
quently (and quicker) than BO-GP. This happens since TGP can have different kernel bandwidths li1minimum number of samples per TGP leaf2Is is common to randomly sample the objective function to estimate σp as the standard deviation of those samples. Since
this may not be possible for all cases, we assume the worst possible case, where the objective function only two possible
values - the global maximum and global minimum. This wields σp =
√f(x+)−f(x−)
2
28
(a) RKHS (b) GF
(c) GM
Figure 4.5: GP Vs. TGP Results. Synthetic Functions. Best Sample observation y over the number of iterations.Black - BO-GP. Blue - BO-TGP and w1. Green - BO-TGP and w2. Red - BO-TGP and w3. Magenta - BO-TGPand w∞. Dashed Red - First TGP node split. Shaded regions represent the standard deviation of the bestobservation at each iteration.
29
y (x∗)± σ (y (x∗))Function GP TGP w1 TGP w2 TGP w3 TGP w∞
RKHS 5.3279 ±0.0642
5.3993 ±0.0638
5.4472 ±0.0621
5.4375 ±0.0627
5.7305 ±0.0128
GF 0.4081 ±0.0093
0.4008 ±0.0079
0.3974 ±0.0108
0.3900 ±0.0118
0.4106 ±0.0074
GM 0.1228 ±0.0011
0.1295 ±0.0001
0.1297 ±8.1e− 04
0.1302 ±8.3e− 04
0.1309 ±6.5e− 04
Table 4.1: Results at the last iteration of the BO process (means and standard deviations over all runs).
over the input space and, therefore, better estimate the expected value and variance of the objective
function f(). This improves exploration with the given learning criterion and allows BO to use the
sample budget more efficiently. BO may search for other possible local optima instead of wasting
samples where variance would be otherwise high (and low expected value) like in GPs.
As for the GF function, BO-TGP doesn’t show significant better results than BO-GP. While TGPs
help BO by better estimating the objective function and improve multiple local optima exploration, the
GF function only has one global/local maximum. GPs may give a worse estimation for GF but, by
having higher overall process variance, they tend to make BO reach the single global optimum with
similar convergence rate.
As a final note, our GF results are contradictory with results from Assel et al. [26], which we could
replicate using σn = 10−3 and σp = 1. Later on we confirmed that these were exactly the values
used throughout their experiments. As we increase the parameter σn, the surrogate model’s overall
variance also increases (check equations 3.6 and 3.2). This leads to lower rates of convergence for
BO in general (as you can see by comparison with the results from Assel et al. [26]). However, this
is less impactful for BO-TGP due its ability to reduce the overall process variance (as explained in
section 4.4.1), which makes BO-TGP have better results in their setting. This observation explains
their results in comparison with the ones presented here.
4.4.3 Robot Grasp Simulations
We have performed 30 runs of BO for all the proposed objects (Water bottle, Mug, Cup of Glass,
Drill) and the optimization procedure using GP and TGP (with different weights wk - see subsection
3.3.2). The water bottle object was evaluated in two different facets (one from the side and one from
the top), while all the other objects were only evaluated from one of the sideways facet.
Each run has 30 initial samples and the optimization performs 120 iterations, with σp = 0.35,
σn = 10−4. The input search space is composed by (δx, δy, δz, θx, θy, θz, s1).
Results for these experiments are shown in figure 4.6 and table 4.2.
From results, we can observe that BO-TGP is, generally, better than BO-GP. The difference, this
time, is that it is not clear which k to use.
As these experiments were performed for a 7-D learning optimization task, we can’t visualize
directly explanations for these observations.
Aside from the experiment from figure 4.6(e) - Drill - higher values of k seem to have better results
30
(a) Water bottle - Sideways (b) Water bottle - Top
(c) Mug - Sideways (d) Glass - Sideways
(e) Drill - Sideways
Figure 4.6: GP Vs. TGP Results. Simox Simulation Environment. Black - BO-GP. Blue - BO-TGP and w1.Green - BO-TGP and w2. Red - BO-TGP and w3. Magenta - BO-TGP and w∞. Dashed Red - First TGP nodesplit. Shaded regions represent the standard deviation of the best observation at each iteration.
31
y (x∗)± σ (y (x∗))Object GP TGP w1 TGP w2 TGP w3 TGP w∞
Water bottle -Sideways
0.5867 ±0.0339
0.6080 ±0.0231
0.6008 ±0.0239
0.5775 ±0.0320
0.6001 ±0.0222
Water bottle -Top
0.5200 ±0.0346
0.5567 ±0.0260
0.5546 ±0.0194
0.5689 ±0.0261
0.5761 ±0.0255
Mug - Sideways 0.1704 ±0.0138
0.1620 ±0.0145
0.1678 ±0.0180
0.1847 ±0.0156
0.1778 ±0.0135
Glass - Side-ways
0.4526 ±0.0263
0.4394 ±0.0180
0.4444 ±0.0267
0.4803 ±0.0229
0.4584 ±0.0260
Drill - Sideways 0.1328 ±0.0095
0.1209 ±0.0080
0.1227 ±0.0087
0.1216 ±0.0117
0.1239 ±0.0096
Table 4.2: Results at the last iteration of the BO process (means and standard deviations over all runs).
overall - specifically k = 3 and k = ∞ (similar to the synthetic results). Still, it is not possible to
conclude that as we increase k we get progressively better in BO-TGP. As k models data cross-
correlation between leafs, different k might have different BO-TGP performances according to the
objective function’s characteristics.
4.4.3.A Simox Metric Signal profile
One thing we observed before performing BO experiments was that the Simox grasping metric
had very high signal variance inside small input space vicinities (see figure 4.7).
0.5 0.55 0.6 0.650.4
0.42
0.44
0.46
0.48
0.5
0.52
0.54
0.56
0.58
Figure 4.7: Simox metric profile for the water bottle. Horizontal axis represents δx and the vertical axis representsthe metric value. The default pose was chosen arbitrarily.
This aspect may worsen BO performance and, specifically, can affect with higher degree BO-
32
TGP. Since each TGP leaf has much less points, hyper-parameter optimization performance gets
more susceptible to data distribution.
As a final experiment for the BO-TGP subsection we decided to see how BO-TGP and BO-GP
is affect by the presence of observational noise, with the objective to simulate the grasping metric in
figure 4.7.
The signal noise is defined as y = f (x)− |εy| with εy ∼ N (0, σε).
(a) εy = 0 (b) εy ∼ N (0, 0.2)
Figure 4.8: GP Vs. TGP Results. Simox Simulation Environment.Black - BO-GP. Blue - BO-TGP and w1. Green - BO-TGP and w2. Red - BO-TGP and w3. Magenta - BO-TGPand w∞. Dashed Red - First TGP node split. Shaded regions represent the standard deviation of the bestobservation at each iteration.
y (x∗)± σ (y (x∗))Function GP TGP w1 TGP w2 TGP w3 TGP w∞
εy = 0 5.3279 ±0.0642
5.3993 ±0.0638
5.4472 ±0.0621
5.4375 ±0.0627
5.7305 ±0.0128
εy ∼ N (0, 0.2) 5.2759 ±0.0773
5.2074 ±0.0781
5.2118 ±0.0831
5.3287 ±0.0674
5.2039 ±0.0629
Table 4.3: Results at the last iteration of the BO process (means and standard deviations over all runs).
The results from table 4.3 and 4.8 show a similar tendency from the ones observed in 4.4.3. All
BO optimizations suffered from the presence of noise εy, but BO-TGP performance seems to suffer a
great decrease of performance in presence of observational noise compared to BO-GP (BO-GP was
better except for one case).
4.5 Results BO Vs. UBO
4.5.1 Motivation
Before presenting and reviewing all experimental results, we show a regression example for the
RKHS function to illustrate the practical differences between the unscented expected improvement
33
and the expected improvement. See figure 4.9.
(a) Expected Improvement (b) Unscented Expected Improvement
Figure 4.9: RKHS Posterior - σx = 0.01
In this example, we took 40 random samples of the RKHS function into a Gaussian Process. In
sub-figure (a) we can see the expected value and variance of the objective function given by the GP,
while on sub-figure (b), the expected value and variance for the unscented criterion, given by:
Uyq =
m∑i=1
ω(i)yq
Uσ2q =
m∑i=1
ω(i)σ2q
(4.1)
In comparison, one can see that the unscented expected values are smoother for both GP’s mean
and variance. But the most important feature is that the expected mean value for the global maximum
(which has high risk) is now lower than the value for the local maximum at x ≈ 0.078. So, for the
unscented expected improvement the local maximum at x ≈ 0.078 will be considered the learning
optimum and its vicinity will be chosen more frequently, in comparison with the expected improvement
criterion.
4.5.2 Synthetic Functions
To reproduce the effect of the input noise, we queried the result function using 100 Monte Carlo
samples according the input noise distribution at each iteration. By analyzing the outcome of the sam-
ples we can estimate the sample’s mean from the query xq (ymean (xmci )) and the sample’s variance
of the optimum (ystd (xmci )).
We have performed 100 runs of Bayesian Optimization for both functions (RKHS, GM) and the
optimization procedure (BO and UBO).
For RKHS each run has 5 initial random samples and the optimization performs 45 iterations. The
input noise is set as σx = 0.01. For GM each run has 30 initial samples and the optimization performs
34
90 iterations. The input noise is set as σx = 0.1. All other optimization parameters are identical to the
BO-TGP vs BO-GP experiments.
In Fig. 4.10 and Fig. 4.11 we show the statistics over the different runs for the evaluation criteria
with respect to the number of iterations. The shaded region represents the 95% confidence interval.
(a) ymc (x∗) (b) std (ymc (x∗))
Figure 4.10: RKHS Results
(a) ymc (x∗) (b) std (ymc (x∗))
Figure 4.11: GM Results
For both functions, we can observe that UBO quickly overcomes the results of BO. As soon as the
random exploration phase finishes and the optimization starts, the UBO computes less risky solutions,
as demonstrated by the higher expected return value and lower standard deviation. In table 4.4 we
35
show the numeric results obtained at the last iteration. We also show values for the worst sample of
the Monte Carlo runs3. The worst case for UBO is always more favorable than the worst case for BO
by a large margin.
4.5.3 Robot Grasp Simulations
We have performed 30 runs of the robotic grasp simulation for each object and each optimization
criterion. The robot hand posture with respect to the objects is initialized as shown in Fig. 4.2. The
input search space, in this case, was composed by (δx, δy).
Each run starts with 40 initial random samples and proceeds with 60 iterations of Bayesian op-
timization, for a total of 100. In this case, we assume that the function is stochastic, due to small
simulation numerical errors, assuming σy = 10−4. Also, we assume an input noise σx = 0.03 (note
that the input space was normalized in advance to the unit hyper-cube [0, 1]d). In each iteration we
sample 20 times at the query point with input noise to compute the expected outcome. The results can
be observed in figures 4.12, 4.13, 4.14 and 4.15. We note that the plots seem noisier than with the
synthetic functions. This fact is due to a lower number of samples at the query points, for the sake of
computation time, as each one required to run the full grasp simulation. Note also that those samples
are only used for evaluation purposes and would not be required for the optimization process.
(a) ymc (x∗) (b) std (ymc (x∗))
Figure 4.12: Water bottle. Input Space Noise σx = 0.01
It can be observed that, for the water bottle and glass, the UBO method has clear advantages
over BO. As soon as the initial sampling phase finishes, UBO obtains higher mean values and lower
standard deviations. For the drill, the UBO eventually overcomes the BO, but at later iterations, which
might imply that the unsafe optimum is difficult to find, but still exists. Looking at the quantitative
results shown in Table 4.4, we can see that, at the end of the optimization, UBO is better than BO3Worst cases are not shown graphically due to lack of space, but they are coherent with the evolution of the means.
36
(a) ymc (x∗) (b) std (ymc (x∗))
Figure 4.13: Mug. Input Space Noise σx = 0.03
(a) ymc (x∗) (b) std (ymc (x∗))
Figure 4.14: Glass. Input Space Noise σx = 0.03
37
(a) ymc (x∗) (b) std (ymc (x∗))
Figure 4.15: Drill. Input Space Noise σx = 0.03
in all criteria, except for the mean output value for the mug. For the mug, the 100 iteration are not
enough to obtain better mean values. We can see that the mug and drill objects are more challenging
due to their non-rotational symmetry. Since the optimization is only done in translation parameters,
the method is missing exploration in the rotation degrees of freedom. Furthermore, in the mug’s
case, the facet chosen was the one that contains the mug’s handle. Trying to learn a grasp in this
setting is much harder than the other cases since, for the same input space volume, the percentage
of configurations which return a good metric is much smaller. For the water bottle and the glass, the
rotational degrees of freedom are not so important because the objects are rotational symmetric.
In Fig. 4.16 we illustrate four grasps at the water bottle explored during the experiments. Two
of the grasps are performed in a safe region while the two other are explored at a unsafe region.
Although the unsafe zone has one observation with the highest value, it has also higher risk of getting
a low value in its vicinity.
As a final experiment we decided to run UBO vs BO for the glass object and use all learning
variables used as in section 4.4.3. We have obtained similar results. See figure 4.17.
38
(a) Safe-zone, y = 0.413 (b) Safe-zone, y = 0.418
(c) Unsafe-zone, y = 0.439 (d) Unsafe-zone, y = 0.377
Figure 4.16: Grasp safety. In this example the best grasp is at an unsafe zone (c). However, a bad grasp is inits vicinity (d). The unscented Bayesian optimization chooses grasps with lower risk at the safe zone (a) and (b),where performance is robust to input noise.
39
(a) ymc (x∗) (b) std (ymc (x∗))
Figure 4.17: Glass. 7 Dimensions. Input Space Noise σx = 0.03
ymc (x∗) worst ymc (x∗) std (ymc (x∗))Synthetic Problems
Function BO UBO BO UBO BO UBORKHS 4.863 4.934 2.881 4.657 0.554 0.065GM 0.080 0.093 0.023 0.053 0.027 0.014
Simulation - SimoxObject BO UBO BO UBO BO UBOBottle 0.550 0.567 0.390 0.430 0.077 0.065Mug 0.119 0.114 0.051 0.059 0.029 0.027Glass 0.421 0.452 0.080 0.252 0.184 0.087Drill 0.101 0.108 0.050 0.068 0.030 0.018
Table 4.4: Results at the last iteration of the BO process (means and standard deviations over all runs).
40
5Conclusions
41
This dissertation presented novel Bayesian Optimization methods for robust grasping, namely
Gaussian Processes Vs. Treed Gaussian Processes and Bayesian Optimization Vs. Unscented
Bayesian Optimization. Both contributions were proposed to answer classical BO problems: het-
eroscedasticity and proper input noise modeling. These problems affect performance of BO in many
active learning problems. The specific application here was to find the optimal way to grasp an object
(using the Simox simulator).
With the work developed during this dissertation, we have also contributed with a paper for IROS
2016 International Conference [35] and we’ve worked with Ruben Martinez-Cantin to augment his
Bayesian Optimization C++ toolbox, with BO-TGP and UBO implementations (https://github.com/
josemscnogueira/bayesopt/tree/tgp_cuei).
Results shown that both implementations outperformed classic BO for the synthetic problem
cases. BO-TGP achieves better results when the objective functions presents multiple local optima
and different smoothness behaviors while UBO finds much safer optima when we consider the pres-
ence of input noise.
These results remained true for the Simox Simulated experiments. However, in the BO-TGP we
saw that observational noise affects greatly its performance. We verified that the Simox Simulation
environment had indeed this type of noise which explained why the results were not as expected
compared to the synthetic ones. It would be interesting, in the future, to include some sort of high-
frequency noise reduction for the Simox grasping environment (like a soft touch implementation).
We couldn’t conclude that using higher degrees of independency between TGP leaf’s data for lack
of consistent results. Despite this, our synthetic results (and some from Simox’s experiments) suggest
that less data correlation between TGP leaf’s leads to higher valued optima selection when different
smoothness characteristics of the objective function are present.
UBO results demonstrated that using plain BO may lead to inconsistent final optima values when
consider input noise. We have shown that using UBO, while considering input noise conditions, leads
to better average optima when comparing with BO.
We presented these two methods with the main purpose of perform robust and safe grasping
of unknown objects by haptic exploration. The potential interest of both these methods goes be-
yond grasping or even robotics. Bayesian optimization is currently being used in many applications:
engineering, computer sciences, economics, simulations, experimental design, biology, artificial in-
telligence. In all those fields, there are many situations where input noise, heteroscedasticity or
uncertainty may arise, and in which safe optimization is therefore fundamental.
Although this dissertation’s work has reached an end, we would like to propose several objec-
tives/goals that would continue our line of research and pursue some problems left to solve:
• We’ve shown in 4.4.3.A that Simox’s grasp metric had very high signal variance. This happens
since the Simox simulator only uses one contact point per finger node. Bayesian Optimization
would benefit greatly if soft touch was implemented (more contact points per surface). It would
also be worthwhile to explore better Simox alternatives.
42
• TGPs research has yet to be completed. Tree construction uses a recursive construction algo-
rithm which does not guarantee that the final tree minimizes overall node uncertainty (eq. 3.8).
Also, it uses brute-force search to select splitting points which scales with the number of sam-
ples and input space’s dimensionality. There’s also no rule for selecting the minimum number of
samples per leaf and how this changes the learning process.
• TGPs results should be compared with other heteroscedastic models. Specifically, ones that
explore a jointly distributed objective function and hyper-parameters.
• UEI could be changed to also use information about sigma points’ variance instead of using
only the mean value. It would be worthwhile to explore other learning criteria that consider input
noise and compare it to UEI.
43
Bibliography
[1] J. Mockus. Application of bayesian approach to numerical methods of global and stochastic
optimization. Journal of Global Optimization, 4(4):347–365, June 1994.
[2] D. Jones, M. Schonlau, and W. Welch. Efficient global optimization of expensive black-box func-
tions. Journal of Global Optimization, 13(4):455–492, December 1998.
[3] E. Brochu, V. Cora, and N. Freitas. A tutorial on bayesian optimization of expensive cost func-
tions, with application to active user modeling and hierarchical reinforcement learning. Technical
report, December 2010. arXiv:1012.2599.
[4] D. Luo, Y. Wang, and X. Wu. Active online learning of the bipedal walking. Technical report,
Peking University, 2011.
[5] N. Srinivas, A. Krause, S. Kakade, D. Matheson, and M. Seeger. Gaussian process optimiza-
tion in the bandid setting: No regret and experimental design. Technical report, June 2010.
arXiv:0912.3995v4.
[6] Z. Wang, M. Zoghu, N. Freitas, D. Matheson, and F. Hutter. Bayesian optimization in a billion
dimensions via random embeddings. Technical report, January 2013. arXiv:1301.1942v1.
[7] E. Brochu, T. Brochu, and N. Freitas. A bayesian interactive optimization approach to procedural
animation design. Technical report, University of British Columbia, 2010.
[8] R. Martinez-Cantin, N. Freitas, and J. Castellanos. Analysis of particle methods for simultaneous
robot localization and mapping and a new algorithm: Marginal-slam. Robotics and Automation,
2007 IEEE International Conference on Robots and Automation, pages 2415–2420, April 2007.
[9] F. Veiga. Robotic grasp optimization from contact force analysis. Master’s dissertation in electri-
cal and computer engineering, Institudo Superior Tecnico, April 2012.
[10] S. Dragiev, M. Toussaint, and Gienger M. Uncertainty aware grasping and tactile exploration.
2013 IEEE International Conference on Robotics and Automation (ICRA), pages 113–119, May
2013.
[11] L. Montesano and M. Lopes. Active learning of visual descriptors for grasping using non-
parametric smoothed beta distributions. Robotics and Autonomous Systems Journal, August
2011.
44
[12] Ruben Martinez-Cantin. BayesOpt: A Bayesian optimization library for nonlinear optimization,
experimental design and bandits. Journal of Machine Learning Research, 15:3735–3739, 2014.
[13] C. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. The MIT Press,
2006. ISBN 026218253X.
[14] Kushner H. A new method of locating the maximum of an arbitrary multipeak curve in the pres-
ence of noise. Journal of Basic Engineering, 86(1):97–106, March 1964.
[15] J. Mockus, V. Tiesis, and A. Zilinskas. Toward Global Optimization. Dixon, 2nd edition, 1978.
Chapter: The Application of Bayesian Methods for Seeking the Extremum.
[16] Finkel D. Direct optimization algorithm user guide. Technical report, Center for Research in
Scientific Computation, North Carolina State University, Raleigh, NC 27695-8205, March 2003.
[17] V. Vahrenkamp. Simox - a robotics toolbox for simulation, motion and grasp planning. http:
//simox.sourceforge.net/, 2013. Accessed: 2015-05-24.
[18] M. Henriques. Controlo e planeamento de maos roboticas antropomorficas utilizando sinergias.
Master’s dissertation in mathematics and applications, Institudo Superior Tecnico, July 2013.
[19] M. Tesch, J. Schneider, and H. Choset. Expensive function optimization with stochastic binary
outcomes. Technical report, Robotics Institute, Carnegie Mellon University, 2013.
[20] J. McHutchon and C. Rasmussen. Gaussian process training with input noise. Technical report,
Cambrige University, 2011.
[21] Q. Le, A. Smola, and S. Canu. Heteroscedastic gaussian process regression. Technical report,
Australian National University, 0200 ACT, Australia, 2005.
[22] K. Kersting, C. Plagemann, P. Pfaff, and W. Burgard. Most likely heteroscedastic gaussian pro-
cess regression. Technical report, University of Freiburg, 2007.
[23] S. Kuindersma, R. Grupen, and A. Barto. Variational bayesian optimization for runtime risk-
sensitive control. Technical report, University of Massachussets Amherst, 2012.
[24] B. Damas and J. Santos-Victor. Online learning of single and multi-valued functions with an
infinite mixture of linear experts. Neural Computation, 25(11), November 2013.
[25] Jasper Snoek, Hugo Larochelle, and Ryan Adams. Practical Bayesian optimization of machine
learning algorithms. In NIPS, pages 2960–2968, 2012.
[26] J. Assel, Z. Wang, B. Shahriari, and N. Freitas. Heteroscedastic treed bayesian optimizaion.
Technical report, March 2015. arXiv:1410.7172v2.
[27] E. Wan and R. Van Der Merwe. The unscented kalman filter for nonlinear estimation. In Proceed-
ings of the IEEE 2000 Adaptive Systems for Signal Processing, Communications, and Control
Symposium (Cat. No.00EX373). Institute of Electrical & Electronics Engineers (IEEE), 2000. doi:
10.1109/asspcc.2000.882463. URL http://dx.doi.org/10.1109/ASSPCC.2000.882463.
45
[28] S. Julier and J. Uhlmann. Unscented filtering and nonlinear estimation. Proceedings of the
IEEE, 92(3):401–422, March 2004. doi: 10.1109/jproc.2003.823141. URL http://dx.doi.org/
10.1109/JPROC.2003.823141.
[29] R. van der Merwe. Sigma-Point Kalman Filters for Probabilistic Inference in Dynamic State-
Space Models. PhD thesis, OGI School of Science & Engineering, Oregon Health & Science
University, April 2004.
[30] K. Ito and K. Xiong. Gaussian filters for nonlinear filtering problems. IEEE Transactions on
Automatic Control, 45(5):910–927, 2000.
[31] M. Nørgaard, N.K. Poulsen, and O. Ravn. New developments in state estimation for nonlinear
systems. Automatica, 36(11):1627–1638, November 2000.
[32] S. Julier and J.K. Uhlmann. The scaled unscented transformation. In IEEE American Control
Conf., pages 4555–4559, Anchorage AK, USA, 8–10 May 2002.
[33] Z. Wang, J. Assel, and N. Freitas. Rkhs 1d function for bayesian optimization tasks. https:
//github.com/iassael/function-rkhs, 2014. Accessed: 2015-10-03.
[34] R. Gramacy and H. Lee. Bayesian treed gaussian process models. Technical report, University
of California, Santa Cruz, 2006. arXiv:0710.4536.
[35] Jose Nogueira, Ruben Martinez-Cantin, Alexandre Bernardino, and Lorenzo Jamone. Unscented
bayesian optimization for safe robot grasping. In 2016 IEEE/RSJ International Conference on
Intelligent Robots and Systems (IROS). IEEE, oct 2016. doi: 10.1109/iros.2016.7759310. URL
https://doi.org/10.1109/iros.2016.7759310.
46