저 시-비 리- 경 지 2.0 한민
는 아래 조건 르는 경 에 한하여 게
l 저 물 복제, 포, 전송, 전시, 공연 송할 수 습니다.
다 과 같 조건 라야 합니다:
l 하는, 저 물 나 포 경 , 저 물에 적 된 허락조건 명확하게 나타내어야 합니다.
l 저 터 허가를 면 러한 조건들 적 되지 않습니다.
저 에 른 리는 내 에 하여 향 지 않습니다.
것 허락규약(Legal Code) 해하 쉽게 약한 것 니다.
Disclaimer
저 시. 하는 원저 를 시하여야 합니다.
비 리. 하는 저 물 리 목적 할 수 없습니다.
경 지. 하는 저 물 개 , 형 또는 가공할 수 없습니다.
Master Thesis
Adaptive Bayesian Optimization
for Organic Material Screening
유기소재 스크리닝을 위한 적응적 베이지안 최적화
February 2016
Seoul National University
The Graduate School
Interdisciplinary Program in Neuroscience
Sangwoong Yoon
Adaptive Bayesian Optimization
for Organic Material Screening
유기소재 스크리닝을 위한 적응적 베이지안 최적화
지도교수 장 병 탁
이 논문을 이학석사 학위논문으로 제출함
2015 년 12 월
서울대학교 대학원
협동과정 뇌과학전공
윤 상 웅
윤상웅의 이학석사 학위논문을 인준함
2015 년 12 월
위 원 장 김 건 희 (인)
부위원장 장 병 탁 (인)
위 원 이 상 훈 (인)
Abstract
Adaptive Bayesian Optimization forOrganic Material Screening
Sangwoong Yoon
Interdisciplinary Program in Neuroscience
The Graduate School
Seoul National University
Bayesian optimization (BO) is an efficient black-box optimization method which
utilizes the power of statistical models built upon previously searched points.
The efficacy of BO largely depends on the choice of the statistical model, but
it is usually difficult to determine beforehand which model would yield the
best optimization performance for a given task. This thesis investigates a mod-
ified problem setting for BO where multiple candidate surrogate functions are
available, and experiments two novel strategies based on multi-armed bandit
algorithms. The proposed strategies attempt to discriminate among the candi-
date models, and therefore referred as adaptive BO’s. The strategies are tested
on optimization test bed functions, and the chemical screening scheduling prob-
lem where the issue of selecting a surrogate function to use become particularly
salient. Surprisingly, it is discovered that the baseline strategy which blends
multiple candidate functions uniform-randomly performs non-trivial perfor-
mance. The results presented in the thesis shows that the relaxation of the
number of surrogate functions in BO yields interesting dynamics.
Keywords: Bayesian Optimization, Multi-Armed Bandit, Gaussian Process,
Chemoinformatics
Student Number: 2014-21320
i
Contents
Abstract i
Contents iv
List of Figures v
Chapter 1 Introduction 1
Chapter 2 Preliminaries 6
2.1 Bayesian Optimization . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Multi-Armed Bandit . . . . . . . . . . . . . . . . . . . . . . . . 8
Chapter 3 Organic Material Screening: The Motivational Ap-
plication 10
3.1 Beyond Structure-Property Relationship . . . . . . . . . . . . . 10
3.2 Dataset: Electronic Properties of Organic Molecules . . . . . . 11
Chapter 4 Bayesian Optimization with Multiple Surrogate Func-
tions 13
4.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2 Baseline: Random Arm . . . . . . . . . . . . . . . . . . . . . . . 14
4.3 Proposed Strategies . . . . . . . . . . . . . . . . . . . . . . . . 14
Chapter 5 Experiments 17
5.1 Benchmark Functions . . . . . . . . . . . . . . . . . . . . . . . 18
5.2 Screening over Organic Molecules . . . . . . . . . . . . . . . . . 18
ii
Chapter 6 Discussion and Conclusion 23
6.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Bibliography 26
국문초록 30
iii
List of Figures
Figure 1.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Figure 5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Figure 5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Figure 5.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Figure 5.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Figure 5.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Figure 5.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
iv
Chapter 1
Introduction
Bayesian optimization (BO) is a powerful model-based optimization technique
which can handle difficult optimization problems where the gradient is unavail-
able and function evaluation is expensive. BO builds a statistical model, or a
surrogate function, upon previously acquired data points, generalizing their
information to unexplored areas of the data space. Then, BO calls the acqui-
sition function to measure the acquisition priority of unseen data points. A
point with the highest acquisition function value is chosen as the next search
point, and its function value is queried to the (possibly noisy) oracle, augment-
ing the dataset. The whole task can be formally written as the following.
Given {gi(x)}Ki=1 , a(x), and D,
find x∗ = arg maxxi∈D
f(xi).
where gi(x) is the prespecified candidate surrogate function, and a(x) is the
prespecified acquisition function. D, the domain of f , is either a continuous
vector space or a set of finite points. The detailed procedure of BO is depicted
in Algorithm1.
In recent years, BO has attracted a great amount of interest from the ma-
chine learning community, and there have been advancements in terms of its
theory and application. Some notable theoretical achievements are the proven
bounds of popular heuristics [1, 2], information-theoretic acquisition functions
[3], and strategies for conditional input spaces [4]. Also, a great deal of effort
was devoted for BO in high dimensional space [5]. Numerous suggested ap-
plications have demonstrated the effectiveness of BO. Such examples include
1
Algorithm 1 Bayesian Optimization
Input: g(x): The surrogate function,
a(x): The acquisition function,
f(x): The underlying function to be optimized,
D: The function domain
Output: f(x+): The incumbent best function value
1: T ← {(xi1, f(xi1))} . Initialization
2: while termination condition do
3: update g(x) with T
4: xnew ← arg maxx′∈D a(x′)
5: Evaluate f(xnew)
6: T ← T ∪ {(xnew, f(xnew))}
7: end while
adaptive robot gait control [6], reinforcement learning [7], adaptive Monte Carlo
[8]. One area that raised sharp attraction was automated machine learning, es-
pecially the tuning of hyperparameters in deep neural networks [9, 10]. More
thorough review on the progress and application of BO can be found on the
recent surveys [7, 11].
In addition to its effectiveness, Bayesian optimization is an intriguing re-
search area, because in BO, modeling from data and decision making under
uncertainty are intertwined. BO, thus, possesses interesting relationships with
various neighboring machine learning fields. BO borrows all the tricks of statis-
tical modeling from supervised learning, in particular Gaussian Process (GP).
From the decision making perspective, BO can be viewed as a special kind
of bandit problem, which also faces the exploration-exploitation dilemma. Fur-
thermore, BO and active learning both have a similar iterative scheme in which
each data is labeled one at a time, although the objectives and therefore the
criteria for data selection are different.
Although it has never been proven explicitly, the effectiveness of Bayesian
2
optimization is likely to arise from the generalization ability of its surrogate
function g(x). BO exploits the information propagated from the previous search
points to guide its search. The prediction from g(x) may not be perfect, yet pro-
vides useful information to schedule a more efficient search. The generalization
power should depend on the choice of the surrogate function whose inductive
bias may or may not align with that of the agnostic underlying function. In
fact, the choice of surrogate function in BO significantly affect the optimiza-
tion performance, as empirically shown in Figure1.1. Therefore, the need for
discriminating among the available set of surrogate functions. In other words,
Bayesian optimization has to be performed adaptively in order to hedge the
risk of selecting a poorly performing surrogate function.
Model selection in supervised learning is usually performed by cross vali-
dation, but this is not directly applicable to BO. In BO scenarios, the number
of data is often too small to split the dataset into training and validation sets.
Moreover, the data points are not independently sampled, possibly causing a
highly biased estimate of the generalization (A similar problem was reported
in active learning literatures [12]). Therefore, a model selection method that is
tailor-made for Bayesian optimization is needed.
It is most reasonable for the model selection in Bayesian optimization to
be done online. In other words, it should not require more than a single-pass
run of Bayesian optimization. Since each function evaluation is assumed to be
prohibitively costly, repeating multiple runs of BO, each of which involves a
number of function evaluations is infeasible. However it justifies extra compu-
tation for model selection in between the function evaluations.
From these motivations, this thesis proposes adaptive Bayesian optimization
strategies which can select the best performing surrogate function among a set
of putative surrogate functions. Identifying the best model and exploiting it can
naturally be formulated into a multi-armed bandit (MAB) problem, where each
surrogate function is a bandit arm. In this setting, on every round, one surrogate
function is probabilistically selected and the next queried data is decided by the
recommendation of the selected surrogate function. The proposed strategies as
well as a baseline strategy, which turned out to perform unexpectedly well, are
tested on global optimization benchmarks and the organic material screening
3
task.
Another contribution of this thesis is to demonstrate a new application of
Bayesian optimization which explicitly requires the model selection. In this
thesis, BO is applied to schedule the computational screening over candidate
molecules to find molecules with the desired electronic property in a minimum
number of searches. In this task, a set of candidate molecules is given, and
the electronic property, e.g. atomization energy, or energy levels of Highest Oc-
cupied Molecular Orbital (HOMO) and Lowest Unoccupied Molecular Orbital
(LUMO), is evaluated via quantum mechanical simulation one at a time. First,
the functional form between the molecular property and molecular structure is
not known in advance. Fixing a type of surrogate function before the actual BO
run, therefore, can be dangerous. Finally, the quantum mechanical calculation
of chemical property is notorious for its computational burden. Even though
there are various factors affect the computational cost, the simulation can gen-
erally take several hours to a few days [13, 14], preventing repeated runs for
model selection.
There have been related works that share the idea of meta-decision making.
One class of researches explored online selection or blending of data acquisition
policies. [15] used the Hedge algorithm to adaptively select among acquisition
functions, and [16, 12] used a contextual bandit algorithm to blend among
active learning strategies. In active learning, the data acquisition policy is of
the prime importance because the model to be trained is specified beforehand.
On the contrary, in BO the difference between acquisition functions are not
as clear-cut as in active learning. For example, both Expected Improvement
(EI) and Upper Confidence Bound (UCB) criteria are known to work well
in practice, and it is not very clear exactly when one outperforms the other.
Despite this, it is reasonable to dynamically switch from an exploratory strategy
to an exploitative strategy as BO proceeds.
On the other hand, [17] focused on the very problem of dynamic kernel
selection, which this thesis also concentrates on. It clearly showed that the per-
formance of BO strongly depends on the choice of the kernel function. However,
their proposed strategies lack justifications such as a bandit formulation, and
the reported performances were only modest.
4
0 10 20 30 40 50Data points
0.0
0.5
1.0
1.5
2.0
2.5
Regr
et
Hartmann-6D
rbfardrbfpoly3linearmat32ardmat32expo
0 10 20 30 40 50Data points
0
5
10
15
20
25
30
35
40
Regr
etRastrigin 2D
rbfardrbfpoly3linearmat32ardmat32expo
Figure 1.1 The performance of Bayesian optimization depends on the choice of
the surrogate function, and also the characteristic of the underlying function.
5
Chapter 2
Preliminaries
2.1 Bayesian Optimization
As depicted in Algorithm1, in Bayesian optimization, the underlying function
(or the objective function) f(x) is optimized using the model (or the surrogate
function) g(x) built upon known pairs of inputs xi and their function value
f(xi). The acquisition function a(x) decides the next search point from infor-
mation from the surrogate function. The surrogate function and acquisition
function are the key components of BO.
Surrogate Functions: Gaussian Processes
Gaussian Processes in machine learning are defined by the mean function and
covariance function, and are realized by a multivariate Gaussian distribution
over any given set of data points in the domain whose mean and covariance are
determined by the mean function m(x) and the covariance function k(x,x′).
f1:N ∼ N (m1:N ,K)
where f1:N ≡ (f(x1) . . . f(xN ))>, m1:N ≡ (m(x1) . . .m(xN ))>, and Kij =
k(xi,xj) . The mean function is frequently set to a zero function, and the
covariance function is set to a positive semi-definite kernel function.
GP can also be viewed as a kernelized Bayesian linear regression [18], and
therefore enjoying advantages from both kernel methods and Bayesian treat-
ment. GP is flexible due to the nonlinearity from kernel function, yet robust
6
due to the Bayesian treatment. Moreover, as a kernel-based algorithm, it is
possible to incorporate prior knowledge through the choice of kernel.
Inference on the predictive mean and variance of the test data points x∗ is
done by conditioning the joint Gaussian density defined above.
E [f∗|x1:N ] = m∗ + k>∗K−1(x1:N −m1:N )
Var [f∗|x1:N ] = k(x∗,x∗)− k>∗K−1k∗
where k∗ = (k(x1,x∗) . . . k(xN ,x∗))>. It is highly desirable that the predictive
variance is analytically obtainable, since the value is essential in calculating
acquisition function values during BO.
The main drawback of GP is its limited scalability. Since matrix inversion
is involved, GP scales with the cubic of the number of training data points. It
is typically estimated that GP is not applicable to roughly more than 10,000
training data points unless special kinds of techniques are applied. However,
because the cost of data acquisition is assumed to be expensive, the size of data
is usually in the feasible regime for GP. Even so, researches such as [19] tried
to improve the scalability of BO, by replacing GP with a neural network.
Acquisition Functions
An acquisition function in Bayesian Optimization decides which point in the
domain to explore next based on the current prediction of the surrogate func-
tion, and therefore plays a central role. To perform a successful optimization,
the acquisition function needs to balance between exploration (selecting a point
whose predictive uncertainty is high) and exploitation (selecting a point whose
predictive mean in high). An optimal balance may be calculated from a dy-
namic programming-like computation, but such method is highly unlikely to be
tractable. Instead, a few heuristic criteria, for example Expected Improvement
(EI) and Gaussian Process Upper Confidence Bound (GP-UCB), are popularly
used. The effectiveness of those criteria are first proven empirically, and then
theoretically in recent studies. ([2] for EI, and [1] for GP-UCB)
The Expected Improvement (EI) is a decision-theoretic criterion which
selects the point with the highest expected utility. Here, utility is defined as
7
the improvement over the best value found so far. Formally, the improvement
I at a point x is I(y, y∗|x) =
y − y∗, if y > y∗
0 elsewhere y∗ is the incumbent
optimum. Its expectation can be exactly calculated by exploiting the Gaus-
sianity of the predictive distribution. Therefore the EI acquisition function
aEI(x) = Ey [I(y, y∗|x)], is calculated as
Ey [I(y, y∗|x)] =∫∞y∗I(y, y∗|x) dy
= (µ(x)− y∗)Φ(µ(x)−y∗σ(x)
)+ σ(x)φ
(µ(x)−y∗σ(x)
)where µ(x) and σ(x) are the predictive mean and standard deviation at x,
and φ(·) and Φ(·) are the probability density function and cumulative distri-
bution function of a standard normal distribution. The equations are obtained
by assuming the maximization, and can be changed properly for minimization
setting.
The Gaussian Process Upper Confidence Bound (GP-UCB) acqui-
sition function is motivated by upper confidence bound based strategies for
the multi-armed bandit problem, for instance [20]. In GP-UCB, the point with
the highest (or the lowest if it is minimization) confidence interval value. In
BO, the confidence interval can be obtained from predictive distribution of the
surrogate function.
aGP−UCB = µ(x) + βσ(x)
where β governs the exploration-exploitation tradeoff. β is often scheduled to
decrease in order to shift from exploration to exploitation as the BO proceeds.
As choosing the value of β or setting the decaying schedule of β is non-trivial,
the EI acquisition function which does not have any free parameter is used
throughout the experiments in this thesis.
2.2 Multi-Armed Bandit
Multi-Armed Bandit is the simplest problem that exhibits the exploration-
exploitation tradeoff and learning from evaluative feedbacks. In the original
MAB, there are K slot machines with different reward levels, and the player
8
plays one machine at a time. Then the reward corresponding to the played
machine is given as a feedback. Note that the reward from a single machine
may differ from a trial to trial, since the machine is for gambling. The objective
of the game is to maximize the expected cumulative sum of rewards. The object
is often stated in terms of regret, the difference between rewards received from
the selected action and from the optimal action that could have chosen, and it
should be minimized. In order to achieve the objective, the player must carefully
balance between exploitation (keep playing the best rewarding machine) and
exploration (trying unplayed machines just in case there is a better machine
among the unplayed). Due to its simple yet rich structure, extensive researches
have done on the problem, and numerous variants have been proposed, although
they are note the main focus of this thesis.
Besides the modern variants of Multi-Armed Bandit, probably the most
straightforward categorization of bandit problems is made along whether the
reward structure has a fixed statistical form or not. In the former, called the
stochastic MAB, rewards are independently generated from a fixed distribution.
The latter is called the adversarial MAB, and no particular structure other
than the fact that the rewards are bounded can be assumed on its reward
structure. Therefore, the adversarial MAB is certainly a harder problem, which
demands the player to constantly explore alternative arms. EXP3, a classical
solution for the adversarial MAB, was presented by [21]. In EXP3 strategy,
the probability distribution over actions is mainly determined by the history of
received rewards, but additionally mixed with a random distribution, enforcing
constant exploration. Another strategy, HEDGE [22], unlike EXP3, assumes
feedbacks from all arms are available.
9
Chapter 3
Organic Material Screening: TheMotivational Application
3.1 Beyond Structure-Property Relationship
Chemical research has strong motivation for using machine learning techniques.
Many of its investigatory tools such as wet experiments and quantum mechan-
ics simulations require non-trivial amount of resources. In particular, finding a
novel molecule with a desired property is one of the most arduous job in prac-
tice, because it usually involves experimental or computational screening over
a candidate molecule set whose size often scales up to hundreds or thousands.
Machine learning researchers have contributed to alleviate the burden of the
process of novel molecule discovery. The most popular approach is to replace
experiments or simulations with a machine learning algorithm, and is termed
quantitative structure property/activity relationship (QSPR/QSAR) analysis
in chemoinformatics. The approach aims to build a statistical model that can
make predictions on molecular properties given the molecular structure. This
endeavor has a long history and has advanced along with the improvement of
artificial intelligence techniques. Algorithms including decision trees [23], mul-
tilayer perceptrons [24], and support vector machines [25, 26] have applied to
QSPR and continuously made progress in terms of prediction accuracy. Deep
neural networks have also been applied to chemoinformatics, thanks to their
booming popularity. In [13], a multi-task deep neural network was trained to
predict electronic properties of molecules that are usually calculated by time-
consuming quantum mechanics simulations. Another multi-task deep neural
10
network trained in [27] predicts chemical/biological activities from molecular
descriptors and won the first prize in the Merck molecular activity competi-
tion.1
QSPR/QSAR approaches, however, have crucial limitations. First of all,
the model’s prediction become reliable only after training over a large amount
of data gathered from expensive experiments. For example, the neural network
in [13] needed more than 5,000 quantum simulation results to accomplish the
reported performance. It would be preferable for an algorithm to function with
a smaller amount of data. Furthermore, since the diversity of chemicals is mas-
sive, transferability between datasets can be very low. For instance, a model
trained with candidate molecules of organic solar panel may perform poorly
with drug candidate molecules, because the two sets of chemicals may have
different characteristics. Moreover, even with recent improvements in perfor-
mance, the prediction is still not error-free. It is risky to make decisions solely
based on the model’s prediction.
The computer-guided screening may be a more plausible scenario. It is hard
for machine learning algorithms to completely replace the conventional chemical
investigation apparatuses, but machine learning algorithms can be used to aid
the sequence of chemical investigation more efficient. Such approaches are called
(optimal) experimental design [28], and Bayesian optimization can be viewed
as one of them. BO, by recommending next search points which are most likely
to have optimal properties, can guide the search over the chemical compound
space.
3.2 Dataset: Electronic Properties of Organic Molecules
Among diverse molecular properties of interest, in this thesis the utility of
Bayesian optimization in the chemistry domain is demonstrated on the task
of screening for molecules with the desired electronic property for two reasons.
First, molecular electronic properties also have huge industrial importance, such
as applications to photovoltaic cells and light emitting devices. Second, building
a unified automated system is more feasible. Unlike biochemical properties,
1https://www.kaggle.com/c/MerckActivity
11
which are typically evaluated by wet experiments, electronic properties can be
calculated in silico by quantum mechanics simulations. As the search guidance
(Bayesian optimization) and function evaluation (quantum simulation) can be
performed in a single machine or cluster, the whole loop is closed and can.
Recently, there have been advancements in modeling electronic properties of
organic molecules, and a rich dataset was made public. QM7b [13], the released
dataset, contains 7,211 organic molecules consisting upto 23 atoms of C, H, O,
N, S, and Cl, with 14 property values per a molecule. The list of the properties
includes Highest Occupied Molecular Orbital (HOMO) energy level, Lowest
Unoccupied Molecular Orbital (LUMO) energy level, polarizability, and a few
others. Some properties are duplicated, since they are estimated from different
quantum simulation methods. Among the properties, we focus on the band gap
(LUMO energy level − HOMO energy level), because this determines which
wavelength of light the molecule will mostly interact with. Such interaction is
of crucial importance in applications where the interaction with the light is
involved.
12
Chapter 4
Bayesian Optimization with MultipleSurrogate Functions
4.1 Problem Formulation
In this thesis, a typical setting of Bayesian optimization is modified to yield
a novel BO setting which explicitly accounts the effect of surrogate functions
on the progress of optimization. In the proposed formulation, a set of candi-
date surrogate functions {gi(x)} are given before the BO starts, yet it is
unknown which surrogate function (or which blending strategy) would provide
the best optimization performance. Assuming multiple possible choice of sur-
rogate functions imposes an additional layer of decision making on the top of
the conventional BO. The whole problem is still an optimization, and therefore
the goal is to find the optimal point x∗ with the least number of function eval-
uations. The modified problem can be stated formally as the following. If the
object is minimization, arg max is replaced with arg min.
Given {gi(x)}Ki=1 , a(x), and D,
find x∗ = arg maxxi∈D
f(xi).
where gi(x)’s are the prespecified candidate surrogate functions, and a(x) is
the prespecified acquisition function. D, the domain of f , is either a continuous
vector space or a set of finite points.
To tackle the problem of BO with multiple candidate surrogate functions,
the multi-armed bandit framework is adopted in this thesis. Each candidate
surrogate functions are viewed as an arm in the MAB setting, and the queried
value, i.e. the y value of the newly acquired data point, or values derived from
13
it is considered as a reward. Note that it is also possible to take approaches
which do not have connections to MAB, and [17] pursued such direction. The
following sections describe the approaches investigated in this thesis.
4.2 Baseline: Random Arm
One of the most naive approach to take advantages of information from mul-
tiple surrogate functions is to randomly select a surrogate function to use.
From the analogy made to the MAB setting, this strategy will be referred as
RandomArm in the rest of the thesis.
4.3 Proposed Strategies
The two multi armed bandit-based strategies are proposed in this thesis. The
both strategies continuously evaluate their arms (or their candidate surrogate
functions) as Bayesian optimization proceeds, but they differ in which infor-
mation to use for the evaluation. The simpler one, referred as the partial in-
formation strategy or the EXP3-based adaptive Bayesian optimization, feeds
the newly acquired function value as the reward to the EXP3 algorithm, and
the EXP3 updates the probabilities assigned to each surrogate functions. De-
spite of its clarity, the EXP3-based strategy is expected to suffer from poor
scalability with the number of arms. In order to remedy the problem, the full
information strategy, or the HEDGE-based adaptive Bayesian optimization,
is also devised. It attempts to calculate feedback information for all surro-
gate functions, from the posterior mean value of each surrogate functions. The
HEDGE-based adaptive Bayesian optimization is largely inspired by [15]. The
strategies are described in detail in Algorithm2 and Algorithm3.
14
Algorithm 2 EXP3-based Adaptive Bayesian Optimization
Input: {gi(x)}Ki=1: The set of candidate surrogate functions,a(x; git): The acquisition function calculated based on git ,f(x): The underlying function to be optimized,D: The function domainγ ∈ (0, 1]
Output: f(x+): The incumbent best function value1: T ← {(xi1, f(xi1))} . The initial search point2: for i = 1, ...,K, ci(1) = 0 . Initialization of cumulative rewards3: t = 14: while termination condition do5: Update {gi(x)}Ki=1 with T6: For i = 1, ...,K, set wi(t) = exp(γci(t)/K)
7: For i = 1, ...,K, set pi(t) = (1− γ) wi(t)∑Kj=1 wj(t)
+ γK
8: Select it ∈ {1, ...,K} probabilistically according to p1(t), ..., pK(t)9: xnew ← arg maxx′∈D a(x
′; git)
10: Evaluate f(xnew)11: Receive reward rit(t) = f(xnew) . can be scaled to [0, 1]12: Set rj(t) = rj(t)/pj(t) for j = 1, ...,K if j = it, otherwise 013: Update ci(t+ 1) = ci(t) + rj(t)14: T ← T ∪ {(xnew, f(xnew))}15: t = t+ 116: end while
15
Algorithm 3 HEDGE-based Adaptive Bayesian Optimization
Input: {gi(x)}Ki=1: The set of candidate surrogate functions,a(x; git): The acquisition function calculated based on git ,f(x): The underlying function to be optimized,D: The function domainγ ∈ (0, 1]
Output: f(x+): The incumbent best function value1: T ← {(xi1, f(xi1))} . The initial search point2: for i = 1, ...,K, ci(1) = 0 . Initialization of cumulative rewards3: t = 14: Update {gi(x)}Ki=1 with T5: while termination condition do6: For i = 1, ...,K, set wi(t) = exp(γci(t))
7: For i = 1, ...,K, set pi(t) = wi(t)∑Kj=1 wj(t)
8: Select it ∈ {1, ...,K} probabilistically according to p1(t), ..., pK(t)9: xnew ← arg maxx′∈D a(x
′; git)
10: Evaluate f(xnew)11: T ← T ∪ {(xnew, f(xnew))}12: For j = 1, ...,K and k = 1, ...,K, evaluate µj(xk) . The posterior
means of surrogate functions on every suggested point13: Receive rewards, for j = 1, ...,K, rk(t) =
∑Kj=1 µj(xk)pj(t)
14: Update ck(t+ 1) = ck(t) + rk(t) for j = 1, ...,K15: t = t+ 116: end while
16
Chapter 5
Experiments
The proposed strategies for BO with multiple candidate surrogate functions
are demonstrated on global optimization benchmark functions and the organic
material screening task. In the experiments, Gaussian Processes with different
kernel functions are used as candidate surrogate functions. Hence the demon-
strated task can be viewed as a dynamic selection among kernels. However, it
should be noted that the proposed problem setting and strategies are not re-
stricted to kernel-related scheme. For example, the selection between multiple
possible data representations can also be performed in the proposed setting.
The implementation of Gaussian Processes and their kernels is from GPy
[29], a Python package for GP. Hyperparameters are treated by optimizing the
marginal likelihood with hyperpriors which are broad log-normal distributions.
Expected Improvement acquisition function is used for Bayesian optimiza-
tion. All the following experiments assume that the domain is a set of finite
elements which is given in advance. This is to follow the constraint of the
motivational application, the screening of molecules, where each element x in
domain D corresponds to a molecule. To avoid the sampling bias in the situa-
tion, the dataset D = {xi} is sampled again on every repeated experiment. For
the benchmark functions, 1,000 data points are generated randomly in their
specified domain, and for the quantum machine dataset, 2,000 molecules are
randomly selected among 7211 molecules. Note that, in this case, the optimiza-
tion of acquisition function can be done by an exhaustive search and therefore
heuristic optimization methods such as DiRect [30] is not needed.
Every following figure is from the experiments 100 times repeated with the
subset resampling, unless specified otherwise. On every repetition, the start-
17
0 10 20 30 40 50Data points
0.0
0.5
1.0
1.5
2.0
2.5
Regr
et
Hartmann 6D
mat32linearEXP3HEDGErandom_arm
0 10 20 30 40 50Data points
0
5
10
15
20
25
30
35
40
Regr
et
Rastrigin 2D
rbflinearEXP3HEDGErandom_arm
Figure 5.1 The proposed strategies are tested on the bechmark functions giventwo candidate surrogate functions.
ing point of BO is shared for every strategies or kernels being compared to
ensure the fairness of the comparison. The performance is measured by the no-
tion of regret r, which is the difference between the global optimum f(x∗) =
maxx∈D f(x) and the the optimum acquired so far f(x+) = maxx∈T f(x). It
should be noted that the actual value of regret is unknown during BO since
the global optimum value is unknown.
5.1 Benchmark Functions
The proposed strategies are first test to the popular benchmarks for global
optimization. Among many benchmark functions from [31], Rastrigin (2D) and
Hartmann (6D) functions are chosen.
It is interesting that the number of candidate surrogate functions and their
combination (for example, two similarly performing surrogate functions, or one
good and one bad functions) affects the performance of strategies. Therefore
varying conditions are tested and the results are shown in Figure 5.1 through
Figure 5.4.
5.2 Screening over Organic Molecules
As stated in Section 3 in this thesis, Bayesian optimization may be an effective
solution for searching a molecule with the most desirable electronic property
18
0 5 10 15 20 25 30 35 40Data points
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Arm
sel
ectin
g pr
obab
ility
Probabilities of Arms: Hartmann 6D, EXP3
mat32linear
0 5 10 15 20 25 30 35 40Data points
0.0
0.2
0.4
0.6
0.8
1.0
Arm
sel
ectin
g pr
obab
ility
Probabilities of Arms: Hartmann 6D, EXP3
rbflinear
Figure 5.2 The probability of selecting each surrogate function (averaged overmultiple runs) are displayed.
0 10 20 30 40 50Data points
0.0
0.5
1.0
1.5
2.0
2.5
Regr
et
Hartmann-6D
mat32rbfardexpolinearEXP3HEDGErandom_arm
0 5 10 15 20 25 30 35 40Data points
0.20
0.22
0.24
0.26
0.28
0.30
0.32
0.34
Arm
sel
ectin
g pr
obab
ility
Probabilities of Arms: Hartmann 6D, EXP3
mat32rbfardexpolinear
0 5 10 15 20 25 30 35 40Data points
0.23
0.24
0.25
0.26
0.27
0.28
Arm
sel
ectin
g pr
obab
ility
Probabilities of Arms: Hartmann 6D, HEDGE
mat32rbfardexpolinear
Figure 5.3 The performance of proposed strategies given four candidate sur-rogate functions, and the averaged probabilities of selecting each candidatesurrogate function.
19
0 10 20 30 40 50Data points
0.0
0.5
1.0
1.5
2.0
2.5
Regr
et
Hartmann-6D
mat32rbfardexpopoly3mat32ardlinearEXP3HEDGErandom_arm
0 5 10 15 20 25 30 35 40Data points
0.150
0.155
0.160
0.165
0.170
0.175
0.180
Arm
sel
ectin
g pr
obab
ility
Probabilities of Arms: Hartmann 6D, EXP3
mat32rbfardexpopoly3mat32ardlinear
0 5 10 15 20 25 30 35 40Data points
0.166
0.168
0.170
0.172
0.174
Arm
sel
ectin
g pr
obab
ility
Probabilities of Arms: Hartmann 6D, HEDGE
mat32rbfardexpopoly3mat32ardlinear
Figure 5.4 The performance of proposed strategies given six candidate surrogatefunctions, and the averaged probabilities of selecting each candidate surrogatefunction.
0 10 20 30 40 50 60 70Data points
1.0
1.5
2.0
Regr
et
Hartmann-6D
rbfardmat32ardexpolinearRandarm
Figure 5.5 RandomArm strategy is applied with four similarly performing ker-nels, outperforming all of them.
20
0 20 40 60 80 100Data points
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
Regr
et
QM7b Band Gap Maximization
rbfpoly3linearmat32exporandom
0 20 40 60 80 100 120 140 160Data points
1.0
1.5
2.0
2.5
3.0
3.5
4.0
Regr
et
QM7b Band Gap Maximization: 3 Arms
EXP3HEDGE-MODmat32rbfexpoRandarm
Figure 5.6 (Left) Bayesian optimization is applied to find a molecule with themaximum HOMO-LUMO bandgap. The choice of the surrogate function affectsthe regret curve. ’Random’ refers to a non-BO search strategy that performssearch in a random order. (Right) The proposed strategies are tested given threepopular candidate surrogate functions.
among a set of candidate molecules. When BO is applied to the chemical screen-
ing task, the molecular property of interest is a function value, and molecules
are elements of the domain. Then, the structure-property relationship is the
function being optimized, and quantum simulations corresponds to evaluations
of the function.
To show the feasibility of the approach, QM7b dataset is used. As the
dataset contains the results from quantum simulations, in the following exper-
iments, quantum simulations are not performed again but the property values
are just queried to the dataset. It saves time by preventing repeated execu-
tion of same calculations, and therefore enables repeated experiments. Among
14 electronic properties provided in the dataset, the HOMO-LUMO bandgap
calculated from GW simulation is chosen as a target electronic property.
Standard Bayesian optimization is demonstrated in QM7b dataset, and the
result is shown on the left panel in Figure 5.6. The figure verifies the effec-
tiveness of BO in chemical screening task. For the most of surrogate functions,
BO yields lower regret than that of randomly ordered screening. However, if
RBF kernel is chosen, than it would yield roughly the same regret with that
of the random search. Hence, choosing an adequate surrogate function, or at
least hedging the risk of choosing a wrong surrogate function is important in
21
this case.
It is intriguing that non-popular kernels (the linear kernel and the third
order polynomial kernel) achieved better performance. To consider a more re-
alistic scenario, the adaptive strategies are tested with the rest three kernels.
The result is shown in the right panel of Figure 5.6. The proposed methods did
not show notable improvement compared to the RandomArm strategy. How-
ever, more surprisingly, the regret of RandomArm is similar or lower than that
of BO with exponential kernel which showed the lowest regret among the three.
22
Chapter 6
Discussion and Conclusion
6.1 Discussion
One of the most counter-intuitive observation is that pulling an inferior arm,
i.e., acquire data points from the suboptimal surrogate function, does not al-
ways exacerbate the optimization performance. When a suboptimal function
is only modestly inferior, it often enables the blending strategies (even Ran-
domArm) to achieve superior performance over that of the optimal arm. This
phenomenon is most clearly demonstrated in Figure 5.5, and also in the right
panel of Figure 5.6. It can be hypothesized that the blending strategies work
as ensemble models in supervised learning: Merging information from multiple
models is likely to improve the prediction. It may also explain the unexpected
performance of RandomArm in Figure 5.1, where the regret curve of Ran-
domArm is notably close to the better surrogate function. Even suboptimal
surrogate functions are able to provide useful information for the optimization.
The approach taken in this thesis assumes that selecting a best-performing
surrogate function (or ruling out suboptimal surrogate functions) would yield
improved optimization performance. However, from the observation addressed
above, it is not always the case. By sharing data points, candidate surrogate
functions share information, and therefore the intriguing dynamics is generated.
This is clearly different from the scenario with multiple acquisition functions
investigated in [15].
In experiments shown above, MAB-based strategies did not always show
pronounced improvement over the RandomArm strategy. One possible reason
is that MAB-based strategies usually needs a ’warm-up’ period to make a sharp
23
decision. As shown in the figures, probabilities of selecting an arm is close to
uniform in the beginning of BO, and then evolves slowly. It is because MAB
methods used in this thesis make decisions based on the cumulative rewards,
and multiple rounds of playing arms are needed to produce enough difference.
BO typically runs in a ’small data’ regime, so this ’slow decision’ characteristic
of MAB does not make a perfect fit.
[15] used the posterior predictive mean value as a reward, but it can be
problematic with multiple surrogate functions, because predictive mean value
may scale differently. For example, during the experiments conducted in this
thesis, linear kernels tends to provide extreme posterior mean values which can
be expected from its linearity. If used without caution, those extreme posterior
value may significantly skew the decision making process. One alternative is to
exploit information expressed in a form of probability, available in a different
form of BO such as [3]. It can be an interesting future research direction.
Exploiting the information structure behind the problem is a key of a suc-
cessful solution to an MAB game (or other decision making problems). Cur-
rently, the puzzling dynamics among candidate surrogate functions is not ac-
counted properly and it is probably the main reason why the experimented
strategies underperform RandomArm strategy. In order to invent a strategy
that incorporate the specific information structure, deeper understanding on
the model-reality discrepancy is needed. This may lead to research on a BO
version of computational learning theory.
6.2 Conclusion
The problem of performing Bayesian optimization in the multiple candidate
surrogate function setting is investigated in this thesis. Two multi-armed bandit-
based strategies, EXP3-based and HEDGE-based ones, are proposed and tested
against the RandomArm strategy as a baseline. The superiority of the proposed
methods are not clear, since it outperforms the baseline only occasionally. How-
ever, the experimental results provided in this thesis indicates interesting dy-
namics regarding surrogate functions and the performance of BO. Performing
BO with multiple surrogate functions leads to blending of information from
24
surrogate functions, often resulting in performance beyond expectation. This
effect is prominent even for RandomArm strategy, suggesting that it can be a
useful hedging technique in the situation where the risk of selecting a wrong
surrogate function is high. The organic material screening task is a good ex-
ample of such situation. With BO, molecules with desired properties can be
found with less number of property simulations, and blending strategies can
hedge the risk of selecting a wrong surrogate functions as well as improve the
optimization performance. The whole results in this thesis are closely related
to an intriguing aspect of the model-based decision making problem, where a
model may deviate from the reality, or multiple candidate models may be avail-
able. There is only little known about it, and hence revealing the underlying
mechanism of how models affect the decision performance should be a very
exciting direction of future research.
25
Bibliography
[1] N. Srinivas, A. Krause, S. M. Kakade, and M. W. Seeger, “Information-
theoretic regret bounds for gaussian process optimization in the bandit set-
ting,” Information Theory, IEEE Transactions on, vol. 58, no. 5, pp. 3250–
3265, 2012.
[2] A. D. Bull, “Convergence rates of efficient global optimization algorithms,”
The Journal of Machine Learning Research, vol. 12, pp. 2879–2904, Nov.
2011.
[3] J. M. Hernandez-Lobato, M. W. Hoffman, and Z. Ghahramani, “Predictive
entropy search for efficient global optimization of black-box functions,” in
Advances in Neural Information Processing Systems, pp. 918–926, 2014.
[4] J. S. Bergstra, R. Bardenet, Y. Bengio, and B. Kegl, “Algorithms for
hyper-parameter optimization,” in Advances in Neural Information Pro-
cessing Systems, pp. 2546–2554, 2011.
[5] K. Kandasamy, J. Schneider, and B. Poczos, “High dimensional bayesian
optimisation and bandits via additive models,” in Proceedings of the 32nd
International Conference on Machine Learning (ICML-15) (D. Blei and
F. Bach, eds.), pp. 295–304, JMLR Workshop and Conference Proceedings,
2015.
[6] A. Cully, J. Clune, D. Tarapore, and J.-B. Mouret, “Robots that can adapt
like animals,” Nature, vol. 521, no. 7553, pp. 503–507, 2015.
[7] E. Brochu, V. M. Cora, and N. De Freitas, “A tutorial on bayesian
optimization of expensive cost functions, with application to active
user modeling and hierarchical reinforcement learning,” arXiv preprint
arXiv:1012.2599, 2010.
26
[8] C. E. RASMUSSEN, “Gaussian processes to speed up hybrid monte carlo
for expensive bayesian integrals,” Bayesian statistics, vol. 7, pp. 651–659,
2008.
[9] J. Snoek, H. Larochelle, and R. P. Adams, “Practical bayesian optimiza-
tion of machine learning algorithms,” in Advances in neural information
processing systems, pp. 2951–2959, 2012.
[10] K. Swersky, J. Snoek, and R. P. Adams, “Freeze-thaw bayesian optimiza-
tion,” arXiv preprint arXiv:1406.3896, 2014.
[11] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas, “Tak-
ing the human out of the loop: A review of bayesian optimization,” tech.
rep., Universities of Harvard, Oxford, Toronto, and Google DeepMind,
2015.
[12] Y. Baram, R. El-Yaniv, and K. Luz, “Online choice of active learning
algorithms,” The Journal of Machine Learning Research, vol. 5, pp. 255–
291, 2004.
[13] G. Montavon, M. Rupp, V. Gobre, A. Vazquez-Mayagoitia, K. Hansen,
A. Tkatchenko, K.-R. Muller, and O. A. von Lilienfeld, “Machine learn-
ing of molecular electronic properties in chemical compound space,” New
Journal of Physics, vol. 15, no. 9, p. 095003, 2013.
[14] G. Montavon, K. Hansen, S. Fazli, M. Rupp, F. Biegler, A. Ziehe,
A. Tkatchenko, A. V. Lilienfeld, and K.-R. Muller, “Learning invariant
representations of molecules for atomization energy prediction,” in Ad-
vances in Neural Information Processing Systems, pp. 440–448, 2012.
[15] M. D. Hoffman, E. Brochu, and N. de Freitas, “Portfolio allocation for
bayesian optimization,” in Uncertainty in Artificial Intelligence, pp. 327–
336, 2011.
[16] W.-N. Hsu and H.-T. Lin, “Active learning by learning,” in Twenty-Ninth
AAAI Conference on Artificial Intelligence, 2015.
27
[17] I. Roman, R. Santana, A. Mendiburu, and J. A. Lozano, “Dynamic ker-
nel selection criteria for bayesian optimization,” in BayesOpt 2014: NIPS
Workshop on Bayesian Optimization, 2014.
[18] C. E. Rasmussen, Gaussian processes for machine learning. MIT Press,
2006.
[19] J. Snoek, O. Rippel, K. Swersky, R. Kiros, N. Satish, N. Sundaram, M. Pat-
wary, M. Prabhat, and R. Adams, “Scalable bayesian optimization using
deep neural networks,” in Proceedings of the 32nd International Confer-
ence on Machine Learning (ICML-15), 2015.
[20] P. Auer, “Using confidence bounds for exploitation-exploration trade-offs,”
The Journal of Machine Learning Research, vol. 3, pp. 397–422, 2003.
[21] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire, “The nonstochas-
tic multiarmed bandit problem,” SIAM Journal on Computing, vol. 32,
no. 1, pp. 48–77, 2002.
[22] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire, “Gambling in a
rigged casino: The adversarial multi-armed bandit problem,” in Founda-
tions of Computer Science, 1995. Proceedings., 36th Annual Symposium
on, pp. 322–331, IEEE, 1995.
[23] D. M. Hawkins, S. S. Young, and A. Rusinko, “Analysis of a large
structure-activity data set using recursive partitioning,” Quantitative
Structure-Activity Relationships, vol. 16, no. 4, pp. 296–302, 1997.
[24] J. Devillers, Neural networks in QSAR and drug design. Academic Press,
1996.
[25] R. Burbidge, M. Trotter, B. Buxton, and S. Holden, “Drug design by ma-
chine learning: support vector machines for pharmaceutical data analysis,”
Computers & Chemistry, vol. 26, no. 1, pp. 5–14, 2001.
[26] U. Norinder, “Support vector machine models in drug design: applications
to drug transport processes and qsar using simplex optimisations and vari-
able selection,” Neurocomputing, vol. 55, no. 1, pp. 337–346, 2003.
28
[27] G. E. Dahl, N. Jaitly, and R. Salakhutdinov, “Multi-task neural networks
for qsar predictions,” arXiv preprint arXiv:1406.1231, 2014.
[28] K. Chaloner and I. Verdinelli, “Bayesian experimental design: A review,”
Statistical Science, pp. 273–304, 1995.
[29] The GPy authors, “GPy: A gaussian process framework in python.” http:
//github.com/SheffieldML/GPy, 2012–2015.
[30] D. R. Jones, C. D. Perttunen, and B. E. Stuckman, “Lipschitzian opti-
mization without the lipschitz constant,” Journal of Optimization Theory
and Applications, vol. 79, no. 1, pp. 157–181, 1993.
[31] A.-R. Hedar, “Test functions for unconstrained global optimization.”
http://www-optima.amp.i.kyoto-u.ac.jp/member/student/hedar/
Hedar_files/TestGO_files/Page364.htm, 2012–2015.
29
국문초록
베이지안 최적화는 탐색한 점들로 구축한 확률 모형을 기반으로 최적화를 수행
하는 모델 기반 최적화 기법이다. 베이지안 최적화의 성능은 어떠한 종류의 확률
모형을 쓰느냐에 따라 크게 좌우되는데, 많은 경우 사전에 어떤 모형이 가장 잘
동작할지 알 수 없다는 것이 문제이다. 본 논문에서는 여러 개의 대리 함수를 사
용할 수 있도록 수정된 베이지안 최적화 기법 문제를 제시하고, 이를 해결하기
위한 두 가지 multi-armed bandit 기반의 전략을 실험한다. 제안된 전략들은 여러
개의 대리 함수들 중 어떤 것을 사용할지 적응적으로 결정한다. 이 전략들은 최
적화 벤치마크 함수들과 유기분자 스크리닝 과제에 대해서 시험되었다. 유기분자
스크리닝 과제에서는 어떤 대리 함수를 쓰느냐가 중요한 문제이기 때문에 제안되
는 전략들이 아주 중요한 역할을 한다. 놀랍게도, 성능평가의 기준점으로 사용된,
각 대리 함수를 임의적으로 선택하는 전략이 준수한 성능을 보여주었다. 이러한
결과들은 베이지안 최적화에서 대리 함수의 개수 조건을 완화하였을 때 흥미로운
현상이 나타나, 유의미한 향후 연구방향임을 시사한다.
주요어: 베이지안 최적화, Multi-Armed Bandit, 가우시안 과정, 화학정보학
학번: 2014-21320
30