bayesian optimization (bo)
DESCRIPTION
Bayesian Optimization (BO). Javad Azimi Fall 2010 http://web.engr.oregonstate.edu/~azimi/. Outline. Formal Definition Application Bayesian Optimization Steps Surrogate Function(Gaussian Process) Acquisition Function PMAX IEMAX MPI MEI UCB GP-Hedge. Formal Definition. Input: - PowerPoint PPT PresentationTRANSCRIPT
Bayesian Optimization
(BO)Javad Azimi
Fall 2010http://web.engr.oregonstate.edu/~azimi/
Outline
• Formal Definition• Application• Bayesian Optimization Steps– Surrogate Function(Gaussian Process)– Acquisition Function
• PMAX• IEMAX• MPI• MEI• UCB• GP-Hedge
Formal Definition
• Input:
• Goal:
Fuel Cell Application
AnodeCathode
bact
eria
Oxidation products
(CO2)
Fuel (organic matter)
e-
e-
O2
H2OH+
This is how an MFC works
SEM image of bacteria sp. on Ni nanoparticle enhanced carbon fibers.
Nano-structure of anode significantly impact the electricity production.
We should optimize anode nano-structure to maximize power by selecting a set of experiment.4
Big Picture• Since Running experiment is very expensive we use BO.
• Select one experiment to run at a time based on results of previous experiments.Current Experiments Our Current Model Select Single Experiment
Run Experiment 5
BO Main Steps
• Surrogate Function(Response Surface , Model)– Make a posterior over unobserved points based
on the prior.– Its parameter might be based on the prior.
Remember it is a BAYESIAN approach.• Acquisition Criteria(Function)– Which sample should be selected next.
Surrogate Function• Simulates the unknown function distribution based
on the prior.– Deterministic (Classical Linear Regression,…)• There is a deterministic prediction for each point x in
the input space.– Stochastic (Bayesian regression, Gaussian Process,
…)• There is a distribution over the prediction for each
point x in the input space. (i.e Normal distribution)– Example• Deterministic: f(x1)=y1, f(x2)=y2• Stochastic: f(x1)=N(y1,2) f(x2)=N(y2,5)
Gaussian Process(GP)
• A Gaussian process is a collection number of random variables, any finite number of which have a joint Gaussian distribution.– Consistency requirement or marginalization
property.• Marginalization property:
Gaussian Process(GP)• Formal prediction:
• Interesting points:– Squared exponential function corresponds to Bayesian linear
regression with an infinite number of basis function.– Variance is independent from observation– The mean is a linear combination of observation.– If the covariance function specifies the entries of covariance
matrix, marginalization is satisfied!
Gaussian Process(GP)• Gaussian Process is:– An exact interpolating regression method.• Predict the training data perfectly. (not true in classical
regression)– A natural generalization of linear regression.• Nonlinear regression approach!
– A simple example of GP can be obtained from Bayesian regression.• Identical results
– Specifies a distribution over functions.
Gaussian process(2):distribution over functions
95% confidence interval for each point x.
Three sampled functions
Gaussian process(2):GP vs Bayesian regression
• Bayesian regression:– Distribution over weight– The prior is defined over the weights.
• Gaussian Process– Distribution over function– The prior is defined over the function space.
• These are the same but from different view.
Short Summary
• Given any unobserved point z, we can define a normal distribution of its prediction value such that:– Its means is the linear combination of the
observed value.– Its variance is related to its distance from
observed value. (closer to observed data, less variance)
BO Main Steps
• Surrogate Function(Response Surface , Model)– Make a posterior over unobserved points based
on the prior.– Its parameter might be based on the prior.
Remember it is a BAYESIAN approach.• Acquisition Criteria(Function)– Which sample should be selected next.
Bayesian Optimization:(Acquisition criterion)
• Remember: we are looking for:
• Input:– Set of observed data.– A set of points with their corresponding mean and variance.
• Goal: Which point should be selected next to get to the maximizer of the function faster.
• Different Acquisition criterion(Acquisition functions or policies)
Policies
• Maximum Mean (MM).• Maximum Upper Interval (MUI).• Maximum Probability of Improvement (MPI).• Maximum Expected of Improvement (MEI).
Policies:Maximum Mean (MM).
• Returns the point with highest expected value.
• Advantage:– If the model is stable and has been learnt very good,
performs very good.• Disadvantage:– There is a high chance to fall in local minimum(just exploit).
• Can converge to global optimum finally?– No
Policies:Maximum Upper Interval (MUI).
• Returns the point with highest 95% upper interval.
• Advantage:– Combination of mean and variance(exploitation and exploration).
• Disadvantage:– Dominated by variance and mainly explore the input space.
• Can converge to global optimum finally?– Yes.– But needs almost infinite number of samples.
Policies:Maximum Probability of Improvement (MPI)
• Selects the sample with highest probability of improving the current best observation (ymax) by some margins m.
Policies:Maximum Probability of Improvement (MPI)
• Advantage:– Considers mean and variance and ymax in policy(smarter than MUI)
• Disadvantage:– Ad-hoc parameter m – Large value of m?
• Exploration– Small value of m?
• Exploitation
Policies:Maximum Expected of Improvement (MEI)
• Maximum Expected of improvement. • Question: Expectation over which variable?– m
Policies:Upper Confidence Bounds
• Select based on the variance and mean of each point.
– The selection of k left to the user.– Recently, a principle approach to select this
parameter has been proposed.
Summary
• We introduced several approaches, each of which has advantage and disadvantage.– MM– MUI– MPI– MEI– GP-UCB
• Which one should be selected for an unknown model?
GP-Hedge• GP-Hedge(2010) • It select one of the baseline policy based on the theoretical
results of multi-armed bandit problem, although the objective is a bit different!
• They show that they can perform better than (or as well as) the best baseline policy in some framework.
Future Works
• Method selection smarter than GP-Hedge with theoretical analysis.
• Batch Bayesian optimization.• Scheduling Bayesian optimization.