![Page 1: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/1.jpg)
Nonparametric Bayesian Methods
(Gaussian Processes)
[80240603 Advanced Machine Learning, Fall, 2012]
Jun Zhu [email protected]
State Key Lab of Intelligent Technology & Systems
Tsinghua University
November 15, 2011
![Page 2: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/2.jpg)
Recap. of Nonparametric Bayesian
What should we expect from nonparametric Bayesian
methods?
Complexity of our model should be allowed to grow as we get
more data
Place a prior on an unbounded number of parameters
![Page 3: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/3.jpg)
Example: Classification
Data
Nonparametric Approach
Parametric Approach
Build model
Predict using model
![Page 4: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/4.jpg)
Example: Clustering
Data
Nonparametric Approach
Parametric Approach
Build model
![Page 5: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/5.jpg)
Example: Regression
Data
Nonparametric Approach
Parametric Approach
Build model
Predict using model
![Page 6: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/6.jpg)
A Nonparametric Bayesian Approach to
Clustering
We must again specify two things:
The likelihood function (how data is affected by the parameters):
Identical to the parametric case.
The prior (the prior distribution on the parameters):
The Dirichlet Process!
Exact posterior inference is still intractable. But we have can
derive the Gibbs update equations!
![Page 7: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/7.jpg)
What is Dirichlet Process?
[http://www.nature.com/nsmb/journal/v7/n6/fig_tab/nsb0600_443_F1.html]
![Page 8: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/8.jpg)
The DP, CRP and Stick-Breaking Process
Three birds on the same stone
Stick-breaking Process
(just the weights)
The CRP describes a
partition of when
G is marginalized out
![Page 9: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/9.jpg)
Inference for DP Mixtures – Gibbs sampler
We introduce the indicators and use the CRP
representation.
Randomly initialize . Repeat:
sample each from
Sample each based on Z and X only for occupied clusters
This is the sampler we saw earlier, but now with some
theoretical basis.
![Page 10: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/10.jpg)
Today, we talk about Gaussian processes, a nonparametric
Bayesian method on the function spaces
Outline
Gaussian process regression
Gaussian process classification
Hyper-parameters, covariance functions, and more
![Page 11: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/11.jpg)
Recap. of Gaussian Distribution
Multivariate Gaussian Marginal & Conditional
![Page 12: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/12.jpg)
A Prediction Task
Goal: learn a function from noisy observed data
Linear
Polynomial
…
![Page 13: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/13.jpg)
Bayesian Regression Methods Noisy observations Gaussian likelihood function for linear regression Gaussian prior (Conjugate) Inference with Bayes’ rule Posterior
Marginal likelihood Prediction
![Page 14: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/14.jpg)
Connections to Ridge Regression
The MAP estimate is a ridge regression
which reduces to
Squared error Quadratic regularizer
![Page 15: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/15.jpg)
Generalize to Function Space
The linear regression model can be too restricted.
How to rescue?
… by projections (the kernel trick)
![Page 16: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/16.jpg)
Generalize to Function Space
A mapping function
Doing linear regression in the mapped space
… everything is similar, with X substituted by
![Page 17: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/17.jpg)
Example 1: fixed basis functions
Given a set of basis functions
E.g. 1:
E.g. 2:
![Page 18: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/18.jpg)
Example 2: adaptive basis functions
Neural networks to learn a parameterized mapping function
E.g., a two-layer feedforward neural networks
[Figure by Neal]
![Page 19: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/19.jpg)
Example 2: adaptive basis functions
A Bayesian two-layer network with zero-mean Gaussian priors
The infinite limit corresponds to a Gaussian process [Neal, PhD thesis, 1995]
[MacKay, Gaussian Process: a Replacement for Supervised Neural Networks?1997]
[Neal, Bayesian Learning for Neural Networks.1995]
![Page 20: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/20.jpg)
Can GP Replace Neural Networks?
Have we thrown the baby out with the bath water?
Neural networks are intelligent models which discovered
features and patterns
Gaussian Process are smoothing devices, not for feature
discovery
The limit of infinite number of hidden units (width) may be a
bad limit
How about multiple layers (depth)?
In fact, now, it’s the spring of deep learning/feature
learning/representation learning
![Page 21: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/21.jpg)
Model Complexity Matters
A simple curve fitting task
![Page 22: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/22.jpg)
Model Complexity Matters
Order = 1
![Page 23: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/23.jpg)
Model Complexity Matters
Order = 2
![Page 24: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/24.jpg)
Model Complexity Matters
Order = 3
![Page 25: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/25.jpg)
Model Complexity Matters
Order = 9?
![Page 26: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/26.jpg)
Model Complexity Matters
Too simple models
![Page 27: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/27.jpg)
Model Complexity Matters
Too complicated models
Issues with model selection!!
![Page 28: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/28.jpg)
A Non-parametric Approach
A non-parametric approach No explicit parameterization of the function Put a prior over all possible functions Higher probabilities are given to functions that are more likely,
e.g., of good properties (smoothness, etc.)
Manage an uncountably infinite number of functions
Gaussian process provides a sophisticated approach with computational tractability
![Page 29: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/29.jpg)
Random Function vs. Random Variable
A function is represented as an infinite vector with a index
set
For a particular point , is a random variable
![Page 30: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/30.jpg)
Gaussian Process A Gaussian process (GP) is a generalization of a multivariate Gaussian distribution to infinitely many variables; thus functions
Def: A stochastic process is Gaussian if and only if for every finite set of indices x1, ..., xn in the index set
is a vector-valued Gaussian random variable
A Gaussian distribution is fully specified by the mean vector and covariance matrix
A Gaussian process is fully specified by a mean function and covariance function
Mean function
Covariance function
![Page 31: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/31.jpg)
Kolmogorov Consistency
A fundamental theorem guarantees that a suitably “consistent” collection of finite-dim distributions will define a stochastic process
aka Kolmogorov extension theorem
Kolmogorov Consistency Conditions Order over permutation
Marginalization
verified with the properties of multivariate Gaussian
Andrey Nikolaevich Kolmogorov
Soviet Russian mathematician
[1903 – 1987]
![Page 32: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/32.jpg)
Compare to Dirichlet Process
DP is on random probability measure P, i.e., a special type of function Positive, and sum to one! Kolmogorov consistency due to the properties of Dirichlet
distribution
DP: discrete instances (measures) with probability one Natural for mixture models DP mixture is a limit case of finite Dirichlet mixture model
GP: continuous instances (real-valued functions) Consistency due to the properties of Guassian Good for prediction functions, e.g., regression and classification
![Page 33: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/33.jpg)
Bayesian Linear Regression is a GP
Bayesian linear regression with mapping functions
The mean and covariance are
Therefore,
![Page 34: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/34.jpg)
Draw Random Functions from a GP
Example:
For a finite subset
![Page 35: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/35.jpg)
Draw Samples from Multivariate Gaussian
Task: draw a set of samples from
Directly draw is apparently impossible
A procedure is as follows
Cholesky decomposition (aka “matrix square root”)
Generate
Compute
![Page 36: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/36.jpg)
Prediction with Noise-free Observations
For noise-free observations, we know the true function value
The joint distribution of training output and test outputs
![Page 37: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/37.jpg)
Sequential Update for Matrix Inversion
Don’t need to do inversion for every covariance matrix
Let be the covariance matrix when N data points are
given
For N+1 data points, we have
[MacKay, Gaussian Process: a Replacement for Supervised Neural Networks?1997]
![Page 38: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/38.jpg)
Posterior GP
Samples from the prior and the posterior after observing “+”
shaded region denotes twice the standard deviation at each input
Why the variance at the training points is zero?
![Page 39: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/39.jpg)
Prediction with Noisy Observations
For noisy observations, we don’t know true function values
The joint distribution of training output and test outputs
Is the variance at the training points zero?
![Page 40: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/40.jpg)
More Analysis
Let for one test
data. we have
The mean is a linear predictor (representor theorem)
Linear of observations (a linear smoother)
Linear of n kernel functions
![Page 41: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/41.jpg)
More Analysis
Observations
Although GP defines a joint Gaussian distribution over all of the
y variables, it suffices to consider (n+1)-dimensional
distribution. See the graphical illustration.
Variance doesn’t depend on observed targets, but only on
inputs
![Page 42: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/42.jpg)
Graphical Model for GP
Squared nodes are observed, round nodes are stochastic
All pairs of latent variables are connected
Predictions depend only on the corresponding single latent
Adding a triplet does not influence the distribution. This is guaranteed from the consistence of GP
![Page 43: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/43.jpg)
Residual Modeling with GP
Explicit Basis Function:
residual modeling with GP
an example of semi-parametric model
if we assume a normal prior
we have
Similarly, we can derive the predictive mean and covariance
![Page 44: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/44.jpg)
Outline
Introduction
Gaussian Process Regression
Gaussian Process Classification
![Page 45: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/45.jpg)
Recap. of Probabilistic Classifiers
Naïve Bayes (generative models) The prior over classes The likelihood with strict conditional independence
assumption on inputs
Bayes’ rule is used for posterior inference
Logistic regression (conditional/discriminative models) Allow arbitrary structures in inputs
![Page 46: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/46.jpg)
Recap. of Probabilistic Classifiers
More on the discriminative methods (binary classification)
is the response function (the inverse is a link function)
comparison
![Page 47: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/47.jpg)
Recap. of Probabilistic Classifiers
MLE estimation
The objective function is smooth and concave, with unique
maximum
We can solve it using Newton’s methods, or conjugate
gradient descent
w goes to infinity for separable case
![Page 48: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/48.jpg)
Bayesian Logistic Regression
Place a prior over w
[Figure credit: Rasmussen & Williams, 2006]
![Page 49: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/49.jpg)
Gaussian Process Classification
Latent function f(x)
Observations are independent given the latent function
![Page 50: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/50.jpg)
Posterior Inference for Classification
Posterior (Non-Gaussian)
Latent value
Predictive distribution
![Page 51: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/51.jpg)
Laplace Approximation Methods
Approximating a hard distribution with a “nicer” one
Laplace approximation is a method using a Gaussian distribution as the approximation
What Gaussian distribution?
![Page 52: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/52.jpg)
Laplace Approximation Methods
Approximate the integrals of the form
assume has global maximum at
then
since growing exponentially with M, it’s enough to
focus on at
As M increases, integral is well-approximated by a Gaussian
![Page 53: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/53.jpg)
Laplace Approximation Methods
An example:
a global maximum is
![Page 54: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/54.jpg)
Laplace Approximation Methods
Deviations by Taylor series expansion
assume that the high-order terms are negligible
since is a local maxima,
Then, take the first three terms of the Taylor series at
![Page 55: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/55.jpg)
Application: approximate a hard dist.
Consider single variable z with distribution
where the normalization constant is unknown
f(z) could be a scaled version of p(z)
Laplace approximation can be applied to find a Gaussian
approximation centered on the mode of p(z)
![Page 56: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/56.jpg)
Application: approximate a hard dist.
Doing Taylor expansion in the logarithm space
is the mode. We have
Then, the Taylor series on is
Taking exponential, we have
![Page 57: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/57.jpg)
Application: generalize to multivariate
Task: approximate defined over M-dim space
Find a stationary point , where
Do Taylor series expansion in log-space at
where A is the M x M Hessian matrix
Take exponential and normalize
![Page 58: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/58.jpg)
Steps in Applying Laplace Approximation
Find the mode
Run a numerical optimization algorithm
Multimodal distributions lead to different Laplace
approximations depending on the mode considered
Evaluate the Hessian matrix A at that mode
![Page 59: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/59.jpg)
Approximate Gaussian Process Using a Gaussian to approximate the posterior
Then, the latent function distribution
Laplace method to a nice Gaussian
![Page 60: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/60.jpg)
Laplace Approximation for GP
Computing the mode and Hessian matrix
The true posterior
normalization constant
Find the MAP estimate
Take the derivative
![Page 61: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/61.jpg)
Laplace Approximation for GP
The derivatives of the log posterior are
W is diagonal since data points are independent
Finding the mode
Existence of maximum
For logistic, we have
The Hessian is negative definite The objective is concave and
has unique maxima
How about probit regression?
(homework)
![Page 62: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/62.jpg)
Laplace Approximation for GP
Logistic regression likelihood
How about negative examples?
Well-explained Region
![Page 63: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/63.jpg)
Laplace Approximation for GP
Probit regression likelihood
How about negative examples?
Well-explained Region
![Page 64: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/64.jpg)
Laplace Approximation for GP
The derivatives of the log posterior are
W is diagonal since data points are independent
Finding the mode
Existence of maximum
At the maximum, we have
No-closed form solution, numerical methods are needed
![Page 65: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/65.jpg)
Laplace Approximation for GP
The derivatives of the log posterior are
W is diagonal since data points are independent
Finding the mode
No-closed form solution, numerical methods are needed
The Gaussian approximation
![Page 66: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/66.jpg)
Laplace Approximation for GP
Laplace approximation
Predictions as GP predictive mean
Positive examples have positive coefficients for their kernels
Negative examples have negative coefficients for their kernels
Well-explained points don’t contribute strongly to predictions
Non-support vectors
![Page 67: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/67.jpg)
Laplace Approximation for GP
Laplace approximation
Predictions as GP predictive mean
Then, the response variable is predicted as (MAP prediction)
Alternative average prediction
![Page 68: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/68.jpg)
Weakness of Laplace Approximation
Directly only applicable to real-valued variables
Based on Gaussian distribution
May be applicable to transformed variable
If , then consider Laplace approximation of
Based purely on a specific value of the variable
Expansion on local maxima
![Page 69: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/69.jpg)
GPs for Multi-class Classification
Latent functions for n training points and for C classes
Using multiple independent GPs, one for each category
Using softmax function to get the class probability
![Page 70: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/70.jpg)
Laplace Approximation for Multi-class GP
The log of un-normalized posterior is
We have
Then, the mode is
Newton method can be applied with the above Hessian
![Page 71: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/71.jpg)
Laplace Approximation for Multi-class GP
Predictions with the Gaussian approximation
The predictive mean for class c is
which is Gaussian as both terms in the product are Gaussian
the mean and co-variance are
![Page 72: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/72.jpg)
Covariance Functions
The only requirement for covariance matrix is the positive
semidefinite
Many covariance functions, hyper-parameters make influence
S: stationary; ND: non-degenerate. Degenerate covariance functions have finite rank
![Page 73: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/73.jpg)
Covariance Functions
Squared Exponential Kernel
Infinitely differentiable Equivalent to regression using infinitely many Gaussian shaped basis functions
placed everywhere, not just training points!
Gaussian-shaped basis functions
For the finite case, let the prior , we have a GP with covariance
function
For the infinite limit, we can show
# basis functions
per unit interval.
![Page 74: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/74.jpg)
Covariance Functions
Squared Exponential Kernel
Proof: (a set of uniformly distributed basis functions)
Let the integral interval go to infinity, we get
![Page 75: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/75.jpg)
Using finitely many basis functions can be
dangerous!
Missed components
Not full rank
![Page 76: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/76.jpg)
Adaptation of Hyperparameters
Characteristic lengthscale parameter
Roughly measures how far we need to go in order to make the
data points un-related
Larger l gives smoother functions (i.e., simpler functions)
![Page 77: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/77.jpg)
Adaptation of Hyperparameters
Squared exponential covariance function
Hyper-parameters
Possible choices of M
![Page 78: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/78.jpg)
Marginal Likelihood for Model Selection
A Bayesian approach to model selection
Let denote a family of models. Each is characterized by
some parameters
The marginal likelihood (evidence) is
An automatic trade-off between data fit and model complexity
(see next slide …)
likelihood prior
![Page 79: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/79.jpg)
Marginal Likelihood for Model Selection
Simple models account for a limited range of data sets; complex models account for a broad range of data sets.
For a particular data set y, the margin likelihood prefers a model of intermediate complexity over too simple or too complex ones
p(y|
X,M
i)
![Page 80: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/80.jpg)
Marginal Likelihood for GP
Marginal likelihood can be used to estimate the hyper-parameters for GP
For GP regression, we have
data fit model complexity
![Page 81: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/81.jpg)
Marginal Likelihood for GP
Marginal likelihood can be used to estimate the hyper-parameters for GP
For GP regression, we have
Then, we can do gradient descent to solve
For GP classification, we need Laplace approximation to compute the marginal likelihood.
![Page 82: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/82.jpg)
Other Model Selection Methods
When the number of parameters is small, we can do
K-fold cross-validation (CV)
Leave-one-out cross-validation (LOO-CV)
Different selection methods usually lead to different results
Marginal likelihood estimation LOO-CV
![Page 83: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/83.jpg)
Hyperparameters of Covariance Function
Squared Exponential
Hyperparameters: maximum allowable covariance, and Length parameter
– The mean posterior predictive functions for three different length-scales
– Green one learned by maximum marginal likelihood
– Too short one can almost exactly fits the data!
![Page 84: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/84.jpg)
Other Inference Methods
Markov Chain Monte Carlo methods
Expectation Propagation
Variational Approximation
![Page 85: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/85.jpg)
Other Issues
Multiple outputs
Noise models with correlations
Non-Guassian likelihood
Mixture of GPs
Student’s t process
Latent variable models
…
![Page 86: Nonparametric Bayesian Methods (Gaussian Processes) › lab-datasets › course › 7.1... · A Nonparametric Bayesian Approach to Clustering We must again specify two things: The](https://reader033.vdocuments.net/reader033/viewer/2022060407/5f0fae337e708231d4455b91/html5/thumbnails/86.jpg)
References
Rasmussen & Williams. Gaussian Process for Machine Learning, 2006.
The Gaussian Process website: http://www.gaussianprocess.org/