introduction to gaussian processesdanielpr/files/gptut-penn_0.pdfintroduction to gaussian processes...

Introduction to Gaussian Processes

Daniel Preotiuc-Pietro

Positive Psychology CenterUniversity of Pennsylvania

13 October 2014

with slides from Trevor Cohn, Neil Lawrence, Richard Turner

Gaussian Processes

Brings together several key ideas in one framework

� Bayesian� kernelised� non-parametric� non-linear� modelling uncertainty

Elegant and powerful framework, with growing popularity inmachine learning and application domains.

Gaussian Processes

State of the art for regression

� exact posterior inference� supports very complex non-linear functions� elegant model selection

Gaussian Processes

Now mature enough for use in NLP

� support for classification, ranking, etc� fancy kernels, e.g., text� sparse approximations for large scale inference� probabilistic formulation allows incorporation into larger

graphical models� models the prediction uncertainty so that it’s propagated

through pipelines of probabilistic components

Gaussian Processes

Regression and classification are fundamental techniques inNLP

Linear models:

� simple and fast to run� provide straightforward interpretability� but very limited in the type of relations they can model

Non-linear models (e.g. SVM):

� better results because they allow to model otherrelationships

� but high cost of hyperparameter optimisation� lack of interoperability with downstream processing

Gaussian Processes

The Gaussian distribution:

� is a distribution over vectors

The Gaussian process:

� is a distribution over functions

Gaussian Processes, prior

� the prior represents our prior beliefs over the functions weexpect to learn

� here, standard choices: 0 mean, 1 variance, ‘smooth’functions

Gaussian Process, posterior

� after observing data points, adjust the posteriordistribution over functions

� the functions must pass ’close’ to the observations,compute posterior mean and variance

Gaussian Processes

GP demo: http://www.tmpl.fi/gp/

Relationship to Linear regression

Linear regression:y = wx + ε (1)

where w are parameters we learn.

Gaussian process regression:

y = f (x) + ε (2)

where f is drawn from a Gaussian process prior.

� the w parameters are gone, hence the term‘non-parametric’ for GP’s

� f is not limited to a parametric form

Graphical model view

f ∼ GP(m, k)

y ∼ N( f (x), σ2)

� f : RD− > R is a latentfunction

� y is a noisy realisationof f (x)

� k is the covariancefunction or kernel

� m and σ2 is a learnedparameter

Formal definition

Definition

A Gaussian process is a collection of random variables, anyfinite number of which have a joint Gaussian distribution.

A GP is fully specified by mean m(x) and covariance k(x, x′)functions:

� m(x) = E[ f (x)]

� k(x, x′) = E[( f (x) −m(x))(( f (x′) −m(x′))]

Without loss of generality we can assume m(x) = 0.

Compared to a Gaussian distribution

Gaussian distribution:

� specified by a mean and covariance matrix: x ∼ N(μ,Σ)� the vector index is the position of the random variables x

Gaussian process:

� specified by a mean and covariance function: f ∼ GP(m, k)� the index is the argument x of f

The Gaussian Process is an infinite extension of multivariateGaussian distributions

Gaussian distribution

p(y|Σ) ∝ exp(−1

2yTΣ−1y)

Σ =

⎡⎣ 1 .7

.7 1

⎤⎦

y1

y 2

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3



2yTΣ−1y)

Σ =

⎡⎣ 1 .6

.6 1

⎤⎦

y1

y 2

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3



2yTΣ−1y)

Σ =

⎡⎣ 1 .4

.4 1

⎤⎦

y1

y 2

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3



2yTΣ−1y)

Σ =

⎡⎣ 1 .1

.1 1

⎤⎦

y1

y 2

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3



2yTΣ−1y)

Σ =

⎡⎣ 1 .0

.0 1

⎤⎦

y1

y 2

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3



2yTΣ−1y)

Σ =

⎡⎣ 1 .7

.7 1

⎤⎦

y1

y 2

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3


p(y2|y1,Σ) ∝ exp(−1

2(y2 − μ∗)Σ∗−1(y2 − μ∗)) 1

2

y1

y 2

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y 2

Σ =

⎡⎣ 1 .9

.9 1

⎤⎦

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y 2

1 2−3

−2

−1

0

1

2

3

variable index

y

Σ =

⎡⎣ 1 .9

.9 1

⎤⎦

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y 5

1 2 3 4 5−3

−2

−1

0

1

2

3

variable index

y

Σ =

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

1 .9 .8 .6 .4

.9 1 .9 .8 .6

.8 .9 1 .9 .8

.6 .8 .9 1 .9

.4 .6 .8 .9 1

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

New visualisation

−2 0 2−3

−2

−1

0

1

2

3

y1

y 20

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 10 20

1

10

20

Regression using Gaussians

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

Σ =

1 10 20

1

10

20

Regression using Gaussians

1 2 3 4 5 6 7 8 9 1011121314151617181920−3

−2

−1

0

1

2

3

variable index

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

K(x1, x2) = σ2 exp(− 1

2l2(x1 − x2)2

)

1 10 20

1

10

20

Regression: probabilistic inference in function space

−3

−2

−1

0

1

2

3

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

K(x1, x2) = σ2 exp(− 1

2l2(x1 − x2)2

)

1 10 20

1

10

20


−3

−2

−1

0

1

2

3

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric)

K(x1, x2) = σ2 exp(− 1

2l2(x1 − x2)2

)

1 10 20

1

10

20

Parametric model

y(x) = f(x; θ) + σyε

ε ∼ N (0, 1)


−3

−2

−1

0

1

2

3

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y


p(y|θ) = N (0,Σ)

function estimatewith uncertaintyK(x1, x2) = σ2 exp

(− 1

2l2(x1 − x2)2

)

1 10 20

1

10

20

Parametric model


ε ∼ N (0, 1)


−3

−2

−1

0

1

2

3

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)


noise

K(x1, x2) = σ2 exp(− 1

2l2(x1 − x2)2

)

1 10 20

1

10

20

Parametric model


ε ∼ N (0, 1)


−3

−2

−1

0

1

2

3

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)


horizontal-scale

noise

K(x1, x2) = σ2 exp(− 1

2l2(x1 − x2)2

)

1 10 20

1

10

20

Parametric model


ε ∼ N (0, 1)


−3

−2

−1

0

1

2

3

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)


noise

Parametric model


ε ∼ N (0, 1)K(x1, x2) = σ2 exp(− 1

2l2(x1 − x2)2

)

1 10 20

1

10

20

horizontal-scalevertical-scale

Squared Exponential (SE) kernel

Usual choice for covariance is the Squared Exponential (SE)kernel (a.k.a. RBF, exponentiated quadratic, Gaussian):

k(x, x′) = σ2 exp(− (x − x′)2

2l2

)

The kernel’s parameters {l, σ} are called the GP’shyperparameters.

When l is very high, neighbouring points are very correlatedi.e. that feature is less relevant.

What effect do the hyper-parameters have?

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)

Non-parametric (∞-parametric) Parametric model


ε ∼ N (0, 1)K(x1, x2) = σ2 exp(− 1

2l2(x1 − x2)2

)K(x1, x2) = σ2 exp

(− 1

2l2(x1 − x2)2

)


−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)



ε ∼ N (0, 1)K(x1, x2) = σ2 exp(− 1

2l2(x1 − x2)2

)K(x1, x2) = σ2 exp

(− 1

2l2(x1 − x2)2

)


−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)



ε ∼ N (0, 1)K(x1, x2) = σ2 exp(− 1

2l2(x1 − x2)2

)K(x1, x2) = σ2 exp

(− 1

2l2(x1 − x2)2

)


−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)



ε ∼ N (0, 1)K(x1, x2) = σ2 exp(− 1

2l2(x1 − x2)2

)K(x1, x2) = σ2 exp

(− 1

2l2(x1 − x2)2

)


−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)



ε ∼ N (0, 1)K(x1, x2) = σ2 exp(− 1

2l2(x1 − x2)2

)K(x1, x2) = σ2 exp

(− 1

2l2(x1 − x2)2

)


−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)



ε ∼ N (0, 1)K(x1, x2) = σ2 exp(− 1

2l2(x1 − x2)2

)K(x1, x2) = σ2 exp

(− 1

2l2(x1 − x2)2

)


−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)



ε ∼ N (0, 1)K(x1, x2) = σ2 exp(− 1

2l2(x1 − x2)2

)K(x1, x2) = σ2 exp

(− 1

2l2(x1 − x2)2

)


−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)



ε ∼ N (0, 1)K(x1, x2) = σ2 exp(− 1

2l2(x1 − x2)2

)K(x1, x2) = σ2 exp

(− 1

2l2(x1 − x2)2

)


−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)



ε ∼ N (0, 1)K(x1, x2) = σ2 exp(− 1

2l2(x1 − x2)2

)K(x1, x2) = σ2 exp

(− 1

2l2(x1 − x2)2

)


−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)



ε ∼ N (0, 1)K(x1, x2) = σ2 exp(− 1

2l2(x1 − x2)2

)K(x1, x2) = σ2 exp

(− 1

2l2(x1 − x2)2

)


−3

−2

−1

0

1

2

3

x

y

Σ =

Σ(x1, x2) = K(x1, x2) + Iσ2y

p(y|θ) = N (0,Σ)



ε ∼ N (0, 1)K(x1, x2) = σ2 exp(− 1

2l2(x1 − x2)2

)K(x1, x2) = σ2 exp

(− 1

2l2(x1 − x2)2

)