an introduction to bayesian machine learning€¦ · an introduction to bayesian machine learning...

28
An Introduction to Bayesian Machine Learning An Introduction to Bayesian Machine Learning Jos´ e Miguel Hern´ andez-Lobato Department of Engineering, Cambridge University April 8, 2013 1

Upload: others

Post on 24-May-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An Introduction to Bayesian Machine Learning€¦ · An Introduction to Bayesian Machine Learning Model-based Machine Learning We design aprobabilistic modelwhichexplainshow the data

An Introduction to Bayesian Machine Learning

An Introduction to Bayesian Machine Learning

Jose Miguel Hernandez-Lobato

Department of Engineering, Cambridge University

April 8, 2013

1

Page 2: An Introduction to Bayesian Machine Learning€¦ · An Introduction to Bayesian Machine Learning Model-based Machine Learning We design aprobabilistic modelwhichexplainshow the data

An Introduction to Bayesian Machine Learning

What is Machine Learning?The design of computational systems that discover patterns in a collectionof data instances in an automated manner.

The ultimate goal is to use the discovered patterns to make predictions onnew data instances not seen before.

1

?

11 1 1 1 1 1 1

2 2 2 2 2 2 2

0 0 0 0 0 0 0 ? ?

? ? ?

? ? ?

Instead of manually encoding patterns in computer programs, we makecomputers learn these patterns without explicitly programming them .

Figure source [Hinton et al. 2006].

2

Page 3: An Introduction to Bayesian Machine Learning€¦ · An Introduction to Bayesian Machine Learning Model-based Machine Learning We design aprobabilistic modelwhichexplainshow the data

An Introduction to Bayesian Machine Learning

Model-based Machine Learning

We design a probabilistic model which explains how the data is generated.

An inference algorithm combines model and data to make predictions.

Probability theory is used to deal with uncertainty in the model or the data.

1 11 1 1 1 1 1 1

2 2 2 2 2 2 2

0 0 0 0 0 0 0

1 11 1 1 1 1 1 1

2 2 2 2 2 2 2

0 0 0 0 0 0 0

11 1

2 2

0 0

Data ? ? ?

? ? ?

? ? ?

New Data

Prob. Model

InferenceAlgorithm

Prediction 0: 0.900 1: 0.005 2: 0.095

3

Page 4: An Introduction to Bayesian Machine Learning€¦ · An Introduction to Bayesian Machine Learning Model-based Machine Learning We design aprobabilistic modelwhichexplainshow the data

An Introduction to Bayesian Machine Learning

Basics of Probability TheoryThe theory of probability can be derived using just two rules:

Sum rule:

p(x) =

∫p(x , y) dy .

Product rule:

p(x , y) = p(y |x)p(x) = p(x |y)p(y) .

They can be combined to obtain Bayes’ rule:

p(y |x) =p(x |y)p(y)

p(x)=

p(y |x)p(y)∫p(x , y) dy

.

Independence of X and Y : p(x , y) = p(x)p(y).

Conditional independence of X and Y given Z : p(x , y |z) = p(x |z)p(y |z).4

Page 5: An Introduction to Bayesian Machine Learning€¦ · An Introduction to Bayesian Machine Learning Model-based Machine Learning We design aprobabilistic modelwhichexplainshow the data

An Introduction to Bayesian Machine Learning

The Bayesian FrameworkThe probabilistic model M with parameters θ explains how the data D isgenerated by specifying the likelihood function p(D|θ,M) .

Our initial uncertainty on θ is encoded in the prior distribution p(θ|M) .

Bayes’ rule allows us to update our uncertainty on θ given D:

p(θ|D,M) =p(D|θ,M)p(θ|M)

p(D|M).

We can then generate probabilistic predictions for some quantity x of newdata Dnew given D and M using

p(x |Dnew,D,M) =

∫p(x |θ,Dnew,M)p(θ|D,M)dθ .

These predictions will be approximated using an inference algorithm .5

Page 6: An Introduction to Bayesian Machine Learning€¦ · An Introduction to Bayesian Machine Learning Model-based Machine Learning We design aprobabilistic modelwhichexplainshow the data

An Introduction to Bayesian Machine Learning

Bayesian Model ComparisonGiven a particular D, we can use the model evidence p(D|M) to rejectboth overly simple models, and overly complex models.

Figure source [Ghahramani 2012].6

Page 7: An Introduction to Bayesian Machine Learning€¦ · An Introduction to Bayesian Machine Learning Model-based Machine Learning We design aprobabilistic modelwhichexplainshow the data

An Introduction to Bayesian Machine Learning

Probabilistic Graphical ModelsThe Bayesian framework requires to specify a high-dimensional distributionp(x1, . . . , xk) on the data, model parameters and latent variables.

Working with fully flexible joint distributions is intractable!

We will work with structured distributions, in which the random variablesinteract directly with only few others. These distributions will have manyconditional independencies .

This structure will allow us to:- Obtain a compact representation of the distribution.- Use computationally efficient inference algorithms.

The framework of probabilistic graphical models allows us to representand work with such structured distributions in an efficient manner.

7

Page 8: An Introduction to Bayesian Machine Learning€¦ · An Introduction to Bayesian Machine Learning Model-based Machine Learning We design aprobabilistic modelwhichexplainshow the data

An Introduction to Bayesian Machine Learning

Some Examples of Probabilistic Graphical Models

Bayesian Network Markov NetworkGraphs

Muscle-Pain Congestion

Flu Hayfever

Season

BD

C

A

Independencies(F⊥H|C ), (C⊥S |F ,H) (A⊥C |B,D), (B⊥D|A,C )

(M⊥H,C |F ), (M⊥C |F ), ...Factorization

p(S ,F ,H,M,C ) = p(S)p(F |S) p(A,B,C ,D) = 1Z φ1(A,B)

p(H|S)p(C |F ,H)p(M|F ) φ2(B,C )φ3(C ,D)φ4(A,D)

Figure source [Koller et al. 2009].8

Page 9: An Introduction to Bayesian Machine Learning€¦ · An Introduction to Bayesian Machine Learning Model-based Machine Learning We design aprobabilistic modelwhichexplainshow the data

An Introduction to Bayesian Machine Learning

Bayesian Networks

A BN G is a DAG whose nodes are random variables X1, . . . ,Xd .

Let PAGXibe the parents of Xi in G.

The network is annotated with the conditional distributions p(Xi |PAGXi).

Conditional Independencies:

Let NDGXibe the variables in G which are non-descendants of Xi in G.

G encodes the conditional independencies (Xi⊥NDGXi|PAGXi

), i = 1, . . . , d .

Factorization:

G encodes the factorization p(x1, . . . , xd) =∏d

i=1 p(xi |paGxi ) .

9

Page 10: An Introduction to Bayesian Machine Learning€¦ · An Introduction to Bayesian Machine Learning Model-based Machine Learning We design aprobabilistic modelwhichexplainshow the data

An Introduction to Bayesian Machine Learning

BN Examples: Naive BayesWe have features X = (X1, . . . ,Xd) ∈ Rn and a label Y ∈ {1, . . . ,C}.

The figure shows the Naive Bayes graphical model with parameters θ1 andθ2 for a dataset D = {xi , yi}di=1 using plate notation .

Shaded nodes are observed ,unshaded nodes are not observed.

The joint distribution for D θ1 and θ2 is

p(D,θ1θ2) =n∏

i=1

d∏j=1

p(xi,j |yi ,θ1)p(yi |θ2)

p(θ1)p(θ2) .

Figure source [Murphy 2012].

10

Page 11: An Introduction to Bayesian Machine Learning€¦ · An Introduction to Bayesian Machine Learning Model-based Machine Learning We design aprobabilistic modelwhichexplainshow the data

An Introduction to Bayesian Machine Learning

BN Examples: Hidden Markov ModelWe have a temporal sequence of measurements D = {X1, . . . ,XT}.

Time dependence is explained by a hidden process Z = {Z1, . . . ,ZT},that is modeled by a first-order Markov chain .

The data is a noisy observation of the hidden process.

Figure source [Murphy 2012].

The joint distribution for D, Z, θ1and θ2 is

p(D,Z,θ1,θ2) =

[T∏t=1

p(xt |zt ,θ1)

][

T∏t=1

p(zt |zt−1,θ2)

]p(θ1)p(θ2)p(z0) .

11

Page 12: An Introduction to Bayesian Machine Learning€¦ · An Introduction to Bayesian Machine Learning Model-based Machine Learning We design aprobabilistic modelwhichexplainshow the data

An Introduction to Bayesian Machine Learning

BN Examples: Matrix Factorization Model

We have observations xi ,j from an n × d matrix X.

The entries of X are generated as a function of the entries of alow rank matrix UVT, U is n × k and V is d × k and k � min(n, d).

p(D,U,V,θ1,θ2,θ3) = n∏i=1

d∏j=1

p(xi ,j |ui , vj , θ3)

[

n∏i=1

p(ui |θ1)

] d∏i=j

p(vj |θ2)

p(θ1)p(θ2)p(θ3) .

12

Page 13: An Introduction to Bayesian Machine Learning€¦ · An Introduction to Bayesian Machine Learning Model-based Machine Learning We design aprobabilistic modelwhichexplainshow the data

An Introduction to Bayesian Machine Learning

D-separationConditional independence properties can be read directly from the graph.

We say that the sets of nodes A, B and C satisfy (A⊥B|C ) when all of

the possible paths from any node in A to any node in B are blocked .

A path will be blocked if it contains a node x with arrows meeting at x1 - i) head-to-tail or ii) tail-to-tail and x is C .2 - head-to-head and neither x , nor any of its descendants, is in C .

f

e b

a

c

f

e b

a

c

(a⊥b|c) does not follow fromthe graph.

(a⊥b|f ) is implied by thegraph.

Figure source [Bishop 2006].

13

Page 14: An Introduction to Bayesian Machine Learning€¦ · An Introduction to Bayesian Machine Learning Model-based Machine Learning We design aprobabilistic modelwhichexplainshow the data

An Introduction to Bayesian Machine Learning

Bayesian Networks as Filters

p(x1, . . . , xk) must satisfy the CIs implied by d-separation .

p(x1, . . . , xk) must factorize as p(x1, . . . , xd) =∏d

i=1 p(xi |paGxi ).

p(x1, . . . , xk) must satisfy the CIs (Xi⊥NDGXi|PAGXi

) , i = 1, . . . , d .

The three filters are the same!

All possibleDistributions

All distributionsthat factorize

and satisfy CIs

Figure source [Bishop 2006].

14

Page 15: An Introduction to Bayesian Machine Learning€¦ · An Introduction to Bayesian Machine Learning Model-based Machine Learning We design aprobabilistic modelwhichexplainshow the data

An Introduction to Bayesian Machine Learning

Markov BlanketThe Markov blanket of a node x is the set of nodes comprising parents,children and co-parents of x .

It is the minimal set of nodes that isolates a node from the rest, that is,

x is CI of any other node in the graph given its Markov blanket .

x

Figure source [Bishop 2006].

15

Page 16: An Introduction to Bayesian Machine Learning€¦ · An Introduction to Bayesian Machine Learning Model-based Machine Learning We design aprobabilistic modelwhichexplainshow the data

An Introduction to Bayesian Machine Learning

Markov NetworksA MN is an undirected graph G whose nodes are the r.v. X1, . . . ,Xd .

It is annotated with the potential functions φ1(D1), . . . , φk(Dk), whereD1, . . . ,Dk are sets of variables, each forming a maximal clique of G, and

φi , . . . , φk are positive functions.

Conditional Independencies:

G encodes the conditional independencies (A⊥B|C ) for any sets of nodes

A, B and C such that C separates A from B in G .

Factorization:

G encodes the factorization p(X1, . . . ,Xd) = Z−1∏k

i=1 φi (Di ), where Z isa normalization constant.

16

Page 17: An Introduction to Bayesian Machine Learning€¦ · An Introduction to Bayesian Machine Learning Model-based Machine Learning We design aprobabilistic modelwhichexplainshow the data

An Introduction to Bayesian Machine Learning

Clique and Maximal Clique

Clique : Fully connected subset of nodes.

Maximal Clique : A clique in which we cannot include any more nodeswithout it ceasing to be a clique.

x1

x2

x3

x4

(x1, x2) is a clique but not a maximal clique.

(x2, x3, x4) is a maximal clique.

p(x1, . . . , x4) =1

Zφ1(x1, x2, x3)φ2(x2, x3, x4) .

Figure source [Bishop 2006].

17

Page 18: An Introduction to Bayesian Machine Learning€¦ · An Introduction to Bayesian Machine Learning Model-based Machine Learning We design aprobabilistic modelwhichexplainshow the data

An Introduction to Bayesian Machine Learning

MN Examples: Potts ModelLet x1, . . . , xn ∈ {1, . . . ,C},

p(x1, . . . , xn) =1

Z

∏i∼j

φij(xi , xj) ,

where

log φij(xi , xj) =

{β > 0 if xi = xj

0 otherwise,

xi xj

Figure source [Bishop 2006]. Figure source Erik Sudderth.18

Page 19: An Introduction to Bayesian Machine Learning€¦ · An Introduction to Bayesian Machine Learning Model-based Machine Learning We design aprobabilistic modelwhichexplainshow the data

An Introduction to Bayesian Machine Learning

From Directed Graphs to Undirected Graphs: Moralization

Let p(x1, . . . , x4) = p(x1)p(x2)p(x3)p(x4|x1, x2, x3) .How do we obtain the corresponding undirected model?

p(x4|x1, x2, x3) implies that x1, . . . , x4 must be in a maximal clique.

General method:1- Fully connect all the parents of any node.2- Eliminate edge directions.

x1 x3

x4

x2

x1 x3

x4

x2Moralization adds the fewestextra links and so retains themaximum number of CIs.

Figure source [Bishop 2006].

19

Page 20: An Introduction to Bayesian Machine Learning€¦ · An Introduction to Bayesian Machine Learning Model-based Machine Learning We design aprobabilistic modelwhichexplainshow the data

An Introduction to Bayesian Machine Learning

CIs in Directed and Undirected ModelsMarkov Network in Undirected Models

If all the CIs of p(x1, . . . , xn) are reflected in G, and vice versa, then G issaid to be a perfect map .

C

A B

A

C

B

D

PUD

Figure source [Bishop 2006].

20

Page 21: An Introduction to Bayesian Machine Learning€¦ · An Introduction to Bayesian Machine Learning Model-based Machine Learning We design aprobabilistic modelwhichexplainshow the data

An Introduction to Bayesian Machine Learning

Basic Distributions: BernoulliDistribution for x ∈ {0, 1} governed by µ ∈ [0, 1] such that µ = p(x = 1).

Bern(x |µ) = xµ+ (1− x)(1− µ).

E(x) = µ.

Var(x) = µ(1− µ).

0 1

21

Page 22: An Introduction to Bayesian Machine Learning€¦ · An Introduction to Bayesian Machine Learning Model-based Machine Learning We design aprobabilistic modelwhichexplainshow the data

An Introduction to Bayesian Machine Learning

Basic Distributions: BetaDistribution for µ ∈ [0, 1] such as the prob. of a binary event.

Beta(µ|a, b) =Γ(a + b)

Γ(a)Γ(b)µa−1(1− µ)b−1 .

E(x) = a/(a + b).

Var(x) = ab/((a + b)2(a + b + 1)).

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

2.5

3.0

x

pdf

a = 0.5, b = 0.5a = 5, b = 1a = 1, b = 3a = 2, b = 2a = 2, b = 5

22

Page 23: An Introduction to Bayesian Machine Learning€¦ · An Introduction to Bayesian Machine Learning Model-based Machine Learning We design aprobabilistic modelwhichexplainshow the data

An Introduction to Bayesian Machine Learning

Basic Distributions: Multinomial

We extract with replacement n balls of k different categories from a bag.

Let xi and denote the number of balls extracted and pi the probability,both of category i = 1, . . . , k.

p(x1, . . . , xk |n, p1, . . . , pk) =

n!

x1!···xk !px11 · · · p

xkk if

∑ki=1 xk = n

0 otherwise

.

E(xi ) = npi .

Var(xi ) = npi (1− pi ).

Cov(xi , xj) = −npipj(1− pi ).

23

Page 24: An Introduction to Bayesian Machine Learning€¦ · An Introduction to Bayesian Machine Learning Model-based Machine Learning We design aprobabilistic modelwhichexplainshow the data

An Introduction to Bayesian Machine Learning

Basic Distributions: Dirichlet

Multivariate distribution over µ1, . . . , µk ∈ [0, 1], where∑k

i=1 µi = 1.

Parameterized in terms of α = (α1, . . . , αk) with αi > 0 for i = 1, . . . , k.

Dir(µ1, . . . , µk |α) =Γ(∑k

i=1 αk

)Γ(α1) · · · Γ(αk)

k∏i=1

µαii .

E(µi ) =ai∑kj=1 aj

.

Figure source [Murphy 2012].

24

Page 25: An Introduction to Bayesian Machine Learning€¦ · An Introduction to Bayesian Machine Learning Model-based Machine Learning We design aprobabilistic modelwhichexplainshow the data

An Introduction to Bayesian Machine Learning

Basic Distributions: Multivariate Gaussian

p(x|µ,Σ) =1

(2π)n/21

|Σ|1/2exp

{−1

2(x− µ)TΣ−1(x− µ)

}.

E(x) = µ.

Cov(x) = Σ.

Figure source [Murphy 2012].25

Page 26: An Introduction to Bayesian Machine Learning€¦ · An Introduction to Bayesian Machine Learning Model-based Machine Learning We design aprobabilistic modelwhichexplainshow the data

An Introduction to Bayesian Machine Learning

Basic Distributions: Wishart

Distribution for the precision matrix Λ = Σ−1 of a Multivariate Gaussian.

W(Λ|w, ν) = B(W, ν)|Λ|(ν−D−1) exp

{−1

2Tr(W−1Λ)

},

where

B(W, ν) ≡ |W|−ν/2(

2vD/2πD(D−1)/4D∏i=1

Γ

(ν + 1− i

2

)).

E(Λ) = νW.

26

Page 27: An Introduction to Bayesian Machine Learning€¦ · An Introduction to Bayesian Machine Learning Model-based Machine Learning We design aprobabilistic modelwhichexplainshow the data

An Introduction to Bayesian Machine Learning

Summary

With ML computers learn patterns and then use them to make predictions.With ML we avoid to manually encode patterns in computer programs.

Model-based ML separates knowledge about the data generation process(model) from reasoning and prediction (inference algorithm).

The Bayesian framework allows us to do model-based ML using probabilitydistributions which must be structured for tractability.

Probabilistic graphical models encode such structured distributions byspecifying several CIs (factorizations) that they must satisfy.

Bayesian Networks and Markov Networks are two different types ofgraphical models which can express different types of CIs.

27

Page 28: An Introduction to Bayesian Machine Learning€¦ · An Introduction to Bayesian Machine Learning Model-based Machine Learning We design aprobabilistic modelwhichexplainshow the data

An Introduction to Bayesian Machine Learning

References

Hinton, G. E., Osindero, S. and Teh, Y. A fast learning algorithm fordeep belief nets. Neural Computation 18, 2006, 1527-1554.

Koller, D. and Friedman, N. Probabilistic Graphical Models:Principles and Techniques Mit Press, 2009

Bishop, C. M. Model-based machine learning PhilosophicalTransactions of the Royal Society A: Mathematical,Physical andEngineering Sciences, 2013, 371

Ghahramani Z. Bayesian nonparametrics and the probabilisticapproach to modelling. Philosophical Transactions of the RoyalSociety A, 2012.

Murphy, K. Machine Learning: A Probabilistic Perspective Mit Press,2012

28