learning in bayesian networks. learning problem set of random variables x = {w, x, y, z, …}...

Learning In Bayesian Networks

Learning Problem

Set of random variables X = {W, X, Y, Z, …}

Training set D = {x1, x2, …, xN}

Each observation specifies values of subset of variables x1 = {w1, x1, ?, z1, …}

x2 = {w2, x2, y2, z2, …}

x3 = {?, x3, y3, z3, …}

Goal

Predict joint distribution over some variables given other variables E.g., P(W, Y | Z, X)

Classes Of Graphical Model Learning Problems

Network structure knownAll variables observed

Network structure knownSome missing data (or latent variables)

Network structure not knownAll variables observed

Network structure not knownSome missing data (or latent variables)

today and nextclass

going to skip(not too relevantfor papers we’ll read;see optionalreadings for moreinfo)

Learning CPDs When All Variables Are Observed And Network Structure Is Known

Trivial problem?

X Y

ZX Y P(Z|X,Y)0 0 ?0 1 ?1 0 ?1 1 ?

P(X)?P(X)?

P(Y)?

X Y Z0 0 10 1 10 1 01 1 11 1 11 0 0

Training Data

Recasting Learning As Inference

We’ve already encountered probabilistic models that have latent (a.k.a. hidden, nonobservable) variables that must be estimated from data.

 E.g., Weiss model Direction of motion

 E.g., Gaussian mixture model To which cluster does each data point belong

Why not treat unknown entries inthe conditional probability tablesthe same way?

Recasting Learning As Inference

Suppose you have a coin with an unknown bias, θ ≡ P(head).

You flip the coin multiple times and observe the outcome.

From observations, you can infer the bias of the coin

This is learning. This is inference.

Treating Conditional ProbabilitiesAs Latent Variables

Graphical model probabilities (priors, conditional distributions) can also be cast as random variables

E.g., Gaussian mixture model

Remove the knowledge “built into” the links (conditional distributions) and into the nodes (prior distributions).

Create new random variables to represent the knowledge

Hierarchical Bayesian Inference

z λ

x

z

x

z λ

x

q

Slides stolen from David Heckerman tutorial

training example 1

training example 2

Parameters might not be independent

training example 1

training example 2

General Approach:Learning Probabilities in a Bayes Net

If network structure Sh known and no missing data…

We can express joint distribution over variables X in terms of model parameter vector θs

Given random sample D = {x1, x2, ..., xN}, compute the

posterior distribution p(θs | D, Sh)  Given posterior distribution, marginals and conditionals on nodes in network can be determined.

Probabilistic formulation of all supervised and unsupervised learning problems.

Computing Parameter Posteriors

E.g., net structure X→Y

Computing Parameter Posteriors

Given complete data (all X,Y observed) and no direct dependencies among parameters,

Explanation Given complete data, each setof parameters is disconnected fromeach other set of parameters in the graph

θx

X Y θy|x

θx

θx

D separation

parameterindependence

Posterior Predictive Distribution

Given parameter posteriors

What is prediction of next observation XN+1?

How can this be used for unsupervised and supervised learning?

What we talkedabout the past three classes

What we justdiscussed

Prediction Directly From Data

In many cases, prediction can be made without explicitly computing posteriors over parametersE.g., coin toss example from earlier class

Posterior distribution is

Prediction of next coin outcome

Generalizing To Multinomial RVs In Bayes Net

Variable Xi is discrete, with values xi1, ... xiri

i: index of multinomial RVj: index over configurations of the parents of node ik: index over values of node i

unrestricted distribution: one parameter per probability

Xi

XbXa

Prediction Directly From Data: Multinomial Random Variables

Prior distribution is Posterior distribution is

Posterior predictive distribution:

I: index over nodesj: index over values of parents of Ik: index over values of node i

Other Easy Cases

Members of the exponential family

see Barber text 8.5

Linear regression with Gaussian noise

see Barber text 18.1

learning in bayesian networks. learning problem set of random variables x = {w, x, y, z, …}...

Documents

x slide

random variables x

x y yx

training data slide

observation x n

nonobservable variables

info slide

learning probabilities