introduction to graphical models

Introduction to Graphical Models

Brookes Vision Lab Reading Group

Graphical Models

• To build a complex system using simpler parts.

• System should be consistent• Parts are combined using probability• Undirected – Markov random fields• Directed – Bayesian Networks

Overview

• Representation• Inference• Linear Gaussian Models• Approximate inference• Learning

Causality : Sprinkler “causes” wet grass

Representation

Conditional Independence

• Independent of ancestors given parents• P(C,S,R,W) = P(C) P(S|C) P(R|C,S) P(W|C,S,R)• = P(C) P(S|C) P(R|C) P(W|S,R)

• Space required for n binary nodes– O(2n) without factorization– O(n2k) with factorization, k = maximum fan-in

Inference

• Pr(S=1|W=1) = Pr(S=1,W=1)/Pr(W=1) = 0.2781/0.6471 = 0.430• Pr(R=1|W=1) = Pr(R=1,W=1)/Pr(W=1) = 0.4581/0.6471 = 0.708

Explaining Away

• S and R “compete” to explain W=1

• S and R are conditionally dependent

• Pr(S=1|R=1,W=1) = 0.1945

Inference

where

where

Inference

• Variable elimination• Choosing optimal ordering – NP hard• Greedy methods work well• Computing several marginals• Dynamic programming avoids redundant

computation• Sound familiar ??

Bayes Balls for Conditional Independence

A Unifying (Re)View

Linear GaussianModel (LGM)

FA SPCA PCA LDS

Mixture of Gaussians VQ HMM

Continuous-State LGM

Basic Model

Discrete-State LGM

Basic Model● State of a system is a k-vector x (unobserved)● Output of a system is a p-vector y (observed) ● Often k << p

● Basic model ● xt+1 = A xt + w● yt = C xt + v

● A is the k x k transition matrix● C is a p x k observation matrix● w = N(0, Q)● v = N(0, R)

● Noise processes are essential

Zero mean w.l.o.g

Degeneracy in Basic Model

• Structure in Q can be moved to A and C• W.l.o.g. Q = I• R cannot be restricted as yt are observed• Components of x can be reordered arbitrarily.• Ordering is based on norms of columns of C.• x1 = N(µ1, Q1)• A and C are assumed to have rank k.• Q, R, Q1 are assumed to be full rank.

Probability Computation

• P( xt+1 | xt ) = N(A xt, Q ; xt+1)

• P( yt | xt ) = N( C xt, R; yt)

• P({x1,..,xT,{y1,..,yT}) =

P(x1) П P(xt+1|xtП P(yt|xt)• Negative log probability

Inference● Given model parameters {A, C, Q, R, µ1, Q1}● Given observations y● What can be infered about hidden states x ?● Total likelihood

● Filtering : P (x(t) | {y(1), ... , y(t)})● Smoothing: P (x(t) | {y(1), ... , y(T)})● Partial smoothing: P (x(t) | {y(1), ... , y(t+t')})● Partial prediction: P (x(t) | {y(1), ... , y(t-t')})● Intermediate values of recursive methods for computing total likelihood.

Learning• Unknown parameters {A, C, Q, R, µ1, Q1}• Given observations y• Log-likelihood

F(Q,Ө) – free energy

EM algorithm• Alternate between maximizing F(Q,Ө) w.r.t. Q and

Ө.

• F = L at the beginning of M-step• E-step does not change Ө• Therefore, likelihood does not decrease.



Static Data Modeling Time-series Modeling

● No temporal dependence ● Factor analysis● SPCA● PCA

● Time ordering of data crucial● LDS (Kalman filter models)

Static Data Modelling

• A = 0• x = w• y = C x + v• x1 = N(0,Q)• y = N(0, CQC'+R)• Degeneracy in model• Learning : EM

– R restricted• Inference

Factor Analysis

• Restrict R to be diagonal.• Q = I• x – factors• C – factor loading matrix• R – uniqueness• Learning – EM , quasi-Newton optimization• Inference

SPCA

• R = єI• є – global noise level• Columns of C span the principal subspace.• Learning – EM algorithm• Inference

PCA• R = lim є->0 єI• Learning

– Diagonalize sample covariance of data– Leading k eigenvalues and eigenvectors define C– EM determines leading eigenvectors without

diagonalization• Inference

– Noise becomes infinitesimal– Posterior collapses to a single point

Linear Dynamical Systems

• Inference – Kalman filter• Smoothing – RTS recursions• Learning – EM algorithm

– C known – Shumway and Stoffer, 1982– All unknown – Ghahramani and Hinton, 1995

Discrete-State LGM

• xt+1 = WTA[A xt + w]

• yt = C xt + v• x1 = WTA[N(µ1,Q1)]

Discrete-State LGM

Discrete-state LGM

Static Data Modeling Time-series Modeling

● Mixture of Gaussians● VQ

● HMM

Static Data Modelling

• A = 0• x = WTA[w]• w = N(µ,Q)• Y = C x + v• лj = P(x = ej)

• Nonzero µ for nonuniform лj

• y = N(Cj, R)

• Cj – jth column of C

Mixture of Gaussians• Mixing coefficients of cluster лj

• Mean – columns Cj

• Variance – R• Learning: EM (corresponds to ML

competitive learning)• Inference

Vector Quantization• Observation noise becomes infinitesimal• Inference problem solved by 1NN rule• Euclidean distance for diagonal R• Mahalanobis distance for unscaled R• Posterior collapses to closest cluster• Learning with EM = batch version of k-

means

Time-series modelling

HMM

• Transition matrix T• Ti,j = P(xt+1 = ej | xt = ei)• For every T, there exist A and Q• Filtering : forward recursions• Smoothing: forward-backward algorithm• Learning: EM (called Baum-Welsh

reestimation)• MAP state sequences - Viterbi

introduction to graphical models

Documents

p yt xt

n c xt

probability computationp

c x vx1

pvector y

observedcomponents of

complex system

hidden states x