probabilistic graphical modelsprobabilistic graphical modelsepxing/class/10708-09/lecture/... ·...

1

School of Computer Science

Probabilistic Graphical ModelsProbabilistic Graphical Models

Learning generalized linear Learning generalized linear models and tabular CPT of fully models and tabular CPT of fully

observed BNobserved BN

© Eric Xing @ CMU, 2005-2009 1

Eric XingEric XingLecture 7, September 30, 2009

Reading:

X1

X4

X2 X3

X4

X2 X3

X1X1

X2

X1

X3

X1

X4

X2 X3

X4

X2 X3

X1X1

X2

X1

X3

Exponential familyFor a numeric random variable X

{ })()(exp)()|( AxTxhxp T ηηη = Xn

is an exponential family distribution with natural (canonical) parameter η

Function T(x) is a sufficient statistic.

{ }{ })(exp)(

)(1

)()(exp)()|(

xTxhZ

AxTxhxp

Tηη

ηηη

=

−= Xn

N

© Eric Xing @ CMU, 2005-2009 2

Function A(η) = log Z(η) is the log normalizer.Examples: Bernoulli, multinomial, Gaussian, Poisson, gamma,...

2

Recall Linear RegressionLet us assume that the target variable and the inputs are related by the equation:y q

where ε is an error term of unmodeled effects or random noise

Now assume that ε follows a Gaussian N(0,σ), then we have:

iiT

iy εθ += x

⎟⎞

⎜⎛ − 21 θ )( Ty x

© Eric Xing @ CMU, 2005-2009 3

⎟⎟⎠

⎞⎜⎜⎝

⎛−= 222

1σθ

σπθ )(exp);|( ii

iiyxyp x

Recall: Logistic Regression (sigmoid classifier)

The condition distribution: a Bernoulliyy xxxyp −−= 11 ))(()()|( μμ

where μ is a logistic function

We can used the brute-force gradient method as in LR

xxxyp = 1 ))(()()|( μμ

xTex

θμ

−+=

11)(

© Eric Xing @ CMU, 2005-2009 4

But we can also apply generic laws by observing the p(y|x) is an exponential family function, more specifically, a generalized linear model!

3

Example: Multivariate Gaussian Distribution

For a continuous vector random variable X∈Rk:

⎬⎫

⎨⎧ ΣΣ − )()(exp)( μμμ 111 T xxxp

Exponential family representation

( )

( )( ){ }Σ−Σ−Σ+Σ−=

⎭⎬

⎩⎨ −Σ−−

Σ=Σ

−−− logtrexp

)()(exp),(

/

//

μμμπ

μμπ

μ

12111

21

2

212

21

22

TTTk

k

xxx

xxxp

( )[ ] ( )[ ]( )[ ]

121

21

1211

211

vec;)(

and ,vec,vec;TxxxxT

−−−− Σ−=Σ==Σ−Σ= ημηηημη

Moment parameter

Natural parameter

© Eric Xing @ CMU, 2005-2009 5

Note: a k-dimensional Gaussian is a (k+k2)-parameter distribution with a (k+k2)-element vector of sufficient statistics (but because of symmetry and positivity, parameters are constrained and have lower degree of freedom)

( )[ ]( )

( ) 2

221

112211

21

2

2/)(

log)(trlog)(vec;)(

k

TT

xh

AxxxxT

−

−

=

−−−=Σ+Σ=

=

π

ηηηημμη

Example: Multinomial distributionFor a binary vector random variable ),|(multi~ πxx

⎫⎧∑

Exponential family representation

⎭⎬⎫

⎩⎨⎧

== ∑k

kkxK

xx xxp K πππππ lnexp)( 2121 L

⎭⎬⎫

⎩⎨⎧

⎟⎠

⎞⎜⎝

⎛−+⎟⎟

⎠

⎞⎜⎜⎝

⎛∑−

=

⎭⎬⎫

⎩⎨⎧

⎟⎠

⎞⎜⎝

⎛−⎟

⎠

⎞⎜⎝

⎛−+=

∑∑

∑∑∑−

=

−

=−

=

−

=

−

=

−

=

1

1

1

11

1

1

1

1

1

1

1

1ln1

lnexp

1ln1lnexp

K

kk

K

k kKk

kk

K

kk

K

kk

K

kkk

x

xx

ππ

π

ππ

© Eric Xing @ CMU, 2005-2009 6

p y p

[ ]

1

1

0

1

1

1

=

⎟⎠

⎞⎜⎝

⎛=⎟

⎠

⎞⎜⎝

⎛−−=

=

⎥⎦⎤

⎢⎣⎡

⎟⎠⎞⎜

⎝⎛=

∑∑=

−

=

)(

lnln)(

)(

;ln

xh

eA

xxTK

k

K

kk

K

k

kηπη

ππη

4

Why exponential family?Moment generating property

dddA 1

{ }{ }

[ ])()(

)(exp)()(

)(exp)()(

)()(

)(log

xTE

dxZ

xTxhxT

dxxTxhdd

Z

Zdd

ZZ

dd

ddA

T

T

=

=

=

==

∫

∫

ηη

ηηη

ηηη

ηηη1

1

© Eric Xing @ CMU, 2005-2009 7

{ } { }

[ ] [ ][ ])(

)()(

)()()(

)(exp)()()(

)(exp)()(

xTVarxTExTE

Zdd

Zdx

ZxTxhxTdx

ZxTxhxT

dAd

2

TT

=−=

−= ∫∫2

22

2 1 ηηηη

ηηη

η

Moment estimationWe can easily compute moments of any exponential family distribution by taking the derivatives of the log normalizer y g gA(η).The qth derivative gives the qth centered moment.

variance)(

mean)(

=

=

2 ηηη

AdddA

© Eric Xing @ CMU, 2005-2009 8

When the sufficient statistic is a stacked vector, partial derivatives need to be considered.

L

variance=2ηd

5

Moment vs canonical parametersThe moment parameter µ can be derived from the natural (canonical) parameter( ) p

A(η) is convex since

[ ] μηη def

)()(== xTE

ddA

[ ] 02

2

>= )()( xTVardAdη

η

4

8

-2 -1 0 1 2

4

8

-2 -1 0 1 2

A

ηη∗

© Eric Xing @ CMU, 2005-2009 9

Hence we can invert the relationship and infer the canonical parameter from the moment parameter (1-to-1):

A distribution in the exponential family can be parameterized not only by η − the canonical parameterization, but also by μ − the moment parameterization.

)(def

μψη =

MLE for Exponential FamilyFor iid data, the log-likelihood is

{ }∏ T AThD )()()(l)(l

Take derivatives and set to zero:

{ }

∑ ∑

∏

−⎟⎠

⎞⎜⎝

⎛+=

−=

n nn

Tn

nn

Tn

NAxTxh

AxTxhD

)()()(log

)()(exp)(log);(

ηη

ηηηl

1)(

0)()(∑∂

=∂

∂−=

∂∂

nn

A

ANxT

ηηη

ηl

© Eric Xing @ CMU, 2005-2009 10

This amounts to moment matching.We can infer the canonical parameters using

)(1

)(1)(

∑

∑

=

=∂

∂

⇒

nnMLE

nn

xTN

xTN

A

μ

ηη

)

)( MLEMLE μψη )) =

6

SufficiencyFor p(x|θ), T(x) is sufficient for θ if there is no information in Xregarding θ beyond that in T(x).g g y ( )

We can throw away X for the purpose of inference w.r.t. θ .

Bayesian view

Frequentist view

T(x) θX ))(|()),(|( xTpxxTp θθ =

T(x) θX ))(|()),(|( xTxpxTxp =θ

© Eric Xing @ CMU, 2005-2009 11

The Neyman factorization theorem

T(x) is sufficient for θ if

T(x) θX

))(,()),(()),(,( xTxxTxTxp 21 ψθψθ =

))(,()),(()|( xTxhxTgxp θθ =⇒

ExamplesGaussian:

( )[ ][ ]

1211 vec; −− Σ−Σ= μη

Multinomial:

( )[ ]

( ) 2211

21

2 /)(

log)(vec;)(

k

T

T

xh

AxxxxT

−

−

=

Σ+Σ=

=

π

μμη∑∑ ==⇒n

nn

nMLE xN

xTN

111 )(μ

[ ]

1

0

1

⎟⎞

⎜⎛

⎟⎞

⎜⎛

=

⎥⎦⎤

⎢⎣⎡

⎟⎠⎞⎜

⎝⎛=

∑∑−

ll)(

)(

;ln

A

xxTKK

K

k

kη

ππη

∑=⇒n

nMLE xN1μ

© Eric Xing @ CMU, 2005-2009 12

Poisson:1

111

=

⎟⎠

⎜⎝

=⎟⎠

⎜⎝

−−= ∑∑==

)(

lnln)(

xh

eAkk

kkηπη

!)(

)()(

log

xxh

eAxxT

1=

==

==

ηλη

λη

∑=⇒n

nMLE xN1μ

7

Generalized Linear Models (GLIMs)

The graphical modelLinear regression XnLinear regressionDiscriminative linear classificationCommonality:

model Ep(Y)=μ=f(θTX)What is p()? the cond. dist. of Y.What is f()? the response function.

GLIM

n

YnN

© Eric Xing @ CMU, 2005-2009 13

GLIMThe observed input x is assumed to enter into the model via a linear combination of its elementsThe conditional mean μ is represented as a function f(ξ) of ξ, where f is known as the response functionThe observed output y is assumed to be characterized by an exponential family distribution with conditional mean μ.

xTθξ =

GLIM, cont.

ηψfθ

μξ yEXP

EXP

The choice of exp family is constrained by the nature of the data Y

( ){ })()(exp),(),|( 1 ηηφφη φ Ayxyhyp T −=⇒

{ })()(exp)()|( ηηη Ayxyhyp T −=

ηx

μξ y

© Eric Xing @ CMU, 2005-2009 14

p y y YExample: y is a continuous vector multivariate Gaussian

y is a class label Bernoulli or multinomial

The choice of the response functionFollowing some mild constrains, e.g., [0,1]. Positivity …Canonical response function:

In this case θTx directly corresponds to canonical parameter η.)(⋅= −1ψf

8

MLE for GLIMs with natural response

Log-likelihood

( )∑ ∑ T Ah )()(l θl

Derivative of Log-likelihood

( )∑ ∑ −+=n n

nnnT

n Ayxyh )()(log ηθl

( )

)(

μ

θη

ηη

θ

−=

⎟⎟⎠

⎞⎜⎜⎝

⎛−=

∑

∑xy

dd

ddAyx

dd

T

nnnn

n

n

n

nnn

l

This is a fixed point function

© Eric Xing @ CMU, 2005-2009 15

Online learning for canonical GLIMsStochastic gradient ascent = least mean squares (LMS) algorithm:

)( μ−= yX Tp

because μ is a function of θ

( ) ntnn

tt xy μρθθ −+=+1

( ) size step a is and where ρθμ nTtt

n x=

Batch learning for canonical GLIMs

The Hessian matrix2 ⎥

⎥⎥⎤

⎢⎢⎢⎡

−−−−−−−−

=T

T

xx

XMMM

2

1

( )

WXX

xxddx

dd

ddx

ddxxy

dd

dddH

T

nT

nTn

n n

nn

Tn

n n

nn

nTn

nn

nnnTT

−=

=−=

−=

=−==

∑

∑

∑∑

θηημ

θη

ημ

θμμ

θθθ

since

l2

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

=

ny

yy

yM

v 2

1

⎥⎥⎥

⎦⎢⎢⎢

⎣ −−−− Tnx

MMM

© Eric Xing @ CMU, 2005-2009 16

where is the design matrix and

which can be computed by calculating the 2nd derivative of A(ηn)

[ ]TnxX =

⎟⎟⎠

⎞⎜⎜⎝

⎛=

N

N

dd

ddW

ημ

ημ ,,diag K

1

1

9

Recall LMSCost function in matrix form:

nT∑1 2 ⎥

⎥⎥⎤

⎢⎢⎢⎡

−−−−−−−−

=T

T

xx

XMMM

2

1

To minimize J(θ), take derivative and set to zero:

( ) ( )yy

yJ

T

ii

Ti

vv −−=

−= ∑=

θθ

θθ

XX

x

2121

1

2)()(⎥⎥⎥

⎦⎢⎢⎢

⎣ −−−− Tnx

MMM

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

=

ny

yy

yM

v 2

1

1

© Eric Xing @ CMU, 2005-2009 17

( )

( )

( )0

221

22121

=−=

−+=

∇+∇−∇=

+−−∇=∇

yXXX

yXXXXX

yyXyXX

yyXyyXXXJ

TT

TTT

TTTT

TTTTTT

v

v

vvv

vvvv

θ

θθ

θθθ

θθθθ

θθθ

θθ

trtrtr

tryXXX TT v=⇒ θ

The normal equations

( ) yXXX TT v1−=*θ

⇓

Iteratively Reweighted Least Squares (IRLS)

Recall Newton-Raphson methods with cost function JJHtt θθ ∇= −+ 11

We now have

Now:

JH θθθ ∇−=

WXXH

yXJT

T

−=

−=∇ )( μθ

( ) [ ]( ) ttTtT

tTttTtT

tt

zWXXWX

yXXWXXWX

H

1

1

11

)(−

−

−+

=

−+=

∇−=

μθ

θθ θ l

( ) yXXX TT v1−=*θ

© Eric Xing @ CMU, 2005-2009 18

where the adjusted response is

This can be understood as solving the following " Iteratively reweighted least squares " problem

( ) zWXXWX( ) )( tttt yWXz μθ −+=

−1

)()(minarg θθθθ

XzWXz Tt −−=+1

10

Example 1: logistic regression (sigmoid classifier)

The condition distribution: a Bernoulli yy xxxyp −−= 11 ))(()()|( μμ

where μ is a logistic function

p(y|x) is an exponential family function, with mean:

xxxyp = 1 ))(()()|( μμ

)()( xex ημ −+

=1

1

[ ] )(| xexyE ημ −+

==1

1

© Eric Xing @ CMU, 2005-2009 19

and canonical response function

IRLS

xTθξη ==

⎟⎟⎟

⎠

⎞

⎜⎜⎜

⎝

⎛

−

−=

−=

)(

)(

)(

NN

W

dd

μμ

μμ

μμημ

1

1

1

11

O

Logistic regression: practical issues

It is very common to use regularized maximum likelihood.T11

IRLS takes O(Nd3) per iteration, where N = number of training cases and d = dimension of input x.Quasi Newton methods that approximate the Hessian work faster

( ) θθλθσθ

λθ

θσθθ

T

nn

Tn

Txy

xyl

Ip

xye

xyp T

2

01

11

1

−=

=+

=±=

∑

−

−

)(log)(

),(Normal~)(

)(),(

© Eric Xing @ CMU, 2005-2009 20

Quasi-Newton methods, that approximate the Hessian, work faster.Conjugate gradient takes O(Nd) per iteration, and usually works best in practice.Stochastic gradient descent can also be used if N is large c.f. perceptron rule:

( ) λθθσθ −−=∇ nnnT

n xyxy )(1l

11

Example 2: linear regressionThe condition distribution: a Gaussian

))(())((exp)( μμθ xyxyxyp T⎬⎫

⎨⎧ −Σ−−=Σ −111

where μ is a linear function

p(y|x) is an exponential family function, with

)()( xxx T ηθμ ==

[ ] T

( )( ){ })()(exp)(

))(())((exp),,( //

ηη

μμπ

θ

Ayxxh

xyxyxyp

T

k

−Σ−⇒

⎭⎬

⎩⎨ Σ

Σ=Σ

−121

212 22Rescale

© Eric Xing @ CMU, 2005-2009 21

mean:

and canonical response function

IRLS

[ ] xxyE Tθμ ==|xTθξη ==1

IWdd

=

= 1ημ ( )

( ) ( )( ) )(

)(tTTt

ttTT

ttTtTt

yXXX

yXXXX

zWXXWX

μθ

μθ

θ

−+=

−+=

=

−

−

−+

1

1

11

⇒∞→

⇒t

YXXX TT 1−= )(θ

Steepest descent Normal equation

Density estimation μ σ

Simple GMs are the building blocks of complex BNs

RegressionLinear, conditional mixture, nonparametric

X Y

Density estimationParametric and nonparametric methods

μ,σ

XX

© Eric Xing @ CMU, 2005-2009 22

ClassificationGenerative and discriminative approach

Q

X

Q

X

12

School of Computer ScienceAn (incomplete)

genealogy of graphical g p

models

The structures of most GMs (e.g., all listed here), are not learned from data but

© Eric Xing @ CMU, 2005-2009 23

learned from data, but designed by human.

But such designs are useful and indeed favored because thereby human knowledge are put into good use …

MLE for general BNsIf we assume the parameters for each CPD are globally independent, and all nodes are fully observed, then the log-p , y , glikelihood function decomposes into a sum of local terms, one per node:

∑ ∑∏ ∏ ⎟⎠

⎞⎜⎝

⎛=⎟⎟

⎠

⎞⎜⎜⎝

⎛==

i ninin

n iinin ii

xpxpDpD ),|(log),|(log)|(log);( ,,,, θθθθ ππ xxl

© Eric Xing @ CMU, 2005-2009 24

X2=1

X2=0

X5=0

X5=1

13

Consider the distribution defined by the directed acyclic GM:

Example: decomposable likelihood of a directed model

This is exactly like learning four separate small BNs, each of which consists of a node and its parents.

),,|(),|(),|()|()|( 432431321211 θθθθθ xxxpxxpxxpxpxp =

X1

X1X1 X1

© Eric Xing @ CMU, 2005-2009 25

X4

X2 X3

X4

X2 X3

X2 X3

MLE for BNs with tabular CPDsAssume each CPD is represented as a table (multinomial) where

)|(def

kXjXθ

Note that in case of multiple parents, will have a composite state, and the CPD will be a high-dimensional tableThe sufficient statistics are counts of family configurations

The log likelihood is

)|( kXjXpiiijk === πθ

iπX

∑=n

kn

jinijk ixxn π,,

def

© Eric Xing @ CMU, 2005-2009 26

The log-likelihood is

Using a Lagrange multiplier to enforce , we get:

∑∏ ==kji

ijkijkkji

nijk nD ijk

,,,,

loglog);( θθθl

1=∑j ijkθ ∑=

''

jkij

ijkMLijk n

nθ

14

MLE and Kulback-Leibler divergence

KL divergence( ) ∑=

xqxqxpxqD )(log)()(||)(

Empirical distribution

Where δ(x,xn) is a Kronecker delta function

Maxθ(MLE) Minθ(KL)

( ) ∑=x xp

xqxpxqD)(

log)()(||)(

∑=

=N

nnxxN

xp1

1 ),()(~ defδ

≡( ) )(~ xp∑

© Eric Xing @ CMU, 2005-2009 27

( )

);(1

)|(log1)(~log)(~

)|(log)(~)(~log)(~)|(

)(log)(~)|(||)(~

DN

C

xpN

xpxp

xpxpxpxpxpxpxpxpxpD

nn

x

xx

x

θ

θ

θθ

θ

l−=

−=

−=

=

∑∑

∑∑

∑

Parameter sharing

Aπ

Consider a time-invariant (stationary) 1st-order Markov modelInitial state probability vector:

State transition probability matrix:

X1 X2 X3 XT…

)(def

11 == kk Xpπ

)|(def

11 1 === −it

jtij XXpA

∏∏T

© Eric Xing @ CMU, 2005-2009 28

The joint:

The log-likelihood:

Again, we optimize each parameter separatelyπ is a multinomial frequency vector, and we've seen it beforeWhat about A?

∏∏= =

−=t t

ttT XXpxpXp2 2

111 )|()|()|( : πθ

∑∑∑=

−+=n

T

ttntn

nn AxxpxpD

211 ),|(log)|(log);( ,,, πθl

15

Learning a Markov chain transition matrix

A is a stochastic matrix: Each row of A is multinomial distribution.

1=∑ j ijAEach row of A is multinomial distribution.So MLE of Aij is the fraction of transitions from i to j

Application:

∑ ∑∑ ∑

= −

= −=•→

→=∀

n

T

titn

jtnn

T

titnML

ijx

xxi

jiAi2 1,

,2 1,

)(#)(# :

© Eric Xing @ CMU, 2005-2009 29

if the states Xt represent words, this is called a bigram language model

Sparse data problem:If i j did not occur in data, we will have Aij =0, then any future sequence with word pair i j will have zero probability. A standard hack: backoff smoothing or deleted interpolation

MLiti AA •→•→ −+= )(~

λλη 1

Bayesian language modelGlobal and local parameter independence

β

The posterior of Ai · and Ai' · is factorized despite v-structure on Xt, because Xt-1

X1 X2 X3 XT…

Ai ·π Ai' ·

βα

…

A

© Eric Xing @ CMU, 2005-2009 30

acts like a multiplexerAssign a Dirichlet prior βi to each row of the transition matrix:

We could consider more realistic priors, e.g., mixtures of Dirichlets to account for types of words (adjectives, verbs, etc.)

)(# where,)(

)(#)(#

),,|( ',

,def

•→+=−+=

+•→+→

==i

Ai

jiDijpA

i

ii

MLijikii

i

kii

Bayesij β

βλλβλ

ββ

β 1

16

Example: HMM: two scenariosSupervised learning: estimation when the “right answer” is known

Examples: GIVEN: a genomic region x = x1…x1,000,000 where we have good

(experimental) annotations of the CpG islandsGIVEN: the casino player allows us to observe him one evening,

as he changes dice and produces 10,000 rolls

Unsupervised learning: estimation when the “right answer” is unknown

Examples:GIVEN: the porcupine genome; we don’t know how frequent are the

© Eric Xing @ CMU, 2005-2009 31

GIVEN: the porcupine genome; we don t know how frequent are the CpG islands there, neither do we know their composition

GIVEN: 10,000 rolls of the casino player, but we don’t see when he changes dice

QUESTION: Update the parameters θ of the model to maximize P(x|θ) --- Maximal likelihood (ML) estimation

Recall definition of HMMTransition probabilities between any two states

y2 y3y1 yT...

or

Start probabilities

A AA Ax2 x3x1 xT... ,)|( , ji

it

jt ayyp === − 11 1

( ) .,,,,lMultinomia~)1|( ,2,1,1 I∈∀=− iaaayyp Miiiitt K

( ).,,,lMultinomia~)( 211 Myp πππ K

© Eric Xing @ CMU, 2005-2009 32

Emission probabilities associated with each state

or in general:

( ).,,,lMultinomia)( 211 Myp πππ K

( ) .,,,,lMultinomia~)|( ,,, I∈∀= ibbbyxp Kiiiitt K211

( ) .,|f~)|( I∈∀⋅= iyxp iitt θ1

17

Supervised ML estimationGiven x = x1…xN for which the true state path y = y1…yN is known,

Define:Aij = # times state transition i→j occurs in yBik = # times state i in y emits k in x

We can show that the maximum likelihood parameters θ are:

∑∑ ∑∑ ∑ ==

•→→

== −

= −

' ',

,,

)(#)(#

j ij

ij

n

T

titn

jtnn

T

titnML

ij AA

y

yyi

jia2 1

2 1

T

© Eric Xing @ CMU, 2005-2009 33

What if x is continuous? We can treat as N×Tobservations of, e.g., a Gaussian, and apply learning rules for Gaussian …

∑∑ ∑∑ ∑ ==

•→→

==

=

' ',

,,

)(#)(#

k ik

ik

n

T

titn

ktnn

T

titnML

ik BB

y

xyi

kib1

1

( ){ }NnTtyx tntn :,::, ,, 11 ==

Supervised ML estimation, ctd.Intuition:

When we know the underlying states, the best estimate of θ is the average frequency of transitions & emissions that occur in the trainingaverage frequency of transitions & emissions that occur in the training data

Drawback:Given little data, there may be overfitting:

P(x|θ) is maximized, but θ is unreasonable0 probabilities – VERY BAD

Example:

© Eric Xing @ CMU, 2005-2009 34

Example:Given 10 casino rolls, we observe

x = 2, 1, 5, 6, 1, 2, 3, 6, 2, 3y = F, F, F, F, F, F, F, F, F, F

Then: aFF = 1; aFL = 0bF1 = bF3 = .2; bF2 = .3; bF4 = 0; bF5 = bF6 = .1

18

PseudocountsSolution for small training sets:

Add pseudocountsAij = # times state transition i→j occurs in y + RijBik = # times state i in y emits k in x + Sik

Rij, Sij are pseudocounts representing our prior beliefTotal pseudocounts: Ri = ΣjRij , Si = ΣkSik ,

--- "strength" of prior belief, --- total number of imaginary instances in the prior

© Eric Xing @ CMU, 2005-2009 35

Larger total pseudocounts ⇒ strong prior belief

Small total pseudocounts: just to avoid 0 probabilities --- smoothing

This is equivalent to Bayesian est. under a uniform prior with "parameter strength" equals to the pseudocounts

Earthquake Burglary ∏==M

i ixpp )|()( πxxXFactorization:

How to define parameter prior?

q

Radio

g y

Alarm

Call

=ii

1

ji

kii x

jkixp

πθπ x

x|

)|( =

Local Distributions defined by, e.g., multinomial parameters:

© Eric Xing @ CMU, 2005-2009 36

Assumptions (Geiger & Heckerman 97,99):Complete Model EquivalenceGlobal Parameter IndependenceLocal Parameter IndependenceLikelihood and Prior Modularity

? )|( Gp θ

19

Global Parameter Independence Earthquake Burglary

Global & Local Parameter Independence

Global Parameter IndependenceFor every DAG model:

∏=

=M

iim GpGp

1

)|()|( θθ

Earthquake

Radio

Burglary

Alarm

Call

© Eric Xing @ CMU, 2005-2009

Local Parameter IndependenceFor every node:

∏=

=i

ji

ki

q

jxi GpGp

1|

)|()|(π

θθx

independent of)( | YESAlarmCallP =θ

)( | NOAlarmCallP =θ

Global ParameterIndependence

Parameter Independence,Graphical View

sample 1

θ2|1θ1 θ2|1

X1 X2

IndependenceLocal ParameterIndependence

© Eric Xing @ CMU, 2005-2009

Provided all variables are observed in all cases, we can perform Bayesian update each parameter independently !!!

sample 2

M

X1 X2

20

Which PDFs Satisfy Our Assumptions? (Geiger & Heckerman 97,99)

Discrete DAG Models:∑Γ )(

)(Multi~| θπ jxi i

x

Dirichlet prior:

Gaussian DAG Models:

Normal prior:

∏∏∏∑

=Γ

Γ=

kk

kk

kk

kk

kk CP 1-1- )()(

)()( αα θαθ

α

αθ

),(Normal~| Σμπ jxi i

x

⎭⎬⎫

⎩⎨⎧ −Ψ−−

Ψ=Ψ − )()'(exp

||)(),|( // νμνμ

πνμ 1

212 21

21

np

© Eric Xing @ CMU, 2005-2009 39

Normal-Wishart prior:

⎭⎩Ψ ||)( π 22

{ }

. where

,WtrexpW),(),|W( /)(/

1

212

21

−

−−

Σ=⎭⎬⎫

⎩⎨⎧ ΤΤ=Τ

W

nww

wwncp αααα

( ),)(,Normal),,|( 1−= WW μμ ανανμp

Summary: Learning GMFor exponential family dist., MLE amounts to moment matchingg

GLIM: Natural responseIteratively Reweighted Least Squares as a general algorithm

For fully observed BN the log-likelihood function decomposes

© Eric Xing @ CMU, 2005-2009 40

For fully observed BN, the log likelihood function decomposes into a sum of local terms, one per node; thus learning is also factored

probabilistic graphical modelsprobabilistic graphical modelsepxing/class/10708-09/lecture/... ·...

Documents