probabilistic graphical modelsprobabilistic graphical modelsepxing/class/10708-09/lecture/... ·...

20
1 School of Computer Science Probabilistic Graphical Models Probabilistic Graphical Models Learning generalized linear Learning generalized linear models and tabular CPT of fully models and tabular CPT of fully observed BN observed BN © Eric Xing @ CMU, 2005-2009 1 Eric Xing Eric Xing Lecture 7, September 30, 2009 Reading: X1 X4 X2 X3 X4 X2 X3 X1 X1 X2 X1 X3 X1 X4 X2 X3 X4 X2 X3 X1 X1 X2 X1 X3 Exponential family For a numeric random variable X { } ) ( ) ( exp ) ( ) | ( A x T x h x p T η η η = X n is an exponential family distribution with natural (canonical) parameter η Function T(x) is a sufficient statistic. { } { } ) ( exp ) ( ) ( 1 ) ( ) ( exp ) ( ) | ( x T x h Z A x T x h x p T η η η η η = = X n N © Eric Xing @ CMU, 2005-2009 2 Function A(η) = log Z(η) is the log normalizer. Examples: Bernoulli, multinomial, Gaussian, Poisson, gamma,...

Upload: others

Post on 23-May-2020

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Probabilistic Graphical ModelsProbabilistic Graphical Modelsepxing/Class/10708-09/lecture/... · Probabilistic Graphical ModelsProbabilistic Graphical Models Learning generalized

1

School of Computer Science

Probabilistic Graphical ModelsProbabilistic Graphical Models

Learning generalized linear Learning generalized linear models and tabular CPT of fully models and tabular CPT of fully

observed BNobserved BN

© Eric Xing @ CMU, 2005-2009 1

Eric XingEric XingLecture 7, September 30, 2009

Reading:

X1

X4

X2 X3

X4

X2 X3

X1X1

X2

X1

X3

X1

X4

X2 X3

X4

X2 X3

X1X1

X2

X1

X3

Exponential familyFor a numeric random variable X

{ })()(exp)()|( AxTxhxp T ηηη = Xn

is an exponential family distribution with natural (canonical) parameter η

Function T(x) is a sufficient statistic.

{ }{ })(exp)(

)(1

)()(exp)()|(

xTxhZ

AxTxhxp

Tηη

ηηη

=

−= Xn

N

© Eric Xing @ CMU, 2005-2009 2

Function A(η) = log Z(η) is the log normalizer.Examples: Bernoulli, multinomial, Gaussian, Poisson, gamma,...

Page 2: Probabilistic Graphical ModelsProbabilistic Graphical Modelsepxing/Class/10708-09/lecture/... · Probabilistic Graphical ModelsProbabilistic Graphical Models Learning generalized

2

Recall Linear RegressionLet us assume that the target variable and the inputs are related by the equation:y q

where ε is an error term of unmodeled effects or random noise

Now assume that ε follows a Gaussian N(0,σ), then we have:

iiT

iy εθ += x

⎟⎞

⎜⎛ − 21 θ )( Ty x

© Eric Xing @ CMU, 2005-2009 3

⎟⎟⎠

⎞⎜⎜⎝

⎛−= 222

1σθ

σπθ )(exp);|( ii

iiyxyp x

Recall: Logistic Regression (sigmoid classifier)

The condition distribution: a Bernoulliyy xxxyp −−= 11 ))(()()|( μμ

where μ is a logistic function

We can used the brute-force gradient method as in LR

xxxyp = 1 ))(()()|( μμ

xTex

θμ

−+=

11)(

© Eric Xing @ CMU, 2005-2009 4

But we can also apply generic laws by observing the p(y|x) is an exponential family function, more specifically, a generalized linear model!

Page 3: Probabilistic Graphical ModelsProbabilistic Graphical Modelsepxing/Class/10708-09/lecture/... · Probabilistic Graphical ModelsProbabilistic Graphical Models Learning generalized

3

Example: Multivariate Gaussian Distribution

For a continuous vector random variable X∈Rk:

⎬⎫

⎨⎧ ΣΣ − )()(exp)( μμμ 111 T xxxp

Exponential family representation

( )

( )( ){ }Σ−Σ−Σ+Σ−=

⎭⎬

⎩⎨ −Σ−−

Σ=Σ

−−− logtrexp

)()(exp),(

/

//

μμμπ

μμπ

μ

12111

21

2

212

21

22

TTTk

k

xxx

xxxp

( )[ ] ( )[ ]( )[ ]

121

21

1211

211

vec;)(

and ,vec,vec;TxxxxT

−−−− Σ−=Σ==Σ−Σ= ημηηημη

Moment parameter

Natural parameter

© Eric Xing @ CMU, 2005-2009 5

Note: a k-dimensional Gaussian is a (k+k2)-parameter distribution with a (k+k2)-element vector of sufficient statistics (but because of symmetry and positivity, parameters are constrained and have lower degree of freedom)

( )[ ]( )

( ) 2

221

112211

21

2

2/)(

log)(trlog)(vec;)(

k

TT

xh

AxxxxT

=

−−−=Σ+Σ=

=

π

ηηηημμη

Example: Multinomial distributionFor a binary vector random variable ),|(multi~ πxx

⎫⎧∑

Exponential family representation

⎭⎬⎫

⎩⎨⎧

== ∑k

kkxK

xx xxp K πππππ lnexp)( 2121 L

⎭⎬⎫

⎩⎨⎧

⎟⎠

⎞⎜⎝

⎛−+⎟⎟

⎞⎜⎜⎝

⎛∑−

=

⎭⎬⎫

⎩⎨⎧

⎟⎠

⎞⎜⎝

⎛−⎟

⎞⎜⎝

⎛−+=

∑∑

∑∑∑−

=

=−

=

=

=

=

1

1

1

11

1

1

1

1

1

1

1

1ln1

lnexp

1ln1lnexp

K

kk

K

k kKk

kk

K

kk

K

kk

K

kkk

x

xx

ππ

π

ππ

© Eric Xing @ CMU, 2005-2009 6

p y p

[ ]

1

1

0

1

1

1

=

⎟⎠

⎞⎜⎝

⎛=⎟

⎞⎜⎝

⎛−−=

=

⎥⎦⎤

⎢⎣⎡

⎟⎠⎞⎜

⎝⎛=

∑∑=

=

)(

lnln)(

)(

;ln

xh

eA

xxTK

k

K

kk

K

k

kηπη

ππη

Page 4: Probabilistic Graphical ModelsProbabilistic Graphical Modelsepxing/Class/10708-09/lecture/... · Probabilistic Graphical ModelsProbabilistic Graphical Models Learning generalized

4

Why exponential family?Moment generating property

dddA 1

{ }{ }

[ ])()(

)(exp)()(

)(exp)()(

)()(

)(log

xTE

dxZ

xTxhxT

dxxTxhdd

Z

Zdd

ZZ

dd

ddA

T

T

=

=

=

==

ηη

ηηη

ηηη

ηηη1

1

© Eric Xing @ CMU, 2005-2009 7

{ } { }

[ ] [ ][ ])(

)()(

)()()(

)(exp)()()(

)(exp)()(

xTVarxTExTE

Zdd

Zdx

ZxTxhxTdx

ZxTxhxT

dAd

2

TT

=−=

−= ∫∫2

22

2 1 ηηηη

ηηη

η

Moment estimationWe can easily compute moments of any exponential family distribution by taking the derivatives of the log normalizer y g gA(η).The qth derivative gives the qth centered moment.

variance)(

mean)(

=

=

2 ηηη

AdddA

© Eric Xing @ CMU, 2005-2009 8

When the sufficient statistic is a stacked vector, partial derivatives need to be considered.

L

variance=2ηd

Page 5: Probabilistic Graphical ModelsProbabilistic Graphical Modelsepxing/Class/10708-09/lecture/... · Probabilistic Graphical ModelsProbabilistic Graphical Models Learning generalized

5

Moment vs canonical parametersThe moment parameter µ can be derived from the natural (canonical) parameter( ) p

A(η) is convex since

[ ] μηη def

)()(== xTE

ddA

[ ] 02

2

>= )()( xTVardAdη

η

4

8

-2 -1 0 1 2

4

8

-2 -1 0 1 2

A

ηη∗

© Eric Xing @ CMU, 2005-2009 9

Hence we can invert the relationship and infer the canonical parameter from the moment parameter (1-to-1):

A distribution in the exponential family can be parameterized not only by η − the canonical parameterization, but also by μ − the moment parameterization.

)(def

μψη =

MLE for Exponential FamilyFor iid data, the log-likelihood is

{ }∏ T AThD )()()(l)(l

Take derivatives and set to zero:

{ }

∑ ∑

−⎟⎠

⎞⎜⎝

⎛+=

−=

n nn

Tn

nn

Tn

NAxTxh

AxTxhD

)()()(log

)()(exp)(log);(

ηη

ηηηl

1)(

0)()(∑∂

=∂

∂−=

∂∂

nn

A

ANxT

ηηη

ηl

© Eric Xing @ CMU, 2005-2009 10

This amounts to moment matching.We can infer the canonical parameters using

)(1

)(1)(

=

=∂

nnMLE

nn

xTN

xTN

A

μ

ηη

)

)( MLEMLE μψη )) =

Page 6: Probabilistic Graphical ModelsProbabilistic Graphical Modelsepxing/Class/10708-09/lecture/... · Probabilistic Graphical ModelsProbabilistic Graphical Models Learning generalized

6

SufficiencyFor p(x|θ), T(x) is sufficient for θ if there is no information in Xregarding θ beyond that in T(x).g g y ( )

We can throw away X for the purpose of inference w.r.t. θ .

Bayesian view

Frequentist view

T(x) θX ))(|()),(|( xTpxxTp θθ =

T(x) θX ))(|()),(|( xTxpxTxp =θ

© Eric Xing @ CMU, 2005-2009 11

The Neyman factorization theorem

T(x) is sufficient for θ if

T(x) θX

))(,()),(()),(,( xTxxTxTxp 21 ψθψθ =

))(,()),(()|( xTxhxTgxp θθ =⇒

ExamplesGaussian:

( )[ ][ ]

1211 vec; −− Σ−Σ= μη

Multinomial:

( )[ ]

( ) 2211

21

2 /)(

log)(vec;)(

k

T

T

xh

AxxxxT

=

Σ+Σ=

=

π

μμη∑∑ ==⇒n

nn

nMLE xN

xTN

111 )(μ

[ ]

1

0

1

⎟⎞

⎜⎛

⎟⎞

⎜⎛

=

⎥⎦⎤

⎢⎣⎡

⎟⎠⎞⎜

⎝⎛=

∑∑−

ll)(

)(

;ln

A

xxTKK

K

k

ππη

∑=⇒n

nMLE xN1μ

© Eric Xing @ CMU, 2005-2009 12

Poisson:1

111

=

⎟⎠

⎜⎝

=⎟⎠

⎜⎝

−−= ∑∑==

)(

lnln)(

xh

eAkk

kkηπη

!)(

)()(

log

xxh

eAxxT

1=

==

==

ηλη

λη

∑=⇒n

nMLE xN1μ

Page 7: Probabilistic Graphical ModelsProbabilistic Graphical Modelsepxing/Class/10708-09/lecture/... · Probabilistic Graphical ModelsProbabilistic Graphical Models Learning generalized

7

Generalized Linear Models (GLIMs)

The graphical modelLinear regression XnLinear regressionDiscriminative linear classificationCommonality:

model Ep(Y)=μ=f(θTX)What is p()? the cond. dist. of Y.What is f()? the response function.

GLIM

n

YnN

© Eric Xing @ CMU, 2005-2009 13

GLIMThe observed input x is assumed to enter into the model via a linear combination of its elementsThe conditional mean μ is represented as a function f(ξ) of ξ, where f is known as the response functionThe observed output y is assumed to be characterized by an exponential family distribution with conditional mean μ.

xTθξ =

GLIM, cont.

ηψfθ

μξ yEXP

EXP

The choice of exp family is constrained by the nature of the data Y

( ){ })()(exp),(),|( 1 ηηφφη φ Ayxyhyp T −=⇒

{ })()(exp)()|( ηηη Ayxyhyp T −=

ηx

μξ y

© Eric Xing @ CMU, 2005-2009 14

p y y YExample: y is a continuous vector multivariate Gaussian

y is a class label Bernoulli or multinomial

The choice of the response functionFollowing some mild constrains, e.g., [0,1]. Positivity …Canonical response function:

In this case θTx directly corresponds to canonical parameter η.)(⋅= −1ψf

Page 8: Probabilistic Graphical ModelsProbabilistic Graphical Modelsepxing/Class/10708-09/lecture/... · Probabilistic Graphical ModelsProbabilistic Graphical Models Learning generalized

8

MLE for GLIMs with natural response

Log-likelihood

( )∑ ∑ T Ah )()(l θl

Derivative of Log-likelihood

( )∑ ∑ −+=n n

nnnT

n Ayxyh )()(log ηθl

( )

)(

μ

θη

ηη

θ

−=

⎟⎟⎠

⎞⎜⎜⎝

⎛−=

∑xy

dd

ddAyx

dd

T

nnnn

n

n

n

nnn

l

This is a fixed point function

© Eric Xing @ CMU, 2005-2009 15

Online learning for canonical GLIMsStochastic gradient ascent = least mean squares (LMS) algorithm:

)( μ−= yX Tp

because μ is a function of θ

( ) ntnn

tt xy μρθθ −+=+1

( ) size step a is and where ρθμ nTtt

n x=

Batch learning for canonical GLIMs

The Hessian matrix2 ⎥

⎥⎥⎤

⎢⎢⎢⎡

−−−−−−−−

=T

T

xx

XMMM

2

1

( )

WXX

xxddx

dd

ddx

ddxxy

dd

dddH

T

nT

nTn

n n

nn

Tn

n n

nn

nTn

nn

nnnTT

−=

=−=

−=

=−==

∑∑

θηημ

θη

ημ

θμμ

θθθ

since

l2

⎥⎥⎥⎥

⎢⎢⎢⎢

=

ny

yy

yM

v 2

1

⎥⎥⎥

⎦⎢⎢⎢

⎣ −−−− Tnx

MMM

© Eric Xing @ CMU, 2005-2009 16

where is the design matrix and

which can be computed by calculating the 2nd derivative of A(ηn)

[ ]TnxX =

⎟⎟⎠

⎞⎜⎜⎝

⎛=

N

N

dd

ddW

ημ

ημ ,,diag K

1

1

Page 9: Probabilistic Graphical ModelsProbabilistic Graphical Modelsepxing/Class/10708-09/lecture/... · Probabilistic Graphical ModelsProbabilistic Graphical Models Learning generalized

9

Recall LMSCost function in matrix form:

nT∑1 2 ⎥

⎥⎥⎤

⎢⎢⎢⎡

−−−−−−−−

=T

T

xx

XMMM

2

1

To minimize J(θ), take derivative and set to zero:

( ) ( )yy

yJ

T

ii

Ti

vv −−=

−= ∑=

θθ

θθ

XX

x

2121

1

2)()(⎥⎥⎥

⎦⎢⎢⎢

⎣ −−−− Tnx

MMM

⎥⎥⎥⎥

⎢⎢⎢⎢

=

ny

yy

yM

v 2

1

1

© Eric Xing @ CMU, 2005-2009 17

( )

( )

( )0

221

22121

=−=

−+=

∇+∇−∇=

+−−∇=∇

yXXX

yXXXXX

yyXyXX

yyXyyXXXJ

TT

TTT

TTTT

TTTTTT

v

v

vvv

vvvv

θ

θθ

θθθ

θθθθ

θθθ

θθ

trtrtr

tryXXX TT v=⇒ θ

The normal equations

( ) yXXX TT v1−=*θ

Iteratively Reweighted Least Squares (IRLS)

Recall Newton-Raphson methods with cost function JJHtt θθ ∇= −+ 11

We now have

Now:

JH θθθ ∇−=

WXXH

yXJT

T

−=

−=∇ )( μθ

( ) [ ]( ) ttTtT

tTttTtT

tt

zWXXWX

yXXWXXWX

H

1

1

11

)(−

−+

=

−+=

∇−=

μθ

θθ θ l

( ) yXXX TT v1−=*θ

© Eric Xing @ CMU, 2005-2009 18

where the adjusted response is

This can be understood as solving the following " Iteratively reweighted least squares " problem

( ) zWXXWX( ) )( tttt yWXz μθ −+=

−1

)()(minarg θθθθ

XzWXz Tt −−=+1

Page 10: Probabilistic Graphical ModelsProbabilistic Graphical Modelsepxing/Class/10708-09/lecture/... · Probabilistic Graphical ModelsProbabilistic Graphical Models Learning generalized

10

Example 1: logistic regression (sigmoid classifier)

The condition distribution: a Bernoulli yy xxxyp −−= 11 ))(()()|( μμ

where μ is a logistic function

p(y|x) is an exponential family function, with mean:

xxxyp = 1 ))(()()|( μμ

)()( xex ημ −+

=1

1

[ ] )(| xexyE ημ −+

==1

1

© Eric Xing @ CMU, 2005-2009 19

and canonical response function

IRLS

xTθξη ==

⎟⎟⎟

⎜⎜⎜

−=

−=

)(

)(

)(

NN

W

dd

μμ

μμ

μμημ

1

1

1

11

O

Logistic regression: practical issues

It is very common to use regularized maximum likelihood.T11

IRLS takes O(Nd3) per iteration, where N = number of training cases and d = dimension of input x.Quasi Newton methods that approximate the Hessian work faster

( ) θθλθσθ

λθ

θσθθ

T

nn

Tn

Txy

xyl

Ip

xye

xyp T

2

01

11

1

−=

=+

=±=

)(log)(

),(Normal~)(

)(),(

© Eric Xing @ CMU, 2005-2009 20

Quasi-Newton methods, that approximate the Hessian, work faster.Conjugate gradient takes O(Nd) per iteration, and usually works best in practice.Stochastic gradient descent can also be used if N is large c.f. perceptron rule:

( ) λθθσθ −−=∇ nnnT

n xyxy )(1l

Page 11: Probabilistic Graphical ModelsProbabilistic Graphical Modelsepxing/Class/10708-09/lecture/... · Probabilistic Graphical ModelsProbabilistic Graphical Models Learning generalized

11

Example 2: linear regressionThe condition distribution: a Gaussian

))(())((exp)( μμθ xyxyxyp T⎬⎫

⎨⎧ −Σ−−=Σ −111

where μ is a linear function

p(y|x) is an exponential family function, with

)()( xxx T ηθμ ==

[ ] T

( )( ){ })()(exp)(

))(())((exp),,( //

ηη

μμπ

θ

Ayxxh

xyxyxyp

T

k

−Σ−⇒

⎭⎬

⎩⎨ Σ

Σ=Σ

−121

212 22Rescale

© Eric Xing @ CMU, 2005-2009 21

mean:

and canonical response function

IRLS

[ ] xxyE Tθμ ==|xTθξη ==1

IWdd

=

= 1ημ ( )

( ) ( )( ) )(

)(tTTt

ttTT

ttTtTt

yXXX

yXXXX

zWXXWX

μθ

μθ

θ

−+=

−+=

=

−+

1

1

11

⇒∞→

⇒t

YXXX TT 1−= )(θ

Steepest descent Normal equation

Density estimation μ σ

Simple GMs are the building blocks of complex BNs

RegressionLinear, conditional mixture, nonparametric

X Y

Density estimationParametric and nonparametric methods

μ,σ

XX

© Eric Xing @ CMU, 2005-2009 22

ClassificationGenerative and discriminative approach

Q

X

Q

X

Page 12: Probabilistic Graphical ModelsProbabilistic Graphical Modelsepxing/Class/10708-09/lecture/... · Probabilistic Graphical ModelsProbabilistic Graphical Models Learning generalized

12

School of Computer ScienceAn (incomplete)

genealogy of graphical g p

models

The structures of most GMs (e.g., all listed here), are not learned from data but

© Eric Xing @ CMU, 2005-2009 23

learned from data, but designed by human.

But such designs are useful and indeed favored because thereby human knowledge are put into good use …

MLE for general BNsIf we assume the parameters for each CPD are globally independent, and all nodes are fully observed, then the log-p , y , glikelihood function decomposes into a sum of local terms, one per node:

∑ ∑∏ ∏ ⎟⎠

⎞⎜⎝

⎛=⎟⎟

⎞⎜⎜⎝

⎛==

i ninin

n iinin ii

xpxpDpD ),|(log),|(log)|(log);( ,,,, θθθθ ππ xxl

© Eric Xing @ CMU, 2005-2009 24

X2=1

X2=0

X5=0

X5=1

Page 13: Probabilistic Graphical ModelsProbabilistic Graphical Modelsepxing/Class/10708-09/lecture/... · Probabilistic Graphical ModelsProbabilistic Graphical Models Learning generalized

13

Consider the distribution defined by the directed acyclic GM:

Example: decomposable likelihood of a directed model

This is exactly like learning four separate small BNs, each of which consists of a node and its parents.

),,|(),|(),|()|()|( 432431321211 θθθθθ xxxpxxpxxpxpxp =

X1

X1X1 X1

© Eric Xing @ CMU, 2005-2009 25

X4

X2 X3

X4

X2 X3

X2 X3

MLE for BNs with tabular CPDsAssume each CPD is represented as a table (multinomial) where

)|(def

kXjXθ

Note that in case of multiple parents, will have a composite state, and the CPD will be a high-dimensional tableThe sufficient statistics are counts of family configurations

The log likelihood is

)|( kXjXpiiijk === πθ

iπX

∑=n

kn

jinijk ixxn π,,

def

© Eric Xing @ CMU, 2005-2009 26

The log-likelihood is

Using a Lagrange multiplier to enforce , we get:

∑∏ ==kji

ijkijkkji

nijk nD ijk

,,,,

loglog);( θθθl

1=∑j ijkθ ∑=

''

jkij

ijkMLijk n

Page 14: Probabilistic Graphical ModelsProbabilistic Graphical Modelsepxing/Class/10708-09/lecture/... · Probabilistic Graphical ModelsProbabilistic Graphical Models Learning generalized

14

MLE and Kulback-Leibler divergence

KL divergence( ) ∑=

xqxqxpxqD )(log)()(||)(

Empirical distribution

Where δ(x,xn) is a Kronecker delta function

Maxθ(MLE) Minθ(KL)

( ) ∑=x xp

xqxpxqD)(

log)()(||)(

∑=

=N

nnxxN

xp1

1 ),()(~ defδ

≡( ) )(~ xp∑

© Eric Xing @ CMU, 2005-2009 27

( )

);(1

)|(log1)(~log)(~

)|(log)(~)(~log)(~)|(

)(log)(~)|(||)(~

DN

C

xpN

xpxp

xpxpxpxpxpxpxpxpxpD

nn

x

xx

x

θ

θ

θθ

θ

l−=

−=

−=

=

∑∑

∑∑

Parameter sharing

Consider a time-invariant (stationary) 1st-order Markov modelInitial state probability vector:

State transition probability matrix:

X1 X2 X3 XT…

)(def

11 == kk Xpπ

)|(def

11 1 === −it

jtij XXpA

∏∏T

© Eric Xing @ CMU, 2005-2009 28

The joint:

The log-likelihood:

Again, we optimize each parameter separatelyπ is a multinomial frequency vector, and we've seen it beforeWhat about A?

∏∏= =

−=t t

ttT XXpxpXp2 2

111 )|()|()|( : πθ

∑∑∑=

−+=n

T

ttntn

nn AxxpxpD

211 ),|(log)|(log);( ,,, πθl

Page 15: Probabilistic Graphical ModelsProbabilistic Graphical Modelsepxing/Class/10708-09/lecture/... · Probabilistic Graphical ModelsProbabilistic Graphical Models Learning generalized

15

Learning a Markov chain transition matrix

A is a stochastic matrix: Each row of A is multinomial distribution.

1=∑ j ijAEach row of A is multinomial distribution.So MLE of Aij is the fraction of transitions from i to j

Application:

∑ ∑∑ ∑

= −

= −=•→

→=∀

n

T

titn

jtnn

T

titnML

ijx

xxi

jiAi2 1,

,2 1,

)(#)(# :

© Eric Xing @ CMU, 2005-2009 29

if the states Xt represent words, this is called a bigram language model

Sparse data problem:If i j did not occur in data, we will have Aij =0, then any future sequence with word pair i j will have zero probability. A standard hack: backoff smoothing or deleted interpolation

MLiti AA •→•→ −+= )(~

λλη 1

Bayesian language modelGlobal and local parameter independence

β

The posterior of Ai · and Ai' · is factorized despite v-structure on Xt, because Xt-1

X1 X2 X3 XT…

Ai ·π Ai' ·

βα

A

© Eric Xing @ CMU, 2005-2009 30

acts like a multiplexerAssign a Dirichlet prior βi to each row of the transition matrix:

We could consider more realistic priors, e.g., mixtures of Dirichlets to account for types of words (adjectives, verbs, etc.)

)(# where,)(

)(#)(#

),,|( ',

,def

•→+=−+=

+•→+→

==i

Ai

jiDijpA

i

ii

MLijikii

i

kii

Bayesij β

βλλβλ

ββ

β 1

Page 16: Probabilistic Graphical ModelsProbabilistic Graphical Modelsepxing/Class/10708-09/lecture/... · Probabilistic Graphical ModelsProbabilistic Graphical Models Learning generalized

16

Example: HMM: two scenariosSupervised learning: estimation when the “right answer” is known

Examples: GIVEN: a genomic region x = x1…x1,000,000 where we have good

(experimental) annotations of the CpG islandsGIVEN: the casino player allows us to observe him one evening,

as he changes dice and produces 10,000 rolls

Unsupervised learning: estimation when the “right answer” is unknown

Examples:GIVEN: the porcupine genome; we don’t know how frequent are the

© Eric Xing @ CMU, 2005-2009 31

GIVEN: the porcupine genome; we don t know how frequent are the CpG islands there, neither do we know their composition

GIVEN: 10,000 rolls of the casino player, but we don’t see when he changes dice

QUESTION: Update the parameters θ of the model to maximize P(x|θ) --- Maximal likelihood (ML) estimation

Recall definition of HMMTransition probabilities between any two states

y2 y3y1 yT...

or

Start probabilities

A AA Ax2 x3x1 xT... ,)|( , ji

it

jt ayyp === − 11 1

( ) .,,,,lMultinomia~)1|( ,2,1,1 I∈∀=− iaaayyp Miiiitt K

( ).,,,lMultinomia~)( 211 Myp πππ K

© Eric Xing @ CMU, 2005-2009 32

Emission probabilities associated with each state

or in general:

( ).,,,lMultinomia)( 211 Myp πππ K

( ) .,,,,lMultinomia~)|( ,,, I∈∀= ibbbyxp Kiiiitt K211

( ) .,|f~)|( I∈∀⋅= iyxp iitt θ1

Page 17: Probabilistic Graphical ModelsProbabilistic Graphical Modelsepxing/Class/10708-09/lecture/... · Probabilistic Graphical ModelsProbabilistic Graphical Models Learning generalized

17

Supervised ML estimationGiven x = x1…xN for which the true state path y = y1…yN is known,

Define:Aij = # times state transition i→j occurs in yBik = # times state i in y emits k in x

We can show that the maximum likelihood parameters θ are:

∑∑ ∑∑ ∑ ==

•→→

== −

= −

' ',

,,

)(#)(#

j ij

ij

n

T

titn

jtnn

T

titnML

ij AA

y

yyi

jia2 1

2 1

T

© Eric Xing @ CMU, 2005-2009 33

What if x is continuous? We can treat as N×Tobservations of, e.g., a Gaussian, and apply learning rules for Gaussian …

∑∑ ∑∑ ∑ ==

•→→

==

=

' ',

,,

)(#)(#

k ik

ik

n

T

titn

ktnn

T

titnML

ik BB

y

xyi

kib1

1

( ){ }NnTtyx tntn :,::, ,, 11 ==

Supervised ML estimation, ctd.Intuition:

When we know the underlying states, the best estimate of θ is the average frequency of transitions & emissions that occur in the trainingaverage frequency of transitions & emissions that occur in the training data

Drawback:Given little data, there may be overfitting:

P(x|θ) is maximized, but θ is unreasonable0 probabilities – VERY BAD

Example:

© Eric Xing @ CMU, 2005-2009 34

Example:Given 10 casino rolls, we observe

x = 2, 1, 5, 6, 1, 2, 3, 6, 2, 3y = F, F, F, F, F, F, F, F, F, F

Then: aFF = 1; aFL = 0bF1 = bF3 = .2; bF2 = .3; bF4 = 0; bF5 = bF6 = .1

Page 18: Probabilistic Graphical ModelsProbabilistic Graphical Modelsepxing/Class/10708-09/lecture/... · Probabilistic Graphical ModelsProbabilistic Graphical Models Learning generalized

18

PseudocountsSolution for small training sets:

Add pseudocountsAij = # times state transition i→j occurs in y + RijBik = # times state i in y emits k in x + Sik

Rij, Sij are pseudocounts representing our prior beliefTotal pseudocounts: Ri = ΣjRij , Si = ΣkSik ,

--- "strength" of prior belief, --- total number of imaginary instances in the prior

© Eric Xing @ CMU, 2005-2009 35

Larger total pseudocounts ⇒ strong prior belief

Small total pseudocounts: just to avoid 0 probabilities --- smoothing

This is equivalent to Bayesian est. under a uniform prior with "parameter strength" equals to the pseudocounts

Earthquake Burglary ∏==M

i ixpp )|()( πxxXFactorization:

How to define parameter prior?

q

Radio

g y

Alarm

Call

=ii

1

ji

kii x

jkixp

πθπ x

x|

)|( =

Local Distributions defined by, e.g., multinomial parameters:

© Eric Xing @ CMU, 2005-2009 36

Assumptions (Geiger & Heckerman 97,99):Complete Model EquivalenceGlobal Parameter IndependenceLocal Parameter IndependenceLikelihood and Prior Modularity

? )|( Gp θ

Page 19: Probabilistic Graphical ModelsProbabilistic Graphical Modelsepxing/Class/10708-09/lecture/... · Probabilistic Graphical ModelsProbabilistic Graphical Models Learning generalized

19

Global Parameter Independence Earthquake Burglary

Global & Local Parameter Independence

Global Parameter IndependenceFor every DAG model:

∏=

=M

iim GpGp

1

)|()|( θθ

Earthquake

Radio

Burglary

Alarm

Call

© Eric Xing @ CMU, 2005-2009

Local Parameter IndependenceFor every node:

∏=

=i

ji

ki

q

jxi GpGp

1|

)|()|(π

θθx

independent of)( | YESAlarmCallP =θ

)( | NOAlarmCallP =θ

Global ParameterIndependence

Parameter Independence,Graphical View

sample 1

θ2|1θ1 θ2|1

X1 X2

IndependenceLocal ParameterIndependence

© Eric Xing @ CMU, 2005-2009

Provided all variables are observed in all cases, we can perform Bayesian update each parameter independently !!!

sample 2

M

X1 X2

Page 20: Probabilistic Graphical ModelsProbabilistic Graphical Modelsepxing/Class/10708-09/lecture/... · Probabilistic Graphical ModelsProbabilistic Graphical Models Learning generalized

20

Which PDFs Satisfy Our Assumptions? (Geiger & Heckerman 97,99)

Discrete DAG Models:∑Γ )(

)(Multi~| θπ jxi i

x

Dirichlet prior:

Gaussian DAG Models:

Normal prior:

∏∏∏∑

Γ=

kk

kk

kk

kk

kk CP 1-1- )()(

)()( αα θαθ

α

αθ

),(Normal~| Σμπ jxi i

x

⎭⎬⎫

⎩⎨⎧ −Ψ−−

Ψ=Ψ − )()'(exp

||)(),|( // νμνμ

πνμ 1

212 21

21

np

© Eric Xing @ CMU, 2005-2009 39

Normal-Wishart prior:

⎭⎩Ψ ||)( π 22

{ }

. where

,WtrexpW),(),|W( /)(/

1

212

21

−−

Σ=⎭⎬⎫

⎩⎨⎧ ΤΤ=Τ

W

nww

wwncp αααα

( ),)(,Normal),,|( 1−= WW μμ ανανμp

Summary: Learning GMFor exponential family dist., MLE amounts to moment matchingg

GLIM: Natural responseIteratively Reweighted Least Squares as a general algorithm

For fully observed BN the log-likelihood function decomposes

© Eric Xing @ CMU, 2005-2009 40

For fully observed BN, the log likelihood function decomposes into a sum of local terms, one per node; thus learning is also factored