probabilistic graphical modelsprobabilistic graphical modelsepxing/class/10708-09/lecture/... ·...

School of Computer Science

Probabilistic Graphical ModelsProbabilistic Graphical Models

Learning generalized linear Learning generalized linear models and tabular CPT of fully models and tabular CPT of fully

observed BNobserved BN

Eric XingEric XingLecture 7, September 30, 2009

Reading:

Exponential familyFor a numeric random variable X

{ })()(exp)()|( AxTxhxp T ηηη = Xn

is an exponential family distribution with natural (canonical) parameter η

Function T(x) is a sufficient statistic.

{ }{ })(exp)(

)()(exp)()|(

AxTxhxp

ηηη

−= Xn

Function A(η) = log Z(η) is the log normalizer.Examples: Bernoulli, multinomial, Gaussian, Poisson, gamma,...

Recall Linear RegressionLet us assume that the target variable and the inputs are related by the equation:y q

where ε is an error term of unmodeled effects or random noise

Now assume that ε follows a Gaussian N(0,σ), then we have:

iy εθ += x

⎟⎞

⎜⎛ − 21 θ )( Ty x

⎟⎟⎠

⎞⎜⎜⎝

⎛−= 222

σπθ )(exp);|( ii

iiyxyp x

Recall: Logistic Regression (sigmoid classifier)

The condition distribution: a Bernoulliyy xxxyp −−= 11 ))(()()|( μμ

where μ is a logistic function

We can used the brute-force gradient method as in LR

xxxyp = 1 ))(()()|( μμ

But we can also apply generic laws by observing the p(y|x) is an exponential family function, more specifically, a generalized linear model!

Example: Multivariate Gaussian Distribution

For a continuous vector random variable X∈Rk:

⎬⎫

⎨⎧ ΣΣ − )()(exp)( μμμ 111 T xxxp

Exponential family representation

( )( ){ }Σ−Σ−Σ+Σ−=

⎭⎬

⎩⎨ −Σ−−

−−− logtrexp

)()(exp),(

μμμπ

μμπ

( )[ ] ( )[ ]( )[ ]

vec;)(

and ,vec,vec;TxxxxT

−−−− Σ−=Σ==Σ−Σ= ημηηημη

Moment parameter

Natural parameter

Note: a k-dimensional Gaussian is a (k+k2)-parameter distribution with a (k+k2)-element vector of sufficient statistics (but because of symmetry and positivity, parameters are constrained and have lower degree of freedom)

( )[ ]( )

112211

log)(trlog)(vec;)(

AxxxxT

−−−=Σ+Σ=

ηηηημμη

Example: Multinomial distributionFor a binary vector random variable ),|(multi~ πxx

⎫⎧∑

Exponential family representation

⎭⎬⎫

⎩⎨⎧

== ∑k

xx xxp K πππππ lnexp)( 2121 L

⎭⎬⎫

⎩⎨⎧

⎟⎠

⎞⎜⎝

⎛−+⎟⎟

⎞⎜⎜⎝

⎛∑−

⎭⎬⎫

⎩⎨⎧

⎟⎠

⎞⎜⎝

⎛−⎟

⎞⎜⎝

⎛−+=

∑∑

∑∑∑−

1ln1lnexp

⎟⎠

⎞⎜⎝

⎛=⎟

⎞⎜⎝

⎛−−=

⎥⎦⎤

⎢⎣⎡

⎟⎠⎞⎜

⎝⎛=

∑∑=

lnln)(

kηπη

ππη

Why exponential family?Moment generating property

dddA 1

{ }{ }

[ ])()(

)(exp)()(

xTxhxT

dxxTxhdd

ηηη

ηηη1

{ } { }

[ ] [ ][ ])(

)()()(

)(exp)()()(

)(exp)()(

xTVarxTExTE

ZxTxhxTdx

ZxTxhxT

−= ∫∫2

2 1 ηηηη

ηηη

Moment estimationWe can easily compute moments of any exponential family distribution by taking the derivatives of the log normalizer y g gA(η).The qth derivative gives the qth centered moment.

variance)(

mean)(

2 ηηη

When the sufficient statistic is a stacked vector, partial derivatives need to be considered.

variance=2ηd

Moment vs canonical parametersThe moment parameter µ can be derived from the natural (canonical) parameter( ) p

A(η) is convex since

[ ] μηη def

)()(== xTE

[ ] 02

>= )()( xTVardAdη

-2 -1 0 1 2

ηη∗

Hence we can invert the relationship and infer the canonical parameter from the moment parameter (1-to-1):

A distribution in the exponential family can be parameterized not only by η − the canonical parameterization, but also by μ − the moment parameterization.

μψη =

MLE for Exponential FamilyFor iid data, the log-likelihood is

{ }∏ T AThD )()()(l)(l

Take derivatives and set to zero:

∑ ∑

−⎟⎠

⎞⎜⎝

NAxTxh

AxTxhD

)()()(log

)()(exp)(log);(

ηηηl

0)()(∑∂

∂−=

∂∂

ηηη

This amounts to moment matching.We can infer the canonical parameters using

)( MLEMLE μψη )) =

SufficiencyFor p(x|θ), T(x) is sufficient for θ if there is no information in Xregarding θ beyond that in T(x).g g y ( )

We can throw away X for the purpose of inference w.r.t. θ .

Bayesian view

Frequentist view

T(x) θX ))(|()),(|( xTpxxTp θθ =

T(x) θX ))(|()),(|( xTxpxTxp =θ

The Neyman factorization theorem

T(x) is sufficient for θ if

T(x) θX

))(,()),(()),(,( xTxxTxTxp 21 ψθψθ =

))(,()),(()|( xTxhxTgxp θθ =⇒

ExamplesGaussian:

( )[ ][ ]

1211 vec; −− Σ−Σ= μη

Multinomial:

( )[ ]

( ) 2211

log)(vec;)(

AxxxxT

Σ+Σ=

μμη∑∑ ==⇒n

nMLE xN

111 )(μ

⎟⎞

⎜⎛

⎟⎞

⎜⎛

⎥⎦⎤

⎢⎣⎡

⎟⎠⎞⎜

⎝⎛=

∑∑−

ππη

∑=⇒n

nMLE xN1μ

Poisson:1

⎟⎠

⎜⎝

=⎟⎠

⎜⎝

−−= ∑∑==

lnln)(

kkηπη

ηλη

∑=⇒n

nMLE xN1μ

Generalized Linear Models (GLIMs)

The graphical modelLinear regression XnLinear regressionDiscriminative linear classificationCommonality:

model Ep(Y)=μ=f(θTX)What is p()? the cond. dist. of Y.What is f()? the response function.

GLIMThe observed input x is assumed to enter into the model via a linear combination of its elementsThe conditional mean μ is represented as a function f(ξ) of ξ, where f is known as the response functionThe observed output y is assumed to be characterized by an exponential family distribution with conditional mean μ.

xTθξ =

GLIM, cont.

ηψfθ

μξ yEXP

The choice of exp family is constrained by the nature of the data Y

( ){ })()(exp),(),|( 1 ηηφφη φ Ayxyhyp T −=⇒

{ })()(exp)()|( ηηη Ayxyhyp T −=

μξ y

p y y YExample: y is a continuous vector multivariate Gaussian

y is a class label Bernoulli or multinomial

The choice of the response functionFollowing some mild constrains, e.g., [0,1]. Positivity …Canonical response function:

In this case θTx directly corresponds to canonical parameter η.)(⋅= −1ψf

MLE for GLIMs with natural response

Log-likelihood

( )∑ ∑ T Ah )()(l θl

Derivative of Log-likelihood

( )∑ ∑ −+=n n

n Ayxyh )()(log ηθl

⎟⎟⎠

⎞⎜⎜⎝

⎛−=

This is a fixed point function

Online learning for canonical GLIMsStochastic gradient ascent = least mean squares (LMS) algorithm:

)( μ−= yX Tp

because μ is a function of θ

( ) ntnn

tt xy μρθθ −+=+1

( ) size step a is and where ρθμ nTtt

Batch learning for canonical GLIMs

The Hessian matrix2 ⎥

⎥⎥⎤

⎢⎢⎢⎡

−−−−−−−−

=−==

∑∑

θηημ

θμμ

θθθ

⎥⎥⎥⎥

⎢⎢⎢⎢

⎥⎥⎥

⎦⎢⎢⎢

⎣ −−−− Tnx

where is the design matrix and

which can be computed by calculating the 2nd derivative of A(ηn)

[ ]TnxX =

⎟⎟⎠

⎞⎜⎜⎝

ημ ,,diag K

Recall LMSCost function in matrix form:

nT∑1 2 ⎥

⎥⎥⎤

⎢⎢⎢⎡

−−−−−−−−

To minimize J(θ), take derivative and set to zero:

( ) ( )yy

vv −−=

−= ∑=

2)()(⎥⎥⎥

⎦⎢⎢⎢

⎣ −−−− Tnx

⎥⎥⎥⎥

⎢⎢⎢⎢

∇+∇−∇=

+−−∇=∇

yXXXXX

yyXyXX

yyXyyXXXJ

TTTTTT

θθθ

θθθθ

θθθ

trtrtr

tryXXX TT v=⇒ θ

The normal equations

( ) yXXX TT v1−=*θ

Iteratively Reweighted Least Squares (IRLS)

Recall Newton-Raphson methods with cost function JJHtt θθ ∇= −+ 11

We now have

JH θθθ ∇−=

−=∇ )( μθ

( ) [ ]( ) ttTtT

tTttTtT

zWXXWX

yXXWXXWX

∇−=

θθ θ l

( ) yXXX TT v1−=*θ

where the adjusted response is

This can be understood as solving the following " Iteratively reweighted least squares " problem

( ) zWXXWX( ) )( tttt yWXz μθ −+=

)()(minarg θθθθ

XzWXz Tt −−=+1

Example 1: logistic regression (sigmoid classifier)

The condition distribution: a Bernoulli yy xxxyp −−= 11 ))(()()|( μμ

where μ is a logistic function

p(y|x) is an exponential family function, with mean:

xxxyp = 1 ))(()()|( μμ

)()( xex ημ −+

[ ] )(| xexyE ημ −+

and canonical response function

xTθξη ==

⎟⎟⎟

⎜⎜⎜

μμημ

Logistic regression: practical issues

It is very common to use regularized maximum likelihood.T11

IRLS takes O(Nd3) per iteration, where N = number of training cases and d = dimension of input x.Quasi Newton methods that approximate the Hessian work faster

( ) θθλθσθ

θσθθ

)(log)(

),(Normal~)(

Quasi-Newton methods, that approximate the Hessian, work faster.Conjugate gradient takes O(Nd) per iteration, and usually works best in practice.Stochastic gradient descent can also be used if N is large c.f. perceptron rule:

( ) λθθσθ −−=∇ nnnT

n xyxy )(1l

Example 2: linear regressionThe condition distribution: a Gaussian

))(())((exp)( μμθ xyxyxyp T⎬⎫

⎨⎧ −Σ−−=Σ −111

where μ is a linear function

p(y|x) is an exponential family function, with

)()( xxx T ηθμ ==

( )( ){ })()(exp)(

))(())((exp),,( //

μμπ

xyxyxyp

−Σ−⇒

⎭⎬

⎩⎨ Σ

−121

212 22Rescale

and canonical response function

[ ] xxyE Tθμ ==|xTθξη ==1

= 1ημ ( )

( ) ( )( ) )(

)(tTTt

ttTtTt

zWXXWX

⇒∞→

YXXX TT 1−= )(θ

Steepest descent Normal equation

Density estimation μ σ

Simple GMs are the building blocks of complex BNs

RegressionLinear, conditional mixture, nonparametric

Density estimationParametric and nonparametric methods

ClassificationGenerative and discriminative approach

School of Computer ScienceAn (incomplete)

genealogy of graphical g p

models

The structures of most GMs (e.g., all listed here), are not learned from data but

learned from data, but designed by human.

But such designs are useful and indeed favored because thereby human knowledge are put into good use …

MLE for general BNsIf we assume the parameters for each CPD are globally independent, and all nodes are fully observed, then the log-p , y , glikelihood function decomposes into a sum of local terms, one per node:

∑ ∑∏ ∏ ⎟⎠

⎞⎜⎝

⎛=⎟⎟

⎞⎜⎜⎝

i ninin

n iinin ii

xpxpDpD ),|(log),|(log)|(log);( ,,,, θθθθ ππ xxl

Consider the distribution defined by the directed acyclic GM:

Example: decomposable likelihood of a directed model

This is exactly like learning four separate small BNs, each of which consists of a node and its parents.

),,|(),|(),|()|()|( 432431321211 θθθθθ xxxpxxpxxpxpxp =

X1X1 X1

MLE for BNs with tabular CPDsAssume each CPD is represented as a table (multinomial) where

)|(def

kXjXθ

Note that in case of multiple parents, will have a composite state, and the CPD will be a high-dimensional tableThe sufficient statistics are counts of family configurations

The log likelihood is

)|( kXjXpiiijk === πθ

jinijk ixxn π,,

The log-likelihood is

Using a Lagrange multiplier to enforce , we get:

∑∏ ==kji

ijkijkkji

nijk nD ijk

loglog);( θθθl

1=∑j ijkθ ∑=

ijkMLijk n

MLE and Kulback-Leibler divergence

KL divergence( ) ∑=

xqxqxpxqD )(log)()(||)(

Empirical distribution

Where δ(x,xn) is a Kronecker delta function

Maxθ(MLE) Minθ(KL)

( ) ∑=x xp

xqxpxqD)(

log)()(||)(

1 ),()(~ defδ

≡( ) )(~ xp∑

)|(log1)(~log)(~

)|(log)(~)(~log)(~)|(

)(log)(~)|(||)(~

xpxpxpxpxpxpxpxpxpD

∑∑

Parameter sharing

Consider a time-invariant (stationary) 1st-order Markov modelInitial state probability vector:

State transition probability matrix:

X1 X2 X3 XT…

11 == kk Xpπ

)|(def

11 1 === −it

jtij XXpA

∏∏T

The joint:

The log-likelihood:

Again, we optimize each parameter separatelyπ is a multinomial frequency vector, and we've seen it beforeWhat about A?

∏∏= =

−=t t

ttT XXpxpXp2 2

111 )|()|()|( : πθ

∑∑∑=

−+=n

nn AxxpxpD

211 ),|(log)|(log);( ,,, πθl

Learning a Markov chain transition matrix

A is a stochastic matrix: Each row of A is multinomial distribution.

1=∑ j ijAEach row of A is multinomial distribution.So MLE of Aij is the fraction of transitions from i to j

Application:

∑ ∑∑ ∑

= −=•→

→=∀

titnML

jiAi2 1,

)(#)(# :

if the states Xt represent words, this is called a bigram language model

Sparse data problem:If i j did not occur in data, we will have Aij =0, then any future sequence with word pair i j will have zero probability. A standard hack: backoff smoothing or deleted interpolation

MLiti AA •→•→ −+= )(~

λλη 1

Bayesian language modelGlobal and local parameter independence

The posterior of Ai · and Ai' · is factorized despite v-structure on Xt, because Xt-1

X1 X2 X3 XT…

Ai ·π Ai' ·

acts like a multiplexerAssign a Dirichlet prior βi to each row of the transition matrix:

We could consider more realistic priors, e.g., mixtures of Dirichlets to account for types of words (adjectives, verbs, etc.)

)(# where,)(

)(#)(#

),,|( ',

•→+=−+=

+•→+→

jiDijpA

MLijikii

Bayesij β

βλλβλ

Example: HMM: two scenariosSupervised learning: estimation when the “right answer” is known

Examples: GIVEN: a genomic region x = x1…x1,000,000 where we have good

(experimental) annotations of the CpG islandsGIVEN: the casino player allows us to observe him one evening,

as he changes dice and produces 10,000 rolls

Unsupervised learning: estimation when the “right answer” is unknown

Examples:GIVEN: the porcupine genome; we don’t know how frequent are the

GIVEN: the porcupine genome; we don t know how frequent are the CpG islands there, neither do we know their composition

GIVEN: 10,000 rolls of the casino player, but we don’t see when he changes dice

QUESTION: Update the parameters θ of the model to maximize P(x|θ) --- Maximal likelihood (ML) estimation

Recall definition of HMMTransition probabilities between any two states

y2 y3y1 yT...

Start probabilities

A AA Ax2 x3x1 xT... ,)|( , ji

jt ayyp === − 11 1

( ) .,,,,lMultinomia~)1|( ,2,1,1 I∈∀=− iaaayyp Miiiitt K

( ).,,,lMultinomia~)( 211 Myp πππ K

Emission probabilities associated with each state

or in general:

( ).,,,lMultinomia)( 211 Myp πππ K

( ) .,,,,lMultinomia~)|( ,,, I∈∀= ibbbyxp Kiiiitt K211

( ) .,|f~)|( I∈∀⋅= iyxp iitt θ1

Supervised ML estimationGiven x = x1…xN for which the true state path y = y1…yN is known,

Define:Aij = # times state transition i→j occurs in yBik = # times state i in y emits k in x

We can show that the maximum likelihood parameters θ are:

∑∑ ∑∑ ∑ ==

•→→

== −

)(#)(#

titnML

jia2 1

What if x is continuous? We can treat as N×Tobservations of, e.g., a Gaussian, and apply learning rules for Gaussian …

∑∑ ∑∑ ∑ ==

•→→

)(#)(#

titnML

( ){ }NnTtyx tntn :,::, ,, 11 ==

Supervised ML estimation, ctd.Intuition:

When we know the underlying states, the best estimate of θ is the average frequency of transitions & emissions that occur in the trainingaverage frequency of transitions & emissions that occur in the training data

Drawback:Given little data, there may be overfitting:

P(x|θ) is maximized, but θ is unreasonable0 probabilities – VERY BAD

Example:

Example:Given 10 casino rolls, we observe

x = 2, 1, 5, 6, 1, 2, 3, 6, 2, 3y = F, F, F, F, F, F, F, F, F, F

Then: aFF = 1; aFL = 0bF1 = bF3 = .2; bF2 = .3; bF4 = 0; bF5 = bF6 = .1

PseudocountsSolution for small training sets:

Add pseudocountsAij = # times state transition i→j occurs in y + RijBik = # times state i in y emits k in x + Sik

Rij, Sij are pseudocounts representing our prior beliefTotal pseudocounts: Ri = ΣjRij , Si = ΣkSik ,

--- "strength" of prior belief, --- total number of imaginary instances in the prior

Larger total pseudocounts ⇒ strong prior belief

Small total pseudocounts: just to avoid 0 probabilities --- smoothing

This is equivalent to Bayesian est. under a uniform prior with "parameter strength" equals to the pseudocounts

Earthquake Burglary ∏==M

i ixpp )|()( πxxXFactorization:

How to define parameter prior?

πθπ x

Local Distributions defined by, e.g., multinomial parameters:

Assumptions (Geiger & Heckerman 97,99):Complete Model EquivalenceGlobal Parameter IndependenceLocal Parameter IndependenceLikelihood and Prior Modularity

? )|( Gp θ

Global Parameter Independence Earthquake Burglary

Global & Local Parameter Independence

Global Parameter IndependenceFor every DAG model:

iim GpGp

)|()|( θθ

Earthquake

Burglary

Local Parameter IndependenceFor every node:

jxi GpGp

)|()|(π

independent of)( | YESAlarmCallP =θ

)( | NOAlarmCallP =θ

Global ParameterIndependence

Parameter Independence,Graphical View

sample 1

θ2|1θ1 θ2|1

IndependenceLocal ParameterIndependence

Provided all variables are observed in all cases, we can perform Bayesian update each parameter independently !!!

sample 2

Which PDFs Satisfy Our Assumptions? (Geiger & Heckerman 97,99)

Discrete DAG Models:∑Γ )(

)(Multi~| θπ jxi i

Dirichlet prior:

Gaussian DAG Models:

Normal prior:

∏∏∏∑

kk CP 1-1- )()(

)()( αα θαθ

),(Normal~| Σμπ jxi i

⎭⎬⎫

⎩⎨⎧ −Ψ−−

Ψ=Ψ − )()'(exp

||)(),|( // νμνμ

πνμ 1

212 21

Normal-Wishart prior:

⎭⎩Ψ ||)( π 22

. where

,WtrexpW),(),|W( /)(/

−−

Σ=⎭⎬⎫

⎩⎨⎧ ΤΤ=Τ

wwncp αααα

( ),)(,Normal),,|( 1−= WW μμ ανανμp

Summary: Learning GMFor exponential family dist., MLE amounts to moment matchingg

GLIM: Natural responseIteratively Reweighted Least Squares as a general algorithm

For fully observed BN the log-likelihood function decomposes

For fully observed BN, the log likelihood function decomposes into a sum of local terms, one per node; thus learning is also factored

probabilistic graphical modelsprobabilistic graphical modelsepxing/class/10708-09/lecture/... ·...

Documents

undirected probabilistic graphical models (markov nets)

probabilistic graphical models, spring 2012

probabilistic graphical models -...

probabilistic graphical models · 2014-01-15 · logistics...

probabilistic graphical models: introduction and general...

probabilistic graphical...

probabilistic graphical modelsepxing/class/10708-14/... ·...

introduction to probabilistic graphical models ·...

probabilistic graphical models - perception...

visión de alto nivel -...

introduction to probabilistic graphical models ·...

structured sparse additive models - carnegie...

probabilistic graphical models - ttic

02 probabilistic inference in graphical models

probabilistic graphical...

probabilistic graphical models - caltech...

probabilistic graphical model representation in...

dynamical probabilistic graphical models

probabilistic graphical models: lagrangian relaxation

probabilistic graphical models - people