generalizedlinearmodels*...

Generalized Linear Models +

Learning Fully Observed Bayes Nets

1

Matt Gormley Lecture 5

January 27, 2016

School of Computer Science

Readings: KF Chap. 17 Jordan Chap. 8 Jordan Chap. 9.1 – 9.2

10-‐708 Probabilistic Graphical Models

Machine Learning

2

The data inspires the structures we want to predict It also tells us

what to optimize

Our model defines a score

for each structure

Learning tunes the parameters of the

model

Inference finds {best structure, marginals,

partition function} for a new observation

Domain Knowledge

Mathematical Modeling

Optimization Combinatorial Optimization

ML

(Inference is usually called as a subroutine

in learning)

Machine Learning

3

Data

Model

Learning

Inference

(Inference is usually called as a subroutine

in learning)

3

A

l

i

c

e

s

a

w

B

o

b

o

n

a

h

i

l

l

w

i

t

h

a

t

e

l

e

s

c

o

p

e

A

l

i

c

e

s

a

w

B

o

b

o

n

a

h

i

l

l

w

i

t

h

a

t

e

l

e

s

c

o

p

e

4

t

i

m

e

fl

i

e

s

l

i

k

e

a

n

a

r

r

o

w

t

i

m

e

fl

i

e

s

l

i

k

e

a

n

a

r

r

o

w

t

i

m

e

fl

i

e

s

l

i

k

e

a

n

a

r

r

o

w

t

i

m

e

fl

i

e

s

l

i

k

e

a

n

a

r

r

o

w

t

i

m

e

fl

i

e

s

l

i

k

e

a

n

a

r

r

o

w

2

Objective

X1

X3 X2

X4 X5

Today’s Lecture

4

X1

X3 X2

X4 X5

p(X1, X2, X3, X4, X5) =

p(X5|X3)p(X4|X2, X3)

p(X3)p(X2|X1)p(X1)

p(X1, X2, X3, X4, X5) =

p(X5|X3)p(X4|X2, X3)

p(X3)p(X2|X1)p(X1)

Today’s Lecture

5

X1

X3 X2

X4 X5

p(X1, X2, X3, X4, X5) =

p(X5|X3)p(X4|X2, X3)

p(X3)p(X2|X1)p(X1)

Today’s Lecture

How do we define and learn these conditional and marginal distributions for a Bayes Net?

6

X1

X3 X2

X4 X5

Today’s Lecture 1.  Exponential Family Distributions

A candidate for marginal distributions, p(Xi)

2.  Generalized Linear Models Convenient form for conditional distributions, p(Xj | Xi)

3.  Learning Fully Observed Bayes Nets Easy thanks to decomposability

7

1. EXPONENTIAL FAMILY A candidate for marginal distributions, p(Xi)

8

Why the Exponential Family? 1.  Pitman-‐Koopman-‐Darmois theorem: it is the only

family of distributions with sufficient statistics that do not grow with the size of the dataset

2.  Only family of distributions for which conjugate priors exist (see Murphy textbook for a description)

3.  It is the distribution that is closest to uniform (i.e. maximizes entropy) – subject to moment matching constraints

4.  Key to Generalized Linear Models (next section) 5.  Includes some of your favorite distributions!

9

Adapted from Murphy (2012) textbook

Whiteboard

•  Definition of multivariate exponential family •  Example 1: Categorical distribution •  Example 2: Dirichlet distribution

10

Exponential family, a basic building block

l  For a numeric random variable X

is an exponential family distribution with natural (canonical) parameter η

l  Function T(x) is a sufficient statistic. l  Function A(η) = log Z(η) is the log normalizer. l  Examples: Bernoulli, multinomial, Gaussian, Poisson,

gamma,...

© Eric Xing @ CMU, 2005-2015 11

{ }{ })(exp)(

)(

)()(exp)()|(

xTxhZ

AxTxhxpT

T

ηη

ηηη1

=

−=

Example: Multivariate Gaussian Distribution

l  For a continuous vector random variable X∈Rk:

l  Exponential family representation

l  Note: a k-dimensional Gaussian is a (d+d2)-parameter distribution with a (d+d2)-element vector of sufficient statistics (but because of symmetry and positivity, parameters are constrained and have lower degree of freedom)

© Eric Xing @ CMU, 2005-2015 12

( )

( )( ){ }Σ−Σ−Σ+Σ−=

⎭⎬⎫

⎩⎨⎧ −Σ−−

Σ=Σ

−−−

−

logtrexp

)()(exp),(

/

//

µµµπ

µµπ

µ

12111

21

2

1212

21

21

21

TTTk

Tk

xxx

xxxp

( )[ ] ( )[ ]( )[ ]

( )( ) 2

221

112211

21

121

21

1211

211

22

/)(

log)(trlog)(vec;)(

and ,vec,vec;

k

TT

T

xh

AxxxxT

−

−

−−−−

=

−−−=Σ+Σ=

=

Σ−=Σ==Σ−Σ=

π

ηηηηµµη

ηµηηηµη

Moment parameter

Natural parameter

Example: Multinomial distribution l  For a binary vector random variable

l  Exponential family representation

© Eric Xing @ CMU, 2005-2015 13

),|(multi~ πxx

⎭⎬⎫

⎩⎨⎧

== ∑k

kkxK

xx xxp K πππππ lnexp)( 1121 !

[ ]

1

1

0

1

1

1

=

⎟⎠

⎞⎜⎝

⎛=⎟

⎠

⎞⎜⎝

⎛−−=

=

⎥⎦⎤

⎢⎣⎡

⎟⎠⎞⎜

⎝⎛=

∑∑=

−

=

)(

lnln)(

)(

;ln

xh

eA

xxTK

k

K

kk

K

k

kηπη

ππη

⎭⎬⎫

⎩⎨⎧

⎟⎠

⎞⎜⎝

⎛−+⎟⎟

⎠

⎞⎜⎜⎝

⎛

∑−=

⎭⎬⎫

⎩⎨⎧

⎟⎠

⎞⎜⎝

⎛−⎟

⎠

⎞⎜⎝

⎛−+=

∑∑

∑∑∑−

=

−

=−=

−

=

−

=

−

=

1

1

1

111

1

1

1

1

1

1

1ln1

lnexp

1ln1lnexp

K

kk

K

k kKk

kk

K

kk

K

kK

K

kkk

x

xx

ππ

π

ππ

Cumulant Generating Property l  First cumulant (aka. Mean)

l  Second cumulant (aka. Variance or First central moment)

© Eric Xing @ CMU, 2005-2015 14

{ }

{ }

[ ])()(

)(exp)()(

)(exp)()(

)()(

)(log

xTE

dxZ

xTxhxT

dxxTxhdd

Z

Zdd

ZZ

dd

ddA

T

T

=

=

=

==

∫

∫

ηη

ηηη

ηηη

ηηη1

1

{ } { }

[ ] [ ][ ])(

)()(

)()()(

)(exp)()()(

)(exp)()(

xTVarxTExTE

Zdd

Zdx

ZxTxhxTdx

ZxTxhxT

dAd

2

TT

=

−=

−= ∫∫2

22

2 1η

ηηηη

ηη

η

Moment estimation l  We can easily compute moments of any exponential family

distribution by taking the derivatives of the log normalizer A(η).

l  The qth derivative gives the qth centered moment.

l  When the sufficient statistic is a stacked vector, partial derivatives need to be considered.

© Eric Xing @ CMU, 2005-2015 15

!

variance)(

mean)(

=

=

2

2

ηη

ηη

dAdddA

Moment vs canonical parameters l  The moment parameter µ can be derived from the natural

(canonical) parameter

l  A(η) is convex since

l  Hence we can invert the relationship and infer the canonical parameter from the moment parameter (1-to-1):

l  A distribution in the exponential family can be parameterized not only by η - the canonical parameterization, but also by µ - the moment parameterization.

© Eric Xing @ CMU, 2005-2015 16

[ ] µηη def

)()(== xTE

ddA

[ ] 02

2

>= )()( xTVardAdηη

)(def

µψη =

4

8

-2 -1 0 1 2

4

8

-2 -1 0 1 2

A

ηη*

MLE for Exponential Family l  For iid data, the log-likelihood is

l  Take derivatives and set to zero:

l  This amounts to moment matching. l  We can infer the canonical parameters using

© Eric Xing @ CMU, 2005-2015 17

{ }

∑ ∑

∏

−⎟⎠

⎞⎜⎝

⎛+=

−=

n nn

Tn

nn

Tn

NAxTxh

AxTxhD

)()()(log

)()(exp)(log);(

ηη

ηηηl

)(

)()(

)()(

∑

∑

∑

=

=∂∂

⇒

=∂∂

−=∂∂

nnMLE

nn

nn

xTN

xTN

A

ANxT

1

1

0

µ

ηη

ηη

η

⌢

l

)( MLEMLE µψη ⌢⌢=

Examples l  Gaussian:

l  Multinomial:

l  Poisson:

© Eric Xing @ CMU, 2005-2015 18

( )[ ]( )[ ]

( ) 2211

21

1211

2 /)(

log)(vec;)(

vec;

k

T

T

xh

AxxxxT

−

−

−−

=

Σ+Σ=

=

Σ−Σ=

π

µµη

µη

∑∑ ==⇒n

nn

nMLE xN

xTN

111 )(µ

[ ]

1

1

0

1

1

1

=

⎟⎠

⎞⎜⎝

⎛=⎟

⎠

⎞⎜⎝

⎛−−=

=

⎥⎦⎤

⎢⎣⎡

⎟⎠⎞⎜

⎝⎛=

∑∑=

−

=

)(

lnln)(

)(

;ln

xh

eA

xxTK

k

K

kk

K

k

kηπη

ππη

∑=⇒n

nMLE xN1

µ

!)(

)()(log

xxh

eAxxT

1=

==

=

=

ηλη

λη

∑=⇒n

nMLE xN1

µ

Sufficiency l  For p(x|q), T(x) is sufficient for θ if there is no information in X

regarding θ beyond that in T(x). l  We can throw away X for the purpose of inference w.r.t. θ .

l  Bayesian view

l  Frequentist view

l  The Neyman factorization theorem l  T(x) is sufficient for θ if

© Eric Xing @ CMU, 2005-2015 19

T(x) θ X ))(|()),(|( xTpxxTp θθ =

T(x) θ X ))(|()),(|( xTxpxTxp =θ

T(x) θ X

))(,()),(()),(,( xTxxTxTxp 21 ψθψθ =

))(,()),(()|( xTxhxTgxp θθ =⇒

Whiteboard

•  Bayesian estimation of exponential family

20

2. GENERALIZED LINEAR MODELS

Convenient form for conditional distributions, p(Xj | Xi)

21

Why Generalized Linear Models? (GLIMs)

1.  Generalization of linear regression, logistic regression, probit regression, etc.

2.  Provides a framework for creating new conditional distributions that come with some convenient properties

3.  Special case: GLIMs with canonical response functions are easy to train with MLE.

4.  No Free Lunch: What about Bayesian estimation of GLIMs? Unfortunately, we have to turn to approximation techniques since, in general, there isn't a closed form of the posterior.

22

Generalized Linear Models (GLIMs)

l  GLIM l  The observed input x is assumed to enter into the model

via a linear combination of its elements l  The conditional mean µ is represented as a function f(ξ)

of ξ, where f is known as the response function l  The observed output y is assumed to be characterized

by an exponential family distribution with conditional mean µ.

© Eric Xing @ CMU, 2005-2015 23

Xn

Yn N

xTθξ =

Whiteboard

•  Constructive definition of GLIMs •  Definition of GLIMs with canonical response functions

24

Generalized Linear Models (GLIMs)

l  The graphical model l  Linear regression l  Discriminative linear classification l  Commonality:

model Ep(Y)=µ=f(θTX) l  What is p()? the cond. dist. of Y. l  What is f()? the response function.

© Eric Xing @ CMU, 2005-2015 25

Xn

Yn N

GLIM, cont.

l  The choice of exp family is constrained by the nature of the data Y l  Example: y is a continuous vector à multivariate Gaussian

y is a class label à Bernoulli or multinomial

l  The choice of the response function l  Following some mild constrains, e.g., [0,1]. Positivity … l  Canonical response function:

l  In this case θTx directly corresponds to canonical parameter η. © Eric Xing @ CMU, 2005-2015

26

( ){ })()(exp),(),|( 1 ηηφφη φ Ayxyhyp T −=⇒

)(⋅= −1ψf

{ })()(exp)()|( ηηη Ayxyhyp T −=

ηψfθ

xµξ yEXP

Example canonical response functions

© Eric Xing @ CMU, 2005-2015 27

MLE for GLIMs with natural response

l  Log-likelihood

l  Derivative of Log-likelihood

l  Online learning for canonical GLIMs l  Stochastic gradient ascent = least mean squares (LMS) algorithm:

© Eric Xing @ CMU, 2005-2015 28

( )∑ ∑ −+=n n

nnnT

n Ayxyh )()(log ηθl

( )

)(

)(

µ

µ

θη

ηη

θ

−=

−=

⎟⎟⎠

⎞⎜⎜⎝

⎛−=

∑

∑

yX

xydd

ddAyx

dd

Tn

nnn

n

n

n

nnn

l

This is a fixed point function because µ is a function of θ

( ) ntnn

tt xy µρθθ −+=+1

( ) size step a is and where ρθµ nTtt

n x=

Batch learning for canonical GLIMs

l  The Hessian matrix

where is the design matrix and

which can be computed by calculating the 2nd derivative of A(ηn) © Eric Xing @ CMU, 2005-2015 29

( )

WXX

xxddx

dd

ddx

ddxxy

dd

dddH

T

nT

nTn

n n

nn

Tn

n n

nn

nTn

nn

nnnTT

−=

=−=

−=

=−==

∑

∑

∑∑

θηηµ

θη

ηµ

θµ

µθθθ

since

l2

[ ]TnxX =

⎟⎟⎠

⎞⎜⎜⎝

⎛=

N

N

dd

ddW

ηµ

ηµ ,,diag …

1

1

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

−−−−

−−−−

−−−−

=

nx

xx

X!!!

2

1

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

=

ny

yy

y!

" 2

1

Recall LMS l  Cost function in matrix form:

l  To minimize J(θ), take derivative and set to zero:

© Eric Xing @ CMU, 2005-2015 30

( ) ( )yy

yJ

T

n

ii

Ti

!!−−=

−= ∑=

θθ

θθ

XX

x

2121

1

2)()( ⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

−−−−

−−−−

−−−−

=

nx

xx

X!!!

2

1

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

=

ny

yy

y!

" 2

1

( )

( )

( )0

221

22121

=−=

−+=

∇+∇−∇=

+−−∇=∇

yXXX

yXXXXX

yyXyXX

yyXyyXXXJ

TT

TTT

TTTT

TTTTTT

!

!

!!!

!!!!

θ

θθ

θθθ

θθθθ

θθθ

θθ

trtrtr

tryXXX TT !

=⇒ θ The normal equations

( ) yXXX TT !1−=*θ

⇓

Iteratively Reweighted Least Squares (IRLS)

l  Recall Newton-Raphson methods with cost function J

l  We now have

l  Now:

l  where the adjusted response is

l  This can be understood as solving the following " Iteratively reweighted least squares " problem

© Eric Xing @ CMU, 2005-2015 31

JHttθθθ ∇−= −+ 11

WXXHyXJT

T

−=

−=∇ )( µθ

( ) [ ]( ) ttTtT

tTttTtT

tt

zWXXWX

yXXWXXWX

H

1

1

11

−

−

−+

=

−+=

∇+=

)( µθ

θθ θ l

( ) )( tttt yWXz µθ −+=−1

)()(minarg θθθθ

XzWXz Tt −−=+1

( ) yXXX TT !1−=*θ

Example 1: logistic regression (sigmoid classifier)

l  The condition distribution: a Bernoulli

where µ is a logistic function

l  p(y|x) is an exponential family function, with l  mean:

l  and canonical response function

l  IRLS

© Eric Xing @ CMU, 2005-2015 32

yy xxxyp −−= 11 ))(()()|( µµ

)()( xex ηµ −+=1

1

[ ] )(| xexyE ηµ −+

==1

1

xTθξη ==

⎟⎟⎟

⎠

⎞

⎜⎜⎜

⎝

⎛

−

−

=

−=

)(

)(

)(

NN

W

dd

µµ

µµ

µµηµ

1

1

1

11

!

Logistic regression: practical issues

l  It is very common to use regularized maximum likelihood.

l  IRLS takes O(Nd3) per iteration, where N = number of training cases and d = dimension of input x.

l  Quasi-Newton methods, that approximate the Hessian, work faster. l  Conjugate gradient takes O(Nd) per iteration, and usually works best in practice. l  Stochastic gradient descent can also be used if N is large c.f. perceptron rule:

© Eric Xing @ CMU, 2005-2015 33

( ) θθλ

θσθ

λθ

θσθθ

T

nn

Tn

Txy

xyl

Ip

xye

xyp T

2

01

11

1

−=

=+

=±=

∑

−

−

)(log)(

),(Normal~)(

)(),(

( ) λθθσθ −−=∇ nnnT

n xyxy )(1l

Example 2: linear regression l  The condition distribution: a Gaussian

where µ is a linear function

l  p(y|x) is an exponential family function, with l  mean:

l  and canonical response function

l  IRLS

© Eric Xing @ CMU, 2005-2015 34

)()( xxx T ηθµ ==

[ ] xxyE Tθµ ==|

xTθξη ==1

IWdd

=

= 1ηµ

( )( ){ })()(exp)(

))(())((exp),,( //

ηη

µµπ

θ

Ayxxh

xyxyxyp

T

Tk

−Σ−⇒

⎭⎬⎫

⎩⎨⎧ −Σ−−

Σ=Σ

−

−

121

1212 2

12

1

( )( ) ( )

( ) )(

)(tTTt

ttTT

ttTtTt

yXXX

yXXXX

zWXXWX

µθ

µθ

θ

−+=

−+=

=

−

−

−+

1

1

11

⇒∞→

⇒t

YXXX TT 1)( −=θ

Steepest descent Normal equation

Rescale

Today’s Lecture 1.  Exponential Family Distributions

A candidate for marginal distributions, p(Xi)

2.  Generalized Linear Models Convenient form for conditional distributions, p(Xj | Xi)

3.  Learning Fully Observed Bayes Nets Easy thanks to decomposability

35

3. LEARNING FULLY OBSERVED BNS

Easy thanks to decomposability

36

Classification Generative and discriminative approach

Q

X

Q

X

Regression Linear, conditional mixture, nonparametric

X Y

Density estimation Parametric and nonparametric methods

µ,σ

X X

Simple GMs are the building blocks of complex BNs

© Eric Xing @ CMU, 2005-2015 37

l  Consider the distribution defined by the directed acyclic GM:

l  This is exactly like learning four separate small BNs, each of which consists of a node and its parents.

Decomposable likelihood of a BN

),,|(),|(),|()|()|( 432431321211 θθθθθ xxxpxxpxxpxpxp =

X1

X4

X2 X3

X4

X2 X3

X1 X1

X2

X1

X3

© Eric Xing @ CMU, 2005-2015 38

Summary 1.  Exponential Family Distributions

–  A candidate for marginal distributions, p(Xi) –  Examples: Multinomial, Dirichlet, Gaussian, Gamma, Poisson –  MLE has closed form solution –  Bayesian estimation easy with conjugate priors –  Sufficient statistics by inspection

2.  Generalized Linear Models –  Convenient form for conditional distributions,

p(Xj | Xi) –  Special case: GLIMs with canonical response

•  Output y follows an exponential family •  Input x introduced via a linear combination

–  MLE for GLIMs with canonical response by SGD –  In general, Bayesian estimation relies on approximations

3.  Learning Fully Observed Bayes Nets –  Easy thanks to decomposability

40

generalized*linear*models*...

Documents

generalizedlinearmodels*...