generalized*linear*models*...

40
Generalized Linear Models + Learning Fully Observed Bayes Nets 1 Matt Gormley Lecture 5 January 27, 2016 School of Computer Science Readings: KF Chap. 17 Jordan Chap. 8 Jordan Chap. 9.1 – 9.2 10708 Probabilistic Graphical Models

Upload: others

Post on 03-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Generalized*Linear*Models* Learning*Fully*Observed*Bayes*Nets*epxing/.../10708-16/slide/lecture5-EF-GLIM-learnin… · Example: Multivariate Gaussian Distribution ! For a continuous

Generalized  Linear  Models  +  

Learning  Fully  Observed  Bayes  Nets    

1  

Matt  Gormley  Lecture  5  

January  27,  2016    

School of Computer Science

Readings:    KF  Chap.  17  Jordan  Chap.  8  Jordan  Chap.  9.1  –  9.2  

10-­‐708  Probabilistic  Graphical  Models  

Page 2: Generalized*Linear*Models* Learning*Fully*Observed*Bayes*Nets*epxing/.../10708-16/slide/lecture5-EF-GLIM-learnin… · Example: Multivariate Gaussian Distribution ! For a continuous

Machine  Learning  

2  

The  data  inspires  the  structures  we  want  to  predict   It  also  tells  us  

what  to  optimize  

Our  model  defines  a  score  

for  each  structure  

Learning  tunes  the  parameters  of  the  

model  

Inference  finds    {best  structure,  marginals,  

partition  function}  for  a  new  observation  

Domain  Knowledge  

Mathematical  Modeling  

Optimization  Combinatorial  Optimization  

ML  

(Inference  is  usually  called  as  a  subroutine  

in  learning)  

Page 3: Generalized*Linear*Models* Learning*Fully*Observed*Bayes*Nets*epxing/.../10708-16/slide/lecture5-EF-GLIM-learnin… · Example: Multivariate Gaussian Distribution ! For a continuous

Machine  Learning  

3  

Data          

Model            

Learning          

Inference          

(Inference  is  usually  called  as  a  subroutine  

in  learning)  

3

A

l

i

c

e

s

a

w

B

o

b

o

n

a

h

i

l

l

w

i

t

h

a

t

e

l

e

s

c

o

p

e

A

l

i

c

e

s

a

w

B

o

b

o

n

a

h

i

l

l

w

i

t

h

a

t

e

l

e

s

c

o

p

e

4

t

i

m

e

fl

i

e

s

l

i

k

e

a

n

a

r

r

o

w

t

i

m

e

fl

i

e

s

l

i

k

e

a

n

a

r

r

o

w

t

i

m

e

fl

i

e

s

l

i

k

e

a

n

a

r

r

o

w

t

i

m

e

fl

i

e

s

l

i

k

e

a

n

a

r

r

o

w

t

i

m

e

fl

i

e

s

l

i

k

e

a

n

a

r

r

o

w

2

Objective            

X1

X3 X2

X4 X5

Page 4: Generalized*Linear*Models* Learning*Fully*Observed*Bayes*Nets*epxing/.../10708-16/slide/lecture5-EF-GLIM-learnin… · Example: Multivariate Gaussian Distribution ! For a continuous

Today’s  Lecture  

4  

X1

X3 X2

X4 X5

p(X1, X2, X3, X4, X5) =

p(X5|X3)p(X4|X2, X3)

p(X3)p(X2|X1)p(X1)

Page 5: Generalized*Linear*Models* Learning*Fully*Observed*Bayes*Nets*epxing/.../10708-16/slide/lecture5-EF-GLIM-learnin… · Example: Multivariate Gaussian Distribution ! For a continuous

p(X1, X2, X3, X4, X5) =

p(X5|X3)p(X4|X2, X3)

p(X3)p(X2|X1)p(X1)

Today’s  Lecture  

5  

X1

X3 X2

X4 X5

Page 6: Generalized*Linear*Models* Learning*Fully*Observed*Bayes*Nets*epxing/.../10708-16/slide/lecture5-EF-GLIM-learnin… · Example: Multivariate Gaussian Distribution ! For a continuous

p(X1, X2, X3, X4, X5) =

p(X5|X3)p(X4|X2, X3)

p(X3)p(X2|X1)p(X1)

Today’s  Lecture  

How  do  we  define  and  learn  these  conditional  and  marginal  distributions  for  a  Bayes  Net?  

6  

X1

X3 X2

X4 X5

Page 7: Generalized*Linear*Models* Learning*Fully*Observed*Bayes*Nets*epxing/.../10708-16/slide/lecture5-EF-GLIM-learnin… · Example: Multivariate Gaussian Distribution ! For a continuous

Today’s  Lecture  1.  Exponential  Family  Distributions  

A  candidate  for  marginal  distributions,  p(Xi)

2.  Generalized  Linear  Models  Convenient  form  for  conditional  distributions,  p(Xj | Xi)

3.  Learning  Fully  Observed  Bayes  Nets  Easy  thanks  to  decomposability

7  

Page 8: Generalized*Linear*Models* Learning*Fully*Observed*Bayes*Nets*epxing/.../10708-16/slide/lecture5-EF-GLIM-learnin… · Example: Multivariate Gaussian Distribution ! For a continuous

1.  EXPONENTIAL  FAMILY  A  candidate  for  marginal  distributions,  p(Xi)

8  

Page 9: Generalized*Linear*Models* Learning*Fully*Observed*Bayes*Nets*epxing/.../10708-16/slide/lecture5-EF-GLIM-learnin… · Example: Multivariate Gaussian Distribution ! For a continuous

Why  the  Exponential  Family?  1.  Pitman-­‐Koopman-­‐Darmois  theorem:  it  is  the  only  

family  of  distributions  with  sufficient  statistics  that  do  not  grow  with  the  size  of  the  dataset  

2.  Only  family  of  distributions  for  which  conjugate  priors  exist  (see  Murphy  textbook  for  a  description)  

3.  It  is  the  distribution  that  is  closest  to  uniform  (i.e.  maximizes  entropy)  –  subject  to  moment  matching  constraints  

4.  Key  to  Generalized  Linear  Models  (next  section)  5.  Includes  some  of  your  favorite  distributions!  

9  

Adapted  from  Murphy  (2012)  textbook  

Page 10: Generalized*Linear*Models* Learning*Fully*Observed*Bayes*Nets*epxing/.../10708-16/slide/lecture5-EF-GLIM-learnin… · Example: Multivariate Gaussian Distribution ! For a continuous

Whiteboard  

•  Definition  of  multivariate  exponential  family  •  Example  1:  Categorical  distribution  •  Example  2:  Dirichlet  distribution  

10  

Page 11: Generalized*Linear*Models* Learning*Fully*Observed*Bayes*Nets*epxing/.../10708-16/slide/lecture5-EF-GLIM-learnin… · Example: Multivariate Gaussian Distribution ! For a continuous

Exponential family, a basic building block

l  For a numeric random variable X

is an exponential family distribution with natural (canonical) parameter η

l  Function T(x) is a sufficient statistic. l  Function A(η) = log Z(η) is the log normalizer. l  Examples: Bernoulli, multinomial, Gaussian, Poisson,

gamma,...

© Eric Xing @ CMU, 2005-2015 11

{ }{ })(exp)(

)(

)()(exp)()|(

xTxhZ

AxTxhxpT

T

ηη

ηηη1

=

−=

Page 12: Generalized*Linear*Models* Learning*Fully*Observed*Bayes*Nets*epxing/.../10708-16/slide/lecture5-EF-GLIM-learnin… · Example: Multivariate Gaussian Distribution ! For a continuous

Example: Multivariate Gaussian Distribution

l  For a continuous vector random variable X∈Rk:

l  Exponential family representation

l  Note: a k-dimensional Gaussian is a (d+d2)-parameter distribution with a (d+d2)-element vector of sufficient statistics (but because of symmetry and positivity, parameters are constrained and have lower degree of freedom)

© Eric Xing @ CMU, 2005-2015 12

( )

( )( ){ }Σ−Σ−Σ+Σ−=

⎭⎬⎫

⎩⎨⎧ −Σ−−

Σ=Σ

−−−

logtrexp

)()(exp),(

/

//

µµµπ

µµπ

µ

12111

21

2

1212

21

21

21

TTTk

Tk

xxx

xxxp

( )[ ] ( )[ ]( )[ ]

( )( ) 2

221

112211

21

121

21

1211

211

22

/)(

log)(trlog)(vec;)(

and ,vec,vec;

k

TT

T

xh

AxxxxT

−−−−

=

−−−=Σ+Σ=

=

Σ−=Σ==Σ−Σ=

π

ηηηηµµη

ηµηηηµη

Moment parameter

Natural parameter

Page 13: Generalized*Linear*Models* Learning*Fully*Observed*Bayes*Nets*epxing/.../10708-16/slide/lecture5-EF-GLIM-learnin… · Example: Multivariate Gaussian Distribution ! For a continuous

Example: Multinomial distribution l  For a binary vector random variable

l  Exponential family representation

© Eric Xing @ CMU, 2005-2015 13

),|(multi~ πxx

⎭⎬⎫

⎩⎨⎧

== ∑k

kkxK

xx xxp K πππππ lnexp)( 1121 !

[ ]

1

1

0

1

1

1

=

⎟⎠

⎞⎜⎝

⎛=⎟

⎞⎜⎝

⎛−−=

=

⎥⎦⎤

⎢⎣⎡

⎟⎠⎞⎜

⎝⎛=

∑∑=

=

)(

lnln)(

)(

;ln

xh

eA

xxTK

k

K

kk

K

k

kηπη

ππη

⎭⎬⎫

⎩⎨⎧

⎟⎠

⎞⎜⎝

⎛−+⎟⎟

⎞⎜⎜⎝

∑−=

⎭⎬⎫

⎩⎨⎧

⎟⎠

⎞⎜⎝

⎛−⎟

⎞⎜⎝

⎛−+=

∑∑

∑∑∑−

=

=−=

=

=

=

1

1

1

111

1

1

1

1

1

1

1ln1

lnexp

1ln1lnexp

K

kk

K

k kKk

kk

K

kk

K

kK

K

kkk

x

xx

ππ

π

ππ

Page 14: Generalized*Linear*Models* Learning*Fully*Observed*Bayes*Nets*epxing/.../10708-16/slide/lecture5-EF-GLIM-learnin… · Example: Multivariate Gaussian Distribution ! For a continuous

Cumulant Generating Property l  First cumulant (aka. Mean)

l  Second cumulant (aka. Variance or First central moment)

© Eric Xing @ CMU, 2005-2015 14

{ }

{ }

[ ])()(

)(exp)()(

)(exp)()(

)()(

)(log

xTE

dxZ

xTxhxT

dxxTxhdd

Z

Zdd

ZZ

dd

ddA

T

T

=

=

=

==

ηη

ηηη

ηηη

ηηη1

1

{ } { }

[ ] [ ][ ])(

)()(

)()()(

)(exp)()()(

)(exp)()(

xTVarxTExTE

Zdd

Zdx

ZxTxhxTdx

ZxTxhxT

dAd

2

TT

=

−=

−= ∫∫2

22

2 1η

ηηηη

ηη

η

Page 15: Generalized*Linear*Models* Learning*Fully*Observed*Bayes*Nets*epxing/.../10708-16/slide/lecture5-EF-GLIM-learnin… · Example: Multivariate Gaussian Distribution ! For a continuous

Moment estimation l  We can easily compute moments of any exponential family

distribution by taking the derivatives of the log normalizer A(η).

l  The qth derivative gives the qth centered moment.

l  When the sufficient statistic is a stacked vector, partial derivatives need to be considered.

© Eric Xing @ CMU, 2005-2015 15

!

variance)(

mean)(

=

=

2

2

ηη

ηη

dAdddA

Page 16: Generalized*Linear*Models* Learning*Fully*Observed*Bayes*Nets*epxing/.../10708-16/slide/lecture5-EF-GLIM-learnin… · Example: Multivariate Gaussian Distribution ! For a continuous

Moment vs canonical parameters l  The moment parameter µ can be derived from the natural

(canonical) parameter

l  A(η) is convex since

l  Hence we can invert the relationship and infer the canonical parameter from the moment parameter (1-to-1):

l  A distribution in the exponential family can be parameterized not only by η - the canonical parameterization, but also by µ - the moment parameterization.

© Eric Xing @ CMU, 2005-2015 16

[ ] µηη def

)()(== xTE

ddA

[ ] 02

2

>= )()( xTVardAdηη

)(def

µψη =

4

8

-2 -1 0 1 2

4

8

-2 -1 0 1 2

A

ηη*

Page 17: Generalized*Linear*Models* Learning*Fully*Observed*Bayes*Nets*epxing/.../10708-16/slide/lecture5-EF-GLIM-learnin… · Example: Multivariate Gaussian Distribution ! For a continuous

MLE for Exponential Family l  For iid data, the log-likelihood is

l  Take derivatives and set to zero:

l  This amounts to moment matching. l  We can infer the canonical parameters using

© Eric Xing @ CMU, 2005-2015 17

{ }

∑ ∑

−⎟⎠

⎞⎜⎝

⎛+=

−=

n nn

Tn

nn

Tn

NAxTxh

AxTxhD

)()()(log

)()(exp)(log);(

ηη

ηηηl

)(

)()(

)()(

=

=∂∂

=∂∂

−=∂∂

nnMLE

nn

nn

xTN

xTN

A

ANxT

1

1

0

µ

ηη

ηη

η

l

)( MLEMLE µψη ⌢⌢=

Page 18: Generalized*Linear*Models* Learning*Fully*Observed*Bayes*Nets*epxing/.../10708-16/slide/lecture5-EF-GLIM-learnin… · Example: Multivariate Gaussian Distribution ! For a continuous

Examples l  Gaussian:

l  Multinomial:

l  Poisson:

© Eric Xing @ CMU, 2005-2015 18

( )[ ]( )[ ]

( ) 2211

21

1211

2 /)(

log)(vec;)(

vec;

k

T

T

xh

AxxxxT

−−

=

Σ+Σ=

=

Σ−Σ=

π

µµη

µη

∑∑ ==⇒n

nn

nMLE xN

xTN

111 )(µ

[ ]

1

1

0

1

1

1

=

⎟⎠

⎞⎜⎝

⎛=⎟

⎞⎜⎝

⎛−−=

=

⎥⎦⎤

⎢⎣⎡

⎟⎠⎞⎜

⎝⎛=

∑∑=

=

)(

lnln)(

)(

;ln

xh

eA

xxTK

k

K

kk

K

k

kηπη

ππη

∑=⇒n

nMLE xN1

µ

!)(

)()(log

xxh

eAxxT

1=

==

=

=

ηλη

λη

∑=⇒n

nMLE xN1

µ

Page 19: Generalized*Linear*Models* Learning*Fully*Observed*Bayes*Nets*epxing/.../10708-16/slide/lecture5-EF-GLIM-learnin… · Example: Multivariate Gaussian Distribution ! For a continuous

Sufficiency l  For p(x|q), T(x) is sufficient for θ if there is no information in X

regarding θ beyond that in T(x). l  We can throw away X for the purpose of inference w.r.t. θ .

l  Bayesian view

l  Frequentist view

l  The Neyman factorization theorem l  T(x) is sufficient for θ if

© Eric Xing @ CMU, 2005-2015 19

T(x) θ X ))(|()),(|( xTpxxTp θθ =

T(x) θ X ))(|()),(|( xTxpxTxp =θ

T(x) θ X

))(,()),(()),(,( xTxxTxTxp 21 ψθψθ =

))(,()),(()|( xTxhxTgxp θθ =⇒

Page 20: Generalized*Linear*Models* Learning*Fully*Observed*Bayes*Nets*epxing/.../10708-16/slide/lecture5-EF-GLIM-learnin… · Example: Multivariate Gaussian Distribution ! For a continuous

Whiteboard  

•  Bayesian  estimation  of  exponential  family  

20  

Page 21: Generalized*Linear*Models* Learning*Fully*Observed*Bayes*Nets*epxing/.../10708-16/slide/lecture5-EF-GLIM-learnin… · Example: Multivariate Gaussian Distribution ! For a continuous

2.  GENERALIZED  LINEAR  MODELS    

Convenient  form  for  conditional  distributions,  p(Xj | Xi)

21  

Page 22: Generalized*Linear*Models* Learning*Fully*Observed*Bayes*Nets*epxing/.../10708-16/slide/lecture5-EF-GLIM-learnin… · Example: Multivariate Gaussian Distribution ! For a continuous

Why  Generalized  Linear  Models?  (GLIMs)  

1.  Generalization  of  linear  regression,  logistic  regression,  probit  regression,  etc.  

2.  Provides  a  framework  for  creating  new  conditional  distributions  that  come  with  some  convenient  properties    

3.  Special  case:  GLIMs  with  canonical  response  functions  are  easy  to  train  with  MLE.  

4.  No  Free  Lunch:  What  about  Bayesian  estimation  of  GLIMs?  Unfortunately,  we  have  to  turn  to  approximation  techniques  since,  in  general,  there  isn't  a  closed  form  of  the  posterior.  

22  

Page 23: Generalized*Linear*Models* Learning*Fully*Observed*Bayes*Nets*epxing/.../10708-16/slide/lecture5-EF-GLIM-learnin… · Example: Multivariate Gaussian Distribution ! For a continuous

Generalized Linear Models (GLIMs)

l  GLIM l  The observed input x is assumed to enter into the model

via a linear combination of its elements l  The conditional mean µ is represented as a function f(ξ)

of ξ, where f is known as the response function l  The observed output y is assumed to be characterized

by an exponential family distribution with conditional mean µ.

© Eric Xing @ CMU, 2005-2015 23

Xn

Yn N

xTθξ =

Page 24: Generalized*Linear*Models* Learning*Fully*Observed*Bayes*Nets*epxing/.../10708-16/slide/lecture5-EF-GLIM-learnin… · Example: Multivariate Gaussian Distribution ! For a continuous

Whiteboard  

•  Constructive  definition  of  GLIMs  •  Definition  of  GLIMs  with  canonical  response  functions  

24  

Page 25: Generalized*Linear*Models* Learning*Fully*Observed*Bayes*Nets*epxing/.../10708-16/slide/lecture5-EF-GLIM-learnin… · Example: Multivariate Gaussian Distribution ! For a continuous

Generalized Linear Models (GLIMs)

l  The graphical model l  Linear regression l  Discriminative linear classification l  Commonality:

model Ep(Y)=µ=f(θTX) l  What is p()? the cond. dist. of Y. l  What is f()? the response function.

© Eric Xing @ CMU, 2005-2015 25

Xn

Yn N

Page 26: Generalized*Linear*Models* Learning*Fully*Observed*Bayes*Nets*epxing/.../10708-16/slide/lecture5-EF-GLIM-learnin… · Example: Multivariate Gaussian Distribution ! For a continuous

GLIM, cont.

l  The choice of exp family is constrained by the nature of the data Y l  Example: y is a continuous vector à multivariate Gaussian

y is a class label à Bernoulli or multinomial

l  The choice of the response function l  Following some mild constrains, e.g., [0,1]. Positivity … l  Canonical response function:

l  In this case θTx directly corresponds to canonical parameter η. © Eric Xing @ CMU, 2005-2015

26

( ){ })()(exp),(),|( 1 ηηφφη φ Ayxyhyp T −=⇒

)(⋅= −1ψf

{ })()(exp)()|( ηηη Ayxyhyp T −=

ηψfθ

xµξ yEXP

Page 27: Generalized*Linear*Models* Learning*Fully*Observed*Bayes*Nets*epxing/.../10708-16/slide/lecture5-EF-GLIM-learnin… · Example: Multivariate Gaussian Distribution ! For a continuous

Example canonical response functions

© Eric Xing @ CMU, 2005-2015 27

Page 28: Generalized*Linear*Models* Learning*Fully*Observed*Bayes*Nets*epxing/.../10708-16/slide/lecture5-EF-GLIM-learnin… · Example: Multivariate Gaussian Distribution ! For a continuous

MLE for GLIMs with natural response

l  Log-likelihood

l  Derivative of Log-likelihood

l  Online learning for canonical GLIMs l  Stochastic gradient ascent = least mean squares (LMS) algorithm:

© Eric Xing @ CMU, 2005-2015 28

( )∑ ∑ −+=n n

nnnT

n Ayxyh )()(log ηθl

( )

)(

)(

µ

µ

θη

ηη

θ

−=

−=

⎟⎟⎠

⎞⎜⎜⎝

⎛−=

yX

xydd

ddAyx

dd

Tn

nnn

n

n

n

nnn

l

This is a fixed point function because µ is a function of θ

( ) ntnn

tt xy µρθθ −+=+1

( ) size step a is and where ρθµ nTtt

n x=

Page 29: Generalized*Linear*Models* Learning*Fully*Observed*Bayes*Nets*epxing/.../10708-16/slide/lecture5-EF-GLIM-learnin… · Example: Multivariate Gaussian Distribution ! For a continuous

Batch learning for canonical GLIMs

l  The Hessian matrix

where is the design matrix and

which can be computed by calculating the 2nd derivative of A(ηn) © Eric Xing @ CMU, 2005-2015 29

( )

WXX

xxddx

dd

ddx

ddxxy

dd

dddH

T

nT

nTn

n n

nn

Tn

n n

nn

nTn

nn

nnnTT

−=

=−=

−=

=−==

∑∑

θηηµ

θη

ηµ

θµ

µθθθ

since

l2

[ ]TnxX =

⎟⎟⎠

⎞⎜⎜⎝

⎛=

N

N

dd

ddW

ηµ

ηµ ,,diag …

1

1

⎥⎥⎥⎥

⎢⎢⎢⎢

−−−−

−−−−

−−−−

=

nx

xx

X!!!

2

1

⎥⎥⎥⎥

⎢⎢⎢⎢

=

ny

yy

y!

" 2

1

Page 30: Generalized*Linear*Models* Learning*Fully*Observed*Bayes*Nets*epxing/.../10708-16/slide/lecture5-EF-GLIM-learnin… · Example: Multivariate Gaussian Distribution ! For a continuous

Recall LMS l  Cost function in matrix form:

l  To minimize J(θ), take derivative and set to zero:

© Eric Xing @ CMU, 2005-2015 30

( ) ( )yy

yJ

T

n

ii

Ti

!!−−=

−= ∑=

θθ

θθ

XX

x

2121

1

2)()( ⎥⎥⎥⎥

⎢⎢⎢⎢

−−−−

−−−−

−−−−

=

nx

xx

X!!!

2

1

⎥⎥⎥⎥

⎢⎢⎢⎢

=

ny

yy

y!

" 2

1

( )

( )

( )0

221

22121

=−=

−+=

∇+∇−∇=

+−−∇=∇

yXXX

yXXXXX

yyXyXX

yyXyyXXXJ

TT

TTT

TTTT

TTTTTT

!

!

!!!

!!!!

θ

θθ

θθθ

θθθθ

θθθ

θθ

trtrtr

tryXXX TT !

=⇒ θ The normal equations

( ) yXXX TT !1−=*θ

Page 31: Generalized*Linear*Models* Learning*Fully*Observed*Bayes*Nets*epxing/.../10708-16/slide/lecture5-EF-GLIM-learnin… · Example: Multivariate Gaussian Distribution ! For a continuous

Iteratively Reweighted Least Squares (IRLS)

l  Recall Newton-Raphson methods with cost function J

l  We now have

l  Now:

l  where the adjusted response is

l  This can be understood as solving the following " Iteratively reweighted least squares " problem

© Eric Xing @ CMU, 2005-2015 31

JHttθθθ ∇−= −+ 11

WXXHyXJT

T

−=

−=∇ )( µθ

( ) [ ]( ) ttTtT

tTttTtT

tt

zWXXWX

yXXWXXWX

H

1

1

11

−+

=

−+=

∇+=

)( µθ

θθ θ l

( ) )( tttt yWXz µθ −+=−1

)()(minarg θθθθ

XzWXz Tt −−=+1

( ) yXXX TT !1−=*θ

Page 32: Generalized*Linear*Models* Learning*Fully*Observed*Bayes*Nets*epxing/.../10708-16/slide/lecture5-EF-GLIM-learnin… · Example: Multivariate Gaussian Distribution ! For a continuous

Example 1: logistic regression (sigmoid classifier)

l  The condition distribution: a Bernoulli

where µ is a logistic function

l  p(y|x) is an exponential family function, with l  mean:

l  and canonical response function

l  IRLS

© Eric Xing @ CMU, 2005-2015 32

yy xxxyp −−= 11 ))(()()|( µµ

)()( xex ηµ −+=1

1

[ ] )(| xexyE ηµ −+

==1

1

xTθξη ==

⎟⎟⎟

⎜⎜⎜

=

−=

)(

)(

)(

NN

W

dd

µµ

µµ

µµηµ

1

1

1

11

!

Page 33: Generalized*Linear*Models* Learning*Fully*Observed*Bayes*Nets*epxing/.../10708-16/slide/lecture5-EF-GLIM-learnin… · Example: Multivariate Gaussian Distribution ! For a continuous

Logistic regression: practical issues

l  It is very common to use regularized maximum likelihood.

l  IRLS takes O(Nd3) per iteration, where N = number of training cases and d = dimension of input x.

l  Quasi-Newton methods, that approximate the Hessian, work faster. l  Conjugate gradient takes O(Nd) per iteration, and usually works best in practice. l  Stochastic gradient descent can also be used if N is large c.f. perceptron rule:

© Eric Xing @ CMU, 2005-2015 33

( ) θθλ

θσθ

λθ

θσθθ

T

nn

Tn

Txy

xyl

Ip

xye

xyp T

2

01

11

1

−=

=+

=±=

)(log)(

),(Normal~)(

)(),(

( ) λθθσθ −−=∇ nnnT

n xyxy )(1l

Page 34: Generalized*Linear*Models* Learning*Fully*Observed*Bayes*Nets*epxing/.../10708-16/slide/lecture5-EF-GLIM-learnin… · Example: Multivariate Gaussian Distribution ! For a continuous

Example 2: linear regression l  The condition distribution: a Gaussian

where µ is a linear function

l  p(y|x) is an exponential family function, with l  mean:

l  and canonical response function

l  IRLS

© Eric Xing @ CMU, 2005-2015 34

)()( xxx T ηθµ ==

[ ] xxyE Tθµ ==|

xTθξη ==1

IWdd

=

= 1ηµ

( )( ){ })()(exp)(

))(())((exp),,( //

ηη

µµπ

θ

Ayxxh

xyxyxyp

T

Tk

−Σ−⇒

⎭⎬⎫

⎩⎨⎧ −Σ−−

Σ=Σ

121

1212 2

12

1

( )( ) ( )

( ) )(

)(tTTt

ttTT

ttTtTt

yXXX

yXXXX

zWXXWX

µθ

µθ

θ

−+=

−+=

=

−+

1

1

11

⇒∞→

⇒t

YXXX TT 1)( −=θ

Steepest descent Normal equation

Rescale

Page 35: Generalized*Linear*Models* Learning*Fully*Observed*Bayes*Nets*epxing/.../10708-16/slide/lecture5-EF-GLIM-learnin… · Example: Multivariate Gaussian Distribution ! For a continuous

Today’s  Lecture  1.  Exponential  Family  Distributions  

A  candidate  for  marginal  distributions,  p(Xi)

2.  Generalized  Linear  Models  Convenient  form  for  conditional  distributions,  p(Xj | Xi)

3.  Learning  Fully  Observed  Bayes  Nets  Easy  thanks  to  decomposability

35  

Page 36: Generalized*Linear*Models* Learning*Fully*Observed*Bayes*Nets*epxing/.../10708-16/slide/lecture5-EF-GLIM-learnin… · Example: Multivariate Gaussian Distribution ! For a continuous

3.  LEARNING  FULLY  OBSERVED  BNS    

Easy  thanks  to  decomposability

36  

Page 37: Generalized*Linear*Models* Learning*Fully*Observed*Bayes*Nets*epxing/.../10708-16/slide/lecture5-EF-GLIM-learnin… · Example: Multivariate Gaussian Distribution ! For a continuous

Classification Generative and discriminative approach

Q

X

Q

X

Regression Linear, conditional mixture, nonparametric

X Y

Density estimation Parametric and nonparametric methods

µ,σ

X X

Simple GMs are the building blocks of complex BNs

© Eric Xing @ CMU, 2005-2015 37

Page 38: Generalized*Linear*Models* Learning*Fully*Observed*Bayes*Nets*epxing/.../10708-16/slide/lecture5-EF-GLIM-learnin… · Example: Multivariate Gaussian Distribution ! For a continuous

l  Consider the distribution defined by the directed acyclic GM:

l  This is exactly like learning four separate small BNs, each of which consists of a node and its parents.

Decomposable likelihood of a BN

),,|(),|(),|()|()|( 432431321211 θθθθθ xxxpxxpxxpxpxp =

X1

X4

X2 X3

X4

X2 X3

X1 X1

X2

X1

X3

© Eric Xing @ CMU, 2005-2015 38

Page 39: Generalized*Linear*Models* Learning*Fully*Observed*Bayes*Nets*epxing/.../10708-16/slide/lecture5-EF-GLIM-learnin… · Example: Multivariate Gaussian Distribution ! For a continuous

Learning  Fully  Observed  BNs  

39  

X1

X3 X2

X4 X5

✓⇤= argmax

✓log p(X1, X2, X3, X4, X5)

= argmax

✓log p(X5|X3, ✓5) + log p(X4|X2, X3, ✓4)

+ log p(X3|✓3) + log p(X2|X1, ✓2)

+ log p(X1|✓1)

✓⇤1 = argmax

✓1

log p(X1|✓1)

✓⇤2 = argmax

✓2

log p(X2|X1, ✓2)

✓⇤3 = argmax

✓3

log p(X3|✓3)

✓⇤4 = argmax

✓4

log p(X4|X2, X3, ✓4)

✓⇤5 = argmax

✓5

log p(X5|X3, ✓5)

✓⇤= argmax

✓log p(X1, X2, X3, X4, X5)

= argmax

✓log p(X5|X3, ✓5) + log p(X4|X2, X3, ✓4)

+ log p(X3|✓3) + log p(X2|X1, ✓2)

+ log p(X1|✓1)

Page 40: Generalized*Linear*Models* Learning*Fully*Observed*Bayes*Nets*epxing/.../10708-16/slide/lecture5-EF-GLIM-learnin… · Example: Multivariate Gaussian Distribution ! For a continuous

Summary  1.  Exponential  Family  Distributions  

–  A  candidate  for  marginal  distributions,  p(Xi) –  Examples:  Multinomial,  Dirichlet,  Gaussian,  Gamma,  Poisson  –  MLE  has  closed  form  solution  –  Bayesian  estimation  easy  with  conjugate  priors  –  Sufficient  statistics  by  inspection    

2.  Generalized  Linear  Models  –  Convenient  form  for  conditional  distributions,  

p(Xj | Xi) –  Special  case:  GLIMs  with  canonical  response    

•  Output  y  follows  an  exponential  family  •  Input  x  introduced  via  a  linear  combination    

–  MLE  for  GLIMs  with  canonical  response  by  SGD  –  In  general,  Bayesian  estimation  relies  on  approximations  

3.  Learning  Fully  Observed  Bayes  Nets  –  Easy  thanks  to  decomposability

40