generalized*linear*models*...
TRANSCRIPT
Generalized Linear Models +
Learning Fully Observed Bayes Nets
1
Matt Gormley Lecture 5
January 27, 2016
School of Computer Science
Readings: KF Chap. 17 Jordan Chap. 8 Jordan Chap. 9.1 – 9.2
10-‐708 Probabilistic Graphical Models
Machine Learning
2
The data inspires the structures we want to predict It also tells us
what to optimize
Our model defines a score
for each structure
Learning tunes the parameters of the
model
Inference finds {best structure, marginals,
partition function} for a new observation
Domain Knowledge
Mathematical Modeling
Optimization Combinatorial Optimization
ML
(Inference is usually called as a subroutine
in learning)
Machine Learning
3
Data
Model
Learning
Inference
(Inference is usually called as a subroutine
in learning)
3
A
l
i
c
e
s
a
w
B
o
b
o
n
a
h
i
l
l
w
i
t
h
a
t
e
l
e
s
c
o
p
e
A
l
i
c
e
s
a
w
B
o
b
o
n
a
h
i
l
l
w
i
t
h
a
t
e
l
e
s
c
o
p
e
4
t
i
m
e
fl
i
e
s
l
i
k
e
a
n
a
r
r
o
w
t
i
m
e
fl
i
e
s
l
i
k
e
a
n
a
r
r
o
w
t
i
m
e
fl
i
e
s
l
i
k
e
a
n
a
r
r
o
w
t
i
m
e
fl
i
e
s
l
i
k
e
a
n
a
r
r
o
w
t
i
m
e
fl
i
e
s
l
i
k
e
a
n
a
r
r
o
w
2
Objective
X1
X3 X2
X4 X5
Today’s Lecture
4
X1
X3 X2
X4 X5
p(X1, X2, X3, X4, X5) =
p(X5|X3)p(X4|X2, X3)
p(X3)p(X2|X1)p(X1)
p(X1, X2, X3, X4, X5) =
p(X5|X3)p(X4|X2, X3)
p(X3)p(X2|X1)p(X1)
Today’s Lecture
5
X1
X3 X2
X4 X5
p(X1, X2, X3, X4, X5) =
p(X5|X3)p(X4|X2, X3)
p(X3)p(X2|X1)p(X1)
Today’s Lecture
How do we define and learn these conditional and marginal distributions for a Bayes Net?
6
X1
X3 X2
X4 X5
Today’s Lecture 1. Exponential Family Distributions
A candidate for marginal distributions, p(Xi)
2. Generalized Linear Models Convenient form for conditional distributions, p(Xj | Xi)
3. Learning Fully Observed Bayes Nets Easy thanks to decomposability
7
1. EXPONENTIAL FAMILY A candidate for marginal distributions, p(Xi)
8
Why the Exponential Family? 1. Pitman-‐Koopman-‐Darmois theorem: it is the only
family of distributions with sufficient statistics that do not grow with the size of the dataset
2. Only family of distributions for which conjugate priors exist (see Murphy textbook for a description)
3. It is the distribution that is closest to uniform (i.e. maximizes entropy) – subject to moment matching constraints
4. Key to Generalized Linear Models (next section) 5. Includes some of your favorite distributions!
9
Adapted from Murphy (2012) textbook
Whiteboard
• Definition of multivariate exponential family • Example 1: Categorical distribution • Example 2: Dirichlet distribution
10
Exponential family, a basic building block
l For a numeric random variable X
is an exponential family distribution with natural (canonical) parameter η
l Function T(x) is a sufficient statistic. l Function A(η) = log Z(η) is the log normalizer. l Examples: Bernoulli, multinomial, Gaussian, Poisson,
gamma,...
© Eric Xing @ CMU, 2005-2015 11
{ }{ })(exp)(
)(
)()(exp)()|(
xTxhZ
AxTxhxpT
T
ηη
ηηη1
=
−=
Example: Multivariate Gaussian Distribution
l For a continuous vector random variable X∈Rk:
l Exponential family representation
l Note: a k-dimensional Gaussian is a (d+d2)-parameter distribution with a (d+d2)-element vector of sufficient statistics (but because of symmetry and positivity, parameters are constrained and have lower degree of freedom)
© Eric Xing @ CMU, 2005-2015 12
( )
( )( ){ }Σ−Σ−Σ+Σ−=
⎭⎬⎫
⎩⎨⎧ −Σ−−
Σ=Σ
−−−
−
logtrexp
)()(exp),(
/
//
µµµπ
µµπ
µ
12111
21
2
1212
21
21
21
TTTk
Tk
xxx
xxxp
( )[ ] ( )[ ]( )[ ]
( )( ) 2
221
112211
21
121
21
1211
211
22
/)(
log)(trlog)(vec;)(
and ,vec,vec;
k
TT
T
xh
AxxxxT
−
−
−−−−
=
−−−=Σ+Σ=
=
Σ−=Σ==Σ−Σ=
π
ηηηηµµη
ηµηηηµη
Moment parameter
Natural parameter
Example: Multinomial distribution l For a binary vector random variable
l Exponential family representation
© Eric Xing @ CMU, 2005-2015 13
),|(multi~ πxx
⎭⎬⎫
⎩⎨⎧
== ∑k
kkxK
xx xxp K πππππ lnexp)( 1121 !
[ ]
1
1
0
1
1
1
=
⎟⎠
⎞⎜⎝
⎛=⎟
⎠
⎞⎜⎝
⎛−−=
=
⎥⎦⎤
⎢⎣⎡
⎟⎠⎞⎜
⎝⎛=
∑∑=
−
=
)(
lnln)(
)(
;ln
xh
eA
xxTK
k
K
kk
K
k
kηπη
ππη
⎭⎬⎫
⎩⎨⎧
⎟⎠
⎞⎜⎝
⎛−+⎟⎟
⎠
⎞⎜⎜⎝
⎛
∑−=
⎭⎬⎫
⎩⎨⎧
⎟⎠
⎞⎜⎝
⎛−⎟
⎠
⎞⎜⎝
⎛−+=
∑∑
∑∑∑−
=
−
=−=
−
=
−
=
−
=
1
1
1
111
1
1
1
1
1
1
1ln1
lnexp
1ln1lnexp
K
kk
K
k kKk
kk
K
kk
K
kK
K
kkk
x
xx
ππ
π
ππ
Cumulant Generating Property l First cumulant (aka. Mean)
l Second cumulant (aka. Variance or First central moment)
© Eric Xing @ CMU, 2005-2015 14
{ }
{ }
[ ])()(
)(exp)()(
)(exp)()(
)()(
)(log
xTE
dxZ
xTxhxT
dxxTxhdd
Z
Zdd
ZZ
dd
ddA
T
T
=
=
=
==
∫
∫
ηη
ηηη
ηηη
ηηη1
1
{ } { }
[ ] [ ][ ])(
)()(
)()()(
)(exp)()()(
)(exp)()(
xTVarxTExTE
Zdd
Zdx
ZxTxhxTdx
ZxTxhxT
dAd
2
TT
=
−=
−= ∫∫2
22
2 1η
ηηηη
ηη
η
Moment estimation l We can easily compute moments of any exponential family
distribution by taking the derivatives of the log normalizer A(η).
l The qth derivative gives the qth centered moment.
l When the sufficient statistic is a stacked vector, partial derivatives need to be considered.
© Eric Xing @ CMU, 2005-2015 15
!
variance)(
mean)(
=
=
2
2
ηη
ηη
dAdddA
Moment vs canonical parameters l The moment parameter µ can be derived from the natural
(canonical) parameter
l A(η) is convex since
l Hence we can invert the relationship and infer the canonical parameter from the moment parameter (1-to-1):
l A distribution in the exponential family can be parameterized not only by η - the canonical parameterization, but also by µ - the moment parameterization.
© Eric Xing @ CMU, 2005-2015 16
[ ] µηη def
)()(== xTE
ddA
[ ] 02
2
>= )()( xTVardAdηη
)(def
µψη =
4
8
-2 -1 0 1 2
4
8
-2 -1 0 1 2
A
ηη*
MLE for Exponential Family l For iid data, the log-likelihood is
l Take derivatives and set to zero:
l This amounts to moment matching. l We can infer the canonical parameters using
© Eric Xing @ CMU, 2005-2015 17
{ }
∑ ∑
∏
−⎟⎠
⎞⎜⎝
⎛+=
−=
n nn
Tn
nn
Tn
NAxTxh
AxTxhD
)()()(log
)()(exp)(log);(
ηη
ηηηl
)(
)()(
)()(
∑
∑
∑
=
=∂∂
⇒
=∂∂
−=∂∂
nnMLE
nn
nn
xTN
xTN
A
ANxT
1
1
0
µ
ηη
ηη
η
⌢
l
)( MLEMLE µψη ⌢⌢=
Examples l Gaussian:
l Multinomial:
l Poisson:
© Eric Xing @ CMU, 2005-2015 18
( )[ ]( )[ ]
( ) 2211
21
1211
2 /)(
log)(vec;)(
vec;
k
T
T
xh
AxxxxT
−
−
−−
=
Σ+Σ=
=
Σ−Σ=
π
µµη
µη
∑∑ ==⇒n
nn
nMLE xN
xTN
111 )(µ
[ ]
1
1
0
1
1
1
=
⎟⎠
⎞⎜⎝
⎛=⎟
⎠
⎞⎜⎝
⎛−−=
=
⎥⎦⎤
⎢⎣⎡
⎟⎠⎞⎜
⎝⎛=
∑∑=
−
=
)(
lnln)(
)(
;ln
xh
eA
xxTK
k
K
kk
K
k
kηπη
ππη
∑=⇒n
nMLE xN1
µ
!)(
)()(log
xxh
eAxxT
1=
==
=
=
ηλη
λη
∑=⇒n
nMLE xN1
µ
Sufficiency l For p(x|q), T(x) is sufficient for θ if there is no information in X
regarding θ beyond that in T(x). l We can throw away X for the purpose of inference w.r.t. θ .
l Bayesian view
l Frequentist view
l The Neyman factorization theorem l T(x) is sufficient for θ if
© Eric Xing @ CMU, 2005-2015 19
T(x) θ X ))(|()),(|( xTpxxTp θθ =
T(x) θ X ))(|()),(|( xTxpxTxp =θ
T(x) θ X
))(,()),(()),(,( xTxxTxTxp 21 ψθψθ =
))(,()),(()|( xTxhxTgxp θθ =⇒
Whiteboard
• Bayesian estimation of exponential family
20
2. GENERALIZED LINEAR MODELS
Convenient form for conditional distributions, p(Xj | Xi)
21
Why Generalized Linear Models? (GLIMs)
1. Generalization of linear regression, logistic regression, probit regression, etc.
2. Provides a framework for creating new conditional distributions that come with some convenient properties
3. Special case: GLIMs with canonical response functions are easy to train with MLE.
4. No Free Lunch: What about Bayesian estimation of GLIMs? Unfortunately, we have to turn to approximation techniques since, in general, there isn't a closed form of the posterior.
22
Generalized Linear Models (GLIMs)
l GLIM l The observed input x is assumed to enter into the model
via a linear combination of its elements l The conditional mean µ is represented as a function f(ξ)
of ξ, where f is known as the response function l The observed output y is assumed to be characterized
by an exponential family distribution with conditional mean µ.
© Eric Xing @ CMU, 2005-2015 23
Xn
Yn N
xTθξ =
Whiteboard
• Constructive definition of GLIMs • Definition of GLIMs with canonical response functions
24
Generalized Linear Models (GLIMs)
l The graphical model l Linear regression l Discriminative linear classification l Commonality:
model Ep(Y)=µ=f(θTX) l What is p()? the cond. dist. of Y. l What is f()? the response function.
© Eric Xing @ CMU, 2005-2015 25
Xn
Yn N
GLIM, cont.
l The choice of exp family is constrained by the nature of the data Y l Example: y is a continuous vector à multivariate Gaussian
y is a class label à Bernoulli or multinomial
l The choice of the response function l Following some mild constrains, e.g., [0,1]. Positivity … l Canonical response function:
l In this case θTx directly corresponds to canonical parameter η. © Eric Xing @ CMU, 2005-2015
26
( ){ })()(exp),(),|( 1 ηηφφη φ Ayxyhyp T −=⇒
)(⋅= −1ψf
{ })()(exp)()|( ηηη Ayxyhyp T −=
ηψfθ
xµξ yEXP
Example canonical response functions
© Eric Xing @ CMU, 2005-2015 27
MLE for GLIMs with natural response
l Log-likelihood
l Derivative of Log-likelihood
l Online learning for canonical GLIMs l Stochastic gradient ascent = least mean squares (LMS) algorithm:
© Eric Xing @ CMU, 2005-2015 28
( )∑ ∑ −+=n n
nnnT
n Ayxyh )()(log ηθl
( )
)(
)(
µ
µ
θη
ηη
θ
−=
−=
⎟⎟⎠
⎞⎜⎜⎝
⎛−=
∑
∑
yX
xydd
ddAyx
dd
Tn
nnn
n
n
n
nnn
l
This is a fixed point function because µ is a function of θ
( ) ntnn
tt xy µρθθ −+=+1
( ) size step a is and where ρθµ nTtt
n x=
Batch learning for canonical GLIMs
l The Hessian matrix
where is the design matrix and
which can be computed by calculating the 2nd derivative of A(ηn) © Eric Xing @ CMU, 2005-2015 29
( )
WXX
xxddx
dd
ddx
ddxxy
dd
dddH
T
nT
nTn
n n
nn
Tn
n n
nn
nTn
nn
nnnTT
−=
=−=
−=
=−==
∑
∑
∑∑
θηηµ
θη
ηµ
θµ
µθθθ
since
l2
[ ]TnxX =
⎟⎟⎠
⎞⎜⎜⎝
⎛=
N
N
dd
ddW
ηµ
ηµ ,,diag …
1
1
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
−−−−
−−−−
−−−−
=
nx
xx
X!!!
2
1
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
=
ny
yy
y!
" 2
1
Recall LMS l Cost function in matrix form:
l To minimize J(θ), take derivative and set to zero:
© Eric Xing @ CMU, 2005-2015 30
( ) ( )yy
yJ
T
n
ii
Ti
!!−−=
−= ∑=
θθ
θθ
XX
x
2121
1
2)()( ⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
−−−−
−−−−
−−−−
=
nx
xx
X!!!
2
1
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
=
ny
yy
y!
" 2
1
( )
( )
( )0
221
22121
=−=
−+=
∇+∇−∇=
+−−∇=∇
yXXX
yXXXXX
yyXyXX
yyXyyXXXJ
TT
TTT
TTTT
TTTTTT
!
!
!!!
!!!!
θ
θθ
θθθ
θθθθ
θθθ
θθ
trtrtr
tryXXX TT !
=⇒ θ The normal equations
( ) yXXX TT !1−=*θ
⇓
Iteratively Reweighted Least Squares (IRLS)
l Recall Newton-Raphson methods with cost function J
l We now have
l Now:
l where the adjusted response is
l This can be understood as solving the following " Iteratively reweighted least squares " problem
© Eric Xing @ CMU, 2005-2015 31
JHttθθθ ∇−= −+ 11
WXXHyXJT
T
−=
−=∇ )( µθ
( ) [ ]( ) ttTtT
tTttTtT
tt
zWXXWX
yXXWXXWX
H
1
1
11
−
−
−+
=
−+=
∇+=
)( µθ
θθ θ l
( ) )( tttt yWXz µθ −+=−1
)()(minarg θθθθ
XzWXz Tt −−=+1
( ) yXXX TT !1−=*θ
Example 1: logistic regression (sigmoid classifier)
l The condition distribution: a Bernoulli
where µ is a logistic function
l p(y|x) is an exponential family function, with l mean:
l and canonical response function
l IRLS
© Eric Xing @ CMU, 2005-2015 32
yy xxxyp −−= 11 ))(()()|( µµ
)()( xex ηµ −+=1
1
[ ] )(| xexyE ηµ −+
==1
1
xTθξη ==
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛
−
−
=
−=
)(
)(
)(
NN
W
dd
µµ
µµ
µµηµ
1
1
1
11
!
Logistic regression: practical issues
l It is very common to use regularized maximum likelihood.
l IRLS takes O(Nd3) per iteration, where N = number of training cases and d = dimension of input x.
l Quasi-Newton methods, that approximate the Hessian, work faster. l Conjugate gradient takes O(Nd) per iteration, and usually works best in practice. l Stochastic gradient descent can also be used if N is large c.f. perceptron rule:
© Eric Xing @ CMU, 2005-2015 33
( ) θθλ
θσθ
λθ
θσθθ
T
nn
Tn
Txy
xyl
Ip
xye
xyp T
2
01
11
1
−=
=+
=±=
∑
−
−
)(log)(
),(Normal~)(
)(),(
( ) λθθσθ −−=∇ nnnT
n xyxy )(1l
Example 2: linear regression l The condition distribution: a Gaussian
where µ is a linear function
l p(y|x) is an exponential family function, with l mean:
l and canonical response function
l IRLS
© Eric Xing @ CMU, 2005-2015 34
)()( xxx T ηθµ ==
[ ] xxyE Tθµ ==|
xTθξη ==1
IWdd
=
= 1ηµ
( )( ){ })()(exp)(
))(())((exp),,( //
ηη
µµπ
θ
Ayxxh
xyxyxyp
T
Tk
−Σ−⇒
⎭⎬⎫
⎩⎨⎧ −Σ−−
Σ=Σ
−
−
121
1212 2
12
1
( )( ) ( )
( ) )(
)(tTTt
ttTT
ttTtTt
yXXX
yXXXX
zWXXWX
µθ
µθ
θ
−+=
−+=
=
−
−
−+
1
1
11
⇒∞→
⇒t
YXXX TT 1)( −=θ
Steepest descent Normal equation
Rescale
Today’s Lecture 1. Exponential Family Distributions
A candidate for marginal distributions, p(Xi)
2. Generalized Linear Models Convenient form for conditional distributions, p(Xj | Xi)
3. Learning Fully Observed Bayes Nets Easy thanks to decomposability
35
3. LEARNING FULLY OBSERVED BNS
Easy thanks to decomposability
36
Classification Generative and discriminative approach
Q
X
Q
X
Regression Linear, conditional mixture, nonparametric
X Y
Density estimation Parametric and nonparametric methods
µ,σ
X X
Simple GMs are the building blocks of complex BNs
© Eric Xing @ CMU, 2005-2015 37
l Consider the distribution defined by the directed acyclic GM:
l This is exactly like learning four separate small BNs, each of which consists of a node and its parents.
Decomposable likelihood of a BN
),,|(),|(),|()|()|( 432431321211 θθθθθ xxxpxxpxxpxpxp =
X1
X4
X2 X3
X4
X2 X3
X1 X1
X2
X1
X3
© Eric Xing @ CMU, 2005-2015 38
Learning Fully Observed BNs
39
X1
X3 X2
X4 X5
✓⇤= argmax
✓log p(X1, X2, X3, X4, X5)
= argmax
✓log p(X5|X3, ✓5) + log p(X4|X2, X3, ✓4)
+ log p(X3|✓3) + log p(X2|X1, ✓2)
+ log p(X1|✓1)
✓⇤1 = argmax
✓1
log p(X1|✓1)
✓⇤2 = argmax
✓2
log p(X2|X1, ✓2)
✓⇤3 = argmax
✓3
log p(X3|✓3)
✓⇤4 = argmax
✓4
log p(X4|X2, X3, ✓4)
✓⇤5 = argmax
✓5
log p(X5|X3, ✓5)
✓⇤= argmax
✓log p(X1, X2, X3, X4, X5)
= argmax
✓log p(X5|X3, ✓5) + log p(X4|X2, X3, ✓4)
+ log p(X3|✓3) + log p(X2|X1, ✓2)
+ log p(X1|✓1)
Summary 1. Exponential Family Distributions
– A candidate for marginal distributions, p(Xi) – Examples: Multinomial, Dirichlet, Gaussian, Gamma, Poisson – MLE has closed form solution – Bayesian estimation easy with conjugate priors – Sufficient statistics by inspection
2. Generalized Linear Models – Convenient form for conditional distributions,
p(Xj | Xi) – Special case: GLIMs with canonical response
• Output y follows an exponential family • Input x introduced via a linear combination
– MLE for GLIMs with canonical response by SGD – In general, Bayesian estimation relies on approximations
3. Learning Fully Observed Bayes Nets – Easy thanks to decomposability
40