probabilistic graphical modelsprobabilistic graphical modelsepxing/class/10708-09/lecture/... ·...
Post on 23-May-2020
14 Views
Preview:
TRANSCRIPT
1
School of Computer Science
Probabilistic Graphical ModelsProbabilistic Graphical Models
Learning generalized linear Learning generalized linear models and tabular CPT of fully models and tabular CPT of fully
observed BNobserved BN
© Eric Xing @ CMU, 2005-2009 1
Eric XingEric XingLecture 7, September 30, 2009
Reading:
X1
X4
X2 X3
X4
X2 X3
X1X1
X2
X1
X3
X1
X4
X2 X3
X4
X2 X3
X1X1
X2
X1
X3
Exponential familyFor a numeric random variable X
{ })()(exp)()|( AxTxhxp T ηηη = Xn
is an exponential family distribution with natural (canonical) parameter η
Function T(x) is a sufficient statistic.
{ }{ })(exp)(
)(1
)()(exp)()|(
xTxhZ
AxTxhxp
Tηη
ηηη
=
−= Xn
N
© Eric Xing @ CMU, 2005-2009 2
Function A(η) = log Z(η) is the log normalizer.Examples: Bernoulli, multinomial, Gaussian, Poisson, gamma,...
2
Recall Linear RegressionLet us assume that the target variable and the inputs are related by the equation:y q
where ε is an error term of unmodeled effects or random noise
Now assume that ε follows a Gaussian N(0,σ), then we have:
iiT
iy εθ += x
⎟⎞
⎜⎛ − 21 θ )( Ty x
© Eric Xing @ CMU, 2005-2009 3
⎟⎟⎠
⎞⎜⎜⎝
⎛−= 222
1σθ
σπθ )(exp);|( ii
iiyxyp x
Recall: Logistic Regression (sigmoid classifier)
The condition distribution: a Bernoulliyy xxxyp −−= 11 ))(()()|( μμ
where μ is a logistic function
We can used the brute-force gradient method as in LR
xxxyp = 1 ))(()()|( μμ
xTex
θμ
−+=
11)(
© Eric Xing @ CMU, 2005-2009 4
But we can also apply generic laws by observing the p(y|x) is an exponential family function, more specifically, a generalized linear model!
3
Example: Multivariate Gaussian Distribution
For a continuous vector random variable X∈Rk:
⎬⎫
⎨⎧ ΣΣ − )()(exp)( μμμ 111 T xxxp
Exponential family representation
( )
( )( ){ }Σ−Σ−Σ+Σ−=
⎭⎬
⎩⎨ −Σ−−
Σ=Σ
−−− logtrexp
)()(exp),(
/
//
μμμπ
μμπ
μ
12111
21
2
212
21
22
TTTk
k
xxx
xxxp
( )[ ] ( )[ ]( )[ ]
121
21
1211
211
vec;)(
and ,vec,vec;TxxxxT
−−−− Σ−=Σ==Σ−Σ= ημηηημη
Moment parameter
Natural parameter
© Eric Xing @ CMU, 2005-2009 5
Note: a k-dimensional Gaussian is a (k+k2)-parameter distribution with a (k+k2)-element vector of sufficient statistics (but because of symmetry and positivity, parameters are constrained and have lower degree of freedom)
( )[ ]( )
( ) 2
221
112211
21
2
2/)(
log)(trlog)(vec;)(
k
TT
xh
AxxxxT
−
−
=
−−−=Σ+Σ=
=
π
ηηηημμη
Example: Multinomial distributionFor a binary vector random variable ),|(multi~ πxx
⎫⎧∑
Exponential family representation
⎭⎬⎫
⎩⎨⎧
== ∑k
kkxK
xx xxp K πππππ lnexp)( 2121 L
⎭⎬⎫
⎩⎨⎧
⎟⎠
⎞⎜⎝
⎛−+⎟⎟
⎠
⎞⎜⎜⎝
⎛∑−
=
⎭⎬⎫
⎩⎨⎧
⎟⎠
⎞⎜⎝
⎛−⎟
⎠
⎞⎜⎝
⎛−+=
∑∑
∑∑∑−
=
−
=−
=
−
=
−
=
−
=
1
1
1
11
1
1
1
1
1
1
1
1ln1
lnexp
1ln1lnexp
K
kk
K
k kKk
kk
K
kk
K
kk
K
kkk
x
xx
ππ
π
ππ
© Eric Xing @ CMU, 2005-2009 6
p y p
[ ]
1
1
0
1
1
1
=
⎟⎠
⎞⎜⎝
⎛=⎟
⎠
⎞⎜⎝
⎛−−=
=
⎥⎦⎤
⎢⎣⎡
⎟⎠⎞⎜
⎝⎛=
∑∑=
−
=
)(
lnln)(
)(
;ln
xh
eA
xxTK
k
K
kk
K
k
kηπη
ππη
4
Why exponential family?Moment generating property
dddA 1
{ }{ }
[ ])()(
)(exp)()(
)(exp)()(
)()(
)(log
xTE
dxZ
xTxhxT
dxxTxhdd
Z
Zdd
ZZ
dd
ddA
T
T
=
=
=
==
∫
∫
ηη
ηηη
ηηη
ηηη1
1
© Eric Xing @ CMU, 2005-2009 7
{ } { }
[ ] [ ][ ])(
)()(
)()()(
)(exp)()()(
)(exp)()(
xTVarxTExTE
Zdd
Zdx
ZxTxhxTdx
ZxTxhxT
dAd
2
TT
=−=
−= ∫∫2
22
2 1 ηηηη
ηηη
η
Moment estimationWe can easily compute moments of any exponential family distribution by taking the derivatives of the log normalizer y g gA(η).The qth derivative gives the qth centered moment.
variance)(
mean)(
=
=
2 ηηη
AdddA
© Eric Xing @ CMU, 2005-2009 8
When the sufficient statistic is a stacked vector, partial derivatives need to be considered.
L
variance=2ηd
5
Moment vs canonical parametersThe moment parameter µ can be derived from the natural (canonical) parameter( ) p
A(η) is convex since
[ ] μηη def
)()(== xTE
ddA
[ ] 02
2
>= )()( xTVardAdη
η
4
8
-2 -1 0 1 2
4
8
-2 -1 0 1 2
A
ηη∗
© Eric Xing @ CMU, 2005-2009 9
Hence we can invert the relationship and infer the canonical parameter from the moment parameter (1-to-1):
A distribution in the exponential family can be parameterized not only by η − the canonical parameterization, but also by μ − the moment parameterization.
)(def
μψη =
MLE for Exponential FamilyFor iid data, the log-likelihood is
{ }∏ T AThD )()()(l)(l
Take derivatives and set to zero:
{ }
∑ ∑
∏
−⎟⎠
⎞⎜⎝
⎛+=
−=
n nn
Tn
nn
Tn
NAxTxh
AxTxhD
)()()(log
)()(exp)(log);(
ηη
ηηηl
1)(
0)()(∑∂
=∂
∂−=
∂∂
nn
A
ANxT
ηηη
ηl
© Eric Xing @ CMU, 2005-2009 10
This amounts to moment matching.We can infer the canonical parameters using
)(1
)(1)(
∑
∑
=
=∂
∂
⇒
nnMLE
nn
xTN
xTN
A
μ
ηη
)
)( MLEMLE μψη )) =
6
SufficiencyFor p(x|θ), T(x) is sufficient for θ if there is no information in Xregarding θ beyond that in T(x).g g y ( )
We can throw away X for the purpose of inference w.r.t. θ .
Bayesian view
Frequentist view
T(x) θX ))(|()),(|( xTpxxTp θθ =
T(x) θX ))(|()),(|( xTxpxTxp =θ
© Eric Xing @ CMU, 2005-2009 11
The Neyman factorization theorem
T(x) is sufficient for θ if
T(x) θX
))(,()),(()),(,( xTxxTxTxp 21 ψθψθ =
))(,()),(()|( xTxhxTgxp θθ =⇒
ExamplesGaussian:
( )[ ][ ]
1211 vec; −− Σ−Σ= μη
Multinomial:
( )[ ]
( ) 2211
21
2 /)(
log)(vec;)(
k
T
T
xh
AxxxxT
−
−
=
Σ+Σ=
=
π
μμη∑∑ ==⇒n
nn
nMLE xN
xTN
111 )(μ
[ ]
1
0
1
⎟⎞
⎜⎛
⎟⎞
⎜⎛
=
⎥⎦⎤
⎢⎣⎡
⎟⎠⎞⎜
⎝⎛=
∑∑−
ll)(
)(
;ln
A
xxTKK
K
k
kη
ππη
∑=⇒n
nMLE xN1μ
© Eric Xing @ CMU, 2005-2009 12
Poisson:1
111
=
⎟⎠
⎜⎝
=⎟⎠
⎜⎝
−−= ∑∑==
)(
lnln)(
xh
eAkk
kkηπη
!)(
)()(
log
xxh
eAxxT
1=
==
==
ηλη
λη
∑=⇒n
nMLE xN1μ
7
Generalized Linear Models (GLIMs)
The graphical modelLinear regression XnLinear regressionDiscriminative linear classificationCommonality:
model Ep(Y)=μ=f(θTX)What is p()? the cond. dist. of Y.What is f()? the response function.
GLIM
n
YnN
© Eric Xing @ CMU, 2005-2009 13
GLIMThe observed input x is assumed to enter into the model via a linear combination of its elementsThe conditional mean μ is represented as a function f(ξ) of ξ, where f is known as the response functionThe observed output y is assumed to be characterized by an exponential family distribution with conditional mean μ.
xTθξ =
GLIM, cont.
ηψfθ
μξ yEXP
EXP
The choice of exp family is constrained by the nature of the data Y
( ){ })()(exp),(),|( 1 ηηφφη φ Ayxyhyp T −=⇒
{ })()(exp)()|( ηηη Ayxyhyp T −=
ηx
μξ y
© Eric Xing @ CMU, 2005-2009 14
p y y YExample: y is a continuous vector multivariate Gaussian
y is a class label Bernoulli or multinomial
The choice of the response functionFollowing some mild constrains, e.g., [0,1]. Positivity …Canonical response function:
In this case θTx directly corresponds to canonical parameter η.)(⋅= −1ψf
8
MLE for GLIMs with natural response
Log-likelihood
( )∑ ∑ T Ah )()(l θl
Derivative of Log-likelihood
( )∑ ∑ −+=n n
nnnT
n Ayxyh )()(log ηθl
( )
)(
μ
θη
ηη
θ
−=
⎟⎟⎠
⎞⎜⎜⎝
⎛−=
∑
∑xy
dd
ddAyx
dd
T
nnnn
n
n
n
nnn
l
This is a fixed point function
© Eric Xing @ CMU, 2005-2009 15
Online learning for canonical GLIMsStochastic gradient ascent = least mean squares (LMS) algorithm:
)( μ−= yX Tp
because μ is a function of θ
( ) ntnn
tt xy μρθθ −+=+1
( ) size step a is and where ρθμ nTtt
n x=
Batch learning for canonical GLIMs
The Hessian matrix2 ⎥
⎥⎥⎤
⎢⎢⎢⎡
−−−−−−−−
=T
T
xx
XMMM
2
1
( )
WXX
xxddx
dd
ddx
ddxxy
dd
dddH
T
nT
nTn
n n
nn
Tn
n n
nn
nTn
nn
nnnTT
−=
=−=
−=
=−==
∑
∑
∑∑
θηημ
θη
ημ
θμμ
θθθ
since
l2
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
=
ny
yy
yM
v 2
1
⎥⎥⎥
⎦⎢⎢⎢
⎣ −−−− Tnx
MMM
© Eric Xing @ CMU, 2005-2009 16
where is the design matrix and
which can be computed by calculating the 2nd derivative of A(ηn)
[ ]TnxX =
⎟⎟⎠
⎞⎜⎜⎝
⎛=
N
N
dd
ddW
ημ
ημ ,,diag K
1
1
9
Recall LMSCost function in matrix form:
nT∑1 2 ⎥
⎥⎥⎤
⎢⎢⎢⎡
−−−−−−−−
=T
T
xx
XMMM
2
1
To minimize J(θ), take derivative and set to zero:
( ) ( )yy
yJ
T
ii
Ti
vv −−=
−= ∑=
θθ
θθ
XX
x
2121
1
2)()(⎥⎥⎥
⎦⎢⎢⎢
⎣ −−−− Tnx
MMM
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
=
ny
yy
yM
v 2
1
1
© Eric Xing @ CMU, 2005-2009 17
( )
( )
( )0
221
22121
=−=
−+=
∇+∇−∇=
+−−∇=∇
yXXX
yXXXXX
yyXyXX
yyXyyXXXJ
TT
TTT
TTTT
TTTTTT
v
v
vvv
vvvv
θ
θθ
θθθ
θθθθ
θθθ
θθ
trtrtr
tryXXX TT v=⇒ θ
The normal equations
( ) yXXX TT v1−=*θ
⇓
Iteratively Reweighted Least Squares (IRLS)
Recall Newton-Raphson methods with cost function JJHtt θθ ∇= −+ 11
We now have
Now:
JH θθθ ∇−=
WXXH
yXJT
T
−=
−=∇ )( μθ
( ) [ ]( ) ttTtT
tTttTtT
tt
zWXXWX
yXXWXXWX
H
1
1
11
)(−
−
−+
=
−+=
∇−=
μθ
θθ θ l
( ) yXXX TT v1−=*θ
© Eric Xing @ CMU, 2005-2009 18
where the adjusted response is
This can be understood as solving the following " Iteratively reweighted least squares " problem
( ) zWXXWX( ) )( tttt yWXz μθ −+=
−1
)()(minarg θθθθ
XzWXz Tt −−=+1
10
Example 1: logistic regression (sigmoid classifier)
The condition distribution: a Bernoulli yy xxxyp −−= 11 ))(()()|( μμ
where μ is a logistic function
p(y|x) is an exponential family function, with mean:
xxxyp = 1 ))(()()|( μμ
)()( xex ημ −+
=1
1
[ ] )(| xexyE ημ −+
==1
1
© Eric Xing @ CMU, 2005-2009 19
and canonical response function
IRLS
xTθξη ==
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛
−
−=
−=
)(
)(
)(
NN
W
dd
μμ
μμ
μμημ
1
1
1
11
O
Logistic regression: practical issues
It is very common to use regularized maximum likelihood.T11
IRLS takes O(Nd3) per iteration, where N = number of training cases and d = dimension of input x.Quasi Newton methods that approximate the Hessian work faster
( ) θθλθσθ
λθ
θσθθ
T
nn
Tn
Txy
xyl
Ip
xye
xyp T
2
01
11
1
−=
=+
=±=
∑
−
−
)(log)(
),(Normal~)(
)(),(
© Eric Xing @ CMU, 2005-2009 20
Quasi-Newton methods, that approximate the Hessian, work faster.Conjugate gradient takes O(Nd) per iteration, and usually works best in practice.Stochastic gradient descent can also be used if N is large c.f. perceptron rule:
( ) λθθσθ −−=∇ nnnT
n xyxy )(1l
11
Example 2: linear regressionThe condition distribution: a Gaussian
))(())((exp)( μμθ xyxyxyp T⎬⎫
⎨⎧ −Σ−−=Σ −111
where μ is a linear function
p(y|x) is an exponential family function, with
)()( xxx T ηθμ ==
[ ] T
( )( ){ })()(exp)(
))(())((exp),,( //
ηη
μμπ
θ
Ayxxh
xyxyxyp
T
k
−Σ−⇒
⎭⎬
⎩⎨ Σ
Σ=Σ
−121
212 22Rescale
© Eric Xing @ CMU, 2005-2009 21
mean:
and canonical response function
IRLS
[ ] xxyE Tθμ ==|xTθξη ==1
IWdd
=
= 1ημ ( )
( ) ( )( ) )(
)(tTTt
ttTT
ttTtTt
yXXX
yXXXX
zWXXWX
μθ
μθ
θ
−+=
−+=
=
−
−
−+
1
1
11
⇒∞→
⇒t
YXXX TT 1−= )(θ
Steepest descent Normal equation
Density estimation μ σ
Simple GMs are the building blocks of complex BNs
RegressionLinear, conditional mixture, nonparametric
X Y
Density estimationParametric and nonparametric methods
μ,σ
XX
© Eric Xing @ CMU, 2005-2009 22
ClassificationGenerative and discriminative approach
Q
X
Q
X
12
School of Computer ScienceAn (incomplete)
genealogy of graphical g p
models
The structures of most GMs (e.g., all listed here), are not learned from data but
© Eric Xing @ CMU, 2005-2009 23
learned from data, but designed by human.
But such designs are useful and indeed favored because thereby human knowledge are put into good use …
MLE for general BNsIf we assume the parameters for each CPD are globally independent, and all nodes are fully observed, then the log-p , y , glikelihood function decomposes into a sum of local terms, one per node:
∑ ∑∏ ∏ ⎟⎠
⎞⎜⎝
⎛=⎟⎟
⎠
⎞⎜⎜⎝
⎛==
i ninin
n iinin ii
xpxpDpD ),|(log),|(log)|(log);( ,,,, θθθθ ππ xxl
© Eric Xing @ CMU, 2005-2009 24
X2=1
X2=0
X5=0
X5=1
13
Consider the distribution defined by the directed acyclic GM:
Example: decomposable likelihood of a directed model
This is exactly like learning four separate small BNs, each of which consists of a node and its parents.
),,|(),|(),|()|()|( 432431321211 θθθθθ xxxpxxpxxpxpxp =
X1
X1X1 X1
© Eric Xing @ CMU, 2005-2009 25
X4
X2 X3
X4
X2 X3
X2 X3
MLE for BNs with tabular CPDsAssume each CPD is represented as a table (multinomial) where
)|(def
kXjXθ
Note that in case of multiple parents, will have a composite state, and the CPD will be a high-dimensional tableThe sufficient statistics are counts of family configurations
The log likelihood is
)|( kXjXpiiijk === πθ
iπX
∑=n
kn
jinijk ixxn π,,
def
© Eric Xing @ CMU, 2005-2009 26
The log-likelihood is
Using a Lagrange multiplier to enforce , we get:
∑∏ ==kji
ijkijkkji
nijk nD ijk
,,,,
loglog);( θθθl
1=∑j ijkθ ∑=
''
jkij
ijkMLijk n
nθ
14
MLE and Kulback-Leibler divergence
KL divergence( ) ∑=
xqxqxpxqD )(log)()(||)(
Empirical distribution
Where δ(x,xn) is a Kronecker delta function
Maxθ(MLE) Minθ(KL)
( ) ∑=x xp
xqxpxqD)(
log)()(||)(
∑=
=N
nnxxN
xp1
1 ),()(~ defδ
≡( ) )(~ xp∑
© Eric Xing @ CMU, 2005-2009 27
( )
);(1
)|(log1)(~log)(~
)|(log)(~)(~log)(~)|(
)(log)(~)|(||)(~
DN
C
xpN
xpxp
xpxpxpxpxpxpxpxpxpD
nn
x
xx
x
θ
θ
θθ
θ
l−=
−=
−=
=
∑∑
∑∑
∑
Parameter sharing
Aπ
Consider a time-invariant (stationary) 1st-order Markov modelInitial state probability vector:
State transition probability matrix:
X1 X2 X3 XT…
)(def
11 == kk Xpπ
)|(def
11 1 === −it
jtij XXpA
∏∏T
© Eric Xing @ CMU, 2005-2009 28
The joint:
The log-likelihood:
Again, we optimize each parameter separatelyπ is a multinomial frequency vector, and we've seen it beforeWhat about A?
∏∏= =
−=t t
ttT XXpxpXp2 2
111 )|()|()|( : πθ
∑∑∑=
−+=n
T
ttntn
nn AxxpxpD
211 ),|(log)|(log);( ,,, πθl
15
Learning a Markov chain transition matrix
A is a stochastic matrix: Each row of A is multinomial distribution.
1=∑ j ijAEach row of A is multinomial distribution.So MLE of Aij is the fraction of transitions from i to j
Application:
∑ ∑∑ ∑
= −
= −=•→
→=∀
n
T
titn
jtnn
T
titnML
ijx
xxi
jiAi2 1,
,2 1,
)(#)(# :
© Eric Xing @ CMU, 2005-2009 29
if the states Xt represent words, this is called a bigram language model
Sparse data problem:If i j did not occur in data, we will have Aij =0, then any future sequence with word pair i j will have zero probability. A standard hack: backoff smoothing or deleted interpolation
MLiti AA •→•→ −+= )(~
λλη 1
Bayesian language modelGlobal and local parameter independence
β
The posterior of Ai · and Ai' · is factorized despite v-structure on Xt, because Xt-1
X1 X2 X3 XT…
Ai ·π Ai' ·
βα
…
A
© Eric Xing @ CMU, 2005-2009 30
acts like a multiplexerAssign a Dirichlet prior βi to each row of the transition matrix:
We could consider more realistic priors, e.g., mixtures of Dirichlets to account for types of words (adjectives, verbs, etc.)
)(# where,)(
)(#)(#
),,|( ',
,def
•→+=−+=
+•→+→
==i
Ai
jiDijpA
i
ii
MLijikii
i
kii
Bayesij β
βλλβλ
ββ
β 1
16
Example: HMM: two scenariosSupervised learning: estimation when the “right answer” is known
Examples: GIVEN: a genomic region x = x1…x1,000,000 where we have good
(experimental) annotations of the CpG islandsGIVEN: the casino player allows us to observe him one evening,
as he changes dice and produces 10,000 rolls
Unsupervised learning: estimation when the “right answer” is unknown
Examples:GIVEN: the porcupine genome; we don’t know how frequent are the
© Eric Xing @ CMU, 2005-2009 31
GIVEN: the porcupine genome; we don t know how frequent are the CpG islands there, neither do we know their composition
GIVEN: 10,000 rolls of the casino player, but we don’t see when he changes dice
QUESTION: Update the parameters θ of the model to maximize P(x|θ) --- Maximal likelihood (ML) estimation
Recall definition of HMMTransition probabilities between any two states
y2 y3y1 yT...
or
Start probabilities
A AA Ax2 x3x1 xT... ,)|( , ji
it
jt ayyp === − 11 1
( ) .,,,,lMultinomia~)1|( ,2,1,1 I∈∀=− iaaayyp Miiiitt K
( ).,,,lMultinomia~)( 211 Myp πππ K
© Eric Xing @ CMU, 2005-2009 32
Emission probabilities associated with each state
or in general:
( ).,,,lMultinomia)( 211 Myp πππ K
( ) .,,,,lMultinomia~)|( ,,, I∈∀= ibbbyxp Kiiiitt K211
( ) .,|f~)|( I∈∀⋅= iyxp iitt θ1
17
Supervised ML estimationGiven x = x1…xN for which the true state path y = y1…yN is known,
Define:Aij = # times state transition i→j occurs in yBik = # times state i in y emits k in x
We can show that the maximum likelihood parameters θ are:
∑∑ ∑∑ ∑ ==
•→→
== −
= −
' ',
,,
)(#)(#
j ij
ij
n
T
titn
jtnn
T
titnML
ij AA
y
yyi
jia2 1
2 1
T
© Eric Xing @ CMU, 2005-2009 33
What if x is continuous? We can treat as N×Tobservations of, e.g., a Gaussian, and apply learning rules for Gaussian …
∑∑ ∑∑ ∑ ==
•→→
==
=
' ',
,,
)(#)(#
k ik
ik
n
T
titn
ktnn
T
titnML
ik BB
y
xyi
kib1
1
( ){ }NnTtyx tntn :,::, ,, 11 ==
Supervised ML estimation, ctd.Intuition:
When we know the underlying states, the best estimate of θ is the average frequency of transitions & emissions that occur in the trainingaverage frequency of transitions & emissions that occur in the training data
Drawback:Given little data, there may be overfitting:
P(x|θ) is maximized, but θ is unreasonable0 probabilities – VERY BAD
Example:
© Eric Xing @ CMU, 2005-2009 34
Example:Given 10 casino rolls, we observe
x = 2, 1, 5, 6, 1, 2, 3, 6, 2, 3y = F, F, F, F, F, F, F, F, F, F
Then: aFF = 1; aFL = 0bF1 = bF3 = .2; bF2 = .3; bF4 = 0; bF5 = bF6 = .1
18
PseudocountsSolution for small training sets:
Add pseudocountsAij = # times state transition i→j occurs in y + RijBik = # times state i in y emits k in x + Sik
Rij, Sij are pseudocounts representing our prior beliefTotal pseudocounts: Ri = ΣjRij , Si = ΣkSik ,
--- "strength" of prior belief, --- total number of imaginary instances in the prior
© Eric Xing @ CMU, 2005-2009 35
Larger total pseudocounts ⇒ strong prior belief
Small total pseudocounts: just to avoid 0 probabilities --- smoothing
This is equivalent to Bayesian est. under a uniform prior with "parameter strength" equals to the pseudocounts
Earthquake Burglary ∏==M
i ixpp )|()( πxxXFactorization:
How to define parameter prior?
q
Radio
g y
Alarm
Call
=ii
1
ji
kii x
jkixp
πθπ x
x|
)|( =
Local Distributions defined by, e.g., multinomial parameters:
© Eric Xing @ CMU, 2005-2009 36
Assumptions (Geiger & Heckerman 97,99):Complete Model EquivalenceGlobal Parameter IndependenceLocal Parameter IndependenceLikelihood and Prior Modularity
? )|( Gp θ
19
Global Parameter Independence Earthquake Burglary
Global & Local Parameter Independence
Global Parameter IndependenceFor every DAG model:
∏=
=M
iim GpGp
1
)|()|( θθ
Earthquake
Radio
Burglary
Alarm
Call
© Eric Xing @ CMU, 2005-2009
Local Parameter IndependenceFor every node:
∏=
=i
ji
ki
q
jxi GpGp
1|
)|()|(π
θθx
independent of)( | YESAlarmCallP =θ
)( | NOAlarmCallP =θ
Global ParameterIndependence
Parameter Independence,Graphical View
sample 1
θ2|1θ1 θ2|1
X1 X2
IndependenceLocal ParameterIndependence
© Eric Xing @ CMU, 2005-2009
Provided all variables are observed in all cases, we can perform Bayesian update each parameter independently !!!
sample 2
M
X1 X2
20
Which PDFs Satisfy Our Assumptions? (Geiger & Heckerman 97,99)
Discrete DAG Models:∑Γ )(
)(Multi~| θπ jxi i
x
Dirichlet prior:
Gaussian DAG Models:
Normal prior:
∏∏∏∑
=Γ
Γ=
kk
kk
kk
kk
kk CP 1-1- )()(
)()( αα θαθ
α
αθ
),(Normal~| Σμπ jxi i
x
⎭⎬⎫
⎩⎨⎧ −Ψ−−
Ψ=Ψ − )()'(exp
||)(),|( // νμνμ
πνμ 1
212 21
21
np
© Eric Xing @ CMU, 2005-2009 39
Normal-Wishart prior:
⎭⎩Ψ ||)( π 22
{ }
. where
,WtrexpW),(),|W( /)(/
1
212
21
−
−−
Σ=⎭⎬⎫
⎩⎨⎧ ΤΤ=Τ
W
nww
wwncp αααα
( ),)(,Normal),,|( 1−= WW μμ ανανμp
Summary: Learning GMFor exponential family dist., MLE amounts to moment matchingg
GLIM: Natural responseIteratively Reweighted Least Squares as a general algorithm
For fully observed BN the log-likelihood function decomposes
© Eric Xing @ CMU, 2005-2009 40
For fully observed BN, the log likelihood function decomposes into a sum of local terms, one per node; thus learning is also factored
top related