learning with approximate inferencesrihari/cse674/chap20/... · assienment a,o ao a"n al al al...
TRANSCRIPT
Probabilistic Graphical Models Srihari
1
Learning with Approximate Inference
Sargur Srihari [email protected]
Probabilistic Graphical Models Srihari
Topics
• Learning MN parameters using Gradient Ascent and Belief Propagation – Difficulty with exact methods
• Approximate methods – Approximate Inference
• Belief Propagation • Sampling-based Learning • MAP-based Learning
• Ex: CRF for Protein Structure Prediction 2
Probabilistic Graphical Models Srihari
Likelihood function for a Markov Network
• Log-linear form of MN with parameters θis
• We wish to find θ that maximizes the Log-likelihood
3
P(X1,..Xn;θ ) =1
Z(θ )exp θi fi (Di )
i=1
k
∑⎧⎨⎩
⎫⎬⎭
where θ={θ1,..θk} are k parameters, each associated with a feature fi defined over instances of Di
Z θ( ) = exp θ
ifiξ( )
i∑⎧⎨
⎩
⎫⎬⎭ξ
∑
Where the partition function defined as:
ℓ(θ :D) = θi
i∑ fi ξ[m]( )
m∑⎛⎝⎜
⎞⎠⎟−M lnZ(θ )
Probabilistic Graphical Models Srihari Gradient-based learning of Undirected PGM
4
Run inference to compute Z(θ) , Eθ[fi]
Initialize θ .
Compute gradient of likelihood . ℓ
Update θ with step size η .
Optimum Reached?
Stop
Yes
No
∂∂θiℓ θ;D( ) = ED[ fi χ( )]− Eθ [ fi ]
θt+1 ← θt + η∇ℓ θt ;D( )
Z θ( ) = exp θ
ifiξ( )
i∑⎧⎨⎪⎪⎩⎪⎪
⎫⎬⎪⎪⎭⎪⎪ξ
∑ Eθ [ fi ]=1
Z(θ )fi (ξ )exp θ j f j ξ( )
j∑⎧⎨
⎩⎪
⎫⎬⎭⎪ξ
∑
∇ℓ(θ) =ddθℓ(θ) =
∂ℓ θ( )∂θ
1
.∂ℓ θ( )∂θ
k
⎡
⎣
⎢⎢⎢⎢
⎤
⎦
⎥⎥⎥⎥
θt+1−θt ≤ δ
Pθ (χ ) =1
Z θ( ) exp − θi fi (Di )i=1
k
∑⎡⎣⎢
⎤⎦⎥
Method assumes that we are able to compute partition function Z(θ), which is a sum over all unnormalized probabilities of assignments ξ,and two expectations ED[fi(χ)] and Eθ[fi]. Both Z(θ) and Eθ[fi] require inference to compute probabilities of each ξ
Goal: Determine parameters θ of Markov Network over variables χ given data set D with k features fi
Probabilistic Graphical Models Srihari
Exact Inference: Belief Propagation
C
(a) (b) (c)
A
D BBD
A
C
BD
A
C
P(A,B,C.D) = 1Zφ1(A,B) ⋅φ2 (B,C) ⋅φ3(C,D) ⋅φ4 (D,A)
where
Z = φ1(A,B) ⋅φ2 (B,C) ⋅φ3(C,D) ⋅φ4 (D,A)A,B,C ,D∑
Z=7,201,840
1.Gibbs Distribution
1. A,B,D 2. B,C,D{B,D}
362
Assignmenta
6o6r6t6o6o6r6r
maxc600,000300,030
5, ooo,5oo1,000
2001,000, 100
100,010200,000
Chapter 1&
Assienment
a,o
aona"
alalalAL
Assignment b0bob0
bL
bL
bt6t
cocLctcococ1ct
d1 ll I4o4t4o4rd0 ll 5,
6o
bL
br
600,2001,300, 1305, 100, 510
201,000
reparameteriza-tion
clique treeinvariant
Example 10.5
n,z(B, D) l3z(8,C,
Figure 10.6 lhe clique and sepset bellefs for the Misconception example,
Using equation (10.9), the denominator can be rewritten as:
il 6i-i5i-r.(i- j)e€r
Each message da+i appears exactly once in the numerator and exactly once in theso that all messages cancel. The remaining expression is simply:
l rbo{co1: e*.ieVr
Thus, via equation (10.i0), the clique and sepsets beliefs provide aunnormalized measure. This property is called the clique tree irunriant, for reasonsbecome clear later on in this chapter.
Another intuition for this result can be obtained from the following example:
Consider a clique tree obtained from Markav network A*B -C -D, with anfactors @. Our clique tree in this case would haue three cliques C1 : {A,B), Czand, Cs : {CLD}. When the clique tree is calibrated, we haue that B1(A,B) : Po(Zz(B,C) : Po(B,C). From the conditional independence properties of thishaue that
Po(A,B,C): Po(A,B)P,(C I B),and
P^rc I B\:1vjB'c) .po(B)
As B2(B,C) : Po(B,C), we can obtain Po@) by marginalizing B2(B,C). Thus, we
Br@,qrffifir)p|(A, B)p2(B,C)-U;p,@O-
d
Po(A, B,C)
!PΦ
a1,b0,c1,d 0( ) = 100
β1
a1,b0,d 0( )β2b0,c1,d 0( )
µ1,2
b0,d 0( )=
200 ⋅300 ⋅100600 ⋅200
= 100
2. Clique Tree (triangulated):
Computing Clique Beliefs (βi), Sepset Beliefs (µi,j)
β1
A,B,D( ) = !PΦ
A,B,D( ) = ψ1
A,B,D( )C∑ = φ
1(A,B)φ
2(B,C)φ
3(C,D)φ
4(D,A)
C∑
e.g., β1(a0,b0,d 0) = 300,000 + 300,000 = 600,000
µ1,2
(B,D) = β1
C1−S1,2
∑ C1( ) = β
1A∑ A,B,D( )
e.g., µ1,2
(b0,d 0) = 600,000 + 200 = 600,200
β2
B,C,D( ) = !PΦ
B,C,D( ) = µ1,2
B,D( ) ⋅ψ2A∑ B,C,D( ) = ψ
2B,C,D( )
A∑
e.g.,β2
b0,c0,d 0( ) = 300,000 +100 = 300,100
Initial Potentials:
ψ1
A,B,D( ) = φ1
A,B( )φ2B,C( )φ3
C,D( )φ4D,A( )
ψ2
B,C,D( ) = φ1
A,B( )φ2B,C( )φ3
C,D( )φ4D,A( )
!PΦ
A,B,C,D( ) = φ1
A,B( )φ2B,C( )φ3
C,D( )φ4D,A( )
All Clique and Sepset Beliefs
Factors Unormalized Distribution
Verifying Inference
Each ψ has every factor involving its arguments
Clique/Cluster C1 Cluster C2 Sepset
!PΦχ( ) =
βiC
i( )i∈VT
∏µ
i,jS
i,j( )i−j( )∈ET
∏
Probabilistic Graphical Models Srihari Example of Computing ED[fi(χ)]
χ={A,B,C}; Pairwise Markov Network A—B—C – Variables are binary – Three clusters: C1= {A,B}, C2={B,C}, C3={C,A}
• Log-linear model with two features: • f00(x,y)=1 if x=0, y=0 and 0 otherwise and• f11(x,y)=1 if x=1, y=1 and 0 otherwise
– Assume we have three data instances (A,B,C): – (0,0.0)à Cluster AB, Cluster BC, Cluster CA have f00(x,y)=1 – (0,1,0)àOnly Cluster CA has f00(x,y)=1 – (1,0,0)àOnly Cluster BC has f00(x,y)=1
• Unnormalized empirical feature counts, pooled over all clusters:
E !Pf00
⎡⎣ ⎤⎦ = (3 + 1 + 1)/ 3 = 5/ 3
E !Pf11
⎡⎣ ⎤⎦ = (0 + 0 + 0)/ 3 = 0
Pθ (χ ) =1
Z θ( ) exp − θi fi (Di )i=1
k
∑⎡⎣⎢
⎤⎦⎥
Can normalize using counts over all features i.e., (5/3)+0
Both shared over all clusters
Probabilistic Graphical Models Srihari
Difficulty with Exact Methods • Exact Parameter Estimation Methods assume
ability to compute 1. Partition function Z(θ) and 2. Expectations
• In many applications structure of network does not allow exact computation of these terms – In image segmentation, grid networks lead to
exponential size clusters for exact inference
7
EPθ [ fi ]
Cluster graph is clique tree with overlapping factors
AB—BC--CA
Probabilistic Graphical Models Srihari
Discussion on Approximate Methods 1. In this section: approximate inference
– Decouple inference from Learning Parameters • Inference is a black-box
– But approximation may interfere with learning • Non-convergence of inference can lead to oscillating
estimates of the gradient & no learning convergence
2. Next section: approximate objective function – Whose optimization doesn’t require much inference – Approximately optimizing the likelihood function
can be reformulated as exactly optimizing an approximate objective
8
Probabilistic Graphical Models Srihari
Approximate Inference • Learning with Approximate Inference
methods 1. Belief Propagation 2. MAP-based Learning
9
Probabilistic Graphical Models Srihari
Approximate Inference: Belief Propagation
• Popular Approach for Approximate Inference is Belief Propagation and its variants – An algorithm from this family would be used for
inference in the model resulting from the learning procedure
• We should use the same inference algorithm that will be used for querying it – Model trained with same inference algorithm is
better than model trained with exact inference! • BP is run with each iteration of Gradient Ascent
to compute expected feature count 10 EPθ [ fi ]
Probabilistic Graphical Models Srihari
Difficulty with Belief Propagation • In principle, BP is run in each Gradient Ascent
iteration – to compute used in gradient computation
• Due to family preservation property each feature must be a subset of a cluster Ci in the cluster graph
– Hence to compute we can compute BP marginals over Ci
• But BP often does not converge – Marginals derived often oscillate
• Final results depend on where we stop
– As a result gradient computed is unstable • Hurt convergence properties of gradient descent • Even more severe with line search 11
EPθ [ fi ]
EPθ [ fi ]
Probabilistic Graphical Models Srihari
Convergent alternatives to Belief Propagation
• We describe three BP methods: 1. Pseudo- moment matching
• Reformulate task of learning with approximate inference as optimizing an alternative objective
2. Maximum Entropy approximation (CAMEL) • General derivation that allows us to reformulate ML with
BP as a unified optimization problem with an approximate objective
3. MAP-based Learning • Approximate expected feature counts with their counts
in the single MAP assignment in current MN 12
Probabilistic Graphical Models Srihari
Pseudo-moment Matching • Begin with analysis of fixed points in learning
– Converged BP beliefs must satisfy • Or
• Define for each sepset Si,j between Ci and Cj
– We use the final set of potentials as the parameterization of the Markov Network
• Provides a closed-form solution for both inference and learning – Cannot be used with parameter regularization, non-
table factors or CRFs 13
Eβi Ci( ) fCi⎡⎣ ⎤⎦ = ED fi Ci( )⎡⎣ ⎤⎦
φi ←βi
µi, j
β
ic
ij( ) = P̂ c
ij( )
C1: A,B,D C2: B,C,D S12={B,D}
Convergent point is a set of beliefs that match the data
Probabilistic Graphical Models Srihari
BP and Maximum Entropy • A more general derivation that allows us to
reformulate maximum likelihood learning with belief propagation as a unified optimization problem with an approximate objective
• Opens door to use of better approximation algorithms
• Start with dual of maximum-likelihood problem Maximum-Entropy:
Find Q(χ) Maximizing HQ(χ) Subject to EQ[fi]=ED [fi], i=1,.., k 14
Probabilistic Graphical Models Srihari Tractable Max Entropy
15
• This is exact for tree cluster graph but approx otherwise Approx-Maximum-Entropy:
Find Q(χ) Maximizing Subject to EQ[fi]=ED [fi], i=1,.., k Q ε Local(U)
HQ χ( ) ≈ HβiCi( )
Ci∈U∑ − Hµi, j
Si, j( )Ci−Cj∈S∑
• Approximation is exact when cluster graph is a tree Method known as CAMEL (Constrained Approx Max Entropy Learning)
• Assume a cluster graph U consisting of – a set of clusters {Ci} connected by sepsets Sij.
• Rather than optimize maximum-entropy – over the space of distributions Q we optimize
Hβ
i
Ci
( )C
i∈U∑ − Hµ
i ,jS
i,j( )C
i−C
j∈U
∑
Probabilistic Graphical Models Srihari
Example of Max Ent Learning
• Pairwise Markov Network A—B—C – Variables are binary – Three clusters: C1= {A,B}, C2={B,C}, C3={C,A} – Log-linear model with features
• f00(x,y)=1 if x=0, y=0 and 0 otherwise for x,y, instance of Ci
• f11(x,y)=1 if x=1, y=1 and 0 otherwise
– Three data instances (A,B,C): (0,0.0),(0,1,0),(1,0,0) • Unnormalized Feature counts, pooled over all clusters,
are
16
E !P f00[ ] = (3+1+1) / 3 = 5 / 3E !P f11[ ] = (0 + 0 + 0) / 3 = 0
Probabilistic Graphical Models Srihari
CAMEL Optimization Problem • Optimization problem takes the form
– with two types of constraints:
17
EQ[fi]=ED [fi], i=1,.., k
Type 2 Constraints: Marginals from Cluster-graph approximation
Type 1 Constraints:
Probabilistic Graphical Models Srihari
CAMEL Solutions
• CAMEL optimization is a constrained maximization problem with – linear constraints and – a nonconcave objective
• Several solution algorithms, one of which is – Lagrange multipliers for all constraints and
optimize over resulting new variables
18
Probabilistic Graphical Models Srihari
Sampling-based Learning
• We wish to maximize the log-likelihood
• In the sampling-based approach we use samples to estimate the partition function Z(θ) and use that estimate to perform gradient ascent
19
ℓ(θ :D) = θi
i∑ fi ξ[m]( )
m∑⎛⎝⎜
⎞⎠⎟−M lnZ(θ )
Probabilistic Graphical Models Srihari
Expressing Z(θ) as an expectation • Partition function Z(θ) is a summation over an
exponentially large space – One approach to approximating this summation is to
reformulate as expectation wrt a distribution Q(χ)
– This is precisely the form of importance sampling estimator 20
Z θ( ) = expξ∑ θi fi ξ( )
i∑⎧⎨
⎩
⎫⎬⎭
=Q ξ( )Q ξ( ) exp
ξ∑ θi fi ξ( )
i∑⎧⎨
⎩
⎫⎬⎭
=EQ1
Q χ( ) exp θi fi χ( )i∑⎧⎨
⎩
⎫⎬⎭
⎡
⎣⎢
⎤
⎦⎥
Since E
Qg(χ)⎡⎣ ⎤⎦ = Q(ξ)g(ξ)
ξ∑
Probabilistic Graphical Models Srihari
Importance sampling • To determine from p(z), we
draw samples {z(m)} from a simpler dist. q(z)
• Samples are weighted by ratios rl= p(z(l)) / q(z(l)) – Known as importance weights
• Which corrects the bias introduced by wrong distribution
21
Proposal distribution E[ f ] = f (z)p(z)dz∫
= f (z) p(z)q(z)
q(z)dz∫
= 1M
p(z(m) )q(z(m) )m=1
M∑ f (z(m) )
Unlike rejection sampling All of the samples are retained
E
p[f ] = 1
Mf (z [m ])
m=1
M
∑
Probabilistic Graphical Models Srihari
Importance Sampling Estimator
• We can approximate the partition function by generating samples from Q and correcting appropriately via weights
• We can simplify this expression by choosing Q to be Pθ0 for some set of parameters θ0
22
Z θ( )= EPθ0
Z θ0( )exp θifiχ( )i∑{ }
exp θi0fiχ( )i∑{ }
⎡
⎣
⎢⎢⎢⎢
⎤
⎦
⎥⎥⎥⎥
=Z θ0( )EPθ0exp (θ
i− θi
0)fi(χ)
i∑⎧⎨⎪⎪⎩⎪⎪
⎫⎬⎪⎪⎭⎪⎪
⎡
⎣⎢⎢
⎤
⎦⎥⎥
Probabilistic Graphical Models Srihari
Samples to approximate ln Z(θ) • If we can sample instances ξ1,..ξK from Pθ0
– We can approximate the log-partition function as • We can plug this approximation of ln Z(θ) into
the log-likelihood and optimize it • Note that ln Z(θ0) is a constant
– that we can ignore in the optimization and the resulting expression is a simple function of θ which can be optimized using gradient ascent
lnZ θ( )≈ ln
1K
exp θi− θ
i0( ) f
iξk( )
i∑⎧⎨⎪⎪⎩⎪⎪
⎫⎬⎪⎪⎭⎪⎪k=1
K
∑⎛
⎝⎜⎜⎜⎜
⎞
⎠⎟⎟⎟⎟
+ lnZ θ0( )
1Mℓ(θ :D) = θi
i∑ ED fi (di )[ ]( )− lnZ(θ )
where ED[fi(di)] is the empirical expectation of f, i.e., its average in the data set
Probabilistic Graphical Models Srihari
Gradient ascent + Sampling • Gradient ascent over θ relative to
– is equivalent to utilizing an importance sampling estimator directly to approximate the expected counts in the gradient of equation
• However, it is generally it is useful to view such methods as exactly optimizing an approximate objective rather than approximately optimizing the exact likelihood 24
lnZ θ( )≈ ln
1K
exp θi− θ
i0( ) f
iξk( )
i∑⎧⎨⎪⎪⎩⎪⎪
⎫⎬⎪⎪⎭⎪⎪k=1
K
∑⎛
⎝⎜⎜⎜⎜
⎞
⎠⎟⎟⎟⎟
+ lnZ θ0( )
∂∂θi
1Mℓ(θ :D) = ED[ fi χ( )]− Eθ [ fi ]
Probabilistic Graphical Models Srihari
Quality of importance sampling estimator
• It depends on the difference between θ and θ0
• The greater the difference, the larger the variance of the importance weights
• Thus this type of approximation is reasonable only in a neighborhood of surrounding θ0
25
Probabilistic Graphical Models Srihari
How to use this approximation? • Iterate between two steps
1. Use a sampling procedure to generate samples from current parameter set θt
2. Then use gradient descent to find θt+1 that improves the approximate log-likelihood based on the samples
• We can then regenerate samples and repeat the process.
– As the samples are regenerated from a new distribution,
• we can hope that that they are generated from a distribution not too far from the one we are currently optimizing maintaining a reasonable approximation
26
Probabilistic Graphical Models Srihari MAP-based Learning
• Another approach to inference in learning – Approximating expected feature counts with the
counts in the single MAP assignment to current MN • Approximate gradient at assignment θ is
– where ξMAP(θ)=arg maxξ P(ξ|θ) is the MAP assignment given the current set of parameters θ
• Approach also called as Viterbi training • Equivalent to exact optimization of approximate
objective 27
ED fi χ( )⎡⎣ ⎤⎦ − fi ξMAP θ( )( )
1Mℓ θ :D( )− lnP ξMAP θ( ) |θ( )
Probabilistic Graphical Models Srihari
Ex: CRFs for Protein Structure • Predict the 3_D structure of proteins
– Proteins are chains of residues, each containing one of 20 possible amino acids
• Amino acids are linked together into a common backbone structure onto which amino-specific side-chains are attached
– Predict the side-chain conformations given the backbone.
– Full configuration consists of upto four angles each of which takes on a continuous value
• Discretize angle into a small no (3) rotamers
• Address optimization as MAP for CRF 28