sargur srihari [email protected]/cse674/chap11/11.1-infasopti.pdf · 362 assignment a 6o...
TRANSCRIPT
Probabilistic Graphical Models Srihari
1 1
Inference as Optimization
Sargur Srihari [email protected]
Probabilistic Graphical Models Srihari
2
Topics in Inference as Optimization
• Overview • Exact Inference revisited • The Energy Functional • Optimizing the Energy Functional
Probabilistic Graphical Models Srihari
3
Exact and Approximate Inference • PGMs represent probability distributions PΦ(χ)
– Where χ is a set of variables and Φ is a set of factors
• Inference is the task of answering queries – e.g., compute conditional probability PΦ(Y|E=e),
• Problem of inference in PGMs is NP-hard – Worst case is exponential
• Exact Inference is often efficient using – Variable Elimination or Clique tree Algorithms
• But complexity is exponential in tree width of network • In such cases exact algorithms become infeasible
• This motivates approximate inference
Y ,E ⊆ χ
Probabilistic Graphical Models Srihari
4
Approximate Target Distribution • We consider approximate inference methods
where the approximation arises from constructing an approximation to target distribution PΦ
• This approximation takes a simpler form that allows inference
• Simpler approximating form exploits factorization structure of PGM
Probabilistic Graphical Models Srihari
Principles of Approximate Algorithms
• Approximate inference methods share common conceptual principles:
1. Find target class Q of “easy” distributions Q and 2. Then Search for an instance within that class that
best approximates PΦ
3. Answer queries using inference on Q instead of PΦ4. All methods optimize the same target function for
measuring similarity between Q and PΦ• This reformulates inference problem as:
– Optimizing an objective function over class Q
Probabilistic Graphical Models Srihari
Reformulated Inference Problem • This problem is one of constrained optimization
– i.e., find distribution Q that minimizes D(Q|| PΦ) – Such problems can be solved by variety of different
optimization techniques • Technique most often used for PGMs is based
on Lagrange multipliers • Constrained optimization and Lagrange solution
is discussed next
6
Probabilistic Graphical Models Srihari What is constrained optimization? • Ex: find the maximum entropy distribution over
X with Val(X)={x1,..xK} where entropy is – Unconstrained Optimization
• Use gradient method treating each P(xk) as a parameter θk – Compute gradient of HP(X) wrt parameters: – Setting partial derivative to 0 we get log(θk)=1, or θk=1/2 – But nos do not add up to 1, and hence not a distribution – Flaw in analysis: we want constraints that Σkθk=1, and θk≥ 0
– Constrained Optimization • Maximizing a function f under equality constraints • Find θ • Maximizing f(θ) • Subject to c1(θ)=0 …..
cm(θ)=0
H(X) = − p(xk )logP(xk )
k=1
K
∑
∂∂θ
k
H(X) = − log(θk)− 1
Method of Lagrange multipliers allows us to solve constrained optimization problems using tools for unconstrained optimization. Lagrangian is
J(θ,λ) = f (θ)− λjc
j(θ)
j=1
m
∑
Probabilistic Graphical Models Srihari
Lagrange leads to Message Passing • Method of Lagrange multipliers produces a set
of equations that characterize the optima of the objective
• It produces a set of fixed-point equations that define each variable in terms of others
• Fixed point equations derived from constrained energy optimization can be viewed as passing messages over a graph object
8
Probabilistic Graphical Models Srihari
9
Categories of methods in this class
1. Message passing on Clique Tree – Loopy belief propagation
• Optimize approximate versions of the energy functional
2. Message passing on Clique Trees with approximate messages
– Called expectation propagation • Maximize exact energy functional but with relaxed constraints on Q
3. Mean-field method – Originates in statistical physics
• Focus on Q that has simple factorization
Probabilistic Graphical Models Srihari
Examples of Clique Tree Bayesian Network 1 Moralized Graph and Clique tree
Cluster Graph
Bayesian Network 2 Triangu- lation
Moralized Graph
Clique tree
Probabilistic Graphical Models Srihari Calibrated Clique Tree
C
(a) (b) (c)
A
D BBD
A
C
BD
A
C
P(A,B,C.D) = 1Zφ1(A,B) ⋅φ2 (B,C) ⋅φ3(C,D) ⋅φ4 (D,A)
where
Z = φ1(A,B) ⋅φ2 (B,C) ⋅φ3(C,D) ⋅φ4 (D,A)A,B,C ,D∑
Z=7,201,840
1.Gibbs Distribution
C1: {A,B,D} C2: {B,C,D}S1,2:{B,D}
362
Assignmenta
6o6r6t6o6o6r6r
maxc600,000300,030
5, ooo,5oo1,000
2001,000, 100
100,010200,000
Chapter 1&
Assienment
a,o
aona"
alalalAL
Assignment b0bob0
bL
bL
bt6t
cocLctcococ1ct
d1 ll I4o4t4o4rd0 ll 5,
6o
bL
br
600,2001,300, 1305, 100, 510
201,000
reparameteriza-tion
clique treeinvariant
Example 10.5
n,z(B, D) l3z(8,C,
Figure 10.6 lhe clique and sepset bellefs for the Misconception example,
Using equation (10.9), the denominator can be rewritten as:
il 6i-i5i-r.(i- j)e€r
Each message da+i appears exactly once in the numerator and exactly once in theso that all messages cancel. The remaining expression is simply:
l rbo{co1: e*.ieVr
Thus, via equation (10.i0), the clique and sepsets beliefs provide aunnormalized measure. This property is called the clique tree irunriant, for reasonsbecome clear later on in this chapter.
Another intuition for this result can be obtained from the following example:
Consider a clique tree obtained from Markav network A*B -C -D, with anfactors @. Our clique tree in this case would haue three cliques C1 : {A,B), Czand, Cs : {CLD}. When the clique tree is calibrated, we haue that B1(A,B) : Po(Zz(B,C) : Po(B,C). From the conditional independence properties of thishaue that
Po(A,B,C): Po(A,B)P,(C I B),and
P^rc I B\:1vjB'c) .po(B)
As B2(B,C) : Po(B,C), we can obtain Po@) by marginalizing B2(B,C). Thus, we
Br@,qrffifir)p|(A, B)p2(B,C)-U;p,@O-
d
Po(A, B,C)
E.g., !PΦ
a1,b0,c1,d 0( ) = 100 and measure indiuced is
β1
a1,b0,d 0( )β2b0,c1,d 0( )
µ1,2
b0,d 0( )=
200 ⋅300 ⋅100600 ⋅200
= 100
2. Clique Tree (triangulated):
Beliefs:
β1
A,B,D( ) = !PΦ
A,B,D( ) = ψ1
A,B,D( )C∑ = φ
1(A,B)φ
2(B,C)φ
3(C,D)φ
4(D,A)
C∑
e.g., β1(a1,b0,d 0) = 100 +100 = 200
µ1,2
(B,D) = β1
C1−S1,2
∑ C1( ) = β
1A∑ A,B,D( )
e.g., µ1,2
(b0,d 0) = 600,000 + 200 = 600,200
β2
B,C,D( ) = !PΦ
B,C,D( ) = ψ2
B,C,D( )A∑
e.g.,β2
b0,c1,d 0( ) = 300,000 +100 = 300,100
Initial Potentials:
ψ1
A,B,D( ) = φ1
A,B( )φ2B,C( )φ3
C,D( )φ4D,A( )
ψ2
B,C,D( ) = φ1
A,B( )φ2B,C( )φ3
C,D( )φ4D,A( )
!PΦ
A,B,C,D( ) = φ1
A,B( )φ2B,C( )φ3
C,D( )φ4D,A( )
β1(A,B,D) β2(B,C,D) μ12(B,D) Clique Beliefs
Sepset Beliefs Measure induced by calibrated tree T
QT
=β
i(C
i)
i∏
µij
ij∏
where µ
i,j= β
iCi−Si ,j
∑ (Ci) = β
jCj−Si ,j
∑ (Cj)
Unnormalized Measure
Probabilistic Graphical Models Srihari
Belief Propagation
12
A simple network A Clique Tree A Cluster Graph
• Clique tree and cluster graph are alternative ways of doing inference
• Cluster graph may contain loops • Inference is Called Loopy Belief Propagation • Clusters are smaller than in Clique Tree
Probabilistic Graphical Models Srihari
13
Exact Inference Revisited
• We have a factorized distribution of the form
– where Uϕ =Scope (ϕ) – Factors are:
• CPDs in a BN or • potentials in a MN
• We are interested in answering queries: – about marginal probabilities of variables and – about the partition function
PΦ
X( ) =1Z
φ Uφ( )
φ∈Φ∏
Probabilistic Graphical Models Srihari
14
Cluster Graph Representation • End-product of Belief Propagation is a calibrated
cluster tree – A calibrated set of beliefs represents a distribution
• We view exact inference as searching over the set of distributions Q that are representable by the cluster tree to find a distribution Q* that matches PΦ
Cluster graph U for factors Φ over χ is an undirected graph Each of whose nodes i is associated with a subset Each edge between pair of clusters Ci and Cj is associated with a sepset A tree T is a clique tree for graph H if
Each node in T corresponds to a clique in H and each maximal clique in H is a node in T Each sepset Si,j separates W<(j,j) and W<(j,i) in H
Ci⊆ χ
S
i,j⊆C
i∩C
j
Probabilistic Graphical Models Srihari
15
Distance between Q and PΦ • We need to optimize distance between Q and
PΦ without answering hard queries about PΦ • Relative entropy (or K-L divergence) allows us
to exploit the structure of PΦ without performing reasoning with it – Relative entropy of P1 and P2 defined as
• It is always non-negative • Equal to 0 if and only if P1 = P2
– We search for distribution Q that minimizes D(Q|| PΦ)
D P1||P
2( ) = EP1
lnP1χ( )
lnP2χ( )
⎡
⎣
⎢⎢⎢
⎤
⎦
⎥⎥⎥
Probabilistic Graphical Models Srihari
16
Specifying the set Q • We need to specify objects to optimize over • Suppose we are given:
– a clique tree structure T for PΦ, – a set of beliefs Q={βi: i ε VT} U {μi,j: (i-j) ε ET}
where Ci are clusters in T, βi denote beliefs over Ci and μi,j denotes beliefs Si,j of edges in T
• Set of beliefs in T defines a distribution Q by
• We are now searching over a set of distributions Q – that are representable by a set of beliefs Q over the cliques and
sepsets in a particular clique tree structure T
Q χ( ) =
βi
i∈VT
∏µ
i,ji−j( )∈VT
∏
The beliefs correspond to marginals of Q βi[ci]=Q(ci) µij[sij]=Q(sij)
Probabilistic Graphical Models Srihari
17
Statement of Inference as Optimization • Exact inference is one of maximizing -D(Q || PΦ)
over the space of calibrated sets Q Ctree-Optimize-KL• Find Q={βi: i ε VT} U {μi,j: (i-j) ε ET}
• Maximizing -D(Q || PΦ)
• Subject to
• Theorem: If T is an I-map of PΦ then there is a unique solution to Ctree-Optimize-KL
µi,j
si,j⎡⎣⎢⎤⎦⎥ = β
iCi−Si ,j
∑ ci( ) ∀ i− j( )∈E
T,∀s
i,j∈Val S
i,j( )
βi
ci
∑ ci( ) = 1 ∀i ∈ V
T
Probabilistic Graphical Models Srihari
Possible approach • Examine different configurations of beliefs that
satisfy marginal consistency constraints – Select the configuration that maximizes the
objective – Such as exhaustive examination is impossible to
perform • Instead of searching over a space of all
calibrated trees we can search over a space of simpler distributions – We will not find a distribution equivalent to PΦ but
one that is reasonably close 18
Probabilistic Graphical Models Srihari
19
The Energy Functional • Directly evaluating D(Q || PΦ) is unwieldy
– Because summation over all χ is infeasible in practice
• Instead use equivalent form – Where F is the energy functional
• Theorem:
• Since the term ln Z does not depend on Q, – minimizing relative entropy D(Q || PΦ) is equivalent
to maximizing the energy functional • Energy functional has two terms:
– energy term (expectation of logs of factors in Φ) and entropy term
D P1||P
2( ) = EP1
lnP1χ( )
lnP2χ( )
⎡
⎣
⎢⎢⎢
⎤
⎦
⎥⎥⎥= P
1χ∑ χ⎡⎣⎢
⎤⎦⎥
lnP1χ( )
lnP2χ( )
⎡
⎣
⎢⎢⎢
⎤
⎦
⎥⎥⎥
D Q ||P
Φ( ) = lnZ −F !PΦ,Q( )
F !P
Φ,Q⎡
⎣⎢⎤⎦⎥ = E
Qln !P χ( )⎡⎣⎢
⎤⎦⎥+ H
Qχ( ) = E
Qlnφ⎡⎣⎢⎤⎦⎥
φ∈Φ∑ + H
Qχ( )
F !P
Φ,Q( )
F !PΦ,Q⎡⎣⎢⎤⎦⎥ = E
Qlnφ⎡⎣⎢⎤⎦⎥
φ∈Φ∑ +H
Qχ( )
Probabilistic Graphical Models Srihari
20
Optimizing the Energy Functional • From here onward we pose the problem of finding a
good Q as one of maximizing the energy functional – Equivalently minimizing the relative entropy – Importantly energy functional involves expectations in Q – By choosing Q that allow efficient inference we can evaluate/
optimize the energy functional • Moreover, energy Functional is a lower bound on
partition function – Since D(Q||PΦ) ≥0 we have – Useful since partition function is usually the hardest part of
inference • Plays important role in learning
lnZ ≥F !P
Φ,Q⎡
⎣⎢⎤⎦⎥
Probabilistic Graphical Models Srihari
21
Strategies for optimizing energy functional
• Methods are referred to as Variational Methods • Refers to a strategy in which we introduce new
parameters that increase the degrees of freedom • Each choice of these parameters gives a different
approximation • We attempt to optimize the variational parameters to
get the best approximation • Variational calculus: finding optima of a functional
– E.g., distribution that maximizes entropy
Probabilistic Graphical Models Srihari
Further Topics in Variational Methods
• Exact Inference • Propagation-Based Approximations • Propagation with Approximate Messages • Structured Variational Approximations
22