rebuilding factorized information criterion: asymptotically accurate marginal likelihood
Post on 14-Aug-2015
229 Views
Preview:
TRANSCRIPT
Rebuilding Factorized InformationCriterion: Asymptotically Accurate
Marginal Likelihood
Kohei Hayashi12 Shin-ichi Maeda3
Ryohei Fujimaki4
1National Institute of Informatics
2Kawarabayashi Large Graph Project, ERATO, JST
3Kyoto University
4NEC Knowledge Discovery Laboratories
July 10, 20151 / 21
Introduction
Factorized asymptotic Bayesian inference (FAB)
• Recently-developed approximate Bayesian method
4 Accurate and tractable
6 Limited to binary latent variable models (LVMs)
Our contributions:
• Extend FAB to general LVMs (e.g. PCA)
• Analyze theoretical properties that are unclear in theprevious studies
2 / 21
..1 Revisiting FAB
..2 Generalization of FAB
3 / 21
Bayesian Inference for Binary LVMs
Binary LVM:p( X︸︷︷︸
data
, Z︸︷︷︸LVs
, Π︸︷︷︸params
| K︸︷︷︸model
) = p(Π)︸ ︷︷ ︸prior
p(X,Z|Π, K)︸ ︷︷ ︸joint likelihood
Assumptions:
• X and Z are jointly i.i.d.
p(X,Z|Π, K) =N∏n=1
p(xn, zn|Π, K)
• The prior doesn’t depend on N• ln p(Π) = O(1)• “Flat” prior
4 / 21
Goal: To obtain
• the marginal likelihood:
p(X|K) =
∫p(X,Z,Π|K)dZdΠ
• the marginal posteriors:
p(Z|X, K) =
∫p(X,Z,Π|K)dΠ/p(X|K)
p(Π|X, K) =
∫p(X,Z,Π|K)dZ/p(X|K)
Problem: The marginalizations are intractable5 / 21
Key idea: Use
• the variational representation for∫dZ
• Laplace’s method for∫dΠ
.Factorized information criterion (FIC)..
......
FIC(K) ≡ maxq
Eq
[maxΠ
ln p(X,Z|Π, K)]
− Eq
[DΠ
2
∑k
ln∑n
znk
]︸ ︷︷ ︸
FIC penalty term
+H(q)+O(lnN)
• q(Z): trial distribution
• H(q): entropy
6 / 21
Accuracy of FIC
4 Asymptotically equivalent to the marginal likelihood
.Theorem 3 of [Fujimaki+ 12a]..
......
In mixture models, under mild conditions,
FIC(K) = ln p(X|K) +O(1)
≈ ln p(X|K)
Similar results are obtained for:
• HMMs [Fujimaki+ 12b]
• Latent feature models [KH+ 13]
• Mixture of experts [Eto+ 14]
• Factorial relational models [Liu+ yesterday]
7 / 21
Optimizing FIC
Computation of FIC is difficult
maxq
Eq
[maxΠ
ln p(X,Z|Π, K)]− DΠ
2
∑k
Eq
[ln∑n
znk
]+H(q)
≥maxq∈Q
Eq
[maxΠ
ln p(X,Z|Π, K)]− DΠ
2
∑k
Eq
[ln∑n
znk
]+H(q)
Mean-field approx. (Q ≡ {q(Z)|q(Z) =∏
n q(zn)})
≥ maxq∈Q,Π
Eq [ln p(X,Z|Π, K)]− DΠ
2
∑k
ln∑n
Eq[znk] +H(q)
Jensen’s ineq.
≡FIC(K)
8 / 21
Algorithm
Optimization problem:
maxq∈Q,Π
Eq [ln p(X,Z|Π, K)]− DΠ
2
∑k
ln∑n
Eq[znk] +H(q)
Can be solved by EM-like alternating updates:
..1 Initialize q and Π
..2 Update q (Fix Π)
..3 Update Π (Fix q)
..4 Repeat step 2 and 3 until convergence
9 / 21
Model PruningThe FAB algorithm eliminates irrelevant componentsautomatically
Eq [ln p(X,Z|Π, K)]− DΠ
2
∑k
ln∑n
Eq[znk]︸ ︷︷ ︸penalty term
+H(q)
0 2 4 6 8 10
-log(x)
• The penalty term introduces group sparsity to Z
K=6
Z
K=6 K=3
update update
10 / 21
Summary of FIC/FAB
4 Asymptotically equivalent to the marginal likelihood• Fits to “Big Data” situations
4 Performs parameter inference and model selectionsimultaneously
• EM-like updates of q and Π• ARD-like model pruning
4 Doesn’t depend on the choice of p(Π)• More frequentist than Bayesian
4 Works in many binary LVMs
11 / 21
Limitations of FIC/FAB
6 Limited to binary LVMs• In real Z,
∑n znk can be negative
• − ln∑
n znk may diverge
6 Missing relations to EM and VB• Similar approaches, but which are better?
6 Unclear legitimacy of optimizing FIC• e.g. tightness
12 / 21
..1 Revisiting FAB
..2 Generalization of FAB
13 / 21
Setting
• Now Z can take general values (e.g. Z ∈ RN×K)
• Consider separating the parameters:Π = {Θ,Ξ}
• Θ: k-independent params• Ξ = {ξk}Kk=1: k-dependent params (e.g. mixing
coefficients)
14 / 21
Generalized FIC (gFIC).Definition..
......
gFIC(K) ≡ Eq∗
maxΠ
ln p(X,Z|Π, K)−1
2ln |FΞ|︸ ︷︷ ︸
penalty
+H(q)+O(lnN)
• q∗(Z) ≡ p(Z|X, K): marginal posterior
• FΞ: Hessian of − ln p(X,Z|Π, K)/N(i.e. empirical Fisher information)
• In PCA, FΞ = Z>Z
FIC gFICApplicable class Binary LVMs General LVMsPenalty term −
∑k ln
∑n znk − ln |FΞ|
Regularization Group sparsity ”Low-rank”15 / 21
Generalized FAB (gFAB)
4 Use the same technique as FAB
Eq∗
[maxΠ
ln p(X,Z|Π, K)]
− 1
2Eq∗ [ln |FΞ|] +H(q∗)
≥maxq∈Q
Eq
[maxΠ
ln p(X,Z|Π, K)]
− 1
2Eq [ln |FΞ|] +H(q)
Mean-field approx.
≥ maxq∈Q,Π
Eq [ln p(X,Z|Π, K)] − 1
2lnEq[|FΞ|] +H(q)
Jensen’s ineq.
≡gFIC(K)
• Able to solve by alternating updates of q and Π16 / 21
Comparison with EM and VB4 gFAB asymp. approx. ln p(X|K) for all K,
whereas EM and VB don’t.Theorem 2 & Corollary 5..
......
Let K ′ be the “true” model of X, then
gFIC(K ′) ≈ ln p(X|K) for K > K ′
gFIC(K) ≈ ln p(X|K) for K ≤ K ′
• K ′ can be obtained by model pruning
.Proposition 10+..
......
6 EM +O(lnN) ≈ ln p(X|K) only for K ≤ K ′
6 VB ≈ ln p(X|K) only for K ≤ K ′17 / 21
Asymptotic Behavior of gFIC
4 gFIC(K) ≈ gFIC(K) in some cases.Proposition 6..
......
q∗ is asymptotically mutually independent.
4 Justify mean-field approximation
.Proposition 7..
......
If q is not degenerated and ln p(X,Z|Π, K) is smoothand concave w.r.t. Π,
Eq[maxΠ
ln p(X,Z|Π, K)]p→ max
ΠEq[ln p(X,Z|Π, K)].
4 Justify Jensen’s inequality18 / 21
Experiments: Bayesian PCA
−1000
0
1000
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
22000
30000
35000
40000
45000
N=
100N
=500
N=
1000N
=2000
10 20 30K
Obj
ectiv
e
method EM BICEM VB1 VB2 gFAB Task: model selection
• Choose K thatmaximizes theobjective
Results:
4 gFAB: Successfullyobtain true K = 10w/ skippingK = 10, . . . , 29
6 EM: Alwaysoverestimates K (assuggested in Prop.10+)
6 VB1: Select true Kbut need to computeall K = 1, . . . , 30
19 / 21
Conclusion
Summary of this talk:
• FAB: Tractable Bayesian method for binary LVMs
• Proposed gFAB for general LVMs (e.g. PCA)• Theoretical Analysis
• Showing the desirable properties of gFAB
At the poster session (right after):We will explain more details such as
• Full derivation of gFIC
• “High-level” mechanism of model pruning
• ...
20 / 21
Future work• Potentially applicable to a wide class of LVMs
• factor analysis, CCA, partial membership, lineardynamical systems, ...
• If you are interested in, let’s collaborate!
Thank you!
References[Fujimaki+ 12a] Fujimaki, Ryohei and Morinaga, Satoshi. Factorized asymptotic Bayesian
inference for mixture modeling. In AISTATS, 2012.
[Fujimaki+ 12b] Fujimaki, Ryohei and Hayashi, Kohei. Factorized asymptotic Bayesianhidden Markov model. In ICML, 2012.
[KH+ 13] Hayashi, Kohei and Fujimaki, Ryohei. Factorized asymptotic Bayesianinference for latent feature models. In NIPS, 2013.
[Eto+ 14] Eto, Riki, Fujimaki, Ryohei, Morinaga, Satoshi, and Tamano, Hiroshi.Fully-automatic Bayesian piecewise sparse linear models. In AISTATS, 2014.
[Liu+ yesterday] Liu, Chunchen, Feng, Lu, Fujimaki, Ryohei, and Muraoka, Yusuke. Scalablemodel selection for large-scale factorial relational models. In ICML, 2015.
21 / 21
Model SelectionGoal: Choose a good model
• A wrong model degrades the final output.(e.g., wrong prediction, messy visualization)
simple model
intermediate model
complex model
data
How can we select a good model?22 / 21
Factorized Information Criterion (FIC)
Recently-developed model selection framework
• Applicable for many Discrete LVMs:• mixture models [Fujimaki+ 12]
• hidden Markov models [Fujimaki+ 12]
• latent feature models [Hayashi+ 13]
• mixture of experts [Eto+ 14]
• Accurate if we have a lot of data samples• Fit to “Big Data” settings
• Computationally efficient one-pass model selection• ARD-like “pruning”
23 / 21
Bayesian Model SelectionChoose K that maximizes the marginal likelihood:
p(X|K) =
∫p(Π)︸ ︷︷ ︸prior
p(X|Π, K)︸ ︷︷ ︸likelihood
dΠ.
data
The integral is intractable.
• Need approximation of p(X|K): BIC24 / 21
Derivation of BICSuppose p(X|Π, K) has the unique MLE Π̂⇔ p(X|Π, K) is regular.
Then, Laplace’s method yields:
ln p(X|K) = BIC(K) +O(1),
BIC(K) = ln p(X|Π̂, K)− DΠ
2lnN.
• Assuming ln p(Π) = O(1)25 / 21
Why BIC Fails?In LVMs, p(X|Π, K) is NOT regular.
• p(X|Π, K) =∫dZp(X,Z|Π, K)
• Mixing various Zs⇒ p(X|Π, K) has many maxima⇒ Laplace’s method doesn’t work.
=+ + ...
.Idea of gFIC........Apply Laplace’s method separately for each Z
26 / 21
Derivation of gFICTo separate p(X,Z|Π, K), use the variational bound:
p(X|K) =Eq
[ln
∫p(X,Z|Π, K)p(Π)dΠ
]+H(q) + KL(q‖p(Z|X,K))
• q(Z): a distribution of Z• H(q): the entropy of q• Eq[f(Z)] =
∫dZq(Z)f(Z)
• Moves∫dZ to the outside of p(X,Z|Π, K).
If p(X,Z|Π, K) is regular → Laplace’s method works!• The regularity depends on Z.• Need case analysis for Z.
27 / 21
Case 1: p(X,Z|Π, K) Is Regular
Laplace’s method yields:
ln p(X,Z|K)dΠ = BIC(K)− 1
2ln |F|+O(1)
• F: the Hessian of ln p(X|Π̂, K)/N w.r.t. Ξ
Remark:
• BIC ignores 12 ln |F| as constant.
• However, gFIC includes it.• F depends on Z and will change the solution of q.
28 / 21
Case 2: p(X,Z|Π, K) Is Not Regular
In this case, Z is degenerated, i.e.
• z·k = 0 (0 column vectors)
• z·k = z·l 6=k (overlapping representation)
Eliminating such components gives a regular modelK ′ = rank(Z) < K: p(X,Z|Π, K) = p(X, Z̃|Π̃, K ′)
Now Laplace’s method yields:1
ln p(X,Z|K)dΠ = ln
∫p(X, Z̃|Π̃, K ′)p̃(Π̃)dΠ̃
= BIC(K ′)− 1
2ln |F̃|+O(1)
1Assuming the support of p(Π) is compact and the whole space29 / 21
Substituting the Laplace approx into the variationalbound, we obtain gFIC..gFIC..
......
ln p(X|K) =gFIC +O(1),
gFIC =Eq∗
[ln p(X, Z̃| ˆ̃Π)− 1
2ln |F̃|
]+H(q∗)
−O(lnN).
30 / 21
Model Pruning
.Proposition 4..
......
q∗(Z) = pK(Z)(1 +O(N−1)),
pK(Z) ≡
{p(Z,X|Π̂,K)|FΠ̂|−1/2
C K = κ(Z),
pκ(Z)(Z̃) K > κ(Z),
K' KK
=
31 / 21
top related