approximating the kullback- leibler divergence between gaussian mixture models icassp 2007 john r....

Approximating The Kullback-Leibler Divergence Between Gaussian Mixture

ModelsICASSP 2007

John R. Hershey and Peder A. OlsenIBM T. J. Watson Research Center

Speaker: 孝宗

Outline

• Introduction• Kullback-Leibler Divergence

• Methods• Monte Carlo Sampling• The Unscented Transformation• Gaussian Approximations• The Product of Gaussian Approximation• The Matched Bound Approximation• The Variational Approximation• The Variational Upper Bound

Introduction

• Kullback-Leibler Divergence: relative entropy• KLD between two PDF and

• Three properties:1. Self similarity: 2. Self identification: 3. Positivity:

Introduction

• The KL divergence is used in many aspects of speech and image recognition, such as determining if two acoustic models are similar,[2], measuring how confusable two words or HMMs are, [3, 4, 5]

• For two Gaussians and the KLD has closed formed expression,

• Whereas for two GMMs no such closed form expression exists.

Monte Carlo Sampling

• The idea is to draw a sample from the pdf such that • Using n i.i.d. samples we have( if f(x) is PDF)

• To draw a sample from a GMM we first draw a discrete sample according to the probabilities . Then we draw a continuous sample from the resulting gaussian component

Alex Hong

identical independent distributed

The Unscented Transformation

3. Non-linear transform

1.Choise sigma point2.Each sigma points has a weight

4. Compute a Gaussian from weighted points

http://ais.informatik.uni-freiburg.de/teaching/ws12/mapping/pdf/slam05-ukf.pdf

The Unscented Transformation• An approach to estimate in such a way that

approximation is exact for all quadratic function • It is possible to pick up 2d sigma points such that

• One possible choice of the sigma points is

The Unscented Transformation

Gaussian Approximations

• Replace and with Gaussians

• Another method :

The Product of Gaussian Approximation

• The likelihood defined by

So…


*Jensen’s Inequality因為 log是 concave function


• Closed-form solution

• Self similarity• Self identification • Positivity • tends to greatly underestimate

(Jensen’s Inequality)

The Matched Bound Approximation• If and have the same number of components

𝐷 ¿≤∫∑

𝑖

𝜋 𝑖 𝑓 𝑖 log (𝜋𝑖 𝑓 𝑖𝜔𝑖𝑔𝑖

)

¿∫∑𝑖

𝜋𝑖 𝑓 𝑖 log (𝜋 𝑖

𝜔 𝑖

)+∫∑𝑖

𝜋 𝑖 𝑓 𝑖 log (𝑓 𝑖

𝑔𝑖

)

¿∑𝑖

𝜋 𝑖 log (𝜋 𝑖

𝜔 𝑖

)∫ 𝑓 𝑖+∑𝑖

𝜋 𝑖∫ 𝑓 𝑖 log (𝑓 𝑖

𝑔𝑖

)¿𝐷 ¿¿∑

𝑖

𝜋 𝑖¿¿為 1

Log-sum inequality

The Matched Bound Approximation• Goldberger’s approximate formula• Define a match function,

• Self similarity• Self identification • Positivity

𝑚 (𝑎 )=argmin𝑏𝐷 ¿¿

𝐷𝑔𝑜𝑙𝑑𝑏𝑒𝑟𝑔𝑒𝑟 ¿

The Variational Approximation• Define variational parameters such that

• We get the best bound by maximizing with respect to .

b

bbxf xgE )(log)(

b ab

bbabxf

xgE

||)(

)(log

b ab

bbabxf

xgE

||)(

)(log

≝ℒ 𝑓 (𝑔 ,𝜙 )

𝐿 𝑓 (𝑔)≝𝐸 𝑓 (𝑥 ) log 𝑔(𝑥 )

Lower bound

.ˆ)||(

)||(

| ba

ba

gfD

b b

gfDb

ab e

e

The Variational Approximation .ˆ

)||(

ˆ ˆ

)||(

| aa

aa

ffD

a a

ffDa

aa e

e

Define: 𝐷𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛𝑎𝑙¿

𝐷𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛𝑎𝑙¿For equal numbers of components, if we restrict and to have only one non-zero element for a given a, the formula reduces exactly to the chain ruleupper bound given in equation (13).

The Variational Upper Bound• Define satisfying the constraints ,

• With this notation we use Jensen’s inequality to obtain an upper bound of the KL divergence as follows

The Variational Upper Bound• finding the variational parameters that minimize • The problem is convex in as well as in so we can fix one and

optimize for the other

• Since any zero in and are fixed under the iteration we recommend starting with

Experiments

• the acoustic model consists of a total of 9,998 gaussians belonging to 826 separate GMMs. The number of gaussians per GMM varies from 1 to 76, of which 5 mixtures attained the lower bound of 1.• The median number of gaussians per GMM was 9.

We used all combinations of these 826 GMMs to test the various approximations to the KL divergence.

Experiments• Each of the methods was compared to the reference approximation, which is

the Monte Carlo method with one million samples, denoted DMC(1M).

• The vertical axis represents the probability derived from a histogram of the deviations taken across all pairs of GMMs.

Experiments

Conclusion

• If accuracy is the primary concern, then MC is clearly best. • when computation time is an issue, or when

gradients need to be evaluated, the proposed methods may be useful.• Finally, some of the more popular methods, , , and ,

should be avoided, since better alternatives exist.

approximating the kullback- leibler divergence between gaussian mixture models icassp 2007 john r....

Documents