em algorithm and its application in probabilistic latent semantic analysis

27
EM algorithm and its application in Probabilistic Latent Semantic Analysis (pLSA) Duc-Hieu Tran tdh.net [at] gmail.com Nanyang Technological University July 27, 2010 Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 1 / 27

Upload: zukun

Post on 27-Jan-2015

108 views

Category:

Technology


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: EM algorithm and its application in probabilistic latent semantic analysis

EM algorithm and its application in Probabilistic LatentSemantic Analysis (pLSA)

Duc-Hieu Trantdh.net [at] gmail.com

Nanyang Technological University

July 27, 2010

Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 1 / 27

Page 2: EM algorithm and its application in probabilistic latent semantic analysis

The parameter estimation problem

Outline

The parameter estimation problem

EM algorithm

Probabilistic Latent Sematic Analysis

Reference

Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 2 / 27

Page 3: EM algorithm and its application in probabilistic latent semantic analysis

The parameter estimation problem

Introduction

Known the prior probabilities P(ωi ), class-conditional densities p(x |ωi )=⇒ optimal classifier

I P(ωj |x) ∝ p(x |ωj)p(ωj)

I decide ωi if p(ωi |x) > P(ωj |x), ∀j 6= i

In practice, p(x |ωi ) is unknown – just estimated from training samples(e.g., assume p(x |ωi ) ∼ N (µi ,Σi )).

Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 3 / 27

Page 4: EM algorithm and its application in probabilistic latent semantic analysis

The parameter estimation problem

Frequentist vs. Bayesian schools

Frequentist

I parameters – quantities whose values are fixed but unknown.

I the best estimate of their values – the one that maximizes theprobability of obtaining the observed samples.

Bayesian

I paramters – random variables having some known prior distribution.

I observation of the samples converts this to a posterior density;revising our opinion about the true values of the parameters.

Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 4 / 27

Page 5: EM algorithm and its application in probabilistic latent semantic analysis

The parameter estimation problem

Examples

I training samples: S = {(x (1), y (1)), . . . (x (m), y (m))}I frequentist: maximum likelihood

maxθ

∏i

p(y (i)|x (i); θ)

I bayesian: P(θ) – prior, e.g., P(θ) ∼ N (0, I)

P(θ|S) ∝

(m∏i=1

P(y (i)|x (i), θ)

).P(θ)

θMAP = arg maxθ

P(θ|S)

Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 5 / 27

Page 6: EM algorithm and its application in probabilistic latent semantic analysis

EM algorithm

Outline

The parameter estimation problem

EM algorithm

Probabilistic Latent Sematic Analysis

Reference

Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 6 / 27

Page 7: EM algorithm and its application in probabilistic latent semantic analysis

EM algorithm

An estimation problem

I training set of m independent samples: {x (1), x (2), . . . , x (m)}I goal: fit the paramters of a model p(x , z) to the data

I the likelihood:

`(θ) =m∑i=1

log p(x (i); θ) =m∑i=1

log∑z

p(x (i), z ; θ)

I explicitly maximize `(θ) might be difficult.

I z - laten random variableif z(i) were observed, then maximum likelihood estimation would beeasy.

I strategy: repeatedly construct a lower-bound on ` (E-step) andoptimize that lower-bound (M-step).

Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 7 / 27

Page 8: EM algorithm and its application in probabilistic latent semantic analysis

EM algorithm

EM algorithm (1)

I digression: Jensen’s inequality.f – convex function; E [f (X )] ≥ f (E [X ])

I for each i , Qi – distribution of z :∑

z Qi (z) = 1,Qi (z) ≥ 0

`(θ) =∑i

log p(x (i); θ)

=∑i

log∑z(i)

p(x (i), z(i); θ)

=∑i

log∑z(i)

Qi (z(i))

p(x (i), z(i); θ)

Qi (z(i))(1)

applying Jensen’s inequality, concave function log

≥∑i

∑z(i)

Qi (z(i))log

p(x (i), z(i); θ)

Qi (z(i))(2)

More detail . . .

Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 8 / 27

Page 9: EM algorithm and its application in probabilistic latent semantic analysis

EM algorithm

EM algorithm (2)

I for any set of distribution Qi , formula (2) gives a lower-bound on `(θ)

I how to choose Qi?

I strategy: make the inequality hold with equality at our particularvalue of θ.

I require:p(x (i), z(i); θ)

Qi (z(i))= c

c – constant not depend on z(i)

I choose: Qi (z(i)) ∝ p(x (i), z(i); θ)

I we know∑

z Qi (z(i)) = 1, so

Qi (z(i)) =

p(x (i), z(i); θ)∑z p(x (i), z ; θ)

=p(x (i), z(i); θ)

p(x (i); θ)= p(z(i)|x (i); θ)

Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 9 / 27

Page 10: EM algorithm and its application in probabilistic latent semantic analysis

EM algorithm

EM algorithm (3)

I Qi – posterior distribution of z(i) given x (i) and the parameter θ

EM algorithm: repeat until convergence

I E-step: for each iQi (z

(i)) := p(z(i)|x (i); θ)

I M-step:

θ := arg maxθ

∑i

∑z(i)

Qi (z(i)) log

p(x (i), z(i); θ)

Qi (z(i))

The algorithm will converge, since `(θ(t)) ≤ `(θ(t+1))

Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 10 / 27

Page 11: EM algorithm and its application in probabilistic latent semantic analysis

EM algorithm

EM algorithm (4)Digression: coordinate ascent algorithm.

I maxα

W (α1, . . . αm)

I loop until converge:for i ∈ 1, . . . ,m:

αi = arg maxα̂i

W (α1, . . . , α̂i , . . . , αm)

EM-algorithm as coordinate ascent algorithm

I

J(Q, θ) =∑i

∑z(i)

Qi (z(i)) log

p(x (i), z(i); θ)

Qi (z(i))

I `(θ) ≥ J(Q, θ)

I EM algorithm can be viewed as coordinate ascent on JE-step: maximize w.r.t QM-step: maximize w.r.t θ

Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 11 / 27

Page 12: EM algorithm and its application in probabilistic latent semantic analysis

Probabilistic Latent Sematic Analysis

Outline

The parameter estimation problem

EM algorithm

Probabilistic Latent Sematic Analysis

Reference

Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 12 / 27

Page 13: EM algorithm and its application in probabilistic latent semantic analysis

Probabilistic Latent Sematic Analysis

Probabilistic Latent Semantic Analysis (1)

I set of documents D = {d1, . . . , dN}set of words W = {w1, . . . ,wM}set of unobserved classes Z = {z1, . . . , zK}

I conditional independence assumption:

P(di ,wj |zk) = P(di |zk)P(wj |zk) (3)

I so,

P(wj |di ) =K∑

k=1

P(zk |di )P(wj |zk) (4)

P(di ,wj) = P(di )K∑

k=1

P(wj |zk)P(zk |di )

More detail . . .

Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 13 / 27

Page 14: EM algorithm and its application in probabilistic latent semantic analysis

Probabilistic Latent Sematic Analysis

Probabilistic Latent Semantic Analysis (2)

I n(di ,wj) – # word wj in doc. diI Likelihood

L =N∏i=1

P(di ) =N∏i=1

M∏j=1

[P(di ,wj)]n(di ,wj )

=N∏i=1

M∏j=1

[P(di )

K∑k=1

P(wj |zk)P(zk |di )

]n(di ,wj )

I log-likelihood ` = log(L)

=N∑i=1

M∑j=1

[n(di ,wj) logP(di ) + n(di ,wj) log

K∑k=1

P(wj |zk)P(zk |di )

]

Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 14 / 27

Page 15: EM algorithm and its application in probabilistic latent semantic analysis

Probabilistic Latent Sematic Analysis

Probabilistic Latent Semantic Analysis (3)I maximize ` w.r.t P(wj |zk), P(zk |di )I ≈ maximize

N∑i=1

M∑j=1

n(di ,wj) logK∑

k=1

P(wj |zk)P(zk |di )

=N∑i=1

M∑j=1

n(di ,wj) logK∑

k=1

Qk(zk)P(wj |zk)P(zk |di )

Qk(zk)

≥N∑i=1

M∑j=1

n(di ,wj)K∑

k=1

Qk(zk) logP(wj |zk)P(zk |di )

Qk(zk)

I choose

Qk(zk) =P(wj |zk)P(zk |di )∑Kl=1 P(wj |zl)P(zl |di )

= P(zk |di ,wj)

More detail . . .

Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 15 / 27

Page 16: EM algorithm and its application in probabilistic latent semantic analysis

Probabilistic Latent Sematic Analysis

Probabilistic Latent Semantic Analysis (4)

I ≈ maximize (w.r.t P(wj |zk), P(zk |di ))

N∑i=1

M∑j=1

n(di ,wj)K∑

k=1

P(zk |di ,wj) logP(wj |zk)P(zk |di )

P(zk |di ,wj)

I ≈ maximize

N∑i=1

M∑j=1

n(di ,wj)K∑

k=1

P(zk |di ,wj) log[P(wj |zk)P(zk |di )]

Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 16 / 27

Page 17: EM algorithm and its application in probabilistic latent semantic analysis

Probabilistic Latent Sematic Analysis

Probabilistic Latent Semantic Analysis (5)EM-algorithm

I E-step: update

P(zk |di ,wj) =P(wj |zk)P(zk |di )∑Kl=1 P(wj |zl)P(zl |di )

I M-step: maximize w.r.t P(wj |zk), P(zk |di )N∑i=1

M∑j=1

n(di ,wj)K∑

k=1

P(zk |di ,wj) log[P(wj |zk)P(zk |di )]

subject toM∑j=1

P(wj |zk) = 1, k ∈ {1 . . .K}

K∑k=1

P(zk |di ) = 1, i ∈ {1 . . .N}

Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 17 / 27

Page 18: EM algorithm and its application in probabilistic latent semantic analysis

Probabilistic Latent Sematic Analysis

Probabilistic Latent Semantic Analysis (6)

Solution of maximization problem in M-step:

P(wj |zk) =

∑Ni=1 n(di ,wj)P(zk |di ,wj)∑M

m=1

∑Nn=1 n(dn,wm)P(zk |dn,wm)

P(zk |di ) =

∑Mj=1 n(di ,wj)P(zk |di ,wj)

n(di )

where, n(di ) =∑M

j=1 n(di ,wj)More detail . . .

Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 18 / 27

Page 19: EM algorithm and its application in probabilistic latent semantic analysis

Probabilistic Latent Sematic Analysis

Probabilistic Latent Semantic Analysis (7)

All together

I E-step:

P(zk |di ,wj) =P(wj |zk)P(zk |di )∑Kl=1 P(wj |zl)P(zl |di )

I M-step:

P(wj |zk) =

∑Ni=1 n(di ,wj)P(zk |di ,wj)∑M

m=1

∑Nn=1 n(dn,wm)P(zk |dn,wm)

P(zk |di ) =

∑Mj=1 n(di ,wj)P(zk |di ,wj)

n(di )

Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 19 / 27

Page 20: EM algorithm and its application in probabilistic latent semantic analysis

Reference

Outline

The parameter estimation problem

EM algorithm

Probabilistic Latent Sematic Analysis

Reference

Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 20 / 27

Page 21: EM algorithm and its application in probabilistic latent semantic analysis

Reference

I R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification,Wiley-Interscience, 2001.

I T. Hofmann, ”Unsupervised learning by probabilistic latent semanticanalysis,” Machine Learning, vol. 42, 2001, p. 177–196.

I Course: ”Machine Learning CS229”, Andrew Ng, Stanford University

Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 21 / 27

Page 22: EM algorithm and its application in probabilistic latent semantic analysis

Appendix

Generative model for word/document co-occurence

I select a document di with probability (w.p) P(di )

I pick a latent class zk w.p P(zk |di )I generate a word wj w.p P(wj |zk)

P(di ,wj) =K∑

k=1

P(di ,wj |zk)P(zk) =K∑

k=1

P(wj |zk)P(di |zk)P(zk)

=K∑

k=1

P(wj |zk)P(zk |di )P(di )

= P(di )K∑

k=1

P(wj |zk)P(zk |di )

P(di ,wj) = P(wj |di )P(di )

=⇒ P(wj |di ) =K∑

k=1

P(zk |di )P(wj |zk)

Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 22 / 27

Page 23: EM algorithm and its application in probabilistic latent semantic analysis

Appendix

I

P(wj |di ) =K∑

k=1

P(zk |di )P(wj |zk)

since∑K

k=1 P(zk |di ) = 1, P(wj , di ) is convex combination of P(wj |zk)

I ≈ each document is modelled as a mixture of topics

Return

Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 23 / 27

Page 24: EM algorithm and its application in probabilistic latent semantic analysis

Appendix

P(zk |di ,wj) =P(di ,wj |zk)P(zk)

P(di ,wj)(5)

=P(wj |zk)P(di |zk)P(zk)

P(di ,wj)(6)

=P(wj |zk)P(zk |di )

P(wj |di )(7)

=P(wj |zk)P(zk |di )∑Kl=1 P(wj |zl)P(zl |di )

(8)

From (5) to (6) by conditional independence assumption (3). From (7) to(8) by (4). Return

Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 24 / 27

Page 25: EM algorithm and its application in probabilistic latent semantic analysis

Appendix

Lagrange multipliers τk , ρi

H =N∑i=1

M∑j=1

n(di ,wj)K∑

k=1

P(zk |di ,wj) log[P(wj |zk)P(zk |di )]

+K∑

k=1

τk

1−M∑j=1

P(wj |di )

+N∑i=1

ρi

[1−

K∑k=1

P(zk |di )

]

∂H∂P(wj |zk)

=

∑Ni=1 P(zk |di ,wj)n(di ,wj)

P(wj |zk)− τk = 0

∂H∂P(zk |di )

=

∑Mj=1 n(di ,wj)P(zk |di ,wj)

P(zk |di )− ρi = 0

Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 25 / 27

Page 26: EM algorithm and its application in probabilistic latent semantic analysis

Appendix

from∑M

j=1 P(wj |zk) = 1

τk =M∑j=1

N∑i=1

P(zk |di ,wj)n(di ,wj)

from∑K

k=1 P(zk |di ,wj) = 1

ρi = n(di )

=⇒ P(wj |zk),P(zk |di ) Return

Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 26 / 27

Page 27: EM algorithm and its application in probabilistic latent semantic analysis

Appendix

Applying the Jensen’s inequality

I f (x) = log(x), concave function

f

(Ez(i)∼Qi

[p(x (i), z(i); θ)

Qi (z(i))

])≥ Ez(i)∼Qi

[f

(p(x (i), z(i); θ)

Qi (z(i))

)]Return

Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 27 / 27