cs 572: information retrieval - emory universityeugene/cs572/lectures/lecture11-topic-mo… ·...

64
CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments: Some slides were adapted from Chris Manning, and from Thomas Hoffman 2/17/2016 CS572: Information Retrieval. Spring 2016 1

Upload: others

Post on 20-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

CS 572: Information Retrieval

Lecture 11: Topic Models

Acknowledgments: Some slides were adapted from Chris Manning, and

from Thomas Hoffman

2/17/2016 CS572: Information Retrieval. Spring 2016 1

Page 2: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

Plan for next few weeks

• Project 1: done (submit by Friday).

• Project 2: (topic) language models: TBA tomorrow

• Monday 2/22: no class. Watch LDA Google talk by David Blei: https://www.youtube.com/watch?v=7BMsuyBPx90

• Wednesday 2/24: guest lecture: Prof. Joyce Ho

• Monday 2/29: Semantics (conclusion); NLP for IR

• Wednesday 3/2: NLP for IR + guest lecture

• Wednesday 3/2: Midterm (take-home) assigned. Due by 5pm Thursday 3/3.

2/17/2016 CS572: Information Retrieval. Spring 2016 2

Page 3: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

3

Today: Can we transform this matrix to identify the "meaning" or topicof the documents, and use that for retrieval/classification, etc?

Recall: Term-document matrix

3

Anthony andCleopatra

Julius Caesar

TheTempest

Hamlet Othello Macbeth

anthony 5.25 3.18 0.0 0.0 0.0 0.35

brutus 1.21 6.10 0.0 1.0 0.0 0.0

caesar 8.59 2.54 0.0 1.51 0.25 0.0

calpurnia 0.0 1.54 0.0 0.0 0.0 0.0

cleopatra 2.85 0.0 0.0 0.0 0.0 0.0

mercy 1.51 0.0 1.90 0.12 5.25 0.88

worser 1.37 0.0 0.11 4.15 0.25 1.95

. . .

2/17/2016 CS572: Information Retrieval. Spring 2016

Page 4: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

Problems with Lexical Semantics

• Ambiguity and association in natural language

– Polysemy: Words often have a multitude of meaningsand different types of usage (more severe in very heterogeneous collections).

– The word-based retrieval model is unable to discriminate between different meanings of the same word.

2/17/2016 4CS572: Information Retrieval. Spring 2016

Page 5: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

Problems with Lexical Semantics

– Synonymy: Different terms may have identical or similar meanings (weaker: words indicating the same topic).

– No associations between words are made in the matrix or vector space representation.

2/17/2016 5CS572: Information Retrieval. Spring 2016

Page 6: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

Polysemy and Context

• Document similarity on single word level: polysemy and context

carcompany

•••dodgeford

meaning 2

ringjupiter

•••space

voyagermeaning 1…

saturn...

planet...

contribution to similarity, if

used in 1st meaning, but not

if in 2nd

2/17/2016 6CS572: Information Retrieval. Spring 2016

Page 7: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

Solution: Topic Models

• Idea: model words in context (e.g., document)

• Examples:

– Topic models in science:

http://topics.cs.princeton.edu/Science/browser/

– Topic models in javascript (by David Mimno)

http://mimno.infosci.cornell.edu/jsLDA/

2/17/2016 7CS572: Information Retrieval. Spring 2016

Page 8: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

Application: Model Evolution of Topics

2/17/2016 8CS572: Information Retrieval. Spring 2016

Page 9: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

Progression of Topic Models

• Latent Semantic Analysis|Indexing (LSI | LSA)

• Probalistic LSI (pLSI)

• Probabilistic LSI with Dirichlet priors (LDA):– Mon 2/22: Google tech talk by David Blei

• Scalable topic models (SVD/NMF, Bayes MF): – Wed 2/24, Prof. Joyce Ho

• Word2Vec, other extensions (Mon 2/29)

2/17/2016 9CS572: Information Retrieval. Spring 2016

Page 10: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

Latent Semantic Indexing (LSI)

• Perform a low-rank approximation of document-term matrix (typical rank 100-300)

• General idea

– Map documents (and terms) to a low-dimensionalrepresentation.

– Design a mapping such that the low-dimensional space reflects semantic associations (latent semantic space).

– Compute document similarity based on the inner product in this latent semantic space

2/17/2016 10CS572: Information Retrieval. Spring 2016

Page 11: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

Goals of LSI

• Similar terms map to similar location in low dimensional space

• Noise reduction by dimension reduction

2/17/2016 11CS572: Information Retrieval. Spring 2016

Page 12: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

Latent Semantic Analysis

• Latent semantic space: illustrative example

courtesy of Susan Dumais

2/17/2016 12CS572: Information Retrieval. Spring 2016

Page 13: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

13

Decompose the term-document matrix into a product of matrices.decomposition: singular value decomposition(SVD).SVD: C = UΣV T (where C = term-documentmatrix)

Then, use the SVD to compute a new, improved term-document matrix C′.Hope: get better similarity values out of C′ (compared to C).Using SVD for this purpose is called latent semantic indexing or LSI.

Latent semantic indexing: Overview

132/17/2016 CS572: Information Retrieval. Spring 2016

Page 14: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

Singular Value Decomposition

TVUA

MM MN V is NN

For an M N matrix A of rank r there exists a factorization

(Singular Value Decomposition = SVD) as follows:

The columns of U are orthogonal eigenvectors of AAT.

The columns of V are orthogonal eigenvectors of ATA.

ii

rdiag ...1 Singular values.

Eigenvalues 1 … r of AAT are the eigenvalues of ATA.

2/17/2016 CS572: Information Retrieval. Spring 2016 14

Page 15: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

Singular Value Decomposition

• Illustration of SVD dimensions and sparseness

2/17/2016 15CS572: Information Retrieval. Spring 2016

Page 16: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

SVD example

Let

01

10

11

A

M=3, N=2. Its SVD is

2/12/1

2/12/1

00

30

01

3/16/12/1

3/16/12/1

3/16/20

Typically, the singular values arranged in decreasing order.

2/17/2016 CS572: Information Retrieval. Spring 2016 16

TVUA

MM MN V is NN

Page 17: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

• SVD can be used to compute optimal low-rank approximations.

• Approximation problem: Find Ak of rank k such that

Ak and X are both mn matrices.

Typically, want k << r.

Low-rank Approximation

Frobenius normFkXrankX

k XAA

min)(:

2/17/2016 17CS572: Information Retrieval. Spring 2016

Page 18: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

• Solution via SVD

Low-rank Approximation

set smallest r-k

singular values to zero

T

kk VUA )0,...,0,,...,(diag 1

column notation: sum

of rank 1 matrices

T

ii

k

i ik vuA

1

k

2/17/2016 CS572: Information Retrieval. Spring 2016 18

Page 19: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

• If we retain only k singular values, and set the rest to 0, then we don’t need the matrix parts in red

• Then Σ is k×k, U is M×k, VT is k×N, and Ak is M×N

• This is referred to as the reduced SVD– It is the convenient (space-saving) and usual form for

computational applications

Reduced SVD

k

192/17/2016 CS572: Information Retrieval. Spring 2016

Page 20: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

Approximation error

• How good (bad) is this approximation?

• It’s the best possible, measured by the Frobenius norm of the error:

where the i are ordered such that i i+1.– Suggests why Frobenius error drops as k increases.

1

)(:min

kFkFkXrankX

AAXA

2/17/2016 20CS572: Information Retrieval. Spring 2016

Page 21: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

SVD Low-rank approximation

• Whereas the term-doc matrix A may have M=50000, N=10 million (and rank close to 50000)

• We can construct an approximation A100 with rank 100.

– Of all rank 100 matrices, it would have the lowest Frobenius error.

C. Eckart, G. Young, The approximation of a matrix by another of lower rank.

Psychometrika, 1, 211-218, 1936.

2/17/2016 CS572: Information Retrieval. Spring 2016 21

Page 22: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

Connection to Vector Space Model

• Intuition: Dimension reduction through LSI brings together “related” axes in the vector space.

2/17/2016 CS572: Information Retrieval. Spring 2016 22

Page 23: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

Intuition from block matrices

Block 1

Block 2

Block k

0’s

0’s

= non-zero entries.

m

terms

n documents

What’s the rank of this matrix?

2/17/2016 CS572: Information Retrieval. Spring 2016 23

Page 24: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

Intuition from block matrices

Block 1

Block 2

Block k

0’s

0’s

m

terms

n documents

Vocabulary partitioned into k topics (clusters); each doc discusses

only one topic.

2/17/2016 CS572: Information Retrieval. Spring 2016 24

Page 25: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

Intuition from block matrices

Block 1

Block 2

Block k

0’s

0’s

= non-zero entries.

m

terms

n documents

What’s the best rank-k

approximation to this matrix?

2/17/2016 CS572: Information Retrieval. Spring 2016 25

Page 26: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

Intuition from block matrices

Block 1

Block 2

Block k

Few nonzero entries

Few nonzero entries

wiper

tire

V6

car

automobile

1

1

0

0

Likely there’s a good rank-k

approximation to this matrix.

2/17/2016 CS572: Information Retrieval. Spring 2016 26

Page 27: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

Assumption/Hope

Topic 1

Topic 2

Topic 32/17/2016 CS572: Information Retrieval. Spring 2016 27

Page 28: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

Latent Semantic Indexing by SVD

2/17/2016 28CS572: Information Retrieval. Spring 2016

Page 29: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

Performing the maps

• Each row and column of A gets mapped into the k-dimensional LSI space, by the SVD.

• Claim – this is not only the mapping with the best (Frobenius error) approximation to A, but in fact improves retrieval.

• A query q is also mapped into this space, by

– Query NOT a sparse vector.

1 kk

T

k Uqq

2/17/2016 29CS572: Information Retrieval. Spring 2016

Page 30: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

Performing the maps

• ATA is the dot product of pairs of documentsATA≈Ak

TAk = (UkkVkT)T (UkkVk

T)

= VkkUkT UkkVk

T

= (Vkk) (Vkk)T

• Since Vk = AkTUkk

-1

we should transform query q to qk as follows

qk qTUkk

1

Sec. 18.4

302/17/2016 CS572: Information Retrieval. Spring 2016

Page 31: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

Empirical evidence

• Experiments on TREC 1/2/3 – Dumais

• Lanczos SVD code (available on netlib) due to Berry used in these expts

– Running times of ~ one day on tens of thousands of docs [still an obstacle to use]

• Dimensions – various values 250-350 reported. Reducing k improves recall.

– (Under 200 reported unsatisfactory)

• Generally expect recall to improve – what about precision?

31CS572: Information Retrieval. Spring 20162/17/2016

Page 32: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

322/17/2016 CS572: Information Retrieval. Spring 2016

Page 33: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

332/17/2016 CS572: Information Retrieval. Spring 2016

Page 34: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

342/17/2016 CS572: Information Retrieval. Spring 2016

Page 35: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

352/17/2016 CS572: Information Retrieval. Spring 2016

Page 36: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

362/17/2016 CS572: Information Retrieval. Spring 2016

Page 37: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

372/17/2016 CS572: Information Retrieval. Spring 2016

Page 38: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

382/17/2016 CS572: Information Retrieval. Spring 2016

Page 39: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

392/17/2016 CS572: Information Retrieval. Spring 2016

Page 40: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

Empirical evidence: Conclusion

• Precision at or above median TREC precision

– Top scorer on almost 20% of TREC topics

• Slightly better on average than straight vector spaces

• Effect of dimensionality:

Dimensions Precision

250 0.367

300 0.371

346 0.374

2/17/2016 40CS572: Information Retrieval. Spring 2016

Page 41: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

Failure modes

• Negated phrases

– TREC topics sometimes negate certain query/terms phrases – automatic conversion of topics to

• Boolean queries

– As usual, freetext/vector space syntax of LSI queries precludes (say) “Find any doc having to do with the following 5 companies”

• See Berry, Dumais for more (resources slide)

412/17/2016 CS572: Information Retrieval. Spring 2016

Page 42: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

LSI has many other applications

• In many settings in pattern recognition and retrieval, we have a feature-object matrix.

– For text, the terms are features and the docs are objects.

Could be opinions & users … Recommender Systems

– This matrix may be redundant in dimensionality.

– Can work with low-rank approximation.

– If entries are missing (e.g., users’ opinions), can recover if dimensionality is low.

2/17/2016 42CS572: Information Retrieval. Spring 2016

Page 43: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

Resources

• http://www.cs.utk.edu/~berry/lsi++/

• http://lsi.argreenhouse.com/lsi/LSIpapers.html

• Dumais (1993) LSI meets TREC: A status report.

• Dumais (1994) Latent Semantic Indexing (LSI) and TREC-2.

• Dumais (1995) Using LSI for information filtering: TREC-3 experiments.

• M. Berry, S. Dumais and G. O'Brien. Using linear algebra for

intelligent information retrieval. SIAM Review, 37(4):573--595, 1995.

2/17/2016 CS572: Information Retrieval. Spring 2016 43

Page 44: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

44

Probabilistic View: Topic Language Models

query

d1

d2

dn

Information need

document collection

generation

),,|( dTC MMMQP

1dM

2dM

ndM

CM

1TM

2TM

mTM

…),|( TC MMQP

2/17/2016 CS572: Information Retrieval. Spring 2016

Page 45: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

Latent Aspects: Example

2/17/2016 45CS572: Information Retrieval. Spring 2016

Page 46: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

2/17/2016 46CS572: Information Retrieval. Spring 2016

Page 47: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

(probabilistic) LSI: pLSI

2/17/2016 47CS572: Information Retrieval. Spring 2016

Page 48: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

48

Aspect Model

• Generation process– Choose a doc d with prob P(d)

• There are N d’s

– Choose a latent class z with (generated) prob P(z|d)

• There are K z’s, and K << N

– Generate a word w with (generated) prob P(w|z)

– This creates pair (d, w), without direct concern for z

• Joining the probabilities:

Remember: P(z|d) means “probability of z, given d”

K chosen in advance (how

many topics in collection???)

2/17/2016 CS572: Information Retrieval. Spring 2016

Page 49: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

49

Aspect Model (2)

• Log-likelihood– Maximize this to find P(d), P(z|d), P(w|z)

• Apply Bayes theorem: end up with

• What is modeled? – Doc-specific word distributions, P(w|d), are based on

combination of specific classes/factors/aspects, P(w|z)

– Not just assigned to nearest cluster

2/17/2016 CS572: Information Retrieval. Spring 2016

Page 50: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

pLSI Learning

2/17/2016 50CS572: Information Retrieval. Spring 2016

Page 51: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

pLSI Generative Model

2/17/2016 51CS572: Information Retrieval. Spring 2016

Page 52: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

52

Approach: Expectation Maximization (EM)

• EM is popular technique to maximize likelihood estimation

• Alternates between:

– E-step: calculate future probabilities of z based on current estimates

– M-step: update estimate parameters based on calculated probabilities

2/17/2016 CS572: Information Retrieval. Spring 2016

Page 53: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

Simple example

0 1-2 -1-4 -3 4 52 3

Data:

Model:

Parameters:

OBJECTIVE: Fit mixture of Gaussian model with C=2 components

keep fixed

i.e. only estimate

x

P(x|)

where

2/17/2016 53CS572: Information Retrieval. Spring 2016

Page 54: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

Likelihood function

Likelihood is a function of parameters,

Probability is a function of r.v. xDIFFERENT from LAST PLOT

2/17/2016 54CS572: Information Retrieval. Spring 2016

Page 55: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

Probabilistic model

Imagine model generating data

Need to introduce label, z, for each data

point

Label is called a latent variable

also called hidden, unobserved, missing

c

0 1-2 -1-4 -3 4 52 3

Simplifies the problem:

if we knew the labels, we can decouple the components as estimate

parameters separately for each one

2/17/2016 55CS572: Information Retrieval. Spring 2016

Page 56: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

Intuition of EME-step: Compute a distribution on the labels of the points, using current parameters

M-step: Update parameters using current guess of label distribution.

E

E

M

M

E2/17/2016 56CS572: Information Retrieval. Spring 2016

Page 57: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

EM for pLSI

2/17/2016 57CS572: Information Retrieval. Spring 2016

Page 58: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

pLSA for IR: T. Hoffman 2000

• MED – 1033 docs

• CRAN – 1400 docs

• CACM – 3204 docs

• CISI – 1460 docs

• Reporting best results with K varying from 32, 48, 64, 80, 128

• pLSA* model takes the average across all models at different K values

2/17/2016 58CS572: Information Retrieval. Spring 2016

Page 59: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

Example of topics found from a Science

Magazine papers collection

2/17/2016 59CS572: Information Retrieval. Spring 2016

Page 60: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

Using Aspects for Query Expansion

2/17/2016 60CS572: Information Retrieval. Spring 2016

Page 61: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

Relevance Results

• Cosine Similarity is the baseline

• In LSI, query vector(q) is multiplied to get the reduced space vector

• In PLSI, p(z|d) and p(z|q). In EM iterations, only P(z|q) is adapted

2/17/2016 61CS572: Information Retrieval. Spring 2016

Page 62: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

Precision-Recall results(4/4)

2/17/2016 62CS572: Information Retrieval. Spring 2016

Page 63: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

63

Experiment: pLSI w/ 128-factor decomposition

2/17/2016 CS572: Information Retrieval. Spring 2016

Page 64: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture11-topic-mo… · 2016-02-17  · CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments:

Extension: Document Priors

• Model the document “prior”

• LDA: extension of pLSI to better model document generation process [David Blei]

• Video: https://www.youtube.com/watch?v=7BMsuyBPx90

• Lecture slides: http://www.cs.columbia.edu/~blei/talks/Blei_MLSS_2012.pdf

2/17/2016 CS572: Information Retrieval. Spring 2016 64