2010 © university of michigan latent semantic indexing si650: information retrieval winter 2010...

2010 © University of Michigan

Latent Semantic Indexing

SI650: Information Retrieval

Winter 2010

School of Information

University of Michigan

1


…Latent semantic indexing Singular value decomposition…


Problems with lexical semantics

• Polysemy – bar, bank, jaguar, hot– tend to reduce precision

• Synonymy– building/edifice, Large/big, Spicy/hot– tend to reduce recall

• Relatedness– doctor/patient/nurse/treatment

• Sparse matrix• Need: dimensionality reduction


Problem in Retrieval

4

Query = “information retrieval”

Document 1 = “inverted index precision recall”

Document 2 = “welcome to ann arbor”

• Which one should we rank higher?• Query vocabulary & doc vocabulary mismatch!• Smoothing won’t help here…• If only we can represent documents/queries by topics!



• Motivation– Query vocabulary & doc vocabulary mismatch– Need to match/index based on concepts (or topics)

• Main idea:– Projects queries and documents into a space with

“latent” semantic dimensions– Dimensionality reduction: the latent semantic space

has fewer dimensions (semantic concepts)– Exploits co-occurrence: Co-occurring terms are

projected onto the same dimensions

2010 © University of Michigan6

Example of “Semantic Concepts”

(Slide from C. Faloutsos’s talk)


Concept Space = Dimension Reduction

7

• Number of concepts (K) is always smaller than the number of words (N) or number of documents (M).

• If we represent a document as a N-dimension vector; and the corpus as an M*N matrix…

• The goal is to reduce the dimension from N to K.• But how can we do that?


Techniques for dimensionality reduction

• Based on matrix decomposition (goal: preserve clusters, explain away variance)

• A quick review of matrices– Vectors– Matrices– Matrix multiplication

1

1

2

*

1494

852

321


Eigenvectors and eigenvalues

• An eigenvector is an implicit “direction” for a matrix

where v (eigenvector) is non-zero, though λ (eigenvalue) can be any complex number in principle

• Computing eigenvalues (det = determinant):

if A is square (N x N), has r distinct solutions, where 1 <= r <= N

• For each λ found, you can find v by , or

0)det( IA

vvA

vvA

0)( vIA

0)det( IA



• Example:

• det (A-lI) = (-1-l)*(-l)-3*2=0• Then: l+l2-6=0; l1=2; l2=-3

• For l1=2:

• Solutions: x1=x2

02

31A

2

31IA

022

33

2

1

x

x



• Wait, that means there are many eigenvectors for the same eigenvalue…

• v = (x1, x2)T; x1 = x2 corresponds to many vectors, e.g., (1, 1)T, (2, 2)T, (650, 650)T…

• Not surprising …

• if v is an eigenvector of A, v’ = cv is also an eigenvector (c is any non-zero constant)

11

vvA


Matrix Decomposition

• If A is a square (N x N) matrix and it has N linearly independent eigenvectors, it can be decomposed into ULU-1

where U: matrix of eigenvectors (every column) : L diagonal matrix of eigenvalues

• AU = UL• U-1AU = L• A = ULU-1


Example

3,1,21

1221

A

11

11U

2/12/1

2/12/11U

2/12/1

2/12/1

30

01

11

111UUA

30

01


Example

000

020

003

A

0

1

0

2v

Eigenvaluesare 3, 2, 0

1

0

0

3v

0

0

1

1v

321 642

6

4

2

vvvx

)642(* 321 vvvAAx

321 642 AvAvAvAx

332211 642 vvvAx

x is an arbitrary vector, yet Sx dependson the eigenvalues and eigenvectors

2010 © University of Michigan 15

What about an arbitrary matrix?

• A: n x m matrix (n documents, m terms)• A = USVT (as opposed to A = ULU-1)• U: n x n matrix;• V: m x m matrix• : S n x m diagonal matrix only values on

the diagonal can be non-zero. • UUT = I; VVT = I


SVD: Singular Value Decomposition

• A=USVT

• U is the matrix of orthogonal eigenvectors of AAT

• V is the matrix of orthogonal eigenvectors of ATA• The components of S are the eigenvalues of ATA• This decomposition exists for all matrices, dense

or sparse• If A has 5 columns and 3 rows, then U will be

5x5 and V will be 3x3• In Matlab, use [U,S,V] = svd (A)


Term matrix normalization

1

0

1

1

0

0

000

101

1010

0

0

0

1

1

110

110

101

A

71.0

00.0

71.0

58.0

00.0

00.0

00.000.000.0

45.000.058.0

45.000.058.000.0

00.0

00.0

00.0

58.0

58.0

45.071.000.0

45.071.000.0

45.000.058.0

)(nA

D1 D2 D3 D4 D5

D1 D2 D3 D4 D5


Example (Berry and Browne)

• T1: baby• T2: child• T3: guide• T4: health • T5: home• T6: infant• T7: proofing• T8: safety• T9: toddler

• D1: infant & toddler first aid• D2: babies & children’s room (for

your home)• D3: child safety at home• D4: your baby’s health and

safety: from infant to toddler• D5: baby proofing basics• D6: your guide to easy rust

proofing• D7: beanie babies collector’s

guide


Document term matrix

0001001

0001100

0110000

0001001

0000110

0001000

1100000

0000110

1011010

A

00045.00071.0

00045.058.000

071.071.00000

00045.00071.0

000058.058.00

00045.0000

71.071.000000

000058.058.00

71.0071.045.0058.00

)(nA


Decomposition

u =

-0.6976 -0.0945 0.0174 -0.6950 0.0000 0.0153 0.1442 -0.0000 0 -0.2622 0.2946 0.4693 0.1968 -0.0000 -0.2467 -0.1571 -0.6356 0.3098 -0.3519 -0.4495 -0.1026 0.4014 0.7071 -0.0065 -0.0493 -0.0000 0.0000 -0.1127 0.1416 -0.1478 -0.0734 0.0000 0.4842 -0.8400 0.0000 -0.0000 -0.2622 0.2946 0.4693 0.1968 0.0000 -0.2467 -0.1571 0.6356 -0.3098 -0.1883 0.3756 -0.5035 0.1273 -0.0000 -0.2293 0.0339 -0.3098 -0.6356 -0.3519 -0.4495 -0.1026 0.4014 -0.7071 -0.0065 -0.0493 0.0000 -0.0000 -0.2112 0.3334 0.0962 0.2819 -0.0000 0.7338 0.4659 -0.0000 0.0000 -0.1883 0.3756 -0.5035 0.1273 -0.0000 -0.2293 0.0339 0.3098 0.6356

v =

-0.1687 0.4192 -0.5986 0.2261 0 -0.5720 0.2433 -0.4472 0.2255 0.4641 -0.2187 0.0000 -0.4871 -0.4987 -0.2692 0.4206 0.5024 0.4900 -0.0000 0.2450 0.4451 -0.3970 0.4003 -0.3923 -0.1305 0 0.6124 -0.3690 -0.4702 -0.3037 -0.0507 -0.2607 -0.7071 0.0110 0.3407 -0.3153 -0.5018 -0.1220 0.7128 -0.0000 -0.0162 -0.3544 -0.4702 -0.3037 -0.0507 -0.2607 0.7071 0.0110 0.3407


Decomposition

S = 1.5849 0 0 0 0 0 0 0 1.2721 0 0 0 0 0 0 0 1.1946 0 0 0 0 0 0 0 0.7996 0 0 0 0 0 0 0 0.7100 0 0 0 0 0 0 0 0.5692 0 0 0 0 0 0 0 0.1977 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Spread on the v1 axis


What does this have to do with dimension reduction?

• Low rank matrix approximation• SVD: A[m*n] = U[m*m]S[m*n]VT

[n*n]

• Remember that S is a diagonal matrix of eigenvalues

• If we only keep the largest r eigenvalues.. • A ≈ U[m*r]S[r*r]VT

[n*r]

22


Rank-4 approximation

s4 =

1.5849 0 0 0 0 0 0 0 1.2721 0 0 0 0 0 0 0 1.1946 0 0 0 0 0 0 0 0.7996 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0



u*s4*v' -0.0019 0.5985 -0.0148 0.4552 0.7002 0.0102 0.7002 -0.0728 0.4961 0.6282 0.0745 0.0121 -0.0133 0.0121 0.0003 -0.0067 0.0052 -0.0013 0.3584 0.7065 0.3584 0.1980 0.0514 0.0064 0.2199 0.0535 -0.0544 0.0535 -0.0728 0.4961 0.6282 0.0745 0.0121 -0.0133 0.0121 0.6337 -0.0602 0.0290 0.5324 -0.0008 0.0003 -0.0008 0.0003 -0.0067 0.0052 -0.0013 0.3584 0.7065 0.3584 0.2165 0.2494 0.4367 0.2282 -0.0360 0.0394 -0.0360 0.6337 -0.0602 0.0290 0.5324 -0.0008 0.0003 -0.0008



u*s4: word vector representation of the concepts/topics

-1.1056 -0.1203 0.0207 -0.5558 0 0 0 -0.4155 0.3748 0.5606 0.1573 0 0 0 -0.5576 -0.5719 -0.1226 0.3210 0 0 0 -0.1786 0.1801 -0.1765 -0.0587 0 0 0 -0.4155 0.3748 0.5606 0.1573 0 0 0 -0.2984 0.4778 -0.6015 0.1018 0 0 0 -0.5576 -0.5719 -0.1226 0.3210 0 0 0 -0.3348 0.4241 0.1149 0.2255 0 0 0 -0.2984 0.4778 -0.6015 0.1018 0 0 0



s4*v': new (concept/topic) representation of documents

-0.2674 -0.7087 -0.4266 -0.6292 -0.7451 -0.4996 -0.7451 0.5333 0.2869 0.5351 0.5092 -0.3863 -0.6384 -0.3863 -0.7150 0.5544 0.6001 -0.4686 -0.0605 -0.1457 -0.0605 0.1808 -0.1749 0.3918 -0.1043 -0.2085 0.5700 -0.2085 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0



s2 =

1.5849 0 0 0 0 0 0 0 1.2721 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0



u*s2*v'

0.1361 0.4673 0.2470 0.3908 0.5563 0.4089 0.5563 0.2272 0.2703 0.2695 0.3150 0.0815 -0.0571 0.0815 -0.1457 0.1204 -0.0904 -0.0075 0.4358 0.4628 0.4358 0.1057 0.1205 0.1239 0.1430 0.0293 -0.0341 0.0293 0.2272 0.2703 0.2695 0.3150 0.0815 -0.0571 0.0815 0.2507 0.2412 0.2813 0.3097 -0.0048 -0.1457 -0.0048 -0.1457 0.1204 -0.0904 -0.0075 0.4358 0.4628 0.4358 0.2343 0.2454 0.2685 0.3027 0.0286 -0.1073 0.0286 0.2507 0.2412 0.2813 0.3097 -0.0048 -0.1457 -0.0048



u*s2: word vector representation of the concepts/topics

-1.1056 -0.1203 0 0 0 0 0 -0.4155 0.3748 0 0 0 0 0 -0.5576 -0.5719 0 0 0 0 0 -0.1786 0.1801 0 0 0 0 0 -0.4155 0.3748 0 0 0 0 0 -0.2984 0.4778 0 0 0 0 0 -0.5576 -0.5719 0 0 0 0 0 -0.3348 0.4241 0 0 0 0 0 -0.2984 0.4778 0 0 0 0 0



s2*v': new (concept/topic) representation of documents

-0.2674 -0.7087 -0.4266 -0.6292 -0.7451 -0.4996 -0.7451 0.5333 0.2869 0.5351 0.5092 -0.3863 -0.6384 -0.3863 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0



A[n x m] ≈ U[n x r] L [ r x r] (V[m x r])T

• A: n x m matrix (n documents, m terms)• U: n x r matrix (n documents, r concepts)• L: r x r diagonal matrix (strength of each

‘concept’) (r : rank of the matrix)• V: m x r matrix (m terms, r concepts)


Latent semantic indexing (LSI)

• Dimensionality reduction = identification of hidden (latent) concepts

• Query matching in latent space• LSI matches documents even if they don’t have

words in common; – If they share frequently co-occurring terms


Back to the CS-MED example

(Slide from C. Faloutsos’s talk)


Example of LSI

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

datainf

retrieval

brainlung

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.800 0.27

=

CS

MD

9.64 0

0 5.29x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

x

CS-concept MD-concept

Term rep of concept

(Slide adapted from C. Faloutsos’s talk)

Strength of CS-concept

Dim. Reduction

A = U L VT


How to Map Query/Doc to the Same Concept Space?

qTconcept = qT V

dTconcept = dT V

1 0 0 0 0

datainf.

retrieval

brainlung

qT

=

0.58 0

0.58 0

0.58 0

0 0.71

0 0.71

=0.58 0

Similarity with

CS-concept

CS-concept

dT

= 0 1 1 0 0 1.16 0

(Slide adapted from C. Faloutsos’s talk)


Useful pointers

• http://lsa.colorado.edu • http://lsi.research.telcordia.com • http://www.cs.utk.edu/~lsi


Readings

• MRS18• MRS17, MRS19• MRS20


Problem of LSI

• Concepts/Topics are hard to interpret• New document/query vectors could have

negative values• Lack of statistical interpretation• Probabilistic latent semantic indexing…

38


General Idea of Probabilistic Topic Models

• Modeling a topic/subtopic/theme with a multinomial distribution (unigram LM)

• Modeling text data with a mixture model involving multinomial distributions – A document is “generated” by sampling words from some multinomial

distribution– Each time, a word may be generated from a different distribution– Many variations of how these multinomial distributions are mixed

• Topic mining = Fitting the probabilistic model to text

• Answer topic-related questions by computing various kinds of conditional probabilities based on the estimated model (e.g., p(time | topic), p(time | topic, location))


40

Document as a Sample of Mixed Topics

• Applications of topic models:– Summarize themes/aspects– Facilitate navigation/browsing– Retrieve documents– Segment documents– Many others

• How can we discover these topic word distributions?

Topic 1

Topic k

Topic 2

…

Background B

government 0.3 response 0.2...

donate 0.1relief 0.05help 0.02 ...

city 0.2new 0.1orleans 0.05 ...

is 0.05the 0.04a 0.03 ...

[ Criticism of government response to the hurricane primarily consisted of criticism of its response to the approach of the storm and its aftermath, specifically in the delayed response ] to the [ flooding of New Orleans. … 80% of the 1.3 million residents of the greater New Orleans metropolitan area evacuated ] …[ Over seventy countries pledged monetary donations or other assistance]. …


Probabilistic Latent Semantic Analysis/Indexing (PLSA/PLSI) [Hofmann

99]

• Mix k multinomial distributions to generate a document• Each document has a potentially different set of mixing

weights which captures the topic coverage• When generating words in a document, each word may be

generated using a DIFFERENT multinomial distribution (this is in contrast with the document clustering model where, once a multinomial distribution is chosen, all the words in a document would be generated using the same model)

• We may add a background distribution to “attract” background words


PLSI as a Mixture Model

Topic z1

Topic zk

Topic z2

…

Document d

Background B

warning 0.3 system 0.2..

aid 0.1donation 0.05support 0.02 ..

statistics 0.2loss 0.1dead 0.05 ..

is 0.05the 0.04a 0.03 ..

k

1

2

B

B

W

p(z1|d)

1 - B

“Generating” word w in doc d in the collection

Parameters: B=noise-level (manually set)P(z|d) and p(w|z) are estimated with Maximum Likelihood

)|()(),(

)|()|()|(

dwpdpdwp

zwpdzpdwpZz

??

??

?

???

??

?

p(z2|d)

p(zk|d)


Parameter Estimation using EM Algorithm

• We have the equation for log-likelihood function from the PLSI model, which we want to maximize:

• Maximizing likelihood using Expectation Maximization

Dd Ww

dwpdwcL ),(log),(


EM Steps

• E-Step– Expectation step where expectation of the likelihood

function is calculated with the current parameter values

• M-Step– Update the parameters with the calculated posterior

probabilities– Find the parameters that maximizes the likelihood

function


E Step

• It is the probability that a word w occurring in a document d, is explained by topic z

Zz

zwpdzp

zwpdzpwdzp

)|()|(

)|()|(),|(


M Step

• All these equations use p(z|d,w) calculated in E Step

• Converges to a local maximum of the likelihood function• We will see more when we talk about topic modeling

Dd Ww

Dd

Zz Ww

Ww

wdzpdwc

wdzpdwczwp

wdzpdwc

wdzpdwcdzp

),|(),(

),|(),()|(

),|(),(

),|(),()|(


Example of PLSI

48


Topics represented as word distributions

• Topics are interpretable!

49

- Example of topics found from blog articles about “Hurricane Katrina”

2010 © university of michigan latent semantic indexing si650: information retrieval winter 2010...

Documents