2010 © university of michigan latent semantic indexing si650: information retrieval winter 2010...
TRANSCRIPT
2010 © University of Michigan
Latent Semantic Indexing
SI650: Information Retrieval
Winter 2010
School of Information
University of Michigan
1
2010 © University of Michigan
Problems with lexical semantics
• Polysemy – bar, bank, jaguar, hot– tend to reduce precision
• Synonymy– building/edifice, Large/big, Spicy/hot– tend to reduce recall
• Relatedness– doctor/patient/nurse/treatment
• Sparse matrix• Need: dimensionality reduction
2010 © University of Michigan
Problem in Retrieval
4
Query = “information retrieval”
Document 1 = “inverted index precision recall”
Document 2 = “welcome to ann arbor”
• Which one should we rank higher?• Query vocabulary & doc vocabulary mismatch!• Smoothing won’t help here…• If only we can represent documents/queries by topics!
2010 © University of Michigan
Latent Semantic Indexing
• Motivation– Query vocabulary & doc vocabulary mismatch– Need to match/index based on concepts (or topics)
• Main idea:– Projects queries and documents into a space with
“latent” semantic dimensions– Dimensionality reduction: the latent semantic space
has fewer dimensions (semantic concepts)– Exploits co-occurrence: Co-occurring terms are
projected onto the same dimensions
2010 © University of Michigan
Concept Space = Dimension Reduction
7
• Number of concepts (K) is always smaller than the number of words (N) or number of documents (M).
• If we represent a document as a N-dimension vector; and the corpus as an M*N matrix…
• The goal is to reduce the dimension from N to K.• But how can we do that?
2010 © University of Michigan
Techniques for dimensionality reduction
• Based on matrix decomposition (goal: preserve clusters, explain away variance)
• A quick review of matrices– Vectors– Matrices– Matrix multiplication
1
1
2
*
1494
852
321
2010 © University of Michigan
Eigenvectors and eigenvalues
• An eigenvector is an implicit “direction” for a matrix
where v (eigenvector) is non-zero, though λ (eigenvalue) can be any complex number in principle
• Computing eigenvalues (det = determinant):
if A is square (N x N), has r distinct solutions, where 1 <= r <= N
• For each λ found, you can find v by , or
0)det( IA
vvA
vvA
0)( vIA
0)det( IA
2010 © University of Michigan
Eigenvectors and eigenvalues
• Example:
• det (A-lI) = (-1-l)*(-l)-3*2=0• Then: l+l2-6=0; l1=2; l2=-3
• For l1=2:
• Solutions: x1=x2
02
31A
2
31IA
022
33
2
1
x
x
2010 © University of Michigan
Eigenvectors and eigenvalues
• Wait, that means there are many eigenvectors for the same eigenvalue…
• v = (x1, x2)T; x1 = x2 corresponds to many vectors, e.g., (1, 1)T, (2, 2)T, (650, 650)T…
• Not surprising …
• if v is an eigenvector of A, v’ = cv is also an eigenvector (c is any non-zero constant)
11
vvA
2010 © University of Michigan
Matrix Decomposition
• If A is a square (N x N) matrix and it has N linearly independent eigenvectors, it can be decomposed into ULU-1
where U: matrix of eigenvectors (every column) : L diagonal matrix of eigenvalues
• AU = UL• U-1AU = L• A = ULU-1
2010 © University of Michigan
Example
3,1,21
1221
A
11
11U
2/12/1
2/12/11U
2/12/1
2/12/1
30
01
11
111UUA
30
01
2010 © University of Michigan
Example
000
020
003
A
0
1
0
2v
Eigenvaluesare 3, 2, 0
1
0
0
3v
0
0
1
1v
321 642
6
4
2
vvvx
)642(* 321 vvvAAx
321 642 AvAvAvAx
332211 642 vvvAx
x is an arbitrary vector, yet Sx dependson the eigenvalues and eigenvectors
2010 © University of Michigan 15
What about an arbitrary matrix?
• A: n x m matrix (n documents, m terms)• A = USVT (as opposed to A = ULU-1)• U: n x n matrix;• V: m x m matrix• : S n x m diagonal matrix only values on
the diagonal can be non-zero. • UUT = I; VVT = I
2010 © University of Michigan
SVD: Singular Value Decomposition
• A=USVT
• U is the matrix of orthogonal eigenvectors of AAT
• V is the matrix of orthogonal eigenvectors of ATA• The components of S are the eigenvalues of ATA• This decomposition exists for all matrices, dense
or sparse• If A has 5 columns and 3 rows, then U will be
5x5 and V will be 3x3• In Matlab, use [U,S,V] = svd (A)
2010 © University of Michigan
Term matrix normalization
1
0
1
1
0
0
000
101
1010
0
0
0
1
1
110
110
101
A
71.0
00.0
71.0
58.0
00.0
00.0
00.000.000.0
45.000.058.0
45.000.058.000.0
00.0
00.0
00.0
58.0
58.0
45.071.000.0
45.071.000.0
45.000.058.0
)(nA
D1 D2 D3 D4 D5
D1 D2 D3 D4 D5
2010 © University of Michigan
Example (Berry and Browne)
• T1: baby• T2: child• T3: guide• T4: health • T5: home• T6: infant• T7: proofing• T8: safety• T9: toddler
• D1: infant & toddler first aid• D2: babies & children’s room (for
your home)• D3: child safety at home• D4: your baby’s health and
safety: from infant to toddler• D5: baby proofing basics• D6: your guide to easy rust
proofing• D7: beanie babies collector’s
guide
2010 © University of Michigan
Document term matrix
0001001
0001100
0110000
0001001
0000110
0001000
1100000
0000110
1011010
A
00045.00071.0
00045.058.000
071.071.00000
00045.00071.0
000058.058.00
00045.0000
71.071.000000
000058.058.00
71.0071.045.0058.00
)(nA
2010 © University of Michigan
Decomposition
u =
-0.6976 -0.0945 0.0174 -0.6950 0.0000 0.0153 0.1442 -0.0000 0 -0.2622 0.2946 0.4693 0.1968 -0.0000 -0.2467 -0.1571 -0.6356 0.3098 -0.3519 -0.4495 -0.1026 0.4014 0.7071 -0.0065 -0.0493 -0.0000 0.0000 -0.1127 0.1416 -0.1478 -0.0734 0.0000 0.4842 -0.8400 0.0000 -0.0000 -0.2622 0.2946 0.4693 0.1968 0.0000 -0.2467 -0.1571 0.6356 -0.3098 -0.1883 0.3756 -0.5035 0.1273 -0.0000 -0.2293 0.0339 -0.3098 -0.6356 -0.3519 -0.4495 -0.1026 0.4014 -0.7071 -0.0065 -0.0493 0.0000 -0.0000 -0.2112 0.3334 0.0962 0.2819 -0.0000 0.7338 0.4659 -0.0000 0.0000 -0.1883 0.3756 -0.5035 0.1273 -0.0000 -0.2293 0.0339 0.3098 0.6356
v =
-0.1687 0.4192 -0.5986 0.2261 0 -0.5720 0.2433 -0.4472 0.2255 0.4641 -0.2187 0.0000 -0.4871 -0.4987 -0.2692 0.4206 0.5024 0.4900 -0.0000 0.2450 0.4451 -0.3970 0.4003 -0.3923 -0.1305 0 0.6124 -0.3690 -0.4702 -0.3037 -0.0507 -0.2607 -0.7071 0.0110 0.3407 -0.3153 -0.5018 -0.1220 0.7128 -0.0000 -0.0162 -0.3544 -0.4702 -0.3037 -0.0507 -0.2607 0.7071 0.0110 0.3407
2010 © University of Michigan
Decomposition
S = 1.5849 0 0 0 0 0 0 0 1.2721 0 0 0 0 0 0 0 1.1946 0 0 0 0 0 0 0 0.7996 0 0 0 0 0 0 0 0.7100 0 0 0 0 0 0 0 0.5692 0 0 0 0 0 0 0 0.1977 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Spread on the v1 axis
2010 © University of Michigan
What does this have to do with dimension reduction?
• Low rank matrix approximation• SVD: A[m*n] = U[m*m]S[m*n]VT
[n*n]
• Remember that S is a diagonal matrix of eigenvalues
• If we only keep the largest r eigenvalues.. • A ≈ U[m*r]S[r*r]VT
[n*r]
22
2010 © University of Michigan
Rank-4 approximation
s4 =
1.5849 0 0 0 0 0 0 0 1.2721 0 0 0 0 0 0 0 1.1946 0 0 0 0 0 0 0 0.7996 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2010 © University of Michigan
Rank-4 approximation
u*s4*v' -0.0019 0.5985 -0.0148 0.4552 0.7002 0.0102 0.7002 -0.0728 0.4961 0.6282 0.0745 0.0121 -0.0133 0.0121 0.0003 -0.0067 0.0052 -0.0013 0.3584 0.7065 0.3584 0.1980 0.0514 0.0064 0.2199 0.0535 -0.0544 0.0535 -0.0728 0.4961 0.6282 0.0745 0.0121 -0.0133 0.0121 0.6337 -0.0602 0.0290 0.5324 -0.0008 0.0003 -0.0008 0.0003 -0.0067 0.0052 -0.0013 0.3584 0.7065 0.3584 0.2165 0.2494 0.4367 0.2282 -0.0360 0.0394 -0.0360 0.6337 -0.0602 0.0290 0.5324 -0.0008 0.0003 -0.0008
2010 © University of Michigan
Rank-4 approximation
u*s4: word vector representation of the concepts/topics
-1.1056 -0.1203 0.0207 -0.5558 0 0 0 -0.4155 0.3748 0.5606 0.1573 0 0 0 -0.5576 -0.5719 -0.1226 0.3210 0 0 0 -0.1786 0.1801 -0.1765 -0.0587 0 0 0 -0.4155 0.3748 0.5606 0.1573 0 0 0 -0.2984 0.4778 -0.6015 0.1018 0 0 0 -0.5576 -0.5719 -0.1226 0.3210 0 0 0 -0.3348 0.4241 0.1149 0.2255 0 0 0 -0.2984 0.4778 -0.6015 0.1018 0 0 0
2010 © University of Michigan
Rank-4 approximation
s4*v': new (concept/topic) representation of documents
-0.2674 -0.7087 -0.4266 -0.6292 -0.7451 -0.4996 -0.7451 0.5333 0.2869 0.5351 0.5092 -0.3863 -0.6384 -0.3863 -0.7150 0.5544 0.6001 -0.4686 -0.0605 -0.1457 -0.0605 0.1808 -0.1749 0.3918 -0.1043 -0.2085 0.5700 -0.2085 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2010 © University of Michigan
Rank-2 approximation
s2 =
1.5849 0 0 0 0 0 0 0 1.2721 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2010 © University of Michigan
Rank-2 approximation
u*s2*v'
0.1361 0.4673 0.2470 0.3908 0.5563 0.4089 0.5563 0.2272 0.2703 0.2695 0.3150 0.0815 -0.0571 0.0815 -0.1457 0.1204 -0.0904 -0.0075 0.4358 0.4628 0.4358 0.1057 0.1205 0.1239 0.1430 0.0293 -0.0341 0.0293 0.2272 0.2703 0.2695 0.3150 0.0815 -0.0571 0.0815 0.2507 0.2412 0.2813 0.3097 -0.0048 -0.1457 -0.0048 -0.1457 0.1204 -0.0904 -0.0075 0.4358 0.4628 0.4358 0.2343 0.2454 0.2685 0.3027 0.0286 -0.1073 0.0286 0.2507 0.2412 0.2813 0.3097 -0.0048 -0.1457 -0.0048
2010 © University of Michigan
Rank-2 approximation
u*s2: word vector representation of the concepts/topics
-1.1056 -0.1203 0 0 0 0 0 -0.4155 0.3748 0 0 0 0 0 -0.5576 -0.5719 0 0 0 0 0 -0.1786 0.1801 0 0 0 0 0 -0.4155 0.3748 0 0 0 0 0 -0.2984 0.4778 0 0 0 0 0 -0.5576 -0.5719 0 0 0 0 0 -0.3348 0.4241 0 0 0 0 0 -0.2984 0.4778 0 0 0 0 0
2010 © University of Michigan
Rank-2 approximation
s2*v': new (concept/topic) representation of documents
-0.2674 -0.7087 -0.4266 -0.6292 -0.7451 -0.4996 -0.7451 0.5333 0.2869 0.5351 0.5092 -0.3863 -0.6384 -0.3863 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2010 © University of Michigan 31
Latent Semantic Indexing
A[n x m] ≈ U[n x r] L [ r x r] (V[m x r])T
• A: n x m matrix (n documents, m terms)• U: n x r matrix (n documents, r concepts)• L: r x r diagonal matrix (strength of each
‘concept’) (r : rank of the matrix)• V: m x r matrix (m terms, r concepts)
2010 © University of Michigan
Latent semantic indexing (LSI)
• Dimensionality reduction = identification of hidden (latent) concepts
• Query matching in latent space• LSI matches documents even if they don’t have
words in common; – If they share frequently co-occurring terms
2010 © University of Michigan 34
Example of LSI
1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 30 0 0 1 1
datainf
retrieval
brainlung
0.18 0
0.36 0
0.18 0
0.90 0
0 0.53
0 0.800 0.27
=
CS
MD
9.64 0
0 5.29x
0.58 0.58 0.58 0 0
0 0 0 0.71 0.71
x
CS-concept MD-concept
Term rep of concept
(Slide adapted from C. Faloutsos’s talk)
Strength of CS-concept
Dim. Reduction
A = U L VT
2010 © University of Michigan 35
How to Map Query/Doc to the Same Concept Space?
qTconcept = qT V
dTconcept = dT V
1 0 0 0 0
datainf.
retrieval
brainlung
qT
=
0.58 0
0.58 0
0.58 0
0 0.71
0 0.71
=0.58 0
Similarity with
CS-concept
CS-concept
dT
= 0 1 1 0 0 1.16 0
(Slide adapted from C. Faloutsos’s talk)
2010 © University of Michigan
Useful pointers
• http://lsa.colorado.edu • http://lsi.research.telcordia.com • http://www.cs.utk.edu/~lsi
2010 © University of Michigan
Problem of LSI
• Concepts/Topics are hard to interpret• New document/query vectors could have
negative values• Lack of statistical interpretation• Probabilistic latent semantic indexing…
38
2010 © University of Michigan39
General Idea of Probabilistic Topic Models
• Modeling a topic/subtopic/theme with a multinomial distribution (unigram LM)
• Modeling text data with a mixture model involving multinomial distributions – A document is “generated” by sampling words from some multinomial
distribution– Each time, a word may be generated from a different distribution– Many variations of how these multinomial distributions are mixed
• Topic mining = Fitting the probabilistic model to text
• Answer topic-related questions by computing various kinds of conditional probabilities based on the estimated model (e.g., p(time | topic), p(time | topic, location))
2010 © University of Michigan
40
Document as a Sample of Mixed Topics
• Applications of topic models:– Summarize themes/aspects– Facilitate navigation/browsing– Retrieve documents– Segment documents– Many others
• How can we discover these topic word distributions?
Topic 1
Topic k
Topic 2
…
Background B
government 0.3 response 0.2...
donate 0.1relief 0.05help 0.02 ...
city 0.2new 0.1orleans 0.05 ...
is 0.05the 0.04a 0.03 ...
[ Criticism of government response to the hurricane primarily consisted of criticism of its response to the approach of the storm and its aftermath, specifically in the delayed response ] to the [ flooding of New Orleans. … 80% of the 1.3 million residents of the greater New Orleans metropolitan area evacuated ] …[ Over seventy countries pledged monetary donations or other assistance]. …
2010 © University of Michigan41
Probabilistic Latent Semantic Analysis/Indexing (PLSA/PLSI) [Hofmann
99]
• Mix k multinomial distributions to generate a document• Each document has a potentially different set of mixing
weights which captures the topic coverage• When generating words in a document, each word may be
generated using a DIFFERENT multinomial distribution (this is in contrast with the document clustering model where, once a multinomial distribution is chosen, all the words in a document would be generated using the same model)
• We may add a background distribution to “attract” background words
2010 © University of Michigan
PLSI (a.k.a. Aspect Model)
• Every document is a mixture of underlying (latent) K aspects (topics) with mixture weights p(z|d)– How is this related to LSI?
• Each aspect is represented by a distribution of words p(w|z)
• Estimate p(z|d) and p(w|z) using EM algorithm
)|()(),(
)|()|()|(
dwpdpdwp
zwpdzpdwpZz
2010 © University of Michigan43
PLSI as a Mixture Model
Topic z1
Topic zk
Topic z2
…
Document d
Background B
warning 0.3 system 0.2..
aid 0.1donation 0.05support 0.02 ..
statistics 0.2loss 0.1dead 0.05 ..
is 0.05the 0.04a 0.03 ..
k
1
2
B
B
W
p(z1|d)
1 - B
“Generating” word w in doc d in the collection
Parameters: B=noise-level (manually set)P(z|d) and p(w|z) are estimated with Maximum Likelihood
)|()(),(
)|()|()|(
dwpdpdwp
zwpdzpdwpZz
??
??
?
???
??
?
p(z2|d)
p(zk|d)
2010 © University of Michigan
Parameter Estimation using EM Algorithm
• We have the equation for log-likelihood function from the PLSI model, which we want to maximize:
• Maximizing likelihood using Expectation Maximization
Dd Ww
dwpdwcL ),(log),(
2010 © University of Michigan
EM Steps
• E-Step– Expectation step where expectation of the likelihood
function is calculated with the current parameter values
• M-Step– Update the parameters with the calculated posterior
probabilities– Find the parameters that maximizes the likelihood
function
2010 © University of Michigan
E Step
• It is the probability that a word w occurring in a document d, is explained by topic z
Zz
zwpdzp
zwpdzpwdzp
)|()|(
)|()|(),|(
2010 © University of Michigan
M Step
• All these equations use p(z|d,w) calculated in E Step
• Converges to a local maximum of the likelihood function• We will see more when we talk about topic modeling
Dd Ww
Dd
Zz Ww
Ww
wdzpdwc
wdzpdwczwp
wdzpdwc
wdzpdwcdzp
),|(),(
),|(),()|(
),|(),(
),|(),()|(