matrix factorization lsi probabilistic latent...
TRANSCRIPT
Latent variable models
• Matrix Factorization• LSI• Probabilistic Latent Indexing
Matrix factorization
n ! = #$, … , #' , # ( ∈ *+n ! is a m x n matrix with columns the #( sn Low rank approximation of X
o Find factors U, V, / ! ≈ -.o With U an m x k matrix, U a k x n matrix, k < m, n
≈ x
m x n m x k k x n
n Many different decompositionso e.g. Singular Value Decomposition, Non Negative Matrix Factorization,
Tri factorization, etc
236
X U V
Recherche d'information textuelle
2 views of matrix factorization
o Decomposition in a vector basiso ! ≈ #$
o x& ≃ ∑)*+, -).)/
o Columns of U, -) are basis vectors, the .)/are the coefficient of x& in this basis
237
X V
xj
vju
1u2 u3
Original data Basis vectorsDictionnary
Representation
≃ x vj
u1
u2
u3
Recherche d'information textuelle
2 views of matrix factorization
o Sum of rank 1 matrices
n ! = ∑$%&' ($)$o Where ($ is the ith column of U and )$ is the ith row of V
238Recherche d'information textuelle
o Interpretationn If X is a term x document matrix
n Terms and documents are represented in a commonrepresentation space of size k
n Their similarity is measured by a dot product in thisspace
239
xj
vj
ui
Original data Term RepresentationDocument Representation
» xterm
document
Recherche d'information textuelle
linear algebra review
o X a mxn matrixn The rank of X is the number of linearly independent
rows or columnso !"#$ % ≤min(+,#)
o X a square mxm matrixn Eigenvector, eigenvalue of X
o ., / / %. = /.o The number of non zero eigen values of X is at most
rank(X)
240Recherche d'information textuelle
o Diagonalisation of a real square matrixn X a real valued mxm matrix of rank m
o ! =#Λ#%&o the columns of # are the eigenvectors of X and Λ is a
diagonal matrix whose entries are the eigenvalues of X in decreasing order
o Diagonalisation of a symmetric real valued matrixn X a real valued mxm symmetric matrix of rank m
o ! =#Λ#'o The columns of # are orthogonal and unit length
normalizedo #' = U%&
241Recherche d'information textuelle
Singular value decomposition of a rectangular matrix
o X a m x n matrix of rank ro SVD of X
n ! = #Σ%&o Σ diagonal matrix of singular values of !!&
n Singular values are square roots of eigenvalueso # matrix of eigenvectors of !!&o % matrix of eigenvectors of X(!
n If )*+, ! = ) theno only ) eigenvalues are non 0o -.*/0 ! = 12*+ 34,…,37o #,% are orthogonal
242Recherche d'information textuelle
Latent Semantic Analysis
Recherche d'information textuelle 244
LSI
n Motivationo Tirer parti des cooccurrences des termes dans les documents
pour obtenir des représentations de taille plus petite et apporter des solutions aux problèmes de synonymie et polysémie rencontrés dans les modèles vectoriels
o Exemplen ! = (…,car,…), " = (…, automobile,…)
n Principeo Projeter requêtes et documents dans un espace de
dimension réduite où les termes qui cooccurrent sont « proches »
245Recherche d'information textuelle
LSI interprétation
o ! is a term x document matrixn " terms in rows, # documents in columnsn $%& could be 0/1 or tf-idf for examplen *+rows encode the term projection on the latent
factorso * is the matrix of eigenvectors of the term cooccurence
matrixn ,+- columns encode the document projection on the
latent factorso ,- is the matrix of eigenvectors of the document
cooccurence matrixn * and , are orthonormal
246Recherche d'information textuelle
n Représentation d'une requête ou d'un document dans l'espace des termes :o !" = Σ"%&'"(!o Les termes qui co-occurent fréquemment sont projetés au
même « endroit »
n idem pour la projection des termes dans l'espace des documents avec )
n Calcul de la similarité : e.g.n Par rapport au modèle vector space
o Les documents sont représentés dans un espace dense de taille réduite
o Résout quelques pb de synonymieo On perd les facilités d’indexe inverséo Coût du calcul de la SVD
)','(cos dqRSV
247Recherche d'information textuelle
Probabilistic Latent Semantic Analysis
Preliminaries : unigram modelo Generative model of a document
n Select document lengthn Pick a word w with probability p(w)n Continue until the end of the document
o Applicationsn Classificationn Clusteringn Ad-hoc retrieval (language models)
Õ=i
i dwpdp )()(
Apprentissage Statistique - P. Gallinari
249
Preliminaries - Unigram model –geometric interpretation
Apprentissage Statistique - P. Gallinari
250
2/1)(
4/1)(
4/1)(
3
2
1
tionrepresenta d doc
=
=
=
dwp
dwp
dwp
P(w1|d)
P(w3|d)
P(w2|d)
Document d
Word simplex
Latent models for document generation
Apprentissage Statistique - P. Gallinari
251
o Several factors influence the creation of a document (authors, topics, mood, etc).n They are usually unknown
o Generative statistical modelsn Associate the factors with latent variablesn Identifying (learning) the latent variables allows us to
uncover (inference) complex latent structures
Probabilistic Latent Semantic Analysis -PLSA (Hofmann 99)
Apprentissage Statistique - P. Gallinari
252
o Motivationsn Several topics may be present in a document or in a
document collectionn Learn the topics from a training collectionn Applications
o Identify the semantic content of documents, documents relationships, trends, …
o Segment documents, ad-hoc IR, …
PLSA
Apprentissage Statistique - P. Gallinari
253
o The latent structure is a set of topicsn Each document is generated as a set of words chosen
from selected topicsn A latent variable z (topic) is associated to each word
occurrence in the document
o Generative Processn Select a document d, P(d)n Iterate
o Choose a latent class z, P(z|d)o Generate a word w according to P(w| z)
n Note : P(w| z) and P(z|d) are multinomial distributions over the V words and the T topics
PLSA - Topico A topic is a distribution over words
o Remarkn A topic is shared by several wordsn A word is associated to several topics
word P(w|z)
machine 0.04
learning 0.01
information 0.09
retrieval 0.02
…… …….
Apprentissage Statistique - P. Gallinari
254
P(w|z)
words
PLSA as a graphical model
Apprentissage Statistique - P. Gallinari
255
ïî
ïíì
=
=
åz
dzPzwPdwP
dwPdPwdP
)()()(
)(*)(),(
Boxes represent repeated sampling
d wz
Corpus level
Document level
P(z|d) P(w|z)
D
Nd
PLSA model
Apprentissage Statistique - P. Gallinari
256
o Hypothesisn # values of z is fixed a priorin Bag of wordsn Documents are independent
o No specific distribution on the documentsn Conditional independence
o z being known, w and d are independent
o Learningo Maximum Likelihood : p(Doc-collection)o EM algorithm and variants
PLSA - geometric interpretationo Topici is a point on the word simplexo Documents are constrained to lie on the topic simplexo Creates a bottleneck in document representation
å=z
dzPzwPdwP )()()(
Apprentissage Statistique - P. Gallinari
257
Topic simplex
topic2
topic1
topic3
w2 w1
w3
Word simplex
Document d
Applications
Apprentissage Statistique - P. Gallinari
258
o Thematic segmentationo Creating documents hierarchieso IR : PLSA modelo Clustering and classificationo Image annotation
n Learn and infer P(w|image)o Collaborative filtering
o Note : #variants and extensionsn E.g. Hierarchical PLSA (see Gaussier et al.)
Latent Dirichlet Allocation - LDA (Blei et al. 2003)
Apprentissage Statistique - P. Gallinari
259
o LDA is also a topic modeln Extends PLSA
o Motivationsn Generalization over unseen documents
o Define a probabilistic model over documentsn Not present in PLSA
o Allows to generate (model) unseen documentsn Overtraining
o In PLSA, the number of parameters grows with the corpus size
o LDA constrains the distribution of topics for each document and words for each topic
LDA - model
Apprentissage Statistique - P. Gallinari
260
o Similar to PLSA with the addition of a prior distribution on the topic distribution
o Generative processn For a documentn Topic distribution
o Choose θ ~ Dirichlet (a) a distribution over topicsn Words
o For each document word wn Choose a topic z ~ multinomial (θ)n Choose a word w from p(w | θ, F) multinomial
probability conditioned on topic z
LDA tagging (Blei et al 2003)
Apprentissage Statistique - P. Gallinari
261
Finding topics in PNAS (Griffith et al. 2004)
Apprentissage Statistique - P. Gallinari
262
o
PNAS categories
LDA topics
Mean θi value for the most significant topic i on this
category
Author-recipient topic model (McCallum et al. 2004)
Apprentissage Statistique - P. Gallinari
263
Learning from Enron data
Identify
•Topic•Author-recipient
o [Harvey et al, 2010]
Using LDA for ranking
Recherche d'information textuelle 264