matrix factorization lsi probabilistic latent...

Latent variable models

• Matrix Factorization• LSI• Probabilistic Latent Indexing

Matrix factorization

n ! = #$, … , #' , # ( ∈ *+n ! is a m x n matrix with columns the #( sn Low rank approximation of X

o Find factors U, V, / ! ≈ -.o With U an m x k matrix, U a k x n matrix, k < m, n

≈ x

m x n m x k k x n

n Many different decompositionso e.g. Singular Value Decomposition, Non Negative Matrix Factorization,

Tri factorization, etc

236

X U V

Recherche d'information textuelle

2 views of matrix factorization

o Decomposition in a vector basiso ! ≈ #$

o x& ≃ ∑)*+, -).)/

o Columns of U, -) are basis vectors, the .)/are the coefficient of x& in this basis

237

X V

xj

vju

1u2 u3

Original data Basis vectorsDictionnary

Representation

≃ x vj

u1

u2

u3


2 views of matrix factorization

o Sum of rank 1 matrices

n ! = ∑$%&' ($)$o Where ($ is the ith column of U and )$ is the ith row of V

238Recherche d'information textuelle

o Interpretationn If X is a term x document matrix

n Terms and documents are represented in a commonrepresentation space of size k

n Their similarity is measured by a dot product in thisspace

239

xj

vj

ui

Original data Term RepresentationDocument Representation

» xterm

document


linear algebra review

o X a mxn matrixn The rank of X is the number of linearly independent

rows or columnso !"#$ % ≤min(+,#)

o X a square mxm matrixn Eigenvector, eigenvalue of X

o ., / / %. = /.o The number of non zero eigen values of X is at most

rank(X)


o Diagonalisation of a real square matrixn X a real valued mxm matrix of rank m

o ! =#Λ#%&o the columns of # are the eigenvectors of X and Λ is a

diagonal matrix whose entries are the eigenvalues of X in decreasing order

o Diagonalisation of a symmetric real valued matrixn X a real valued mxm symmetric matrix of rank m

o ! =#Λ#'o The columns of # are orthogonal and unit length

normalizedo #' = U%&


Singular value decomposition of a rectangular matrix

o X a m x n matrix of rank ro SVD of X

n ! = #Σ%&o Σ diagonal matrix of singular values of !!&

n Singular values are square roots of eigenvalueso # matrix of eigenvectors of !!&o % matrix of eigenvectors of X(!

n If )*+, ! = ) theno only ) eigenvalues are non 0o -.*/0 ! = 12*+ 34,…,37o #,% are orthogonal


Latent Semantic Analysis

Recherche d'information textuelle 244

LSI

n Motivationo Tirer parti des cooccurrences des termes dans les documents

pour obtenir des représentations de taille plus petite et apporter des solutions aux problèmes de synonymie et polysémie rencontrés dans les modèles vectoriels

o Exemplen ! = (…,car,…), " = (…, automobile,…)

n Principeo Projeter requêtes et documents dans un espace de

dimension réduite où les termes qui cooccurrent sont « proches »


LSI interprétation

o ! is a term x document matrixn " terms in rows, # documents in columnsn $%& could be 0/1 or tf-idf for examplen *+rows encode the term projection on the latent

factorso * is the matrix of eigenvectors of the term cooccurence

matrixn ,+- columns encode the document projection on the

latent factorso ,- is the matrix of eigenvectors of the document

cooccurence matrixn * and , are orthonormal


n Représentation d'une requête ou d'un document dans l'espace des termes :o !" = Σ"%&'"(!o Les termes qui co-occurent fréquemment sont projetés au

même « endroit »

n idem pour la projection des termes dans l'espace des documents avec )

n Calcul de la similarité : e.g.n Par rapport au modèle vector space

o Les documents sont représentés dans un espace dense de taille réduite

o Résout quelques pb de synonymieo On perd les facilités d’indexe inverséo Coût du calcul de la SVD

)','(cos dqRSV


Probabilistic Latent Semantic Analysis

Preliminaries : unigram modelo Generative model of a document

n Select document lengthn Pick a word w with probability p(w)n Continue until the end of the document

o Applicationsn Classificationn Clusteringn Ad-hoc retrieval (language models)

Õ=i

i dwpdp )()(

Apprentissage Statistique - P. Gallinari

249

Preliminaries - Unigram model –geometric interpretation


250

2/1)(

4/1)(

4/1)(

3

2

1

tionrepresenta d doc

=

=

=

dwp

dwp

dwp

P(w1|d)

P(w3|d)

P(w2|d)

Document d

Word simplex

Latent models for document generation


251

o Several factors influence the creation of a document (authors, topics, mood, etc).n They are usually unknown

o Generative statistical modelsn Associate the factors with latent variablesn Identifying (learning) the latent variables allows us to

uncover (inference) complex latent structures

Probabilistic Latent Semantic Analysis -PLSA (Hofmann 99)


252

o Motivationsn Several topics may be present in a document or in a

document collectionn Learn the topics from a training collectionn Applications

o Identify the semantic content of documents, documents relationships, trends, …

o Segment documents, ad-hoc IR, …

PLSA


253

o The latent structure is a set of topicsn Each document is generated as a set of words chosen

from selected topicsn A latent variable z (topic) is associated to each word

occurrence in the document

o Generative Processn Select a document d, P(d)n Iterate

o Choose a latent class z, P(z|d)o Generate a word w according to P(w| z)

n Note : P(w| z) and P(z|d) are multinomial distributions over the V words and the T topics

PLSA - Topico A topic is a distribution over words

o Remarkn A topic is shared by several wordsn A word is associated to several topics

word P(w|z)

machine 0.04

learning 0.01

information 0.09

retrieval 0.02

…… …….


254

P(w|z)

words

PLSA as a graphical model


255

ïî

ïíì

=

=

åz

dzPzwPdwP

dwPdPwdP

)()()(

)(*)(),(

Boxes represent repeated sampling

d wz

Corpus level

Document level

P(z|d) P(w|z)

D

Nd

PLSA model


256

o Hypothesisn # values of z is fixed a priorin Bag of wordsn Documents are independent

o No specific distribution on the documentsn Conditional independence

o z being known, w and d are independent

o Learningo Maximum Likelihood : p(Doc-collection)o EM algorithm and variants

PLSA - geometric interpretationo Topici is a point on the word simplexo Documents are constrained to lie on the topic simplexo Creates a bottleneck in document representation

å=z

dzPzwPdwP )()()(


257

Topic simplex

topic2

topic1

topic3

w2 w1

w3

Word simplex

Document d

Applications


258

o Thematic segmentationo Creating documents hierarchieso IR : PLSA modelo Clustering and classificationo Image annotation

n Learn and infer P(w|image)o Collaborative filtering

o Note : #variants and extensionsn E.g. Hierarchical PLSA (see Gaussier et al.)

Latent Dirichlet Allocation - LDA (Blei et al. 2003)


259

o LDA is also a topic modeln Extends PLSA

o Motivationsn Generalization over unseen documents

o Define a probabilistic model over documentsn Not present in PLSA

o Allows to generate (model) unseen documentsn Overtraining

o In PLSA, the number of parameters grows with the corpus size

o LDA constrains the distribution of topics for each document and words for each topic

LDA - model


260

o Similar to PLSA with the addition of a prior distribution on the topic distribution

o Generative processn For a documentn Topic distribution

o Choose θ ~ Dirichlet (a) a distribution over topicsn Words

o For each document word wn Choose a topic z ~ multinomial (θ)n Choose a word w from p(w | θ, F) multinomial

probability conditioned on topic z

LDA tagging (Blei et al 2003)


261

Finding topics in PNAS (Griffith et al. 2004)


262

o

PNAS categories

LDA topics

Mean θi value for the most significant topic i on this

category

Author-recipient topic model (McCallum et al. 2004)


263

Learning from Enron data

Identify

•Topic•Author-recipient

o [Harvey et al, 2010]

Using LDA for ranking

Recherche d'information textuelle 264