matrix factorization lsi probabilistic latent...

29
Latent variable models Matrix Factorization LSI Probabilistic Latent Indexing

Upload: others

Post on 20-Jul-2020

25 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Matrix Factorization LSI Probabilistic Latent Indexingdac.lip6.fr/master/wp-content/uploads/2018/09/LatentRO.pdf · Matrix factorization n ! = #$,… ,#',#(∈ *+ n ! is a m x n matrix

Latent variable models

• Matrix Factorization• LSI• Probabilistic Latent Indexing

Page 2: Matrix Factorization LSI Probabilistic Latent Indexingdac.lip6.fr/master/wp-content/uploads/2018/09/LatentRO.pdf · Matrix factorization n ! = #$,… ,#',#(∈ *+ n ! is a m x n matrix

Matrix factorization

n ! = #$, … , #' , # ( ∈ *+n ! is a m x n matrix with columns the #( sn Low rank approximation of X

o Find factors U, V, / ! ≈ -.o With U an m x k matrix, U a k x n matrix, k < m, n

≈ x

m x n m x k k x n

n Many different decompositionso e.g. Singular Value Decomposition, Non Negative Matrix Factorization,

Tri factorization, etc

236

X U V

Recherche d'information textuelle

Page 3: Matrix Factorization LSI Probabilistic Latent Indexingdac.lip6.fr/master/wp-content/uploads/2018/09/LatentRO.pdf · Matrix factorization n ! = #$,… ,#',#(∈ *+ n ! is a m x n matrix

2 views of matrix factorization

o Decomposition in a vector basiso ! ≈ #$

o x& ≃ ∑)*+, -).)/

o Columns of U, -) are basis vectors, the .)/are the coefficient of x& in this basis

237

X V

xj

vju

1u2 u3

Original data Basis vectorsDictionnary

Representation

≃ x vj

u1

u2

u3

Recherche d'information textuelle

Page 4: Matrix Factorization LSI Probabilistic Latent Indexingdac.lip6.fr/master/wp-content/uploads/2018/09/LatentRO.pdf · Matrix factorization n ! = #$,… ,#',#(∈ *+ n ! is a m x n matrix

2 views of matrix factorization

o Sum of rank 1 matrices

n ! = ∑$%&' ($)$o Where ($ is the ith column of U and )$ is the ith row of V

238Recherche d'information textuelle

Page 5: Matrix Factorization LSI Probabilistic Latent Indexingdac.lip6.fr/master/wp-content/uploads/2018/09/LatentRO.pdf · Matrix factorization n ! = #$,… ,#',#(∈ *+ n ! is a m x n matrix

o Interpretationn If X is a term x document matrix

n Terms and documents are represented in a commonrepresentation space of size k

n Their similarity is measured by a dot product in thisspace

239

xj

vj

ui

Original data Term RepresentationDocument Representation

» xterm

document

Recherche d'information textuelle

Page 6: Matrix Factorization LSI Probabilistic Latent Indexingdac.lip6.fr/master/wp-content/uploads/2018/09/LatentRO.pdf · Matrix factorization n ! = #$,… ,#',#(∈ *+ n ! is a m x n matrix

linear algebra review

o X a mxn matrixn The rank of X is the number of linearly independent

rows or columnso !"#$ % ≤min(+,#)

o X a square mxm matrixn Eigenvector, eigenvalue of X

o ., / / %. = /.o The number of non zero eigen values of X is at most

rank(X)

240Recherche d'information textuelle

Page 7: Matrix Factorization LSI Probabilistic Latent Indexingdac.lip6.fr/master/wp-content/uploads/2018/09/LatentRO.pdf · Matrix factorization n ! = #$,… ,#',#(∈ *+ n ! is a m x n matrix

o Diagonalisation of a real square matrixn X a real valued mxm matrix of rank m

o ! =#Λ#%&o the columns of # are the eigenvectors of X and Λ is a

diagonal matrix whose entries are the eigenvalues of X in decreasing order

o Diagonalisation of a symmetric real valued matrixn X a real valued mxm symmetric matrix of rank m

o ! =#Λ#'o The columns of # are orthogonal and unit length

normalizedo #' = U%&

241Recherche d'information textuelle

Page 8: Matrix Factorization LSI Probabilistic Latent Indexingdac.lip6.fr/master/wp-content/uploads/2018/09/LatentRO.pdf · Matrix factorization n ! = #$,… ,#',#(∈ *+ n ! is a m x n matrix

Singular value decomposition of a rectangular matrix

o X a m x n matrix of rank ro SVD of X

n ! = #Σ%&o Σ diagonal matrix of singular values of !!&

n Singular values are square roots of eigenvalueso # matrix of eigenvectors of !!&o % matrix of eigenvectors of X(!

n If )*+, ! = ) theno only ) eigenvalues are non 0o -.*/0 ! = 12*+ 34,…,37o #,% are orthogonal

242Recherche d'information textuelle

Page 9: Matrix Factorization LSI Probabilistic Latent Indexingdac.lip6.fr/master/wp-content/uploads/2018/09/LatentRO.pdf · Matrix factorization n ! = #$,… ,#',#(∈ *+ n ! is a m x n matrix

Latent Semantic Analysis

Recherche d'information textuelle 244

Page 10: Matrix Factorization LSI Probabilistic Latent Indexingdac.lip6.fr/master/wp-content/uploads/2018/09/LatentRO.pdf · Matrix factorization n ! = #$,… ,#',#(∈ *+ n ! is a m x n matrix

LSI

n Motivationo Tirer parti des cooccurrences des termes dans les documents

pour obtenir des représentations de taille plus petite et apporter des solutions aux problèmes de synonymie et polysémie rencontrés dans les modèles vectoriels

o Exemplen ! = (…,car,…), " = (…, automobile,…)

n Principeo Projeter requêtes et documents dans un espace de

dimension réduite où les termes qui cooccurrent sont « proches »

245Recherche d'information textuelle

Page 11: Matrix Factorization LSI Probabilistic Latent Indexingdac.lip6.fr/master/wp-content/uploads/2018/09/LatentRO.pdf · Matrix factorization n ! = #$,… ,#',#(∈ *+ n ! is a m x n matrix

LSI interprétation

o ! is a term x document matrixn " terms in rows, # documents in columnsn $%& could be 0/1 or tf-idf for examplen *+rows encode the term projection on the latent

factorso * is the matrix of eigenvectors of the term cooccurence

matrixn ,+- columns encode the document projection on the

latent factorso ,- is the matrix of eigenvectors of the document

cooccurence matrixn * and , are orthonormal

246Recherche d'information textuelle

Page 12: Matrix Factorization LSI Probabilistic Latent Indexingdac.lip6.fr/master/wp-content/uploads/2018/09/LatentRO.pdf · Matrix factorization n ! = #$,… ,#',#(∈ *+ n ! is a m x n matrix

n Représentation d'une requête ou d'un document dans l'espace des termes :o !" = Σ"%&'"(!o Les termes qui co-occurent fréquemment sont projetés au

même « endroit »

n idem pour la projection des termes dans l'espace des documents avec )

n Calcul de la similarité : e.g.n Par rapport au modèle vector space

o Les documents sont représentés dans un espace dense de taille réduite

o Résout quelques pb de synonymieo On perd les facilités d’indexe inverséo Coût du calcul de la SVD

)','(cos dqRSV

247Recherche d'information textuelle

Page 13: Matrix Factorization LSI Probabilistic Latent Indexingdac.lip6.fr/master/wp-content/uploads/2018/09/LatentRO.pdf · Matrix factorization n ! = #$,… ,#',#(∈ *+ n ! is a m x n matrix

Probabilistic Latent Semantic Analysis

Page 14: Matrix Factorization LSI Probabilistic Latent Indexingdac.lip6.fr/master/wp-content/uploads/2018/09/LatentRO.pdf · Matrix factorization n ! = #$,… ,#',#(∈ *+ n ! is a m x n matrix

Preliminaries : unigram modelo Generative model of a document

n Select document lengthn Pick a word w with probability p(w)n Continue until the end of the document

o Applicationsn Classificationn Clusteringn Ad-hoc retrieval (language models)

Õ=i

i dwpdp )()(

Apprentissage Statistique - P. Gallinari

249

Page 15: Matrix Factorization LSI Probabilistic Latent Indexingdac.lip6.fr/master/wp-content/uploads/2018/09/LatentRO.pdf · Matrix factorization n ! = #$,… ,#',#(∈ *+ n ! is a m x n matrix

Preliminaries - Unigram model –geometric interpretation

Apprentissage Statistique - P. Gallinari

250

2/1)(

4/1)(

4/1)(

3

2

1

tionrepresenta d doc

=

=

=

dwp

dwp

dwp

P(w1|d)

P(w3|d)

P(w2|d)

Document d

Word simplex

Page 16: Matrix Factorization LSI Probabilistic Latent Indexingdac.lip6.fr/master/wp-content/uploads/2018/09/LatentRO.pdf · Matrix factorization n ! = #$,… ,#',#(∈ *+ n ! is a m x n matrix

Latent models for document generation

Apprentissage Statistique - P. Gallinari

251

o Several factors influence the creation of a document (authors, topics, mood, etc).n They are usually unknown

o Generative statistical modelsn Associate the factors with latent variablesn Identifying (learning) the latent variables allows us to

uncover (inference) complex latent structures

Page 17: Matrix Factorization LSI Probabilistic Latent Indexingdac.lip6.fr/master/wp-content/uploads/2018/09/LatentRO.pdf · Matrix factorization n ! = #$,… ,#',#(∈ *+ n ! is a m x n matrix

Probabilistic Latent Semantic Analysis -PLSA (Hofmann 99)

Apprentissage Statistique - P. Gallinari

252

o Motivationsn Several topics may be present in a document or in a

document collectionn Learn the topics from a training collectionn Applications

o Identify the semantic content of documents, documents relationships, trends, …

o Segment documents, ad-hoc IR, …

Page 18: Matrix Factorization LSI Probabilistic Latent Indexingdac.lip6.fr/master/wp-content/uploads/2018/09/LatentRO.pdf · Matrix factorization n ! = #$,… ,#',#(∈ *+ n ! is a m x n matrix

PLSA

Apprentissage Statistique - P. Gallinari

253

o The latent structure is a set of topicsn Each document is generated as a set of words chosen

from selected topicsn A latent variable z (topic) is associated to each word

occurrence in the document

o Generative Processn Select a document d, P(d)n Iterate

o Choose a latent class z, P(z|d)o Generate a word w according to P(w| z)

n Note : P(w| z) and P(z|d) are multinomial distributions over the V words and the T topics

Page 19: Matrix Factorization LSI Probabilistic Latent Indexingdac.lip6.fr/master/wp-content/uploads/2018/09/LatentRO.pdf · Matrix factorization n ! = #$,… ,#',#(∈ *+ n ! is a m x n matrix

PLSA - Topico A topic is a distribution over words

o Remarkn A topic is shared by several wordsn A word is associated to several topics

word P(w|z)

machine 0.04

learning 0.01

information 0.09

retrieval 0.02

…… …….

Apprentissage Statistique - P. Gallinari

254

P(w|z)

words

Page 20: Matrix Factorization LSI Probabilistic Latent Indexingdac.lip6.fr/master/wp-content/uploads/2018/09/LatentRO.pdf · Matrix factorization n ! = #$,… ,#',#(∈ *+ n ! is a m x n matrix

PLSA as a graphical model

Apprentissage Statistique - P. Gallinari

255

ïî

ïíì

=

=

åz

dzPzwPdwP

dwPdPwdP

)()()(

)(*)(),(

Boxes represent repeated sampling

d wz

Corpus level

Document level

P(z|d) P(w|z)

D

Nd

Page 21: Matrix Factorization LSI Probabilistic Latent Indexingdac.lip6.fr/master/wp-content/uploads/2018/09/LatentRO.pdf · Matrix factorization n ! = #$,… ,#',#(∈ *+ n ! is a m x n matrix

PLSA model

Apprentissage Statistique - P. Gallinari

256

o Hypothesisn # values of z is fixed a priorin Bag of wordsn Documents are independent

o No specific distribution on the documentsn Conditional independence

o z being known, w and d are independent

o Learningo Maximum Likelihood : p(Doc-collection)o EM algorithm and variants

Page 22: Matrix Factorization LSI Probabilistic Latent Indexingdac.lip6.fr/master/wp-content/uploads/2018/09/LatentRO.pdf · Matrix factorization n ! = #$,… ,#',#(∈ *+ n ! is a m x n matrix

PLSA - geometric interpretationo Topici is a point on the word simplexo Documents are constrained to lie on the topic simplexo Creates a bottleneck in document representation

å=z

dzPzwPdwP )()()(

Apprentissage Statistique - P. Gallinari

257

Topic simplex

topic2

topic1

topic3

w2 w1

w3

Word simplex

Document d

Page 23: Matrix Factorization LSI Probabilistic Latent Indexingdac.lip6.fr/master/wp-content/uploads/2018/09/LatentRO.pdf · Matrix factorization n ! = #$,… ,#',#(∈ *+ n ! is a m x n matrix

Applications

Apprentissage Statistique - P. Gallinari

258

o Thematic segmentationo Creating documents hierarchieso IR : PLSA modelo Clustering and classificationo Image annotation

n Learn and infer P(w|image)o Collaborative filtering

o Note : #variants and extensionsn E.g. Hierarchical PLSA (see Gaussier et al.)

Page 24: Matrix Factorization LSI Probabilistic Latent Indexingdac.lip6.fr/master/wp-content/uploads/2018/09/LatentRO.pdf · Matrix factorization n ! = #$,… ,#',#(∈ *+ n ! is a m x n matrix

Latent Dirichlet Allocation - LDA (Blei et al. 2003)

Apprentissage Statistique - P. Gallinari

259

o LDA is also a topic modeln Extends PLSA

o Motivationsn Generalization over unseen documents

o Define a probabilistic model over documentsn Not present in PLSA

o Allows to generate (model) unseen documentsn Overtraining

o In PLSA, the number of parameters grows with the corpus size

o LDA constrains the distribution of topics for each document and words for each topic

Page 25: Matrix Factorization LSI Probabilistic Latent Indexingdac.lip6.fr/master/wp-content/uploads/2018/09/LatentRO.pdf · Matrix factorization n ! = #$,… ,#',#(∈ *+ n ! is a m x n matrix

LDA - model

Apprentissage Statistique - P. Gallinari

260

o Similar to PLSA with the addition of a prior distribution on the topic distribution

o Generative processn For a documentn Topic distribution

o Choose θ ~ Dirichlet (a) a distribution over topicsn Words

o For each document word wn Choose a topic z ~ multinomial (θ)n Choose a word w from p(w | θ, F) multinomial

probability conditioned on topic z

Page 26: Matrix Factorization LSI Probabilistic Latent Indexingdac.lip6.fr/master/wp-content/uploads/2018/09/LatentRO.pdf · Matrix factorization n ! = #$,… ,#',#(∈ *+ n ! is a m x n matrix

LDA tagging (Blei et al 2003)

Apprentissage Statistique - P. Gallinari

261

Page 27: Matrix Factorization LSI Probabilistic Latent Indexingdac.lip6.fr/master/wp-content/uploads/2018/09/LatentRO.pdf · Matrix factorization n ! = #$,… ,#',#(∈ *+ n ! is a m x n matrix

Finding topics in PNAS (Griffith et al. 2004)

Apprentissage Statistique - P. Gallinari

262

o

PNAS categories

LDA topics

Mean θi value for the most significant topic i on this

category

Page 28: Matrix Factorization LSI Probabilistic Latent Indexingdac.lip6.fr/master/wp-content/uploads/2018/09/LatentRO.pdf · Matrix factorization n ! = #$,… ,#',#(∈ *+ n ! is a m x n matrix

Author-recipient topic model (McCallum et al. 2004)

Apprentissage Statistique - P. Gallinari

263

Learning from Enron data

Identify

•Topic•Author-recipient

Page 29: Matrix Factorization LSI Probabilistic Latent Indexingdac.lip6.fr/master/wp-content/uploads/2018/09/LatentRO.pdf · Matrix factorization n ! = #$,… ,#',#(∈ *+ n ! is a m x n matrix

o [Harvey et al, 2010]

Using LDA for ranking

Recherche d'information textuelle 264