latent semantic analysis hongning wang cs@uva. vs model in practice document and query are...

23
Latent Semantic Analysis Hongning Wang CS@UVa

Upload: tobias-pope

Post on 26-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Latent Semantic Analysis Hongning Wang CS@UVa. VS model in practice Document and query are represented by term vectors – Terms are not necessarily orthogonal

Latent Semantic Analysis

Hongning WangCS@UVa

Page 2: Latent Semantic Analysis Hongning Wang CS@UVa. VS model in practice Document and query are represented by term vectors – Terms are not necessarily orthogonal

CS6501: Information Retrieval 2

VS model in practice

• Document and query are represented by term vectors– Terms are not necessarily orthogonal to each

other • Synonymy: car v.s. automobile• Polysemy: fly (action v.s. insect)

CS@UVa

Page 3: Latent Semantic Analysis Hongning Wang CS@UVa. VS model in practice Document and query are represented by term vectors – Terms are not necessarily orthogonal

CS6501: Information Retrieval 3

Choosing basis for VS model

• A concept space is preferred– Semantic gap will be bridged

Sports

Education

Finance

D4

D2

D1D5

D3

Query

CS@UVa

Page 4: Latent Semantic Analysis Hongning Wang CS@UVa. VS model in practice Document and query are represented by term vectors – Terms are not necessarily orthogonal

CS6501: Information Retrieval 4

How to build such a space

• Automatic term expansion – Construction of thesaurus• WordNet

– Clustering of words• Word sense disambiguation– Dictionary-based• Relation between a pair of words should be similar as in

text and dictionary’s descrption

– Explore word usage context

CS@UVa

Page 5: Latent Semantic Analysis Hongning Wang CS@UVa. VS model in practice Document and query are represented by term vectors – Terms are not necessarily orthogonal

CS6501: Information Retrieval 5

How to build such a space

• Latent Semantic Analysis– Assumption: there is some underlying latent

semantic structure in the data that is partially obscured by the randomness of word choice with respect to retrieval

– It means: the observed term-document association data is contaminated by random noise

CS@UVa

Page 6: Latent Semantic Analysis Hongning Wang CS@UVa. VS model in practice Document and query are represented by term vectors – Terms are not necessarily orthogonal

CS6501: Information Retrieval 6

How to build such a space

• Solution– Low rank matrix approximation

Imagine this is our observed term-document matrix

Imagine this is *true* concept-document matrix

Random noise over the word selection in each documentCS@UVa

Page 7: Latent Semantic Analysis Hongning Wang CS@UVa. VS model in practice Document and query are represented by term vectors – Terms are not necessarily orthogonal

CS6501: Information Retrieval 7

Latent Semantic Analysis (LSA)

• Low rank approximation of term-document matrix – Goal: remove noise in the observed term-

document association data– Solution: find a matrix with rank which is closest

to the original matrix in terms of Frobenius norm

CS@UVa

Page 8: Latent Semantic Analysis Hongning Wang CS@UVa. VS model in practice Document and query are represented by term vectors – Terms are not necessarily orthogonal

CS6501: Information Retrieval 8

Basic concepts in linear algebra

• Symmetric matrix

• Rank of a matrix– Number of linearly independent rows (columns) in

a matrix

CS@UVa

Page 9: Latent Semantic Analysis Hongning Wang CS@UVa. VS model in practice Document and query are represented by term vectors – Terms are not necessarily orthogonal

CS6501: Information Retrieval 9

Basic concepts in linear algebra

• Eigen system– For a square matrix – If , is called the right eigenvector of and is the

corresponding eigenvalue• For a symmetric full-rank matrix – We have its eigen-decomposition as

• where the columns of are the orthogonal and normalized eigenvectors of and is a diagonal matrix whose entries are the eigenvalues of

CS@UVa

Page 10: Latent Semantic Analysis Hongning Wang CS@UVa. VS model in practice Document and query are represented by term vectors – Terms are not necessarily orthogonal

CS6501: Information Retrieval 10

Basic concepts in linear algebra

• Singular value decomposition (SVD)– For matrix with rank , we have

• where and are orthogonal matrices, and is a diagonal matrix, with and are the eigenvalues of

– We define • where we place in a descending order and set for , and

for

CS@UVa

Page 11: Latent Semantic Analysis Hongning Wang CS@UVa. VS model in practice Document and query are represented by term vectors – Terms are not necessarily orthogonal

CS6501: Information Retrieval 11

Latent Semantic Analysis (LSA)

• Solve LSA by SVD

– Procedure of LSA1. Perform SVD on document-term adjacency matrix2. Construct by only keeping the largest singular values

in non-zero

Map to a lower dimensional space

CS@UVa

Page 12: Latent Semantic Analysis Hongning Wang CS@UVa. VS model in practice Document and query are represented by term vectors – Terms are not necessarily orthogonal

CS6501: Information Retrieval 12

Latent Semantic Analysis (LSA)

• Another interpretation– is the term-document adjacency matrix

• : document-document similarity by counting how many terms co-occur in and

– Eigen-decomposition of document-document similarity matrix– new representation is then in this system(space)– In the lower dimensional space, we will only use the first

elements in to represent

– The same analysis applies to

CS@UVa

Page 13: Latent Semantic Analysis Hongning Wang CS@UVa. VS model in practice Document and query are represented by term vectors – Terms are not necessarily orthogonal

CS6501: Information Retrieval 13

Geometric interpretation of LSA

• measures the relatedness between and in the -dimensional space

• Therefore– As– is represented as – is represented as

CS@UVa

Page 14: Latent Semantic Analysis Hongning Wang CS@UVa. VS model in practice Document and query are represented by term vectors – Terms are not necessarily orthogonal

CS6501: Information Retrieval 14

Latent Semantic Analysis (LSA)

• VisualizationHCI

Graph theory

CS@UVa

Page 15: Latent Semantic Analysis Hongning Wang CS@UVa. VS model in practice Document and query are represented by term vectors – Terms are not necessarily orthogonal

CS6501: Information Retrieval 15

What are those dimensions in LSA

• Principle component analysis

CS@UVa

Page 16: Latent Semantic Analysis Hongning Wang CS@UVa. VS model in practice Document and query are represented by term vectors – Terms are not necessarily orthogonal

CS6501: Information Retrieval 16

Latent Semantic Analysis (LSA)

• What we have achieved via LSA– Terms/documents that are closely associated are

placed near one another in this new space– Terms that do not occur in a document may still

close to it, if that is consistent with the major patterns of association in the data

– A good choice of concept space for VS model!

CS@UVa

Page 17: Latent Semantic Analysis Hongning Wang CS@UVa. VS model in practice Document and query are represented by term vectors – Terms are not necessarily orthogonal

CS6501: Information Retrieval 17

LSA for retrieval

• Project queries into the new document space

• Treat query as a pseudo document of term vector• Cosine similarity between query and documents in this

lower-dimensional space

CS@UVa

Page 18: Latent Semantic Analysis Hongning Wang CS@UVa. VS model in practice Document and query are represented by term vectors – Terms are not necessarily orthogonal

CS6501: Information Retrieval 18

LSA for retrieval

q: “human computer interaction”

HCI

Graph theory

CS@UVa

Page 19: Latent Semantic Analysis Hongning Wang CS@UVa. VS model in practice Document and query are represented by term vectors – Terms are not necessarily orthogonal

CS6501: Information Retrieval 19

Discussions

• Computationally expensive– Time complexity

• Empirically helpful for recall but not for precision– Recall increases as decreases

• Optimal choice of • Difficult to handle dynamic corpus• Difficult to interpret the decomposition results

We will come back to this later!CS@UVa

Page 20: Latent Semantic Analysis Hongning Wang CS@UVa. VS model in practice Document and query are represented by term vectors – Terms are not necessarily orthogonal

CS6501: Information Retrieval 20

LSA beyond text

• Collaborative filtering

CS@UVa

Page 21: Latent Semantic Analysis Hongning Wang CS@UVa. VS model in practice Document and query are represented by term vectors – Terms are not necessarily orthogonal

CS6501: Information Retrieval 21

LSA beyond text

• Eigen face

CS@UVa

Page 22: Latent Semantic Analysis Hongning Wang CS@UVa. VS model in practice Document and query are represented by term vectors – Terms are not necessarily orthogonal

CS6501: Information Retrieval 22

LSA beyond text

• Cat from deep neuron network

One of the neurons in the artificial neural network, trained from still frames from unlabeled YouTube

videos, learned to detect cats.

CS@UVa

Page 23: Latent Semantic Analysis Hongning Wang CS@UVa. VS model in practice Document and query are represented by term vectors – Terms are not necessarily orthogonal

CS6501: Information Retrieval 23

What you should know

• Assumption in LSA• Interpretation of LSA– Low rank matrix approximation– Eigen-decomposition of co-occurrence matrix for

documents and terms• LSA for IR

CS@UVa