information retrieval through various approximate matrix decompositions

21
1 Information Retrieval through Various Approximate Matrix Decompositions Kathryn Linehan Advisor: Dr. Dianne O’Leary

Upload: jemma

Post on 02-Feb-2016

14 views

Category:

Documents


0 download

DESCRIPTION

Information Retrieval through Various Approximate Matrix Decompositions. Kathryn Linehan Advisor: Dr. Dianne O’Leary. Querying a Document Database. We want to return documents that are relevant to entered search terms Given data: Term-Document Matrix, A - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Information Retrieval through Various Approximate Matrix Decompositions

1

Information Retrieval through Various Approximate Matrix

Decompositions

Kathryn Linehan

Advisor: Dr. Dianne O’Leary

Page 2: Information Retrieval through Various Approximate Matrix Decompositions

2

Querying a Document Database

We want to return documents that are relevant to entered search terms

Given data: • Term-Document Matrix, A

• Entry ( i , j ): importance of term i in document j

• Query Vector, q• Entry ( i ): importance of term i in the query

Page 3: Information Retrieval through Various Approximate Matrix Decompositions

3

Solutions

Literal Term Matching• Compute score vector: s = qTA

• Return the highest scoring documents

• May not return relevant documents that do not contain the exact query terms

Latent Semantic Indexing (LSI)• Same process as above, but use an approximation

to A

Page 4: Information Retrieval through Various Approximate Matrix Decompositions

4

Term-Document Matrix Approximation

Standard approximation used in LSI: rank-k SVD

Project Goal: evaluate use of term-document matrix approximations other than rank-k SVD in LSI• Nonnegative Matrix Factorization (NMF)

• CUR Decomposition

Page 5: Information Retrieval through Various Approximate Matrix Decompositions

5

Matrix Approximation Validation

Let be an approximation to A As the rank of increases, we expect the

relative error, , to go to zero Matrix approximation can be applied to any

matrix A• Preliminary test matrix A: 50 x 30 random sparse

matrix

• Future test matrices: three large sparse term-document matrices

A~

FFAAA

~

A~

Page 6: Information Retrieval through Various Approximate Matrix Decompositions

6

Nonnegative Matrix Factorization (NMF)

Term-document matrix is nonnegative

HWA *

m x n m x k

k x n

• W and H are nonnegative

• kWHrank )(

Page 7: Information Retrieval through Various Approximate Matrix Decompositions

7

NMF

Multiplicative update algorithm of Lee and Seung found in [1]• Find W, H to minimize

• Random initialization for W,H

• Convergence is not guaranteed, but in practice it is very common

• Slow due to matrix multiplications in iteration

2

2

1FWHA

Page 8: Information Retrieval through Various Approximate Matrix Decompositions

8

NMF Validation

A: 50 x 30 random sparse matrix. Average over 5 runs.

5 10 15 20 25 300

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8NMF Validation: Relative Error

k

rela

tive

erro

r

5 10 15 20 25 300

2

4

6

8

10

12NMF Validation: Run Time

k

run

time

Page 9: Information Retrieval through Various Approximate Matrix Decompositions

9

CUR Decomposition

Term-document matrix is sparse

**A C U R

m x n m x c

c x r r x n

• C (R) holds c (r) sampled and rescaled columns (rows) of A

• U is computed using C and R

• kCURrank )(where k is a rank parameter

,

Page 10: Information Retrieval through Various Approximate Matrix Decompositions

10

CUR Implementations

CUR algorithm in [2] by Drineas, Kannan, and Mahoney• Linear time algorithm

• Modification: use ideas in [3] by Drineas, Mahoney, Muthukrishnan (no longer linear time)

• Improvement: Compact Matrix Decomposition (CMD) in [5] by Sun, Xie, Zhang, and Faloutsos

• Other Modifications: our ideas

Deterministic CUR code by G. W. Stewart

Page 11: Information Retrieval through Various Approximate Matrix Decompositions

11

Sampling Column (Row) norm sampling [2]

• Prob(col j) = (similar for row i)

Subspace Sampling [3]• Uses rank-k SVD of A for column probabilities

• Prob(col j) =

• Uses “economy size” SVD of C for row probabilities

• Prob(row i) =

Sampling without replacement

22)(:, FAjA

kjV kA2

, :),(

ciUC2:),(

Page 12: Information Retrieval through Various Approximate Matrix Decompositions

12

Sampling Comparison

A: 50 x 30 random sparse matrix. Average over 5 runs.Legend: Sampling,U,Scaling (Scaling only for without replacement sampling)

5 10 15 20 25 300

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04CUR Validation: Run Time

k

run

time

CN,L

S,Lw/o R,L,w/o Sc

w/o R,L,Sc

5 10 15 20 25 300

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1CUR Validation: Relative Error

k

rela

tive

erro

r

CN,L

S,Lw/o R,L,w/o Sc

w/o R,L,Sc

Page 13: Information Retrieval through Various Approximate Matrix Decompositions

13

Computation of U

Linear algorithm U: approximately solves

, where [2]

Optimal U: solves

FU

UCA ˆminˆ

URU ˆ

2min F

U

CURA

Page 14: Information Retrieval through Various Approximate Matrix Decompositions

14

U Comparison

A: 50 x 30 random sparse matrix. Average over 5 runs. Legend: Sampling,U

5 10 15 20 25 300

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08CUR Validation: Run Time

k

run

time

CN,L

CN,OS,L

S,O

5 10 15 20 25 30

0.4

0.5

0.6

0.7

0.8

0.9

1CUR Validation: Relative Error

k

rela

tive

erro

r

CN,L

CN,OS,L

S,O

Page 15: Information Retrieval through Various Approximate Matrix Decompositions

15

Compact Matrix Decomposition (CMD) Improvement

Remove repeated columns (rows) in C (R) Decreases storage while still achieving the

same relative error [5]

Algorithm [2] [2] with CMD

Runtime 0.008060 0.007153

Storage 880.5 550.5

Relative Error 0.820035 0.820035

A: 50 x 30 random sparse matrix, k = 15. Average over 10 runs.

Page 16: Information Retrieval through Various Approximate Matrix Decompositions

16

Deterministic CUR

Code by G. W. Stewart Uses a RRQR algorithm that does not

store Q• We only need the permutation vector

• Gives us the columns (rows) for C (R)

Uses optimal U

Page 17: Information Retrieval through Various Approximate Matrix Decompositions

17

CUR Comparison

A: 50 x 30 random sparse matrix. Average over 5 runs.Legend: Sampling,U,Scaling (Scaling only for without replacement sampling)

5 10 15 20 25 300

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1CUR Validation: Relative Error

k

rela

tive

erro

r

5 10 15 20 25 300

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08CUR Validation: Run Time

k

run

time

CN,LCN,O

S,L

S,O

w/o R,L,w/o Sc

w/o R,L,ScD

Page 18: Information Retrieval through Various Approximate Matrix Decompositions

18

Future Project Goals

Finish investigation of CUR improvement Validate NMF and CUR using term-document

matrices Investigate storage, computation time and

relative error of NMF and CUR Test performance of NMF and CUR in LSI

• Use average precision and recall, where the average is taken over all queries in the data set

Page 19: Information Retrieval through Various Approximate Matrix Decompositions

19

Precision and Recall

Measurements of performance for document retrieval Let Retrieved = number of documents retrieved,

Relevant = total number of relevant documents to the query, RetRel = number of documents retrieved that are relevant.

Precision:

Recall:

Retrieved

RetRel)Retrieved( P

Relevant

RetRel)Retrieved( R

Page 20: Information Retrieval through Various Approximate Matrix Decompositions

20

Further Topics

Time permitting investigations• Parallel implementations of matrix

approximations

• Testing performance of matrix approximations in forming a multidocument summary

Page 21: Information Retrieval through Various Approximate Matrix Decompositions

21

ReferencesMichael W. Berry, Murray Browne, Amy N. Langville, V. Paul Pauca, and Robert J. Plemmons. Algorithms and applications for approximate nonnegative matrix factorization. Computational Statistics and Data Analysis, 52(1):155-173, September 2007.

Petros Drineas, Ravi Kannan, and Michael W. Mahoney. Fast Monte Carlo algorithms for matrices iii: Computing a compressed approximate matrix decomposition. SIAM Journal on Computing, 36(1):184-206, 2006.

Petros Drineas, Michael W. Mahoney, and S. Muthukrishnan. Relative-error CUR matrix decompositions. SIAM Journal on Matrix Analysis and Applications, 30(2):844-881, 2008.

Tamara G. Kolda and Dianne P. O'Leary. A semidiscrete matrix decomposition for latent semantic indexing in information retrieval. ACM Transactions on Information Systems, 16(4):322-346, October 1998.

Jimeng Sun, Yinglian Xie, Hui Zhang, and Christos Faloutsos. Less is more: Sparse graph mining with compact matrix decomposition. Statistical Analysis and Data Mining, 1(1):6-22, February 2008.

[3]

[2]

[1]

[4]

[5]