information retrieval through various approximate matrix decompositions

28
1 Information Retrieval through Various Approximate Matrix Decompositions Kathryn Linehan Advisor: Dr. Dianne O’Leary

Upload: bertha-reid

Post on 02-Jan-2016

28 views

Category:

Documents


0 download

DESCRIPTION

Information Retrieval through Various Approximate Matrix Decompositions. Kathryn Linehan Advisor: Dr. Dianne O’Leary. Information Retrieval. Extracting information from databases We need an efficient way of searching large amounts of data Example: web search engine. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Information Retrieval through Various Approximate Matrix Decompositions

1

Information Retrieval through Various Approximate Matrix Decompositions

Kathryn Linehan

Advisor: Dr. Dianne O’Leary

Page 2: Information Retrieval through Various Approximate Matrix Decompositions

2

Information Retrieval

Extracting information from databases

We need an efficient way of searching large amounts of data

Example: web search engine

Page 3: Information Retrieval through Various Approximate Matrix Decompositions

3

Querying a Document Database

We want to return documents that are relevant to entered search terms

Given data: • Term-Document Matrix, A

• Entry ( i , j ): importance of term i in document j

• Query Vector, q• Entry ( i ): importance of term i in the query

Page 4: Information Retrieval through Various Approximate Matrix Decompositions

4

Term-Document Matrix Entry ( i, j) : weight of term i in document j

15000

20000

010200

05100

020015

00015

Document

1 2 3 4

Term

Mark

Twain

Samuel

Clemens

Purple

Fairy

Example:

Example taken from [5]

Page 5: Information Retrieval through Various Approximate Matrix Decompositions

5

Query Vector Entry ( i ) : weight of term i in the query

0

0

0

0

1

1 Example:

search for “Mark Twain”

Term

Mark

Twain

Samuel

Clemens

Purple

Fairy

Example taken from [5]

Page 6: Information Retrieval through Various Approximate Matrix Decompositions

6

Document Scoring

Doc 1 and Doc 3 will be returned as relevant, but Doc 2 will not

T

T

0

20

0

30

15000

20000

010200

05100

020015

00015

*

0

0

0

0

1

1

Document

1 2 3 4Term

Mark

Twain

Samuel

Clemens

Purple

Fairy

Scores

Doc 1

Doc 2

Doc 3

Doc 4

Example taken from [5]

Page 7: Information Retrieval through Various Approximate Matrix Decompositions

7

Can we do better if we replace the matrix by an approximation?

Singular Value Decomposition (SVD)

Nonnegative Matrix Factorization (NMF)

CUR Decomposition

WHA

CURA

TVUA

Page 8: Information Retrieval through Various Approximate Matrix Decompositions

8

Nonnegative Matrix Factorization (NMF)

HWA *

m x n m x k

k x n

• W and H are nonnegative

• kWHrank )(

Storage: k(m + n) entries

Page 9: Information Retrieval through Various Approximate Matrix Decompositions

9

NMF

Multiplicative update algorithm of Lee and Seung found in [1]• Find W, H to minimize

• Random initialization for W,H

• Gradient descent method

• Slow due to matrix multiplications in iteration

2

2

1FWHA

Page 10: Information Retrieval through Various Approximate Matrix Decompositions

10

NMF Validation

A: 5 x 3 random dense matrix. Average over 5 runs.

1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 30

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4NMF Validation: Relative Error

k: rank(WH), rank(SVD) <= k

rela

tive

erro

r: F

robe

nius

nor

m

NMF

SVD

40 60 80 100 120 140 160 180 2000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8NMF Validation: Relative Error

k: rank(WH), rank(SVD) <= kre

lativ

e er

ror:

Fro

beni

us n

orm

NMF

SVD

B: 500 x 200 random sparse matrix. Average over 5 runs.

Page 11: Information Retrieval through Various Approximate Matrix Decompositions

11

NMF Validation

10 20 30 40 50 60 70 80 90 1000.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

Iteration Number

Rel

ativ

e er

ror:

Fro

beni

us N

orm

Relative Error vs. Iteration Number

NMF

SVD

B: 500 x 200 random sparse matrix. Rank(NMF) = 80.

Page 12: Information Retrieval through Various Approximate Matrix Decompositions

12

CUR Decomposition

**A C U R

m x n m x c

c x r r x n

• C (R) holds c (r) sampled and rescaled columns (rows) of A

• U is computed using C and R

• kCURrank )(where k is a rank parameter

,

Storage: (nz(C) + cr + nz(R)) entries

Page 13: Information Retrieval through Various Approximate Matrix Decompositions

13

CUR Implementations

CUR algorithm in [3] by Drineas, Kannan, and Mahoney• Linear time algorithm

• Improvement: Compact Matrix Decomposition (CMD) in [6] by Sun, Xie, Zhang, and Faloutsos

• Modification: use ideas in [4] by Drineas, Mahoney, Muthukrishnan (no longer linear time)

• Other Modifications: our ideas

Deterministic CUR code by G. W. Stewart [2]

Page 14: Information Retrieval through Various Approximate Matrix Decompositions

14

Sampling Column (Row) norm sampling [3]

• Prob(col j) = (similar for row i)

Subspace Sampling [4]• Uses rank-k SVD of A for column probabilities

• Prob(col j) =

• Uses “economy size” SVD of C for row probabilities

• Prob(row i) =

Sampling without replacement

22)(:, FAjA

kjV kA2

, :),(

ciUC2:),(

Page 15: Information Retrieval through Various Approximate Matrix Decompositions

15

Computation of U

Linear U [3]: approximately solves

Optimal U: solves

FU

UCA ˆminˆ

RCCACCCU Tk

TTT )()(ˆ

2min F

U

CURA

)()( TTTT RRARCCCU

Page 16: Information Retrieval through Various Approximate Matrix Decompositions

16

Deterministic CUR

Code by G. W. Stewart [2] Uses a RRQR algorithm that does not

store Q• We only need the permutation vector

• Gives us the columns (rows) for C (R)

Uses an optimal U

Page 17: Information Retrieval through Various Approximate Matrix Decompositions

17

Compact Matrix Decomposition (CMD) Improvement

Remove repeated columns (rows) in C (R) Decreases storage while still achieving the

same relative error [6]

Algorithm [3] [3] with CMD

Runtime 0.008060 0.007153

Storage 880.5 550.5

Relative Error 0.820035 0.820035

A: 50 x 30 random sparse matrix, k = 15. Average over 10 runs.

Page 18: Information Retrieval through Various Approximate Matrix Decompositions

18

CUR: Sampling with Replacement Validation

A: 5 x 3 random dense matrix. Average over 5 runs.Legend: Sampling, U

5 6 7 8 9 10 11 12 13 14 150

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5CUR Validation: Relative Error

c/r: number of columns/rows sampled

rela

tive

erro

r: F

robe

nius

nor

m

CN,L

CN,O

S,LS,O

SVD

500 600 700 800 900 1000 1100 1200 1300 1400 15000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4CUR Validation: Relative Error

c/r: number of columns/rows sampled

rela

tive

erro

r: F

robe

nius

nor

m

CN,L

CN,O

S,LS,O

SVD

1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3

x 105

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4CUR Validation: Relative Error

c/r: number of columns/rows sampled

rela

tive

erro

r: F

robe

nius

nor

m

CN,L

CN,O

S,LS,O

SVD

Page 19: Information Retrieval through Various Approximate Matrix Decompositions

19

Sampling without Replacement: Scaling vs. No Scaling

Invert scaling factor applied to

• T

kT CCU )(

Page 20: Information Retrieval through Various Approximate Matrix Decompositions

20

CUR: Sampling without Replacement Validation

A: 5 x 3 random dense matrix. Average over 5 runs.

1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 30

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8CUR Validation: Relative Error, r = 2k, c = k

k: rank(CUR) <= k, rank(SVD) <= k

rela

tive

erro

r: F

robe

nius

nor

m

w/o R,L,w/o Sc

w/o R,L,Sc

SVD

40 60 80 100 120 140 160 180 2000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1CUR Validation: Relative Error, c = 3k, r = k

k: rank(CUR) <= k, rank(SVD) <= k

rela

tive

erro

r: F

robe

nius

nor

m

w/o R,L,w/o Sc

w/o R,L,ScSVD

Legend: Sampling, U, Scaling

B: 500 x 200 random sparse matrix. Average over 5 runs.

Page 21: Information Retrieval through Various Approximate Matrix Decompositions

21

CUR Comparison

40 60 80 100 120 140 160 180 2000

0.2

0.4

0.6

0.8

1

1.2

1.4CUR Validation: Relative Error, r = c = 2k

k: rank(CUR) <= k, rank(SVD) <= k

rela

tive

erro

r: F

robe

nius

nor

m

CN,L

CN,OS,L

S,O

w/o R,L,w/o Sc

w/o R,L,ScD

SVD

40 60 80 100 120 140 160 180 2000

0.2

0.4

0.6

0.8

1

1.2

1.4CUR Validation: Relative Error, r = c = k

k: rank(CUR) <= k, rank(SVD) <= k

rela

tive

erro

r: F

robe

nius

nor

m

CN,L

CN,OS,L

S,O

w/o R,L,w/o Sc

w/o R,L,ScD

SVD

B: 500 x 200 random sparse matrix. Average over 5 runs.Legend: Sampling, U, Scaling

Page 22: Information Retrieval through Various Approximate Matrix Decompositions

22

Judging Success: Precision and Recall

Measurement of performance for document retrieval

• Average precision and recall, where the average is taken over all queries in the data set

• Let Retrieved = number of documents retrieved,

Relevant = total number of relevant documents to the query,

RetRel = number of documents retrieved that are relevant.

• Precision:

• Recall:

Retrieved

RetRel)Retrieved( P

Relevant

RetRel)Retrieved( R

Page 23: Information Retrieval through Various Approximate Matrix Decompositions

23

LSI Results

Term-document matrix size: 5831 x 1033. All matrix approximations are rank 100 approximations (CUR: r = c = k). Average query time is less than 10-3 seconds for all matrix approximations.

0 200 400 600 800 1000 12000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

number of documents retrieved

aver

age

prec

isio

n

Average Precision vs. number of documents retrieved

SVDNMF

CUR:cn,lin

CUR:cn,opt

CUR:sub,linCUR:sub,opt

CUR:w/oR,no

CUR:w/oR,yes

CUR:GWSLTM

0 200 400 600 800 1000 12000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

number of documents retrieved

aver

age

reca

ll

Average Recall vs. number of documents retrieved

SVDNMF

CUR:cn,lin

CUR:cn,opt

CUR:sub,linCUR:sub,opt

CUR:w/oR,no

CUR:w/oR,yes

CUR:GWSLTM

Page 24: Information Retrieval through Various Approximate Matrix Decompositions

24

LSI Results

5 10 15 20 25 30 35 40 45 500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

number of documents retrieved

aver

age

prec

isio

n

Average Precision vs. number of documents retrieved

SVDNMF

CUR:cn,lin

CUR:cn,opt

CUR:sub,linCUR:sub,opt

CUR:w/oR,no

CUR:w/oR,yes

CUR:GWSLTM

5 10 15 20 25 30 35 40 45 500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

number of documents retrieved

aver

age

reca

ll

Average Recall vs. number of documents retrieved

SVDNMF

CUR:cn,lin

CUR:cn,opt

CUR:sub,linCUR:sub,opt

CUR:w/oR,no

CUR:w/oR,yes

CUR:GWSLTM

Term-document matrix size: 5831 x 1033.

All matrix approximations are rank 100 approximations. (CUR: r = c = k)

Page 25: Information Retrieval through Various Approximate Matrix Decompositions

25

Matrix Approximation Results Rel. Error (F-norm) Storage (nz) Runtime (sec)

SVD 0.8203 686500 22.5664

NMF 0.8409 686400 23.0210

CUR: cn,lin 1.4151 17242 0.1741

CUR: cn,opt 0.9724 16358 0.2808

CUR: sub,lin 1.2093 16175 48.7651

CUR: sub,opt 0.9615 16108 49.0830

CUR: w/oR,no 0.9931 17932 0.3466

CUR: w/oR,yes 0.9957 17220 0.2734

CUR:GWS 0.9437 25020 2.2857

LTM -- 52003 --

Page 26: Information Retrieval through Various Approximate Matrix Decompositions

26

Conclusions

We may not be able to store an entire term-document matrix and it may be too expensive to compute an SVD

We can achieve LSI results that are almost as good with cheaper approximations• Less storage

• Less computation time

Page 27: Information Retrieval through Various Approximate Matrix Decompositions

27

Completed Project Goals

Code/validate NMF and CUR Analyze relative error, runtime, and

storage of NMF and CUR Improve CUR algorithm of [3] Analyze use of NMF and CUR in LSI

Page 28: Information Retrieval through Various Approximate Matrix Decompositions

28

ReferencesMichael W. Berry, Murray Browne, Amy N. Langville, V. Paul Pauca, and Robert J. Plemmons. Algorithms and applications for approximate nonnegative matrix factorization. Computational Statistics and Data Analysis, 52(1):155-173, September 2007.

M.W. Berry, S.A. Pulatova, and G.W. Stewart. Computing sparse reduced-rank approximations to sparse matrices. Technical Report UMIACS TR-2004-34 CMSC TR-4591, University of Maryland, May 2004.

Petros Drineas, Ravi Kannan, and Michael W. Mahoney. Fast Monte Carlo algorithms for matrices iii: Computing a compressed approximate matrix decomposition. SIAM Journal on Computing, 36(1):184-206, 2006.

Petros Drineas, Michael W. Mahoney, and S. Muthukrishnan. Relative-error CUR matrix decompositions. SIAM Journal on Matrix Analysis and Applications, 30(2):844-881, 2008.

Tamara G. Kolda and Dianne P. O'Leary. A semidiscrete matrix decomposition for latent semantic indexing in information retrieval. ACM Transactions on Information Systems, 16(4):322-346, October 1998.

Jimeng Sun, Yinglian Xie, Hui Zhang, and Christos Faloutsos. Less is more: Sparse graph mining with compact matrix decomposition. Statistical Analysis and Data Mining, 1(1):6-22, February 2008.

[3]

[2]

[1]

[4]

[5]

[6]