pairwise document similarity in large collections with mapreduce

22
Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard Association for Computational Linguistics, 2008 May 15, 2014 Kyung-Bin Lim

Upload: adina

Post on 23-Feb-2016

62 views

Category:

Documents


0 download

DESCRIPTION

Pairwise Document Similarity in Large Collections with MapReduce. Tamer Elsayed , Jimmy Lin, and Douglas W. Oard Association for Computational Linguistics, 2008 May 15, 2014 Kyung-Bin Lim. Outline. Introduction Methodology Discussion Conclusion. Pairwise Similarity of Documents. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Pairwise Document Similarity in Large Collections with  MapReduce

Pairwise Document Similarity in Large Collections with MapReduce

Tamer Elsayed, Jimmy Lin, and Douglas W. OardAssociation for Computational Linguistics, 2008

May 15, 2014Kyung-Bin Lim

Page 2: Pairwise Document Similarity in Large Collections with  MapReduce

2 / 19

Outline

Introduction Methodology Discussion Conclusion

Page 3: Pairwise Document Similarity in Large Collections with  MapReduce

3 / 19

Pairwise Similarity of Documents

PubMed – “More like this” Similar blog posts Google – Similar pages

Page 4: Pairwise Document Similarity in Large Collections with  MapReduce

4 / 19

Abstract Problem

Applications:– Clustering– “more-like-that” queries

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

0.20

0.30

0.54

0.21

0.00

0.34

0.34

0.13

0.74

0.20

0.30

0.54

0.21

0.00

0.34

0.34

0.13

0.740.2

00.30

0.54

0.21

0.00

0.34

0.34

0.13

0.74

0.20

0.30

0.54

0.21

0.00

0.34

0.34

0.13

0.74

0.20

0.30

0.54

0.21

0.00

0.34

0.34

0.13

0.74

Page 5: Pairwise Document Similarity in Large Collections with  MapReduce

5 / 19

Outline

Introduction Methodology Results Conclusion

Page 6: Pairwise Document Similarity in Large Collections with  MapReduce

6 / 19

Trivial Solution

Load each vector O(N) times O(N2) dot products

scalable and efficient solu-tion

for large collections

Goal

Page 7: Pairwise Document Similarity in Large Collections with  MapReduce

7 / 19

Better Solution

Load weights for each term once Each term contributes O(dft

2) partial scores

Each term contributes only if appears in

Page 8: Pairwise Document Similarity in Large Collections with  MapReduce

8 / 19

Better Solution A term contributes to each pair that contains it

For example, if a term t1 appears in documents x, y, z :

List of documents that contain a particular term: Inverted Index

t1 appears in x, y, z

t1 contributes for pairs:

(x, y) (x, z) (y, z)

Page 9: Pairwise Document Similarity in Large Collections with  MapReduce

9 / 19

Algorithm

Page 10: Pairwise Document Similarity in Large Collections with  MapReduce

10 / 19

MapReduce Programming

Framework that supports distributed computing on clusters of computers

Introduced by Google in 2004 Map step Reduce step Combine step (Optional) Applications

Page 11: Pairwise Document Similarity in Large Collections with  MapReduce

11 / 19

MapReduce Model

Page 12: Pairwise Document Similarity in Large Collections with  MapReduce

12 / 19

Computation Decomposition

reduce

Load weights for each term once Each term contributes o(dft2) partial scores

Each term contributes only if appears in

map

Page 13: Pairwise Document Similarity in Large Collections with  MapReduce

13 / 19

MapReduce Jobs

(1) Inverted Index Computation

(2) Pairwise Similarity

Page 14: Pairwise Document Similarity in Large Collections with  MapReduce

14 / 19

Job1: Inverted Index

(A,(d1,2))(B,(d1,1))(C,(d1,1))

(B,(d2,1))(D,(d2,2))

(A,(d3,1))(B,(d3,2))(E,(d3,1))

(A,[(d1,2), (d3,1)])(B,[(d1,1), (d2,1), (d3,2)])(C,[(d1,1)])

(D,[(d2,2)])

(E,[(d3,1)])

map

map

map

shuffle

reduce

reduce

reduce

reduce

reduce

(A,[(d1,2), (d3,1)])(B,[(d1,1), (d2,1), (d3,2)])

(C,[(d1,1)])

(D,[(d2,2)])

(E,[(d3,1)])

A A B C

B D D

A B B E

d1

d2

d3

Page 15: Pairwise Document Similarity in Large Collections with  MapReduce

15 / 19

Job2: Pairwise Similarity

map

map

map

map

map

(A,[(d1,2), (d3,1)])(B,[(d1,1), (d2,1), (d3,2)])

(C,[(d1,1)])

(D,[(d2,2)])

(E,[(d3,1)])

((d1,d3),2)

((d1,d2),1)((d1,d3),2)((d2,d3),2)

shuffle

((d1,d2),[1])

((d1,d3),[2,2])

((d2,d3),[2])

reduce

reduce

reduce

((d1,d2),1)

((d1,d3),4)

((d2,d3),2)

Page 16: Pairwise Document Similarity in Large Collections with  MapReduce

16 / 19

Implementation Issues

df-cut– Drop common terms

Intermediate tuples dominated by very high df terms

Implemented 99% cut

efficiency Vs. effectiveness

Page 17: Pairwise Document Similarity in Large Collections with  MapReduce

17 / 19

Outline

Introduction Methodology Results Conclusion

Page 18: Pairwise Document Similarity in Large Collections with  MapReduce

18 / 19

Experimental Setup

Hadoop 0.16.0 Cluster of 19 machines– Each with two processors (single core)

Aquaint-2 collection– 2.5GB of text– 906k documents

Okapi BM25 Subsets of collection

Page 19: Pairwise Document Similarity in Large Collections with  MapReduce

19 / 19

Running Time of Pairwise Similarity Comparisons

R2 = 0.997

0

20

40

60

80

100

120

140

0 10 20 30 40 50 60 70 80 90 100

Corpus Size (%)

Com

puta

tion

Tim

e (m

inut

es)

Page 20: Pairwise Document Similarity in Large Collections with  MapReduce

20 / 19

Number of Intermediate Pairs

0

1,000

2,000

3,000

4,000

5,000

6,000

7,000

8,000

9,000

0 10 20 30 40 50 60 70 80 90 100Corpus Size (%)

Inte

rmed

iate

Pai

rs (b

illio

ns)

df-cut at 99%df-cut at 99.9%df-cut at 99.99%df-cut at 99.999%no df-cut

Page 21: Pairwise Document Similarity in Large Collections with  MapReduce

21 / 19

Outline

Introduction Methodology Results Conclusion

Page 22: Pairwise Document Similarity in Large Collections with  MapReduce

22 / 19

Conclusion

Simple and efficient MapReduce solution– 2H for ~million-doc collection

Effective linear-time-scaling approximation– 99.9% df-cut achieves 98% relative accuracy– df-cut controls efficiency vs. effectiveness tradeoff