Download - Detecting Near Duplicates for Web Crawling
![Page 1: Detecting Near Duplicates for Web Crawling](https://reader033.vdocuments.net/reader033/viewer/2022061514/56813cb3550346895da65f30/html5/thumbnails/1.jpg)
Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma
Presenter: Siyuan Hua
![Page 2: Detecting Near Duplicates for Web Crawling](https://reader033.vdocuments.net/reader033/viewer/2022061514/56813cb3550346895da65f30/html5/thumbnails/2.jpg)
Application and why
Algorithm
Google story
Q&A
![Page 3: Detecting Near Duplicates for Web Crawling](https://reader033.vdocuments.net/reader033/viewer/2022061514/56813cb3550346895da65f30/html5/thumbnails/3.jpg)
Web Documents
Files in a file system
E-mails
Domain-specific corpora
![Page 4: Detecting Near Duplicates for Web Crawling](https://reader033.vdocuments.net/reader033/viewer/2022061514/56813cb3550346895da65f30/html5/thumbnails/4.jpg)
Web Mirrors Clustering for “related documents” Data extraction Plagiarism Spam detection Duplicates in domain-specific corpora
![Page 5: Detecting Near Duplicates for Web Crawling](https://reader033.vdocuments.net/reader033/viewer/2022061514/56813cb3550346895da65f30/html5/thumbnails/5.jpg)
Simhash compute each document to a f bit value and each bit is relevant to a unique feature of the document
Properties of simhash value:◦ The fingerprint of a document is a “ hash” value of its
features◦ Similar documents have similar hash values
![Page 6: Detecting Near Duplicates for Web Crawling](https://reader033.vdocuments.net/reader033/viewer/2022061514/56813cb3550346895da65f30/html5/thumbnails/6.jpg)
Definition:◦ Given a collection of f-bit fingerprints and a query fingerprint F, identify
whether an existing fingerprint differs from F in at most k bits. (In the batch-mode version there are set of query fingerprints instead of a single query fingerprint)
Simple Solution:◦ Linear search O(mn) time
Scale Problem: ◦ 1M query document against 8 billion( ) existing web pages in100
seconds. ◦ Simple solution require comparisons! (impossible in 100
seconds)
342
2034 22
![Page 7: Detecting Near Duplicates for Web Crawling](https://reader033.vdocuments.net/reader033/viewer/2022061514/56813cb3550346895da65f30/html5/thumbnails/7.jpg)
Oberservation:◦ Pre-compute all F’ such that Hamming distance between F’ and F is at
most k. Assume K=3 F’ and comparisons! Too much time!
◦ Pre-compute all F’ such that some existing fingerprint is at most Hamming distance k away from F’. Too much space!
416643
64
)2log(
3
64 34
![Page 8: Detecting Near Duplicates for Web Crawling](https://reader033.vdocuments.net/reader033/viewer/2022061514/56813cb3550346895da65f30/html5/thumbnails/8.jpg)
Their solution:◦ Initiation: They build t tables: . Associated with table Ti are two
quantities: an integer and a permutation over the f bit-positions.
◦ Given fingerprint F and an integer k, we probe these tables in parallel:
◦ Step 1: Identify all permuted fingerprints in Ti whose top bit-positions match the top bit-positions of (F).
◦ Step 2: For each of the permuted fingerprints identified in Step 1, check if it differs from (F) in at most k bit positions.
Example:◦ 64 bit fingerprint divided to 6 blocks can build 20 tables◦ Space: Reasonable! Time: Awesome!
tTTT ,...,, 21
iiP
i
iP
iP
i
GB6420 )8)2(log(20 34
![Page 9: Detecting Near Duplicates for Web Crawling](https://reader033.vdocuments.net/reader033/viewer/2022061514/56813cb3550346895da65f30/html5/thumbnails/9.jpg)
Exploration of Design Parameters:◦ (1) A small set of permutations to avoid blowup in space requirements◦ (2) Large values for various Pi to avoid checking too many fingerprints in
Step 2.
Tradeoff◦ Increasing the number of tables increases pi and hence reduces the
query time. Decreasing the number of tables reduces storage requirements, but reduces pi and hence increases the query time
![Page 10: Detecting Near Duplicates for Web Crawling](https://reader033.vdocuments.net/reader033/viewer/2022061514/56813cb3550346895da65f30/html5/thumbnails/10.jpg)
Story:◦ Assume that existing fingerprints are stored in file F and that the batch of
query fingerprints are stored in file Q. With 8B 64-bit fingerprints, file F will occupy 64GB
◦ They use GFS files which is broken into 64MB chunks. Each chunk is replicated at three (almost) randomly chosen machines in a cluster, each chunk is stored as a file in the local file system.
◦ F is divided to 64-MB chunk while Q keeps entirety.
◦ MapReduce computes all the duplications in parallel
![Page 11: Detecting Near Duplicates for Web Crawling](https://reader033.vdocuments.net/reader033/viewer/2022061514/56813cb3550346895da65f30/html5/thumbnails/11.jpg)
![Page 12: Detecting Near Duplicates for Web Crawling](https://reader033.vdocuments.net/reader033/viewer/2022061514/56813cb3550346895da65f30/html5/thumbnails/12.jpg)
![Page 13: Detecting Near Duplicates for Web Crawling](https://reader033.vdocuments.net/reader033/viewer/2022061514/56813cb3550346895da65f30/html5/thumbnails/13.jpg)