deduplication csci 572: information retrieval and search engines summer 2010

16
Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010

Upload: daniel-bryan

Post on 11-Jan-2016

217 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010

Deduplication

CSCI 572: Information Retrieval and Search Engines

Summer 2010

Page 2: Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10 CS572-Summer2010 CAM-2

Outline

• What is Deduplication?• Importance• Challenges• Approaches

Page 3: Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10 CS572-Summer2010 CAM-3

What are web duplicates?

• The same page, referenced by different URLshttp://espn.go.com http://www.espn.com

•What are the differences?•URL host (virtual hosts), sometimes protocol, sometimes page name, etc.

Page 4: Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10 CS572-Summer2010 CAM-4

What are web duplicates?

• Near identical page, referenced by the same URLsGoogle search for “search engines” Google search for “search engines”

•What are the differences?•Page is within some delta % similar to the other (where delta is a large number), but may differ in e.g., adds, counters, timestamps, etc.

Page 5: Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10 CS572-Summer2010 CAM-5

Why is it important to consider duplicates?

• In search engines, URLs tell the crawlers where to go and how to navigate the information space

• Ideally, given the web’s scale and complexity, we’ll give priority to crawl content that we haven’t already stored or seen before– Saves resources (on the crawler end, as well as the

remote host)

– Increases crawler politeness

– Reduces the analysis that we’ll have to do later

Page 6: Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10 CS572-Summer2010 CAM-6

Why is it important to consider duplicates?

• Identification of website mirrors (or copies of content)used to spread the load andbandwidth consumption– Sourceforge.net, CPAN, Apache,

etc.

• If you identify a mirror, you canomit crawling many web pagesand save crawler resources

Page 7: Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10 CS572-Summer2010 CAM-7

“More Like This”

• Finding similarcontent to whatyou were lookingfor– As we discussed

during the lecture on the search engine architecture, much of the time in search engines is spent filtering through the results. Presenting similar documents can cut down on that filtering time

Page 8: Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10 CS572-Summer2010 CAM-8

XML

• XML documents, structurally appear very similar– What’s the difference between RSS and RDF and OWL

and XSL and XSLT and any number of XML documents out there?

• With the ability to identify similarity and reduce duplication of XML, we could identify XML documents with similar structure– RSS feeds that contain the same links

– Differentiate RSS (crawl more often) from other less frequently updated XML

Page 9: Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10 CS572-Summer2010 CAM-9

Detect Plagiarism

• Determine web sites and reportsthat plagiarize one another

• Important for copyright lawsand enforcement

• Determine similarity betweensource code

• Licensing issues– Open Source, other.

Page 10: Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10 CS572-Summer2010 CAM-10

Detection of SPAM

• Identifying malicious SPAM content– Adult sites

– Pharmaceutical drug and prescriptiondrug SPAM

– Malware and phishing scams

• Need to ignore this content from a crawling perspective– Or to “flag” it and not include it in (general) search

results

Page 11: Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10 CS572-Summer2010 CAM-11

Challenges

• Scalability– Most approaches to detecting duplicates rely on training

and analytical approaches that may be computationally expensive

• Challenge is to perform the evaluation at low cost

• What to do with the duplicates?– The answer isn’t always throw them out – they may be

useful for study

– The content may require indexing for later comparison in legal issues, or for “snapshot”ing the web at the time i.e., the Internet Archive

Page 12: Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10 CS572-Summer2010 CAM-12

Challenges

• Structure versus Semantics– Documents that are structurally dissimilar may content

the exact same content• Think the use of <b> tags to emphasize versus <i> tags in

HTML

• Need to take this into account

• Online versus offline– Depends on crawling strategy, but offline typically can

provide more precision at the cost of inability to dynamically react

Page 13: Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10 CS572-Summer2010 CAM-13

Approaches for Deduplication• SIMHASH and Hamming Distance

– Treat web documents as a set of features, constituting an n dimension vector – transform this vector into an f-bit fingerprint of a small size, e.g., 64

– Compare fingerprints and look for difference in at most k bits

– Manku et al., WWW 2007

• Syntactic similarity– Shingling

• Treat web documents as continuous subsequence of words

• Compute w-shingling

• Border et al., WWW 1997

Page 14: Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10 CS572-Summer2010 CAM-14

Approaches for Deduplication

• Link structure similarity– Identify similar in the linkages between web collections

– Choo et al.

Page 15: Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10 CS572-Summer2010 CAM-15

Approaches for Deduplication

• Exploiting the structure and links between physical network hosts– Look at:

• Language

• Geographical connection

• Continuations and proxies

– Zipifan function

– Bharat et al., ICDM 2001

Page 16: Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10 CS572-Summer2010 CAM-16

Wrapup• Need Deduplication for conserving resources and ensuring

quality and accuracy of resultant search indices– Can assist in other areas like plagiarism, SPAM detection, fraud

detection, etc.

• Deduplication at web scale is difficult, need efficient means to perform this computation online or offline

• Techniques look at page structure/content, page link structure content, or physical web node structure