deduplication csci 572: information retrieval and search engines summer 2010

Deduplication

CSCI 572: Information Retrieval and Search Engines

Summer 2010

May-20-10 CS572-Summer2010 CAM-2

Outline

• What is Deduplication?• Importance• Challenges• Approaches


What are web duplicates?

• The same page, referenced by different URLshttp://espn.go.com http://www.espn.com

•What are the differences?•URL host (virtual hosts), sometimes protocol, sometimes page name, etc.


What are web duplicates?

• Near identical page, referenced by the same URLsGoogle search for “search engines” Google search for “search engines”

•What are the differences?•Page is within some delta % similar to the other (where delta is a large number), but may differ in e.g., adds, counters, timestamps, etc.


Why is it important to consider duplicates?

• In search engines, URLs tell the crawlers where to go and how to navigate the information space

• Ideally, given the web’s scale and complexity, we’ll give priority to crawl content that we haven’t already stored or seen before– Saves resources (on the crawler end, as well as the

remote host)

– Increases crawler politeness

– Reduces the analysis that we’ll have to do later


Why is it important to consider duplicates?

• Identification of website mirrors (or copies of content)used to spread the load andbandwidth consumption– Sourceforge.net, CPAN, Apache,

etc.

• If you identify a mirror, you canomit crawling many web pagesand save crawler resources


“More Like This”

• Finding similarcontent to whatyou were lookingfor– As we discussed

during the lecture on the search engine architecture, much of the time in search engines is spent filtering through the results. Presenting similar documents can cut down on that filtering time


XML

• XML documents, structurally appear very similar– What’s the difference between RSS and RDF and OWL

and XSL and XSLT and any number of XML documents out there?

• With the ability to identify similarity and reduce duplication of XML, we could identify XML documents with similar structure– RSS feeds that contain the same links

– Differentiate RSS (crawl more often) from other less frequently updated XML


Detect Plagiarism

• Determine web sites and reportsthat plagiarize one another

• Important for copyright lawsand enforcement

• Determine similarity betweensource code

• Licensing issues– Open Source, other.


Detection of SPAM

• Identifying malicious SPAM content– Adult sites

– Pharmaceutical drug and prescriptiondrug SPAM

– Malware and phishing scams

• Need to ignore this content from a crawling perspective– Or to “flag” it and not include it in (general) search

results


Challenges

• Scalability– Most approaches to detecting duplicates rely on training

and analytical approaches that may be computationally expensive

• Challenge is to perform the evaluation at low cost

• What to do with the duplicates?– The answer isn’t always throw them out – they may be

useful for study

– The content may require indexing for later comparison in legal issues, or for “snapshot”ing the web at the time i.e., the Internet Archive


Challenges

• Structure versus Semantics– Documents that are structurally dissimilar may content

the exact same content• Think the use of <b> tags to emphasize versus <i> tags in

HTML

• Need to take this into account

• Online versus offline– Depends on crawling strategy, but offline typically can

provide more precision at the cost of inability to dynamically react


Approaches for Deduplication• SIMHASH and Hamming Distance

– Treat web documents as a set of features, constituting an n dimension vector – transform this vector into an f-bit fingerprint of a small size, e.g., 64

– Compare fingerprints and look for difference in at most k bits

– Manku et al., WWW 2007

• Syntactic similarity– Shingling

• Treat web documents as continuous subsequence of words

• Compute w-shingling

• Border et al., WWW 1997


Approaches for Deduplication

• Link structure similarity– Identify similar in the linkages between web collections

– Choo et al.


Approaches for Deduplication

• Exploiting the structure and links between physical network hosts– Look at:

• Language

• Geographical connection

• Continuations and proxies

– Zipifan function

– Bharat et al., ICDM 2001


Wrapup• Need Deduplication for conserving resources and ensuring

quality and accuracy of resultant search indices– Can assist in other areas like plagiarism, SPAM detection, fraud

detection, etc.

• Deduplication at web scale is difficult, need efficient means to perform this computation online or offline

• Techniques look at page structure/content, page link structure content, or physical web node structure

deduplication csci 572: information retrieval and search engines summer 2010

Documents

search enginesgoogle

web duplicates

search engineswhat

urlsgoogle search

search enginessummer

similar documents

search engine architecture

general search resultsmay