deduplication csci 572: information retrieval and search engines summer 2010
TRANSCRIPT
Deduplication
CSCI 572: Information Retrieval and Search Engines
Summer 2010
May-20-10 CS572-Summer2010 CAM-2
Outline
• What is Deduplication?• Importance• Challenges• Approaches
May-20-10 CS572-Summer2010 CAM-3
What are web duplicates?
• The same page, referenced by different URLshttp://espn.go.com http://www.espn.com
•What are the differences?•URL host (virtual hosts), sometimes protocol, sometimes page name, etc.
May-20-10 CS572-Summer2010 CAM-4
What are web duplicates?
• Near identical page, referenced by the same URLsGoogle search for “search engines” Google search for “search engines”
•What are the differences?•Page is within some delta % similar to the other (where delta is a large number), but may differ in e.g., adds, counters, timestamps, etc.
May-20-10 CS572-Summer2010 CAM-5
Why is it important to consider duplicates?
• In search engines, URLs tell the crawlers where to go and how to navigate the information space
• Ideally, given the web’s scale and complexity, we’ll give priority to crawl content that we haven’t already stored or seen before– Saves resources (on the crawler end, as well as the
remote host)
– Increases crawler politeness
– Reduces the analysis that we’ll have to do later
May-20-10 CS572-Summer2010 CAM-6
Why is it important to consider duplicates?
• Identification of website mirrors (or copies of content)used to spread the load andbandwidth consumption– Sourceforge.net, CPAN, Apache,
etc.
• If you identify a mirror, you canomit crawling many web pagesand save crawler resources
May-20-10 CS572-Summer2010 CAM-7
“More Like This”
• Finding similarcontent to whatyou were lookingfor– As we discussed
during the lecture on the search engine architecture, much of the time in search engines is spent filtering through the results. Presenting similar documents can cut down on that filtering time
May-20-10 CS572-Summer2010 CAM-8
XML
• XML documents, structurally appear very similar– What’s the difference between RSS and RDF and OWL
and XSL and XSLT and any number of XML documents out there?
• With the ability to identify similarity and reduce duplication of XML, we could identify XML documents with similar structure– RSS feeds that contain the same links
– Differentiate RSS (crawl more often) from other less frequently updated XML
May-20-10 CS572-Summer2010 CAM-9
Detect Plagiarism
• Determine web sites and reportsthat plagiarize one another
• Important for copyright lawsand enforcement
• Determine similarity betweensource code
• Licensing issues– Open Source, other.
May-20-10 CS572-Summer2010 CAM-10
Detection of SPAM
• Identifying malicious SPAM content– Adult sites
– Pharmaceutical drug and prescriptiondrug SPAM
– Malware and phishing scams
• Need to ignore this content from a crawling perspective– Or to “flag” it and not include it in (general) search
results
May-20-10 CS572-Summer2010 CAM-11
Challenges
• Scalability– Most approaches to detecting duplicates rely on training
and analytical approaches that may be computationally expensive
• Challenge is to perform the evaluation at low cost
• What to do with the duplicates?– The answer isn’t always throw them out – they may be
useful for study
– The content may require indexing for later comparison in legal issues, or for “snapshot”ing the web at the time i.e., the Internet Archive
May-20-10 CS572-Summer2010 CAM-12
Challenges
• Structure versus Semantics– Documents that are structurally dissimilar may content
the exact same content• Think the use of <b> tags to emphasize versus <i> tags in
HTML
• Need to take this into account
• Online versus offline– Depends on crawling strategy, but offline typically can
provide more precision at the cost of inability to dynamically react
May-20-10 CS572-Summer2010 CAM-13
Approaches for Deduplication• SIMHASH and Hamming Distance
– Treat web documents as a set of features, constituting an n dimension vector – transform this vector into an f-bit fingerprint of a small size, e.g., 64
– Compare fingerprints and look for difference in at most k bits
– Manku et al., WWW 2007
• Syntactic similarity– Shingling
• Treat web documents as continuous subsequence of words
• Compute w-shingling
• Border et al., WWW 1997
May-20-10 CS572-Summer2010 CAM-14
Approaches for Deduplication
• Link structure similarity– Identify similar in the linkages between web collections
– Choo et al.
May-20-10 CS572-Summer2010 CAM-15
Approaches for Deduplication
• Exploiting the structure and links between physical network hosts– Look at:
• Language
• Geographical connection
• Continuations and proxies
– Zipifan function
– Bharat et al., ICDM 2001
May-20-10 CS572-Summer2010 CAM-16
Wrapup• Need Deduplication for conserving resources and ensuring
quality and accuracy of resultant search indices– Can assist in other areas like plagiarism, SPAM detection, fraud
detection, etc.
• Deduplication at web scale is difficult, need efficient means to perform this computation online or offline
• Techniques look at page structure/content, page link structure content, or physical web node structure