do not crawl in the dust: different urls with similar...

Do Not Crawl in the DUST: Different URLs with Similar Text

ZIV BAR-YOSSEF and IDIT KEIDAR

Technion Israel Institute of Technology

and

URI SCHONFELD

University of California Los Angeles

We consider the problem of dust: Different URLs with Similar Text. Such duplicate URLs are

prevalent in web sites, as web server software often uses aliases and redirections, and dynamically

generates the same page from various different URL requests. We present a novel algorithm,

DustBuster, for uncovering dust; that is, for discovering rules that transform a given URL to

others that are likely to have similar content. DustBuster mines dust effectively from previous

crawl logs or web server logs, without examining page contents. Verifying these rules via sampling

requires fetching few actual web pages. Search engines can benefit from information about dust

to increase the effectiveness of crawling, reduce indexing overhead, and improve the quality of

popularity statistics such as PageRank.

Categories and Subject Descriptors: H.3.3 [Information Search and Retrieval]: Clustering

General Terms: Algorithms

Additional Key Words and Phrases: search engines, crawling, URL normalization, duplicate de-

tection, anti-aliasing

1. INTRODUCTION

The DUST problem. The web is abundant withDUST: Different URLs with Similar Text[Kelly and Mogul 2002;Douglis et al. 1997; McCown and Nelson 2006; Kim et al. 2006].For example, the URLshttp://google.com/newsandhttp://news.google.com return similar content. Adding a trailing slash or/index.html to either returns thesame result. Many web sites define links, redirections, or aliases, such as allowing the tilde symbol ˜ to replace a stringlike /people . A single web server often has multiple DNS names, and any canbe typed in the URL. As the aboveexamples illustrate,DUST is typically not random, but rather stems from some general rules, which we callDUSTrules, such as “˜”→ “ /people ”, or “ /index.html ” at the end of the URL can be omitted.

DUST rules are typically not universal. Many are artifacts of a particular web server implementation. For example,URLs of dynamically generated pages often include parameters; which parameters impact the page’s content is up tothe software that generates the pages. Some sites use their own conventions; for example, a forum site we studiedallows accessing story number “num” both via the URLhttp://domain/story?id=num and viahttp://domain/story num. Our study of the CNN web site has discovered that URLs of the formhttp://cnn.com/money/whateverget redirected tohttp://money.cnn.com/whatever . In this paper, we focus on miningDUST rules within a givenweb site. We are not aware of any previous work tackling this problem.

Standard techniques for avoidingDUST employ universal rules, such as addinghttp:// or removing a trailing slash,in order to obtain some level of canonization. AdditionalDUST is found by comparing document sketches. However,

Authors’ addresses:Z. Bar-Yossef, Dept. of Electrical Engineering, Technion,Haifa 32000, IsraelGoogle Haifa Engineering Center, Israel ([email protected])I. Keidar, Dept. of Electrical Engineering, Technion, Haifa 32000, Israel ([email protected])U. Schonfeld, Dept. of Computer Science, University of California Los Angeles, CA 90095, USA ([email protected])Supported by the European Commission Marie Curie International Re-integration Grant and by Bank Hapoalim.A conference version of this paper appeared at the 16th International World-Wide Web Conference, 2007.Permission to make digital/hard copy of all or part of this material without fee for personal or classroom use provided that the copies are not madeor distributed for profit or commercial advantage, the ACM copyright/server notice, the title of the publication, and its date appear, and notice isgiven that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires priorspecific permission and/or a fee.c© 2009 ACM 0000-0000/2009/0000-0111 $5.00

ACM Journal Name, Vol. 3, No. 1, 02 2009, Pages 111–131.

this is conducted on a page by page basis, and all the pages must be fetched in order to employ this technique. Byknowing DUST rules, one can reduce the overhead of this process. In particular, information about many redundantURLs can be represented succinctly in the form of a short listof rules. Once the rules are obtained, they can be used toavoid fetching duplicate pages altogether, including pages that were never crawled before. The rules can be obtainedafter crawling a small subset of a web site, or may be retainedfrom a previous crawl of the same site. The latter isparticularly useful in dynamic web sites, like blogs and news sites, where new pages are constantly added. Finally, theuse of rules is robust to web site structure changes, since the rules can be validated anew before each crawl by fetchinga small number of pages. For example, in a crawl of a small newssite we examined, the number of URLs fetchedwould have been reduced by 26%.

Knowledge aboutDUST rules can be valuable for search engines for additional reasons: DUST rules allow for acanonicalURL representation, thereby reducing overhead in indexing, and caching [Kelly and Mogul 2002; Dougliset al. 1997], and increasing the accuracy of page metrics, like PageRank.

We focus on URLs withsimilar contents rather than identical ones, since different versions of the same document arenot always identical; they tend to differ in insignificant ways, e.g., counters, dates, and advertisements. Likewise, someURL parameters impact only the way pages are displayed (fonts, image sizes, etc.) without altering their contents.

Detecting DUST from a URL list. Contrary to initial intuition, we show that it is possible todiscover likelyDUST

rules without fetching a single web page. We present an algorithm,DustBuster, which discovers such likely rules froma list of URLs. Such aURL list can be obtained from many sources including a previous crawlor web server logs.1

The rules are then verified (or refuted) by sampling a small number of actual web pages. The fact DustBuster’s inputis a list of URLs rather than a collection of web pages significantly reduces its running time and storage requirements.

At first glance, it is not clear that a URL list can provide reliable information regardingDUST, as it does not includeactual page contents. We show, however, how to use a URL list to discover two types ofDUST rules:substring substitu-tions, which are similar to the “replace” function in editors, andparameter substitutions. A substring substitution ruleα → β replaces an occurrence of the stringα in a URL by the stringβ. A parameter substitution rule replaces the valueof a parameter in a URL by some default value. Thanks to the standard syntax of parameter usage in URLs, detectingparameter substitution rules is fairly straightforward. Most of our work therefore focuses on substring substitutionrules.

DustBuster uses three heuristics, which together are very effective at detecting likelyDUST rules and distinguishingthem from invalid ones. The first heuristic leverages the observation that if a ruleα → β is common in a web site, thenwe can expect to find in the URL list multiple examples of pagesaccessed both ways. For example, in the site wherestory?id= can be replaced bystory , we are likely to see many different URL pairs that differ only in this substring;we say that such a pair of URLs is aninstanceof the rule “story?id= ” → “story ”. The set of all instances of a ruleis called the rule’ssupport. Our first attempt to uncoverDUST is therefore to seek rules that have large support.

Nevertheless, some rules that have large support are notvalid DUST rules, meaning their support includes manyinstances, URL pairs, whose associated documents are not similar. For example, in one site we found an invalidDUST rule, “movie-forum ” → “politics-forum ” whose instances included pairs of URLs of the form:http://movie-forum.com/story 〈num〉 andhttp://politics-forum.com/story 〈num〉. In this case the URLs wereassociated with two unrelated stories that happen to share the same “story id” number. Another example is the rule“1” → “2”, which emanates from instances likepic-1.jpg andpic-2.jpg , story 1 andstory 2, andlect1 andlect2 , none of which areDUST since the pairs of URLs are not associated with similar documents. Our second andthird heuristics address the challenge of eliminating suchinvalid rules. The second heuristic is based on the observationthat invalid rules tend to flock together. For example in mostinstances of “1”→ “2”, one could also replace the “1”by other digits. We therefore ignore rules that come in largegroups.

Further eliminating invalid rules requires calculating the fraction ofDUST in the support of each rule. How couldthis be done without inspecting page content? Our third heuristic uses cues from the URL list to guess which instancesare likely to beDUST and which are not. In case the URL list is produced from a previous crawl, we typically havedocument sketches [Broder et al. 1997] available for each URL in the list. These sketches can be used to estimatethe similarity between documents and thus to eliminate rules whose support does not contain sufficiently manyDUST

pairs.In case the URL list is produced from web server logs, document sketches are not available. The only cue about

the contents of URLs in these logs is the sizes of these contents. We thus use the size field from the log to filter

1Increasingly many web server logs are available nowadays tosearch engines via protocols like Google Sitemaps [Google Inc. ].

112

out instances (URL pairs) that have “mismatching” sizes. The difficulty with size-based filtering is that the size of adynamic page can vary dramatically, e.g., when many users comment on an interesting story or when a web page ispersonalized. To account for such variability, we compare the ranges of sizes seen in all accesses to each page. Whenthe size ranges of two URLs do not overlap, they are unlikely to beDUST.

Having discovered likelyDUST rules, another challenge that needs to be addressed is eliminating redundant ones.For example, the rule “http://site-name/story?id= ” → “http://site-name/story ” will be discovered, alongwith many consisting of substrings thereof, e.g., “?id= ” → “ ”. However, without considering the content itself, it isnot obvious which rule should be kept in such situations– thelatter could be either valid in all cases, or invalid outsidethecontextof the former. We are able to use support information from theURL list to remove many redundant likelyDUST rules. We remove additional redundancies after performingsome validations, and thus compile a succinct list ofrules.

Canonization. Once the correctDUST rules are discovered, we exploit them for URL canonization.The problemof finding a canonical set of URLs for a given URL list is NP-hard due to reducibility to the minimum dominatingset problem. Despite this, we have devised an efficientcanonization algorithmthattypicallysucceeds in transformingURLs to a site-specific canonical form.

Experimental results. We experiment with DustBuster on four web sites with very different characteristics. Twoof our experiments use web server logs, and two use crawl outputs. We find that DustBuster can discover rules veryeffectively from moderate sized URL lists, with as little as20,000 entries. Limited sampling is then used in order tovalidate or refute each rule.

Our experiments show that up to 90% of the top ten rules discovered by DustBusterprior to the validation phaseare found to be valid, and in most sites 70% of the top 100 rulesare valid. Furthermore,DUST rules discovered byDustBuster may account for 47% of theDUST in a web site and that using DustBuster can reduce a crawl by upto26%.

Roadmap. The rest of this paper is organized as follows. Section 2 reviews related work. We formally define theDUST detection and canonization problems in Section 3. Section 4presents the basic heuristics our algorithm uses.DustBuster and the canonization algorithm appear in Section 5. Section 6 presents experimental results. We end withsome concluding remarks in Section 7.

2. RELATED WORK

2.1 Canonization rules

Global canonization rules. The most obvious and naive wayDUST is being dealt with today is through standardcanonization. URLs have a very standard structure [Berners-Lee et al. ]. The hostname may have many differentaliases. Different hostnames may return the exact same site. For example adding a “www” to the base hostnameoften returns the same content. Choosing one hostname to identify each site is a standard way to canonize a URL.Other standard canonization techniques include replacinga “// ” with a single “/ ” and removing the index.html suffix.However, site specificDUST rules cannot be detected using these simple rules.

Site-specific canonization rules. Another method for discoveringDUST rules is to examine the web server config-uration file and file system. In the configuration, file alias rules are defined. Each such rule allows a directory in theweb server to be accessed using a different name, an alias. Byparsing the web server configuration file [AC2 ] onecould easily learn these rules. Further inspection of the file system may uncover symbolic links. These too have theexact same effect.

There are three main problems with using this technique. Themain problem is that symbolic links and aliases areby no means the sole source ofDUST rules. Other sources include parameters that do not affect the content, differentdynamic files that produce the same content and many others. Our technique, therefore, can discover a wider rangeof DUST rules, regardless of cause. The second problem is that each configuration file is different according to thetype and version of the web server. The third and final problemis that once the files are parsed and processed therules discovered would have to be transferred to the search engine. Transferring these rules from the web serverto the search engine may be possible in the future if new protocols are defined and adopted. The Google Sitemapsarchitecture [Google Inc. ] defines a protocol for web site managers to supply information about their sites to searchengines. This type of protocol can be extended to enable the web server to send canonization rules to any party thatwants such information. However, such an extension has not been adopted yet.

113

2.2 Detecting similarity between documents

The standard way of dealing withDUST is using document sketches [Brin et al. 1995; Garcia-Molinaet al. 1996;Broder et al. 1997; Shivakumar and Garcia-Molina 1998; Di Iorio et al. 2003; Jain et al. 2005; Hoad and Zobel2003; Monostori et al. 2002; Finkel et al. 2002; Zobel and Moffat 1998], which are short summaries used to determinesimilarities among documents. To compute such a sketch, however, one needs to fetch and inspect the whole document.Our approach cannot replace document sketches, since it does not findDUST across sites orDUST that does not stemfrom rules. However, it is desirable to use our approach to complement document sketches in order to reduce theoverhead of collecting redundant data. Moreover, since document sketches do not give rules, they cannot be used forURL canonization, which is important, e.g., to improve the accuracy of page popularity metrics.

2.3 Detecting mirror sites

One common source of near-duplicate content is mirroring. Anumber of previous works have dealt with automaticdetection of mirror sites on the web. There are two basic approaches to mirror detection. The first method is based onthe content itself [Cho et al. 2000; Kelly and Mogul 2002]. The documents are downloaded and processed. Mirrors aredetected by processing the documents to detect the similarity between hosts. We will call these techniques“bottom-up” .

The second method uses meta information that may already be available such as a URL list, the IP number associatedwith the host names and other techniques. We will call these techniques“top-down” [Bharat and Broder 1999; Liang2001; Bharat et al. 2000]. These techniques are closer to what we are doing in this paper.

In contrast to mirror detection, we deal with the complementary problem of detectingDUST within one site. Mirrordetection may exploit syntactic analysis of URLs and limited sampling as we do. However, a major challenge thatsite-specificDUST detection must address is efficientlydiscoveringprospective rules out of a daunting number ofpossibilities (all possible substring substitutions). Incontrast, mirror detection focuses on comparing a given pair ofsites, and only needs to determinewhetherthey are mirrors.

2.4 Analysis of web server logs

Various commercial tools as well as papers are available on analyzing web server logs. We are not aware of anyprevious algorithm for automatically detectingDUST rules nor of any algorithm for harvesting information for searchengines from web server logs. Some companies (e.g., [WebLogExpert ; StatCounter ; Analog ]) have tools foranalyzing web server logs, but their goals are very different from ours. This type of software usually provides suchstatistics as: popular keyword terms, entry pages (first page users hit), exit pages (pages users use to move to anothersite), information about visitor paths and many more.

2.5 Mining association rules

Our problem may seem similar to mining association rules [Agrawal and Srikant 1994], yet the two problems differsubstantially. Whereas the input of such mining algorithmsconsists of complete lists of items that belong together,our input includes individual items from different lists. The absence of complete lists renders techniques used thereininapplicable to our problem.

2.6 Abstract Rewrite System

One way to view our work is as producing an Abstract Rewrite System (ARS) [Bognar 1995] for URL canonizationvia DUST rules. For ease of readability, we have chosen not to adopt the ARS terminology in this paper.

3. PROBLEM DEFINITION

URLs. We view URLs as strings over an alphabetΣ of tokens. Tokens are either alphanumeric strings or non-alphanumeric characters. In addition, every URL is prepended with the special token and is appended with thespecial token$ (ˆ and$ are not included inΣ).

A URL u is valid, if its domain name resolves to a valid IP address and its contents can be fetched by accessing thecorresponding web server (the http return code is not in the 4xx or 5xx series). Ifu is valid, we denote by doc(u) thereturned document.

DUST. Two valid URLsu1,u2 are calledDUST if their corresponding documents, doc(u1) and doc(u2), are “similar”.To this end, any method of measuring the similarity between two documents can be used. For our implementation and

114

experiments, we use the popularJaccard similarity coefficientmeasure [Jaccard 1908], which can be estimated usingshingles, or document sketches due to Broder et al. [1997].

DUST rules. We seek generalrules for detecting when two URLs areDUST. A DUST rule φ is a relation over thespace of URLs.φ may be many-to-many. Every pair of URLs belonging toφ is called aninstanceof φ. Thesupportof φ, denoted support(φ), is the collection of all its instances.

We discuss two types ofDUST rules: substring substitutions and parameter substitutions. Parameter substitutionruleseither replace the value of a certain parameter appearing inthe URL with a default value, or omit this parameterfrom the URL altogether. Thanks to the standard syntax of parameter usage in URLs, detecting parameter substitutionrules is fairly straightforward. Most of our work thereforefocuses on substring substitution rules.

Our algorithm focuses primarily on detecting substring substitution rules. Asubstring substitution ruleα → βis specified by an ordered pair of strings(α,β) over the token alphabetΣ. (In addition, we allow these strings tosimultaneously start with the tokenˆ and/or to simultaneously end with the token$.) In Section 5.4 we will see thatapplyinga substring substitution rule is simply done by replacing the first occurrence of the substring. However, atthis point, instances of substring substitution rules are simply defined as follows:

DEFINITION 3.1 INSTANCE OF A RULE. A pair u1,u2 of URLs is aninstanceof a substring substitution ruleα →β, if there exist strings p,s s.t. u1 = pαs and u2 = pβs.

For example, the pair of URLshttp://www.site.com/index.html andhttp://www.site.com is an instance oftheDUST rule “/index.html$ ” → “$”. This rule demonstrates that substring substitution rules can be used to remove“irrelevant” segments of the URL, segments that if removed from a valid URL result in another valid URL with similarassociated content.

The two types of rules we discuss in this paper are in no way complete. Indeed, it is easy to think of additional typesof DUST rules that are not covered by the types of rules we present here. However, our experiments over real-worlddata show that the rules we explore are highly effective at uncoveringDUST. It would be interesting to explore moretypes ofDUST rules and compare their effectiveness in future work.

The DUST problem. Our goal is to detectDUST and eliminate redundancies in a collection of URLs belonging to agiven web siteS. This is solved by a combination of two algorithms, one that discoversDUST rules from a URL list,and another that uses them in order to transform URLs to theircanonical form.

The input of the first algorithm is a list of URLs, (typically from the same web site), and its output is a list of dustrules corresponding to these URLs. TheURL list is a list of records consisting of: (1) a URL; (2) the http return code;(3) the size of the returned document; and (4) the document’ssketch. The last two fields are optional. This type of listcan be obtained from web server logs or from a previous crawl.he URL list is a sample of the URLs that belong tothe web site but is not required to be arandomsample. For example, web server logs might be biased towardspopularURLs.

For a given web siteS, we denote byUS the set of URLs that belong toS. A DUST rule φ is said to bevalid w.r.t. S,if for eachu1 ∈US and for eachu2 s.t.(u1,u2) is an instance ofφ, u2 ∈US and(u1,u2) is DUST.

A DUST rule detection algorithmis given a listL of URLs from a web siteSand outputs an ordered list ofDUST

rules. The algorithm may also fetch pages (which may or may not appear in the URL list). The ordering of rulesrepresents the confidence of the algorithm in the validity ofthe rules.

Canonization. Let R be an ordered list ofDUST rules that have been found to be valid w.r.t. some web siteS.We would like to define what is acanonizationof the URLs inUS, using the rules inR . This definition is madesomewhat more difficult by the fact that aDUST rule may map a URL to multiple URLs. For example, the URLhttp://a.b.a.com/ is mapped by the rule “a.”→ “” to both http://a.b.com/ andhttp://b.a.com/ . To thisend, we define a standard way of applying each rule. For example, in the case of substring substations, we only replacethe first occurrence of the string.

The rules inR naturally induce a labeled graphGR onUS: there is an edge fromu1 to u2 labeled byφ if and onlyif φ(u1) = u2. Note that adjacent URLs inGR correspond to similar documents. Further note that due to our rulevalidation process (see Section 5.4), R cannot contain botha rule and its inverse. Nevertheless, the graphGR may stillcontain cycles.

For the purpose of canonization, we assume that document dissimilarity empirically respects at least a weak formof the triangle inequality, so that URLs that are connected by short paths inGR are similar too. Thus, ifGR has abounded diameter (as it does in the data sets we encountered), then every two URLs connected by a path are similar.

115

A canonization that maps every URLu to some URL that is reachable fromu thus makes sense, because the originalURL and its canonical form are guaranteed to beDUST.

A set of canonical URLsis a subsetCUS ⊆ US that is reachable from every URL inUS. Equivalently,CUS is adominating set in the transitive closure of thereverse graphof GR , the graph obtained by reversing the direction ofall edges. A canonization is any mappingC : US→ CUS that maps every URLu∈US to some canonical URLC(u),which is reachable fromu by a directed path. Our goal is to find a small set of canonical URLs and a correspondingcanonization, which is efficiently computable.

Finding the minimum size set of canonical URLs is intractable, due to the NP-hardness of the minimum dominatingset problem (cf. [Garey and Johnson 1979]). Fortunately, our empirical study indicates that for typical collectionsof DUST rules found in web sites, efficient canonization is possible. Thus, although we cannot design an algorithmthat always obtains an optimal canonization, we will seek one that maps URLs to asmallset of canonical URLs, andalwaysterminates in polynomial time.

Metrics. We use three measures to evaluateDUST detection and canonization. The first measure isprecision—thefraction of valid rules among the rules reported by theDUST detection algorithm. The second, and most important,measure is thediscovered redundancy—the amount of redundancy eliminated in a crawl. It is definedas the differencebetween the number of unique URLs in the crawl before and after canonization, divided by the former.

The third measure iscoverage: given a large collection of URLs that includesDUST, what percentage of the dupli-cate URLs is detected. The number of duplicate URLs in a givenURL list is defined as the difference between thenumber of unique URLs and the number of unique document sketches. Since we do not have access to the entire website, we measure the achieved coverage within the URL list. We count the number of duplicate URLs in the list beforeand after canonization, and the difference between them divided by the former is the coverage.

One of the standard measures of information retrieval isrecall. In our case, recall would measure what percent ofall correctDUST rules is discovered. However, it is clearly impossible to construct a complete list of all valid rules tocompare against. Therefore, recall is not directly measurable in our case, and is replaced by coverage.

The coverage measure is equivalent to recall over the set of duplicate URLs.

4. BASIC HEURISTICS

Our algorithm for extracting likely string substitution rules from the URL list uses three heuristics: thelarge supportheuristic, thesmall buckets heuristic, and thesimilarity likeliness heuristic. Our empirical results provide evidencethat these heuristics are effective on web-sites of varyingscopes and characteristics.

Large support heuristic.

Large Support HeuristicThe support of a valid DUST rule is large.

For example, if a rule “index.html$ ” → “$” is valid, we should expect many instances witnessing to this effect,e.g.,www.site.com/d1/index.html andwww.site.com/d1/ , andwww.site.com/d3/index.html andwww.site.com/d3/ . We would thus like to discover rules of large support. Note that valid rules of small support are not veryinteresting anyway, because the savings gained by applyingthem are negligible.

Finding the support of a rule on the web site requires knowingall the URLs associated with the site. Since the onlydata at our disposal is the URL list, which is unlikely to be complete, the best we can do is compute the support ofrulesin this URL list. That is, for each ruleφ, we can find the number of instances(u1,u2) of φ, for which bothu1 andu2 appear in the URL list. We call these instances thesupport ofφ in the URL listand denote them by supportL (φ). Ifthe URL list is long enough, we expect this support to be representative of the overall support of the rule on the site.

Note that since|supportL(α → β)| = |supportL(β → α)|, for everyα andβ, our algorithm cannot know whetherboth rules are valid or just one of them is. It therefore outputs the pairα,β instead. Finding which of the two directionsis valid is left to the final phase of DustBuster.

Given a URL listL, how do we compute the size of the support of every possible rule? To this end, we introducea new characterization of the support size. Consider a substring α of a URL u = pαs. We call the pair(p,s) theenvelopeof α in u. For example, ifu =http://www.site.com/index.html andα =“ index ”, then the envelope ofα in u is the pair of strings “http://www.site.com/ ” and “.html$ ”. By Definition 3.1, a pair of URLs(u1,u2) isan instance of a substitution ruleα → β if and only if there exists at least one shared envelope(p,s) so thatu1 = pαs

116

andu2 = pβs.For a stringα, denote byEL(α) the set of envelopes ofα in URLs, where the URLs satisfy the following conditions:

(1) these URLs appear in the URL listL; and (2) the URLs haveα as a substring. Ifα occurs in a URLu several times,thenu contributes as many envelopes toEL(α) as the number of occurrences ofα in u. The following theorem showsthat under certain conditions,|EL(α)∩EL (β)| equals|supportL(α → β)|. As we shall see later, this gives rise to anefficient procedure for computing support size, since we cancompute the envelope sets of each substringα separately,and then by join and sort operations find the pairs of substrings whose envelope sets have large intersections.

THEOREM 4.1. Let α 6= β be two non-empty and non-semiperiodic strings. Then,

|supportL (α → β)| = |EL (α)∩EL(β)|.

A string α is semiperiodic, if it can be written asα = γkγ′ for some stringγ, where|α| > |γ|, k ≥ 1, γk is the stringobtained by concatenatingk copies of the stringγ, andγ′ is a (possibly empty) prefix ofγ [Gusfield 1997]. Ifα isnot semiperiodic, it isnon-semiperiodic. For example, the strings “///// ” and “a.a.a ” are semiperiodic, while thestrings “a.a.b ” and “%//// ” are not.

Unfortunately, the theorem does not hold for rules where oneof the strings is either semiperiodic or empty. Ruleswhere one of the strings is empty are actually very rare sinceanother rule with additional context would be detectedinstead. For example, instead of detecting the rule “/ ′′ → “ ′′ we would detect the rule “/$ ′′ → “$′′ that would onlyremove the last slash. For the semiperiodic case, let us consider the following example. Letα be the semiperiodicstring “a.a ” and β = “a”. Let u1 = http://a.a.a/ and letu2 = http://a.a/ . There aretwo ways in which wecan substituteα with β in u1 and obtainu2. Similarly, let γ be “a. ” and δ be the empty string. There are two waysin which we can substituteγ with δ in u1 to obtainu2. This means that the instance(u1,u2) will be associated withtwo envelopes inEL(α)∩EL (β) and with two envelopes inEL(γ)∩EL (δ) and not just one. Thus, whenα or β aresemiperiodic or empty,|EL (α)∩EL(β)| can overestimate the support size. On the other hand, such examples are quiterare, and in practice we expect a minimal gap between|EL(α)∩EL(β)| and the support size.

PROOF OFTHEOREM 4.1. To prove the identity, we will show a 1-1 mapping from supportL (α → β) ontoEL(α)∩EL(β). Let(u1,u2) be any instance of the ruleα→ β that occurs inL. By Definition 3.1, there exists an envelope(p,s)so thatu1 = pαsandu2 = pβs. Note that(p,s)∈EL(α)∩EL (β), hence we define our mapping as:fα,β(u1,u2) = (p,s).The main challenge is to prove thatfα,β is a well defined function; that is,fα,β maps every instance(u1,u2) to a singlepair (p,s). This is captured by the following lemma whose proof is provided below:

LEMMA 4.2. Let α 6= β be two distinct, non-empty, and non-semiperiodic strings.Then, there cannot be twodistinct pairs(p1,s1) 6= (p2,s2) s.t. p1αs1 = p2αs2 and p1βs1 = p2βs2.

We are left to show thatf is 1-1 and onto. Take any two instances(u1,u2),(v1,v2) of (α,β), and suppose thatfα,β(u1,u2) = fα,β(v1,v2) = (p,s). This means thatu1 = pαs= v1 andu2 = pβs= v2. Hence, necessarily(u1,u2) =(v1,v2), implying f is 1-1. Take now any envelope(p,s) ∈ EL(α)∩EL (β). By definition, there exist URLsu1,u2 ∈ L,so thatu1 = pαsandu2 = pβs. By Definition 3.1,(u1,u2) is an instance of the ruleα→ β, and thusfα,β(u1,u2)= (p,s).

In order to prove Lemma 4.2, we show the following basic property of semiperiodic strings:

LEMMA 4.3. Let α 6= β be two distinct and non-empty strings. Ifβ is both a suffix and a prefix ofα, thenα mustbe semiperiodic.

PROOF. Sinceβ is both a prefix and a suffix ofα and sinceα 6= β, there exist two non-empty stringsβ0 andβ2 s.t.

α = β0β = ββ2.

Let k = ⌊ |α||β0|

⌋. Note thatk ≥ 1, as|α| ≥ |β0|. We will show by induction onk thatα = βk0β′

0, whereβ′0 is a possibly

empty prefix ofβ0.The induction base isk = 1. In this case,|β0| ≥

|α|2 , and asα = β0β, |β| ≤ |β0|. Sinceβ0β = ββ2, it follows thatβ

is a prefix ofβ0. Thus, defineβ′0 = β and we have:

α = β0β = β0β′0.

117

Assume now that the statement holds for⌊ |α||β0|

⌋ = k−1≥ 1 and let us show correctness for⌊ |α||β0|

⌋ = k. As α = β0β,

then⌊ |β||β0|

⌋ = k−1≥ 1. Hence,|β| ≥ |β0| and thusβ0 must be a prefix ofβ (recall thatβ0β = ββ2). Let us writeβ as:

β = β0β1.

We thus have the following two representations ofα:

α = β0β = β0β0β1 and α = ββ2 = β0β1β2.

We conclude that:

β = β1β2.

We thus found a string (β1), which is both a prefix and a suffix ofβ. We therefore conclude from the inductionhypothesis that

β = βk−10 β′

0,

for some prefixβ′0 of β0. Hence,

α = β0β = βk0β′

0.

We can now prove Lemma 4.2:

PROOF OFLEMMA 4.2. Suppose, by contradiction, that there exist two different pairs(p1,s1) and(p2,s2) so that:

u1 = p1αs1 = p2αs2 (1)

u2 = p1βs1 = p2βs2. (2)

These equations together with the fact that(p1,s1) 6= (p2,s2) imply that bothp1 6= p2 and s1 6= s2. Thus, byEquations (1) and (2), one ofp1 andp2 is a proper prefix of the other. Suppose, for example, thatp1 is a proper prefixof p2 (the other case is identical). It follows thats2 is a proper suffix ofs1.

Let us assume, without loss of generality, that|α| ≥ |β|. There are two cases to consider:

Case (i):. Both p1α andp1β are prefixes ofp2. This implies that alsoαs2,βs2 are suffixes ofs1.

Case (ii):. At least one ofp1α andp1β is not a prefix ofp2.

Case (i): In this case,p2 can be expressed as:

p2 = p1αp′2 = p1βp′′2,

for some stringsp′2 andp′′2. Thus, since|α| ≥ |β|, β is a prefix ofα. Similarly:

s1 = s′1αs2 = s′′1βs2.

Thus,β is also a suffix ofα. According to Lemma 4.3,α is therefore semiperiodic. A contradiction.Case (ii): If p1α is not a prefix ofp2, α can be written asα = α0α1 = α1α2, for some non-empty stringsα0,α1,α2,

as shown in Figure 1. By Lemma 4.3,α is semiperiodic. A contradiction. The case thatp1β is not a prefix ofp2 ishandled in a similar manner.

Fig. 1. Breakdown ofα in Lemma 4.2.

118

Small buckets heuristic. While most validDUST rules have large support, the converse is not necessarily true: therecan be rules with large support that are not valid. One class of such rules is substitutions among numbered items, e.g.,(lect1.ps ,lect2.ps ), (lect1.ps ,lect3.ps ), and so on.

We would like to somehow filter out the rules with “misleading” support. The support for a ruleα → β canbe thought of as a collection of recommendations, where eachenvelope(p,s) ∈ EL(α)∩EL (β) represents a singlerecommendation. Consider an envelope(p,s) that is willing to give a recommendation to any rule, for example“ ˆhttp:// ” → “ ˆ ”. Naturally its recommendations lose their value. This type of support only leads to many invalidrules being considered. This is the intuitive motivation for the following heuristic to separate the validDUST rulesfrom invalid ones.

If an envelope(p,s) belongs to many envelope setsEL(α1), EL(α2),. . . ,EL (αk), then it contributes to the intersec-tionsEL(αi)∩EL(α j ), for all 1≤ i 6= j ≤ k. The substringsα1,α2, . . . ,αk constitute what we call abucket. That is,for a given envelope(p,s), bucket(p,s) is the set of all substringsα s.t. pαs∈ L. An envelope pertaining to a largebucket supports many rules.

Small Buckets HeuristicMuch of the support of valid DUST substring substitution rules is likely to belong to small buckets.

Similarity likeliness heuristic. The above two heuristics use the URL strings alone to detectDUST. In order to raisethe precision of the algorithm, we use a third heuristic thatbetter captures the “similarity dimension”, by providinghints as to which instances are likely to be similar.

Similarity Likeliness HeuristicThe likely similar support of a valid DUST rule is large.

We show below that using cues from the URL list we can determine which URL pairs in the support of a rule are likelyto have similar content, i.e., arelikely similar, and which are not. The likely similar support, rather than the completesupport, is used to determine whether a rule is valid or not. For example, in a forum web site we examined, the URL listincluded two sets of URLshttp://politics.domain/story num andhttp://movies.domain/story num withdifferent numbers. The support of the invalid rule “http://politics.domain ” → “http://movies.domain ” waslarge, yet since the corresponding stories were very different, the likely similar support of the rule was found to besmall.

How do we use the URL list to estimate similarity between documents? The simplest case is that the URL listincludes a document sketch, such as the shingles of Broder etal. [1997], for each URL. Such sketches are typicallyavailable when the URL list is the output of a previous crawl of the web site. When available, documents sketches areused to indicate which URL pairs are likely similar.

When the URL list is taken from web server logs, documents sketches are not available. In this case we usedocument sizes (document sizes are usually given by web server software). We determine two documents to be similarif their sizes “match”. Size matching, however, turns out tobe quite intricate, because the same document may havevery different sizes when inspected at different points of time or by different users. This is especially true when dealingwith highly dynamic sites like forum or blogging web sites. Therefore, if two URLs have different “size” values in theURL list, we cannot immediately infer that these URLs are notDUST. Instead, for each unique URL, we track all itsoccurrences in the URL list, and keep the minimum and the maximum size values encountered. We denote the intervalbetween these two numbers byIu. A pair of URLs,u1 andu2, in the support are considered likely to be similar if theintervalsIu1 andIu2 overlap. Note however, that in the case of size matching it ismore accurate to say that the URLsare unlikely to be similar if their intervals do not overlap.For this reason the size matching heuristic is effective infiltering support but not as a measure of how similar two URLs are. Our experiments show that this heuristic is veryeffective in improving the precision of our algorithm, often increasing precision by a factor of two.

5. DUSTBUSTER

In this section we describe DustBuster—our algorithm for discovering site-specificDUST rules. DustBuster has fourphases. The first phase uses the URL list alone to generate a short list of likely DUST rules. The second phase removes

119

redundancies from this list. The next phase generates likely parameter substitution rules. The last phase validates orrefutes each of the rules in the list, by fetching a small sample of pages.

5.1 Detecting likely DUST rules

Our strategy for discovering likelyDUST rules is the following: we compute the size of the support of each rule thathas at least one instance in the URL list, and output the ruleswhose support exceeds some thresholdMS. Based onTheorem 4.1, we compute the size of the support of a ruleα → β as the size of the setEL(α)∩EL (β). That is roughlywhat our algorithm does, but with three reservations:

(1) Based on the small buckets heuristic, we avoid considering certain rules by ignoring large buckets in the com-putation of envelope set intersections. Buckets bigger than some thresholdT are calledoverflowing, and all envelopespertaining to them are denoted collectively byO and are not included in the envelope sets.

(2) Based on the similarity likeliness heuristic, we filter support by estimating the likelihood of two documentsbeing similar. We eliminate rules by filtering out instanceswhose associated documents are unlikely to be similar incontent. That is, for a given instanceu1 = pαs andu2 = pβs, the envelope(p,s) is disqualified ifu1 andu2 are foundunlikely to be similar using the tests introduced in Section4. These techniques are provided as a boolean functionLikelySimilar which returns false only if the documents of the two input URLs are unlikely to be similar. The set ofall disqualified envelopes is then denotedDα,β.

(3) In practice, substitutions of long substrings are rare.Hence, our algorithm considers substrings of length at mostS tokens, for some given parameterS.

To conclude, our algorithm computes for every two substrings α,β that appear in the URL list and whose length isat mostS, the size of the set(EL(α)∩EL(β))\ (O∪Dα,β).

1:Function DetectLikelyRules(URLListL)2: create table ST (substring, prefix, suffix, sizerange/docsketch)3: create table IT (substring1, substring2)4: create table RT (substring1, substring2, supportsize)5: for each recordr ∈ L do6: for ℓ = 0 to Sdo7: for each substringα of r.url of lengthℓ do8: p := prefix of r.url precedingα9: s := suffix of r.url succeedingα10: add (α, p, s, r.sizerange/r.docsketch) to ST11: group tuples in ST intobucketsby (prefix,suffix)12: for each bucketB do13: if (|B| = 1 OR |B| > T) continue14: for each pair of distinct tuplest1,t2 ∈ B do15: if (LikelySimilar(t1, t2))16: add (t1.substring,t2.substring) to IT17: group tuples in IT intorule supportsby (substring1,substring2)18: for each rulesupport Rdo19: t := first tuple in R20: add tuple (t.substring1, t.substring2,|R|) to RT21: sort RT by supportsize22: return all rules in RT whose support size is≥ MS

Fig. 2. Discovering likely DUST rules.

Our algorithm for discovering likelyDUST rules is described in Figure 2. The algorithm gets as input the URL listL. We assume the URL list has been pre-processed so that: (1) only unique URLs have been kept; (2) all the URLshave been tokenized and include the precedingˆ and succeeding$; (3) all records corresponding to errors (http returncodes in the 4xx and 5xx series) have been filtered out; (4) foreach URL, the corresponding document sketch or sizerange has been recorded.

The algorithm uses three tables: a substring table ST, an instance table IT, and a rule table RT. Their attributes arelisted in Figure 2. In principle, the tables can be stored in any database structure; our implementation uses text files.

120

In lines 5–10, the algorithm scans the URL list, and records all substrings of lengths 0 to S of the URLs in the list.For each such substringα, a tuple is added to the substring table ST. This tuple consists of the substringα, as well as itsenvelope(p,s), and either the URL’s document sketch or its size range. The substrings are then grouped into bucketsby their envelopes (line 11). Our implementation does this by sorting the file holding the ST table by the second andthird attributes. Note that two substringsα,β appear in the bucket of(p,s) if and only if (p,s) ∈ EL(α)∩EL (β).

In lines 12–16, the algorithm enumerates the envelopes found. An envelope(p,s) contributes 1 to the intersectionof the envelope setsEL(α)∩EL (β), for everyα,β that appear in its bucket. Thus, if the bucket has only a single entry,we know(p,s) does not contribute any instance to any rule, and thus can be tossed away. If the bucket is overflowing(its size exceedsT), then(p,s) is also ignored (line 13).

In lines 14–16, the algorithm enumerates all the pairs(α,β) of substrings that belong to the bucket of(p,s). If itseems likely that the documents associated with the URLspαs andpβs are similar (through size or document sketchmatching) (line 15),(α,β) is added to the instance table IT (line 16).

The number of times a pair(α,β) has been added to the instance table is exactly the size of theset(EL (α)∩EL (β))\(O∪Dα,β), which is our estimated support for the rulesα → β andβ → α. Hence, all that is left to do is computethese counts and sort the pairs by their count (lines 17–22).The algorithm’s output is an ordered list of pairs. Eachpair representing two likelyDUST rules (one in each direction). Only rules whose support is large enough (bigger thanMS) are kept in the list.

Complexity analysis. Let n be the number of records in the URL list and letm be the average length (in tokens) ofURLs in the URL list. We assume tokens are of constant length.The size of the URL list is thenO(mn) bits. We useO() to suppress factors that are logarithmic inn,m,S, andT, which are typically negligible.

The computation has two major bottlenecks. The first is filling in the instance table IT (lines 12–16). The elementsof the ST table,O(mnS) are split into buckets of sizeO(T) which results inO(mnS

T ) buckets in ST. The algorithmthen enumerates all the buckets, and for each of them enumerates each pair of substrings in the bucket. This compu-tation could possibly face a quadratic blowup. Yet, since overflowing buckets are ignored, then this step takes onlyO(mnS

T T2) = O(mnST) time. The second bottleneck is sorting the URL list and the intermediate tables. Since all theintermediate tables are of size at mostO(mnST), the sorting can be carried inO(mnST) time andO(mnST) externalstorage space. By using an efficient external storage sort utility, we can keep the main memory complexityO(1) ratherthan linear. The algorithm does not fetch any pages.

5.2 Eliminating redundant rules

By design, the output of the above algorithm includes many overlapping pairs. For example, when running on a forumsite, our algorithm finds the pair (“.co.il/story?id= ”, “ .co.il/story ”), as well as numerous pairs of substringsof these, such as (“story?id= ”, “ story ”). Note that every instance of the former pair is also an instance of the latter.We thus say that the formerrefinesthe latter. It is desirable to eliminate redundancies priorto attempting to validate therules, in order to reduce the cost of validation. However, when one likelyDUST rule refines another, it is not obviouswhich should be kept. In some cases, the broader rule is always true, and all the rules that refine it are redundant. Inother cases, the broader rule is only valid in specificcontextsidentified by the refining ones.

In some cases, we can use information from the URL list in order to deduce that a pair is redundant. When two pairshave exactly the same support in the URL list, this gives a strong indication that the latter, seemingly more generalrule, is valid only in the context specified by the former rule. We can thus eliminate the latter rule from the list.

We next discuss in more detail the notion ofrefinementand show how to use it to eliminate redundant rules.

DEFINITION 5.1 REFINEMENT. A rule φ refinesa rule ψ, if support(φ) ⊆ support(ψ).

That is,φ refinesψ, if every instance(u1,u2) of φ is also an instance ofψ. Testing refinement for substitution rulesturns out to be easy, as captured in the following lemma:

LEMMA 5.2. A substitution ruleα′ → β′ refines a substitution ruleα → β if and only if there exists an envelope(γ,δ) s.t.α′ = γαδ andβ′ = γβδ.

PROOF. We prove derivation in both directions. Assume, initially, that there exists an envelope(γ,δ) s.t.α′ = γαδandβ′ = γβδ. We need to show that in this caseα′ → β′ refinesα → β. Take, then, any instance(u1,u2) of α′ → β′.By Definition 3.1, there exists an envelope(p′,s′) s.t. u1 = p′α′s′ andu2 = p′β′s′. Hence, if we definep = p′γ ands= δs′, then we have thatu1 = pαs andu2 = pβs. Using again Definition 3.1, we conclude that(u1,u2) is also aninstance ofα → β. This proves the first direction.

121

For the second direction, assume thatα′ → β′ refinesα → β. Assume that none ofα,β,α′,β′ starts with ˆ or endswith $. (The extension toα,β,α′,β′ that can start with ˆ or end with $ is easy, but requires some technicalities, thatwould harm the clarity of this proof.) Defineu1 = ˆα′$ andu2 = ˆβ′$. By Definition 3.1,(u1,u2) is an instance ofα′ → β′. Due to refinement, it is also an instance ofα → β. Hence, there existp,ss.t.u1 = pαsandu2 = pβs. Hence,pαs= ˆα′$ andpβs= ˆβ′$. Sinceα,β do not start with ˆ and do not end with $, thenp must start with ˆ ands mustend with $. Define thenγ to be the stringp excluding the leading ˆ, and defineδ to be the stringsexcluding the trailing$. We thus have:α′ = γαδ andβ′ = γβδ, as needed.

The characterization given by the above lemma immediately yields an efficient algorithm for deciding whether asubstitution ruleα′ → β′ refines a substitution ruleα → β: we simply check thatα is a substring ofα′, replaceα byβ, and check whether the outcome isβ′. If α has multiple occurrences inα′, we check all of them. Note that ouralgorithm’s input is a list of pairs rather than rules, whereeach pair represents two rules. When considering two pairs(α,β) and(α′,β′), we check refinement in both directions.

Now, suppose a ruleα′ → β′ was found to refine a ruleα → β. Then, support(α′ → β′) ⊆ support(α → β), im-plying that also supportL(α′ → β′) ⊆ supportL (α → β). Hence, if|supportL(α′ → β′)| = |supportL(α → β)|, thensupportL(α′ → β′) = supportL (α → β). If the URL list is sufficiently representative of the web site, this gives anindication that every instance of the refined ruleα → β that occurs on the web site is also an instance of the refinementα′ → β′. We choose to keep only the refinementα′ → β′, because it gives the full context of the substitution.

One small obstacle to using the above approach is the following. In the first phase of our algorithm, we do notcompute the exact size of the support|supportL (α → β)|, but rather calculate the quantity|(EL (α)∩EL (β)) \ (O∪Dα,β)|. It is possible thatα′ → β′ refinesα → β and supportL (α′ → β′) = supportL(α → β), yet |(EL(α′)∩EL(β′))\(O∪Dα′,β′)| < |(EL (α)∩EL(β))\ (O∪Dα,β)|.

How could this happen? Consider some envelope(p,s) ∈ EL(α)∩EL(β), and let(u1,u2) = (pαs, pβs) be thecorresponding instance ofα → β. Since supportL(α′ → β′) = supportL (α → β), then(u1,u2) is also an instance ofα′ → β′. Therefore, there exists some envelope(p′,s′) ∈ EL (α′)∩EL(β′) s.t.u1 = p′α′s′ andu2 = p′β′s′.

α is a substring ofα′ andβ is a substring ofβ′, thusp′ must be a prefix ofp ands′ must be a suffix ofs. This impliesthat any URL that contributes a substringγ to the bucket of(p,s) will also contribute a substringγ′ to the bucket of(p′,s′), unlessγ′ exceeds the maximum substring lengthS. In principle, then, we should expect the bucket of(p′,s′)to be larger than the bucket of(p,s). If the maximum bucket sizeT happens to be exactly between the sizes of thetwo buckets, then(p′,s′) overflows while(p,s) does not. In this case, the first phase of DustBuster will account for(u1,u2) in the computation of the support ofα → β but not in the computation of the support ofα′ → β′, incurring adifference between the two.

In practice, if the supports are identical, the difference between the calculated support sizes should be small. We thuseliminate the refined rule, even if its calculated support size is slightly above the calculated support size of the refiningrule. However, to increase the effectiveness of this phase,we run the first phase of the algorithm twice, once with alower overflow thresholdTlow and once with a higher overflow thresholdThigh. While the support calculated using thelower threshold is more effective in filtering out invalid rules, the support calculated using the higher threshold is moreeffective in eliminating redundant rules.

The algorithm for eliminating refined rules from the list appears in Figure 3. The algorithm gets as input a listof pairs, representing likely rules, sorted by their calculated support size. It uses three tunable parameters: (1) themaximum relative deficiency, MRD, (2) themaximum absolute deficiency, MAD; and (3) themaximum window size,MW. MRD and MAD determine the maximum difference allowed between the calculated support sizes of the refiningrule and the refined rule, when we eliminate the refined rule. MW determines how far down the list we look forrefinements.

The algorithm scans the list from top to bottom. For each ruleR [i], which has not been eliminated yet, the algorithmscans a “window” of rules belowR [i]. Supposes is the calculated size of the support ofR [i]. The window size ischosen so that (1) it never exceedsMW (line 4); and (2) the difference betweens and the calculated support size ofthe lowest rule in the window is at most the maximum betweenMRD·sandMAD (line 5). Now, ifR [i] refines a ruleR [ j] in the window, the refined ruleR [ j] is eliminated (line 7), while if some ruleR [ j] in the window refinesR [i],R [i] is eliminated (line 9).

It is easy to verify that the running time of the algorithm is at most|R | ·MW. In our experiments, this algorithmreduces the set of rules by over 90%.

122

1:Function EliminateRedundancies(pairslist R )2: for i = 1 to |R | do3: if (already eliminatedR [i]) continue4: for j = 1 to min(MW, |R |− i) do5: if (R [i].size−R [i + j ].size>

max(MRD ·R [i].size,MAD)) break6: if (R [i] refinesR [i + j ])7: eliminateR [i + j ]8: else if (R [i + j ] refinesR [i]) then9: eliminateR [i]10: break14: returnR

Fig. 3. Eliminating redundant rules.

5.3 Parameter substitutions

Inline parameters in URLs typically comply with a standard format. In many sites, an inline parameter name ispreceded by the “?” or “&” characters and followed by the “=” character and the parameter’s value, which is followedby either the end of the URL or another “&” character. We can therefore employ a simple regular expression search onURLs in the URL list in order to detect popular parameters, along with multiple examples of values for each parameter.Having detected the parameters, we check for each one whether replacing its value with an arbitrary one is a validDUST rule. To this end, we exploit the ST table computed by DustBuster (see Figure 2), after it has been sorted anddivided into buckets. We seek buckets whose prefix attributeends with the desired parameter name, and then comparethe document sketches or size ranges of the relevant URLs pertaining to such buckets.

For each parameter,p, we choose some value of the parameter ,vp, and add two rules to the list of likely rules:the first, replaces the value of the parameterp with vp, the second rule, omits the parameter altogether. Due to thesimplicity of this algorithm, its detailed presentation isomitted. The next section describes the validation phase whichwill drop theDUST rules which do not generate valid URLs or URLs with similar content.

5.4 Validating DUST rules

So far, the algorithm has generated likely rules from the URLlist alone, without fetching even a single page from theweb site. Fetching a small number of pages for validating or refuting these rules is necessary for two reasons. First,it can significantly improve the final precision of the algorithm. Second, the first two phases of DustBuster, whichdiscover likely substring substitution rules, cannot distinguish between the two directions of a rule. The discovery ofthe pair(α,β) can represent bothα → β andβ → α. This does not mean that in reality both rules are valid or invalidsimultaneously. It is often the case that only one of the directions is valid; for example, in many sites removing thesubstringindex.html is always valid, whereas adding one is not. Only by attempting to fetch actual page contentswe can tell which direction is valid, if any.

The validation phase of DustBuster therefore fetches a small sample of web pages from the web site in order tocheck the validity of the rules generated in the previous phases. The validation of a single rule is presented in Figure4. The algorithm is given as input a likely rule R and a list of URLs from the web site and decides whether the rule isvalid. It uses two parameters: thevalidation count, N(how many samples to use in order to validate each rule), andtherefutation threshold,ε (the minimum fraction of counterexamples to a rule requiredto declare the rule invalid).

REMARK. The application of R to u (line 6) may result in several different URLs. For example, there are severalways of replacing the string “people” with the string “users” in the URL http://people.domain.com/people ,resulting in the URLshttp://users.domain.com/people , http://people.domain.com/users , and http://users.domain.com/users . Our policy is to select one standard way of applying a rule. For example, in the case ofsubstring substitutions, we simply replace the first occurrence of the substring.

In order to determine whether a rule is valid, the algorithm repeatedly chooses random URLs from the given testURL list until hitting a URL on which applying the rule results in a different URL (line 5). The algorithm then appliesthe rule to the random URLu, resulting in a new URLv. The algorithm then fetchesu andv. Using document sketches,such as the shingling technique of Broder et al. [1997], the algorithm tests whetheru andv are similar. If they are,the algorithm accounts foru as a positive example attesting to the validity of the rule. If v cannot be fetched, or they

123

1:Function ValidateRule(R,L)2: positive := 03: negative := 04: while (positive< (1− ε)N AND negative< εN) do5: u := a random URL fromL on which applying R results

in a different URL6: v := outcome of application of R to u7: fetch u and v8: if (fetch u failed) continue9: if (fetch v failedOR DocSketch(u)6= DocSketch(v))10 negative := negative + 111: else12: positive := positive + 113: if (negative≥ εN )14: returnFALSE

15: returnTRUE

Fig. 4. Validating a single likely rule.

are not similar, then it is accounted as a negative example (lines 9–12). The testing is stopped when either the numberof negative examples surpasses the refutation threshold orwhen the number of positive examples is large enough toguarantee the number of negative examples will not surpass the threshold.

One could ask why we declare a rule valid even if we find (a smallnumber of) counterexamples to it. There areseveral reasons: (1) the document sketch comparison test sometimes makes mistakes, since it has an inherent falsenegative probability; (2) dynamic pages sometimes change significantly between successive probes (even if the probesare made at short intervals); and (3) the fetching of a URL maysometimes fail at some point in the middle, after partof the page has been fetched. By choosing a refutation threshold smaller than one, we can account for such situations.

Each parameter substitution rule is validated using the code in Figure 4. The validation of substring substitutions ismore complex, as it needs to address directions and refinements.

Figure 5 shows the algorithm for validating a list of likelyDUST rules. Its input consists of a list of pairs representinglikely substring transformations,(R [i].α,R [i].β), and a test URL listL.

For a pair of substrings(α,β), we use the notationα > β to denote that either|α| > |β| or |α| = |β| andα succeedsβ in the lexicographical order. In this case, we say that the rule α → β shrinksthe URL. We give precedence toshrinking substitutions. Therefore, given a pair(α,β), if α > β, we first try to validate the ruleα → β. If this ruleis valid, we ignore the rule in the other direction since, even if this rule turns out to be valid as well, using this ruleduring canonization is only likely to create cycles, i.e., rules that can be applied an infinite number of times becausethey cancel out each others’ changes. If the shrinking rule is invalid, though, we do attempt to validate the oppositedirection, so as not to lose a valid rule. Whenever one of the directions of(α,β) is found to be valid, we remove fromthe list all pairs refining(α,β)– once a broader rule is deemed valid, there is no longer a needfor refinements thereof.By eliminating these rules prior to validating them, we reduce the number of pages we fetch. We assume that eachpair inR is ordered so thatR [i].α > R [i].β.

The running time of the algorithm is at mostO(|R |2 + N|R |). Since the list is assumed to be rather short, thisrunning time is manageable. The number of pages fetched isO(N|R |) in the worst-case, but much smaller in practice,since we eliminate many redundant rules after validating rules they refine.

Application for URL canonization. Finally, we explain how the discoveredDUST rules may be used for canonizationof a URL list. Our canonization algorithm is described in Figure 6. The algorithm receives a URLu and a list of validDUST rules,R . The idea behind this algorithm is very simple: in each iteration, each rule inR in turn is repeatedlyapplied tou up toMA times, until the rule does not change the URL; this process isrepeated up toMA times, untilthere is an iteration in whichu is unchanged (lines 6–7).

If a rule can be applied more than once (e.g., because the samesubstring appears multiple times in the URL), theneach iteration in lines 4-5 applies it in the first place in theURL where it is applicable. As long as the number ofoccurrences of the replaced substring in the URL does not exceed MA, the algorithm replaces all of them.

We limit the number of iterations of the algorithm and applications of a rule to the parameterMA, because otherwisethe algorithm could have entered an infinite loop (if the graph GR contains cycles). SinceMA is a constant, chosen

124

1:Function Validate(ruleslist R , testURLList L)2 create an empty list of rules LR3: for i = 1 to |R | do4: for j = 1 to i - 1 do5: if (R [ j ] was not eliminatedAND R [i] refinesR [ j ])6: eliminateR [i] from the list7: break8: if (R [i] was eliminated)9: continue10: if (ValidateRule(R [i].α → R [i].β, L))11: addR [i].α → R [i].β to LR12: else if (ValidateRule(R [i].β → R [i].α, L))13: addR [i].β → R [i].α to LR14: else15: eliminateR [i] from the list16: return LR

Fig. 5. Validating likely rules.

1:Function Canonize(URLu, rules list R )2: for k = 1 toMA do3: prev := u3: for i = 1 to |R | do4: for j = 1 to MA do5: u := A URL obtained by applyingR [i] to u6: if (prev = u)7: break8: output u

Fig. 6. Canonization algorithm.

independently of the number of rules, the algorithm’s running time is linear in the number of rules. Recall thatthe general canonization problem is hard, so we cannot expect this algorithm to always produce a minimum sizecanonization. Nevertheless, our empirical study shows that the savings obtained using this algorithm are high.

We believe that the algorithm’s common case success stems from two features. First, our policy of choosing shrink-ing rules whenever possible typically eliminates cycles. Second, our elimination of refinements of valid rules leaves asmall set of rules, most of which do not affect each other.

6. EXPERIMENTAL RESULTS

Experiment setup. We experiment with DustBuster on four web sites: a dynamic forum site2, an academic site(www.ee.technion.ac.il), a large news site (cnn.com) and asmaller news site (nydailynews.com). In the forum site,page contents are highly dynamic, as users continuously addcomments. The site supports multiple domain names andmost of the site’s pages are generated by the same software. The news sites are similar in their structure to many othernews sites on the web. The large news site has a more complex structure, and it makes use of several sub-domainsas well as URL redirections. Finally, the academic site is the most diverse: It includes both static pages and dynamicsoftware-generated content. Moreover, individual pages and directories on the site are constructed and maintained bya large number of users (faculty members, lab managers, etc.)

In the academic and forum sites, we detect likelyDUST rules from web server logs, whereas in the news sites, wedetect likelyDUST rules from a crawl log. Table I depicts the sizes of the logs used. In the crawl logs each URLappears once, while in the web server logs the same URL may appear multiple times. In the validation phase, we userandom entries from additional logs, different from those used to detect the rules. The canonization algorithm is testedon yet another set of logs, different from the ones used to detect and validate the rules.

2The webmaster who gave us access to the logs asked us not to specify the name of the site.

125

Web Site Log Size Unique URLsForum Site 38,816 15,608Academic Site 344,266 17,742Large News Site 11,883 11,883Small News Site 9,456 9,456

Table I. Log sizes.

Parameter settings. The following DustBuster parameters were carefully chosenin all our experiments. Our empiri-cal results suggest that these settings are robust across data sets, as they work in the 4 very different representative siteswe experimented with. The maximum substring length,S, was set to 35 tokens. The maximum bucket size used fordetectingDUST rules,Tlow, was set to 6, and the maximum bucket size used for eliminating redundant rules,Thigh, wasset to 11. In the elimination of redundant rules, we allowed arelative deficiency, MRD, of up to 5%, and an absolutedeficiency, MAD, of 1. The maximum window size, MW, was set to 1100 rules. The value of MS, the minimumsupport size, was set to 3. The algorithm uses a validation count, N, of 100 and a refutation threshold,ε, of 5%-10%.Finally, the canonization uses a maximum of 10 iterations. Shingling [Broder et al. 1997] is used in the validationphase to determine similarity between documents.

Detecting likely DUST rules and eliminating redundant ones. DustBuster’s first phase scans the log and detectsa very long list of likely DUST rules. Subsequently, the redundancy elimination phase dramatically shortens thislist. Table II shows the sizes of the lists before and after redundancy elimination. It can be seen that in all of ourexperiments, over 90% of the rules in the original list have been eliminated.

For example, in the largest log in the academic site, 26,899 likely rules were detected in the first phase, and only2041 (8%) remained after the second; in a smaller log 10,848 rules were detected, of which only 354 (3%) were noteliminated. In the large news site 12,144 were detected, 1243 remained after the second phase. In the forum site,much fewer likely rules were detected, e.g., in one log 402 rules were found, of which 37 (9%) remained. We believethat the smaller number of rules is a result of the forum site being more uniformly structured than the academic one,as most of its pages are generated by the same web server software.

Web Site Rules Rules RemainingDetected after 2nd Phase

Forum Site 402 37 (9.2%)Academic Site 26,899 2,041 (7.6%)Large News Site 12,144 1,243 (9.76%)Small News Site 4,220 96 (2.3%)

Table II. Rule elimination in second phase.

In Figure 7, we examine the precision level in the short list of likely rules produced at the end of these two phasesin three of the sites. Recall that no page contents are fetched in these phases. As this list is ordered by likeliness,we examine theprecision@k; that is, for each topk rules in this list, the curves show which percentage of them arelater deemed valid (by DustBuster’s validation phase) in atleast one direction. We observe that when similarity-basedfiltering is used, DustBuster’s detection phase achieves a very high precision rate even though it does not fetch evena single page. Figure 8 shows the results for four web server logs of the forum site. Out of the 40–50 detected rules,over 80% are indeed valid. In the academic site, over 60% of the 300–350 detected rules are valid, and of the top 100detected rules, over 80% are valid. In the large news sites, 74% of the top 200 rules are valid.

This high precision is achieved, to a large extent, thanks tothe similarity-based filtering (size matching or shinglematching), as shown in Figures 7(a) and 7(b). The log includes invalid rules. For example, the forum site includesmultiple domains, and the stories in each domain are different. Thus, although we find many pairs of the formhttp://domain1/story num andhttp://domain2/story num with the same story number, these URLs representdifferent stories. Similarly, the academic site has URL pairs of the formhttp://site/course1/lect-num.ppt andhttp://site/course2/lect-num.ppt , although the lectures are different. These URL pairs are instances of invalidrules, rules that are are not detected thanks to size matching. Figure 7(a) illustrates the impact of size matching inthe academic site. We see that when size matching is not employed, the precision drops by around 50%. Thus, sizematching reduces the number of accesses needed for validation. Nevertheless, size matching has its limitations– valid

126

0

0.2

0.4

0.6

0.8

1

0 50 100 150 200 250 300

prec

isio

n

top k rules

With Size MatchingNo Size Matching

(a) Academic site, impact of size matching.

0

0.2

0.4

0.6

0.8

1

0 200 400 600 800 1000 1200

Pre

cisi

on

top k rules

Without shingles-filtered supportWith shingles-filtered support

(b) Large news site, impact of shingle matching, 4 shingles used.

Fig. 7. Precision@k of likelyDUST rules detected in DustBuster’s first two phaseswithout fetching actual content.

rules (such as “ps ” → “pdf ”) are missed at the price of increasing precision. Figure 7(b) shows similar results for thelarge news site. When we do not use shingles-filtered support, the precision at the top 200 drops to 40%. Shingles-based filtering reduces the list of likely rules by roughly 70%. Most of the filtered rules turned out to be indeedinvalid.

Validation. We now study how many validations are needed in order to declare that a rule is valid; that is, we studywhat the parameter N in Figure 5 should be set to. To this end, we run DustBuster with values of N ranging from 0 to100, and check which percentage of the rules found to be validwith each value of N are also found valid when N=100.The results from conducting this experiment on the likelyDUST rules found in 4 logs from the forum site and 4 fromthe academic site are shown in Figure 9 (similar results wereobtained for the other sites). In all these experiments,100% precision is reached after 40 validations. Moreover, results obtained in different logs are consistent with eachother.

In these graphs, we only consider rules that DustBuster attempts to validate. Since many valid rules are removed(in line 6 of Figure 5) after rules that they refine are deemed valid, the percentage of valid rules among those thatDustBuster attempts to validate is much smaller than the percentage of valid rules in the original list.

127

0

0.2

0.4

0.6

0.8

1

0 5 10 15 20 25 30 35 40 45 50

prec

isio

n

top k rules

log#4log#3log#2log#1

Fig. 8. Forum site showing 4 different logs, precision@k of likely DUST rules detected in DustBuster’s first two phaseswithout fetching actual content.

Our aggressive elimination of redundant rules reduces the number of rules we need to validate. For example, onone of the logs in the forum site, the validation phase was initiated with 28 pairs representing 56 likely rules (in bothdirections). Of these, only 19 were checked, and the rest were removed because they or their counterparts in theopposite direction were deemed valid either directly or since they refined valid rules. We conclude that the number ofactual pages that need to be fetched in order to validate the rules is very small.

At the end of the validation phase, DustBuster outputs a listof valid substring substitution rules without redun-dancies. Table III shows the number of valid rules detected on each of the sites. The list of 7 rules found using oneof the logs in the forum site is depicted in Figure 10 below. These 7 rules or refinements thereof appear in the out-puts produced using each of the studied logs. Some studied logs include 1–3 additional rules, which are insignificant(have very small support). Similar consistency is observedin the academic site outputs. We conclude that the mostsignificantDUST rules can be adequately detected using a fairly small log with roughly 15,000 unique URLs.

Web Site Valid Rules DetectedForum Site 7Academic Site 52Large News Site 62Small News Site 5

Table III. The number of rules found to be valid.

Coverage. We now turn our attention to coverage, or the percentage of duplicate URLs discovered by DustBuster, inthe academic site. When multiple URLs have the same documentsketch, all but one of them are consideredduplicates.In order to study the coverage achieved by DustBuster, we usetwo different logs from the same site: atraining logand atest log. We run DustBuster on the training log in order to learnDUST rules and we then apply these rules onthe test log. We count what fraction of the duplicates in the test log are covered by the detectedDUST rules. We detectduplicates in the test log by fetching the contents of all of its URLs and computing their document sketches. Figure 11classifies these duplicates. As the figure shows, 47.1% of theduplicates in the test log are eliminated by DustBuster’scanonization algorithm using rules discovered on another log. The rest of theDUST can be divided among severalcategories: (1) duplicate images and icons; (2) replicateddocuments (e.g., papers co-authored by multiple facultymembers and whose copies appear on each of their web pages); (3) “soft errors”, i.e., pages with no meaningfulcontent, such as error message pages, empty search results pages, etc.

128

0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100

prec

isio

n

number of validations


(a) Forum site, 4 different logs.

0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100

prec

isio

n

number of validations


(b) Academic site, 4 different logs.

Fig. 9. Precision among rules that DustBuster attempted to validate vs. number of validations used (N).

1 “.co.il/story ” → “ .co.il/story?id= ”2 “\&LastView= \&Close= ” → “”3 “.php3? ” → “?”4 “.il/story ” → “ .il/story.php3?id= ”5 “\&NewOnly=1\&tvqz=2 ” → “\&NewOnly=1 ”6 “.co.il/thread ” → “ .co.il/thread?rep= ”7 “http://www.../story ” → “http://www.../story?id= ”

Fig. 10. The valid rules detected in the forum site.

Savings in crawl size. The next measure we use to evaluate the effectiveness of the method is the discoveredredundancy, i.e., the percent of the URLs we can avoid fetching in a crawl by using theDUST rules to canonize theURLs. To this end, we performed a full crawl of the academic site, and recorded in a list all the URLs fetched. Weperformed canonization on this list usingDUST rules learned from the crawl, and counted the number of unique URLs

129

Fig. 11. DUST classification, academic site.

before (Ub) and after (Ua) canonization. The discovered redundancy is then given byUb−UaUb

. We found this redundancyto be 18% (see Table IV), meaning that the crawl could have been reduced by that amount. In the two news sites, theDUST rules were learned from the crawl logs and we measured the reduction that can be achieved in the next crawl.By setting a slightly more relaxed refutation threshold (ε = 10%), we obtained a reduction of 26% in the small newssite and 6% in the large one. In the case of the forum site, we used four logs to detectDUST rules, and used these rulesto reduce a fifth log. The reduction achieved in this case was 4.7%. In all these experiments, the training and testingwas done on logs of similar size.

Web Site Reduction AchievedAcademic Site 18%Small News Site 26%Large News Site 6%Forum Site(using logs) 4.7%

Table IV. Reductions in crawl size.

7. CONCLUSIONS

We have introduced the problem of mining site-specificDUST rules. Knowing about such rules can be very useful forsearch engines: It can reduce crawling overhead by up to 26% and thus increase crawl efficiency. It can also reduceindexing overhead. Moreover, knowledge ofDUST rules is essential for canonizing URL names, and canonical namesare very important for statistical analysis of URL popularity based on PageRank or traffic. We presented DustBuster,an algorithm for miningDUST very effectively from a URL list. The URL list can either be obtained from a web serverlog or a crawl of the site.

Acknowledgments. We thank Tal Cohen and the forum site team, and Greg Pendler and thehttp://ee.technion.ac.il admins for providing us with access to web server logs and fortechnical assistance. We thank Israel Cidon,Yoram Moses, and Avigdor Gal for their insightful input. We thank all our reviewers, both from the WWW 2007conference and the TWEB journal, for their detailed and constructive suggestions.

REFERENCES

Apache http server version 2.2 configuration files.http://httpd.apache.org/docs/2.2/configuring.html .

AGRAWAL , R. AND SRIKANT, R. 1994. Fast algorithms for mining association rules. Inthe Proceedings of the 20th International Conference onVery Large Data Bases (VLDB). 487–499.

ANALOG. http://www.analog.cx/ .

BERNERS-LEE, T., FIELDING , R., AND MASINTER, L. Uniform resource identifiers (URI): Generic syntax.http://www.ietf.org/rfc/rfc2396.txt .

BHARAT, K. AND BRODER, A. Z. 1999. Mirror, Mirror on the Web: A Study of Host Pairs with Replicated Content.Computer Networks 31,11–16, 1579–1590.

BHARAT, K., BRODER, A. Z., DEAN, J., AND HENZINGER, M. R. 2000. A comparison of techniques to find mirrored hostson the WWW.Journal of the American Society for Information Science 51,12, 1114–1122.

BOGNAR, M. 1995. A survey on abstract rewriting. Available online at: www.di.ubi.pt/ ∼desousa/1998-1999/logica/mb.ps .

130

BRIN, S., DAVIS , J., AND GARCIA-MOLINA , H. 1995. Copy Detection Mechanisms for Digital Documents.In the Proceedings of the 14thSpecial Interest Group on Management of Data (SIGMOD). 398–409.

BRODER, A. Z., GLASSMAN, S. C.,AND MANASSE, M. S. 1997. Syntactic clustering of the web. Inthe Proceedings of the 6th InternationalWorld Wide Web Conference (WWW). 1157–1166.

CHO, J., SHIVAKUMAR , N., AND GARCIA-MOLINA , H. 2000. Finding replicated web collections. Inthe Proceedings of the 19th Special InterestGroup on Management of Data (SIGMOD). 355–366.

DI IORIO, E., DILIGENTI , M., GORI, M., MAGGINI , M., AND PUCCI, A. 2003. Detecting Near-replicas on the Web by Content and HyperlinkAnalysis. Inthe Proceedings of the 11th International World Wide Web Conference (WWW).

DOUGLIS, F., FELDMAN , A., KRISHNAMURTHY, B., AND MOGUL, J. 1997. Rate of change and other metrics: a live study of theworld wideweb. Inthe Proceedings of the 1st USENIX Symposium on Internet Technologies and Systems (USITS).

FINKEL , R. A., ZASLAVSKY, A. B., MONOSTORI, K., AND SCHMIDT, H. W. 2002. Signature extraction for overlap detection in documents. Inthe Proceedings of the 25th Australasian Computer Science Conference (ACSC). 59–64.

GARCIA-MOLINA , H., GRAVANO , L., AND SHIVAKUMAR , N. 1996. dscam: Finding document copies across multiple databases. Inthe Proceed-ings of the 4th International Conference on Parallel and Distributed Information Systems (PDIS). 68–79.

GAREY, M. R. AND JOHNSON, D. S. 1979.Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman.GOOGLE INC. Google sitemaps.http://sitemaps.google.com .GUSFIELD, D. 1997.Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press.HOAD, T. C.AND ZOBEL, J. 2003. Methods for identifying versioned and plagiarized documents.Journal of the American Society for Information

Science and Technology 54,3, 203–215.JACCARD, P. 1908. Jaccard, P. 1908. Nouvelles recherches sur la distribution florale. 44, 223–270.JAIN , N., DAHLIN , M., AND TEWARI, R. 2005. Using bloom filters to refine web search results. Inthe Proceedings of the 7th International

Workshop on the Web and Databases (WebDB). 25–30.KELLY, T. AND MOGUL, J. C. 2002. Aliasing on the world wide web: prevalence and performance implications. Inthe Proceedings of the 11th

International World Wide Web Conference (WWW). 281–292.K IM , S. J., JEONG, H. S., AND LEE, S. H. 2006. Reliable evaluations of URL normalization. Inthe Proceedings of the 4th International

Conference on Computational Science and Its Applications (ICCSA). 609–617.L IANG , H. 2001. A URL-String-Based Algorithm for Finding WWW Mirror Host. M.S. thesis, Auburn University.MCCOWN, F. AND NELSON, M. L. 2006. Evaluation of crawling policies for a web-repository crawler. In the Proceedings of the 17th ACM

Conference on Hypertext and Hypermedia (HYPERTEXT). 157–168.MONOSTORI, K., FINKEL , R. A., ZASLAVSKY, A. B., HODASZ, G.,AND PATAKI , M. 2002. Comparison of overlap detection techniques. Inthe

Proceedings of the 10th International Conference on Complex Systems (ICCS). 51–60.SHIVAKUMAR , N. AND GARCIA-MOLINA , H. 1998. Finding Near-Replicas of Documents and Servers onthe Web. Inthe Proceedings of the 1st

International Workshop on the Web and Databases (WebDB). 204–212.STATCOUNTER. http://www.statcounter.com/ .WEBLOG EXPERT. http://www.weblogexpert.com/ .ZOBEL, J. AND MOFFAT, A. 1998. Exploring the similarity space.SIGIR Forum 32,1, 18–34.

Received July 2007; revised June 2008; accepted October 2008

131