a survey of web metrics

A Survey of Web Metrics

DEVANSHU DHYANI, WEE KEONG NG, AND SOURAV S. BHOWMICK

Nanyang Technological University

The unabated growth and increasing significance of the World Wide Web has resulted ina flurry of research activity to improve its capacity for serving information moreeffectively. But at the heart of these efforts lie implicit assumptions about “quality” and“usefulness” of Web resources and services. This observation points towardsmeasurements and models that quantify various attributes of web sites. The science ofmeasuring all aspects of information, especially its storage and retrieval or informetricshas interested information scientists for decades before the existence of the Web. Is Webinformetrics any different, or is it just an application of classical informetrics to a newmedium? In this article, we examine this issue by classifying and discussing a wideranging set of Web metrics. We present the origins, measurement functions,formulations and comparisons of well-known Web metrics for quantifying Web graphproperties, Web page significance, Web page similarity, search and retrieval, usagecharacterization and information theoretic properties. We also discuss how these metricscan be applied for improving Web information access and use.

Categories and Subject Descriptors: H.1.0 [Models and Principles]: General; H.3.3[Information Storage and Retrieval]: Information Search and Retrieval; I.7.0[Document and Text Processing]: General

General Terms: Measurement

Additional Key Words and Phrases: Information theoretic, PageRank, quality metrics,Web graph, Web metrics, Web page similarity

1. INTRODUCTION

The importance of measuring attributesof known objects in precise quantitativeterms has long been recognized as crucialfor enhancing our understanding of ourenvironment. This notion has been aptlysummarized by Lord Kelvin:

“When you can measure what you are speak-ing about, and express it in numbers, you know

Authors’ address: College of Engineering, School of Computing Engineering, Nanyang TechnologicalUniversity, Blk N4-2A-32, 50 Nanyang Avenue, Singapore 639789, Singapore; email: {assourav,awkng}@ntu.edu.sg.Permission to make digital or hard copies of part or all of this work for personal or classroom use is grantedwithout fee provided that copies are not made or distributed for profit or direct commercial advantage andthat copies show this notice on the first page or initial screen of a display along with the full citation. Copy-rights for components of this work owned by others than ACM must be honored. Abstracting with credit ispermitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any componentof this work in other works requires prior specific permission and/or a fee. Permissions may be requestedfrom Publications Dept., ACM, Inc., 1515 Broadway, New York, NY 10036 USA, fax: +1 (212) 869-0481, [email protected]©2002 ACM 0360-0300/02/1200-0469 $5.00

something about it; but when you cannot ex-press it in numbers, your knowledge is of ameager and unsatisfactory kind; it may be thebeginning of knowledge, but you have scarcelyin your thoughts advanced to the state ofscience.”

One of the earliest attempts to makeglobal measurements about the Web wasundertaken by Bray [1996]. The studyattempts to answer simple questions on

ACM Computing Surveys, Vol. 34, No. 4, December 2002, pp. 469–503.

470 D. Dhyani et al.

attributes such as the size of the Web,its connectivity, visibility of sites and thedistribution of formats. Since then, sev-eral directly observable metrics such ashit counts, click-through rates, access dis-tributions and so on have become popu-lar for quantifying the usage of web sites.However, many of these metrics tend tobe simplistic about the phenomena thatinfluence the attributes they observe. Forinstance, Pitkow [1997] points out theproblems with hit metering as a reliableusage metric caused by proxy and clientcaches. Given the organic growth of theWeb, we require new metrics that pro-vide deeper insight on the Web as a wholeand also on individual sites from differ-ent perspectives. Arguably, the most im-portant motivation for deriving such met-rics is the role they can play in improv-ing the quality of information available onthe Web.

To clarify the exact meaning of fre-quently used terms, we supply the follow-ing definition [Boyce et al. 1994]:

Measurement, in most general terms, can beregarded as the assignment of numbers to ob-jects (or events or situations) in accord with somerule [measurement function]. The property of theobjects that determines the assignment accord-ing to that rule is called magnitude, the measur-able attribute; the number assigned to a partic-ular object is called its measure, the amount ordegree of its magnitude. It is to be noted thatthe rule defines both the magnitude and themeasure.

In this article, we provide a survey ofwell-known metrics for the Web with re-gard to their magnitudes and measure-ment functions. Based on the attributesthey measure, these are classified into thefollowing categories:

—Web Graph Properties. The WorldWide Web can be represented as a graphstructure where Web pages comprisenodes and hyperlinks denote directededges. Graph-based metrics quantifystructural properties of the Web on bothmacroscopic and microscopic scales.

—Web Page Significance. Significancemetrics formalize the notions of “qual-ity” and “relevance” of Web pages with

respect to information needs of users.Significance metrics are employed torate candidate pages in response to asearch query and have an impact on thequality of search and retrieval on theWeb.

—Usage Characterization. Patterns andregularities in the way users browseWeb resources can provide invaluableclues for improving the content, orga-nization and presentation of Web sites.Usage characterization metrics mea-sure user behavior for this purpose.

—Web Page Similarity. Similarity met-rics quantify the extent of relatednessbetween web pages. There has been con-siderable investigation into what oughtto be regarded as indicators of a re-lationship between pages. We surveymetrics based on different concepts aswell as those that aggregate variousindicators.

—Web Page Search and Retrieval. Theseare metrics for evaluating and compar-ing the performance of Web search andretrieval services.

—Information Theoretic. Informationtheoretic metrics capture propertiesrelated to information needs, produc-tion and consumption. We considerthe relationships between a number ofregularities observed in informationgeneration on the Web.

We find that some of these metrics orig-inate from diverse areas such as classicalinformetrics, library science, informationretrieval, sociology, hypertext and econo-metrics. Others, such as Web-page-qualitymetrics, are entirely specific to the Web.Figure 1 shows a taxonomy of the metricswe discuss here.

Metrics, especially those measuringphenomena, are invariably proposed inthe context of techniques for improvingthe quality and usefulness of measurableobjects; in this case, information on theWeb. As such, we also provide some in-sight on the applicability of Web met-rics. However, one must understand thattheir usefulness is limited by the modelsthat explain the underlying phenomena

ACM Computing Surveys, Vol. 34, No. 4, December 2002.

A Survey of Web Metrics 471

Fig. 1 . A taxonomy of Web metrics.

and establish causal relationships. Astudy of these metrics is a starting pointfor developing these models, which caneventually aid Web content providersin enhancing web sites and predictingthe consequences of changes in certainattributes.

2. WEB GRAPH PROPERTIES

Web graph properties are measured byconsidering the Web or a portion of it,such as a web site, as a directed hypertextgraph where nodes represent pages andedges hyperlinks referred to as the Webgraph. Web graph properties reflect thestructural organization of the hypertextand hence determine the readability andease of navigation. Poorly organized websites often cause user disorientation lead-ing to the “lost in cyberspace” problem.These metrics can aid web site authoringand create sites that are easier to traverse.Variations of this model may label theedges with weights denoting, for example,connection quality, or number of hyper-links. To perform analysis at a highergranular level, nodes may be employed tomodel entire web sites and edges after thetotal strength of connectivities amongstweb sites. In the classification below, wefirst consider the simplest hypertextgraph model to discuss three types ofgraph properties introduced by Botafogoet al. [1992]; namely, centrality, globalmeasures and local measures. Then, wediscuss random graph model based onrandom networks.

Before discussing metrics for hypertextgraph properties, we introduce some of thepreliminary terms. The hypertext graph ofN nodes (Web pages) can be represented

Fig. 2 . Hyperlink graph exam-ple.

Table I. Distance Matrix and Associated CentralityMetrics; K = 5

Nodes a b c d e OD ROC

a 0 1 1 1 1 4 17b 1 0 2 2 3 8 8.5c 5 5 0 5 1 16 4.25d 5 5 5 0 5 20 3.4e 5 5 5 5 0 20 3.4ID 16 16 13 13 10 68RIC 4.25 4.25 5.23 5.23 6.8

as an N×N distance matrix1 C, where ele-ment Cij is the number of links that have tobe followed to reach node j starting fromnode i or simply the distance of j from i.If nodes i and j are unconnected in thegraph, Cij is set to a suitable predefinedconstant K . Figure 2 shows an examplehyperlink graph. The distance matrix forthis graph is shown in Table I.

2.1. Centrality

Centrality measures reflect the extent ofconnectedness of a node with respect toother nodes in the graph. They can be usedto define hierarchies in the hypertext withthe most central node as the root. The outdistance (OD) of a node i is defined as the

1 We differ slightly from Botafogo et al. [1992], wherethis matrix is referred to as the converted distancematrix.



sum of distances to all other nodes; thatis, the sum of all entries in row i of thedistance matrix C. Similarly, the in dis-tance (ID) is the sum of all in distances.Formally,

ODi =∑

j

Cij

IDi =∑

j

Cji.

In order to make the above metrics inde-pendent of the size of the hypertext graph,they are normalized by the converted dis-tance or the sum of all pair-wise distancesbetween nodes thereby yielding the rela-tive out centrality (ROC ) and the relativein centrality (RIC ) measures, respectively.Therefore,

ROCi =∑

i∑

j Cij∑j Cij

RICi =∑

i∑

j Cij∑j Cij

.

The calculation of centrality metrics forthe graph in Figure 2 is shown in Table I.A central node is one with high valuesof relative in- or out-centrality suggestingthat the node is close to other nodes inhyperspace. Identification of a root node(a central node with high relative out-centrality) is the first step towards con-structing easily navigable hypertext hier-archies. The hypertext graph may then beconverted to a crosslinked tree structureusing a breadth-first spanning tree algo-rithm to distinguish between hierarchicaland cross-reference links.

2.2. Global Metrics

Global metrics are concerned with thehypertext as a whole and not individualnodes. They are defined in a hierarchicallyorganized hypertext where the hierarchi-cal and cross-referencing links are dis-tinguished. Two global metrics discussedhere are the compactness and stratum.The compactness metric indicates the ex-tent of cross referencing; a high compact-ness means that each node can easily

reach other nodes in the hypertext. Com-pactness varies between 0 and 1; a com-pletely disconnected graph has compact-ness 0 while a fully connected graph hascompactness 1. For high readability andnavigation, both extremes of compactnessvalues should be avoided. More formally,

Cp =Max−∑i

∑j Cij

Max−Min,

where Max and Min are, respectively, themaximum and minimum values of the cen-trality normalization factor—converteddistance. It can be shown that Max andMin correspond to (N 2 − N )K (for a dis-connected graph) and (N 2−N ) (for a fullyconnected graph), respectively. We notethat the compactness Cp = 0 for a discon-nected graph as the converted distance be-comes Max and Cp = 1 when the converteddistance equals Min for a fully connectedgraph.

The stratum metric captures the linearordering of the Web graph. The conceptof stratum characterizes the linearity inthe structure of a Web graph. Highly lin-ear web sites, despite their simplicity instructure, are often tedious to browse. Thehigher the stratum, the more linear theWeb graph in question. Stratum is definedin terms of a sociometric measure calledprestige. The prestige of a node i is thedifference between its status, the sum ofdistances to all other nodes (or the sumof row i of the distance matrix), and itscontrastatus, the sum of finite distancesfrom all other nodes (or the sum of col-umn i of the distance matrix). The abso-lute prestige is the sum of absolute valuesof prestige for all nodes in the graph. Thestratum of the hypertext is defined as theratio of its absolute prestige to linear ab-solute prestige (the prestige of a linear hy-pertext with equal number of nodes). Thisnormalization by linear absolute prestigerenders the prestige value insensitive tothe hypertext size. Formally,

S =∑

i(|∑

j Cij −∑

j Cji |)LAP

,



Table II. Distance Matrix and Stratum RelatedMetrics

AbsoluteNodes a b c d e Status prestigea 0 1 1 1 1 4 3b 1 0 2 2 3 8 7c ∞ ∞ 0 ∞ 1 1 2d ∞ ∞ ∞ 0 ∞ 0 3e ∞ ∞ ∞ ∞ 0 0 5Contrastatus 1 1 3 3 5 13 20

where the linear absolute prestige LAP isthe following function of the number ofnodes N .

LAP =

N3

4, if n is even

N 3 − N4

, otherwise.

Stratum metrics for the graph ofFigure 2 are shown in Table II. The com-putation of stratum only considers finitedistances; hence, we invalidate uncon-nected entries in the distance matrix (de-noted∞). The linear absolute prestige fora graph of 5 nodes from the above formulais 30. The stratum of the graph can becalculated by normalizing the sum of ab-solute prestige (Table II) of all nodes bythe LAP. For the graph of Figure 2, thisequals 0.67.

We conclude the survey of global met-rics by citing some measurements of de-gree distributions that are reported inKleinberg et al. [1999] and Kumar et al.[1999] and confirmed in Broder et al.[2000]. In experiments conducted on a sub-graph of the Web, the in-degree distri-bution (in-degree versus frequency) hasbeen found to follow Lotka’s law. Thatis, the probability that a node has in-degree i is proportional to 1/iα, whereα is approximately 2. A similar obser-vation holds for the out-degree distri-bution. In Dhyani [2001], we describeour own experiments to confirm thesefindings and use the observation as thepremise for ascertaining distributions ofsome well-known hyperlink-based metricsand, subsequently, we derive other impor-tant informetric laws related to Lotka’slaw.

Attempts have also been made to studythe macroscopic structure of the WWW.In their experiments on a crawl of over200 million pages, Broder et al. [2000]found that over 90% percent of the Webcomprises a single connected component2

if the links are treated as undirectededges. Of these, a core of approximately56 million forms a strongly connected com-ponent. The maximum distance betweenany two pages or the diameter of this coreis only 28 as compared to a diameter ofover 500 for the entire Web graph. Theprobability that a path exists between tworandomly chosen pages was measured tobe 24%. The average directed path lengthis 16.

The significance of the above obser-vations are two-fold. First, they becomethe starting point for modelling thegraph structure of the Web. For exam-ple, Kleinberg et al. [1999] have explainedthe degree distributions by modeling theprocess of copying links while creatingWeb pages. These models can be of usein predicting the behavior of algorithmson the Web and discovering other struc-tural properties not evident from directobservation. Secondly, knowledge of thestructure of the Web and its graph prop-erties can lead to improved quality ofWeb search as demonstrated by hyperlink metrics such as PageRank [Brin andPage 1998], Authorities/Hubs [Kleinberg1998], and Hyperinformation [Marchiori1997].

2.3. Local Metrics

Local metrics measure characteristics ofindividual nodes in the hypertext graph.We discuss two local metrics; namely,depth and imbalance [Botafogo et al.1992]. The depth of a node is just itsdistance from the root. It indicates theease with which the node in question canbe reached and consequently its impor-tance to the reader. That is, the biggerthe distance of a node from the root, theharder it is for the reader to reach this

2 A portion of the Web graph such that there existsa path between any pair of pages.



node, and consequently the less impor-tant this node will be in the hypertext.Nodes that are very deep inside the hy-pertext are unlikely to be read by themajority of the readers [Botafogo et al.1992]. Note that an author may intention-ally store a low relevance piece of infor-mation deep inside the hypertext. Con-sequently, the readers whose interest isnot so strong can browse the hypertextwithout seeing the low relevance informa-tion, while more interested readers willbe able to have a deeper understanding ofthe subject by probing deeper into the hy-pertext. Having access to a depth metric,web-site designers can locate deep nodesand verify that they were intentional.

D(a) ={

[1+Max (D(a1)), 1+Max (D(a2)), . . . , 1+Max (D(an))],[0] if a has no child (n = 0),

The imbalance metric is based on the as-sumption that each node in the hypertextcontains only one idea and the link ema-nating from a node are a further develop-ment on that idea (except cross-referencelinks). Consequently, we might want thehypertext to be a balanced tree. The im-balance metric identifies nodes that areat the root of imbalanced trees and en-ables the web-site designer to identify im-balance nodes. Observe that imbalance ina hypertext does not necessarily indicatepoor design of hypertext [Botafogo et al.1992]. The crux of the matter is that eachtopic should be fully developed. Thus, in

C(a) ={

1+∑

C(a1), 1+∑

C(a2), . . . , 1 +∑

C(an)}

,

{0} if a has no child (n = 0),

a university department hypertext, theremay be many more levels of informationon academic staffs and their researchareas than on locations of copying ma-chines. However, we should expect that

each of the major areas in the hyper-text is treated with equal importance.Although balance is not mandatory, toomuch imbalance might indicate bias ofthe designer or a poorly designed web site[Botafogo et al. 1992]. Note that, similarto depth the imbalance metric can beused as a feedback to the authors. If theydecide that imbalances are desired, thenthey can overlook the information.

In order to quantify imbalance, Botafogoet al. [1992] proposed two imbalance met-rics: absolute depth imbalance and abso-lute child imbalance. Let T be a generalrooted tree. Let a1, a2, . . ., an be childrenof node a. Then, the depth vector D(a)[Botafogo et al. 1992] is defined as follows:

where Max (D(ai)) indicates the value ofthe largest element in the vector D(ai).The depth vector is represented withinsquare brackets. Intuitively, this vectorindicates the maximum distance one cango by following each of the children of nodea. The absolute depth imbalance of a nodea is defined as the standard deviation ofthe elements in vector D(a). That is, thestandard deviation of distances one cango by successively following each of thechildren of a.

Similarly, the child vector C(a)[Botafogo et al. 1992] is defined asfollows:

where∑

C(ai) is the sum of all elementsin vector C(ai). The child vector is repre-sented inside braces { }, and indicates thesize (number of elements) of the subtreesrooted at a1, a2, . . . , an. The absolute child



imbalance for node a is the standard devi-ation of the elements in vector C(a). Thatis, it is the standard deviation of the num-ber of nodes in the subtrees rooted at thechildren of a.

2.4. Random Graph Models

The theory of random networks is con-cerned with the structure and evolutionof large, intricate networks depicting theelements of complex systems and theinteractions between them. For example,living systems form huge genetic net-works whose vertices are proteins andedges represent the chemical interactionsbetween them. Similarly, a large networkis formed by the nervous system whosevertices are nerve cells or neurons con-nected by axons. In social science, similarnetworks can be perceived between indi-viduals and organizations. We describerandom networks here in the context ofanother obvious instance of a large andcomplex network—the WWW.

The earliest model of the topology ofthese networks, known as the randomgraph model, due to Erdos and Renyi, isdescribed in Barabasi et al. [1999]. Sup-pose we have a fixed sized graph of N ver-tices and each pair of vertices is connectedwith a probability p. The probability P (k)that a vertex has k edges is assumed tofollow the Poisson distribution such thatP (k) = e(−λ)

x λk/k! where the mean λ isdefined as

λ = N(

N − 1k

)pk(1− p)N−k−1.

Several random networks such as theWWW exhibit what is known as the small-world phenomenon whereby the averagedistance between any pair of nodes isusually a small number. Measurementsby Albert et al. [1999] show that, on av-erage, two randomly chosen documentson the Web are a mere 19 clicks away.The small-world model accounts for thisobservation by viewing the N verticesas a one-dimensional lattice where eachvertex is connected to its two nearestand next-nearest neighbors. Each edge is

reconnected to a vertex chosen at randomwith probability p. The long range connec-tions generated by this process decreasethe distances between vertices, leadingto the small-world phenomenon. Both themodels predict that the probability distri-bution of vertex connectivity has an expo-nential cutoff.

Barabasi and Albert [1999] reportedfrom their study of the topology of sev-eral large networks that, irrespective ofthe type of network, the connectivitydistribution (or the probability P (k) thata vertex in the network interacts with kother vertices) decays as a power law, thatis, P (k) ∼ k−γ , where γ is a small constantusually between 2.1 and 4, depending onthe network in question. This observationhas been confirmed for the WWW by mea-surements of Web-page degree distribu-tions [Kleinberg et al. 1999; Kumar et al.1999]. The power law distribution sug-gests that the connectivity of large randomnetworks is free of scale, an implicationinconsistent with the traditional randomnetwork models outlined above.

The random network models outlinedearlier do not incorporate two generic as-pects of real networks. First, they as-sume that the number of vertices N re-mains fixed over time. In contrast, mostreal networks are constantly changingdue to additions and deletions of nodes.Typically, the number of vertices N in-creases throughout the lifetime of thenetwork. The number of pages in theWWW in particular is reportedly grow-ing exponentially over time. Second, therandom network models assume thatthe probability of an edge existing be-tween two vertices is uniformally dis-tributed. However, most real networksexhibit what is known as preferentialconnectivity, that is, newly added ver-tices are more likely to establish linkswith vertices having higher connectiv-ity. In the WWW context, this mani-fests in the propensity of Web authors toinclude links to highly connected docu-ments in their web pages. These ingre-dients form the basis of the scale-freemodel proposed by Barabasi and Albert[Albert and Barabasi 2000; Barabasi



et al. 1999, 2000]:

—To incorporate the growth of the net-work, starting with a small number m0of vertices, we add, at each time step,a new vertex with m(< m0) edges thatlink to m vertices already present in thenetwork.

—To incorporate preferential attachment,we assume that the probability 5 thata new vertex will be connected to ver-tex i is proportional to the relative con-nectivity of that vertex, that is, 5(ki) =ki/∑

j k j .

After t time steps, the above assump-tions lead to a network with t+m0 verticesand mt new edges. It can be shown that thenetwork evolves to a scale-invariant statewith the probability that a vertex has kedges following a power law with an expo-nent γ ≈ 3, thereby reproducing the ob-served behavior of the WWW and otherrandom networks. Note that the value ofγ is dependent on the exact form of thegrowth and preferential attachment func-tions defined above and different valueswould be obtained if for instance the lin-ear preferential attachment function wereto be replaced by an exponential function.

3. WEB PAGE SIGNIFICANCE

Perhaps the most well-known Web met-rics are significance metrics. The signifi-cance of a web page can be viewed from twoperspectives—its relevance to a specific in-formation need, such as a user query, andits absolute quality irrespective of particu-lar user requirements. Relevance metricsrelate to the similarity of Web pages withdriving queries using a variety of mod-els for performing the comparison. Qual-ity metrics typically use link informationto distinguish frequently referred pagesfrom less visible ones. However, as weshall see, the quality metrics discussedhere are more sophisticated than simplein-degree counts. The most obvious useof significance metrics is in Web searchand retrieval where the most relevant andhigh quality set of pages must be selectedfrom a vast index in response to a userquery. The introduction of quality metrics

has been a recent development for publicsearch engines, most of which relied ear-lier on purely textual comparisons of key-word queries with indexed pages for as-signing relevance scores. Engines such asGoogle [Brin and Page 1998] use a combi-nation of relevance and quality metrics inranking the responses to user queries.

3.1. Relevance

Information retrieval techniques havebeen adapted to the Web for determiningrelevance of web pages to keyword queries.We present four algorithms for relevanceranking as discussed by Yuwono and Lee[1996]; namely, Boolean spread activation,most-cited, TFxIDF, and vector spread ac-tivation. The first two rely on hyperlinkstructure without considering term fre-quencies (to be explained later.) The lattertwo are based on the vector space model,which represents documents and queriesas vectors for calculating their similarity.Strictly speaking, relevance is a subjec-tive notion as described in Yuwono and Lee[1996]:

“. . . a WWW page is considered relevant to aquery if, by accessing the page, the user can finda resource (URL) containing information perti-nent to the query, or the page itself is such aresource.”

As such, the relevance score metricsdetailed below are means to identifyweb pages that are potentially useful inlocating the information sought by theuser. We first introduce the notation to beused in defining the relevance metrics.

M Number of query wordsQ j The j th query term, for 1≤ j ≤MN Number of WWW pages in indexPi The ith page or its IDRi,q Relevance score of Pi with

respect to query qLii,k Occurrence of an incoming

link from Pk to PiLoi,k Occurrence of an outgoing link

from Pi to PkX i, j Occurrence of Q j in Pi

3.1.1. Boolean Spread Activation. In theBoolean model, the relevance score issimply the number of query terms that



appear in the document. Since only con-junctive queries can be ranked usingthis model, disjunctions and negationshave to be transformed into conjunctions.The Boolean spread activation extendsthis model by propagating the occurr-ence of a query word in a document to itsneighboring documents.3 Thus,

Ri,q =M∑

j=1

Ii, j ,

where

Ii, j =

c1 if X i, j = 1c2 if there exists k such that X k, j = 1 and Lii,k + Loi,k > 00 otherwise.

The constant c2 (c2 < c1) determines the(indirect) contribution from neighboringdocuments containing a query term. Wemay further enhance the Boolean spreadactivation model of Yuwono and Lee [1996]through two (alternative) recursive defi-nitions of the contribution of each queryterm Ii, j as follows:

(1) Ii, j ={

1 if X i, j = 1cIk, j if there exists k such that 0 < c < 1, X k, j = 1 and Lik, j + Lok, j > 0;

(2) Ii, j = X i, j + cIk, j ; 0 < c < 1 and Lik, j + Lok, j > 0.

Note that the above definition is recur-sive as the rank of a page is a functionof whether the search term appears in it(term X i, j ) or in a page that it is connectedto (cIk, j ) through in- or outlinks. The fur-ther the page that contains the searchterm is from the given page, the lower itscontribution to the score (due to the posi-tive coefficient c < 1). Although the defini-tion recursively states the rank based oncontribution of neighboring pages (whichin turn use their neighbors), the compu-tation can indeed be done iteratively (be-cause it is a case of tail recursion). The

3 Assuming that documents linked to one anotherhave some semantic relationships.

choice of implementation is not a subjectof discussion here.

3.1.2. Most-Cited. Each page is assigneda score, which is the sum of the number ofquery words contained in other pages hav-ing a hyperlink referring to the page (cit-ing). This algorithm assigns higher scoresto referenced documents rather than ref-erencing documents.

Ri,q =N∑

k=1,k 6=i

Lii,kM∑

j=1

X k, j

.

The most-cited relevance metric may becombined with the recursive definition ofBoolean spread activation to overcome theabove problem as follows:

Ii, j = X i, j + c(Lik, j + Lok, j )Ik, j ;0 < c < 1.

We note two benefits of the (Lik, j + Lok, j )

coefficient. First, the new metric is unbi-ased with regard to citing and cited pages.Second, the contribution from neighbor-ing pages is scaled by the degree ofconnectivity.

3.1.3. TFxIDF. Based on the vectorspace model, the relevance score of adocument is the sum of weights of thequery terms that appear in the docu-ment, normalized by the Euclidean vectorlength of the document. The weight ofa term is a function of the word’s oc-currence frequency (also called the termfrequency (TF )) in the document andthe number of documents containing theword in collection (the inversedocument



Table III. The Average Precision for Each Search Algorithm

Boolean spread vector spreadactivation most-cited TFxIDF activation

Average precision 0.63 0.58 0.75 0.76

frequency (IDF )).

Ri,q =∑

Q j(0.5+ 0.5(TFi, j /TFi,max)) IDF j√∑

j∈Pi(0.5+ 0.5(TFi, j /TFi,max))2 (IDF j )2

,

where

TFi, j Term frequency of Q j in PiTFi,max Maximum term frequency

of a keyword in PiIDF j log( N∑N

i=1X i, j

).

The weighing function (product of TF andIDF as in Lee et al. [1997]) gives higherweights to terms that occur frequently ina small set of documents. A less-expensiveevaluation of the relevance score leavesout the denominator in the above (i.e., thenormalization factor). Performance of sev-eral approximations of relevance score bythe vector-space model is considered inLee et al. [1997].

3.1.4. Vector Spread Activation. The vec-tor space model is often criticized for nottaking into account hyperlink informationas was done in web page quality mod-els [Brin and Page 1998; Kleinberg 1998;Marchiori 1997]. The vector spread activa-tion method incorporates score propaga-tion as done in Boolean spread activation.Each Web page is assigned a relevancescore (according to the TFxIDF model)and the score of a page is propagated tothose it references. That is, given the scoreof Pi as Si,q and the link weight 0 <α < 1,

Ri,q = Si,q +N∑

j=1, j 6=i

αLii, j S j ,q .

However, experiments [Yuwono and Lee1996] show that the vector spread acti-vation model performs only marginally

better than TFxIDF in terms of retrieval

effectiveness measured by precision andrecall (to be introduced later). Table IIIsummarizes the average precision foreach search algorithm for 56 test queriesas highlighted in Yuwono and Lee [1996].

Application of the above query relevancemodels in search services can be enhancedthrough relevance feedback [Lee et al.1997; Yuwono et al. 1995]. If the user isable to identify some of the references asrelevant, then certain terms from thesedocuments can be used to reformulate theoriginal query into a new one that maycapture some concepts not explicitly spec-ified earlier.

3.2. Quality

Recent work in Web search has demon-strated that the quality of a Web pageis dependent on the hyperlink structurein which it is embedded. Link structureanalysis is based on the notion that a linkfrom a page p to page q can be viewed asan endorsement of q by p, and as someform of positive judgment by p of q’s con-tent. Two important types of techniquesin link-structure analysis are co-citationbased schemes and random-walk-basedschemes. The main idea behind co-citationbased schemes is the notion that, whentwo pages p1 and p2 both point to somepage q, it is reasonable to assume that p1and p2 share a mutual topic of interest.Likewise, when p links to both q1 and q2,it is probable that q1 and q2 share somemutual topic. On the other hand, random-walk-based schemes model the Web (orpart of it) as a graph where pages arenodes and links are edges, and apply some



Table IV. Evaluation of Search Engines vs. their Hyper Versions

Excite HotBot Lycos WebCrawler OpenText AverageEvaluation increment +5.1 +15.1 +16.4 +14.3 +13.7 +12.9Std Deviation 2.2 4.1 3.6 1.6 3 2.9

random walk model to the graph. Pagesare then ranked by the probability of vis-iting them in the modeled random walk.

In this section, we discuss some of themetrics for link structure analysis. Eachof these metrics is recursively defined fora Web page in terms of the measuresof its neighboring pages and the degreeof its hyperlink association with them.Quality metrics can be used in conjunc-tion with relevance metrics to rank re-sults of keyword searches. In addition, dueto their independence from specific querycontexts, they may be used generically fora number of purposes. We mention theseapplications in the context of individualquality metrics. Also, page quality mea-sures do not rely on page contents whichmake them convenient to ascertain andat the same time sinister “spamdexing”schemes4 becomes relatively more difficultto implement.

3.2.1. Hyper-information Content. Accord-ing to Marchiori [1997], the overall infor-mation of a web object is not composed onlyby its static textual information, but alsohyper information, which is the measure ofthe potential information of a Web objectwith respect to the Web space. Roughly, itmeasures how much information one canobtain using that page with a browser, andnavigating starting from it. Suppose thefunctions I (p), T (p) and H(p) denote theoverall information, textual informationand hyper information, respectively, of pthat map Web pages to nonnegative realnumbers with an upper bound of 1. Then,

I (p) = T (p)+ H(p).

Then it can be shown that given a pagep that points to q, and a suitable fading

4 The judicious use of strategic keywords that makespages highly visible to search engine users irrespec-tive of the relevance of their contents.

factor f (0 < f < 1), we have

I (p) = T (p)+ f T (q).

It can be shown that the contribution ofa page qn at a distance of n clicks fromp to p’s information I (p) is f nT (qn). Tomodel the case when p contains severalhyperlinks without violating the boundedproperty of I (p), the sequential selection oflinks b1, b2, . . . , bn embedded in p is inter-preted as follows:

I (p) = T (p)+ f T (b1)+ · · · + f nT (bn).

Therefore, the sequential selection of linkscontained in p is interpreted in the sameway as following consecutive links start-ing at p.

In order to validate the above notionof hyper information, Marchiori [1997]implemented the hyper information aspost-processor of the search engines avail-able during the time of his work (Excite,HotBot, Lycos, WebCrawler and Open-Text). The post-processor remotely querythe search engines, extract the corre-sponding scores (T (p) function), and cal-culate the hyper information and there-fore the overall information by fixingthe depth and fading factors in advance.Table IV [Marchiori 1997] shows the eval-uation increment for each search enginewith respect to its hyper version and thecorresponding standard deviation. As canbe seen, the small standard deviations areempirical evidence of the improvement ofthe quality of information provided by thehyper search engines over their nonhyperversions.

3.2.2. Impact Factor. The impact factor,which originated from the field of bib-liometrics, produces a quantitative esti-mate of the significance of a scientific jour-nal. Given the inherent similarity of websites to journals arising from the analogy



between page hyperlinks and paper cita-tions, the impact factor can also be usedto measure the significance of web sites.The impact factor [Egghe and Rousseau1990] of a journal is the ratio of all cita-tions to a journal to the total number ofsource items (that contain the references)published over a given period of time. Thenumber of citations to a journal or its in-degree is limited in depicting its standing.As pointed out in [Egghe and Rousseau1990], it does not contain any correctionfor the average length of individual pa-pers. Second, citations from all journalsare regarded as equally important. A moresophisticated citation-based impact factorthan normalized in-degree count as pro-posed by Pinski and Narin is discussed inEgghe and Rousseau [1990] and Kleinberg[1998]. We follow the more intuitive de-scription in Kleinberg [1998].

The impact of a journal j is measuredby its influence weight wj . Modeling thecollection of journals as a graph, wherethe nodes denoting journals are labeled bytheir influence weights and directed edgesby the connection strength between twonodes, the connection strength Sij on theedge 〈i, j 〉 is defined as the fraction of cita-tions from journal i to journal j . Followingthe definitions above, the influence weightof journal j is the sum of influence weightsof its citing journals scaled by their respec-tive connection strengths with j . That is,

wj =∑

i

wi Sij.

The nonzero, nonnegative solution w tothe above system of equations (w = ST w)is given by the principal eigenvector5

of ST .

3.2.3. PageRank. The PageRank mea-sure [Brin and Page 1998] extends othercitation based ranking measures, which

5 Let M be a n × n matrix. An eigenvalue of M isa number λ with the property that, for some vectorω, we have Mω = λω. When the assumption that|λ1(M )| > |λ2(M )| holds, ω1(M ) is referred to as theprincipal eigenvector and all other ωi(M ) as nonprin-cipal eigenvectors.

merely count the citations to a page. In-tuitively, a page has a high PageRank ifthere are many pages that point to it orif there are some pages with high PageR-ank that point to it. Let N be the set ofpages that point to a page p and C(p) bethe number of links going out of a page p.Then, given a damping factor d , 0≤d ≤1,the PageRank of p is defined as follows:

R(p) = (1− d )+ d∑q∈N

R(q)C(q)

.

The PageRank may also be considered asthe probability that a random surfer [Brinand Page 1998] visits the page. A ran-dom surfer who is given a Web page atrandom, keeps clicking on links, withouthitting the “back” button but eventuallygets bored and starts from another ran-dom page. The probability that the ran-dom surfer visits a page is its PageRank.The damping factor d in R(p) is the prob-ability at each page the random surfer willget bored and request for another randompage. This ranking is used as one compo-nent of the Google search engine [Brin andPage 1998] to help determine how to or-der the pages returned by a Web searchquery. The score of a page with respect toa query in Google is obtained by combin-ing the position, font and capitalization in-formation stored in hitlists (the IR score)with the PageRank measure. User feed-back is used to evaluate search results andadjust the ranking functions. Cho et al.[1998] describe the use of PageRank forordering pages during a crawl so that themore important pages are visited first. Ithas also been used for evaluating the qual-ity of search engine indexes using randomwalks [Henzinger et al. 1999].

3.2.4. Mutual Reinforcement Approach. Amethod that treats hyperlinks as con-ferrals of authority on pages for locat-ing relevant, authoritative WWW pagesfor a broad topic query is introduced byKleinberg in [1998]. He suggested thatWeb-page importance should depend onthe search query being performed. Thismodel is based on a mutually reinforcing



relationship between authorities—pagesthat contain a lot of information about atopic, and hubs—pages that link to manyrelated authorities. That is, each pageshould have a separate authority ratingbased on the links going to the page andhub rating based on the links going fromthe page. Kleinberg proposed first using atext-based Web search engine to get a RootSet consisting of a short list of Web pagesrelevant to a given query. Second, the RootSet is augmented by pages that link topages in the Root Set, and also pages thatare linked from pages in the Root Set, toobtain a larger Base Set of Web pages. IfN is the number of pages in the final BaseSet, then the data of Kleinberg’s algorithmconsists of an N × N adjacency matrix A,where Aij = 1 if there are one or more hy-pertext links from page i to page j ; other-wise, Aij = 0.

Formally, given a focused subgraph thatcontains a relatively small number ofpages relevant to a broad topic,6 the fol-lowing rule is used to iteratively updateauthority and hub weights (denoted xpand yp, respectively, and initialized to 1)of a page p:

xp =∑

q:q→p

yq and dually

yp =∑

q:p→q

xq

The weights are normalized after eachiteration to prevent them from overflow-ing. At the end of an arbitrarily large num-ber of iterations, the authority and hubweights converge to fixed values. Pageswith weights above a certain threshold canthen be declared as authorities and hubsrespectively. If we represent the focusedsubgraph as an adjacency matrix A whereAij = 1 if there exists a link from page i topage j , and Aij = 0 otherwise, it has beenshown that the authority and hub vectors(x = {xp} and y = { yp}) converge to theprincipal eigenvectors of AT A and AAT re-spectively [Kleinberg 1998].

6 The focused subgraph is constructed from a RootSet obtained from a search engine query.

Authority and hub weights can be usedto enhance Web search by identifying asmall set of high-quality pages on a broadtopic [Chakrabarti et al. 1998a, 1998b].Pages related to a given page p can befound by finding the top authorities andhubs among pages in the vicinity to p[Dean and Henzinger 1999]. The same al-gorithm has also been used for findingdensely linked communities of hubs andauthorities [Gibson et al. 1998].

One of the limitations of Kleinberg’s[1998] mutual reinforcement principle isthat it is susceptible to the Tightly KnitCommunities (TKC) effect. The TKC effectoccurs when a community achieves highscores in link-analysis algorithms even assites in the TKC are not authoritativeon the topic, or pertain to just one as-pect of the topic. A striking example ofthis phenomenon is provided by Cohn andChang [2000]. They use Kleinberg’s Algo-rithm with the search term “jaguar,” andconverge to a collection of sites about thecity of Cincinnati! They found out thatthe cause of this is a large number of on-line newspaper articles in the CincinnatiEnquirer which discuss the JacksonvilleJaguars football team, and all link tothe same Cincinnati Enquirer servicepages.

3.2.5. Rafiei and Mendelzon’s Approach.Generalizations of both PageRank and au-thorities/hubs models for determining thetopics on which a page has a reputationare considered by Rafiei and Mendelzon[2000]. In the one-level influence propaga-tion model of PageRank, a surfer perform-ing a random walk may jump to a pagechosen uniformally at random with prob-ability d or follow an outgoing link fromthe current page. Rafiei and Mendelzon[2000] introduce into this model, topic spe-cific surfing and parameterize the stepof the walk at which the rank is calcu-lated. Given that Nt denotes the number ofpages that address topic t, the probabilitythat a page p will be visited in a randomjump during the walk is d/Nt if p containst and zero otherwise. The probability thatthe surfer visits p after n steps, following



a link from page q at step n−1 is ((1−d )/O(q))Rn−1(q, t), where O(q) is the numberof outgoing links in q and Rn−1(q, t) de-notes the probability of visiting q for topict at step n− 1. The stochastic matrix con-taining pairwise transition probabilitiesaccording to the above model can be shownto be aperiodic and irreducible, therebyconverging to stationary state probabili-ties when n → ∞. In the two-level influ-ence propagation model of authorities andhubs [Kleinberg 1998], outgoing links canbe followed directly from the current pagep, or indirectly through a random page qthat has a link to p.

3.2.6. SALSA. Lempel and Morgan[2000] propose the Stochastic Approachfor Link Structure Analysis (SALSA).This approach is based upon the theoryof Markov Chains, and relies on thestochastic properties of random walks7

performed on a collection of sites. LikeKleinberg’s algorithm, SALSA starts witha similarly constructed Base Set. It thenperforms a random walk by alternately(a) going uniformly to one of the pagesthat links to the current page, and (b)going uniformly to one of the pages linkedto by the current page. The authorityweights are defined to be the stationarydistribution of the two-step chain doingfirst step (a) and then (b), while the hubweights are defined to be the stationarydistribution of the two-step chain doingfirst step (b) and then (a).

Formally, let B(i) = {k : k → i} denotethe set of all nodes that point to i, and letF (i) = {k : i → k} denote the set of allnodes that we can reach from i by follow-ing a forward link. It can be shown thatthe Markov Chain for the authorities hastransition probabilities

Pa(i, j ) =∑

k:k∈B(i)∩B( j )

1|B(i)|

1|F (k)| .

7 According to [Rafiei and Mendelzon 2000], a ran-dom walk on a set of states S = {s1, s2, . . . , sn}, cor-responds to a sequence of states, one for each step ofthe walk. At each step, the walk switches to a newstate or remains in the current state. A random walkis Markovian if the transition at each step is inde-pendent of the previous steps and only depends onthe current state.

Assume for the time being the MarkovChain is irreducible, that is, the under-lying graph structure consists of a sin-gle connected component. The authorsprove that the stationary distribution a =(a1, a2, . . . , aN ) of the Markov Chain satis-fies ai = |B(i)|/|B|, where B = ⋃i B(i) isthe set of all backward links. Similarly, theMarkov Chain for the hubs has transitionprobabilities

Ph(i, j ) =∑

k:k∈F (i)∩F ( j )

1|F (i)|

1|B(k)| .

Lempel and Moran [2000] proved thatthe stationary distribution = (h1, h2, . . . ,hN ) of the Markov Chain satisfies hi =|F (i)|/|F |, where F = ⋃

i F (i) is the setof all forward links.

If the underlying graph of the Base Setconsists of more than one component, thenthe SALSA algorithm selects a startingpoint uniformly at random and performsa random walk within the connected com-ponent that contains the node.

Observe that SALSA does not have thesame mutually reinforcing structure thatKleinberg’s algorithm does [Borodin et al.2001]. Since ai = |B(i)|/|B|, the rela-tive authority of site i within a connectedcomponent is determined local links, notfrom the structure of the component. Also,in the special case of a single compo-nent, SALSA can be viewed as a one-step truncated version of Kleinberg’s algo-rithm [Borodin et al. 2001]. Furthermore,Kleinberg ranks the authorities based onthe structure of the entire graph, andtends to favor the authorities of tightlyknit communities. The SALSA ranks theauthorities based on the their popular-ity in the immediate neighborhood, andfavors various authorities from differentcommunities. Specifically, in SALSA, theTKC effect is overcome through randomwalks on a bipartite Web graph for iden-tifying authorities and hubs. It has beenshown that the resulting Markov chainsare ergodic and high entries in the sta-tionary distributions represent sites mostfrequently visited in the random walk. Ifthe Web graph is weighted, the authority



and hub vectors can be shown to have sta-tionary distributions with scores propor-tional to the sum of weights on incomingand outgoing edges, respectively. This re-sult suggests a simpler calculation of au-thority/hub weights than through the mu-tual reinforcement approach.

3.2.7. Approach of Borodin et al. Borodinet al. [2001] proposed a set of algo-rithms for hypertext link analysis. Wehighlight some of these algorithms here.The authors proposed a series of algorithmthat are based on minor modificationof Kleinberg’s algorithm to eliminatethe previously mentioned errant behav-ior of Kleinberg’s algorithm. They pro-posed an algorithm called Hub-Averaging-Kleinberg Algorithm, which is a hybrid ofthe Kleinberg and SALSA algorithms asit alternated between one step of each al-gorithm. It does the authority rating up-dates just like Kleinberg (giving each au-thority a rating equal to the sum of thehub ratings of all the pages that link to it).However, it does the hub rating updatesby giving each hub a rating equal to theaverage of the authority ratings of all thepages that it links to. Consequently, a hubis better if it links to only good authori-ties, rather than linking to both good andbad authorities. Note that it shares the fol-lowing behavior characteristics with theKleinberg algorithm: if we consider a fullbipartite graph, then the weights of theauthorities increase exponentially fast forHub-Averaging (the rate of increase is thesquare root of that of the Kleinberg’s algo-rithm). However, if one of the hubs pointto a node outside the component, then theweights of the component drop. This pre-vents the Hub-Averaging algorithm fromcompletely following the drifting behav-ior of the Kleinberg’s algorithm [Borodinet al. 2001]. Hub-Averaging and SALSAalso share a common characteristic asthe Hub-Averaging algorithm tends to fa-vor nodes with high in-degree. Namely,if we consider an isolated component ofone authority with high in-degree, theauthority weight of this node will in-crease exponentially faster [Borodin et al.2001].

The authors also proposed two dif-ferent algorithms called Hub-Thresholdand Authority-Threshold that modifies the“threshold” of Kleinberg’s algorithm. TheHub-Threshold algorithm is based on thenotion that a site should not be considereda good authority simply because manyhubs with very poor hub weights point toit. When computing the authority weightof ith page, the Hub-Threshold algorithmdoes not take into consideration all hubsthat point to page i. It only considers thosehubs whose hub weight is at least the aver-age hub weight over all the hubs that pointto page i, computed using the current hubweights for the nodes.

The Authority-Threshold algorithm, onthe other hand, is based on the notion thata site should not be considered a good hubsimply because it points to a number of“acceptable” authorities; rather, to be con-sidered a good hub it must point to someof the best authorities. When computingthe hub weight of the ith page, the algo-rithm counts those authorities which areamong the top K authorities, based on thecurrent authority values. The value of Kis passed as a parameter to the algorithm.

Finally, the authors also proposed twoalgorithms based on Bayesian statisti-cal approach, namely, Bayesian Algo-rithm and Simplified Bayesian Algorithm,as opposed to the more common alge-braic/graph theoretic approach. The Sim-plified Bayesian Algorithm is basicallysimplification of the Bayesian model in theBayesian algorithm. They experimentallyverified that the Simplified Bayesian Al-gorithm is almost identical to the SALSAalgorithm and have at least 80% overlapon all queries. This may be due to thefact that both the algorithms place greatimportance on the in-degree of a nodewhile determining the authority weight ofa node. On the other hand, the Bayesianalgorithm appears to resemble both theKleinberg and the SALSA behavior, lean-ing more towards the first. It has a higherintersection numbers with Kleinberg thanwith SALSA [Borodin et al. 2001].

3.2.8. PicASHOW. PicASHOW [Lempeland Soffer 2001] is a pictorial retrieval



system that searches for images on theWeb using hyperlink-structure analysis.PicASHOW applies co-citation based ap-proaches and PageRank influenced meth-ods. It does not require any image analysiswhatsoever and no creation of taxonomiesfor preclassification of the images on theWeb. The justification for using co-citationbased measures to images just as it doesto Web pages is as follows: (1) Images thatare co-contained in pages are likely to berelated to the same topic. (2) Images thatare contained in pages that are co-citedby a certain page are likely related to thesame topic. Furthermore, in the spirit ofPageRank, the authors assumed that im-ages that are contained in authoritativepages on topic t are good candidates to bequality images on that topic.

The topical collection of images from theWeb is formally defined as a quadrupleIC = (P, I, L, E), where P is a set of Webpages (many of which deal with a certaintopic t), I is the set of images contained inP , L ⊆ P × P is the set of directed linksthat exist on the Web between the pagesP , and E ⊆ P × I is the relation page pcontaining image i. A page p contain animage i if (a) when p is loaded in a Webbrowser, i is displayed, or (b) p points to i’simage file (in some image file format suchas .gif or .jpeg). Based on this definition ofimage collection, the steps for finding au-thoritative images given in a query are asfollows:

(1) The first step is to assemble a largetopical collection of images for a givenquery on topic t. This is based onthe notion that by examining a largeenough set of t-relevant pages, it ispossible to identify high quality t-images. This is achieved by usingKleinberg’s algorithm. That is, for aquery q on topic t, Kleinberg’s algo-rithm is used to assemble a q-inducedcollection of Web pages by submit-ting q first to traditional search en-gines, and adding pages that point toor pointed by pages in the resultantset. This is the page set P and thepage-to-page link set L. The set of im-ages I can be then defined by collect-

ing the images which are containedin P .

(2) The next step focuses on identify-ing replicated images in the collec-tion. This is based on the assumptionthat when a web-site creator encoun-ters an image of his liking on a re-mote server, the usual course of ac-tion is to copy the image file to thelocal server, thus replicating the im-age. Note that this behavior is differentfrom the corresponding behavior withrespect to HTML pages. Most of thetime, authors will not copy a remotepage to the local servers, but ratherprovide links from their site to the re-mote page. In PicASHOW, Lempel andSoffer [2001] download only the first1024 bytes of the image and apply adouble hash function to these bytes, sothat each image is represented by asignature consisting of 32 bytes. Then,two images with the same signatureare considered identical.

(3) Next, noninformative images are fil-tered out from the collection. The au-thors use the following heuristics in or-der to reduce noninformative imagesin the collection: (a) Banners and logosare noninformative and tend to be wideand short. Hence, PicASHOW filtersout images with an aspect ratio greaterthan some threshold value. (b) Imagesthat store small files (less than 10 kilo-bytes) tend to be banners and filteredout. Even if these files are not banners,they are usually not quality topical im-ages. (c) Images stored in file namescontaining “logo” or “banner” keyword.(d) Additionally, cliparts, buttons, spin-ning globes are filtered out based on as-pect ratio and file size heuristics. Notethat this approach does not eliminateall noninformative images.

(4) Finally, the images are ranked basedon different schemes such as in-degreeapproach, PageRank-influenced ap-proach and co-citation based scheme.

Similarly, PicASHOW also finds imagehubs. Just as hubs were defined as pagesthat link to many authoritative pages, im-age hubs are pages that are linked to



many authoritative images. Specifically,pages that contain high-quality imagesare called image containers, while pagesthat point to good image containers arecalled image hubs. Thus, image hubs areremoved from the authoritative imagesthemselves, which are contained in the im-age containers. The co-citation based im-age retrieval schemes is used once againto find both image containers and imagehubs.

4. USAGE CHARACTERIZATION

In this section, we consider the problemof modeling and predicting web page ac-cesses. We begin by relating web pageaccess modeling to prediction for effi-cient information retrieval on the WWWand consider a preliminary statistical ap-proach that relies on the distribution ofinteraccess times. Two other approachesfrom different domains, namely, Markovprocesses and human memory, are alsoexamined.

4.1. Access Prediction Problem

In the case of the WWW, access predictionhas several advantages. It becomes a ba-sis for improved quality of information ac-cess by prefetching documents that have ahigh likelihood of access. In the Web envi-ronment, prefetching can materialize as aclient-side or sever-side function depend-ing on whether prediction is performed us-ing the access patterns of a particular useror those of the whole population. If accessprediction is addressed within the frame-work of a more general modeling problem,there are added benefits from using themodel in other contexts. Webmasters canapply such a model for studying trendsin page accesses recorded in server logsto identify navigation patterns, improvesite organization and analyze the effectsof changes to their web sites.

While the issue is important for im-proving information quality on the Web,it poses several challenges. The scale, ex-tent, heterogeneity, and dynamism of theWeb even within a site make several ap-proaches to predicting accesses possible.

However, as we shall see, each has its ownset of limitations and assumptions thatmust be kept in mind while applying it toa particular domain.

Let us first elucidate the general accessprediction problem. Our basic premise isthat page accesses should be predictedbased on universally available informa-tion on past accesses such as server ac-cess logs. Given a document repositoryand history of past accesses, we wouldlike to know which documents are morelikely to be accessed within a certain inter-val and how frequently they are expectedto be accessed. The information used forprediction, typically found in server logscomprises the time and URL of an HTTPrequest. The identity of the client is neces-sary only if access prediction is personal-ized for the client. From this informationabout past accesses, several predictor vari-ables can be determined, for example, thefrequency of accesses within a time inter-val and interaccess times.

4.2. Statistical Prediction

An obvious prediction is the time untilthe next expected access to a document,say A. The duration can be derived froma distribution of time intervals betweensuccessive accesses. This kind of statisti-cal prediction relates a predictor variableor a set of predictor variables to accessprobability for a large sample assumed tobe representative of the entire population.Future accesses to a document can thenbe predicted from the probability distri-bution using current measurements of itspredictor variable(s). A variant of this ap-proach is to use separate distributions forindividual documents measured from pastaccesses.

Let us illustrate the above approachfor temporal prediction using interaccesstime. Suppose f (t) is the access densityfunction denoting the probability that adocument is accessed at time t after itslast access or its interaccess time proba-bility density. Intuitively, the probabilitythat a document is accessed depends onthe time since its last access and durationinto the future we are predicting. At any



arbitrary point in time, the probabilitythat a document A is accessed at a timeT from now is given by

Pr{A is accessed at T } = f (δ + T ),

where δ is the age or the time since thelast access to the document. The functionf (t) has a cumulative distribution F (t) =∑∞

t ′=0 f (t ′), which denotes the probabilitythat a document will be accessed withintime t from now. Since f (t) is a probabilitydensity, F (∞) = 1, meaning that the doc-ument will certainly be accessed some-time in the future. If f (t) is representedas a continuous distribution, the instan-taneous probability when δ, T → 0 iszero, which makes short-term or immedi-ate prediction difficult. To find the discretedensity f (t) from the access logs, we cal-culate the proportion of document accessesthat occur t time units after the precedingaccess for t ranging from zero to infinity.This approach assumes that all documentshave identical interaccess time distribu-tions, that is, all accesses are treated thesame, irrespective of the documents theyinvolve and that the distributions are freefrom periodic changes in access patterns(such as weekends when interaccess timesare longer.) The implication of the firstassumption is that the prediction is notconditioned on frequency of past accessessince all documents in the observed repos-itory are assumed equally likely to be ac-cessed giving rise to identical frequencydistributions.

Since frequency distributions are morelikely to vary between documents thannot, it is clear that the above assumptionsmake this analysis suitable only on a per-document basis. However, the approachstill holds, notwithstanding that the dis-tribution F (t) is now specific to a particu-lar document. To predict the probability ofaccess within time T from now, for a par-ticular document A, we may use A’s distri-bution function FA(t) to obtain FA(δ + T )where δ is the age at the current time.If the interaccess time distributions aresimilar but not identical, we could condi-tion these distributions on the parameters

and find distributions of these parametersacross the documents.

Our use of a single predictor, the interac-cess time, obtained from the age δ and pre-diction interval T does not imply that thetechnique is univariate. The use of mul-tiple predictors, such as the frequency ofaccesses in a given previous interval caneasily be accommodated into a multidi-mensional plot of access probability. Themethod becomes complicated when sev-eral dimensions are involved. To allevi-ate this, we may derive a combined metricfrom the predictor variables, transform-ing the problem back to univariate pre-diction. However, this requires empiricaldetermination of correlation between pre-dictors and subsequently a combinationfunction.

Given the statistical principle, onemight naturally be led to ask how the dis-tribution F (t) (or its variant FA(t)) can beused for actionable prediction. Recall thatF (t) is a cumulative probability distribu-tion. For a given document age, it tellsus the probability that a document is ac-cessed within a certain interval of time.If a single probability distribution is used,this probability is an indicator of overalldocument usage with respect to time in-terval. If we use individual distributionsFA(t), it can be used to compare the rel-ative usage of documents. The expectedtime to next access T is given by the meanof the distribution:

E[T ] =∞∑

t=0

t · f (t).

The expected time T before the next ac-cess to a document, if it is known for alldocuments, can be used as a criteria forpopulating server side caches.

The temporal approach discussed abovebases prediction on interaccess times.Equally, we may use a frequency basedalternative for predicting access. A fre-quency distribution denotes the probabil-ity of a certain number of accesses to adocument or a sample of documents overa fixed time interval. Using an analogousmethod to that discussed earlier, we can



answer the following for prediction overthe next time interval:— What is the probability that exactly N

documents will be accessed?— What is the probability that N or more

documents will be accessed?— How many documents are expected to

be accessed?This approach has the same drawbacksas discussed previously—it does not ac-count for periodic changes in access rates,rather it aggregates them into a singledistribution and accesses to all documentsare treated the same. Finally, both tempo-ral and frequency prediction may be com-bined to ascertain probabilities of a certainnumber of accesses during a given time pe-riod in the future.

4.3. Markov Models

Markov models assume web page accessesto be “memoryless,” that is, access statis-tics are independent of events more thanone interval ago. We discuss two Markovmodels for predicting web page accesses,one of our own making and another due toSarukkai [2000]. Prior to explaining thesemodels per se, we introduce the concept ofMarkov processes.

4.3.1. Markov Processes. A stochasticprocess can be thought of as a sequenceof states {St ; t= 1, 2, . . .} where St repre-sents the state of the process at discretetime t, with a certain transition proba-bility pij between any two states i and j .A Markov process is a simple form ofstochastic process where the probabilityof an outcome depends only on the imme-diately preceding outcome. Specifically,the probability of being in a certain stateat time t depends entirely on the state ofthe process at time t − 1. To find the stateprobabilities at a time t, it is thereforesufficient to know the state probabilitiesat t − 1 and the one-step transitionprobabilities, pij defined as

pij = Pr{St = j |St−1 = i}.The one-step transition probabilities

represent a transition matrix P = (pij).

In a Markov process where the transitionprobabilities do not change with time, thatis, a stationary Markov process, the proba-bility of a transition in n steps from statei to state j or Pr{St = j |St−n = i} de-noted by p(n)

ij is given by Pnij . Intuitively,

this probability may be calculated as thesummation of the transition probabilitiesover all possible n-step paths between iand j in a graph that is equivalent to thetransition matrix P , with the transitionprobability along any path being theproduct of all successive one-step transi-tion probabilities. This is stated preciselyby the well-known Chapman–Kolmogorovidentity [Ross 1983]:

p(m+n)ij =

∑h

p(m)ih p(n)

hj , (m, n) = 1, 2, . . .

(1)

= P (m+n)ij . (2)

Hence, the matrix of n-step transitionprobabilities is simply the nth power of theone-step transition matrix P . If the initialstate probabilities π0

i = Pr{S0 = i} areknown, we can use the n-step transitionprobabilities to find the state probabilitiesat time n as follows:

πnj =

∑i

Pr{Sn = j |S0 = i} · Pr{S0 = i}

πnj =

∑i

p(n)ij π

0i .

Representing the state probabilities attime n as a state distribution vector 5n =(πn

i ), the above equation can be written invector notation as

5n = Pn ·50. (3)

The vector of initial state probabilities 50

is known as the initial state distribution.Typically, as n → ∞, the initial statedistribution becomes less relevant to then-step transition probabilities. In fact, forlarge n the rows of Pn become identical toeach other and to the steady state distri-bution5n. The steady state probability πn

ias n → ∞ can be interpreted as the frac-tion of time spent by the process in state iin the long run.



4.3.2. Markov Chain Prediction. Markovprocesses can be used to model indi-vidual browsing behavior. We propose amethod [Dhyani 2001] for determining ac-cess probabilities of web pages within asite by modeling the browsing process asan ergodic Markov chain [Ross 1983]. AMarkov chain is simply a sequence ofstate distribution vectors at successivetime intervals, that is, 〈50,51, . . . ,5n〉. AMarkov chain is ergodic if it is possible togo from every state to every other state inone or more transitions. The approach re-lies on a-posteriori transition probabilitiesthat can readily be ascertained from logs ofuser accesses maintained by Web servers.Note that our method can be considered asa variant of the PageRank. The differencebetween PagerRank and our approach isthat PageRank assumes a uniform distri-bution for following outlinks while we useobserved frequencies. Also, our model doesnot perform random jumps to arbitraryWeb pages.

Let us represent a web site as a collec-tion of K states, each representing a page.The evolution of a user’s browsing processmay then be described by a stochastic pro-cess with a random variable X n (at timen = 1, 2, . . .) acquiring a value xn fromthe state space. The process may be char-acterized as a Markov chain if the con-ditional probability of X n+1 depends onlyon the value of X n and is independentof all previous values. Let us denote the

τij = Number of accesses from i to jTotal number of accesses from i to all its neighbors

.

nonnegative, normalized probability oftransition from page i to page j as pij,so that, in terms of the random variableX n,

pij = P (X n+1 = j | X n = i).

The transition probabilities can be repre-sented in a K ×K transition matrix P . Wecan then derive the following:

— The `-step transition probability p(`)ij is

the probability that a user navigates

from page i to page j in ` steps.8This probability is defined generally bythe Chapman–Kolmogorov identity ofEq. (1).

— In ergodic Markov chains, the amountof time spent in a state i is proportionalto the steady state probability πi. If userbrowsing patterns can be shown to be er-godic, then pages that occupy the largestshare of browsing time can be identified.Let 50 denote initial state distributionvector of a Markov chain; the j th ele-ment of 50 is the probability that useris initially at page j .9 The state distri-bution vector 5n at time n can then beobtained from Eq. (3).

It can be shown that, for large n, each el-ement of Pn approaches the steady-statevalue for its corresponding transition. Wecan thus say that irrespective of the ini-tial state distribution, the transition prob-abilities of an ergodic Markov chain willconverge to a stationary distribution, pro-vided such a distribution exists. The meanrecurrence time of a state j is simplythe inverse of the state probability, thatis, 1/π j .

We now briefly address the issue of ob-taining transition probabilities pij. It canbe shown that the transition probabil-ity can be expressed in terms of the a-posteriori probability τij, which can be ob-served from browsing phenomena at thesite as follows:

Accesses originating from outside the website can be modeled by adding anothernode (say ∗) collectively representing out-side pages outside. The transition proba-bilities p∗i denote the proportion of timesusers enter the web site at page i.

Observe that in relation to actual brows-ing, our model implies that the transition

8 A more Web savvy term would be click distance.9 In the Markov chain model for Web browsing, eachnew access can be seen as an advancement in thediscrete time variable.



to the next page is dependent only onthe current page. This assumption cor-rectly captures the situation whereby auser selects a link on the current page togo to the next in the browsing sequence.However, the model does not account forexternal random jumps by entering a freshURL into the browser window or browserbased navigation through the “Forward”and “Back” buttons. Note that pushing“Back” and “Forward” buttons imply thatthe next state in the surf process dependsnot only on the current page but also on thebrowsing history. Browser-based buttonscan be factored into the model by treat-ing them as additional links. However, weuse a simplistic assumption because thea-priori probability of using browser but-tons depends on the navigability of theweb site and is difficult to determineby tracking sample user sessions (due tocaching etc). Our model can be useful inthe following applications:

— Relative Web Page Popularity. Sincethe transition probability matrix isknown a-priori, the long term proba-bilities of visiting a page can be deter-mined. This can be compared with otherpages to find out which pages are morelikely to be accessed than others.

— Local Search Engine Ranking. Therelative popularity measure discussedabove will make localized search en-gines more effective because it uses ob-served surfing behavior to rank pagesrather than generalized in-out de-grees. General purpose measures do nothave this characteristic. For example,Pagerank assumes that out links areselected in accordance with a uniformdistribution. However, hyperlinks oftenfollow the 80/20 rule whereby a smallnumber of links dominate the outgo-ing paths from a page. Long-term pageprobabilities that account for brows-ing behavior within the domain willgive more accurate rankings of usefulpages.

— Smart Browsing. Since `-step transi-tion probabilities are known, we canpredict which pages are likely to berequested over the next ` steps. The

browser can then initiate advanced ac-cess for them to improve performance.It can also serve suggestions to the useron which pages might be of interestbased on their browsing history.

A variation of Markov chains appliedby Sarukkai [2000] predicts user accessesbased on the sequence of previously fol-lowed links. Consider a stochastic matrixP whose elements represent page transi-tion probabilities and a sequence of vec-tors, one for each step in the link historyof a user, denoted I1, I2, . . . I t−1. The `thelement in vector Ik is set to 1 if the uservisits page ` at time k, otherwise, it is setto 0. For appropriate values of constantsa1, a2, . . . , ak , the state probability vectorSt for predicting the next link is deter-mined as follows:

Stj =

t−1∑k=1

ak I t−k Pk .

The next page to be accessed is predictedas the one with the highest state proba-bility in the vector St . The same approachcan be used to generate tours by succes-sively predicting links of a path.

4.4. Human Memory Model

Psychology research has shown that theaccuracy of recalling a particular item inmemory depends on the number of timesthe item was seen or frequency, the timesince last access or recency and the gapbetween previous accesses or spacing. Insome ways, the human memory modelis closer to Web-page access modeling.Recker and Pitkow [1996] have used thehuman memory model to predict docu-ment accesses in a multimedia repositorybased on the frequency and recency of pastdocument accesses. We discuss their ap-proach below.

4.4.1. Frequency Analysis. Frequencyanalysis examines the relationship be-tween the frequency of document accessduring a particular time window and theprobability of access during a subsequentmarginal period called the pane. Thisrelationship can be derived by observing



access log data showing the times ofindividual access to Web documents. Letus consider the number of documentsaccessed x times during a fixed interval,say one week. If we know the proportionof these documents that are also accessedduring the marginal period, then we cancalculate the probability of future accessto a document given that it is accessedx times as follows. Suppose wx denotesthe number of documents accessed xtimes during the window period andpx (px < wx), the number, out of thoseaccessed during the pane. Then, theprobability that a document A is accessedduring the pane given that it is accessedx times during the window is given by:

Pr{A is accessed in pane | A is accessed x times during window}= Pr{A is accessed in pane and A is accessed x times during window}

Pr{A is accessed x times during window} = px

wx.

The above probability can be aggregatedover successive windows for the given ob-servation interval to obtain the condi-tional distribution of need probability, orthe probability that an item will be re-quired right now, versus the frequency ofaccess in the previous window. Recker andPitkow found that the distribution of needprobability p shows a familiar Power Lawdistribution to the human memory model,that is, as the frequency of document ac-cesses x increases, the need probability in-creases according to the function:

p = axb,

where a and b are constants.

4.4.2. Recency Analysis. Recency analy-sis proceeds in a similar manner exceptthat the probability of access during thepane is conditioned upon how recentlythe document was last accessed. Insteadof creating bins of similar frequency val-ues observed over successive windows, thedocuments are aggregated into bins ofsimilar recency values. That is, if x doc-uments were accessed an interval t ago

(recency time) in the previous window, outof which x ′ are accessed during the nextpane, then the need probability is calcu-lated as follows:

Pr{A is accessed in pane | A was

accessed t units ago) = x ′

x.

The need probability versus recency timedistribution shows an inverse power lawrelationship in agreement with retentionmemory literature. In addition, recencyanalysis shows a much better fit on avail-able access data than frequency.

Recker and Pitkow [1996] also studythe relationship between the access

distributions and hyperlink structure.They found that a high correlation ex-ists between the recency of access, num-ber of links and document centrality10

in the graph structure. Specifically, docu-ments that are less recently accessed, havefewer mean number of links per document,and lower measures of relative in- andout-degrees. Thus, recently accessed docu-ments have higher overall interconnectiv-ity. This analysis can be applied to optimalordering for efficient retrieval. Documentswith high-need probability can be posi-tioned for faster access (caching strategiesbased on need probability) or more conve-nient (addition of appropriate hyperlinks).

There are several issues in Recker andPitkow’s [1996] approach that requirecareful consideration before such a modelcan be adopted for practical access pre-diction. First, the window and pane sizesare known to have a strong impact onthe quality of prediction. No rule for anideal setting of these parameters existsand significant trial and error may be

10 See Section 2 for the definition of centrality andother hypertext graph measures.



needed to identify the right values. Sec-ond, the model does not account for contin-uous changes in Web repositories such asaging effects and the addition and deletionof documents. Finally, although the modelconsiders multiple predictors, frequencyand recency, the method of combiningthese parameters is not scalable. We mustnote that there is a difference between pre-dictors and affecter variables. While fre-quency and recency of access are strongpredictors, they cannot be established asfactors affecting need probability due tothe absence of a causal relationship. Morefundamental document properties suchvisibility (connectivity) and relevance topopular topics are more likely to be factorsthat determine need probability. However,for prediction purpose, we believe that allthe three factors considered above can beused.

4.5. Adaptive Web Sites

Usage characterization metrics attempt tomodel and measure user behavior frombrowsing patterns gathered typically fromserver log files. These metrics have facil-itated adaptive web sites—sites that canautomatically improve their organizationand presentation by learning from visi-tor access patterns [Perkowitz and Etzioni1997, 1998, 1999].

A metric that helps find collections ofpages that are visited together in sessionsis the co-occurrence frequency [Perkowitzand Etzioni 1999]. For a sequence ofpage accesses (obtainable from the serveraccess log), the conditional probabilityP (p|q) is the probability that a visitorwill visit page p given that she has al-ready visited q. The co-occurrence fre-quency between the pair 〈p, q〉 is thenMin(P (p|q), P (q|p)). Connected compo-nents in a graph whose edges are labeledwith the co-occurrence frequencies repre-sent clusters of pages that are likely to bevisited together. The quality of such clus-ters is defined as the probability that auser who has visited one page in a clusteralso visits another page in the same clus-ter. In a related work, Yan et al. [1996]represent user sessions as vectors where

the weight of the ith page is the degreeof interest in it measured through actionssuch as the number of times the page isaccessed or the time the user spends onit. Users can then be clustered on the ba-sis of the similarity between their sessionvectors (measured as Euclidean distanceor angle measure). As the user navigatesa site, he is assigned to one or more cate-gories based on the pages accessed so far.Pages in matching categories are includedas suggestions on top of the HTML docu-ment returned to the user if they have notbeen accessed so far and are unlinked tothe document. The same study finds thatthe distribution of time spent by a user ona page is roughly Zipfian. An analysis ofuser navigation patterns by Catledge andPitkow [1995] reveals that the distribu-tion of path lengths within a site is roughlynegative linear with the relationship be-tween path length p and its frequency fbeing f = 0.24p.

4.6. Activation Spread Technique

Pirolli et al. [1996] have recently usedan activation spread technique to iden-tify pages related to a set of “source”pages on the basis of link topology, tex-tual similarity and usage paths. Conceptu-ally, an activation is introduced at a start-ing set of web pages in the Web graphwhose edges are weighted by the crite-ria mentioned above. To elaborate further,the degree of relevance of Web pages toone another is conceived as similarities,or strength of associations, among Webpages. These strength-of-association re-lations are represented using a compos-ite of three graphs. Each graph structurecontains nodes representing Web pages,and directed arcs among nodes are labeledwith values representing strength of asso-ciation among pages. These graphs repre-sent the following:

— The Hypertext Link Topology of a WebLocality. This graph structure repre-sents the hypertext link topology of aWeb locality by using arcs labeled withunit strengths to connect one graphnode to another when there exists a



hypertext link between the correspond-ing pages.

— Inter-Page Text Similarity. This typeof graph structure represents the inter-page text content similarity by label-ing arcs connecting nodes with thecomputed text similarities between cor-responding Web pages.

— Usage Paths. The last type of graphstructure represents the flow of usersthrough the locality by labeling thearcs between two nodes with the num-ber of users that go from one page toanother.

Each of these graphs is represented bymatrices in the spreading activation al-gorithm. That is, each row corresponds toa node representing a page, and similarlyeach column corresponds to a node repre-senting a page. Conceptually, this activa-tion flows through the graph, modulatedby the arc strengths, the topmost activenodes represent the most relevant pagesto the source pages. Effectively for a sourcepage p, the asymptotic activity of page qin the network is proportional to P (q|p),the probability that q will be accessed bya user given that she has visited p.

5. WEB PAGE SIMILARITY

Web page similarity metrics measurethe extent of relatedness between twoor more Web pages. Similarity functionshave mainly been described in the contextof Web page clustering schemes. Cluster-ing is a natural way of semantically orga-nizing information and abstracting impor-tant attributes of a collection of entities.Clustering has certain obvious advan-tages in improving information qualityon the WWW. Clusters of Web pages canprovide more complete information on atopic than individual pages, especially inan exploratory environment where usersare not aware of several pages of interest.Clusters partition the information spacesuch that it becomes possible to treat themas singular units without regarding thedetails of their contents. We must note,however, that the extent to which theseadvantages accrue depends on the qual-

ity and relevance of clusters. While this iscontingent on user needs, intrinsic evalu-ations can often be made to judge clusterquality. In our presentation of clusteringmethods, we discuss these quality metricswhere applicable.

We classify similarity metrics intocontent-based, link-based and usage-basedmetrics. Content-based similarity is mea-sured by comparing the text of documents.Pages with similar content may be consid-ered topically related and designated thesame cluster. Link-based measures relyexclusively on the hyperlink structure of aWeb graph to obtain related pages. Usage-based similarity is based on patterns ofdocument access. The intent is to grouppages or even users into meaningful clus-ters that can aid in better organizationand accessibility of web sites.

5.1. Content-Based Similarity

Document resemblance measures in theWeb context can use subsequences match-ing or word occurrence statistics. The firstset of metrics using subsequence match-ing represents the document D as a setof fixed-size subsequences (or shingles)S(D). The resemblance and containment[Broder et al. 1997] of documents are thendefined in terms of the overlap betweentheir shingle sets. That is, given a pair ofdocuments A and B, the resemblance de-noted (r(A, B)) and containment of A in B(denoted c(A, B)) are defined respectivelyas follows:

r(A, B) = |S(A) ∩ S(B)||S(A) ∪ S(B)|

c(A, B) = |S(A) ∩ S(B)||S(A)| .

Clearly, both the measures vary between0 and 1; if A = B, then a scalable tech-nique to cluster documents on the Webusing their resemblance is described in[Broder et al. 1997]. The algorithm lo-cates clusters by finding connected com-ponents in a graph where edges de-note the resemblance between documentpairs.



Content-based similarity between Webpages can also be approached using thevector space model as in [Weiss et al. 1996].Each document is represented as a termvector where the weight of term ti in doc-ument dk is calculated as follows:

wki =(0.5+ 0.5(TFk,i/TFk,max))wat

k,i√∑ti∈dk

(0.5+ 0.5(TFk,i/TFk,max))2(wat

k,i

)2,

where

TFk,i Term frequency of ti in dkTFk,max Maximum term frequency

of a keyword in dkwat

k,i Contribution to weightfrom term attribute.

Note that this definition of term weightsdeparts from Section 3.1 by excluding theinverse document frequency (IDF) and in-troducing a new factor wat which is config-urable for categories of terms. The term-based similarity, denoted St

xy between twodocuments dx and d y is the normalized dotproduct of their term vectors wx and wyrespectively:

Stxy =

∑i

wxi ·wyi.

5.2. Link-Based Similarity

Link-based similarity metrics are derivedfrom citation analysis. The notion ofhyperlinks provides a mechanism forconnection and traversal of informationspace as do citations in scholarly enter-prise. Co-citation analysis has alreadybeen applied in information science toidentify the core sets of articles, authorsor journals in a particular field of studyas well as for clustering works by topicalrelatedness. Although the meaning andsignificance of citations differs consid-erably in the two environments due tothe unmediated nature of publishingon the Web, it is instructive to reviewmetrics from citation analysis for possibleadaptation. Application of co-citationanalysis for topical clustering of WWWpages is described in Larson [1996] and

Pitkow and Pirolli [1997]. We discusshere two types of citation based similaritymeasures namely co-citation strength andbibliometric coupling strength togetherwith their refinements and applicationsto Web page clustering.

To formalize the definition of citation-based metrics, we first develop a matrixnotation for representing a network of sci-entific publications. A bibliography maybe represented as a citation graph wherepapers are denoted by nodes and refer-ences by links from citing document to thecited document. An equivalent matrix rep-resentation is called the citation matrix(C) where rows denote citing documentsand columns cited documents. An elementCij of the matrix has value one (or the num-ber of references) if there exists a referenceto paper j in paper i, zero otherwise.

Two papers are said to be bibliographi-cally coupled [Egghe and Rousseau 1990]if they have one or more items of refer-ences in common. The bibliographic cou-pling strength of documents i and j is thenumber of references they have in com-mon. In terms of the citation matrix, thecoupling strength of i and j denoted SB′

ij isequal to the scalar product of their citationvectors Ci and Cj . That is,

SB′ij = Ci · CT

j

= (C · CT )ij.

In order to compare the coupling strengthof different pairs of documents, the abovemetric may be unsensitized to the numberof references through the following nor-malization (U is the unit vector of thesame dimension as C):

SBij =

Ci · CTj

Ci ·U T + Cj ·U T .

An alternate notion is that of co-citation.Two papers are co-cited if there exists a



third paper that refers to both of them. Theco-citation strength is the frequency withwhich they are cited together. The relativeco-citation strength is defined as follows:

SCij =

CTi · (CT

j )T

CTi ·U T + CT

j ·U T.

Bibliographic coupling (and co-citation)has been applied for clustering docu-ments. Two criteria discussed by Eggheand Rousseau [1990] that can be appliedto Web pages are as follows:

(A) A set of papers constitute a relatedgroup G A(P0) if each member ofthe group has at least one couplingunit in common with a fixed paperP0. That is, the coupling strengthbetween any paper P and P0 isgreater than zero. Then, G A(P0; n) de-notes that subset of G A(P0) whosemembers are coupled to P0 withstrength n.

(B) A number of papers constitute a re-lated group GB if each member of thegroup has at least one coupling unitto every other member of the group.

A measure of co-citation based clustersimilarity function is the Jaccard index,defined as:

Sj (i, j ) = coc(i, j )cit(i)+ cit( j )− coc(i, j )

,

where coc(i, j ) denotes the co-citationstrength between documents i and jgiven by CT

i ·(CTj )T and cit(i) = CT

i ·U T andcit( j ) = CT

j · U T are the total number ofcitations received by i and j , respectively.Note that Jaccard’s index is very similarto the relative co-citation strength definedby us earlier. Another similarity functionis given by Salton’s cosine equation asfollows:

Ss(i, j ) = coc(i, j )√cit(i) · cit( j )

In most practical cases, it has been foundthat Salton’s similarity strength value istwice as calculated by the Jaccard index.

Amongst other approaches to co-citationbased clustering, Larson [1996] employsthe multidimensional scaling techniqueto discover clusters from the raw co-citation matrix. Similarly, Pitkow andPirolli [1997] apply a transitive clus-ter growing method once the pairwiseco-citation strengths have been deter-mined. In another application, Dean andHenzinger [1999] locate pages related toa given page by looking for siblings withthe highest co-citation strength. Finally, amethod described by Egghe and Rousseau[1990] finds connected components in co-citation networks whose edges are la-beled by co-citation strengths of documentpairs. The size and number of clusters(or cluster cohesiveness) can be con-trolled by removing edges with weightsbelow a certain threshold co-citationfrequency.

A generalization of citation-based simi-larity measures considers arbitrarily longcitation paths rather than immediateneighbors. Weiss et al. [1996] introduce aweighted linear combination of three com-ponents as their hyperlink similarity func-tion for clustering. For suitable values ofweights Wd , Wa and Ws, the hyperlinksimilarity between two pages i and j isdefined as:

Slij = Wd Sd

ij +WaSaij +WsSs

ij.

We now discuss each of the similarity com-ponents Sd

ij , Saij and Ss

ij. Let `ij denote thelength of the shortest path from page ito j and l k

ij that of the shortest path nottraversing k.

— Direct Paths. A link between docu-ments i and j establishes a semanticrelation between the two documents.If these semantic relations are transi-tive, then a path between two nodesalso implies a semantic relation. As thelength of the shortest path between thetwo documents increases, the seman-tic relationship between the two docu-ments tend to weaken. Hence, the directpath component relates the similaritybetween two pages i and j denoted Ss

ij



as inversely proportional to the shortestpath lengths between them. That is,

Ssij =

1

2lij + 2lji.

Observe that the denominator ensuresthat as shortest paths increase inlength, the similarity between the doc-uments decreases.

— Common Ancestors. This componentgeneralizes co-citation by including suc-cessively weakening contributions fromdistant ancestors that are common to iand j . That is, the similarity betweentwo documents is proportional to thenumber of ancestors that the two doc-uments have in common. Let A denotethe set of all common ancestors of iand j ,

Saij =

∑x∈A

1

2l jxi + 2l i

x j.

Because direct paths have already beenconsidered in similarity component Ss,only exclusive paths from the commonancestor to the nodes in question areinvolved in the Sa component. Observethat, as the shortest paths increase inlength, the similarity decreases. Also,the more common ancestors, the higherthe similarity.

— Common Descendents. This compo-nent generalizes bibliographic couplingin the same way as Sa. That is, the sim-ilarity between two documents is alsoproportional to the number of descen-dants that the two documents have incommon. Let D denote the set of all com-mon descendants of i and j ,

Sdij =

∑x∈D

1

2l jix + 2l i

j x.

The computation normalizes Sdij to lie

between 0 and 1 before it is included inSl

ij in the same way as the normaliza-tion for Sa

ij.

5.3. Usage-Based Similarity

Information obtained from the interactionof users with material on the Web can be

of immense use in improving the qualityof online content. We now discuss some ofthe approaches proposed for relating Webdocuments based on user accesses. Websites that automatically improve their or-ganization and presentation by learningfrom access patterns are addressed byPerkowitz and Etzioni [1997, 1998, 1999].Sites may be adaptive through customiza-tion (modifying pages in real time to suitindividual users, e.g., goal recognition) oroptimization (altering the site itself tomake navigation easier for all, e.g., linkpromotion).

5.3.1. Clustering using Server Logs. Theoptimization approach is introduced byPerkowitz and Etzioni [1998, 1999] as analgorithm that generates a candidate in-dex page containing clusters of web pageson the same topic. Their method assumesthat a user visits conceptually relatedpages during an interaction with the site.The index page synthesis problem can bestated as follows: Given a web site and avisitor access log, create new index pagescontaining collections of links to relatedbut currently unlinked pages. The Page-Gather cluster mining algorithm for gen-erating the contents of the index pagecreates a small number of cohesive butpossibly overlapping clusters through fivesteps:

(1) Process access logs into visits.(2) Compute the co-occurrence frequen-

cies between pages and create a sim-ilarity matrix. The matrix entry for apair of pages pi and pj is defined asthe co-occurrence frequency given bymin(P (p1|p2), P (p2|p1)). If two pagesare already linked, the correspondingmatrix cell is set to zero.

(3) Create a graph corresponding tothe matrix and find the maximalcliques11 or connected components12 inthe graph. Each clique or connected

11 A clique is a subgraph in which every pair of nodeshas an edge between them; a maximum clique is onethat is not a subset of a larger clique.12 A connected component is a subgraph in which ev-ery pair of nodes has a path between them.



component then represents a set ofpages that are likely to be visited to-gether (and hence related, by the aboveassumption).

(4) Rank the clusters found and choosewhich to output. The clusters aresorted using the average co-occurrencefrequency between all pairs of docu-ments in a cluster.

(5) For each cluster, create a Web pageconsisting of links to the documentsand present it to the Webmaster forevaluation.

Experiments [Perkowitz and Etzioni1998, 1999] show that PageGather outper-forms other clustering algorithms such asK-means, HAC, and a-priori frequency setcalculation in efficiency as well as quality.Quality of clusters is measured by approx-imating the following: If the user visitsa page in a cluster, what is the likeli-hood that he will visit more pages fromthe same cluster? Suppose n(i) representsthe number of pages in cluster i visitedduring a session, then the quality of clus-ter i is given by the probability P{n(i)≥2|n(i)≥1}.

As we have seen, Perkowitz andEtzioni’s approach relies on co-occurrencefrequency to related pages. However, doesfrequent co-occurrence necessarily indi-cate semantic relationship? Certainly thisassumption disregards a certain amountof arbitrariness that prevails in the oftenserendipitous browsing behavior of users.In exploratory browsing, pages accessed inthe same session may not be related. Forinstance, high co-occurrence could also bedue to site structure, such as coercive hy-perlinks, rather than a persistent intereston the users behalf. The method discountsthe nature of relationship between pagesgrouped together which might be a usefulclue in obtaining more refined clusters.

Server logs can also help cluster usersbased on the similarity between the sets ofpages they visit as described by Yan et al.[1996]. Clustering users is a natural waycustomization, whereby pages in a user’scluster that have not been explored yet canbe suggested as navigational hints in theform of dynamically generated links. This

type of dynamic hypertext configuration isperformed as follows:

(1) Preprocessing. Suppose a web sitehas n HTML pages. Each usersession13 is represented using ann-dimensional session vector whoseith element is the weight or degreeof interest assigned to the ith page.The weight can be measured throughactions such as the number of timesthe page is accessed or the time theuser spends on it.

(2) Clustering. User session vectors thatare close to each other in the n-dimensional vector space, as deter-mined by Euclidean or angular dis-tance, are clustered together.

(3) Dynamic Link Generation. As theuser navigates a site, she is assignedto one or more categories based on thepages accessed so far if their numberexceeds a predefined threshold. Pagesin matching categories are included assuggestions on top of the HTML docu-ment returned to the user if they havenot been accessed so far and are un-linked to the document.

However, this approach towards cus-tomization assumes that

— Pages visited in each session are se-mantically related.

— Users will be interested in all pagesaccessed by other users in the samecluster (during dynamic hyperlinkgeneration).

As for the optimization technique dis-cussed earlier, these assumptions are notentirely valid. Similarity between sessionvectors does not necessarily imply a rela-tionship between user interests. Both thetechniques described above ignore the ef-fect of proxies and browser caches on theaccuracy of individual session tracking.Requests for cached pages are not loggedby the server. Users behind a proxy orfirewall cannot be distinguished and areassigned the same IP address, affecting

13 A session is approximated as requests originatingfrom the same host within 24 hours.



the granularity of clusters. Finally, esti-mates such as the number of accesses andtime spent are not truly representative ofa user’s interest in a page.

6. WEB SEARCH AND RETRIEVAL

In Section 3, we introduced significancemetrics for ranking responses of searchengines to user queries. Here, we reviewmetrics for evaluating and comparing theperformance of Web search and retrievalservices. We discuss them under two cat-egories: effectiveness and search enginecomparison.

6.1. Effectiveness

A large proportion of Web search enginesemploy information retrieval techniquessuch as the vector space model similar-ity for keyword based queries. Their re-trieval effectiveness, or the quality of re-sults returned in response to a query, ismeasured through two metrics, namely,precision and recall. Let N denote thenumber of responses of a search to akeyword query Q , of which N ′ are rel-evant to Q . If R relevant Web pagesexist in all, then the effectiveness met-rics are defined as follows [Hawkinget al. 1999 Lee et al. 1997 Yuwono andLee 1996]:

— Precision. The proportion of pages re-turned that are relevant. That is,Precision = N ′/N .

— Recall. The proportion of relevantpages returned. That is, Recall = N ′/R.

We assume here the definition of relevantpages stated in Section 3.1. Precision andrecall usually exhibit a trade-off relation-ship that is depicted through Precision–Recall curves. Metrics that combine pre-cision and recall are discussed in [Boyceet al. 1994]. One problem with measuringrecall is that the total number of pages rel-evant to a query is difficult to ascertaingiven the size of the Web.

Recently, Brewington and Cybenko[2000] proposed a metric for the currencyof a search engine. This metric can beused to answer questions about how fast

a search engine must reindex the Web toremain “current.” They introduce the con-cept of (α, β)-currency of a search enginewith respect to a changing collection ofWeb pages. Informally, the search enginedata for a given Web page is said to beβ-current if the pages has not changed be-tween the last time it was indexed and βtime units ago. Formally, expected prob-ability of a single page being β-currentover all values of the observation time tnis given as follows:

Pr(β − current|λ, T, β)

= β

T+ 1− exp (−λ(T − β))

λT,

where λ is the change rate of each Webpage [Brewington and Cybenko 2000] andT is an associated distribution of reindex-ing times (a periodic reindexing systemwill have a single constant To). Observethat β is the grace period for allowing un-observed changes to a Web page and re-laxes the temporal aspect of what it meansto be current. The smaller β means morecurrent our information about the page is.

A search engine for a collection of pagesis then said to be (α, β)-current if a ran-domly chose page in the collection has asearch engine entry that is β-current withprobability at least α. It can be shown thatthe expected probability α is as follows:

α =∫ ∞

0

[σ

δ

(tδ

)σ−1

exp (−(t/δ)σ )

]

×[β

To+ 1− exp (−(1/t)(To − β))

(1/t)To

]dt,

where δ and σ are scale and shape parame-ters, respectively, in Weibull distributions[Montgomery and Runger 1994]. Note thatthe above integral can only be evaluatedin closed form when the Weibull shape pa-rameter σ is 1. Otherwise, numerical eval-uation is required. Also, the integral givesan α for every pair of (To, β).

6.2. Search Engine Comparison

The comparison of public domain searchservices has been the subject of several



studies. One approach for comparing mul-tiple search services describes the follow-ing metrics using a meta-search servicesuch as MetaCrawler [Selberg and Etzioni1995]:

— Coverage. Coverage measures thenumber of hits returned by each serviceon an average measured as the per-centage of a pre-set maximum allowedhits per search engine. The uniquenessof coverage can be measured as thepercentage of references not returnedby any other engine. The absolute cov-erage of a public search engine or thefraction of the Web it indexes has beenreported to be a maximum of one-thirdof the Web [Lawrence and Giles 1998,1999]. The same study analyzes theoverlap between pairs of engines toestimate a lower bound on the size ofthe indexable Web as 320 million pages.Note that the estimation of the size ofthe indexable Web is only valid till April1998 [Lawrence and Giles 1998]. Morerecently, it has been estimated that thenumber of unique pages on the Internet

size(E1)size(E2)

=|E1 ∩ E2||E2|

|E1 ∩ E2||E1|

= Fraction of URLs sampled from E2 found in E1Fraction of URLs sampled from E1 found in E2

.

is 2.1 billion and number of uniquepages added per day is 7.3 million[Murray and Moore 2000]. These statis-tics are valid as of July 10, 2000.

— Relevance. Two relevance metrics areused. One metric is the proportion ofhits from each engine that is followedby the user, or its precision. The othermetric is the proportion of overall hitsfollowed by the user per search engine(i.e., market share).

— Performance. Performance is mea-sured by each service’s averageresponse time a query.

Other metrics to evaluate search en-gines indexes are size, overlap and qual-ity. A standardized, statistical way ofmeasuring relative search engine sizeand overlap using random queries (with-out privileged access) is described byBharat and Broder [1998]. Suppose wehave two procedures, one for samplingpages uniformly at random from the in-dex of a particular search engine and an-other for checking whether a particularpage is indexed by a particular searchengine.

Based on approximations of these proce-dures through queries using public inter-faces, the overlap and relative size of twosearch engines indexes E1 and E2 can beestimated as follows:

— Overlap. Fraction of URLs sampledfrom E1 found in E2. This approximatesthe following fraction:

|E1 ∩ E2||E1|

— Size Comparison. The size ratio of E1and E2 is given by:

Sampling is performed by randomly se-lecting a URL from the pages returnedfor a query composed using keywords froma preconstructed lexicon. To test whetherthe URL is indexed by a particular searchengine (i.e., checking), a strong query14

(which uniquely identifies the page) is con-structed. The presence of the URL is thentested in the set of pages returned. Notethat the response to a strong query may

14 A conjunctive query of k most significant keywordsin the Web page. Significance is taken to be inverselyproportional to frequency in the lexicon.



contain multiple URLs due to mirroring,multiple aliases, and so on.

A simple methodology for approximat-ing index quality using random walksis proposed by Henzinger et al. [1999].The definition of quality is based on thePageRank measure defined in the Googlesearch engine [Brin and Page 1998]. Sup-pose each page on the Web is assigned aweight w scaled by the sum of all pageweights. Then, the quality w(S) and av-erage page quality a(S) of a search engineindex S is defined as

w(S) =∑p∈S

R(p)

a(S) = w(S)|S| ,

where R(p) is the PageRank15 of a page pas defined in Brin and Page [1998]. Thisapproach approximates the quality of asearch engine index S by independentlyselecting pages p1, p2, . . . , pn and testingwhether each page is in S. Let I [pi ∈ S]be 1 if pi is in S and 0 otherwise, then theestimate of w(S) is given by the fraction ofpages in the sample sequence that are inthe index. That is,

w̄(S) = 1n

n∑i=1

I [pi ∈ S].

Consequently, one needs to measure thefraction of pages in the sample sequencethat are in the index S in order to approx-imate the quality of S. To achieve this, theauthors pointed out that there is a require-ment of two components. First, a mech-anism is required for selecting pages ac-cording to w. Second, a method for testingwhether a page is indexed by a searchengine.

The authors adopted the approach usedby Bharat and Broder [1998] in orderto test whether a URL is indexed by asearch engine. However, selecting pages

15 In Henzinger et al. [1999], the (1 − d ) term inthe definition of PageRank is normalized by the totalnumber of pages on the web T .

according to a weight distribution w is sig-nificantly harder. It is also difficult to se-lect Web pages according to the PageRankdistribution. To solve this problem,Henzinger et al. [1999] proposed a sam-pling algorithm that provides a qualitymeasure close to that which would be ob-tained using the PageRank distribution.This approach is based on the assumptionthat one has the means to choose a pageuniformly at random. In that case, onecould perform a random walk with anequilibrium distribution corresponding tothe PageRank measure. At each step, thewalk would either jump to a random pagewith probability d or follow a randomlink with probability 1 − d . By executingthis walk for a long period of time, onecould generate a sample sequence andthe pages in the sample sequence wouldhave a distribution close to the PageRankdistribution.

As pointed out by the authors, there aretwo problems with the implementation ofsuch a random walk. First, no method isknown for choosing a Web page uniformlyat random. Second, it is not clear howmany steps one would have to perform inorder to remove the bias of the initial stateand thereby approximate the equilibriumdistribution. To solve the first problem, theauthors adopted the following approach:the walk occasionally chooses a host uni-formly at random from the set of hosts en-countered on the walk thus far, and jumpsto a page chosen uniformly at random fromthe set of pages discovered on that hostthus far. Obviously, the equilibrium distri-bution of this approach does not match thePageRank distribution, since pages thathave not been visited yet cannot be cho-sen, and pages on hosts with a small num-ber of pages are more likely to be chosenthan pages on hosts with a large numberof pages. However, the authors experimen-tally showed [Henzinger et al. 1999] thatthis bias does not prevent the walk fromapproximating a good quality metric thatshow similar behavior as PageRank.

For the second problem, it is obviousthat one has to start the random walkfrom some initial page. Hence, this intro-duces a bias towards pages close to the



initial page. The authors experimentallyjustified that as a substantial portion ofthe Web is highly connected, with reason-ably short paths between pair of pages,randomly walking over a small subgraphof the Web suffices to handle the initialbias [Henzinger et al. 1999].

Henzinger et al. [1999] provided someexperimental results based on their ran-dom walks. First, the results demonstratethat the random walk approach does cap-ture the intuitive notion of quality andthe weight distribution appears heavilyskewed towards pages user would con-sider useful. Second, the results com-pared the measured quality of severalsearch engine indexes. For instance, com-paring the quality scores of the searchengine indexes according to their size,Alta Vista scored the highest under thequality metric; Excite does extremely wellfor a search engine of intermediate size.Similarly, comparing the average pagequality of the search engine indexes, theauthors found out that larger search en-gines sometimes have higher average pagequality.

7. INFORMATION THEORETIC

The final category of Web metrics com-prises metrics that measure propertiesrelated to information needs, productionand consumption. Here, we consider twoproperties discussed by Pitkow and Pirolli[1997], namely, desirability and surviv-ability of Web documents and rate ofchange of Web pages as discussed byBrewington and Cybenko [2000].

The desirability of a page is the proba-bility that its information will be needed ina given time interval. Given that the fre-quency of page access on the WWW is ap-proximately a negative binomial distribu-tion, the information desirability seems tofollow the Burrell Gamma-Poisson (BGP)model, which assumes that accesses toinformation are Poisson events and theirdesirability is modeled by a Poisson pa-rameter. The logarithm of need probabilitysatisfies a negative linear relationshipwith the logarithm of time since lastaccess.

Survival analysis models the probabil-ity that a particular item will be deletedat a particular time. The survival functiondefined over time t is the probability thata page survives at least up to time t. Thatis, for a survival time T , distribution func-tion F (t), and the corresponding densityfunction f (t), the survival function is:

S(t) = 1− F (t)= P{T > t}.

The hazard rate is defined as the probabil-ity that a page will be deleted in the nextunit of time, given that it has survived totime t:

1(t) = f (t)S(t)

.

Brewington and Cybenko [2000] devel-oped an exponential probabilistic modelfor the times between individual Web pagechanges based on observational data onthe rates of change for a large sampleof Web pages. They observed pages at arate of about 100,000 pages per day forperiod of over seven months, recordinghow and when these pages have changed.About one page in five is younger thaneleven days, and half of the Web’s contentis younger than three months. In this con-text, the age of a Web page is defined as thedifference between a downloaded page’slast-modified timestamp and the time atdownloading and lifetime is the time be-tween changes. About one page in four isolder than one year and sometimes mucholder than that. Inspired by this findings,the author model the changes in a singleWeb page as a renewal process. If g (t) isthe age probability density and λ is thechange rate, then it can be shown that

g (t) = λ exp(λt).

Note that the lifetime probability densityis related to the age probability density asthe act of observing “the age is t units” isthe same as knowing “the lifetime is nosmaller than t units.” Consequently, wecan estimate a page’s lifetime PDF, assum-ing an exponential distribution, using onlypage age observations, which can be easilyobtained from the data.



The authors also modeled the age distri-bution for the entire Web using the jointdistribution of the exponential growthrate of Web pages and the change rate λ.Using a distribution over the inverse rateλ = 1/x and exponential growth rate pa-rameter ε, it can be shown that the ageprobability density for the entire web is asfollows:

g (t) =∫ ∞

0

(ε + 1

x

)× exp

((ε + 1

x

)t)

w(x) dx.

Note that the authors observed that theshape of the distribution w(x) in the aboveequation roughly follows Weibull distri-bution Montgomery and Runger [1994],which is given by

w(t) = σ

δ

(tδ

)σ−1

exp (− (t/δ)σ ).

where δ is the scale parameter and δ isa shape parameter. The shape parametercan be varied to change the shape peakeddistribution to an exponential to a uni-modal distribution. The scale parameteradjust the mean of the distribution. Nu-merically, the authors found out that theoptimal values are ε = 0.00176, σ = 0.78and δ = 651.1.

8. CONCLUSION

In this article, we have reviewed and clas-sified some well known Web metrics. Ourapproach has been to consider these met-rics in the context of improving Web con-tent while intuitively explaining their ori-gins and formulations. This analysis isfundamental to modeling the phenomenathat give rise to the measurements. To ourknowledge, this is the first survey that in-corporates an extensive treatment of widerange of metrics and measurement func-tions. Nevertheless, we do not claim thissurvey is complete and acknowledge anyomissions. We hope that this initiativewould serve as a reference point for fur-ther evolution of new metrics for charac-terizing and quantifying information on

the Web and developing the explanatorymodels associated with them.

REFERENCES

ALBERT, R. AND BARABASI, A. 2000. Topology ofevolving networks: Local events and uncer-tainty. Phys. Rev. Lett. 84, 56–60.

ALBERT, R., JEONG, H., AND BARABASI, A. 1999. Thediameter of the world wide web. Nature 401,130–131.

BARABASI, A. AND ALBERT, R. 1999. Emergence ofscaling in random networks. Science 286 (Oct.),509–512.

BARABASI, A., ALBERT, R., AND JEONG, A. 1999.Mean-field theory for scale free random net-works. Phys. A 272, 173–187.

BARABASI, A., ALBERT, R., AND JEONG, J. 2000. Scale-free characteristics of random networks: Thetopology of the world wide web. Phys. A, 281,69–77.

BHARAT, K. AND BRODER, A. 1998. A technique formeasuring the relative size and overlap of pub-lic web search engines. In Proceedings of the7th International World Wide Web Conference(Australia, Apr.).

BREWINGTON, B. AND CYBENKO, G. 2000. How dy-namic is the web? In Proceedings of the 9thInternational World Wide Web Conference (TheNetherlands).

BRODER, A., GLASSMAN, S., MANASSE, M., AND ZWEIG, G.1997. Syntactic clustering of the web. In Pro-ceedings of the 6th World Wide Web Conference.

BRODER, A., KUMAR, R., MAGHOUL, F., RAGHAVAN, P.,RAJAGOPALAN, S., STATA, R., TOMKINS, A., AND

WIENER, J. 2000. Graph structure of the web.In Proceedings of the 9th World Wide WebConference.

BOYCE, B. R., MEADOW, C. T., AND KRAFT, D. H. 1994.Measurement in Information Science. AcademicPress Inc. Orlando, Fla.

BRIN, S. AND PAGE, L. 1998. The anatomy of a large-scale hypertextual web search engine. In Pro-ceedings of the 7th World Wide Web Conference.

BORODIN, A., ROBERTS, G., ROSENTHAL, J. S., AND

TSAPARAS, P. 2001. Finding authorities andhubs from link structures on the world wide web.In Proceedings of the 10th International WorldWide Web Conference (Hong Kong).

BOTAFOGO, R., RIVLIN, E., AND SHNEIDERMAN, B. 1992.Structural analysis of hypertexts: Identifying hi-erarchies and useful metrics. ACM Trans. Inf.Syst. 10, 2 (Apr.), 142–180.

BRAY, T. 1996. Measuring the web. In Proceedingsof the 5th International World Wide Web Confer-ence (Paris, France. May).

CATLEDGE, L. AND PITKOW, J. 1995. Characterizingbrowsing strategies in the world wide web. Com-put. Netw. ISDN Syst. 27, 6 .

CHAKRABARTI, S., DOM, B., GIBSON, D., KUMAR, R.,RAGHAVAN, P., RAJAGOPALAN, S., AND TOMKINS, A.



1998a. Experiments in topic distillation. InProceedings of the SIGIR Workshop on HypertextIR.

CHAKRABARTI, S., DOM, B., RAGHAVAN, P., RAJAGOPALAN,S., GIBSON, D., AND KLEINBERG, J. 1998b. Auto-matic resource compilation by analyzing hyper-link structure and associated text. In Proceed-ings of the 7th World Wide Web Conference.

CHO, J., GARCIA-MOLINA, H., AND PAGE, L. 1998. Effi-cient crawling through url ordering. In Proceed-ings of the 7th World Wide Web Conference.

COHN, D. AND CHANG, H. 2000. Learning to prob-abilistically identify authorative documents. InProceedings of the 17th International Conferenceon Machine Learning (Calif).

DEAN, J. AND HENZINGER, M. 1999. Finding relatedpages in the world wide web. In Proceedings ofthe 8th World Wide Web Conference.

DHYANI, D. 2001. Measuring the web: Metrics,models and methods. Master’s Dissertation,School of Computer Engineering, NanyangTechnological University, Singapore.

EGGHE, L. AND ROUSSEAU, R. 1990. Introductionto Informetrics. Elsevier Science Publishers.Amsterdam, The Netherlands.

GIBSON, D., KLEINBERG, J., AND RAGHAVAN, P. 1998.Inferring web communities from link topology.In Proceedings of the 9th ACM Conference on Hy-pertext and Hypermedia.

HAWKING, D., CRASWELL, N., THISLEWAITE, P., AND

HARMAN, D. 1999. Results and challenges inweb search evaluation. In Proceedings of the 8thWorld Wide Web Conference.

HENZINGER, M., HEYDON, A., MITZENMACHER, M., AND

NAJORK, M. 1999. Measuring index qualityusing random walks on the web. In Proceedingsof the 8th World Wide Web Conference.

KLEINBERG, J. 1998. Authoritative sources in a hy-perlinked environment. In Proceedings of theACM-SIAM Symposium on Discrete Algorithms.

KLEINBERG, J., KUMAR, R., RAGHAVAN, P., RAJAGOPALAN,S., AND TOMKINS, A. 1999. The web as a graph:Measurements, models, and methods. In Pro-ceedings of the 5th International Conference onComputing and Combinatorics (COCOON).

KUMAR, R., RAGHAVAN, P., RAJAGOPALAN, S., AND TOMKINS,A. 1999. Trawling the web for emergingcyber-communities. In Proceedings of the 8thWorld Wide Web Conference.

LARSON, R. 1996. Bibliometrics of the world wideweb: An exploratory analysis of the intellec-tual structure of cyberspace. In Annual Meet-ing of the American Society of InformationScience.

LAWRENCE, S. AND GILES, C. L. 1998. Searching theworld wide web. Science 280 (Apr.).

LAWRENCE, S. AND GILES, C. L. 1999. Searching theweb: General and scientific information access.IEEE Commun. 37, 1, 116–122.

LEE, D., CHUANG, H., AND SEAMONS, K. 1997. Ef-fectiveness of document ranking and relevance

feedback techniques. IEEE Softw. 14, 2(Mar./Apr.), 67–75.

LEMPEL, R. AND MORAN, S. 2000. The stochastic ap-proach for link structure analysis (SALSA) andthe TKC effect. In Proceedings of the 9th WorldWide Web Conference.

LEMPEL, R. AND SOFFER, A. 2001. PicASHOW: Picto-rial authority search by hyperlinks on the web.In Proceedings of the 10th International WorldWide Web Conference (Hong Kong).

MARCHIORI, M. 1997. The quest for correct in-formation on the web: Hyper search engines.In Proceedings of the 6th World Wide WebConference.

MONTGOMERY, D. C. AND RUNGER, G. C. 1994. Ap-plied Statistics and Probability for Engineers.Wiley, New York.

MURRAY, B. H. AND MOORE, A. 2000. Sizing theinternet. White paper. Available from http://www.cyveillance.com/web/us/downloads/Sizingthe Internet.pdf (July).

PERKOWITZ, M. AND ETZIONI, O. 1997. Adaptive websites: An AI challenge. In Proceedings of the15th International Joint Conference on ArtificialIntelligence.

PERKOWITZ, M. AND ETZIONI, O. 1998. Adaptive websites: Automatically synthesizing web pages. InProceedings of the 15th National Conference onArtificial Intelligence.

PERKOWITZ, M. AND ETZIONI, O. 1999. Towardsadaptive web sites: Conceptual framework andcase study. In Proceedings of the 8th World WideWeb Conference.

PIROLLI, P., PITKOW, J., AND RAO, R. 1996. Silkfrom a sow’s ear: Extracting usable struc-tures from the web. In Proceedings of theACM-SIGCHI Conference on Human Factors inComputing.

PITKOW, J. 1997. In search of reliable usage data onthe WWW. In Proceedings of the 6th World WideWeb Conference.

PITKOW, J. AND PIROLLI, P. 1997. Life, death andlawfulness on the electronic frontier. In Proceed-ings of the ACM-SIGCHI Conference on HumanFactors in Computing.

RAFIEI, D. AND MENDELZON, A. 2000. What is thispage known for? computing web page reputa-tions. In Proceedings of the 9th World Wide WebConference.

RECKER, M. AND PITKOW, J. 1996. Predicting docu-ment access in large multimedia repositories.ACM Trans. Comput.-Hum. Inter. 3, 4.

ROSS, S. 1983. Stochastic Processes. Wiley,New York.

SELBERG, E. AND ETZIONI, O. 1995. Multi-servicesearch and comparison using the MetaCrawler.In Proceedings of the 4th International WorldWide Web Conference.

SARUKKAI, R. 2000. Link prediction and path anal-ysis using Markov chains. In Proceedings of the4th World Wide Web Conference.



SNELL, L. 1998. Introduction to Probability.McGraw-Hill International Edition, EnglewoodCliffs, N.J.

WEISS, R., VELEZ, B., SHELDON, M., NAMPREMPRE, C.,SZILAGYI, P., DUDA, A., AND GIFFORD, D. 1996.Hypursuit: A hierarchical network search en-gine that exploits content-link hypertext cluster-ing. In Proceedings of the 7th ACM Conference onHypertext.

YAN, T., JACOBSEN, M., GARCIA-MOLINA, H., AND

DAYAL, U. 1996. From user access patterns to

dynamic hypertext linking. In Proceedings of the5th International World Wide Web Conference(France).

YUWONO, B. AND LEE, D. 1996. Search and rankingalgorithms for locating resources on the worldwide web. In Proceedings of the 12th Interna-tional Conference on Data Engineering (Mar.).

YUWONO, B., LAM, S., YING, J., AND LEE, D. 1995. Aworld wide web resource discovery system. InProceedings of the 4th International World WideWeb Conference.

Received July 2001; revised May 2002; accepted August 2002


a survey of web metrics

Technology

web informetrics

web information access

usage of web sites

web page signicance

web page similarity1

world wide web

usefulness of web resources

quality metrics