constructing a reliable web graph with information on browsing behavior

12
Constructing a reliable Web graph with information on browsing behavior Yiqun Liu , Yufei Xue, Danqing Xu, Rongwei Cen, Min Zhang, Shaoping Ma, Liyun Ru State Key Lab of Intelligent Technology & Systems, Tsinghua National Laboratory for Information Science and Technology, Department of Computer Science and Technology, Tsinghua University, China abstract article info Article history: Received 17 March 2010 Received in revised form 30 May 2012 Accepted 13 June 2012 Available online 23 June 2012 Keywords: Web graph Quality estimation Hyperlink analysis User behavior analysis PageRank Page quality estimation is one of the greatest challenges for Web search engines. Hyperlink analysis algo- rithms such as PageRank and TrustRank are usually adopted for this task. However, low quality, unreliable and even spam data in the Web hyperlink graph makes it increasingly difcult to estimate page quality effec- tively. Analyzing large-scale user browsing behavior logs, we found that a more reliable Web graph can be constructed by incorporating browsing behavior information. The experimental results show that hyperlink graphs constructed with the proposed methods are much smaller in size than the original graph. In addition, algorithms based on the proposed surng with prior knowledgemodel obtain better estimation results with these graphs for both high quality page and spam page identication tasks. Hyperlink graphs constructed with the proposed methods evaluate Web page quality more precisely and with less computational effort. © 2012 Elsevier B.V. All rights reserved. 1. Introduction The explosive growth of data on the Web makes information man- agement and retrieval increasingly difcult. For contemporary search engines, estimating page quality plays an important role in crawling, indexing and ranking processes. For this reason, the estimation of Web page quality is considered as one of the greatest challenges for Web search engines [14]. Currently, the estimation of page quality mainly relies on an analysis of the hyperlink structure of the Web. The success of PageRank [24] and other hyperlink analysis algorithms such as HITS (Hyperlink-Induced Topic Search) [18] and TrustRank [10] shows that it is possible to esti- mate Web page quality query independently. These hyperlink analysis algorithms are based on two basic assumptions [7]: rst, if two pages are connected by a hyperlink, the page linked is recommended by the page that links to it (recommendation). Second, the two pages share a similar topic (locality). Hyperlink analysis algorithms adopted by both commercial search engines (such as [4,11,20,24]) and researchers (such as [10,12,13,18,19]) all rely on these two assumptions. However, these two assumptions miss subtleties in the structure of the actual Web graph. The assumptions and the consequent algorithms thus face challenges in the current Web environment. For example, Table 1 shows several top Web sites ranked by PageRank on a Chinese Web corpus 1 of over 130 million pages. To determine whether the PageRank score accurately represents the popularity of a Web site, we also gathered trafc rankings as mea- sured by Alexa.com. The data in Table 1 show that several of the top 10 Web sites ranked by PageRank also received a large number of user visits. For example, www.baidu.com, www.qq.com and www.sina.com.cn are also the three most frequently visited Web sites in China according to Alexa.com (their trafc rankings are shown in Table 1 in italics). In contrast, several top-ranked sites received a relatively small num- ber of user visits, such as www.hd315.gov.cn and www.miibeian. gov.cn. According to [24], pages with high PageRank values are either well cited from many places around the Web or pointed to by other high PageRank pages. In either case, the pages with the highest PageRank values should be frequently visited by Web users because PageRank can be regarded as the probability that a random surfer visits a page. Trafc is also considered as one of the possible applica- tions of PageRank algorithm in [24]. However, these top-ranked sites do not receive as many user visits as their PageRank rankings indi- cate. Although authority does not necessarily mean high trafc on the Web, we believe that either the MII site or the www.hd315.gov. cn site should not be ranked so high in quality estimation results because there are many other government agencies which are also authoritative but ranked much lower than these two sites. In order to nd out why the MII site and the www.hd315.gov.cn are ranked so high according to PageRank score, we examine the hy- perlink structure of these sites. Fig. 1 shows how www.baidu.com (the most popular Chinese search engine) links to www.miibeian. gov.cn (home page of the Ministry of Industry and Information Tech- nology of China). As shown in the red box, the hyperlink is located at the bottom of the page, and the anchor text contains the Web site's Decision Support Systems 54 (2012) 390401 Corresponding author at: FIT Building 1506, Tsinghua University, Beijing 100084, PR China. Tel./fax: +86 10 62796672. E-mail address: [email protected] (Y. Liu). 1 The Corpus is called SogouT corpus. It contains 130 million Chinese Web pages and was constructed in July 2008. Web site: http://www.sogou.com/labs/dl/t.html. 0167-9236/$ see front matter © 2012 Elsevier B.V. All rights reserved. doi:10.1016/j.dss.2012.06.001 Contents lists available at SciVerse ScienceDirect Decision Support Systems journal homepage: www.elsevier.com/locate/dss

Upload: yiqun-liu

Post on 23-Nov-2016

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Constructing a reliable Web graph with information on browsing behavior

Decision Support Systems 54 (2012) 390–401

Contents lists available at SciVerse ScienceDirect

Decision Support Systems

j ourna l homepage: www.e lsev ie r .com/ locate /dss

Constructing a reliable Web graph with information on browsing behavior

Yiqun Liu ⁎, Yufei Xue, Danqing Xu, Rongwei Cen, Min Zhang, Shaoping Ma, Liyun RuState Key Lab of Intelligent Technology & Systems, Tsinghua National Laboratory for Information Science and Technology, Department of Computer Science and Technology,Tsinghua University, China

⁎ Corresponding author at: FIT Building 1‐506, TsinghPR China. Tel./fax: +86 10 62796672.

E-mail address: [email protected] (Y. Liu).1 The Corpus is called SogouT corpus. It contains 130m

was constructed in July 2008. Web site: http://www.so

0167-9236/$ – see front matter © 2012 Elsevier B.V. Alldoi:10.1016/j.dss.2012.06.001

a b s t r a c t

a r t i c l e i n f o

Article history:Received 17 March 2010Received in revised form 30 May 2012Accepted 13 June 2012Available online 23 June 2012

Keywords:Web graphQuality estimationHyperlink analysisUser behavior analysisPageRank

Page quality estimation is one of the greatest challenges for Web search engines. Hyperlink analysis algo-rithms such as PageRank and TrustRank are usually adopted for this task. However, low quality, unreliableand even spam data in the Web hyperlink graph makes it increasingly difficult to estimate page quality effec-tively. Analyzing large-scale user browsing behavior logs, we found that a more reliable Web graph can beconstructed by incorporating browsing behavior information. The experimental results show that hyperlinkgraphs constructed with the proposed methods are much smaller in size than the original graph. In addition,algorithms based on the proposed “surfingwith prior knowledge”model obtain better estimation results withthese graphs for both high quality page and spam page identification tasks. Hyperlink graphs constructed withthe proposed methods evaluate Web page quality more precisely and with less computational effort.

© 2012 Elsevier B.V. All rights reserved.

1. Introduction

The explosive growth of data on theWebmakes informationman-agement and retrieval increasingly difficult. For contemporary searchengines, estimating page quality plays an important role in crawling,indexing and ranking processes. For this reason, the estimation ofWeb page quality is considered as one of the greatest challenges forWeb search engines [14].

Currently, the estimation of page qualitymainly relies on an analysisof the hyperlink structure of theWeb. The success of PageRank [24] andother hyperlink analysis algorithms such as HITS (Hyperlink-InducedTopic Search) [18] and TrustRank [10] shows that it is possible to esti-mate Web page quality query independently. These hyperlink analysisalgorithms are based on two basic assumptions [7]: first, if two pagesare connected by a hyperlink, the page linked is recommended by thepage that links to it (recommendation). Second, the two pages share asimilar topic (locality). Hyperlink analysis algorithms adopted by bothcommercial search engines (such as [4,11,20,24]) and researchers(such as [10,12,13,18,19]) all rely on these two assumptions. However,these two assumptions miss subtleties in the structure of the actualWeb graph. The assumptions and the consequent algorithms thus facechallenges in the current Web environment.

For example, Table 1 shows several top Web sites ranked byPageRank on a Chinese Web corpus1 of over 130million pages. To

ua University, Beijing 100084,

illion Chinese Web pages andgou.com/labs/dl/t.html.

rights reserved.

determine whether the PageRank score accurately represents thepopularity of a Web site, we also gathered traffic rankings as mea-sured by Alexa.com.

The data in Table 1 show that several of the top 10 Web sitesranked by PageRank also received a large number of user visits. Forexample, www.baidu.com, www.qq.com and www.sina.com.cn arealso the three most frequently visited Web sites in China accordingto Alexa.com (their traffic rankings are shown in Table 1 in italics).In contrast, several top-ranked sites received a relatively small num-ber of user visits, such as www.hd315.gov.cn and www.miibeian.gov.cn. According to [24], pages with high PageRank values are eitherwell cited from many places around the Web or pointed to by otherhigh PageRank pages. In either case, the pages with the highestPageRank values should be frequently visited by Web users becausePageRank can be regarded as “the probability that a random surfervisits a page”. Traffic is also considered as one of the possible applica-tions of PageRank algorithm in [24]. However, these top-ranked sitesdo not receive as many user visits as their PageRank rankings indi-cate. Although authority does not necessarily mean high traffic onthe Web, we believe that either the MII site or the www.hd315.gov.cn site should not be ranked so high in quality estimation resultsbecause there are many other government agencies which are alsoauthoritative but ranked much lower than these two sites.

In order to find out why the MII site and the www.hd315.gov.cnare ranked so high according to PageRank score, we examine the hy-perlink structure of these sites. Fig. 1 shows how www.baidu.com(the most popular Chinese search engine) links to www.miibeian.gov.cn (home page of the Ministry of Industry and Information Tech-nology of China). As shown in the red box, the hyperlink is located atthe bottom of the page, and the anchor text contains the Web site's

Page 2: Constructing a reliable Web graph with information on browsing behavior

Table 1Top-ranked Web sites by PageRank in a Chinese Web hyperlink graph.

Web site Ranked byPageRank

Ranked by Alexa.coma trafficrankings in China

www.hd315.gov.cn 2 1655www.qq.com 3 2www.baidu.com 6 1www.miibeian.gov.cn 7 179www.sina.com.cn 9 3

a http://www.alexa.com/topsites/countries/CN.

391Y. Liu et al. / Decision Support Systems 54 (2012) 390–401

registration information. Each Web site in China should register tothe Ministry of Industry and Information Technology (MII), and siteowners are requested to put the registration information on eachpage. Therefore, almost all Web sites in China link to the MII Website, and the PageRank score of www.miibeian.gov.cn is very highbecause of the huge number of in-links. The Web site www.hd315.gov.cn is highly ranked by PageRank for similar reason; each com-mercial site in China is required to put registration information ontheir pages, and the registration information contains a hyperlink towww.hd315.gov.cn.

From this example we can see that quality estimation results givenby PageRank on practical Web environment may not be so reason-able. Web sites such as the MII site are ranked quite high becausemany Web pages link to them. However, many of these hyperlinksare created due to legal, commercialized or even spamming reasons.Hyperlinks on Web graph should not be treated as equally importantas PageRank as supposed in [2]. Practical Web users do not act like the“random surfer”; instead, they only click hyperlinks interesting tothem. Therefore, Web sites that are connected by hyperlinks thatWeb users are not interested in clicking usually get high PageRankscore which they do not deserve.

This example shows that hyperlink analysis algorithms are notalways successful in the real Web environment because of the exis-tence of hyperlinks that users seldom click. Removing these hyper-links from Web graph is an important step in constructing a morereliable graph on which link analysis algorithms can be performedmore effectively.

To reduce noises in the Web graph, we analyze information onusers' browsing behaviors collected by search engine toolbars orbrowser plug-in software. Information on browsing behavior can re-veal which pages or hyperlinks are frequently visited by users andwhich are not, allowing construction of a more reliable Web graph.

Fig. 1. A sample site (http://www.baidu.com) that links to www.miibeian.gov.cn, thesite in the sample corpus with the 7th highest PageRank score.

For example, although many pages link to the MII homepage, fewpeople click on these links because site registration information isnot interesting to most Web users. These hyperlinks may be regardedas “meaningless” or “invalid” because they are not involved in users'Web surfing process. If we construct a new Web graph withoutthese links, the representation of users' browsing behavior will notbe affected, but the PageRank score calculated by the new graphwill be more accurate because most of the hyperlinks connecting tothe MII homepage are removed.

The number of users visiting a site can be regarded as implicitfeedback about the importance of both hyperlinks and pages in theWeb graph. However, constructing a more reliable graph with thiskind of information remains a challenging problem. Retaining onlythe nodes and vertexes that have been visited at least once is onepotential option. Several researchers, such as Liu et al. [22], have con-structed such a graph, called a ‘user browsing graph’, and have used itto gain better estimates of page quality than with the original Webgraph.2 However, with user browsing information, there are otheroptions in constructing a Web graph other than the user browsinggraph. The contributions of our work include:

• With user browsing information, a new Web surfing model is con-structed other than the “random surfer model” adopted by previousresearches such as PageRank. This “surf with prior knowledgemodel” incorporates both user behavior information and hyperlinkinformation and is a better simulation of Web users' surfingprocesses.

• Twoquality estimation algorithms (userPageRankanduserTrustRank)are proposed according to the new “surf with prior knowledgemodel”. These algorithms take user preference of hyperlinks into con-sideration and they can be performed on the user browsing graph.

• Two different kinds of Web graph construction algorithms are pro-posed besides user browsing graph to combine both browsing andhyperlink structure information. Characteristics and evolution ofthese graphs are studied and compared with the original Webgraph.

The remainder of the paper is organized as follows: Section 2 givesa review of related work on page quality estimation and user brows-ing behavior analysis. Section 3 introduces the “surf with prior knowl-edge model” and the quality estimation algorithms based on it.Section 4 presents algorithms for constructing Web graphs based onboth user browsing and hyperlink information. Section 5 describesthe structure and evolution of the Web graphs constructed with theproposed algorithms. The experimental results of applying differentalgorithms to estimate page quality on different graphs are reportedin Section 6. Conclusions and future work are provided in Section 7.

2. Related work

2.1. Page quality estimation

Most previous work on page quality estimation focuses on exploit-ing the hyperlink graph of the Web and builds a model based on thatgraph. Since the success of PageRank [24] in the late 1990s, extensiveresearch has attempted to improve the efficiency and effectivenessof the original algorithm [11–13]. However, the basic idea has notchanged: a Web page's quality is evaluated by estimating the prob-ability of a Web surfer's visiting the page using a random walk model.The HITS algorithm evaluates Web page quality using two differentmetrics, the hub score and authority score. Experimental results basedon both IBM CLEVER search system evaluation [20] and humanexperts' annotations [1] have demonstrated the effectiveness of HITS.

2 Original Web graph is the Web graph constructed with pages and hyperlinks col-lected from the real Web environment without removing noises.

Page 3: Constructing a reliable Web graph with information on browsing behavior

392 Y. Liu et al. / Decision Support Systems 54 (2012) 390–401

In addition to methods to evaluate the quality of Web pages,researchers have proposed link analysis algorithms to identify spampages. Spam pages are created with the intention of misleadingsearch engines. Gyongyi et al. [10] developed the TrustRank algo-rithm to separate reputable pages from spam. This work was followedby other methods based on the link structure of spam pages, suchas Anti-Trust Rank [19] and Truncated PageRank [2] algorithms.TrustRank is an effective link analysis algorithm that assigns a trustscore to Web pages. Pages with low trust scores tend to be spampages, and pages with high trust scores tend to be high quality pages.

These link analysis algorithms have become popular and impor-tant tools in search engines' ranking mechanisms. However, theWeb graph on which these algorithms are based is not particularlyreliable because hyperlinks can be easily added or deleted by pageauthors or even by Web users (via Web 2.0 services). Therefore, asshown in Table 1, noise in Web graphs makes it difficult for thesealgorithms to evaluate page quality effectively.

Several methods have been proposed to counteract the manipula-tion of Web structure. Algorithms such as DiffusionRank [27] and AIR(Affinity Index Ranking) [17] were designed to fix the flaws ofPageRank and TrustRank. DiffusionRank is motivated by the phenom-enon of heat diffusion, which is analogous to the dissipation of energyvia out-links. AIR scores for Web pages are obtained by using anequivalent electronic circuit model. Similar to TrustRank, both algo-rithms require the construction of a “high quality seed set”. Experi-mental results have shown that DiffusionRank and AIR performbetter than PageRank and TrustRank in removing spam both on toygraphs and in real Web graphs. However, aside from hyperlinks gen-erated for Web structure manipulation and spam, most Web pagescontain meaningless and low quality hyperlinks such as copyrightlinks, advertisement links, and registration information links and soon. These links are not popular and are seldom clicked by users, butthey comprise a large part of Web graphs. Both DiffusionRank andAIR algorithms are unable to deal with this kind of “noise” in hyper-link structure data.

Because of the problems that hyperlink analysis algorithmsencounter in real Web environment, researchers have tried to usefeatures other than hyperlinks to evaluate quality of Web pages.Chau et al. [6] have identified pages on certain topics using bothcontent-based and link-based features. Liu et al. [21] have proposeda learning-based method for identifying search target pages query in-dependently using content-based and hyperlink-based features,such as document length and in-link count. Jacob et al. [15] havealso adopted both content-based and hyperlink-based approaches todetect Web spam. Although these methods use features other thanlinks, link analysis algorithms still play an important role in the iden-tification of high quality pages or spam pages. Therefore, the qualityof Web hyperlink data and the effectiveness of link analysis algo-rithms remain challenging problems.

In contrast to these approaches, we incorporate Web users'browsing behavior to indicate page quality. Most users' browsing be-havior is driven by their interests and information needs. Therefore,pages that are visited and hyperlinks that are clicked by users shouldbe regarded as more meaningful and more important than those thatare not. It is therefore reasonable to use users' preferences to prunethe hyperlink graph.

2.2. User browsing behavior analysis

Although researchers such as Page et al. [24] tried to incorporatebrowsing information (collected from DNS providers) in page qualityestimation at the early stage of hyperlink analysis researches, brows-ing behavior analysis has not become popular until recent years. Webbrowser toolbars such as Google Toolbar and Live Toolbar collect userbrowsing information. It is considered as an important source of im-plicit feedback on page relevance and importance and was widely

adopted in Web site usability [9,16,25], user intent understanding[26] and Web search [3,22,23,28] researches.

Using this information on browsing behavior, it is possible toprune the Web graph by removing unvisited nodes and links. Forexample, Liu et al. [22] constructed a “user browsing graph” withWeb access log data. It is believed that the user browsing graph canavoid most of the problems of the original Web graph because linksin the browsing graph are actually chosen and clicked by users.Liu et al. also proposed an algorithm to estimate page quality,BrowseRank, which is based on continuous-time Markov processmodel. Their study shows that the BrowseRank algorithm worksbetter than hyperlink analysis algorithms such as PageRank andTrustRank when the latter two algorithms are performed on thewhole Web graph.

The user browsing graph is not the only way to incorporatebrowsing behavior into page quality estimation. In addition, the inter-pretation of the user browsing graph is not obvious. For example, wecan infer that the user browsing graph differs from the whole Webgraph in some aspects, but precisely how do the structures of thesetwo graphs differ from each other? How does the user browsinggraph evolve over time? BrowseRank outperforms PageRank andTrustRank algorithms when the latter two algorithms are performedon the original Web graph, but how do hyperlink analysis algorithmsperform on the user browsing graph?

We try to answer these questions through experimental studies,and we also attempt to determine how data on users' browsing be-havior can be better analyzed to construct a more reasonable Websurfing model rather than the widely adopted random surfer model.

3. Surfing with prior knowledge

With the example shown in Table 1 and Fig. 1, we know that hy-perlinks are not clicked by users with equal probabilities and theyshould not be treated as equally important in the construction of surf-ing models. However, due to the difficulties in collecting user brows-ing information, most previous works on Web graph mining arebased on the “random surfer model” which supposes user simplykeeps clicking on successive links at random.

Differently from these works, we collected a large amount of userbrowsing information with the help of a widely used search engine inChina. These Web-access logs were collected from Aug. 3, 2008, toOct. 6, 2008 (60days; logs from Sept. 3 to Sept. 7 were not includedbecause of hard disk failure). Over 2.8billion hyperlink click eventswere recorded and can be adopted as prior knowledge in the con-struction of surfing models. Details of these log data are introducedin Section 4.1.

Designed with random surfer model, one of the major flaws of thePageRank algorithm is “over-democracy” [27]. The original algorithmassumes that the Web user either randomly follows a hyperlink on aWeb page and navigates to the destination (with probability α) orrandomly chooses a different page on the given Web graph (withprobability 1−α).

PageRank kþ1ð Þ Xð Þ ¼ α⋅ ∑Xi⇒X

PageRank kð Þ Xið Þ#Outlink Xið Þ þ 1−αð Þ⋅ 1N : ð1Þ

According to Eq. (1), the PageRank score of a page is divided even-ly between all of its outgoing hyperlinks. However, hyperlinks onWeb pages are not equally important. Some hyperlinks, such as “topstories” links on the CNN.com homepage, are more important, where-as others, such as advertisements, are less important.

Therefore, it is not reasonable to assume that users will followhyperlinks on a Web page with equal probabilities. If we introducethe probability of visiting page Xj directly after visiting page Xi, namelyP(Xi⇒Xj), the random surfer model will be replaced by the “surfing

Page 4: Constructing a reliable Web graph with information on browsing behavior

Table 2Information recorded in Web-access logs.

Name Description

Time stamp Date/time of the click event

393Y. Liu et al. / Decision Support Systems 54 (2012) 390–401

with prior knowledge” model and the estimation of P(Xi⇒Xj) requiresprior knowledge of user browsing behaviors.

With the “surfing with prior knowledge” model, Web users do notclick on hyperlinks on the Web pages they are visiting randomly,instead, each hyperlink L is clicked with a probability of P(Xi⇒Xj) inwhich Xi is the source page and Xj is the destination page of L.

With the new surfing model, Eq. (1) can be modified as follows:

PageRank kþ1ð Þ Xð Þ ¼ α⋅ ∑Xi⇒X

PageRank kð Þ Xið ÞP Xi⇒Xð Þ þ 1−αð Þ⋅ 1N: ð2Þ

In Eq. (2), P(Xi⇒Xj) is the probability of visiting page X directlyafter visiting page Xi. However, for the original Web graph, it is notpossible to estimate this probability because the relevant informationis not provided. Therefore, PageRank (as well as TrustRank) has to becomputed using equal P(Xi⇒Xj) values (as Eq. (1)).

To incorporate prior user browsing information into the originalWeb graph, the user-visited nodes and edges should be selectedand the number of user clicks on each hyperlink (edges) should berecorded. With this information, we can decide which hyperlinksare important and estimate the probability of P(Xi⇒Xj) with the max-imum likelihood assumption.

If we use UC(Xi⇒Xj) to represent the number of user clicks from Xi

to Xj, the original PageRank algorithm can be modified as follows:

userPageRank kþ1ð Þ Xð Þ¼ α⋅ ∑

Xi⇒XuserPageRank kð Þ Xið Þ #UC Xi⇒Xð Þ

∑Xi⇒Xj

#UC Xi⇒Xj

� �þ 1−αð Þ⋅ 1N: ð3Þ

In Eq. (3), the probability of P(Xi⇒Xj) is estimated by the weight-ed UC factor with maximum likelihood assumption. The PageRank ofpage Xi is divided between the outgoing links, weighted by UC of eachlink. Aside from this PageRank division, no other part of the originalalgorithm is changed. Therefore, the time complexity and the efficien-cy of this algorithm stay the same.

A similar modification can be applied to the TrustRank algorithm,which traditionally divides the trust score equally between outgoinglinks. The original and the modified algorithms are shown inEqs. (4) and (5) separately.

TrustRank kþ1ð Þ Xð Þ ¼ α⋅ ∑Xi⇒X

TrustRank kð Þ Xið Þ#Outlink Xið Þ þ 1−αð Þ⋅d ð4Þ

userTrustRank kþ1ð Þ Xð Þ¼ α⋅ ∑

Xi⇒XuserTrustRank kð Þ Xið Þ UC Xi⇒Xð Þ

∑Xi⇒Xj

UC Xi⇒Xj

� �þ 1−αð Þ⋅d: ð5Þ

With the “surfing with prior knowledge” model, hyperlinks on Webpages are not treated as equally important, instead, the probability ofuser clicking is estimated with prior knowledge and maximum like-lihood assumption. By this means, we hope to improve the perfor-mance of PageRank and TrustRank which is originally based on therandom surfer model.

We believe that the new surfing model can also be utilized to othergraphs besides the Web hyperlink graph if the probability of visitingone node from another can be estimated. For example, let G=(V, E)denotes a social graph, where V represents the users and E representsthe relationship between them. In many Web-based social networkservices such as Twitter and Weibo,3 the relationship between userscan be described as a directed edge from follower to followee, whichis similar to the hyperlink from source page to destination page.

3 Weibo (http://www.weibo.com) is China's largest microblog service providerwhich owns over 250million users.

Intuitively, the influence of s social node in social networks is sim-ilar to the quality score of a Web page. It means that if we try to esti-mate influence scores on a social graph, hyperlink algorithms such asPageRank and TrustRank can also be utilized. As hyperlinks in a Webgraph, we believe that the “following” relationships between nodesin a social graph are also not equally important. This is becauseusers may follow another user for different reasons and closest rela-tionships should be valued more. Therefore, “surfing with prior knowl-edge”model is also more reasonable than the random surfer model onthe social graph although prior knowledge (P(Xi⇒Xj)) should be esti-mated by different means.

4. Web Graph construction with information onbrowsing behavior

4.1. Data on user browsing behavior

Based on the “surfing with prior knowledge” model described inSection 3, we revise the original PageRank and TrustRank algorithmsby incorporating prior user browsing behavior information. There-fore, the newly proposed userPageRank and userTrustRank algo-rithms require additional information and cannot be performed onthe original Web graph. To construct a reliable Web graph that incor-porates user browsing behavior information, we collected data onusers' browsing behavior (also called Web-access log data or Webusage data). In contrast to log data from search engine queries andclick-through data, this kind of data is collected using browsertoolbars. It contains information on Web users' total browsing behav-ior, including their interactions with search engines and other Websites.

To provide value-added services to users, most browser toolbarsalso collect anonymous click-through information on users' browsingbehavior. Previous work such as [3] has used this kind of click-throughinformation to improve ranking performance. Liu et al. [23] have pro-posed a Web spam identification algorithm based on this kind of userbehavior data. In this paper, we also adopt Web access logs collectedby toolbars because this enables us to freely collect users' browsingbehavior information with no interruption to the users. An exampleof the information recorded in these logs is shown in Table 2 andExample 1.

Table 2 and Example 1 show that no private information was in-cluded in the log data. The information shown can be easily recordedusing browser toolbars by commercial search engine systems. There-fore, collecting this kind of information for the construction of hyper-link graphs is practical and feasible.

4.2. Construction of a user browsing graph and a user-oriented hyperlinkgraph

With the data on users' browsing behavior described inSection 4.1, we identified which pages and hyperlinks were visitedand the following two algorithms are adopted to construct the userbrowsing graph and the user-oriented hyperlink graph, respectively.

Algorithm 1 constructs a graph completely based on user behaviordata. Only nodes and hyperlinks that were visited at least once areadded to the graph. This graph is similar to the graph constructedby Liu et al. in [22], except that the number of user visits on each

Session ID A randomly assigned ID for each user sessionSource URL URL of the page that the user is visitingDestination URL URL of the page to which the user navigates

Page 5: Constructing a reliable Web graph with information on browsing behavior

Example 1A sample Web-access log collected on Dec. 15, 2008.

(01:07:09) (3ffd50dc34fcd7409100101c63e9245b) (http://v.youku.com/v_playlist/f1707968o1p7.html); (http://www.youku.com/playlist_show/id_1707968.html)

(01:07:09) (f0ac3a4a87d1a24b9c1aa328120366b0) (http://user.qzone.qq.com/234866837); (http://cnc.imgcache.qq.com/qzone/blog/tmygb_static.htm)

(01:07:09) (3fb5ae2833252541b9ccd9820bad30f6) (http://www.qzone8.net/hack/45665.html); (http://www.qzone8.net/hack/)

394 Y. Liu et al. / Decision Support Systems 54 (2012) 390–401

edge is also recorded to estimate P(Xi⇒Xj) for userPageRank anduserTrustRank. Following their convention, we also call this graphuser browsing graph (BG(V,E) for short).

Algorithm 2 constructs a graph distinct from BG(V,E). These twographs share a common set of nodes, though the graph constructedwith Algorithm 2 retains all of the edges between these nodes fromthe original Web graph. We call this graph a user-oriented hyperlinkgraph (user-HG(V,E) for short) because it is extracted from the origi-nal Web graph but has nodes selected with user information. Theoriginal Web graph was constructed by the same search engine com-pany that provided Web access logs to us. Collected in July 2008, itcontains over 3billion pages from 111million Web sites and coversa major proportion of Chinese Web pages at that time.

Thus, both BG(V,E) and user-HG(V,E) are constructed with the helpof browsing behavior data. The latter graph contains more hyperlinks,whereas the former graph only retains hyperlinks that are actuallyfollowed by users. We can see that userPageRank and userTrustRankcannot be performed on user-HG(V,E) because browsing informationis not recorded for all edges on this graph.

4.3. Comparison of the user browsing and user-oriented hyperlink graphs

We constructed BG(V,E) and user-HG(V,E) with the data on userbehavior described in Section 4.1. Table 3 shows how the composi-tions of these two graphs differ from each other.

According to Table 3, we found that although the hyperlink graphuser-HG(V,E) shares a common set of nodes with BG(V,E), the compo-sitions of these two graphs differ significantly. First, BG(V,E) is lessthan one-tenth the size of user-HG(V,E). The percentage of commonpages in user-HG(V,E) is only 1.86%; thus, most (98.14%) of the linksin user-HG(V,E) are not actually clicked by users. This difference isconsistent with people's Web browsing experience that pages usuallyprovide too many hyperlinks for users to click.

Another interesting finding is that the user-HG(V,E) graph doesnot include all the edges in BG(V,E). Less than one-quarter of thepages in BG(V,E) also appear in user-HG(V,E). This phenomenon can

Algorithm 1Algorithm to the construct of the user browsing graph.

1.V={}, E={}2. For each record in the Web-access log, if the source URL is A

and the destination URL is B, then,

if A∉V ; V ¼ V∪ Af g;if B∉V ; V ¼ V∪ Bf g;if A;Bð Þ∉EE ¼ E∪ A;Bð Þf gCount A;Bð Þ ¼ 1;

elseCount A;Bð Þ þ þ:

be partially explained by the fact that user-HG(V,E) is constructedwith information collected by Web crawlers, and it is not possiblefor any crawler to collect the hyperlink graph of the whole Web; itis too huge and changing so fast. When we examined the links thatonly appear in BG(V,E), we found another reason why user-HG(V,E)does not include them. A large proportion of these links come fromusers' clicks on search engines result pages (SERPs). Table 4 showsthe number of SERP-oriented hyperlinks in BG(V,E).

Tables 3 and 4 reveal that of the links that appear only in BG(V,E)(7.97million edges in total), over 3.27million come from SERPs of thefive most frequently used Chinese search engines. This number con-stitutes 30.96% of all edges in BG(V,E). Web users click many linkson SERPs, but almost none of these linkswould be collected by crawlers.These links contain valuable information because they link to Webpages that are both recommended by search engines and clicked byusers. It is not possible for Web crawlers to collect all of the links fromSERPs without information on user behavior because the number ofsuch links would be overwhelmingly large.

Another important type of links that appear only in BG(V,E) arehyperlinks that are clicked in users' password-protected sessions.For example, login authorization is sometimes needed to visit blogpages. After logging in, Web users often navigate among these pages,and Web-access logs can record these browsing behaviors. However,ordinary Web crawlers cannot collect these links because they arenot allowed to access the contents of protected Web pages.

4.4. Construction of the user-oriented combined graph

Section 4.3 shows that the user browsing graph differs from theuser-oriented hyperlink graph in at least two ways: first, comparedwith user-HG(V,E), a large fraction of the edges (98.14% of E inuser-HG(V,E)) are omitted from BG(V,E) because they are not clickedby any user. Second, BG(V,E) contains hyperlinks that are difficult orimpossible for Web crawlers to collect. Thus, each graph containsunique information that is not contained by the other graph. Therefore,if we construct a graph containing all of the hyperlinks and nodes in

Algorithm 2Algorithm to the construct of the user-oriented hyperlink graph.

1.V={}, E={}2. For each record in the Web-access log, if the source URL is A

and the destination URL is B, then,

if A∉V ; V ¼ V∪ Af g;if B∉V ; V ¼ V∪ Bf g:3.For each A and each B in V,

if A;Bð Þ∈Original Web Graphð ÞAND A;Bð Þ∉Eð ÞE ¼ E∪ A;Bð Þf g:

Page 6: Constructing a reliable Web graph with information on browsing behavior

Table 3Differences between BG(V,E) and user-HG(V,E) in the edge sets.

# (Common edges) # (Total edges) Percentage ofcommon edges

BG(V,E) 2,591,716 10,564,205 24.53%User-HG(V,E) 139,125,250 1.86%

Algorithm 3Algorithm to the construct of the user-oriented combined graph.

1.V={}, E={}2. For each record in the Web-access log, if the source URL is A

and the destination URL is B, then,

if A∉V ; V ¼ V∪ Af g;if B∉V ; V ¼ V∪ Bf g:3. For each A and each B in V,

if A;Bð Þ∈BG V ; Eð Þð ÞOR A;Bð Þ∈userHG V ; Eð Þð ÞE ¼ E∪ A;Bð Þf g:

395Y. Liu et al. / Decision Support Systems 54 (2012) 390–401

BG(V,E) and user-HG(V,E), it should contain more complete hyperlink in-formation. We adopt the following algorithm (Algorithm 3) to constructsuch a graph, which combines all of the hyperlink information in BG(V,E)and user-HG(V,E).

This algorithm can construct a graph that shares the same nodeset as BG(V,E) and user-HG(V,E) but that contains the hyperlinksof both graphs. Because it combines the edge sets of BG(V,E) anduser-HG(V,E), we call it a user-oriented combined graph (user-CG(V,E)for short). Similar with user-HG(V,E), it does not contain clicking infor-mation on all the edges and userPageRank/userTrustRank cannot beperformed on it.

4.5. Stats of the constructed graphs

With the data from Web-access logs described in Section 4.1 andthe original whole Web graph (named whole-HG(V,E) for short)mentioned in Section 4.2, we constructed three graphs (BG(V,E),user-HG(V,E), and user-CG(V,E)). These graphs were constructed at thesite-level instead of the page-level to improve efficiency. This level of res-olution is also appropriate because a large number of search enginesadopt site-level link analysis algorithms and then obtain page-level linkanalysis scores using a propagation process within Web sites. Anotherproblem with a page-level graph is that due to data sparsity problem,there are only a few user visits for a large part of pages and the behaviordata may be not so reliable. However, for a site-level graph, the averagenumber of user visits per site is much larger and data sparsity can beavoided to a large extent. According to experimental results in our previ-ous work [28], we also found that a site-level model outperformed apage-level model because the average number of browsing activitiesper site is much larger, indicating more reliable behavior informationsources.

Descriptive statistics of these constructed graphs are shown inTable 5.

We can see fromTable 5 that BG(V,E), user-HG(V,E) and user-CG(V,E)cover a small percentage (3.83%) of the vertices of the original Webgraph. The edge sets of these three graphs are also much smaller thanthe Web graph, but the average number of hyperlinks per node inuser-HG(V,E) and user-CG(V,E) is higher than that of thewhole-HG(V,E).This result means that user-accessed nodes are more strongly con-nected to each other than the other parts of the Web. This patternhints the presence of a large SCC (Strongly Connected Component)proposed in [8] in the user browsing graphs. Another finding is thatcompared with user-HG(V,E) and user-CG(V,E), the ratio of edges tovertices in BG(V,E) is much smaller. Thus, a large fraction of hyperlinksare removed for this graph because they are not followed by users. Theretained links are ostensibly more reliable than the others, however;

Table 4Number of SERP-oriented edges that are not included in user-HG(V,E).

Search engine Number of edges that are not included in user-HG(V,E)

Baidu.com 1,518,109Google.cn 1,169,647Sogou.com 291,829Soso.com 147,034Yahoo.com 143,860Total 3,270,479 (30.96% of all edges in BG(V,E))

whether this information loss creates problems for link analysis algo-rithms remains to be determined.

5. Structure and evolution of constructed graphs

5.1. Structure of the constructed graphs

The degree distribution has been used to describe the structure ofthe Web by many researchers, such as Broder et al. [5]. The existenceof a power law in the degree distribution has been verified by severalWeb crawls [5,8] and is regarded as a basic property of the Web. Wewere interested in whether power laws could also describe thein-degree and out-degree distributions in the constructed graphs.Experimental results of degree distributions of both BG(V,E) anduser-HG(V,E) are shown in Figs. 2 and 3. We did not consider the de-gree distributions of user-CG(V,E) because it is a combination of BG(V,E) and user-HG(V,E). If in-degree and out-degree distributions of thesetwo graphs follow a power law, user-CG(V,E) will as well.

Fig. 2 shows that in-degree distributions of both BG(V,E) anduser-HG(V,E) follow a power law. The exponent of the power law(1.75) is smaller than that found in previous results (approximately2.1 in [6,9]). This difference is because our hyperlink graph is based onsites, whereas previous graphs were based on pages. There are fewerunpopular (low in-degree) nodes in a site-level graph compared witha page-level graph because a large number of unpopular pages maycome from the same Web site. Another phenomenon is that the ex-ponent of power law distribution in BG(V,E) (2.30) is larger thanthat of user-HG(V,E) (1.75). This difference implies that with an in-crease in in-degree i, the number of vertices with i in-links dropsfaster in the user browsing graph. This pattern can be explained bythe fact that someWeb sites are relatively more popular (have higherin-degree) in the user browsing graph than in the user-orientedhyperlink graph.

The out-degree distributions of both graphs also subscribe to thepower law (Fig. 3). The exponent of the out-degree distribution in apage-based graph has been estimated to be 2.7 [5,8]. The exponentestimated for our site-based graph is much smaller (1.9). In a site-basedgraph, out-links that link to pages in the same site are omitted. This as-sumption reduces the number of out-links of many vertices and reducesthe difference between high and low out-link vertices. The exponent of

Table 5Sizes of the constructed and the original Web graphs.

Graph Vertices (#) Edges (#) Edges/vertices

BG(V,E) 4,252,495 10,564,205 2.48User-HG(V,E) 4,252,495 139,125,250 32.72User-CG(V,E) 4,252,495 147,097,739 34.59Whole-HG(V,E) 110,960,971 1,706,085,215 15.38

Page 7: Constructing a reliable Web graph with information on browsing behavior

Fig. 2. In-degree distributions of both BG(V,E) and user-HG(V,E) subscribe to the powerlaw.

396 Y. Liu et al. / Decision Support Systems 54 (2012) 390–401

the out-degree distribution in BG(V,E) is larger than the one in user-HG(V,E). As for the out-degree distribution, this differencemeans that with theincrease in out-degree o, the number of vertices with o in-links dropsfaster in the user browsing graph.

The experimental results shown in Figs. 2 and 3 confirm that sim-ilar to the whole Web graph, the in-degree and out-degree distri-butions of both BG(V,E) and user-HG(V,E) follow a power law.However, the exponents of the power law distributions are differentbecause the constructions of BG(V,E) and user-HG(V,E) decrease thenumbers of valueless nodes and hyperlinks compared with the origi-nal Web graph. The fact that BG(V,E) and user-HG(V,E) inherit charac-teristics of the whole Web makes it possible for us to performstate-of-the-art link analysis algorithms on these graphs.

5.2. Evolution of BG(V,E) and quality estimation of newly visited pages

The purpose of our work is to estimate Web page quality with thehelp of information on user browsing behavior. For practical searchengine applications, an important issue is whether the page qualityscores calculated off-line can be adopted for on-line search process.BG(V,E), user-HG(V,E) and user-CG(V,E) were all constructed withbrowsing behavior information collected by search engines. Thiskind of information is collected during a certain time period. There-fore, user behavior outside this time period cannot be included inthe construction of these graphs. If pages needed by users are not in-cluded in the graphs, it is impossible to calculate their quality scores.Therefore, it is important to determine how the compositions of these

Fig. 3. Out-degree distributions of both BG(V,E) and user-HG(V,E) subscribe to thepower law.

graphs evolve over time and whether newly visited pages can be in-cluded in the graphs.

To determine whether the construction of BG(V,E), user-HG(V,E)and user-CG(V,E) can avoid the problem of newly visited and missingpages, we designed the following experiment:

Step 1 A large number of pages appear each day, and only a fractionof them are visited by users. We only focus on the newly vis-ited pages that are actually visited by users because the ab-sence of pages from the graph could affect users' browsingexperiences. Therefore, we examine how many newly visitedpages are included by the constructed graphs.In Fig. 4, each data point shows the percentage of newlyclicked pages/hyperlinks that are not included by BG(V,E).On each day, BG(V,E) is constructed with browsing behaviordata collected from Aug. 3, 2008, (the first day in the figure)to the date before that day. We focus on BG(V,E) becauseuser-HG(V,E) and user-CG(V,E) share the same vertex set.On the first day, all of the edges and vertices are newly visitedbecause no data has yet been included in BG(V,E). From the sec-ond day to approximately the 15th day, the percentage of newlyvisited edges and vertices drops. On each day after the 15th day,approximately 30% of the edges and 20% of the vertices are newto the BG(V,E), which is constructed with data collected beforethat day.During the first 15days, the percentage of newly visited edgesand vertices drops because the structure of the browsinggraph is more and more complete each day. At the 15th day,the browsing graph contains 6.12million edges and 2.56millionvertices. From then on, the number of newly visited edges andvertices is relatively stable. Approximately 0.3million newedges and 0.1million new vertices appear on each subsequentday. Therefore, it takes approximately 15days to construct a sta-ble user browsing graph and subsequently, approximately 20%of newly visitedWeb sites are not included in BG(V,E) each day.

Step 2 According to Step 1, approximately 20% of newly visited siteswould be missing if we adopt BG(V,E) for quality estimation(supposing BG(V,E) is daily updated). To determine whetherthis missing subset of newly visited sites affects quality esti-mation, we examined whetherWeb sites that are not includedin the graph are indexed by search engines. If they are notindexed by search engines, it is not necessary to calculatetheir quality estimation scores because search engines willnot require these scores. We sampled 30,605 pages from thesites that are visited by users but not included in BG(V,E) (ap-proximately 1% of all visited pages in these sites) and checkedwhether they are indexed by four widely used Chinese searchengines (Baidu.com, Google.cn, Sogou.com, Yahoo.cn). Theexperimental results are shown in Table 6 (SE1–SE4 are usedinstead of search engine names).The experimental results in Table 6 show that most of thesepages (88.74% on average) are not indexed by search engines.It is not necessary for BG(V,E) to include these pages becausesearch engine does not require their quality estimation scores.

Step 3 According to the results of Step 1 and 2, we can calculate thatonly 2.2% (11.26%×20%) of newly visited pages are both notincluded in BG(V,E) and required for quality estimation.Among the pages that are both indexed by search enginesand visited by users, most will be included by BG(V,E) if thisgraph can be updated daily with new log data on browsingbehavior. Therefore, it is appropriate to use BG(V,E) in qualityestimation. Because user-HG(V,E) and user-CG(V,E) share thesame vertex set with BG(V,E), these constructed graphs arealso not substantially affected by the problem of new visitsto missing pages. Thus, these graphs are also appropriate forquality estimation.

Page 8: Constructing a reliable Web graph with information on browsing behavior

Table 6Percentage of newly visited pages indexed by search engines.

Search engine Percentage of pages indexed

SE1 8.65%SE2 11.52%

Fig. 4. Evolution of BG(V,E). Category axis: day number, assuming Aug. 3, 2008, is the first day. Value axis: percentage of newly clicked pages/hyperlinks not included in BG(V,E)(BG(V,E) is constructed with data collected from the first day to the given day).

397Y. Liu et al. / Decision Support Systems 54 (2012) 390–401

6. Experimental results and discussions

6.1. Experimental setup

In Section 1, we assume that the user-accessed part of the Web ismore reliable than the parts that are never visited by users. On thebasis of this assumption, we construct three different hyperlinkgraphs based on browsing behavior. To determine whether the con-structed graphs outperform original Web graph in estimating pagequality, we adopted two evaluation methods.

The first method is based on the ROC/AUC metric, which is a tradi-tional measure in quality estimation research, such as “Web SpamChallenge”.4 To construct a ROC/AUC test set, we sampled 2279Web sites randomly according to their frequencies of user visits andhad two assessors annotate their quality scores. Approximately 39%of these sites were annotated as “high quality”, 19% were “spam”,and the others are “ordinary”. After performing link analysis algo-rithms, each site in the test set was assigned a quality estimationscore. We can evaluate the performance of a link analysis algorithmon the basis of whether it assigns higher scores to good pages andlower scores to bad ones.

The second method is a pairwise orderedness test. This test wasfirst proposed by Gyöngyi et al. [10] and is based on the assumptionthat good pages should be ranked higher than bad pages by an idealalgorithm. We constructed a pairwise orderedness test set composedof 782 pairs of Web sites. These pairs were annotated by productmanagers of a Web user survey company. It is believed that thepairwise orderedness show the two sites' differences in reputation.For example, both http://video.sina.com.cn/ and http://v.blog.sohu.com/ are famous video-sharing Web sites in China. However, the for-mer site is more popular and receives more user visits, so the pairwisequality order is http://video.sina.com.cn/>http://v.blog.sohu.com/. Ifan algorithm assigns a higher score to http://video.sina.com.cn/, itpasses this pairwise orderedness test. We use the accuracy rate toevaluate the performance of the pairwise orderedness test, which isdefined as the percentage of correctly ranked Web site pairs.

With these two evaluation methods, we tested whether tradi-tional hyperlink analysis algorithms perform better on BG(V,E),user-HG(V,E) and user-CG(V,E) than on the originalWeb graph. In addi-tion, we also investigated whether a specifically designed link analysisalgorithm for browsing graphs (such as BrowseRank) performs betterthan the traditional methods (such as PageRank and TrustRank).

First, we compared the performance of the link analysis algo-rithms on the four graphs (BG(V,E), user-HG(V,E), user-CG(V,E) andwhole-HG(V,E)). Second, we compared the performance of PageRank,

4 http://webspam.lip6.fr/.

TrustRank, DiffusionRank and BrowseRank on BG(V,E). The lattercomparisons were only performed on BG(V,E) because BrowseRankrequires users' stay time information, which is only applicable forBG(V,E). In addition, to examine how the proposed userPageRankand userTrustRank algorithms perform, we compared their perfor-mances to that of the original algorithms on both a user browsinggraph and a social graph constructed with data from China's largestmicro-blogging service provider weibo.com.

For TrustRank and DiffusionRank, a high quality page “seed” setmust be constructed. In these experiments, we follow the construc-tion method proposed by Gyöngyi et al. in [10] and which is basedon an inverse PageRank algorithm and human annotation. The in-verse PageRank algorithm was performed on the whole Web graph,and we annotated the top 2000 Web sites ranked by inversePageRank. Finally, 1153 high quality and popular Web sites were se-lected to compose the seed set. The parameters in our implementa-tion of PageRank, TrustRank and Diffusion Rank algorithms are alltuned according to their original implementations [10,24,27]. The αparameters of PageRank and TrustRank algorithms are set to 0.85according to [10,24], and the iteration time is both set to 30 becausethat is enough for the results to converge. Parameters for theDiffusionRank algorithm are set as: γ=1.0, α=0.85, and M=100,according to [27].

6.2. Quality estimation with different graphs

With the four different hyperlink graphs shown in Table 5, we ap-plied the PageRank algorithm and evaluated the performance of pagequality estimation. The experimental results of high quality pageidentification, spam page identification and the pairwise orderednesstest are shown in Fig. 5. The performances of high quality and spampage identification are measured by the AUC value, whereas thepairwise orderedness test used accuracy as the evaluation metric.

Fig. 5 shows that PageRank applied to the original Web graph(whole-HG(V,E)) performs the worst in all three quality estima-tion tasks. This result indicates that the graphs constructed byAlgorithms 1–3 can more effectively estimate Web page quality than

SE3 10.47%SE4 14.41%Average 11.26%

Page 9: Constructing a reliable Web graph with information on browsing behavior

Table 8PageRank ranking comparison of some government agencyWebsites onwhole-HG(V,E)and BG(V,E).

PageRank ranking onwhole-HG(V,E)

PageRank ranking onBG(V,E)

www.miibeian.gov.cn 5 23www.hd315.gov.cn 2 117

0.50.550.6

0.650.7

0.750.8

0.850.9

0.951

High Quality PageIdentification

Spam PageIdentification

Pairwise OrderednessAccuracy

BG(V,E) user-HG(V,E)

user-CG(V,E) whole-HG(V,E)

Fig. 5. Quality estimation results with PageRank performed on BG(V,E), user-HG(V,E),user-CG(V,E) and whole-HG(V,E).

0.840.860.880.9 PageRank TrustRank

DiffusionRank BrowseRank

398 Y. Liu et al. / Decision Support Systems 54 (2012) 390–401

can the original Web graph. The improvements in performance associ-ated with each of these three graphs are shown in Table 7.

According to Table 7, the graphs constructed with information onbrowsing behavior outperform the original Web graph by approxi-mately 5–25%. The adoption of user browsing behavior helps reducepossible noise in the original graph and makes the graph more reliable.This finding agrees with the results in [22] that BG(V,E) outperformsthe original Web graph. It also validates our assumption proposed inSection 1 that the user-accessed part of the Web is more reliable thanthe parts that are never visited by users.

According to Fig. 5 and Table 7, among the three graphs con-structed with user behavior information, BG(V,E) performs theworst, whereas user-HG(V,E) and user-CG(V,E) obtain very similar re-sults. As described in Section 5.1, BG(V,E) contains fewer edges thanthe other two graphs. The retained links are on average more informa-tive than the edges in the other graphs; however, this huge loss of edgedata also compromises the page quality estimation. User-HG(V,E) anduser-CG(V,E) share the same vertex set, and their edge sets are alsovery similar (only 7.97million edges are added to user-CG(V,E), makingup 5.14% edges of the whole graph). Therefore, these two graphs per-form similarly in page quality evaluation.

BG(V,E), user-HG(V,E) and user-CG(V,E) share the same vertex set,which is composed of all user-accessed sites recorded in Web-accesslogs. Although BG(V,E) contains the fewest edges of the four graphs, itstill outperforms whole-HG(V,E). This result shows that the selectionof the vertex set is more important than the selection of the edge set.Reducing the unvisited nodes in the originalWeb graph can be an effec-tive method for constructing hyperlink graph.

In Section 1, we show in Table 1 a list of Web sites which areranked top according to PageRank scores on the original Web graph.We also find that some government Web sites (e.g., www.miiberan.gov.cn and www.hd315.gov.cn) are ranked quite high but fail todraw much user attention. These Web sites are authoritative and im-portant but they should not be ranked so high because other similargovernment agency Web sites are generally ranked much lower. How-ever, when we look into the results of PageRank performed on BG(V,E),we find that the rankings of www.miiberan.gov.cn and www.hd315.gov.cn are more reasonable.

According to Table 8, both www.miiberan.gov.cn and www.hd315.gov.cn are ranked lower according to PageRank on BG(V,E) than that

Table 7Performance improvements of the graphs constructed with Algorithms 1–3 comparedto the original Web graph.

Test method Improvement compared with whole-HG(V,E)

BG (V,E) User-HG(V,E) User-CG(V,E)

High quality page identification +5.69% +7.55% +7.12%Spam page identification +3.77% +7.44% +7.46%Pairwise orderedness test +15.14% +20.34% +19.67%

on whole-HG(V,E). They are also important resources according to thealgorithm on the user browsing graph but not as important as thetop-ranked ones.We believed that the rankings on BG(V,E) give a betterestimation of their quality according to both popularity and authority.

6.3. Quality estimation with different link analysis algorithms

In [22], Liu et al. have shown that a specifically designed link anal-ysis algorithm (BrowseRank) outperforms TrustRank and PageRankfor both spam fighting and high quality page identification whenthe latter two algorithms are applied to the original Web graph.They explained that BrowseRank improves performance because itcan better represent users' preferences than PageRank and TrustRank.However, it is still unclear whether this improvement comes from al-gorithm and model design or from the adoption of data on user be-havior. Thus, we tested the performance of PageRank, TrustRankand BrowseRank on the same BG(V,E) graph. This comparison wasonly performed on BG(V,E) because the calculation of BrowseRankrequires users' stay time information, which is applicable to BG(V,E)only.

PageRank performs better on BG(V,E) than on the original Webgraph (Fig. 5). Therefore, it is possible that the BrowseRank algorithmimproves performance simply because it is performed on a graphconstructed from data on user browsing behavior. The experimentalresults shown in Fig. 6 validate this assumption. TrustRank performsthe best in both spam page identification and high quality page iden-tification, whereas PageRank performs slightly better than the otherthree algorithms in the pairwise orderedness test. The good perfor-mance of TrustRank might come from the prior information stored inthe “seed” set.

According to the results, TrustRank outperforms BrowseRank by4.12% and 2.84% in high quality and spam page identification tasks,respectively. The performance improvements are small but demonstratethat the TrustRank algorithm can also be very effective on BG(V,E). ThePageRank algorithm also performs no worse than BrowseRank on anyof these tests. This result means that the performance improvement bythe BrowseRank algorithm reported in [22] comes both from algorithmdesign and, perhapsmore importantly, from the adoption of informationon user browsing behavior. Additionally, PageRank and TrustRank aremore efficient than BrowseRank because they do not require collectinginformation on users' stay time.

0.70.720.740.760.780.8

0.82

High quality pageidentification

Spam pageidentification

PairwiseOrderedness Test

Fig. 6. Results of quality estimation with different link analysis algorithms on BG(V,E).

Page 10: Constructing a reliable Web graph with information on browsing behavior

0.7

0.72

0.74

0.76

0.78

0.8

0.82

0.84

0.86

0.88

0.9

High quality page identification Spam page identification

PageRank userPageRank

TrustRank userTrustRank

Fig. 7. Quality estimation results with the original PageRank/TrustRank anduserPageRank/userTrustRank algorithms on BG(V,E).

Table 10Information on sites that connect to a spam site (http://11sss11xp.org/).

Site # Out-link # User visits

www.yahoo.cn 35,000 208,658my.51.com 86,295 19,443,717image.baidu.com 148,611 8,218,706

399Y. Liu et al. / Decision Support Systems 54 (2012) 390–401

These results and examples demonstrate that although BrowseRankis specially designed for BG(V,E), it does not perform better thanPageRank, TrustRank or DiffusionRank applied to BG(V,E). BrowseRankfavors the pages where users stay longer, but stay time does not neces-sarily indicate quality or user preference. Compared with the algorithmdesign, the incorporation of information on user browsing behavior inthe construction of link graphs is perhaps more important.

6.4. UserPageRank and UserTrustRank on user browsing graph

In Section 3, we proposed the userPageRank and userTrustRankalgorithms, which modify the original algorithms by estimationof P(Xi⇒Xj) according to user browsing information recorded inBG(V,E). To examine the effectiveness of these algorithms, we com-pared their performance with the original PageRank/TrustRank algo-rithms (Fig. 7).

The modified algorithms perform slightly better than the originalalgorithms. They perform almost equivalently in high quality pageidentification and perform slightly different in spam page identifica-tion. For both PageRank and TrustRank algorithms, the modified algo-rithms outperform the original ones by approximately 3% in spamidentification. Examining several test cases, we find that this perfor-mance improvement comes from modification to the algorithms.

An example is the spam site whose URL is http://11sss11xp.org/.Among the 2279 Web sites in the ROC/AUC test set, it is ranked 1030thby the original TrustRank algorithm and 1672nd by the userTrustRankalgorithm. Because a spam site should be assigned a low ranking posi-tion, userTrustRank performs better for this test case. We investigatedthe hyperlink structure of this site to analyze why the modified algo-rithm performs better.

In Tables 9 and 10, we can see that this site receives many in-linksfrom search engines (such as www.yahoo.cn and image.baidu.com).This phenomenon can be explained because spam sites are designedto achieve unjustifiably favorable rankings in search engines. Thisspam site also receives in-links from several Web 2.0 sites, such asmy.51.com, which is a blog service site. With the original TrustRankalgorithm, trust scores of the original sites should be evenly dividedbetween their outgoing links. In contrast, for userTrustRank, trustscores are assigned by estimating P(Xi⇒Xj), the probability of visitingsite Xj after visiting Xi. Because this site is a spam site that users gen-erally do not visit, P(Xi⇒Xj) for this site should be low. For example,

Table 9Web sites that link to a spam site (http://11sss11xp.org/) in BG(V,E).

Source Web site Destination Web site # User visits

http://web.gougou.com/ http://11sss11xp.org/ 3http://image.baidu.com/ http://11sss11xp.org/ 1http://www.yahoo.cn/ http://11sss11xp.org/ 1http://domainhelp.search.com/ http://11sss11xp.org/ 1http://my.51.com/ http://11sss11xp.org/ 1

the site www.yahoo.cn has 35,000 outgoing links in BG(V,E). Alto-gether, 208,658 user clicks are performed on these outgoing links,and only one of them links to 11sss11xp.org. With the originalTrustRank algorithm, the spam site receives 1/35,000 of Yahoo'strust score, whereas userTrustRank only assigns 1/208,658 of thecorresponding score to this spam site. We can see that userTrustRankdivides a page's trust score according to counts of users' visits, andthis adaptation can help identify spam sites.

6.5. UserPageRank and UserTrustRank on social graph

In order to further examine the performance of userPageRank anduserTrustRank algorithms, we also constructed a social graph asdescribed in Section 3 and see how they perform on it. The datawas collected in September 2011 from weibo.com, which is China'slargest social network service provider. Information of 2,631,342users and about 3.6billion relationships was collected. To the best ofour knowledge, it is one of the largest corpuses in social networkstudies. Information recorded in our data set is shown in Table 11.

As described in Section 3, the userPageRank and userTrustRank re-quire the estimation of P(Xi⇒Xj) as prior knowledge. In social graph,we adopted the number of common tags as a sign of closeness be-tween users. We believe that the assumption is reasonable becausethe following relationships between users with many common inter-ests should be more reliable than those not. Therefore, the weight ofan edge in the social graph equals to the number of common tagsbetween nodes it connects. After performing userPageRank anduserTrustRank algorithms on the weighted social graph, social influ-ence estimation results are shown in Fig. 8.

Fig. 8 shows the AUC performances of different influence estimationalgorithms on the social graph.We use the users with “Verified sign” asmore influent ones in our evaluation because their identity has beenverified by weibo.com and according to the verification policy,5 only“authoritative” person or organizations will be verified. For the seedset of TrustRank and userTrustRank, we select 100 people from“Weibo hall of fame6” which is composed of famous people in certainfields such as entertainment, politics, techniques and so on.

According to the results shown in Fig. 8, we see that the perfor-mance of PageRank, userPageRank and userTrustRank is similar toeach other while TrustRank performs the worst among all algorithms.Although the AUC performance of PageRank is almost the same asuserPageRank and userTrustRank, we find that these algorithms givequite different rankings. The top results of the algorithms in Table 12show that both PageRank and TrustRank put famous entertainmentstars (such as Xidi Xu, Chen Yao and Mi Yang) at the top of their resultlists. Meanwhile, userPageRank and userTrustRank favor accounts thatpost interesting jokes or quotations (such as joke selection and classicquotations).

The differences in top ranked results are caused by the fact thatalthough the entertainment stars have many followers, a large partof these followers do not share same tags with the stars. This is be-cause many of the stars do not list any tags on their accounts suchas Xidi Xu and Chen Yao. People follow the accounts such as joke selec-tion and classic quotations because they actually provide interesting in-formation and influent people. Therefore,we believe that userPageRank

5 http://weibo.com/verify.6 http://weibo.com/pub/star.

Page 11: Constructing a reliable Web graph with information on browsing behavior

Table 12Top results of PageRank, TrustRank, userPageRank and userTrustRank algorithms onthe social graph of weibo.com.

Rank PageRank userPageRank TrustRank userTrustRank

1 Kangyong Cai Joke selection Kangyong Cai Kangyong Cai2 Xidi Xu Kangyong Cai Mi Yang Joke selection3 Cold joke selection Classic quotations Na Xie Xiaoxian Zhang4 Chen Yao Cold joke selection Weiqi Fan Classic quotations5 Xiaogang Feng Global fashion Lihong Wang Cold joke selection

Table 11Information recorded in the collected micro-blogging data.

Information Explanations

User ID The unique identifier for each userUser name The name of the userVerified sign Whether the user's identification is verified

by weibo.comFollowees The ID list that are followed by the userFollowers The ID list that follow the userTags A list of keywords describing the user's interests with

the purpose of self-introduction

400 Y. Liu et al. / Decision Support Systems 54 (2012) 390–401

and userTrustRank algorithms give more reasonable estimation of so-cial influence.

7. Conclusion and future works

Page quality estimation is one of the greatest challenges for searchengines. Link analysis algorithms have made progress in this field butencounter increasing challenges in the real Web environment. In thispaper, we analyze user browsing behavior and proposed two hyper-link analysis algorithms based on “surfing with prior knowledge”model instead of the random surfer model. We also construct reliablelink graphs in which this browsing behavior information is embed-ded. Three construction algorithms are adopted to construct threedifferent kinds of link graphs, BG(V,E), user-HG(V,E) and user-CG(V,E).We examined the structure of these graphs and found that theyinherit characteristics, such as power law distributions of in-degreesand out-degrees, from the original Web graph. The evolution ofthese graphs is also studied, and they are found to be appropriatefor page quality estimation by search engines.

The experimental results show that the graphs constructed withbrowsing behavior data are more effective than the original Web graphin estimating Web page quality. PageRank on BG(V,E), user-HG(V,E)and user-CG(V,E) outperforms PageRank on the whole Web graph. Inaddition, user-HG(V,E) and user-CG(V,E) work better than BG(V,E),probably because the construction process of BG(V,E) omits too manymeaningful hyperlinks. We also found that PageRank, TrustRank andDiffusionRank perform as well as (or even better than) BrowseRankwhen they are performed on the same graph (BG(V,E)). This result re-veals that the incorporation of user browsing information is perhapsmore important than the selection of link analysis algorithms. Addition-ally, the construction of user browsing graphs introducesmore informa-tion. Thus, it is possible to modify the original TrustRank/PageRankalgorithms by estimating the importance of outgoing links. The modi-fied algorithms (called userPageRank and userTrustRank) show betterperformance in both Web spam identification and social influenceestimation.

Although the Web/micro-blogging collections and data on userbrowsing behavior are collected on Chinese Web environment, thealgorithms are not specially designed for the specific collection. Therefore,

Fig. 8. Social influence estimation results with the original PageRank/TrustRank anduserPageRank/userTrustRank algorithms on social graph of weibo.com.

they should not behave significantly differently in a multi-languagecollection as long as reliable data sources can be provided.

Several technical issues remain, which we address here as futurework:

First, Web pages that are visited by users only comprise a smallfraction of pages on the Web. Although it has been found thatmost pages that users need can be included in the vertex set ofBG(V,E), search engines still need to keep many more pages intheir index to meet all possible user needs. To estimate quality ofthese pages, we are planning to predict user preferences for a cer-tain page by using the pages that users previously visited as train-ing set. If we can calculate the probability that a Web page willbe visited by users in the future, this information will help constructa large-scale, credible link graph not limited by data on user behavior.Second, the evolution of the user browsing graph can be regardedas a combination of the evolution of both the Web and Web users'interests. In this paper, we analyzed the short term evolution (aperiod of 60days) of the graph. We are considering collectinglong-term data to determine how the evolutionary process re-flects changes in users' behavior and interests.

Acknowledgments

This work is supported by the Natural Science Foundation(60903107, 61073071) and the National High Technology Researchand Development (863) Program of China (2011AA01A207). In theearly stages of this work, we benefited enormously from discussionswith Yijiang Jin. We thank Jianli Ni for kindly offering help in data col-lection and corpus construction. We also thank Tao Hong, Fei Ma, andShouke Qin from Baidu.com and anonymous referees of this paper fortheir valuable comments and suggestions.

References

[1] B. Amento, L. Terveen, W. Hill, Does authority mean quality? Predicting expertquality ratings of Web documents, In: Proc. of 23rd ACM SIGIR Conference,2000, pp. 296–303.

[2] L. Becchetti, C. Castillo, D. Donato, S. Leonardi, R. Baeza-Yates, Using rank propa-gation and probabilistic counting for link based spam detection, In: Proceedingsof the Workshop on Web Mining and Web Usage Analysis, 2006.

[3] M. Bilenko, R.W. White, Mining the search trails of surfing crowds: identifyingrelevant Web sites from user activity, In: Proc. of the 17th WWW Conference,2008, pp. 51–60.

[4] S. Brin, L. Page, The anatomy of a large-scale hypertextual Web search engine,Computer Networks and ISDN Systems 30 (1998) 107–117.

[5] A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins,J. Wiener, Graph structure in the Web, Computer Networks 33 (2000) 309–320.

[6] M. Chau, H. Chen, A machine learning approach to web page filtering using con-tent and structure analysis, Decision Support Systems 44 (2) (2008) 482–494.

[7] N. Craswell, D. Hawking, S. Robertson, Effective site finding using link anchor in-formation, In: Proceedings of the 24th ACM SIGIR Conference, 2001, pp. 250–257.

[8] D. Donato, L. Laura, S. Leonardi, S. Millozzi, The Web as a graph: how far we are,ACM Transaction on Internet Technology 7 (1) (2007) 4.

[9] X. Fang, C.W. Holsapple, An empirical study of web site navigation structures' im-pacts on web site usability, Decision Support Systems 43 (2) (2007) 476–491.

[10] Z. Gyöngyi, H. Garcia-Molina, J. Pedersen, CombatingWeb spamwith TrustRank, In:Proceedings of the Thirtieth International VLDB Conference, 2004, pp. 576–587.

[11] T. Haveliwala, Efficient computation of PageRank, Technical Report, StanfordUniversity, 1999. http://dbpubs.stanford.edu/pub/1999-31.

[12] T.Haveliwala, Topic-sensitive PageRank: a context-sensitive ranking algorithm forWebsearch, IEEE Transaction on Knowledge and Data Engineering 15 (4) (2003) 784–796.

Page 12: Constructing a reliable Web graph with information on browsing behavior

401Y. Liu et al. / Decision Support Systems 54 (2012) 390–401

[13] T. Haveliwala, S. Kamvar, G. Jeh, An analytical comparison of approaches to person-alizing PageRankStanford Technical Report, http://ilpubs.stanford.edu:8090/596/.

[14] M.R. Henzinger, R. Motwani, C. Silverstein, Challenges in web search engines, In:SIGIR Forum, 36, 2002, pp. 11–22, (Sep. 2002).

[15] A. Jacob, C. Olivier, C. Carlos, WITCH: a new approach to Web spam detection,Yahoo! Research Report No. YR-2008-001, 2008.

[16] Y. Kang, Y. Kim, Do visitors' interest level and perceived quantity of web pagecontent matter in shaping the attitude toward a web site? Decision Support Sys-tems 42 (2) (2006) 1187–1202.

[17] R. Kaul, Y. Yun, S. Kim, Ranking billions of web pages using diodes, Communica-tions of the ACM 52 (8) (2009) 132–136.

[18] J.M. Kleinberg, Authoritative sources in a hyperlinked environment, Journal ofACM 46 (5) (1999) 604–632.

[19] V. Krishnan, R. Raj, Web spam detection with anti-TrustRank, In: The 2nd Interna-tional Workshop on Adversarial Information Retrieval on the Web, 2006, p. 3.

[20] R. Kumar, P. Raghavan, S. Rajagopalan, S. Rajagopalan, A. Tomkins, Core algorithms intheCLEVER system, ACMTransactions on Internet Technology 6 (2) (2006) 131–152.

[21] Y. Liu, M. Zhang, R. Cen, L. Ru, S. Ma, Data cleansing for Web information retrievalusing query independent features, Journal of the American Society for Informa-tion Science and Technology 58 (12) (2007) 1884–1898.

[22] Y. Liu, B. Gao, T. Liu, Y. Zhang, Z. Ma, S. He, H. Li, BrowseRank: letting web users votefor page importance, In: Proc. of 31st ACM SIGIR Conference, 2008, pp. 451–458.

[23] Y. Liu, F. Chen, W. Kong, H. Yu, M. Zhang, S. Ma, L. Ru, Identifying Web spam withthe wisdom of the crowds, ACM Transaction on the Web 6 (1) (March 2012) (Ar-ticle No. 2, 30 pages).

[24] L. Page, S. Brin, R. Motwani, T. Winograd, The PageRank citation ranking: bringingorder to the Web, Stanford Technical Report, 1999. http://ilpubs.stanford.edu:8090/422/.

[25] H. Wu, M. Gordon, K. DeMaagd, W. Fan, Mining web navigations for intelligence,Decision Support Systems 41 (3) (2006) 574–591.

[26] D. Xu, Y. Liu, M. Zhang, L. Ru, S. Ma, Predicting Epidemic Tendency through SearchBehavior Analysis, In Proceedings of the 22nd International Joint Conference onArtificial Intelligence (IJCAI-11) (Barcelona, Spain) (2011) 2361–2366.

[27] H. Yang, I. King, M.R. Lyu, DiffusionRank: a possible penicillin for Web spamming,In: Proc. of 30th ACM SIGIR Conference, 2007, pp. 431–438.

[28] B. Zhou, Y. Liu, M. Zhang, Y. Jin, S. Ma, Incorporating Web browsing informationinto anchor texts for web search, Information Retrieval 14 (3) (2011) 290–314.

Yiqun Liu, Male, born in January 1981. He received his bach-elor and Ph.D. degrees from the Dept. of Computer Scienceand Technology of Tsinghua University in July 2003 and July2007, respectively. He is nowworking as an assistant profes-sor in Tsinghua University and an undergraduate mentorfor the C.S. & T. department. His research interests includeWeb information retrieval, Web user behavior analysis andperformance evaluation of on-line services. Most of his re-cent works and publications can be found at his homepage(http://www.thuir.cn/group/~YQLiu/).

Yufei Xue, Male, born in September 1985. He received hisbachelor degree from the Dept. of Computer Science andTechnology of Tsinghua University in July 2007. He isnow a Ph.D. candidate in the same department. His re-search interests include Web user behavior analysis andWeb user intent analysis.

Danqing Xu, Female, born in June 1986. She received herbachelor degree from the Dept. of Computer Science andTechnology of Tsinghua University in July 2010. She isnow a master student in the same department. Her re-search interests include Web user behavior analysis andsocial networks.

Rongwei Cen, Male, born in April 1982. He received hisbachelor and Ph.D. degrees from the Dept. of ComputerScience and Technology of Tsinghua University in July2005 and July 2010, respectively. He is now working asan engineer in the State Information Center of China. Hisresearch interests include Web information retrieval andWeb user behavior analysis. Most of his recent works andpublications can be found at his homepage (http://www.thuir.cn/group/~RWCen/).

Min Zhang, Female, born in December 1977. She receivedher bachelor and Ph.D. degrees from the Dept. of ComputerScience and Technology of Tsinghua University in July1999 and July 2003, respectively. She is now working asan associate professor in Tsinghua University. Her researchinterests include information retrieval, machine learningand natural language processing. Most of her recent worksand publications can be found at her homepage (http://www.thuir.cn/group/~mzhang/).

Shaoping Ma, Male, born in February 1961. He receivedhis bachelor and Ph.D. degrees from the Dept. of ComputerScience and Technology of Tsinghua University in July1982 and July 1997, respectively. He is now working as aprofessor in Tsinghua University. His research interests in-clude knowledge engineering, Web information retrieval,and natural language processing. Most of his recent worksand publications can be found at his homepage (http://www.thuir.cn/cms/index.php?page=msp).

Liyun Ru, Male, born in December 1979. He received hisbachelor degree from the Dept. of Computer Science andTechnology of Tsinghua University in July 2002. He isnow a Ph.D. candidate in the same department. His re-search interests include Web search engine, user behavioranalysis and Web user intent analysis.