the improved pagerank in web crawler

4
The Improved Pagerank in Web Crawler Ling Zhang Department of Information Science and Engineering, Hunan First  Normal Uni versity Changsha, China [email protected]  Zheng Qin College of Computer and Communication, Hunan University  Changsha, China [email protected]   Abstract   Pagerank is an algorithm for rating web pages. It introduces the relationship of citation in academic papers to evaluate the web page’s authority. It gives the same weight to all edges and ignores the relevancy of web pages to the topic, resulting in a problem of topic-drift. On the analysis of several  pagerank algorithms, an improved pagerank based upon thematic segments is proposed. In this algorithm, a web page is divided into several blocks by Html document’s structure and the most weight is given to linkages in the block that is most relevant to given topic. Moreover, the visited outlinks are regarded as feedback to modify blocks’ relevancy The experiment on Web crawler shows that the new algorithm has some effect on resolving the problem of topic-drift.   Keywords-Pagerank;web crawler;topic-drift;relevancy  Introduction With the rapid growth of web resources in the Internet,  besides to hope the search engine can provide more and more appropriate information, people have the requirement of taking centralized query on given topic. Because the searching range of topic-specific search engine only limits to the  professional area and the searching object is the small portion of web resources, the traditional width-first or depth-first searching strategy is already not suitable. In order to get more objects by visiting few irrelevant web pages, the web crawler usually takes the heuristic searching strategy that ranks urls by their importance and preferentially visits the more important web pages [1]. So, it becomes a hotspot in recent research on how to decide the url’s importance. The existing ranking algorithms mainly estimate the url’s importance by web pages’ relevancy to the topic or their authorities [2, 3]. There are several common algorithms for evaluating authorities, such as pagerank, kleinberg, hits and salsa  [4]. These algorithms can exactly evaluate the web  page’s authority, bu t they hardly consider topical information, resulting in a problem of topic-drift [5, 6] that means although the web page with high authority score certainly has high universal authority, it not always has high authority on given topic too. In order to resolve this problem, Bharat and Henzinger implemented a new heuristic strategy that assigns outlinks different weights. Taher Haveliwal proposed the topic-sensitive pagerank[7], which uses different topic-vectors for different topics and regards the similarity of these topic- vectors to the given topic as weights. Then the weighted sum of web pages’ relevancies to each topic is namely the  pagerank score. Matthew Richardson combines the linkage information and the content information to improve the traditional pagerank [5]. When passing a web page’s pagerank score to each outlinks, not only the relationship of citation should be taken into account, but also the web pages’ topical relevancy is considered. The doubled focused pagerank [8]   proposed by Diligenti divides the probability of visiting a linkage into two parts: one is the probability of jumping to a certain web page or following outlinks in this page, which is  proportional to the page’s relevancy to topic, the another is the  probability of following a certain out link,  which is  proportional to the linkage’s relevancy to topic. These above algorithms all introduce topical information into pagerank to resolve the problem of topic-drift. But they only differentiate linkages simply by assigning more pagerank score to outlinks which are more relevant to topic, not fully making use of the content structure of html documents. On the analysis of above pagerank algorithms and the content structure of html documents, we proposed the pagerank  based on thematic segments. It divides a web page into several  blocks based upon its content structure, and then assigns the  page’s pagerank score to each block according to blocks’ relevancy to topic. Finally, the block’s pagerank score is further assigned to each outlink in it according to linkages’ relevancy. Moreover, the visited linkages can provide feedback to modify blocks’ relevancy. We applied this pagerank algorithm to experiments in web crawler, and the results showed that compared with the above mentioned pagerank algorithm, the new pagerank improves on the searching  precision. I. PAGERANK  Sergey Brin and Larry Page proposed the pagerank algorithm for scoring web pages. Each web page is given an authority score for evaluating its importance. In the beginning,  pagerank has been only used in ranking results of information retrieval, but now it has been applied in many fields such as web crawling, clustering web pages and searching for relevant web pages. Imagining a web surfer who jumps from web page to web  page, it chooses with uniform probability which link to follow at each step. The surfer will occasionally jump to a random  page with some small probability. We consider the web as a directed graph.  F i  be the set of pages which page i links to, and B i  be the set of pages which link to page i. After averaged over a sufficient number of steps, the probability of the surfer on page j at some point in time is given b y the formula (1): *  This paper is supported by the National Natural Science Foundation of China under Grant No.60273070, the Sci & Tech. Project of Hunan under Grant No. 04GK3022 The 1st International Conference on Information Science and Engineering (ICISE2009) 978-0-7695-3887-7/09/$26.00 ©2009 IEEE  1889

Upload: pedro-soldado

Post on 13-Apr-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Improved Pagerank in Web Crawler

7/26/2019 The Improved Pagerank in Web Crawler

http://slidepdf.com/reader/full/the-improved-pagerank-in-web-crawler 1/4

The Improved Pagerank in Web CrawlerLing Zhang

Department of Information Science and Engineering, Hunan First Normal University Changsha, China

[email protected] 

Zheng QinCollege of Computer and Communication, Hunan University 

Changsha, China

[email protected]

 

 Abstract  — Pagerank is an algorithm for rating web pages. It

introduces the relationship of citation in academic papers to

evaluate the web page’s authority. It gives the same weight to

all edges and ignores the relevancy of web pages to the topic,resulting in a problem of topic-drift. On the analysis of several

 pagerank algorithms, an improved pagerank based upon

thematic segments is proposed. In this algorithm, a web page

is divided into several blocks by Html document’s structure

and the most weight is given to linkages in the block that is

most relevant to given topic. Moreover, the visited outlinks areregarded as feedback to modify blocks’ relevancy The

experiment on Web crawler shows that the new algorithm has

some effect on resolving the problem of topic-drift. 

 Keywords-Pagerank;web crawler;topic-drift;relevancy

 Introduction

With the rapid growth of web resources in the Internet,

 besides to hope the search engine can provide more and more

appropriate information, people have the requirement of

taking centralized query on given topic. Because the searching

range of topic-specific search engine only limits to the professional area and the searching object is the small portion

of web resources, the traditional width-first or depth-first

searching strategy is already not suitable. In order to get more

objects by visiting few irrelevant web pages, the web crawler

usually takes the heuristic searching strategy that ranks urls by

their importance and preferentially visits the more importantweb pages [1]. So, it becomes a hotspot in recent research on

how to decide the url’s importance.

The existing ranking algorithms mainly estimate the url’s

importance by web pages’ relevancy to the topic or their

authorities [2, 3]. There are several common algorithms for

evaluating authorities, such as pagerank, kleinberg, hits andsalsa

 [4]. These algorithms can exactly evaluate the web

 page’s authority, but they hardly consider topical information,

resulting in a problem of topic-drift [5, 6] that means although

the web page with high authority score certainly has high

universal authority, it not always has high authority on given

topic too. In order to resolve this problem, Bharat andHenzinger implemented a new heuristic strategy that assigns

outlinks different weights. Taher Haveliwal proposed the

topic-sensitive pagerank[7], which uses different topic-vectors

for different topics and regards the similarity of these topic-

vectors to the given topic as weights. Then the weighted sumof web pages’ relevancies to each topic is namely the

 pagerank score. Matthew Richardson combines the linkage

information and the content information to improve the

traditional pagerank [5]. When passing a web page’s pagerank

score to each outlinks, not only the relationship of citation

should be taken into account, but also the web pages’ topicalrelevancy is considered. The doubled focused pagerank [8]

 

 proposed by Diligenti divides the probability of visiting a

linkage into two parts: one is the probability of jumping to a

certain web page or following outlinks in this page, which is

 proportional to the page’s relevancy to topic, the another is the

 probability of following a certain out link,  which is proportional to the linkage’s relevancy to topic. These above

algorithms all introduce topical information into pagerank to

resolve the problem of topic-drift. But they only differentiate

linkages simply by assigning more pagerank score to outlinkswhich are more relevant to topic, not fully making use of the

content structure of html documents.On the analysis of above pagerank algorithms and the

content structure of html documents, we proposed the pagerank based on thematic segments. It divides a web page into several blocks based upon its content structure, and then assigns the page’s pagerank score to each block according to blocks’relevancy to topic. Finally, the block’s pagerank score isfurther assigned to each outlink in it according to linkages’relevancy. Moreover, the visited linkages can provide feedbackto modify blocks’ relevancy. We applied this pagerankalgorithm to experiments in web crawler, and the resultsshowed that compared with the above mentioned pagerank

algorithm, the new pagerank improves on the searching precision.

I.  PAGERANK  

Sergey Brin and Larry Page proposed the pagerankalgorithm for scoring web pages. Each web page is given an

authority score for evaluating its importance. In the beginning,

 pagerank has been only used in ranking results of information

retrieval, but now it has been applied in many fields such as

web crawling, clustering web pages and searching for relevant

web pages.Imagining a web surfer who jumps from web page to web

 page, it chooses with uniform probability which link to followat each step. The surfer will occasionally jump to a random

 page with some small probability. We consider the web as a

directed graph.  F i  be the set of pages which page i links to,

and Bi be the set of pages which link to page i. After averagedover a sufficient number of steps, the probability of the surfer

on page j at some point in time is given by the formula (1):

* This paper is supported by the National Natural Science

Foundation of China under Grant No.60273070, the Sci & Tech.

Project of Hunan under Grant No. 04GK3022

The 1st International Conference on Information Science and Engineering (ICISE2009)

978-0-7695-3887-7/09/$26.00 ©2009 IEEE   1889

Page 2: The Improved Pagerank in Web Crawler

7/26/2019 The Improved Pagerank in Web Crawler

http://slidepdf.com/reader/full/the-improved-pagerank-in-web-crawler 2/4

 ji B i

P(i)P(j) (1 )

| F | β β 

= + − ∑   (1)

0< β <1, the usual value is 0.15. The pagerank score reflectsthe citation relationship of web pages and if a web page iscitied by many important pages, it is also an important page.Although pagerank score can reflect the authority of web pages

 properly, it ignores the web pages’ relevancy to topic and the

authority score is independent with topics. A web page hasonly one pagerank score, but some web pages (especially somedoor-way web pages) include information about many differenttopics. For example, a web page that has high authority ontopic “art” may not have high authority on topic “sports” too.Moreover, in many times, linkages just have the use ofnavigation or advertisement. So, the web pages linked eachother not always have the same topical relevancy. Therefore, itis not very appropriate to assign the same pagerank score to alloutlinks in a web page.

II.  THE PAGERANK COMBINED WITH CONTENTS 

To combine the authority with the topical relevancy,

Matthew Richardson improved the traditional pagerankalgorithm. When the pagerank score is passed between web

 pages, the relevancy to topic is considered and the out-page

more relevant to topic is assigned more pagerank score [4]. He

 proposed an intelligent web surfer, who probabilistically hops

from page to page, depending on the content of the pages andthe topic terms the surfer is looking for. The resulting

 probability distribution over pages can be seen in formula (2):

q q

i B

P (j) P ' ( j) (1 ) ( ) ( ) j

q q P i P i j β β ∈

= + − →∑   (2)

 P   (i) can be computed through formulate (1). Where

 P q(i→ j) is the probability that the surfer transitions to out-page

 j given that he is on page i and on searching for the topic  q. P q’ (j) specifies where the surfer chooses to jump when notfollowing outlinks. The two probabilities both relate to the

web pages’ relevancy to topic q. Given that W  is the set of all

web pages,  Rq(k )is a measure of relevancy  of page k to

topic q. The definition of  P q(i→ j)  and  P q’ (j)  can be seen in

formula (3) and (4):

( )( )

( )i

q

q

q

k F 

 R j P i j

 R k ∈

→ =

∑  (3)

( )' ( )

( )

q

q

q

k W 

 R j P j

 R k ∈

=

∑  (4)

Seen from the above formulas, the pagerank combined withcontent assigns pagerank score according to web pages andoutlinks’ relevancy to topic. So, in the outlinks on a same web

 page, the more relevant to topic can get more pagerank score ofthe parent web page.

III.  THE PAGERANK BASED ON TOPICAL BLOCKS 

For considering the topical relevancy, the pagerank

combined with content has some effect on the problem of

topic-drift in information retrieval. However, when the

 pagerenk algorithm is applied in web crawling, because the

crawler can’t see contents of unvisited pages, it only canestimate the unvisited web page’s relevancy by visited pages

and information in hyperlinks. But hyperlinks are usually

incapable of providing enough information. On the analysis of

content structure of web pages, an improved pagerank

algorithm based upon thematic segments is proposed, in order

to make the web crawler be able to estimate links’ importancemore accurately.

The general web information processor usually treats the

web page as a unit. In fact, this process is too coarse [9,10].

Actually, when the author is designing a web page, he is not

 piling various information pell-mell, but organizing

information by certain layout and structure.Seen from figure1, according to a web page’s layout and

structure, it can be divided into several information blocks,which include many single information items. Moreover, theseinformation blocks can be classed into four types: the text

 block, the relevant hyperlinks block, the navigation andadvertisement block, and the relevant to other topics block. If a

web page not only includes information about a single topic, but also refers to multiple topics, considering information aboutone topic often being placed together, we need classify theseinformation blocks by topics further.

Figure1.An example of Html document

In these information blocks, some include linkages relevant

to the given topic and some include linkages only for

navigation or advertisement and some include linkages for

other topics. So, we need classify these blocks furtheraccording to their relevance to the given topic, and then assign

the web page's pagerank score to each linkage block in proportion. More relevant to topic, more pagerank score the

linkage block gets. Then, the pagerank score that each linkage

 block gets is assigned to each linkage in this block according

to the relevance of linkages. Moreover, the visited outlink can be regarded as feedback to modify the block’s relevance. If the

outlink is relevant to topic, the relevance of block in which the

outlink is should be accordingly augmented, otherwise, the

relevance should be minished. The web crawler chooses the

linkage which points to web page  j with the

Advertise

ment

The relevanthyperlinks

 Navigatio

n block O t  h  e r  t   o

i   c  s 

1890

Page 3: The Improved Pagerank in Web Crawler

7/26/2019 The Improved Pagerank in Web Crawler

http://slidepdf.com/reader/full/the-improved-pagerank-in-web-crawler 3/4

 probability.

' ( ) ' ( ) (1 ) ( ) ( ) ( ) j

q q q q q

i B

 P j P j P i S m P i j β β ∈

= + − →∑  (5)

Given the set S   of all information blocks in web page i.

Linkage l  j  points to web page  j  and lies in the information block m. S q(m)  is the topical relevance of m compared with

other blocks. L(m) is the set of all linkages in block m, and W  

is the set of linkages in URL candidate frontier. P’ q(j), S q(m)and L(m) are defined as below:

( )

| |' ( )

( )

 j

q j

i B  j

q

q

k W 

 R l 

 B P j

 R k 

=

∑  (6)

( )( )

( )

q

q

q

k S 

 R mS m

 R k ∈

=

∑  (7)

( )

( )( )

( )

q j

q

q

k L m

 R l  P i j

 R k ∈

→ =

∑  (8)

Every time the web crawler chooses the most importantlinkage in URL frontier to visit. After the web page pointed bythis linkage is visited, the pagerank score of web pages andlinkages connected with it is updated immediately.

IV.  EXPERIMENTS 

In order to compare with other pagerank algorithms, we

carry out web crawlers based on three different pagerank

algorithms: the traditional pagerank, the pagerank combined

with content, and the pagerank based on segments. The objects

of web crawlers are computer web pages. To compute the

relevance to topic, we choose and enrich the FOLDOC online

computer dictionary as the computer keywords. The semanticsimilarity of web pages or anchor texts around linkages to the

computer dictionary is regarded as their relevance to the topic.Here, we choose the vector space model to express web pages

and compute the semantic similarity by the cosine formula

(9)[12]:

2 2s'(q,p)

kq kp

k q p

kq kp

k q k p

 f f 

 f f 

∈ ∩

∈ ∈

=

∑ ∑  (9)

In the above formula, q is topic keyword terms, p is word

terms in anchor texts, and  f kd   is the frequency of term k

appearing in d. We choose the searching precision to evaluate

these algorithms' performance [13]:

||

||

 pagescrawled the

 pagesrelevant crawled the precisionSearching    =

  (10)“archive.ncsa.uiuc.edu” and “www.dmoz.org/science” are

selected as the initial url seeds. Each web crawler starts from

the two urls to search for web pages on topic computer. Then

we get the Figure2.

Figure2. The comparison of searching precision with the“archive.ncsa.uiuc.edu” and “www.dmoz.org/science” url

seeds separately

Seen from the first graph of Figure2, because the traditional pagerank ignores web pages' text information, its performanceis the worst and the searching precision is always under 0.3. Inthe initial and middle searching stage, the precision of

 pagerank combined with content is close to that of pagerank based on segments, but it drops rather quickly in the last. So, inthe whole web crawling, pagerank based on segments exceedsthe other two pagerank in performance. From the second graphof Figure2, we can draw the same conclusion except that in ashort stage, the precision of pagerank combined with content isa bit higher than that of pagerank based on segments. By theanalysis of crawled web pages, we find that there is a websiteincluding many web pages designed unmorally so it hasinfluence on the performance of the pagerank using the contentstructure to segment. However, with the increase of crawledweb pages, the available information used in segment is moreand more. So the searching efficiency of pagerank based onsegments is above the other two more and more obviously.

R EFERENCES 

1 Junghoo Cho, Hector Garcia-Molina, Lawrence Page,Efficient crawling through URL ordering, In Proceedings of7th International World Wide Web Conference,1998

2 Chirita, P.; Olmedilla, D.; Nejdl, W, Finding Related PagesUsing the Link Structure of the WWW, In Proceedings ofIEEE/WIC/ACM International Conference,2004

3 Ingongngam, P.; Rungsawang. A, Topic-centric algorithm: anovel approach to Web link analysis, Advanced

1891

Page 4: The Improved Pagerank in Web Crawler

7/26/2019 The Improved Pagerank in Web Crawler

http://slidepdf.com/reader/full/the-improved-pagerank-in-web-crawler 4/4

Information Networking and Applications, Vol.2,2004, pp.299 - 301

4 B. L. Narayan, C. A. Murthy, Sankar K. Pal. Topiccontinuity for Web document categorization and ranking,IEEE/WIC International Conference on WebIntelligence,2003,pp. 310-315

5 Matthew Richardson, Pedro Domingos,The IntelligentSurfer: Probabilistic Combination of Link and Content

Information in PageRank, In advances in neuralinformation processing systems, vol 14,2002, pp.673-680

6 K. Bharat and M. R. Henzinger,Improved algorithms fortopic distillation in a hyperlinked environment, InProceedings of the Twenty-First Annual International ACMSIGIR Conference on Research and Development inInformation Retrieval,1998

7 Taher H. Haveliwala, Topic-Sensitive PageRank: AContext-Sensitive Ranking Algorithm for Web Search,Knowledge and Data Engineering, IEEE Transactions,vol.15(4),2003,,pp.784 - 796

8 Michelangelo Diligenti, Marco Gori, Marco Maggini, WebPage Scoring Systems for Horizontal and Vertical Search,In

Proceedings of the 11th International World Wide WebConference,2002

9 Michael Brinkmeier. PageR ank revisited, ACM Transactionson Internet Technology, vol6 (3), 2006, pp. 282 - 301

10 Wood,L,.Programming the Web: the W3C DOM

specification,Internet Computing, IEEE,  vol3(1),1999,pp.48 - 54

11 Soumen Chakrabarti, Mukul Joshi, Vivek Tawde, IIT

Bombay, Enhanced topic distillation using text, markuptags, and hyperlinks, In Proceedings of the 24th annualinternational ACM SIGIR conference on research anddevelopment in information retrieval,2001

12 Srinivasan P, Pant G, Menczer F, Target seeking crawlersand their topical performance, In Proceedings of the 25thAnnual International ACM SIGIR Conference on Researchand Development in Information Retrieval,2002

13 M. Diligenti, F.M. Coetzee, S. Lawrence, etc, FocusedCrawling Using Context Graphs, In Proceedings of the26th International Conference on Very LargeDatabases. ,2000 

1892