the improved pagerank in web crawler
TRANSCRIPT
7/26/2019 The Improved Pagerank in Web Crawler
http://slidepdf.com/reader/full/the-improved-pagerank-in-web-crawler 1/4
The Improved Pagerank in Web CrawlerLing Zhang
Department of Information Science and Engineering, Hunan First Normal University Changsha, China
Zheng QinCollege of Computer and Communication, Hunan University
Changsha, China
Abstract — Pagerank is an algorithm for rating web pages. It
introduces the relationship of citation in academic papers to
evaluate the web page’s authority. It gives the same weight to
all edges and ignores the relevancy of web pages to the topic,resulting in a problem of topic-drift. On the analysis of several
pagerank algorithms, an improved pagerank based upon
thematic segments is proposed. In this algorithm, a web page
is divided into several blocks by Html document’s structure
and the most weight is given to linkages in the block that is
most relevant to given topic. Moreover, the visited outlinks areregarded as feedback to modify blocks’ relevancy The
experiment on Web crawler shows that the new algorithm has
some effect on resolving the problem of topic-drift.
Keywords-Pagerank;web crawler;topic-drift;relevancy
Introduction
With the rapid growth of web resources in the Internet,
besides to hope the search engine can provide more and more
appropriate information, people have the requirement of
taking centralized query on given topic. Because the searching
range of topic-specific search engine only limits to the professional area and the searching object is the small portion
of web resources, the traditional width-first or depth-first
searching strategy is already not suitable. In order to get more
objects by visiting few irrelevant web pages, the web crawler
usually takes the heuristic searching strategy that ranks urls by
their importance and preferentially visits the more importantweb pages [1]. So, it becomes a hotspot in recent research on
how to decide the url’s importance.
The existing ranking algorithms mainly estimate the url’s
importance by web pages’ relevancy to the topic or their
authorities [2, 3]. There are several common algorithms for
evaluating authorities, such as pagerank, kleinberg, hits andsalsa
[4]. These algorithms can exactly evaluate the web
page’s authority, but they hardly consider topical information,
resulting in a problem of topic-drift [5, 6] that means although
the web page with high authority score certainly has high
universal authority, it not always has high authority on given
topic too. In order to resolve this problem, Bharat andHenzinger implemented a new heuristic strategy that assigns
outlinks different weights. Taher Haveliwal proposed the
topic-sensitive pagerank[7], which uses different topic-vectors
for different topics and regards the similarity of these topic-
vectors to the given topic as weights. Then the weighted sumof web pages’ relevancies to each topic is namely the
pagerank score. Matthew Richardson combines the linkage
information and the content information to improve the
traditional pagerank [5]. When passing a web page’s pagerank
score to each outlinks, not only the relationship of citation
should be taken into account, but also the web pages’ topicalrelevancy is considered. The doubled focused pagerank [8]
proposed by Diligenti divides the probability of visiting a
linkage into two parts: one is the probability of jumping to a
certain web page or following outlinks in this page, which is
proportional to the page’s relevancy to topic, the another is the
probability of following a certain out link, which is proportional to the linkage’s relevancy to topic. These above
algorithms all introduce topical information into pagerank to
resolve the problem of topic-drift. But they only differentiate
linkages simply by assigning more pagerank score to outlinkswhich are more relevant to topic, not fully making use of the
content structure of html documents.On the analysis of above pagerank algorithms and the
content structure of html documents, we proposed the pagerank based on thematic segments. It divides a web page into several blocks based upon its content structure, and then assigns the page’s pagerank score to each block according to blocks’relevancy to topic. Finally, the block’s pagerank score isfurther assigned to each outlink in it according to linkages’relevancy. Moreover, the visited linkages can provide feedbackto modify blocks’ relevancy. We applied this pagerankalgorithm to experiments in web crawler, and the resultsshowed that compared with the above mentioned pagerank
algorithm, the new pagerank improves on the searching precision.
I. PAGERANK
Sergey Brin and Larry Page proposed the pagerankalgorithm for scoring web pages. Each web page is given an
authority score for evaluating its importance. In the beginning,
pagerank has been only used in ranking results of information
retrieval, but now it has been applied in many fields such as
web crawling, clustering web pages and searching for relevant
web pages.Imagining a web surfer who jumps from web page to web
page, it chooses with uniform probability which link to followat each step. The surfer will occasionally jump to a random
page with some small probability. We consider the web as a
directed graph. F i be the set of pages which page i links to,
and Bi be the set of pages which link to page i. After averagedover a sufficient number of steps, the probability of the surfer
on page j at some point in time is given by the formula (1):
* This paper is supported by the National Natural Science
Foundation of China under Grant No.60273070, the Sci & Tech.
Project of Hunan under Grant No. 04GK3022
The 1st International Conference on Information Science and Engineering (ICISE2009)
978-0-7695-3887-7/09/$26.00 ©2009 IEEE 1889
7/26/2019 The Improved Pagerank in Web Crawler
http://slidepdf.com/reader/full/the-improved-pagerank-in-web-crawler 2/4
ji B i
P(i)P(j) (1 )
| F | β β
∈
= + − ∑ (1)
0< β <1, the usual value is 0.15. The pagerank score reflectsthe citation relationship of web pages and if a web page iscitied by many important pages, it is also an important page.Although pagerank score can reflect the authority of web pages
properly, it ignores the web pages’ relevancy to topic and the
authority score is independent with topics. A web page hasonly one pagerank score, but some web pages (especially somedoor-way web pages) include information about many differenttopics. For example, a web page that has high authority ontopic “art” may not have high authority on topic “sports” too.Moreover, in many times, linkages just have the use ofnavigation or advertisement. So, the web pages linked eachother not always have the same topical relevancy. Therefore, itis not very appropriate to assign the same pagerank score to alloutlinks in a web page.
II. THE PAGERANK COMBINED WITH CONTENTS
To combine the authority with the topical relevancy,
Matthew Richardson improved the traditional pagerankalgorithm. When the pagerank score is passed between web
pages, the relevancy to topic is considered and the out-page
more relevant to topic is assigned more pagerank score [4]. He
proposed an intelligent web surfer, who probabilistically hops
from page to page, depending on the content of the pages andthe topic terms the surfer is looking for. The resulting
probability distribution over pages can be seen in formula (2):
q q
i B
P (j) P ' ( j) (1 ) ( ) ( ) j
q q P i P i j β β ∈
= + − →∑ (2)
P (i) can be computed through formulate (1). Where
P q(i→ j) is the probability that the surfer transitions to out-page
j given that he is on page i and on searching for the topic q. P q’ (j) specifies where the surfer chooses to jump when notfollowing outlinks. The two probabilities both relate to the
web pages’ relevancy to topic q. Given that W is the set of all
web pages, Rq(k )is a measure of relevancy of page k to
topic q. The definition of P q(i→ j) and P q’ (j) can be seen in
formula (3) and (4):
( )( )
( )i
q
q
q
k F
R j P i j
R k ∈
→ =
∑ (3)
( )' ( )
( )
q
q
q
k W
R j P j
R k ∈
=
∑ (4)
Seen from the above formulas, the pagerank combined withcontent assigns pagerank score according to web pages andoutlinks’ relevancy to topic. So, in the outlinks on a same web
page, the more relevant to topic can get more pagerank score ofthe parent web page.
III. THE PAGERANK BASED ON TOPICAL BLOCKS
For considering the topical relevancy, the pagerank
combined with content has some effect on the problem of
topic-drift in information retrieval. However, when the
pagerenk algorithm is applied in web crawling, because the
crawler can’t see contents of unvisited pages, it only canestimate the unvisited web page’s relevancy by visited pages
and information in hyperlinks. But hyperlinks are usually
incapable of providing enough information. On the analysis of
content structure of web pages, an improved pagerank
algorithm based upon thematic segments is proposed, in order
to make the web crawler be able to estimate links’ importancemore accurately.
The general web information processor usually treats the
web page as a unit. In fact, this process is too coarse [9,10].
Actually, when the author is designing a web page, he is not
piling various information pell-mell, but organizing
information by certain layout and structure.Seen from figure1, according to a web page’s layout and
structure, it can be divided into several information blocks,which include many single information items. Moreover, theseinformation blocks can be classed into four types: the text
block, the relevant hyperlinks block, the navigation andadvertisement block, and the relevant to other topics block. If a
web page not only includes information about a single topic, but also refers to multiple topics, considering information aboutone topic often being placed together, we need classify theseinformation blocks by topics further.
Figure1.An example of Html document
In these information blocks, some include linkages relevant
to the given topic and some include linkages only for
navigation or advertisement and some include linkages for
other topics. So, we need classify these blocks furtheraccording to their relevance to the given topic, and then assign
the web page's pagerank score to each linkage block in proportion. More relevant to topic, more pagerank score the
linkage block gets. Then, the pagerank score that each linkage
block gets is assigned to each linkage in this block according
to the relevance of linkages. Moreover, the visited outlink can be regarded as feedback to modify the block’s relevance. If the
outlink is relevant to topic, the relevance of block in which the
outlink is should be accordingly augmented, otherwise, the
relevance should be minished. The web crawler chooses the
linkage which points to web page j with the
Advertise
ment
The relevanthyperlinks
Navigatio
n block O t h e r t o
i c s
1890
7/26/2019 The Improved Pagerank in Web Crawler
http://slidepdf.com/reader/full/the-improved-pagerank-in-web-crawler 3/4
probability.
' ( ) ' ( ) (1 ) ( ) ( ) ( ) j
q q q q q
i B
P j P j P i S m P i j β β ∈
= + − →∑ (5)
Given the set S of all information blocks in web page i.
Linkage l j points to web page j and lies in the information block m. S q(m) is the topical relevance of m compared with
other blocks. L(m) is the set of all linkages in block m, and W
is the set of linkages in URL candidate frontier. P’ q(j), S q(m)and L(m) are defined as below:
( )
| |' ( )
( )
j
q j
i B j
q
q
k W
R l
B P j
R k
∈
∈
=
∑
∑ (6)
( )( )
( )
q
q
q
k S
R mS m
R k ∈
=
∑ (7)
( )
( )( )
( )
q j
q
q
k L m
R l P i j
R k ∈
→ =
∑ (8)
Every time the web crawler chooses the most importantlinkage in URL frontier to visit. After the web page pointed bythis linkage is visited, the pagerank score of web pages andlinkages connected with it is updated immediately.
IV. EXPERIMENTS
In order to compare with other pagerank algorithms, we
carry out web crawlers based on three different pagerank
algorithms: the traditional pagerank, the pagerank combined
with content, and the pagerank based on segments. The objects
of web crawlers are computer web pages. To compute the
relevance to topic, we choose and enrich the FOLDOC online
computer dictionary as the computer keywords. The semanticsimilarity of web pages or anchor texts around linkages to the
computer dictionary is regarded as their relevance to the topic.Here, we choose the vector space model to express web pages
and compute the semantic similarity by the cosine formula
(9)[12]:
2 2s'(q,p)
kq kp
k q p
kq kp
k q k p
f f
f f
∈ ∩
∈ ∈
=
∑
∑ ∑ (9)
In the above formula, q is topic keyword terms, p is word
terms in anchor texts, and f kd is the frequency of term k
appearing in d. We choose the searching precision to evaluate
these algorithms' performance [13]:
||
||
pagescrawled the
pagesrelevant crawled the precisionSearching =
(10)“archive.ncsa.uiuc.edu” and “www.dmoz.org/science” are
selected as the initial url seeds. Each web crawler starts from
the two urls to search for web pages on topic computer. Then
we get the Figure2.
Figure2. The comparison of searching precision with the“archive.ncsa.uiuc.edu” and “www.dmoz.org/science” url
seeds separately
Seen from the first graph of Figure2, because the traditional pagerank ignores web pages' text information, its performanceis the worst and the searching precision is always under 0.3. Inthe initial and middle searching stage, the precision of
pagerank combined with content is close to that of pagerank based on segments, but it drops rather quickly in the last. So, inthe whole web crawling, pagerank based on segments exceedsthe other two pagerank in performance. From the second graphof Figure2, we can draw the same conclusion except that in ashort stage, the precision of pagerank combined with content isa bit higher than that of pagerank based on segments. By theanalysis of crawled web pages, we find that there is a websiteincluding many web pages designed unmorally so it hasinfluence on the performance of the pagerank using the contentstructure to segment. However, with the increase of crawledweb pages, the available information used in segment is moreand more. So the searching efficiency of pagerank based onsegments is above the other two more and more obviously.
R EFERENCES
1 Junghoo Cho, Hector Garcia-Molina, Lawrence Page,Efficient crawling through URL ordering, In Proceedings of7th International World Wide Web Conference,1998
2 Chirita, P.; Olmedilla, D.; Nejdl, W, Finding Related PagesUsing the Link Structure of the WWW, In Proceedings ofIEEE/WIC/ACM International Conference,2004
3 Ingongngam, P.; Rungsawang. A, Topic-centric algorithm: anovel approach to Web link analysis, Advanced
1891
7/26/2019 The Improved Pagerank in Web Crawler
http://slidepdf.com/reader/full/the-improved-pagerank-in-web-crawler 4/4
Information Networking and Applications, Vol.2,2004, pp.299 - 301
4 B. L. Narayan, C. A. Murthy, Sankar K. Pal. Topiccontinuity for Web document categorization and ranking,IEEE/WIC International Conference on WebIntelligence,2003,pp. 310-315
5 Matthew Richardson, Pedro Domingos,The IntelligentSurfer: Probabilistic Combination of Link and Content
Information in PageRank, In advances in neuralinformation processing systems, vol 14,2002, pp.673-680
6 K. Bharat and M. R. Henzinger,Improved algorithms fortopic distillation in a hyperlinked environment, InProceedings of the Twenty-First Annual International ACMSIGIR Conference on Research and Development inInformation Retrieval,1998
7 Taher H. Haveliwala, Topic-Sensitive PageRank: AContext-Sensitive Ranking Algorithm for Web Search,Knowledge and Data Engineering, IEEE Transactions,vol.15(4),2003,,pp.784 - 796
8 Michelangelo Diligenti, Marco Gori, Marco Maggini, WebPage Scoring Systems for Horizontal and Vertical Search,In
Proceedings of the 11th International World Wide WebConference,2002
9 Michael Brinkmeier. PageR ank revisited, ACM Transactionson Internet Technology, vol6 (3), 2006, pp. 282 - 301
10 Wood,L,.Programming the Web: the W3C DOM
specification,Internet Computing, IEEE, vol3(1),1999,pp.48 - 54
11 Soumen Chakrabarti, Mukul Joshi, Vivek Tawde, IIT
Bombay, Enhanced topic distillation using text, markuptags, and hyperlinks, In Proceedings of the 24th annualinternational ACM SIGIR conference on research anddevelopment in information retrieval,2001
12 Srinivasan P, Pant G, Menczer F, Target seeking crawlersand their topical performance, In Proceedings of the 25thAnnual International ACM SIGIR Conference on Researchand Development in Information Retrieval,2002
13 M. Diligenti, F.M. Coetzee, S. Lawrence, etc, FocusedCrawling Using Context Graphs, In Proceedings of the26th International Conference on Very LargeDatabases. ,2000
1892