[lecture notes in computer science] advances in web intelligence volume 3528 || ontology based web...

Ontology Based Web Crawling – A NovelApproach

S. Ganesh

Assistant System Engineer,TCS Nortel- Product Test Tata Consultancy Services Ltd,

Mumbai 400 066, [email protected]

Abstract. The requirement of a web Crawler that downloads most rel-evant pages is still a major challenge in the field of Information RetrievalSystems. The use of link analysis algorithms as page rank and otherImportance-metrics like back link count have shed a new approach inprioritizing the URL queue for downloading higher relevant pages. In thispaper, the combination of these metrics along with a new metric calledassociation-metric, which brings the use of Ontology for Crawling hasbeen proposed and implemented. The use of domain dependent Ontol-ogy brings into effect the both semantic and link nature of the URL andits page .The association-metric estimates the semantic content of theURL based on the domain dependent ontology, which in turn strength-ens the metric that is used for prioritizing the URL queue. In addition,after downloading the page, the association metric plays important rolein estimating the relevancy of the links in that page. This new metricsolves the major problem of finding the relevancy of the pages before theprocess of crawling, to an optimal level. The crawler developed based onthe Association metric has shown encouraging results.

Keywords: Web Crawler, Ordering-metric, Importance-metrics, Associ-ation-metric, and Ontology.

1 Introduction

The worldwide web, having over millions of pages, continues to grow rapidlyat a million pages per day and it also changes rapidly. This rapid growth ofthe worldwide web poses unprecedented scaling challenges for general-purposecrawlers and search engines. The increase in the number of accessible web pagesby Internet users imposes the need for a technique to get only the relevantinformation. Crawler, which is a main component of a search engine, is a programthat retrieves Web pages or a Web Cache [3]. Roughly, a crawler starts off withthe URL for an initial page P0. It retrieves P0, extracts any URLs in it, andadds them to a queue of URLs to be scanned. Then the crawler gets URLsfrom the queue (in some order), and repeats the process. Every page that isscanned is given to a client that saves the pages, creates an index for the pages,

P.S. Szczepaniak et al. (Eds.): AWIC 2005, LNAI 3528, pp. 140–149, 2005.c© Springer-Verlag Berlin Heidelberg 2005

Ontology Based Web Crawling – A Novel Approach 141

or summarizes or analyzes the content of the pages. The design of a good crawlerpresents many challenges [4]. The most prominent challenge faced by the currentweb crawlers is to select important pages for downloading. The crawler cannotdownload all pages from the web. It is important for the crawler to select thepages and to visit “important” pages first by prioritizing the URLs in the queueproperly. Other challenges are the proper refreshing strategy, minimizing theload on the websites crawled and parallelization of the crawling process. Thispaper deals with the challenge of prioritizing the URL queue for crawling morerelevant pages based on the domain dependent ontology. This paper exploresthe possibility of merging the present prioritizing algorithms with the semanticnature of the URL, which has been obtained from the ontology through theassociation-metric. The following section 2 discusses the related work done sofar on this challenge. Section 3 gives detailed description on the working of aweb crawler and, various prioritizing algorithms. Section 4 deals with our workon this challenge and on the new prioritizing algorithm that is based on theontology. System overview of developed crawler is described in section 5. Section6 deals with the discussion of proposed Ordering metric.

2 Related ork

There has been considerable work done on prioritizing of the URL queue forefficient crawling. The performance of the existing prioritizing algorithms forcrawling does not suit the requirements of the various kinds and levels of theusers. The use of link analysis algorithms has solved the problem partially. Thebest-known example of such link analysis is the Page rank Algorithm success-fully employed by the Google Search Engine [5]. The HITS algorithm proposedby Kleinberg [1] relies on query-time processing to deduce the hubs and author-ities that exist in a sub graph of the web consisting of both the results to aquery and the local neighborhood of these results. But page rank suffers fromslow computation due to its recursive nature of its algorithm. The main drawback of Kleinberg’s HITS algorithm is its query-time processing for crawlingpages [1].

The recent work by Junghoo Cho et al., claims that the evaluation of theimportance of the page P as I(P) using some metrics solves to certain extent.These metrics are discussed in section 3 [3]. But these metrics’ efficiency is af-fected by many factors like page rank, is less effective at start of the crawlprocess. Back link metric is proved to be less effective for Small domains. Sim-ilarity metric done by them evaluates only the textual similarity between thecrawled pages with the driving query based on only few key words. The otherapproach for crawling higher relevant pages was by the use of neural networks [1].Even this approach has not been established as the efficient crawling techniquetill now.

All these approaches solve the problem only partially .The combination ofimportant metrics has not been explored explicitly [3].

W

142 S. Ganesh

3 Fundamentals

This section describes the basics of a web crawler and various importance metrics.

3.1 Web Crawler

WebCrawler receives a list of URLs to be downloaded and simply returns the fullcontent of the HTML page or any errors, while trying to get the pages. The crawlerswill process one URL from the queue at a time. The queue is prioritized based onthe importance metrics discussed in the next section. The queue will be reorderedaccording to the importance metric used after the downloading of each web page.All the pages may not be of equal interest to the client. So the pages should beassociated with the importance of the page based on the metrics given below.

3.2 Importance Metrics

Given a Web page P , the importance of that page, I(P ), can be evaluated in one ofthe following ways:

1. Similarity to a Driving Query Q . Based on a query Q which drives thecrawling process IS(P) is defined to be the textual similarity betweenP and Q.

2. Backlink Count . The value of IB(P) is the number of links to P that appearover the entire Web. Intuitively, a page P that is linked to by many pages ismore important than one that is seldom referenced. A crawler may estimatethe number of links to P that have been seen so far which is IB’(P) .

3. PageRank . The PageRank back link metric, IR(P), recursively defines theimportance of a page to be the weighted sum of the back links to it. PageRankis described in much greater detail in [3, 5].

(a) Forward Link Count . A metric IF (P) counts the number of links thatemanate fromP .Under thismetric, a pagewithmanyoutgoing links is veryvaluable, since it may be a Web directory. This metric can be computeddirectly from P.

(b) Location Metric. The IL (P) importance of page P is a function of itslocation, not of its contents. If URL u leads to P , then IL (P) is a functionof u. URLs ending with ”. com” or URL containing the string ”home” maybe deemed more useful.

4 Our Work

We have brought a new method of ordering the URL queue by combining both thelink structure of the web and its semantic nature. Our work towards this challengecan be divided into following phases:


4.1 Combination Importance Metric:

The Importance metrics described in the previous section were combined to geta better composite metric. The simple combination of any two metrics obeys theresults as published in [2]. The importance metric are evaluated for the Crawledpages, with this score the importancemetric score for theURLu,O(u) is evaluated.The combination Importance metric is denoted by CI(p) where p is the pageto be crawled.

This can be defined as follows:CI (p) = a1 IR(p) + a2 IB(p) + a3 IL(p) +a4 IF(p) Where a1, a2, a3, a4

are real constants that can be set properly to get an efficient CI (p). CI’ (p) is de-finedas themetric evaluated for thedownloadedor crawledpagep.CI’ (p1,p2. . . .pn)is evaluated for all the crawled pages p1,p2. . . .pn.This Composite Importance met-ric is used with association metric for URL ordering.

4.2 Association metric

The use of the semantic nature of the web particularly the URL throws a new lightin the URL ordering scheme. The Association metric is more like the similaritymetric but it takes the semantic nature of the URL by the use of ontology.

4.2.1 Why Ontology? Ontology is one of the increasingly popular ways tostructure information. Ontologies are also called graphs of concepts. Evaluation ofassociation metric with the aid of the ontology that may be generic or domain de-pendent will surely give more relevant Results. The task can be achieved throughthe maintenance of a reference Ontology based on subject hierarchies may be col-lected from directories like yahoo and Open Directory Project. The Reference On-tology thus created will have the following associations like “is a”, “part of”, “has”relationships. The proposed metric evaluates the metric for the URL and the webpages crawled is discussed in section 6.The semantic metric for the URL u is eval-uated based on its relevancy with the reference ontology. Once the page p of theURL u is downloaded the semantic metric for this page p is also calculated andmaintained, as it will be a parent page for many links to be crawled. This calcula-tion of the metric for the parent page helps in evaluating the relevancy of the linkto be crawled or not even before the association metric for the link is calculated.AS (p) is same for all links from that page p but it differs for the links not extractedfrom this page. Hence we calculate two association metrics:

– Association metric for the URLs u1 to un to be scanned– Association metric for the downloaded parent page p0.

The Association metric for URL u is denoted by AS(u) and the Association metricfor page p crawled is denoted by AS(p).

144 S. Ganesh

4.3 Ordering Metric O(u)

The Ordering Metric O(u) used for reordering the URL queue in our crawler isa composite metric defined as follows:

O(u)= b1 CI’(p) + b2 AS(u) + b3 [AS(p1) + AS(p2) + AS(p3) + .. +AS(pn)] , piis the ith Parent page ofURLu to be crawled.Where b1, b2, b3 are realconstants to be evaluated from the results of our crawl. By varying these constantsdifferent results may be obtained. This O(u) metric is bound to give higher effi-ciency than all the priority algorithms as it combines both semantic and page-linksnature of the web.

4.4 Pseudo code of Our Prioritizing Algorithm:

/*enqueue (url queue, starting url);While (not empty (url queue)){url = dequeue(url queue);page = crawl page(url);enqueue (crawled pages, (url, page));url list = extract urls (page);For each page p in crawled pagesIf [ page p has semantic associations with the keyword w in body or in

title]AS(p)= weighted association valueEnd loopFor each u in url listenqueue (links, (url, u));If [u not in url queue] and [(u,-) not in crawled pages]enqueue (url queue, u);If [url u has semantic associations with the keyword w]AS(u)=weighted association valueCI(u)= pagerank[u]O[u]= b1 CI(u) + b2 AS(u) + b3 [AS(p1) +AS(p2) +. . . ..+AS(pn)]

where p1,p2 . . . pn are the parent pages to this url ureorder queue(url queue); //based on O[u] }*/

5 System Overview

– ImportanceMetricEvaluator: ImportanceMetricEvaluatormodule is theimportant alteration that has been brought to the current crawling system.This module evaluates the importance of the pages to be crawled according toproposed Importance metric. In this module, the prioritizing algorithm basedon the proposed Importance metric – Association Metric is implemented.


Fig. 5.1

– URL Filter: URL filter module is another new module not present in manycrawling systems. Here the importance of the page for the current URL is cal-culated i.e. the ordering metric O(u) is calculated for the URL u based on theImportance metric value provided for u from the Importance metric evaluator

6 Implementation of the Ontology

InThis Section the Implementationof ourOntologybasedCrawler is presentedTheOntology used is similar to the hierarchical directory structure of google directory.

The Ontology structure is as follows:

Fig. 6.1

The working of the algorithm devised has been discussed in this section withPassword Recovery as a start title. Our crawler based on the association metricis to crawl web pages related to Password Recovery, a topic under network se-curity. The ontology used here has a node for security, which comes under com-puters as shown fig 6.1. Before crawling, the ontology tree is traversed to the nodepointing to password recovery. This forms the knowledge-path for our currentcrawl.TheOntology in ourCrawler is currently based on thedomain-Computers.It is a pure hierarchy based on the subject hierarchies for the Computers domain

146 S. Ganesh

in ODP project http://www.dmoz.org//Computers The Open Directory Project(ODP) http://www.dmoz.org has been used because as it is an open source andhas a less commercial bias. To implement this Ontology Structure, two text filesare used. The first text file Ontology.txt establishes the parent-child Relationshipsi.e. “is a”, “part of” relations of the Ontology in a simple way. To get the Asso-ciated Scores for each node in the Knowledge-Path, Score.txt is used. This textfile gives integral scores for each node based on the depth it is present in the On-tology Structure. For Instance, Security is set 1 , which is at level 1 from the roothttp://www.dmoz.org//Computers and Hackers which is a child of Security getsthe score 2 in the same way.

Security Computers Security 1Hackers Security Hackers 2Cryptography Hackers Cryptography 3Password Recovery Cryptography Password Recovery 4Fig 6.2 Illustration of Ontology.txt Fig 6.3 Illustration of Score.txt The

initial parameters for the crawler namely number of pages to be crawled,the current topic of crawl - Password Recovery and the seed URL -http://www.dmoz.org//Computers are set before crawler starts. The Crawl andStop model is followed .In this Crawler before it starts to crawl the Knowledge-Path for the current crawl is constructed based on the Ontology stored. The On-tology.txt is traversed to get the path for the current crawl - PasswordRecovery.The path is thus established: Computers-Security-Hackers-Cryptography-Password Recovery. This path decides and restricts the crawler’s path for thecurrent crawl. Also before the Crawling starts, the Scores for different keywordsin the Knowledge-Path are calculated using Score.txt file. The scores thus estab-lished are as follows: Security =1, Hackers=2, Cryptography=3, PasswordRecovery=4. The Crawler is ready to crawl based on this Knowledge-Path.

The seedURL is downloadedfirst - http://www.dmoz.org//Computers and thenew links are extractedandadded to thequeue.ThesenewURLsare checked for theassociations with the Knowledge-Path, if present the URL u is given the scorebased on the Keyword score that was evaluated before crawling. The crawler re-stricts to the Knowledge –Path by following this URL Path:http://www.dmoz.org//Computers, http://www.dmoz.org/Computers/Security,http://www.dmoz.org//Computers/Security/Hacking,http://www.dmoz.org//Computers/Security/Hacking/Cryptography,http://www.dmoz.org/Cmputers/Security/Hacking/Cryptography/Password

7 Experiments and Results

The various algorithms were run on this crawler and the pages were downloaded.The crawler is written in Java. A Crawl and Stop model of crawler has been fol-lowed. The test bed is taken as the Open Directory Project. The comparison ofvarious metrics is done based on the number of relevant and irrelevant pages and


Fig. 7.1 Relevancy of Downloaded Pages Topic Wise

Fig. 7.2 Precision of Various Metrics

Fig. 7.3 Precisions of the Metrics with Association

the precision of the crawl. All these graphs establish the Associative metric hashigher precision and relevancy than other metrics.

148 S. Ganesh

Fig. 7.4 Relevancies of Crawled Pages for same topic -security

8 Conclusion

The Web Crawler based on Importance-metrics developed has combined the se-mantic and link nature of the web for crawling. The new Ordering Metric proposed- Association metric which is based both on the semantic content of the URL uand all its parent pages along with the importance metric sheds a new method forprioritizing the URL queue for crawling by taking account both the semantic andlink structure of the web. This new algorithm has the ability to solve the majorproblem of crawling relevant pages. This novel idea of bringing the ontology na-ture to the URL has never been explored so far. This proof-of-concept web crawlerbased on Importance-metrics has been developed successfully. This novel work isbound to bring a new dimension to the existing IR systems like search engine wherecrawling is purely based on the link structure of the web and keyword matching.The encouraging results shown by this crawler makes it more adept in the area ofFocused Crawling. By making this crawler distributed, bringing in more relation-ships in the reference Ontology and updating the Ontology dynamically, it can bescaled to a powerful, generic web crawler. The scalability advantage for this crawleris bound to revolutionize the current crawling systems and the search engine archi-tecture of having large storage structures. This crawler is a new leaf in the semanticweb concept.

References

[1] Filippo Menczer, Gautam Pant and Padmini Srinivasan “Topical Web Crawlers:Evaluating Adaptive Algorithms” in ACM Transactions on Internet Technology TheUniversity of Iowa36 (2003)

[2] S. Ganesh, M. Jayaraj, V. Kalyan, Srinivasa Murthy, G. Aghila, “Ontology basedWeb Crawler”, In Proceedings of IEEE sponsored International Conference on In-formation Technology: coding and computing April 2004, Pondicherry EngineeringCollege, India 287 (2004) 337–342


[3] Junghoo Cho, Hector Garcia-Molina, and Lawrence Page, “Efficient crawlingthrough URL ordering”, In Proceedings of the Seventh International World WideWeb Conference, pages 161–172, April 1998. 72 (1998) 161–172

[4] Monika R.Henzinger, “Algorithmic Challenges in Web Search Engines”, InternetMathematics Volume 1, pages 115-126, December 2002. 72 (2002) 115–126

[5] Sergey Brin and Lawrence Page, “The anatomy of a large-scale hyper textual Websearch engine”, In Proceedings of the Seventh International World Wide Web Con-ference, pages 107–117, April 1998. ( bf 72) (1998) 107–117

[lecture notes in computer science] advances in web intelligence volume 3528 || ontology based web...

Documents