adaptive focused crawling
DESCRIPTION
Adaptive Focused Crawling. Presented by: Siqing Du Date: 10/19/05. Outline. Introduction of web crawling Exploiting the hypertextual information Genetic-based crawler Ant-based crawler Machine learning-based crawler Evaluation. Crawling the Web. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Adaptive Focused Crawling](https://reader036.vdocuments.net/reader036/viewer/2022081519/56813fe5550346895daad149/html5/thumbnails/1.jpg)
Adaptive Focused CrawlingAdaptive Focused Crawling
Presented by: Siqing Du
Date: 10/19/05
![Page 2: Adaptive Focused Crawling](https://reader036.vdocuments.net/reader036/viewer/2022081519/56813fe5550346895daad149/html5/thumbnails/2.jpg)
Outline
Introduction of web crawling Exploiting the hypertextual information Genetic-based crawler Ant-based crawler Machine learning-based crawler Evaluation
![Page 3: Adaptive Focused Crawling](https://reader036.vdocuments.net/reader036/viewer/2022081519/56813fe5550346895daad149/html5/thumbnails/3.jpg)
Crawling the Web
Simple crawling on the web proceeds by following the urls in the seed pages, retrieve web pages and add them into a local repository.
Taking the Web as a graph structure (V,E), web crawling is similar to graph traversal problem.
Breadth-first search
![Page 4: Adaptive Focused Crawling](https://reader036.vdocuments.net/reader036/viewer/2022081519/56813fe5550346895daad149/html5/thumbnails/4.jpg)
Flow of a Basic Sequential Crawler
![Page 5: Adaptive Focused Crawling](https://reader036.vdocuments.net/reader036/viewer/2022081519/56813fe5550346895daad149/html5/thumbnails/5.jpg)
What is the Problem
Current Size of web (static/crawlable/visible) is 4 ~ 10 billion or maybe a lot more
Average out-degree(# of urls in a page) of a random page on the web is 7
Hence the size of the graph increases exponentially by 7
A well-known web search engine only can cover a part of the whole web
![Page 6: Adaptive Focused Crawling](https://reader036.vdocuments.net/reader036/viewer/2022081519/56813fe5550346895daad149/html5/thumbnails/6.jpg)
Adaptive Focused Crawling
Focused crawling: developing particular crawlers able to seek out and collect pages related to a given topic.
It is also called topical crawling If a focused crawler includes learning methods in
order to adapt its behavior during the crawl to the particular environment and its relationships with the given input parameters, e.g., the set of retrieved pages and the user-defined topic, the crawler is named adaptive.
Best-first search
![Page 7: Adaptive Focused Crawling](https://reader036.vdocuments.net/reader036/viewer/2022081519/56813fe5550346895daad149/html5/thumbnails/7.jpg)
Outline
Introduction of web crawling Exploiting the hypertextual information Genetic-based crawler Ant-based crawler Machine learning-based crawler Evaluation
![Page 8: Adaptive Focused Crawling](https://reader036.vdocuments.net/reader036/viewer/2022081519/56813fe5550346895daad149/html5/thumbnails/8.jpg)
Exploiting the Hypertextural Information PageRank and HITS founded from citation analysis
started in 1950s by Garfield. In focused crawling systems, the precision is not
defined only in terms of number of crawled pages, but in terms of rank.
Short result lists of high rank documents are definitively better than long lists of interesting documents that force the users to sift through them in order to find the most valuable information.
![Page 9: Adaptive Focused Crawling](https://reader036.vdocuments.net/reader036/viewer/2022081519/56813fe5550346895daad149/html5/thumbnails/9.jpg)
Topical Locality and Anchors
Topical locality occurs each time a page is linked to others with related content. (in order to give users the chance to see further related information or services).
Proximal cues or residues correspond with the imperfect information at intermediate locations that a user exploits to decide the paths to follow in order to reach a target information.
Text snippet, anchor text or icons are usually the imperfect information related to a certain distant content.
![Page 10: Adaptive Focused Crawling](https://reader036.vdocuments.net/reader036/viewer/2022081519/56813fe5550346895daad149/html5/thumbnails/10.jpg)
HITS
Authorities: have relevant content about a topic Hubs: contain several links toward relevant
authoritative pages.
Epqq
qp ha),(:
)()(
Eqpq
qp ah),(:
)()(
![Page 11: Adaptive Focused Crawling](https://reader036.vdocuments.net/reader036/viewer/2022081519/56813fe5550346895daad149/html5/thumbnails/11.jpg)
PageRank
Random surfer model : a surfer in that model is able to randomly click on one of the links contained in a page p with equal probability 1/Np
rank p crank q
N qq q p E
( )( )
:( , )
cE p( )
![Page 12: Adaptive Focused Crawling](https://reader036.vdocuments.net/reader036/viewer/2022081519/56813fe5550346895daad149/html5/thumbnails/12.jpg)
Outline
Introduction of web crawling Exploiting the hypertextual information Genetic-based crawler Ant-based crawler Machine learning-based crawler Evaluation
![Page 13: Adaptive Focused Crawling](https://reader036.vdocuments.net/reader036/viewer/2022081519/56813fe5550346895daad149/html5/thumbnails/13.jpg)
AI-based Approaches
Speculate that crawlers as single autonomous units live and keep moving for interesting resources.
Genetic-based crawlers Ant paradigm
![Page 14: Adaptive Focused Crawling](https://reader036.vdocuments.net/reader036/viewer/2022081519/56813fe5550346895daad149/html5/thumbnails/14.jpg)
Genetic-based crawlers
InfoSpiders, also known as ARACHNID (Adaptive Retrieval Agents Choosing Heuristic Neighborhoods for Information Discovery)
Genetic algorithms have been introduced in order to find approximate solutions to hard-to-solve combinatorial optimization problems.
Inspired by evolutionary biology studies.
![Page 15: Adaptive Focused Crawling](https://reader036.vdocuments.net/reader036/viewer/2022081519/56813fe5550346895daad149/html5/thumbnails/15.jpg)
Basic Idea of GA
A population Genetic operators, such as, inheritance,
mutation, crossover. The ones that are closer to the better solutions
are given more chances to live and reproduce, while the ones that are ill-suited for an environment die out.
The initial population generated randomly
![Page 16: Adaptive Focused Crawling](https://reader036.vdocuments.net/reader036/viewer/2022081519/56813fe5550346895daad149/html5/thumbnails/16.jpg)
InfoSpider
In InfoSpiders an evolving population of intelligent agents browse the Web driven by the user queries.
Each agent is able to draw relevant resources and reason autonomously about next page to download and analyze.
The goal is to mimic the intelligent browsing behavior of human users with little or no interaction among agents.
![Page 17: Adaptive Focused Crawling](https://reader036.vdocuments.net/reader036/viewer/2022081519/56813fe5550346895daad149/html5/thumbnails/17.jpg)
InfoSpider cont.
Each agent is built on top of a genotype (parameter that represents the degree to which a gent trusts the textual description about outgoing links, a set of keywords initialized with the query terms, and a vector of weights)
A feed-forward neural network used to judging what are the best keywords in the first set that best discriminate the documents relevant to the user.
![Page 18: Adaptive Focused Crawling](https://reader036.vdocuments.net/reader036/viewer/2022081519/56813fe5550346895daad149/html5/thumbnails/18.jpg)
InfoSpider cont.
The adaptivity is both unsupervised and supervised. (With or without users’ feedback)
If any error occurs (uninteresting page )due to the agents action selection, the weight of the neural networks are updated subsequently.
Mutation and crossovers provide the second kind of adaptivity to the environment.
An agent’s energy value is assigned at the beginning, updated according to the relevance of page visited.
The energy determines which agent survives or dies out.
![Page 19: Adaptive Focused Crawling](https://reader036.vdocuments.net/reader036/viewer/2022081519/56813fe5550346895daad149/html5/thumbnails/19.jpg)
Itsy Bitsy Spider
Itsy Bitsy spider, an implementation of genetic-based crawler, experimented on Yahoo database.
During the evaluation, the genetic approach dose not outperform the best first search algorithm. (recall high, precision no significant difference)
However, Itsy Bitsy is a simple version of InfoSpiders, no neural network and some other components, and no ability to autonomously reasoning.
![Page 20: Adaptive Focused Crawling](https://reader036.vdocuments.net/reader036/viewer/2022081519/56813fe5550346895daad149/html5/thumbnails/20.jpg)
Outline
Introduction of web crawling Exploiting the hypertextual information Genetic-based crawler Ant-based crawler Machine learning-based crawler Evaluation
![Page 21: Adaptive Focused Crawling](https://reader036.vdocuments.net/reader036/viewer/2022081519/56813fe5550346895daad149/html5/thumbnails/21.jpg)
Ant-based Crawlers
Based on a model of social insect collective behavior.
Studies on how blind animals, such as ants, are able to find out the shortest ways from their nest to the feeding sources and back.
Ants can release an hormonal substance, the pheromone, to mark the ground, leaving a trail.
Other ants follow the train and reinforce the trail.
![Page 22: Adaptive Focused Crawling](https://reader036.vdocuments.net/reader036/viewer/2022081519/56813fe5550346895daad149/html5/thumbnails/22.jpg)
Mechanism
The first ants returning to their nest from the feeding sources are those which chosen the shortest paths.
The back and forth trip let them release pheromone twice.
Others, if have to make choice between different paths, will prefer those with more pheromone path.
![Page 23: Adaptive Focused Crawling](https://reader036.vdocuments.net/reader036/viewer/2022081519/56813fe5550346895daad149/html5/thumbnails/23.jpg)
Ant-based Crawlers
Each agent corresponds to a virtual ant, move from urli to urlj.
The system execution is divided into cycles; in each of them, the ants make a sequence of moves.
At the end of a cycle, the ants update the pheromone intensity values of the followed path as a function of the retrieved resource scores.
![Page 24: Adaptive Focused Crawling](https://reader036.vdocuments.net/reader036/viewer/2022081519/56813fe5550346895daad149/html5/thumbnails/24.jpg)
Ant-based Crawlers
The transition probability from urli to urlj at cycle t is
Prevent circular paths, each ant stores a L list containing the visited urls.
p tij
t
t
ij
l i l E il
( )( )
( ): ( , )
![Page 25: Adaptive Focused Crawling](https://reader036.vdocuments.net/reader036/viewer/2022081519/56813fe5550346895daad149/html5/thumbnails/25.jpg)
Updating Rule
The pheromone of trail from urli to urlj at cycle t+1
Adaptivity: the pheromone intensities are updated according to the visited resource scores.
M
k
kijij tt
1
)()()1(
||
])[()(
||
1
)(
)(
)(
k
P
j
k
k
P
jPscorek
![Page 26: Adaptive Focused Crawling](https://reader036.vdocuments.net/reader036/viewer/2022081519/56813fe5550346895daad149/html5/thumbnails/26.jpg)
Outline
Introduction of web crawling Exploiting the hypertextual information Genetic-based crawler Ant-based crawler Machine learning-based crawler Evaluation
![Page 27: Adaptive Focused Crawling](https://reader036.vdocuments.net/reader036/viewer/2022081519/56813fe5550346895daad149/html5/thumbnails/27.jpg)
Intelligent Crawling’s Statistical Model Aims at learning statistically characteristics of the
linkage structure of the Web while performing search. Using particular knowledge obtained in the search to
calculate the conditional probability and interest ratio to determine whether the unseen page satisfies the user needs.
It does not need any collection of topical example for training.
The crawler adapts its behavior by learning the correlations among given features.
![Page 28: Adaptive Focused Crawling](https://reader036.vdocuments.net/reader036/viewer/2022081519/56813fe5550346895daad149/html5/thumbnails/28.jpg)
Reinforcement Learning-based Approaches A classifier evaluates the relevance of a
hypertext document with respect to the chosen topics.
The interesting documents found are the rewards.
To learn the text in the neighborhood of the hyperlink that most likely point to relevant pages during the crawling.
![Page 29: Adaptive Focused Crawling](https://reader036.vdocuments.net/reader036/viewer/2022081519/56813fe5550346895daad149/html5/thumbnails/29.jpg)
Outline
Introduction of web crawling Exploiting the hypertextual information Genetic-based crawler Ant-based crawler Machine learning-base crawler Evaluation
![Page 30: Adaptive Focused Crawling](https://reader036.vdocuments.net/reader036/viewer/2022081519/56813fe5550346895daad149/html5/thumbnails/30.jpg)
Evaluation Methodologies
The goodness of the retrieved documents
The percentage of important page retrieved over the progress of the crawl is another often used measure.
retrievednumbertotal
retrieveddocumentsrelevantofnumberPr
relevantnumbertotal
retrieveddocumentsrelevantofnumberRr
![Page 31: Adaptive Focused Crawling](https://reader036.vdocuments.net/reader036/viewer/2022081519/56813fe5550346895daad149/html5/thumbnails/31.jpg)
An Example of Performance Plot
Calculated over 159 topics One-tailed t-test performed, p < 0.01
![Page 32: Adaptive Focused Crawling](https://reader036.vdocuments.net/reader036/viewer/2022081519/56813fe5550346895daad149/html5/thumbnails/32.jpg)
Summarization
Focused crawling has become an interesting alternative to the current Web search tools.
A particular kind of crawlers able to seek out and collect the subset of Web pages related to a given topic.
With learning methods, adaptive focused crawlers are able to adapt the system behavior to the particular environment and input parameters during the search.
Evaluation results show how the whole searching process may profit of those techniques and increase crawling performance.
![Page 33: Adaptive Focused Crawling](https://reader036.vdocuments.net/reader036/viewer/2022081519/56813fe5550346895daad149/html5/thumbnails/33.jpg)
Reference
Core paper:– Alessandro Micarelli and Fabio Gasparetti, Adaptive
Focused Crawling Additional papers:
– Gautam Pant, Padmini Srinivasan, and Filippo Menczer, Crawling the Web ,Web Dynamics, Springer-Verlag, 2003.
– Martin Ester, Matthias Groß, Hans-Peter Kriegel, Focused Web Crawling: A Generic Framework for Specifying the User Interest and for Adaptive Crawling Strategies (VLDB2001)
![Page 34: Adaptive Focused Crawling](https://reader036.vdocuments.net/reader036/viewer/2022081519/56813fe5550346895daad149/html5/thumbnails/34.jpg)
Questions & Comments?
Thanks!