seminar on crawler
TRANSCRIPT
![Page 1: Seminar on crawler](https://reader033.vdocuments.net/reader033/viewer/2022061618/554a25fbb4c90526578b493c/html5/thumbnails/1.jpg)
WEB CRAWLERs
Siddharth Shankar
![Page 2: Seminar on crawler](https://reader033.vdocuments.net/reader033/viewer/2022061618/554a25fbb4c90526578b493c/html5/thumbnails/2.jpg)
Resource finding
Finding info on the web - Surfing - Searching - crawling
• Uses of crawling - Find stuff - Gather stuff - Check stuff
![Page 3: Seminar on crawler](https://reader033.vdocuments.net/reader033/viewer/2022061618/554a25fbb4c90526578b493c/html5/thumbnails/3.jpg)
Crawling and Crawlers
![Page 4: Seminar on crawler](https://reader033.vdocuments.net/reader033/viewer/2022061618/554a25fbb4c90526578b493c/html5/thumbnails/4.jpg)
WEB CRAWLERS
also known as web spiders and web robots.
less used names- ants,bots and worms.
A program or automated script which browses the World Wide Web in a methodical, automated manner
The process or program used by search engines to download pages from the web for later processing by a search engine that will index the downloaded pages to provide fast searches.
![Page 5: Seminar on crawler](https://reader033.vdocuments.net/reader033/viewer/2022061618/554a25fbb4c90526578b493c/html5/thumbnails/5.jpg)
WHY CRAWLERS?
Internet has a wide expanse of Information. Finding relevant information requires an efficient mechanism.• Web Crawlers provide that scope to the search engine.
![Page 6: Seminar on crawler](https://reader033.vdocuments.net/reader033/viewer/2022061618/554a25fbb4c90526578b493c/html5/thumbnails/6.jpg)
How does web crawler work?It starts with a list of URLs to visit,
called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of visited URLs, called the crawl frontier.
URLs from the frontier are recursively visited according to a set of policies.
![Page 7: Seminar on crawler](https://reader033.vdocuments.net/reader033/viewer/2022061618/554a25fbb4c90526578b493c/html5/thumbnails/7.jpg)
How does web crawler work?
![Page 8: Seminar on crawler](https://reader033.vdocuments.net/reader033/viewer/2022061618/554a25fbb4c90526578b493c/html5/thumbnails/8.jpg)
Prerequisites of Crawling System
Flexibility: System should be suitable for various scenarios.
High Performance(Scalability): System needs to be scalable with a
minimum of one thousand pages/ second and extending up to millions of pages.
Fault Tolerance: process invalid HTML code, deal with unexpected Web server behavior, can handle stopped processes or interruptions in network services.
![Page 9: Seminar on crawler](https://reader033.vdocuments.net/reader033/viewer/2022061618/554a25fbb4c90526578b493c/html5/thumbnails/9.jpg)
Maintainability and Configurability: Appropriate interface is
necessary for monitoring the crawling process including:
I. Download speed II. Statistics on the pagesIII. Amounts of data stored.
![Page 10: Seminar on crawler](https://reader033.vdocuments.net/reader033/viewer/2022061618/554a25fbb4c90526578b493c/html5/thumbnails/10.jpg)
Crawling StrategiesBreadth-First Crawling: launched by
following hypertext links leading to those pages directly connected with this initial set.
Repetitive Crawling: once pages have been crawled, some systems require the process to be repeated periodically so that indexes are kept updated.
Targeted Crawling: specialized search engines use crawling process heuristics in order to target a certain type of page.
![Page 11: Seminar on crawler](https://reader033.vdocuments.net/reader033/viewer/2022061618/554a25fbb4c90526578b493c/html5/thumbnails/11.jpg)
Random Walks and Sampling: random walks on Web graphs
via sampling is done to estimate the size of documents on
line.
Deep Web crawling: a lot of data accessible via the Web are
currently contained in databases and may only be downloaded through the medium of appropriate requests or forms. The Deep Web is the name given to the Web containing this category of data.
![Page 12: Seminar on crawler](https://reader033.vdocuments.net/reader033/viewer/2022061618/554a25fbb4c90526578b493c/html5/thumbnails/12.jpg)
Crawling PoliciesSelection Policy that states which pages to
download.Re-visit Policy that states when to check for
changes to the pages.Politeness Policy that states how to avoid
overloading Web sites.Parallelization Policy that states how to
coordinate distributed Web crawlers.
![Page 13: Seminar on crawler](https://reader033.vdocuments.net/reader033/viewer/2022061618/554a25fbb4c90526578b493c/html5/thumbnails/13.jpg)
Selection policy
Search engines covers only a fraction of Internet. This requires download of relevant pages, hence a
good
selection policy is very important. Common Selection policies:
Restricting followed links
Path-ascending crawling
Focused crawling
Crawling the Deep Web
![Page 14: Seminar on crawler](https://reader033.vdocuments.net/reader033/viewer/2022061618/554a25fbb4c90526578b493c/html5/thumbnails/14.jpg)
Re-Visit Policy
Web is dynamic; crawling takes a long time. Cost factors play important role in crawling. Freshness and Age- commonly used cost functions. Objective of crawler- high average freshness; low
average age of web pages. Two re-visit policies:
Uniform policyProportional policy
![Page 15: Seminar on crawler](https://reader033.vdocuments.net/reader033/viewer/2022061618/554a25fbb4c90526578b493c/html5/thumbnails/15.jpg)
Politeness Policy Crawlers can have a crippling impact on the
overall performance of a site. The costs of using Web crawlers include:
Network resourcesServer overloadServer/ router crashesNetwork and server disruption
A partial solution to these problems is the robots exclusion protocol.
![Page 16: Seminar on crawler](https://reader033.vdocuments.net/reader033/viewer/2022061618/554a25fbb4c90526578b493c/html5/thumbnails/16.jpg)
Parallelization Policy The crawler runs multiple processes in
parallel. The goal is:
To maximize the download rate.To minimize the overhead from
parallelization.To avoid repeated downloads of the
same page. The crawling system requires a policy for
assigning the new URLs discovered during the crawling process.
![Page 17: Seminar on crawler](https://reader033.vdocuments.net/reader033/viewer/2022061618/554a25fbb4c90526578b493c/html5/thumbnails/17.jpg)
DISTRIBUTED WEB CRAWLING
A distributed computing technique whereby search engines employ many computers to index the Internet via web crawling.
The idea is to spread out the required resources of computation and bandwidth to many computers and networks.
Types of distributed web crawling: 1. Dynamic Assignment 2. Static Assignment
![Page 18: Seminar on crawler](https://reader033.vdocuments.net/reader033/viewer/2022061618/554a25fbb4c90526578b493c/html5/thumbnails/18.jpg)
DYNAMIC ASSIGNMENTWith this, a central server assigns new URLs
to different crawlers dynamically. This allows the central server dynamically balance the load of each crawler.
Configurations of crawling architectures with dynamic assignments:
• A small crawler configuration, in which there is a central DNS resolver and central queues per Web site, and distributed down loaders.
• A large crawler configuration, in which the DNS resolver and the queues are also distributed.
![Page 19: Seminar on crawler](https://reader033.vdocuments.net/reader033/viewer/2022061618/554a25fbb4c90526578b493c/html5/thumbnails/19.jpg)
STATIC ASSIGNMENT
• Here a fixed rule is stated from the beginning of the crawl that defines how to assign new URLs to the crawlers.
• A hashing function can be used to transform URLs into a number that corresponds to the index of the corresponding crawling process.
• To reduce the overhead due to the exchange of URLs between crawling processes, when links switch from one website to another, the exchange should be done in batch.
![Page 20: Seminar on crawler](https://reader033.vdocuments.net/reader033/viewer/2022061618/554a25fbb4c90526578b493c/html5/thumbnails/20.jpg)
FOCUSED CRAWLINGFocused crawling was first introduced by
Chakrabarti. A focused crawler ideally would like to
download only web pages that are relevant to a particular topic and avoid downloading all others.
It assumes that some labeled examples of relevant and not relevant pages are available.
![Page 21: Seminar on crawler](https://reader033.vdocuments.net/reader033/viewer/2022061618/554a25fbb4c90526578b493c/html5/thumbnails/21.jpg)
STRATEGIES OF FOCUSED CRAWLING
A focused crawler predict the probability that a link to a particular page is relevant before actually downloading the page. A possible predictor is the anchor text of links.
In another approach, the relevance of a page is determined after downloading its content. Relevant pages are sent to content indexing and their contained URLs are added to the crawl frontier; pages that fall below a relevance threshold are discarded.
![Page 22: Seminar on crawler](https://reader033.vdocuments.net/reader033/viewer/2022061618/554a25fbb4c90526578b493c/html5/thumbnails/22.jpg)
EXAMPLESYahoo! Slurp: Yahoo Search crawler.Msnbot: Microsoft's Bing web crawler.Googlebot : Google’s web crawler.WebCrawler : Used to build the first
publicly-available full-text index of a subset of the Web.
World Wide Web Worm : Used to build a simple index of document titles and URLs.
Web Fountain: Distributed, modular crawler written in C++.
Slug: Semantic web crawler
![Page 23: Seminar on crawler](https://reader033.vdocuments.net/reader033/viewer/2022061618/554a25fbb4c90526578b493c/html5/thumbnails/23.jpg)
CONCLUSIONWeb crawlers are an important aspect of the
search engines.Web crawling processes deemed high
performance are the basic components of various Web services.
It is not a trivial matter to set up such systems:
1. Data manipulated by these crawlers cover a wide area.
2. It is crucial to preserve a good balance between random access memory and disk accesses.
![Page 24: Seminar on crawler](https://reader033.vdocuments.net/reader033/viewer/2022061618/554a25fbb4c90526578b493c/html5/thumbnails/24.jpg)
THANK YOU