crawlers - march 2008 1 (web) crawlers domain presented by: or shoham amit yaniv guy kroupp saar...
TRANSCRIPT
Crawlers - March 2008Crawlers - March 2008 11
(Web) Crawlers Domain(Web) Crawlers DomainPresented byPresented by::
Or ShohamOr Shoham
Amit YanivAmit Yaniv
Guy KrouppGuy Kroupp
Saar KohanovitchSaar Kohanovitch
Crawlers - March 2008Crawlers - March 2008 22
CrawlersCrawlersIntroductionIntroduction 33Crawler BasicsCrawler Basics 44Domain TerminologyDomain Terminology 55In-Depth Domain ElaborationIn-Depth Domain Elaboration 66Application ExamplesApplication Examples 77UM Domain AnalysisUM Domain Analysis 8-338-33CM Domain AnalysisCM Domain Analysis 34-4434-44Lessens LearnedLessens Learned 45-4645-46ConclusionConclusion 47-49 47-49
Crawlers - March 2008Crawlers - March 2008 33
IntroductionIntroduction A little bit about search enginesA little bit about search engines How do search engines work?How do search engines work? Why are crawlers needed?Why are crawlers needed? Many names – same meaningMany names – same meaning
crawler, spider, robot, BOT, Grub, spycrawler, spider, robot, BOT, Grub, spy
The Goggle PhenomenalThe Goggle Phenomenal founders Larry Page and Sergey Brin, September 1998founders Larry Page and Sergey Brin, September 1998
Crawlers - March 2008Crawlers - March 2008 44
Crawler’s BasicsCrawler’s Basics What is a crawler?What is a crawler? How do crawlers work?How do crawlers work? Crawling web pagesCrawling web pages
What pages should the crawler download?What pages should the crawler download? How should the crawler refresh pages?How should the crawler refresh pages? How should the load on the visited Web sites be minimized? How should the load on the visited Web sites be minimized?
How do Crawlers Index Web Pages?How do Crawlers Index Web Pages? Link indexingLink indexing Text indexingText indexing
How do Crawlers save Data?How do Crawlers save Data? Scalability: Scalability: distribute the repository across a cluster of computers and disksdistribute the repository across a cluster of computers and disks Large bulk updates: Large bulk updates: the repository needs to handle a high rate of modificationsthe repository needs to handle a high rate of modifications Obsolete pages:Obsolete pages: must have a mechanism for detecting and removing obsolete must have a mechanism for detecting and removing obsolete
pages.pages.
Crawlers - March 2008Crawlers - March 2008 55
Domain TerminologyDomain Terminology Link – an HTML code which redirects the user to a different Link – an HTML code which redirects the user to a different
web pageweb page URL - universal Resource Locator. An Internet World Wide URL - universal Resource Locator. An Internet World Wide
Web Address Web Address Seeds – a set of URLs which are the crawler’s starting point Seeds – a set of URLs which are the crawler’s starting point Parser – The element which is responsible for link extraction Parser – The element which is responsible for link extraction
from pages from pages Thread – A dependent stack instance of the same processThread – A dependent stack instance of the same process Queue – The element which holds the retrieved URLsQueue – The element which holds the retrieved URLs Politeness Policy - a common set of rules which are intended Politeness Policy - a common set of rules which are intended
to protect from over abusing the sites while crawling in themto protect from over abusing the sites while crawling in them Repository - the resource which stores all the crawler’s Repository - the resource which stores all the crawler’s
retrieved dataretrieved data
Crawlers - March 2008Crawlers - March 2008 66
Domain ElaborationDomain Elaboration
Rules which apply on the Domain:Rules which apply on the Domain: All crawlers have a URL FetcherAll crawlers have a URL Fetcher All crawlers have a Parser (Extractor)All crawlers have a Parser (Extractor) Crawlers are a Multi Threaded processesCrawlers are a Multi Threaded processes All crawlers have a Crawler ManagerAll crawlers have a Crawler Manager
Strongly related to the search engine Strongly related to the search engine domaindomain
Crawlers - March 2008Crawlers - March 2008 77
Application ExamplesApplication Examples
Many different crawlers doing different things Many different crawlers doing different things
WebCrawlerWebCrawler
Google Crawler Google Crawler
HeritrixHeritrix
Mirroring applicationsMirroring applications
Crawlers - March 2008Crawlers - March 2008 88
User ModelingUser Modeling
Crawlers - March 2008Crawlers - March 2008 99
User Modeling: Class DiagramUser Modeling: Class Diagram
Main Classes:Main Classes:
Spider:Spider:The Spider is the Base component of the crawler , and The Spider is the Base component of the crawler , and
whilewhile
each spider has it own unique way of performingeach spider has it own unique way of performing
Most of the spider contains the same basic Most of the spider contains the same basic
Features :Features :
Crawlers - March 2008Crawlers - March 2008 1010
User Modeling: Class DiagramUser Modeling: Class Diagram
Features:Features: Run/Kill : activation of the spider andRun/Kill : activation of the spider and
deactivation.deactivation. Update: updating running parameters .Update: updating running parameters .
IN ORDER TO OF GETTING THEIN ORDER TO OF GETTING THE
REQUESTED URL’S REQUESTED URL’S
THE SPIDER USES :THE SPIDER USES :
Crawlers - March 2008Crawlers - March 2008 1111
User Modeling: Class DiagramUser Modeling: Class Diagram
IN ORDER TO OF GETTING THE REQUESTEDIN ORDER TO OF GETTING THE REQUESTED
URL’S THE SPIDER USES :URL’S THE SPIDER USES :
URL FETCH NOW :URL FETCH NOW :
This is the basic class that actuallyThis is the basic class that actually
Fetches the url’s Fetches the url’s
The basic features are :The basic features are : URLFetchNow : activation of the class.URLFetchNow : activation of the class. Get/Fetch URL : gets the URL.Get/Fetch URL : gets the URL.
Crawlers - March 2008Crawlers - March 2008 1212
User Modeling: Class DiagramUser Modeling: Class Diagram
To config the SPIDER’S parameters:To config the SPIDER’S parameters:
SPIDER CONFIG:SPIDER CONFIG:
This is the basic class that can set This is the basic class that can set
The SPIDER’S configuration and lets the The SPIDER’S configuration and lets the
SPIDER updates itself . SPIDER updates itself .
Features:Features:
Set/Get Configuration.Set/Get Configuration.
Crawlers - March 2008Crawlers - March 2008 1313
User Modeling: Class DiagramUser Modeling: Class Diagram
To Sort results we are going to need To Sort results we are going to need
Some kind of a Data Structure. Some kind of a Data Structure.
Most commune is a Queue:Most commune is a Queue:
URL QUEUE HANDLER:URL QUEUE HANDLER:
A class containing a queue or any kind of dataA class containing a queue or any kind of data
Structure which sorts results. Structure which sorts results.
Features:Features:
Queue/Dequeue.Queue/Dequeue.
Crawlers - March 2008Crawlers - March 2008 1414
User Modeling: Class DiagramUser Modeling: Class Diagram
In order to make search and result handlingIn order to make search and result handling
More efficient we are going to use an :More efficient we are going to use an :
INDEXER:INDEXER:
The INDEXER is a class that sets the most The INDEXER is a class that sets the most
Effective index and lets the spider use it Effective index and lets the spider use it
And set it . And set it .
Features:Features:
SET/GET INDEX().SET/GET INDEX().
Crawlers - March 2008Crawlers - March 2008 1515
User Modeling: Class DiagramUser Modeling: Class Diagram
In order to control the SPIDER an entity hasIn order to control the SPIDER an entity has
to get The access to kill it and create it , the to get The access to kill it and create it , the
entity will be updated from the queue or the entity will be updated from the queue or the
SCHEDUELER SCHEDUELER
we are going to use a:we are going to use a:
CRAWLER MANAGER:CRAWLER MANAGER:
The MANAGER is a class that is able to make the calls weather The MANAGER is a class that is able to make the calls weather
The spider is created or killed .The spider is created or killed .
Features:Features:
Update By Scheduler/queue : enables the queue/scheduler to Update By Scheduler/queue : enables the queue/scheduler to inform the manager about on going activity .inform the manager about on going activity .
Crawlers - March 2008Crawlers - March 2008 1616
User Modeling: Class DiagramUser Modeling: Class Diagram
In most of the cases we are going to use a In most of the cases we are going to use a
database to store our results ,database to store our results ,
for this we’re going to use a for this we’re going to use a
Class that will communicate with the DB :Class that will communicate with the DB :
STORAGE MANAGER:STORAGE MANAGER:
The STORAGE MANAGER is a class that willThe STORAGE MANAGER is a class that will
write the crawl result to the DB .write the crawl result to the DB .
Features:Features: Sort Info () : the MANAGER will sort info previous to writing it Sort Info () : the MANAGER will sort info previous to writing it
To DB.To DB.
Write To DB(): Write crawl results to DB.Write To DB(): Write crawl results to DB.
Crawlers - March 2008Crawlers - March 2008 1717
Crawlers - March 2008Crawlers - March 2008 1818
User Modeling: Sequence(1)User Modeling: Sequence(1)
Getting Schedule:Getting Schedule:
The MANAGER isThe MANAGER is
Getting next schedule Getting next schedule
Crawler ManagerCrawler Manager
SchedulerScheduler
Crawlers - March 2008Crawlers - March 2008 1919
User Modeling: Sequence(2)User Modeling: Sequence(2)
Creating a new Spider:Creating a new Spider:
The MANAGER isThe MANAGER is
Creating a new SpiderCreating a new Spider
Crawler ManagerCrawler Manager
SPIDERSPIDER
Crawlers - March 2008Crawlers - March 2008 2020
User Modeling: Sequence(3)User Modeling: Sequence(3)
Creating a new search:Creating a new search:
The MANAGER is telling theThe MANAGER is telling the
SPIDER to start search SPIDER to start search
Crawler ManagerCrawler Manager
SPIDERSPIDER
Crawlers - March 2008Crawlers - March 2008 2121
User Modeling: Sequence(4)User Modeling: Sequence(4)
Getting an index:Getting an index:
The SPIDER isThe SPIDER is
Getting index for Getting index for
next crawlnext crawl
Crawler ManagerCrawler Manager
SPIDERSPIDER
Crawlers - March 2008Crawlers - March 2008 2222
User Modeling: Sequence(5)User Modeling: Sequence(5)
Actual Fetching URL’S:Actual Fetching URL’S:
The SPIDER isThe SPIDER is
Activating URL fetchingActivating URL fetching
SPIDERSPIDER
URL FETCH NOWURL FETCH NOW
Crawlers - March 2008Crawlers - March 2008 2323
User Modeling: Sequence(6)User Modeling: Sequence(6)
Queuing results:Queuing results:
The SPIDER isThe SPIDER is
Sending results toSending results to
queuequeue
SPIDERSPIDER
URL QUEUE HANDLERURL QUEUE HANDLER
Crawlers - March 2008Crawlers - March 2008 2424
User Modeling: Sequence(7)User Modeling: Sequence(7)
DeQueuing results:DeQueuing results:
The SPIDER isThe SPIDER is
Dequeuing sorted Dequeuing sorted
results results
SPIDERSPIDER
URL QUEUE HANDLERURL QUEUE HANDLER
Crawlers - March 2008Crawlers - March 2008 2525
User Modeling: Sequence(8)User Modeling: Sequence(8)
Writing to DB:Writing to DB:
The SPIDER isThe SPIDER is
Sending sorted results Sending sorted results
to DBto DB
SPIDERSPIDER
STORAGE HANDLERSTORAGE HANDLER
Crawlers - March 2008Crawlers - March 2008 2626
User Modeling: Sequence(9)User Modeling: Sequence(9)
Update Scheduler:Update Scheduler:
The Queue Handler The Queue Handler
updates the Schedulerupdates the Scheduler
QUEUE HANDLERQUEUE HANDLER
SCHEDULERSCHEDULER
Crawlers - March 2008Crawlers - March 2008 2727
User Modeling: Sequence(10)User Modeling: Sequence(10)
Update Manager:Update Manager:
The Scheduler updatesThe Scheduler updates
The managerThe manager
SCHEDULERSCHEDULER
CRAWLER MANAGERCRAWLER MANAGER
Crawlers - March 2008Crawlers - March 2008 2828
User Modeling: Sequence(11)User Modeling: Sequence(11)
Kill SPIDER:Kill SPIDER:
The Manager kills the SPIDERThe Manager kills the SPIDER
In the end of the processIn the end of the process
CRAWLER MANAGERCRAWLER MANAGER
SPIDERSPIDER
Crawlers - March 2008Crawlers - March 2008 2929
Crawlers - March 2008Crawlers - March 2008 3030
Domain PatternsDomain Patterns How can a crawler cope with new page standards How can a crawler cope with new page standards
conventions?conventions? Fatch new standard pagesFatch new standard pages Index new standard pagesIndex new standard pages
Factory Design PatternFactory Design Pattern
Crawlers - March 2008Crawlers - March 2008 3131
Domain Patterns (2)Domain Patterns (2) The parser class as a Factory designThe parser class as a Factory design
parse different pages: HTML , PDF , Word est.parse different pages: HTML , PDF , Word est.
The UML Fatcher class as a Factory DesignThe UML Fatcher class as a Factory Design Fatchs pages from different protocols and conventionsFatchs pages from different protocols and conventions: :
UDP, TCP/IP , FTP , IP6UDP, TCP/IP , FTP , IP6
How to ensure we have only one Crawler manager Queue and How to ensure we have only one Crawler manager Queue and repository?repository?
Singleton Design PatternSingleton Design Pattern
Crawlers - March 2008Crawlers - March 2008 3232
User Modeling: LessonsUser Modeling: Lessons A Problem :Little Info or Too much Info ?A Problem :Little Info or Too much Info ?
Scoping :Where does a Crawler begins and where does Scoping :Where does a Crawler begins and where does it ends ?it ends ? What is a general feature and what is a specific What is a general feature and what is a specific
feature?feature?
Code varies more the Domain . Code varies more the Domain .
Auto Reverse Engineering or manual ? Auto Reverse Engineering or manual ?
Crawlers - March 2008Crawlers - March 2008 3333
User Modeling: LessonsUser Modeling: Lessons A Problem :Little Info or Too much Info ?A Problem :Little Info or Too much Info ?
Scoping :Where does a Crawler begins and where does Scoping :Where does a Crawler begins and where does it ends ?it ends ? What is a general feature and what is a specific What is a general feature and what is a specific
feature?feature?
Code varies more the Domain . Code varies more the Domain .
Auto Reverse Engineering or manual ? Auto Reverse Engineering or manual ?
Crawlers - March 2008Crawlers - March 2008 3434
Code ModelingCode Modeling
Crawlers - March 2008Crawlers - March 2008 3535
Code Modeling – Reverse Code Modeling – Reverse Engineering – Applications (1)Engineering – Applications (1)
Applications which were R.E.’d:Applications which were R.E.’d: Arale, WebEater – Basic web crawlers for Arale, WebEater – Basic web crawlers for
file downloading (for offline viewing)file downloading (for offline viewing) JoBo – Advanced web crawler for file JoBo – Advanced web crawler for file
downloading (for offline viewing)downloading (for offline viewing) Heritrix – Advanced distributed crawler for Heritrix – Advanced distributed crawler for
file downloading (to archives)file downloading (to archives) HyperSpider – Basic crawler for displaying HyperSpider – Basic crawler for displaying
hyperlink treeshyperlink trees
Crawlers - March 2008Crawlers - March 2008 3636
Code Modeling – Reverse Code Modeling – Reverse EngineeringEngineering - Applications (2) - Applications (2)
Nutch (Lucerne) – Advanced distributed Nutch (Lucerne) – Advanced distributed crawler / search engine for indexingcrawler / search engine for indexing
WebSphinxWebSphinx –– Crawler framework for Crawler framework for mirroring and hyperlink tree displaymirroring and hyperlink tree display
ApertureAperture - Advanced crawler able to read - Advanced crawler able to read HTTP, FTP, local files, for indexingHTTP, FTP, local files, for indexing
Crawlers - March 2008Crawlers - March 2008 3737
Code Modeling – Reverse Code Modeling – Reverse Engineering – CASE ToolEngineering – CASE Tool
Reverse Engineering using Visual Reverse Engineering using Visual Paradigm for UMLParadigm for UML
Used only for class diagrams – use case + Used only for class diagrams – use case + sequence were modeled by hand based sequence were modeled by hand based on classes, usage and documentationon classes, usage and documentation
Good results for small applications, poor Good results for small applications, poor results for large applications (too much results for large applications (too much noise made signal hard to find)noise made signal hard to find)
Crawlers - March 2008Crawlers - March 2008 3838
Application class: A single class containing the main application elements, starts the crawling sequence based on parameters
Page Manager (Page): Class holding all data relevant to a web (or local) page, may save entire page or only summary / relevant parts
Parameters: Class holding parameters required for the application to run
Robots: Class containing information on pages the crawler may not visit
Queue: Class containing a list of links (pages) the crawler should visit
Thread: Class containing information required for each crawler thread
Listener: Class responsible for receiving pages from the internet
Extractor: Class responsible for parsing pages and extracting links for queue
Filters: Classes responsible for deciding if a link should be queued or visited
Helpers: Classes responsible for helping the crawler deal with forms, cookies, etc.
DB / Merger / External DB: Classes required for saving data into databases for local / distributed applications with DBs
Crawlers - March 2008Crawlers - March 2008 3939
Code Modeling Code Modeling – – Sequence (1) Sequence (1)
Crawlers - March 2008Crawlers - March 2008 4040
Code Modeling Code Modeling – – Sequence (2) Sequence (2)
Crawlers - March 2008Crawlers - March 2008 4141
Code Modeling Code Modeling – – Sequence (3) Sequence (3)
Crawlers - March 2008Crawlers - March 2008 4242
Code Modeling – Sequence (4)
Crawlers - March 2008Crawlers - March 2008 4343
Code Modeling – Results Example
Crawlers - March 2008Crawlers - March 2008 4444
Code Modeling - ConclusionsCode Modeling - Conclusions
Very difficult to reach domain-level Very difficult to reach domain-level abstraction based on code modelingabstraction based on code modeling
VP not very helpful in dealing with large VP not very helpful in dealing with large applications (clutter)applications (clutter)
Difficult to understand sequences and use Difficult to understand sequences and use cases correctly (no R.E. at all)cases correctly (no R.E. at all)
Documentation was often the most helpful Documentation was often the most helpful tool for code modeling, rather than R.E.tool for code modeling, rather than R.E.
Crawlers - March 2008Crawlers - March 2008 4545
Domain Modeling with ADOMDomain Modeling with ADOM
ADOM was helpful in establishing domain ADOM was helpful in establishing domain requirementsrequirements
Difficult to model when many optional entities Difficult to model when many optional entities exist, some of which heavily impact class exist, some of which heavily impact class relations and sequencesrelations and sequences
ADOM was not very helpful with abstraction, but ADOM was not very helpful with abstraction, but that may be a function of the domain itself that may be a function of the domain itself (functional)(functional)
End results difficult to read, but seem to provide End results difficult to read, but seem to provide a good domain framework for applicationsa good domain framework for applications
Crawlers - March 2008Crawlers - March 2008 4646
Domain Problems and IssuesDomain Problems and Issues
Crawler domain contains many functional Crawler domain contains many functional entities which do not necessarily store entities which do not necessarily store information (difficult to model)information (difficult to model)
Many optional controller / manager entities Many optional controller / manager entities (clutter with relations)(clutter with relations)
Vast difference in application scaleVast difference in application scale Entity / function containmentEntity / function containment
Crawlers - March 2008Crawlers - March 2008 4747
Future Work (1)Future Work (1)
Merging Code Modeling and User Modeling Merging Code Modeling and User Modeling will be difficult:will be difficult:
User modeling focused mostly on large-User modeling focused mostly on large-scale crawlers (research focuses on scale crawlers (research focuses on these)these)
Mostly from a search engine perspectiveMostly from a search engine perspective Schedule-orientedSchedule-oriented High level of abstractionHigh level of abstraction
Crawlers - March 2008Crawlers - March 2008 4848
Future Work (2)Future Work (2)
Code modeling focused mostly on smaller Code modeling focused mostly on smaller applications (easier to model, available)applications (easier to model, available)
Focus mostly on archival / mirroringFocus mostly on archival / mirroring User-orientedUser-oriented Medium level of abstractionMedium level of abstraction
Crawlers - March 2008Crawlers - March 2008 4949
Future Work (3)Future Work (3)
Merged product entities should be closer Merged product entities should be closer to User Modeling than Code Modeling to User Modeling than Code Modeling (Higher level of abstraction)(Higher level of abstraction)
User vs. scheduleUser vs. schedule Indexing vs. archivingIndexing vs. archiving Importance of optional entitiesImportance of optional entities
Crawlers - March 2008Crawlers - March 2008 5050
Web Crawlers DomainWeb Crawlers Domain
Thank youThank you Any questions?Any questions?