web scale crawling with apache nutch
TRANSCRIPT
![Page 2: Web Scale Crawling with Apache Nutch](https://reader035.vdocuments.net/reader035/viewer/2022081420/554ba8dcb4c905b8618b5235/html5/thumbnails/2.jpg)
2 / 30DigitalPebble Ltd
Based in Bristol (UK) Specialised in Text Engineering
– Web Crawling– Natural Language Processing– Information Retrieval– Data Mining
Strong focus on Open Source & Apache ecosystem User | Contributor | Committer
– Nutch, SOLR, Lucene – Tika– GATE, UIMA– Mahout– Behemoth
![Page 3: Web Scale Crawling with Apache Nutch](https://reader035.vdocuments.net/reader035/viewer/2022081420/554ba8dcb4c905b8618b5235/html5/thumbnails/3.jpg)
3 / 30Outline
Overview Features Data Structures Use cases
What's new in Nutch 1.3 Nutch 2.0 GORA
Conclusion
![Page 4: Web Scale Crawling with Apache Nutch](https://reader035.vdocuments.net/reader035/viewer/2022081420/554ba8dcb4c905b8618b5235/html5/thumbnails/4.jpg)
4 / 30Nutch?
“Distributed framework for large scale web crawling”– but does not have to be large scale at all– or even on the web (file-protocol)
Based on Apache Hadoop
Indexing and Search
Open Source – Apache 2.0 License
![Page 5: Web Scale Crawling with Apache Nutch](https://reader035.vdocuments.net/reader035/viewer/2022081420/554ba8dcb4c905b8618b5235/html5/thumbnails/5.jpg)
5 / 30Short history
2002/2003 : Started By Doug Cutting & Mike Caffarella
2004 : sub-project of Lucene @Apache
2005 : MapReduce implementation in Nutch
– 2006 : Hadoop sub-project of Lucene @Apache
2006/7 : Parser and MimeType in Tika
– 2008 : Tika sub-project of Lucene @Apache
May 2010 : TLP project at Apache
June 2011 (?) : Nutch 1.3
Q4 2011 (?) : Nutch 2.0
![Page 6: Web Scale Crawling with Apache Nutch](https://reader035.vdocuments.net/reader035/viewer/2022081420/554ba8dcb4c905b8618b5235/html5/thumbnails/6.jpg)
6 / 30In a Nutch Shell (1.3)
1) Inject → populates CrawlDB from seed list
2) Generate → Selects URLS to fetch in segment
3) Fetch → Fetches URLs from segment
4) Parse → Parses content (text + metadata)
5) UpdateDB → Updates CrawlDB (new URLs, new status...)
6) InvertLinks → Build Webgraph
7) SOLRIndex → Send docs to SOLR
8) SOLRDedup → Remove duplicate docs based on signature
Step by Step :
Or use the all-in-one 'nutch crawl' command
Repeat steps 2 to 8
![Page 7: Web Scale Crawling with Apache Nutch](https://reader035.vdocuments.net/reader035/viewer/2022081420/554ba8dcb4c905b8618b5235/html5/thumbnails/7.jpg)
7 / 30Frontier expansion
Manual “discovery”– Adding new URLs by
hand, “seeding”
Automatic discovery of new resources (frontier expansion)– Not all outlinks are
equally useful - control– Requires content
parsing and link extraction
seed
i = 1
i = 2
i = 3
[Slide courtesy of A. Bialecki]
![Page 8: Web Scale Crawling with Apache Nutch](https://reader035.vdocuments.net/reader035/viewer/2022081420/554ba8dcb4c905b8618b5235/html5/thumbnails/8.jpg)
8 / 30Outline
Overview Features Data Structures Use cases
What's new in Nutch 1.3 Nutch 2.0 GORA
Conclusion
![Page 9: Web Scale Crawling with Apache Nutch](https://reader035.vdocuments.net/reader035/viewer/2022081420/554ba8dcb4c905b8618b5235/html5/thumbnails/9.jpg)
9 / 30An extensible framework
Endpoints– Protocol– Parser– HtmlParseFilter– ScoringFilter (used in various places)– URLFilter (ditto)– URLNormalizer (ditto)– IndexingFilter
Plugins– Activated with parameter 'plugin.includes'– Implement one or more endpoints
![Page 10: Web Scale Crawling with Apache Nutch](https://reader035.vdocuments.net/reader035/viewer/2022081420/554ba8dcb4c905b8618b5235/html5/thumbnails/10.jpg)
10 / 30Features
Fetcher– Multi-threaded fetcher– Follows robots.txt– Groups URLs per hostname / domain / IP– Limit the number of URLs for round of fetching– Default values are polite but can be made more aggressive
Crawl Strategy – Breadth-first but can be depth-first– Configurable via custom scoring plugins
Scoring– OPIC (On-line Page Importance Calculation) by default– LinkRank
![Page 11: Web Scale Crawling with Apache Nutch](https://reader035.vdocuments.net/reader035/viewer/2022081420/554ba8dcb4c905b8618b5235/html5/thumbnails/11.jpg)
11 / 30Features (cont.)
Protocols– Http, file, ftp, https
Scheduling– Specified or adaptative
URL filters– Regex, FSA, TLD domain, prefix, suffix
URL normalisers– Default, regex
![Page 12: Web Scale Crawling with Apache Nutch](https://reader035.vdocuments.net/reader035/viewer/2022081420/554ba8dcb4c905b8618b5235/html5/thumbnails/12.jpg)
12 / 30Features (cont.)
Other plugins– CreativeCommons– Feeds– Language Identification– Rel tags– Arbitrary Metadata
Indexing to SOLR– Bespoke schema
Parsing with Apache Tika– But some legacy parsers as well
![Page 13: Web Scale Crawling with Apache Nutch](https://reader035.vdocuments.net/reader035/viewer/2022081420/554ba8dcb4c905b8618b5235/html5/thumbnails/13.jpg)
13 / 30Outline
Overview Features Data Structures Use cases
What's new in Nutch 1.3 Nutch 2.0 GORA
Conclusion
![Page 14: Web Scale Crawling with Apache Nutch](https://reader035.vdocuments.net/reader035/viewer/2022081420/554ba8dcb4c905b8618b5235/html5/thumbnails/14.jpg)
14 / 30Data Structures
MapReduce jobs => I/O : Hadoop [Sequence|Map]Files CrawlDB => status of known pages
CrawlDB
MapFile : <Text,CrawlDatum> byte status; [fetched? Unfetched? Failed? Redir?] long fetchTime; byte retries; int fetchInterval; float score = 1.0f; byte[] signature = null; long modifiedTime; org.apache.hadoop.io.MapWritable metaData;
Input of : generate - index Output of : inject - update
![Page 15: Web Scale Crawling with Apache Nutch](https://reader035.vdocuments.net/reader035/viewer/2022081420/554ba8dcb4c905b8618b5235/html5/thumbnails/15.jpg)
15 / 30Data Structures 2
Segment/crawl_generate/ → SequenceFile<Text,CrawlDatum>/crawl_fetch/ → MapFile<Text,CrawlDatum>/content/ → MapFile<Text,Content>/crawl_parse/ → SequenceFile<Text,CrawlDatum>/parse_data/ → MapFile<Text,ParseData>/parse_text/ → MapFile<Text,ParseText>
Segment => round of fetching Identified by a timestamp
Can have multiple versions of a page in different segments
![Page 16: Web Scale Crawling with Apache Nutch](https://reader035.vdocuments.net/reader035/viewer/2022081420/554ba8dcb4c905b8618b5235/html5/thumbnails/16.jpg)
16 / 30Data Structures – 3
LinkDB
MapFile : <Text,Inlinks> Inlinks : HashSet <Inlink> Inlink :
String fromUrlString anchor
Output of : invertlinks Input of : SOLRIndex
linkDB => storage for Web Graph
![Page 17: Web Scale Crawling with Apache Nutch](https://reader035.vdocuments.net/reader035/viewer/2022081420/554ba8dcb4c905b8618b5235/html5/thumbnails/17.jpg)
17 / 30Outline
Overview Features Data Structures Use cases
What's new in Nutch 1.3 Nutch 2.0 GORA
Conclusion
![Page 18: Web Scale Crawling with Apache Nutch](https://reader035.vdocuments.net/reader035/viewer/2022081420/554ba8dcb4c905b8618b5235/html5/thumbnails/18.jpg)
18 / 30Use cases Crawl for Search Systems
– Web wide or vertical– Single node to large clusters– Legacy Lucene-based search or SOLR
… but not necessarily– NLP (e.g.Sentiment Analysis)– ML, Classification / Clustering– Data Mining
– MAHOUT / UIMA / GATE – Use Behemoth as glueware (http://github.com/jnioche/behemoth)
SimilarPages.com– Large cluster on Amazon EC2 (up to
400 nodes)– Fetched & parsed 3 billion pages– 10+ billion pages in crawlDB
(~100TB data)– 200+ million lists of similarities– No indexing / search involved
![Page 19: Web Scale Crawling with Apache Nutch](https://reader035.vdocuments.net/reader035/viewer/2022081420/554ba8dcb4c905b8618b5235/html5/thumbnails/19.jpg)
19 / 30Outline
Overview Features Data Structures Use cases
What's new in Nutch 1.3 Nutch 2.0 GORA
Conclusion
![Page 20: Web Scale Crawling with Apache Nutch](https://reader035.vdocuments.net/reader035/viewer/2022081420/554ba8dcb4c905b8618b5235/html5/thumbnails/20.jpg)
20 / 30NUTCH 1.3 Transition between 1.x and 2.0
http://svn.apache.org/repos/asf/nutch/branches/branch-1.3/
1.3-RC3 => imminent
Removed Lucene-based indexing and search webapp
– delegate indexing / search remotely to SOLR
– change of focus : “Web search application” → “Crawler”
Removed deprecated parse plugins
– delegate most parsing to Tika
Separate local / distributed runtimes
Ivy-based dependency management
![Page 21: Web Scale Crawling with Apache Nutch](https://reader035.vdocuments.net/reader035/viewer/2022081420/554ba8dcb4c905b8618b5235/html5/thumbnails/21.jpg)
21 / 30NUTCH 2.0
Became trunk in 2010
Same features as 1.3– delegation to SOLR, TIKA, etc...
Moved to table-based architecture– Wealth of NoSQL projects in last 2 years
Preliminary version known as NutchBase (Doğacan Güney)
Moved storage layer to subproject in incubator → GORA
![Page 22: Web Scale Crawling with Apache Nutch](https://reader035.vdocuments.net/reader035/viewer/2022081420/554ba8dcb4c905b8618b5235/html5/thumbnails/22.jpg)
22 / 30GORA
http://incubator.apache.org/gora/
ORM for NoSQL databases– and limited SQL support
Serialization with Apache AVRO
Object-to-datastore mappings (backend-specific)
Backend implementations– HBase– Cassandra– SQL– Memory
0.1 released in April 2011
![Page 23: Web Scale Crawling with Apache Nutch](https://reader035.vdocuments.net/reader035/viewer/2022081420/554ba8dcb4c905b8618b5235/html5/thumbnails/23.jpg)
23 / 30GORA (cont.)
Atomic operations– Get – Put– Delete
Querying– Execute– deleteByQuery
Wrappers for Apache Hadoop– GORAInput|OutputFormat– GORAMapper|Reducer
![Page 24: Web Scale Crawling with Apache Nutch](https://reader035.vdocuments.net/reader035/viewer/2022081420/554ba8dcb4c905b8618b5235/html5/thumbnails/24.jpg)
24 / 30Benefits for Nutch
Storage still distributed and replicated
but one big table– status, metadata, content, text → one place
Simplified logic in Nutch– Simpler code for updating / merging information
More efficient– No need to read / write entire structure to update records
– e.g. update step in 1.x
Easier interaction with other resources– Third-party code just need to use GORA and schema
![Page 25: Web Scale Crawling with Apache Nutch](https://reader035.vdocuments.net/reader035/viewer/2022081420/554ba8dcb4c905b8618b5235/html5/thumbnails/25.jpg)
25 / 30Status Nutch 2.0
Beta stage
– debugging / testing required
Compare performance of GORA backends
Need to update documentation / WIKI
Enthusiasm from community
GORA – next great project coming out of Nutch?
![Page 26: Web Scale Crawling with Apache Nutch](https://reader035.vdocuments.net/reader035/viewer/2022081420/554ba8dcb4c905b8618b5235/html5/thumbnails/26.jpg)
26 / 30Future
Delegate code to crawler-commons(http://code.google.com/p/crawler-commons/)
– Fetcher / protocol handling– Robots.txt parsing– URL normalisation / filtering
New functionalities – Sitemap– Canonical tag– More indexers (e.g. ElasticSearch) + pluggable indexers?
Definitive move to 2.0?– Contribute backends and functionalities to GORA
![Page 27: Web Scale Crawling with Apache Nutch](https://reader035.vdocuments.net/reader035/viewer/2022081420/554ba8dcb4c905b8618b5235/html5/thumbnails/27.jpg)
27 / 30Outline
Overview Features Data Structures Use cases
What's new in Nutch 1.3 Nutch 2.0 GORA
Conclusion
![Page 28: Web Scale Crawling with Apache Nutch](https://reader035.vdocuments.net/reader035/viewer/2022081420/554ba8dcb4c905b8618b5235/html5/thumbnails/28.jpg)
28 / 30Where to find out more?
Project page : http://nutch.apache.org/ Wiki : http://wiki.apache.org/nutch/ Mailing lists :
– [email protected]– [email protected]
Chapter in 'Hadoop the Definitive Guide' (T. White)– Understanding Hadoop is essential anyway...
Support / consulting : – http://wiki.apache.org/nutch/Support– [email protected]
![Page 29: Web Scale Crawling with Apache Nutch](https://reader035.vdocuments.net/reader035/viewer/2022081420/554ba8dcb4c905b8618b5235/html5/thumbnails/29.jpg)
29 / 30Questions
?
![Page 30: Web Scale Crawling with Apache Nutch](https://reader035.vdocuments.net/reader035/viewer/2022081420/554ba8dcb4c905b8618b5235/html5/thumbnails/30.jpg)
30 / 30