web crawling and data gathering with apache nutch
DESCRIPTION
Apache Nutch Presentation by Steve Watt at Data Day Austin 2011TRANSCRIPT
![Page 1: Web Crawling and Data Gathering with Apache Nutch](https://reader036.vdocuments.net/reader036/viewer/2022062300/554ba901b4c905b3618b51fe/html5/thumbnails/1.jpg)
Apache Nutch
Web Crawling and Data Gathering
Steve Watt - @wattsteveIBM Big Data LeadData Day Austin
![Page 2: Web Crawling and Data Gathering with Apache Nutch](https://reader036.vdocuments.net/reader036/viewer/2022062300/554ba901b4c905b3618b51fe/html5/thumbnails/2.jpg)
2
Topics
Introduction
The Big Data Analytics Ecosystem
Load Tooling
How is Crawl data being used?
Web Crawling - Considerations
Apache Nutch Overview
Apache Nutch Crawl Lifecycle, Setup and Demos
![Page 3: Web Crawling and Data Gathering with Apache Nutch](https://reader036.vdocuments.net/reader036/viewer/2022062300/554ba901b4c905b3618b51fe/html5/thumbnails/3.jpg)
3
The Offline (Analytics) Big Data Ecosystem
Load Tooling
Web Content Your Content
Hadoop
Data Catalogs Analytics Tooling Export Tooling
Find Analyze Visualize Consume
![Page 4: Web Crawling and Data Gathering with Apache Nutch](https://reader036.vdocuments.net/reader036/viewer/2022062300/554ba901b4c905b3618b51fe/html5/thumbnails/4.jpg)
4
Load Tooling - Data Gathering Patterns and Enablers
Web Content
– Downloading – Amazon Public DataSets / InfoChimps
– Stream Harvesting – Collecta / Roll-your-own (Twitter4J)
– API Harvesting – Roll your own (Facebook REST Query)
– Web Crawling – Nutch
Your Content
– Copy from FileSystem
– Load from Database - SQOOP
– Event Collection Frameworks - Scribe and Flume
![Page 5: Web Crawling and Data Gathering with Apache Nutch](https://reader036.vdocuments.net/reader036/viewer/2022062300/554ba901b4c905b3618b51fe/html5/thumbnails/5.jpg)
5
How is Crawl data being used?
Build your own search engine – Built in Lucene Indexes for querying
– Solr integration for Multi-faceted search
Analytics Selective filtering and extraction with data from a single
provider Joining datasets from multiple providers for further
analytics Event Portal Example Is Austin really a startup town?
Extension of the mashup paradigm - “Content Providers cannot predict how their data will be re-purposed”
![Page 6: Web Crawling and Data Gathering with Apache Nutch](https://reader036.vdocuments.net/reader036/viewer/2022062300/554ba901b4c905b3618b51fe/html5/thumbnails/6.jpg)
6
Web Crawling - considerations
Robots.txt
Facebook lawsuit against API Harvester
“No Crawling without written approval” in Mint.com Terms of Use
What if the web had as many crawlers as Apache Web Servers ?
![Page 7: Web Crawling and Data Gathering with Apache Nutch](https://reader036.vdocuments.net/reader036/viewer/2022062300/554ba901b4c905b3618b51fe/html5/thumbnails/7.jpg)
7
Apache Nutch – What is it ?
Apache Nutch Project – nutch.apache.org– Hadoop + Web Crawler + Lucene
Hadoop based web crawler ? How does that work ?
![Page 8: Web Crawling and Data Gathering with Apache Nutch](https://reader036.vdocuments.net/reader036/viewer/2022062300/554ba901b4c905b3618b51fe/html5/thumbnails/8.jpg)
8
Apache Nutch Overview
Seeds and Crawl Filters
Crawl Depths
Fetch Lists and Partitioning
Segments - Segment Reading using Hadoop
Indexing / Lucene
Web Application for Querying
![Page 9: Web Crawling and Data Gathering with Apache Nutch](https://reader036.vdocuments.net/reader036/viewer/2022062300/554ba901b4c905b3618b51fe/html5/thumbnails/9.jpg)
Apache Nutch - Web Application
![Page 10: Web Crawling and Data Gathering with Apache Nutch](https://reader036.vdocuments.net/reader036/viewer/2022062300/554ba901b4c905b3618b51fe/html5/thumbnails/10.jpg)
Crawl Lifecycle
Generate
Inject
LinkDB
Fetch
Index
CrawlDB Update
Dedup
Merge
![Page 11: Web Crawling and Data Gathering with Apache Nutch](https://reader036.vdocuments.net/reader036/viewer/2022062300/554ba901b4c905b3618b51fe/html5/thumbnails/11.jpg)
Single Process Web Crawling
![Page 12: Web Crawling and Data Gathering with Apache Nutch](https://reader036.vdocuments.net/reader036/viewer/2022062300/554ba901b4c905b3618b51fe/html5/thumbnails/12.jpg)
Single Process Web Crawling
- Create the seed file and copy it into a “urls” directory
- Export JAVA_HOME
- Edit the conf/crawl-urlfilter.txt regex to constrain the crawl (Usually via domain)
- Edit the conf/nutch-site.xml and specify an http.agent.name
- bin/nutch crawl urls -dir crawl -depth 2
D E M O
![Page 13: Web Crawling and Data Gathering with Apache Nutch](https://reader036.vdocuments.net/reader036/viewer/2022062300/554ba901b4c905b3618b51fe/html5/thumbnails/13.jpg)
Distributed Web Crawling
![Page 14: Web Crawling and Data Gathering with Apache Nutch](https://reader036.vdocuments.net/reader036/viewer/2022062300/554ba901b4c905b3618b51fe/html5/thumbnails/14.jpg)
Distributed Web Crawling
- The Nutch distribution is overkill if you already have a Hadoop Cluster. Its also not how you really integrate with Hadoop these days, but there is some history to consider. Nutch Wiki has Distributed Setup.
- Why orchestrate your crawl?
- How?– Create the seed file and copy it into a “urls” directory. Then
copy the directory up to the HDFS
– Edit the conf/crawl-urlfilter.txt regex to constrain the crawl (Usually via domain)
– Copy the conf/nutch-site,conf/nutch-default.xml, conf/nutch-conf.xml & conf/crawl-urlfilter.txt to the Hadoop conf directory.
– Restart Hadoop so the new files are picked up in the classpath
![Page 15: Web Crawling and Data Gathering with Apache Nutch](https://reader036.vdocuments.net/reader036/viewer/2022062300/554ba901b4c905b3618b51fe/html5/thumbnails/15.jpg)
Distributed Web Crawling
- Code Review: org.apache.nutch.crawl.Crawl
- Orchestrated Crawl Example (Step 1 - Inject):
bin/hadoop jar nutch-1.2.0.job org.apache.nutch.crawl.Injector crawl/crawldb urls
D E M O
![Page 16: Web Crawling and Data Gathering with Apache Nutch](https://reader036.vdocuments.net/reader036/viewer/2022062300/554ba901b4c905b3618b51fe/html5/thumbnails/16.jpg)
Segment Reading
![Page 17: Web Crawling and Data Gathering with Apache Nutch](https://reader036.vdocuments.net/reader036/viewer/2022062300/554ba901b4c905b3618b51fe/html5/thumbnails/17.jpg)
17
Segment Readers
The SegmentReader class is not all that useful. But here it is anyway:
– bin/nutch readseg -list crawl/segments/20110128170617
– bin/nutch readseg -dump crawl/segments/20110128170617 dumpdir
What you really want to do is process each crawled page in M/R as an individual record– SequenceFileInputFormatters over Nutch HDFS Segments
FTW
– RecordReader returns Content Objects as Value
Code Walkthrough
D E M O
![Page 18: Web Crawling and Data Gathering with Apache Nutch](https://reader036.vdocuments.net/reader036/viewer/2022062300/554ba901b4c905b3618b51fe/html5/thumbnails/18.jpg)
Thanks
Questions ?
Steve Watt - [email protected]
Twitter: @wattsteveBlog: stevewatt.blogspot.com
austinhug.blogspot.com