nutch in a nutshell presented by liew guo min zhao jin
TRANSCRIPT
![Page 1: Nutch in a Nutshell Presented by Liew Guo Min Zhao Jin](https://reader036.vdocuments.net/reader036/viewer/2022082518/56649e9d5503460f94b9ed0f/html5/thumbnails/1.jpg)
Nutch in a Nutshell
Presented by Liew Guo Min
Zhao Jin
![Page 2: Nutch in a Nutshell Presented by Liew Guo Min Zhao Jin](https://reader036.vdocuments.net/reader036/viewer/2022082518/56649e9d5503460f94b9ed0f/html5/thumbnails/2.jpg)
Outline
Recap Special features Running Nutch in a distributed
environment (with demo) Q&A Discussion
![Page 3: Nutch in a Nutshell Presented by Liew Guo Min Zhao Jin](https://reader036.vdocuments.net/reader036/viewer/2022082518/56649e9d5503460f94b9ed0f/html5/thumbnails/3.jpg)
Recap
Complete web search engine Nutch = Crawler + Indexer/Searcher (Lucene) + GUI
+ Plugins+ MapReduce & Distributed FS (Hadoop)
Java based, open source
Features: Customizable Extensible Distributed
![Page 4: Nutch in a Nutshell Presented by Liew Guo Min Zhao Jin](https://reader036.vdocuments.net/reader036/viewer/2022082518/56649e9d5503460f94b9ed0f/html5/thumbnails/4.jpg)
Nutch as a crawlerInitial URLs
Generator Fetcher
Segment
Webpages/files
Web
Parsergenerate
Injector
CrawlDB
read/write
CrawlDBTool
update get
read/write
![Page 5: Nutch in a Nutshell Presented by Liew Guo Min Zhao Jin](https://reader036.vdocuments.net/reader036/viewer/2022082518/56649e9d5503460f94b9ed0f/html5/thumbnails/5.jpg)
Special Features Extensible (Plugin system)
Most of the essential functionalities of Nutch are implemented as plugins
Three layers Extension points
What can be extended: Protocol, Parser, ScoringFilter, etc.
Extensions The interfaces to be implemented for the extension points
Plugins The actual implementation
![Page 6: Nutch in a Nutshell Presented by Liew Guo Min Zhao Jin](https://reader036.vdocuments.net/reader036/viewer/2022082518/56649e9d5503460f94b9ed0f/html5/thumbnails/6.jpg)
Special Features Extensible (Plugin system)
Anyone can write a plugin Write the code Prepare metadata files
Plugin.xml: what has been extended by what Build.xml: how ant can build your source code
Ask nutch to include your plugin in conf/nutch-site.xml
Tell ant to build your in src/plugin/build.xml More details @ http://
wiki.apache.org/nutch/PluginCentral
![Page 7: Nutch in a Nutshell Presented by Liew Guo Min Zhao Jin](https://reader036.vdocuments.net/reader036/viewer/2022082518/56649e9d5503460f94b9ed0f/html5/thumbnails/7.jpg)
Special Features Extensible (Plugin system)
To use a plugin Make sure you have modified Nutch-site.xml to
include the plugin Then, either
Nutch would automatically call it when needed, or You can write something to call it with its classname and
then use it
![Page 8: Nutch in a Nutshell Presented by Liew Guo Min Zhao Jin](https://reader036.vdocuments.net/reader036/viewer/2022082518/56649e9d5503460f94b9ed0f/html5/thumbnails/8.jpg)
Special Features Distributed (Hadoop)
Map-Reduce (Diagram) A framework for distributed programming Map -- Process the splits of data to get
intermediate results and the keys to indicate what should be put together later
Reduce -- Process the intermediate results with the same key and output final result
![Page 9: Nutch in a Nutshell Presented by Liew Guo Min Zhao Jin](https://reader036.vdocuments.net/reader036/viewer/2022082518/56649e9d5503460f94b9ed0f/html5/thumbnails/9.jpg)
Special Features Distributed (Hadoop)
MapReduce in Nutch Example1: Parsing
Input: <url, content> files from fetch Map(url,content) <url, parse> by calling parser plugins Reduce is identity
Example2: Dumping a segment Input: <url, CrawlDatum>, <url, ParseText> etc. files from
segment Map is identity Reduce(url, value*) <url, ConcatenatedValue> by
simply concatenating the text representation of values
![Page 10: Nutch in a Nutshell Presented by Liew Guo Min Zhao Jin](https://reader036.vdocuments.net/reader036/viewer/2022082518/56649e9d5503460f94b9ed0f/html5/thumbnails/10.jpg)
Special Features Distributed (Hadoop)
Distributed File system Write-once-read-many coherence model
High throughput Master/slave
Simple architecture Single point of failure
Transparent Access via Java API
More info @ http://lucene.apache.org/hadoop/hdfs_design.html
![Page 11: Nutch in a Nutshell Presented by Liew Guo Min Zhao Jin](https://reader036.vdocuments.net/reader036/viewer/2022082518/56649e9d5503460f94b9ed0f/html5/thumbnails/11.jpg)
Running Nutch in a distributed environment MapReduce
In hadoop-site.xml Specify job tracker host & port
mapred.job.tracker
Specify task numbers mapred.map.tasks mapred.reduce.tasks
Specify location for temporary files Mapred.local.dir
![Page 12: Nutch in a Nutshell Presented by Liew Guo Min Zhao Jin](https://reader036.vdocuments.net/reader036/viewer/2022082518/56649e9d5503460f94b9ed0f/html5/thumbnails/12.jpg)
Running Nutch in a distributed environment DFS
In hadoop-site.xml Specify namenode host, port & directory
fs.default.name dfs.name.dir
Specify location for files on each datanode dfs.data.dir
![Page 13: Nutch in a Nutshell Presented by Liew Guo Min Zhao Jin](https://reader036.vdocuments.net/reader036/viewer/2022082518/56649e9d5503460f94b9ed0f/html5/thumbnails/13.jpg)
Demo time!
![Page 14: Nutch in a Nutshell Presented by Liew Guo Min Zhao Jin](https://reader036.vdocuments.net/reader036/viewer/2022082518/56649e9d5503460f94b9ed0f/html5/thumbnails/14.jpg)
Q&A
![Page 15: Nutch in a Nutshell Presented by Liew Guo Min Zhao Jin](https://reader036.vdocuments.net/reader036/viewer/2022082518/56649e9d5503460f94b9ed0f/html5/thumbnails/15.jpg)
Discussion
![Page 16: Nutch in a Nutshell Presented by Liew Guo Min Zhao Jin](https://reader036.vdocuments.net/reader036/viewer/2022082518/56649e9d5503460f94b9ed0f/html5/thumbnails/16.jpg)
Exercises Hands-on exercises
Install Nutch, crawl a few webpages using the crawl command and perform a search on it using the GUI
Repeat the crawling process without using the crawl command
Modify your configuration to perform each of the following crawl jobs and think when they would be useful.
To crawl only webpages and pdfs but not anything else To crawl the files on your harddisk To crawl but not to parse
(Challenging) Modify Nutch such that you can unpack the crawled files in the segments back into their original state
![Page 17: Nutch in a Nutshell Presented by Liew Guo Min Zhao Jin](https://reader036.vdocuments.net/reader036/viewer/2022082518/56649e9d5503460f94b9ed0f/html5/thumbnails/17.jpg)
Reference http://wiki.apache.org/nutch/PluginCentral -- Information on Nutch
plugins http://lucene.apache.org/hadoop/ -- Hadoop homepage http://wiki.apache.org/lucene-hadoop/ -- Hadoop Wiki http://wiki.apache.org/nutch-data/attachments/Presentations/attach
ments/mapred.pdf "MapReduce in Nutch"
http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf "Scalable Computing with MapReduce“
http://www.mail-archive.com/[email protected]/msg01951.html Updated tutorial on setting up Nutch, Hadoop and Lucene together
![Page 18: Nutch in a Nutshell Presented by Liew Guo Min Zhao Jin](https://reader036.vdocuments.net/reader036/viewer/2022082518/56649e9d5503460f94b9ed0f/html5/thumbnails/18.jpg)
Excursion: MapReduce
ProblemFind the number of occurrences of “cat” in a
fileWhat if the file is 20GB large?
Why not do it with more computers? Solution
PC1
PC2
200
300
PC1 500Split 1
Split 2File
![Page 19: Nutch in a Nutshell Presented by Liew Guo Min Zhao Jin](https://reader036.vdocuments.net/reader036/viewer/2022082518/56649e9d5503460f94b9ed0f/html5/thumbnails/19.jpg)
Excursion: MapReduce
ProblemFind the number of occurrences of both “cat”
and “dog” in a very large file Solution
PC1
PC2
200, 250
300, 250
PC1 cat:500Split 1
Split 2File
cat: 200, dog: 250
cat: 300, dog: 250
PC2 dog:500
cat: 200, 300
dog: 250, 250
Input Files
Map
Intermediate files
Reduce
Output files
Sort/Group
![Page 20: Nutch in a Nutshell Presented by Liew Guo Min Zhao Jin](https://reader036.vdocuments.net/reader036/viewer/2022082518/56649e9d5503460f94b9ed0f/html5/thumbnails/20.jpg)
Excursion: MapReduce Generalized Framework
Split 1
Split 2
Split 3
Split 4
Worker
Worker
Worker
k1:v1 k3:v2
k1:v3 k2:v4
k2:v5 k4:v6
k1:v1,v2
k2:v4,v5
k3:v2
Worker
Worker
Worker Output 1
Output 2
k4:v6
Output 3
Master
backInput Files
Map
Intermediate files
Reduce
Output files
Sort/Group