nutch in a nutshell presented by liew guo min zhao jin

Nutch in a Nutshell

Presented by Liew Guo Min

Zhao Jin

Outline

Recap Special features Running Nutch in a distributed

environment (with demo) Q&A Discussion

Recap

Complete web search engine Nutch = Crawler + Indexer/Searcher (Lucene) + GUI

+ Plugins+ MapReduce & Distributed FS (Hadoop)

Java based, open source

Features: Customizable Extensible Distributed

Nutch as a crawlerInitial URLs

Generator Fetcher

Segment

Webpages/files

Web

Parsergenerate

Injector

CrawlDB

read/write

CrawlDBTool

update get

read/write

Special Features Extensible (Plugin system)

Most of the essential functionalities of Nutch are implemented as plugins

Three layers Extension points

What can be extended: Protocol, Parser, ScoringFilter, etc.

Extensions The interfaces to be implemented for the extension points

Plugins The actual implementation


Anyone can write a plugin Write the code Prepare metadata files

Plugin.xml: what has been extended by what Build.xml: how ant can build your source code

Ask nutch to include your plugin in conf/nutch-site.xml

Tell ant to build your in src/plugin/build.xml More details @ http://

wiki.apache.org/nutch/PluginCentral


To use a plugin Make sure you have modified Nutch-site.xml to

include the plugin Then, either

Nutch would automatically call it when needed, or You can write something to call it with its classname and

then use it

Special Features Distributed (Hadoop)

Map-Reduce (Diagram) A framework for distributed programming Map -- Process the splits of data to get

intermediate results and the keys to indicate what should be put together later

Reduce -- Process the intermediate results with the same key and output final result


MapReduce in Nutch Example1: Parsing

Input: <url, content> files from fetch Map(url,content) <url, parse> by calling parser plugins Reduce is identity

Example2: Dumping a segment Input: <url, CrawlDatum>, <url, ParseText> etc. files from

segment Map is identity Reduce(url, value*) <url, ConcatenatedValue> by

simply concatenating the text representation of values


Distributed File system Write-once-read-many coherence model

High throughput Master/slave

Simple architecture Single point of failure

Transparent Access via Java API

More info @ http://lucene.apache.org/hadoop/hdfs_design.html

Running Nutch in a distributed environment MapReduce

In hadoop-site.xml Specify job tracker host & port

mapred.job.tracker

Specify task numbers mapred.map.tasks mapred.reduce.tasks

Specify location for temporary files Mapred.local.dir

Running Nutch in a distributed environment DFS

In hadoop-site.xml Specify namenode host, port & directory

fs.default.name dfs.name.dir

Specify location for files on each datanode dfs.data.dir

Demo time!

Discussion

Exercises Hands-on exercises

Install Nutch, crawl a few webpages using the crawl command and perform a search on it using the GUI

Repeat the crawling process without using the crawl command

Modify your configuration to perform each of the following crawl jobs and think when they would be useful.

To crawl only webpages and pdfs but not anything else To crawl the files on your harddisk To crawl but not to parse

(Challenging) Modify Nutch such that you can unpack the crawled files in the segments back into their original state

Reference http://wiki.apache.org/nutch/PluginCentral -- Information on Nutch

plugins http://lucene.apache.org/hadoop/ -- Hadoop homepage http://wiki.apache.org/lucene-hadoop/ -- Hadoop Wiki http://wiki.apache.org/nutch-data/attachments/Presentations/attach

ments/mapred.pdf "MapReduce in Nutch"

http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/oscon05.pdf "Scalable Computing with MapReduce“

http://www.mail-archive.com/[email protected]/msg01951.html Updated tutorial on setting up Nutch, Hadoop and Lucene together

Excursion: MapReduce

ProblemFind the number of occurrences of “cat” in a

fileWhat if the file is 20GB large?

Why not do it with more computers? Solution

PC1

PC2

200

300

PC1 500Split 1

Split 2File

Excursion: MapReduce

ProblemFind the number of occurrences of both “cat”

and “dog” in a very large file Solution

PC1

PC2

200, 250

300, 250

PC1 cat:500Split 1

Split 2File

cat: 200, dog: 250

cat: 300, dog: 250

PC2 dog:500

cat: 200, 300

dog: 250, 250

Input Files

Map

Intermediate files

Reduce

Output files

Sort/Group

Excursion: MapReduce Generalized Framework

Split 1

Split 2

Split 3

Split 4

Worker

Worker

Worker

k1:v1 k3:v2

k1:v3 k2:v4

k2:v5 k4:v6

k1:v1,v2

k2:v4,v5

k3:v2

Worker

Worker

Worker Output 1

Output 2

k4:v6

Output 3

Master

backInput Files

Map

Intermediate files

Reduce

Output files

Sort/Group

nutch in a nutshell presented by liew guo min zhao jin

Documents