epl660: information retrieval and search engines lab 8 · epl660: information retrieval and search...

35
University of Cyprus Department of Computer Science EPL660: Information Retrieval and Search Engines Lab 8 Παύλος Αντωνίου Γραφείο: B109, ΘΕΕ01

Upload: others

Post on 20-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: EPL660: Information Retrieval and Search Engines Lab 8 · EPL660: Information Retrieval and Search Engines ... •Nutch can run on a single machine (local mode), but gains a lot of

University of Cyprus

Department of

Computer Science

EPL660: Information

Retrieval and Search

Engines – Lab 8

Παύλος Αντωνίου

Γραφείο: B109, ΘΕΕ01

Page 2: EPL660: Information Retrieval and Search Engines Lab 8 · EPL660: Information Retrieval and Search Engines ... •Nutch can run on a single machine (local mode), but gains a lot of

What is Apache Nutch?

• Production ready Web Crawler

• Operates at one of three scales:

– local filesystem (reliable, no network errors, caching is

unnecessary)

– Intranet (local/corporate network)

– whole web (whole Web crawling is difficult)

• Nutch can run on a single machine (local mode), but

gains a lot of its strength from running οn a Hadoop

cluster (deploy mode)

• Relies on Apache Hadoop data structures, which are

great for batch processing

• Open source

• Implemented in Java

Page 3: EPL660: Information Retrieval and Search Engines Lab 8 · EPL660: Information Retrieval and Search Engines ... •Nutch can run on a single machine (local mode), but gains a lot of

Nutch Code Bases

• Nutch 1.x:

– A well matured, production ready crawler

– Fine grained configuration

– Relies on Apache Hadoop data structures

• Nutch 2.x:

– An emerging alternative taking direct inspiration from 1.x,

– Differs in one key area; storage is abstracted away from

any specific underlying data store by using Apache

Gora for handling object to persistent mappings.

– Provides extremely flexible model/stack for storing

everything (fetch time, status, content, parsed text,

outlinks, inlinks, etc.) into a number of NoSQL storage

solutions.

Page 4: EPL660: Information Retrieval and Search Engines Lab 8 · EPL660: Information Retrieval and Search Engines ... •Nutch can run on a single machine (local mode), but gains a lot of

Nutch vs Lucene

• Nutch is using Lucene (through Solr or Elastic

Search) for indexing

• Common question "Should I use Lucene or

Nutch?"

– Simple answer: You should use Lucene if you

don't need a web crawler i.e. for fetching the docs to be

indexed

• Nutch is a better fit for sites

– where you don't have direct access to the underlying

data

– data comes from disparate sources

• multiple domains

• different doc format: json, xml, text, html, ...

Page 5: EPL660: Information Retrieval and Search Engines Lab 8 · EPL660: Information Retrieval and Search Engines ... •Nutch can run on a single machine (local mode), but gains a lot of

Nutch vs Solr/ElasticSearch

• Nutch is a web crawler

– collect web pages or other web accessible resources

– uses Solr or ElasticSearch for indexing

• Solr/ElasticSearch is a search platform

– No crawling: doesn't fetch the data, you have to feed it

– Perfect if you have data to be indexed already (in XML,

json, database, etc.)

Page 6: EPL660: Information Retrieval and Search Engines Lab 8 · EPL660: Information Retrieval and Search Engines ... •Nutch can run on a single machine (local mode), but gains a lot of

Nutch building blocks

Page 7: EPL660: Information Retrieval and Search Engines Lab 8 · EPL660: Information Retrieval and Search Engines ... •Nutch can run on a single machine (local mode), but gains a lot of

Nutch Data

• Nutch data is composed of:

– crawl/crawldb

• contains information about all pages (URLs) known

to the crawler and their status, such as the last time

it visited the page, its fetching status, refresh

interval, content checksum, page importance, etc.

– crawl/linkdb

• for each URL known to Nutch, it contains a list of

other URLs pointing to it (incoming links) and their

associated anchor text (from HTML <a href=“…”>anchor text</a> elements)

Page 8: EPL660: Information Retrieval and Search Engines Lab 8 · EPL660: Information Retrieval and Search Engines ... •Nutch can run on a single machine (local mode), but gains a lot of

Nutch Data

– crawl/segments

Segments are directories with the following

subdirectories:

• a crawl_generate names a set of URLs to be fetched

• a crawl_fetch contains the status of fetching each URL

• a content contains the raw content retrieved from each

URL (for indexing)

• a parse_text contains the parsed text of each URL

• a parse_data contains outlinks and metadata parsed

from each URL (such as anchor text)

• a crawl_parse contains the outlink URLs, used to

update the crawldb

Page 9: EPL660: Information Retrieval and Search Engines Lab 8 · EPL660: Information Retrieval and Search Engines ... •Nutch can run on a single machine (local mode), but gains a lot of

Crawling frontier challenge

• No authoritative catalog of web pages

• Where to start crawling from?

• Crawlers need to discover their view of web

universe

– Start from “seed list” & follow (walk) some (useful?

interesting?) outlinks

• Many dangers of simply wandering around

– explosion or collapse of the frontier; collecting

unwanted content (spam, junk, offensive)

Page 10: EPL660: Information Retrieval and Search Engines Lab 8 · EPL660: Information Retrieval and Search Engines ... •Nutch can run on a single machine (local mode), but gains a lot of

Main Nutch workflow

• Inject: initial creation of CrawlDB

– Insert seed URLs to CrawlDB

– Initial LinkDB is empty

• Generate new shard's fetchlist

• Fetch raw content

• Parse content (discovers outlinks)

• Update CrawlDB from shards

• Update LinkDB from shards

• Index shards

rep

eat

Command-line:

bin/nutch

inject

generate

fetch

parse

updatedb

invertlinks

index /

solrindexEvery step is implemented as one (or more) MapReduce job(s)

(from crawldb to

crawl/segments/crawl_generate)

Page 11: EPL660: Information Retrieval and Search Engines Lab 8 · EPL660: Information Retrieval and Search Engines ... •Nutch can run on a single machine (local mode), but gains a lot of

Injecting new URLs1) Specify a list of

URLs you want to crawl

2) Use a URL filter

3) Use the injector to add URLs to the crawldb

Note: filters, normalizers and plugins allow Nutch to be highly

modular, flexible and very customizable throughout the whole process.

Page 12: EPL660: Information Retrieval and Search Engines Lab 8 · EPL660: Information Retrieval and Search Engines ... •Nutch can run on a single machine (local mode), but gains a lot of

Generate-ing fetchlists4) Generate a fetch list from the crawldb

5) Create segment directory for the generated fetch list

Page 13: EPL660: Information Retrieval and Search Engines Lab 8 · EPL660: Information Retrieval and Search Engines ... •Nutch can run on a single machine (local mode), but gains a lot of

Fetching content

6) Fetch segment

Page 14: EPL660: Information Retrieval and Search Engines Lab 8 · EPL660: Information Retrieval and Search Engines ... •Nutch can run on a single machine (local mode), but gains a lot of

Content processing

7) Parse the results

and update CrawldB

Page 15: EPL660: Information Retrieval and Search Engines Lab 8 · EPL660: Information Retrieval and Search Engines ... •Nutch can run on a single machine (local mode), but gains a lot of

Link inversion

8) Before indexing, invert all links, so that incoming anchor

text can be indexed with pages

Page 16: EPL660: Information Retrieval and Search Engines Lab 8 · EPL660: Information Retrieval and Search Engines ... •Nutch can run on a single machine (local mode), but gains a lot of

Link Inversion

• Pages (urls) have outgoing links (outlinks)

– … I know where I am pointing to

• Question: Who points to me?

– … I don’t know, there is no catalog of pages

– … NOBODY knows for sure either!

• In-degree may indicate importance of the page

• Anchor text provides important semantic info

• Answer: invert the outlinks that I know about

Page 17: EPL660: Information Retrieval and Search Engines Lab 8 · EPL660: Information Retrieval and Search Engines ... •Nutch can run on a single machine (local mode), but gains a lot of

Link Inversion as MR job

• Goal: Compute inlinks for all downloaded and

parsed pages

• Input: each page as a pair of <srcUrl, ParseData>

– ParseData contain page outlinks (destUrls)

• Map <srcUrl, ParseData> → <destUrl, Inlinks>

– where Inlinks: <srcUrl, anchorText>

• Reduce: Map output pairs <destUrl, Inlinks>

grouped by destUrl, append Inlinks in a dedicated

java writeable class

• Output: <destUrl, list of Inlinks>

Page 18: EPL660: Information Retrieval and Search Engines Lab 8 · EPL660: Information Retrieval and Search Engines ... •Nutch can run on a single machine (local mode), but gains a lot of

Page importance - scoring

9) Page importance metadata based on inverted links are

stored in CrawlDB

Page 19: EPL660: Information Retrieval and Search Engines Lab 8 · EPL660: Information Retrieval and Search Engines ... •Nutch can run on a single machine (local mode), but gains a lot of

Indexing

SOLR

Lucene

10) Using data from all possible sources (crawlDB, linkDB,

segments) the indexer creates an index and saves it within

the Solr directory. For indexing, the Lucene library is used.

11) Users can search for information

regarding the crawled web pages via Solr.

Page 20: EPL660: Information Retrieval and Search Engines Lab 8 · EPL660: Information Retrieval and Search Engines ... •Nutch can run on a single machine (local mode), but gains a lot of

Nutch from binary distribution

• Download Apache Nutch 1.16 binary package

from here (you can download Nutch 2.4)

• Unzip your binary Nutch package

• cd apache-nutch-1.16/

• Confirm correct installation

– run "bin/nutch"

• If you are seeing "Permission denied"

– run "chmod +x bin/nutch"

Page 21: EPL660: Information Retrieval and Search Engines Lab 8 · EPL660: Information Retrieval and Search Engines ... •Nutch can run on a single machine (local mode), but gains a lot of

Crawl your first website

• Nutch requires two configuration changes before

a website can be crawled:

1. Customize your crawl properties, where at a

minimum, you provide a name for your crawler

for external servers to recognize

2. Set a seed list of URLs to crawl

Page 22: EPL660: Information Retrieval and Search Engines Lab 8 · EPL660: Information Retrieval and Search Engines ... •Nutch can run on a single machine (local mode), but gains a lot of

Customize your crawl properties

• Default crawl properties: conf/nutch-default.xml

– Mainly remains unchanged

• conf/nutch-site.xml serves as a place to add your

own custom crawl properties that

overwrite conf/nutch-default.xml.

– Add your agent name in the value field of the

http.agent.name property in conf/nutch-site.xml within

<configuration>

<property>

<name>http.agent.name</name>

<value>My Nutch Spider</value>

</property>

Page 23: EPL660: Information Retrieval and Search Engines Lab 8 · EPL660: Information Retrieval and Search Engines ... •Nutch can run on a single machine (local mode), but gains a lot of

Crawl your first website: Seed list

• A URL seed list includes a list of websites, one-

per-line, which nutch will look to crawl

• Create a URL seed list

– mkdir -p urls

– cd urls

– nano seed.txt to create a text file seed.txt under

urls/ with the following content (one URL per line for

each site you want Nutch to crawl). • http://nutch.apache.org/

• (one URL per line for each site you want Nutch to crawl)

Page 24: EPL660: Information Retrieval and Search Engines Lab 8 · EPL660: Information Retrieval and Search Engines ... •Nutch can run on a single machine (local mode), but gains a lot of

Configure Reg. Expression Filters

• conf/regex-urlfilter.txt will provide regular

expressions that allow nutch to filter and narrow

the types of web resources to crawl and download

• Edit the file conf/regex-urlfilter.txt and

REPLACE

# accept anything else

+.

WITH

+^http://([a-z0-9]*\.)*nutch.apache.org/

if, for example, you wished to limit the crawl to

the nutch.apache.org domain

• NOTE: Not specifying any domains to include within regex-urlfilter.txt will lead to all

domains linking to your seed URLs file being crawled as well.

Page 25: EPL660: Information Retrieval and Search Engines Lab 8 · EPL660: Information Retrieval and Search Engines ... •Nutch can run on a single machine (local mode), but gains a lot of

Seeding crawldb with list of URLs

• The injector adds URLs to the crawldb

– bin/nutch inject crawl/crawldb urls

• STEP 1: FETCHING, PARSING PAGES

• Generate fetch list for all pages due to be fetched.

The fetch list is placed in a newly created

segment directory

– bin/nutch generate crawl/crawldb

crawl/segments

– The segment directory is named by the time it's created• s1=`ls -d crawl/segments/2* | tail -1`

• echo $s1

• Run the fetcher on this segment

– bin/nutch fetch $s1

Page 26: EPL660: Information Retrieval and Search Engines Lab 8 · EPL660: Information Retrieval and Search Engines ... •Nutch can run on a single machine (local mode), but gains a lot of

Seeding crawldb with list of URLs

• Parse the entries

– bin/nutch parse $s1

• When this is complete, we update the crawldb

database with the results of the fetch:

– bin/nutch updatedb crawl/crawldb $s1

• First fetching: Now crawldb database contains

both updated entries for all initial pages + new

entries that correspond to newly discovered

pages linked from the initial set.

Page 27: EPL660: Information Retrieval and Search Engines Lab 8 · EPL660: Information Retrieval and Search Engines ... •Nutch can run on a single machine (local mode), but gains a lot of

Seeding crawldb with list of URLs

• Now we generate and fetch a new segment containing the

top-scoring 1,000 pages:– bin/nutch generate crawl/crawldb crawl/segments -

topN 1000

– s2=`ls -d crawl/segments/2* | tail -1`

– bin/nutch fetch $s2

– bin/nutch parse $s2

– bin/nutch updatedb crawl/crawldb $s2

• Let’s fetch one more round:– bin/nutch generate crawl/crawldb crawl/segments -

topN 1000

– s3=`ls -d crawl/segments/2* | tail -1`

– bin/nutch fetch $s3

– bin/nutch parse $s3

– bin/nutch updatedb crawl/crawldb $s3

Page 28: EPL660: Information Retrieval and Search Engines Lab 8 · EPL660: Information Retrieval and Search Engines ... •Nutch can run on a single machine (local mode), but gains a lot of

Seeding crawldb with list of URLs

• STEP 2: INVERTLINKS

• Before indexing we first invert all links, so that we

may index incoming anchor text with the pages.

– bin/nutch invertlinks crawl/linkdb -dir

crawl/segments

• STEP 3: INDEXING INTO APACHE SOLR

[Nutch-Solr integration needed]• Usage: bin/nutch solrindex <solr url> <crawldb> [-

linkdb <linkdb>][-params k1=v1&k2=v2...] (<segment>

...| -dir <segments>) [-noCommit] [-deleteGone] [-

filter] [-normalize]

• Example: bin/nutch solrindex

http://localhost:8983/solr crawl/crawldb/ -linkdb

crawl/linkdb/ crawl/segments/20131108063838/ -filter

-normalize

Page 29: EPL660: Information Retrieval and Search Engines Lab 8 · EPL660: Information Retrieval and Search Engines ... •Nutch can run on a single machine (local mode), but gains a lot of

Seeding crawldb with list of URLs

• STEP 4: DELETING DUPLICATES

• Ensure urls are unique in index

• Usage: bin/nutch solrdedup <solr url>

• Example: /bin/nutch solrdedup

http://localhost:8983/solr

• STEP 5: CLEANING SOLR

• Scan crawldb directory looking for entries with

status DB_GONE (404) and sends delete

requests to Solr for those documents

• Usage: bin/nutch solrclean <crawldb>

<solrurl>

• Example: /bin/nutch solrclean

crawl/crawldb/ http://localhost:8983/solr

Page 30: EPL660: Information Retrieval and Search Engines Lab 8 · EPL660: Information Retrieval and Search Engines ... •Nutch can run on a single machine (local mode), but gains a lot of

All In One: Using the Crawl Command

bin/crawl [-i|--index] [-D "key=value"] <Seed Dir> <Crawl Dir> <Num Rounds>

-i|--index Indexes crawl results into a configured indexer

-D A Java property to pass to Nutch calls

Seed Dir Directory in which to look for a seeds file

Crawl Dir Directory where the crawl/link/segments dirs are saved

Num Rounds The number of rounds to run this crawl for

Example: bin/crawl -i -D solr.server.url=http://localhost:8983/solr/ urls/ TestCrawl/ 2

Page 31: EPL660: Information Retrieval and Search Engines Lab 8 · EPL660: Information Retrieval and Search Engines ... •Nutch can run on a single machine (local mode), but gains a lot of

Nutch Command Line Options

Below are some of the command line options

• bin/nutch readdb crawlDir/crawldb -stats

• bin/nutch readdb crawlDir/crawldb -dump

outdump

• bin/nutch readdb crawlDir/crawldb -topN 2

outreaddbtop

• bin/nutch readdb crawlDir/linkdb -dump

outputlinkdb

For more options:

http://wiki.apache.org/nutch/CommandLineOptions

Page 32: EPL660: Information Retrieval and Search Engines Lab 8 · EPL660: Information Retrieval and Search Engines ... •Nutch can run on a single machine (local mode), but gains a lot of

Nutch deploy mode

• local mode: run Nutch in a single process on one

machine, using Hadoop as a dependency

• deploy mode: take into account Hadoop

configurations installed on machine

– Copy hadoop-env.sh, core-site.xml, hdfs-site.xml,

mapred-site.xml from /usr/local/hadoop/conf to

~/apache-nutch-1.16/conf

• sudo cp /usr/local/hadoop/conf/hadoop-env.sh ~/apache-nutch-

1.16/conf

• sudo cp /usr/local/hadoop/conf/hdfs-site.xml ~/apache-nutch-

1.16/conf

• sudo cp /usr/local/hadoop/conf/mapred-site.xml ~/apache-

nutch-1.16/conf

• sudo cp /usr/local/hadoop/conf/core-site.xml ~/apache-nutch-

1.16/conf

Page 33: EPL660: Information Retrieval and Search Engines Lab 8 · EPL660: Information Retrieval and Search Engines ... •Nutch can run on a single machine (local mode), but gains a lot of

Integrate Solr with Nutch

• https://wiki.apache.org/nutch/NutchTutorial#Integr

ate_Solr_with_Nutch

• Replace Solr schema.xml with Nutch-specific

schema.xml

• Run the Solr Index command:

– bin/nutch solrindex

http://127.0.0.1:8983/solr/ crawl/crawldb

-linkdb crawl/linkdb crawl/segments/*

Page 34: EPL660: Information Retrieval and Search Engines Lab 8 · EPL660: Information Retrieval and Search Engines ... •Nutch can run on a single machine (local mode), but gains a lot of

Checking Your Index

• http://solr_server:8983/solr/admin/

Page 35: EPL660: Information Retrieval and Search Engines Lab 8 · EPL660: Information Retrieval and Search Engines ... •Nutch can run on a single machine (local mode), but gains a lot of

Useful Links

• http://wiki.apache.org/nutch/NutchTutorial

• http://wiki.apache.org/nutch/

• http://nutch.apache.org/

• http://wiki.apache.org/nutch/CommandLineOption

s

• http://today.java.net/pub/a/today/2006/01/10/intro

duction-to-nutch-1.html