cs572 hw nutch

Homework: Crawling the World Wide Web with Apache Nutch

1. Objective

In Homework: Tika you used Tika parsers to extract text from the provided PDF files and helped search

for UFOs in the FBIs vault dataset.

But its time to take a step back. Lets consider how did we obtain all those PDF files in the first place? The short answer is that we crawled the FBIs Vault site http://vault.fbi.gov/ and downloaded all of the relevant PDFs from that site, dropped them in a directory, and then packaged the whole directory into a

tarball.

We used Apache Nutch (http://nutch.apache.org/) to collect the vault.tar.gz dataset. Now, you get to

do it too! Youd think getting all the PDF files would be a snap. That is, of course, if they were named

with URLs that ended with .pdf. And of course, provided that the PDFs were all available from a similar

directory structure, like http://vault.fbi.gov/pdfs/xxx.pdf. Unfortunately, theyre not.

Your goal in this assignment is to download, install and leverage the Apache Nutch web crawling

framework to obtain all of the PDFs from a tiny subset of the FBIs Vault website that we have mirrored at http://fs.vsoeclassrooms.usc.edu/minivault/, and then to package those PDFs and in order to create your

own vault.tar.gz file.

1.1 Individual Work

This assignment is to be

done as individual work

only.

2. Preliminaries

We will be making use of the UNIX servers of the Student Compute Facility (SCF) at aludra.usc.edu for

this exercise.1 We will assume a working knowledge of the SCF from CSCI571 or other prerequisite

course; if you are not familiar with this please look up the ITS documentation available at:

http://www.usc.edu/its/unix/servers/scf.html

http://www.usc.edu/its/unix/

http://www.usc.edu/its/ssh/putty.html

http://www.usc.edu/its/ssh/

http://www.usc.edu/its/sftp/

1 If you are an experienced Unix/Linux user, with the OS already installed, you may want to consider doing this on your

own computer. Check with the TA first to obtain permission before proceeding.

2.1 Making Space

This exercise will require the use of a very large portion of your SCF disk quota, so if you have used your

UNIX account for other courses or other purposes in the past, then odds are that you will need to back up

your data onto your own local computer2.

This exercise will require at least:

Nutch: 65MB unpacked

Crawl data: up to 200MB

Extracted PDFs: approximately 100MB

plus other assorted logs and outputs

We recommend that you have more than 400+MB free on your account before beginning this exercise!

You can check your current disk usage using the command:

quota -v

To see how much space is occupied by the contents of each of your directories, you can use:

du -k

Further information on managing the storage space available on your SCF account can be found at:

http://www.usc.edu/its/accounts/unixquotas.html

WARNING: done correctly, this exercise requires a very large portion of your SCF disk quota. Be very

careful with your settings if you botch something, you may end up exceeding your available disk space and/or encounter other technical issues that can potentially be painful to fix. In particular, pay close

attention to the scope of the crawl to ensure that it is properly confined to our mirror site and does not

extend beyond to FBI websites, or worse, the entire Internet!

2.2 Setting up Java

Nutch requires a more recent version of Java compared to the default settings on your SCF account.

You can check your current version of Java on aludra by running:

aludra.usc.edu(1): java version

java version "1.6.0_23"

Java(TM) SE Runtime Environment (build 1.6.0_23-b05)

Java HotSpot(TM) Server VM (build 19.0-b09, mixed mode)

aludra.usc.edu(2):

If you get anything lower than 1.6.0, you will need to change it! Full instructions in Appendix A.

2 It is probably a good idea to back all your data up before you begin this exercise anyway, in case something goes wrong.

3. Crawling using Apache Nutch

3.1 Downloading and Installing Apache Nutch

To get started with Apache Nutch, grab the latest release, available from:

http://www.apache.org/dyn/closer.cgi/nutch/

The latest release of Nutch that we will use in this assignment is Nutch 1.6. Grab the binary distribution

bin.tar.gz. You can download it directly into your home directory on aludra using:

wget http:///apache/nutch/1.6/ apache-nutch-1.6-bin.tar.gz

And then unpack it:

gtar -xzvf apache-nutch-1.6-bin.tar.gz

Make sure to chmod +x bin/nutch, and then to test it out by running bin/nutch and making sure

you see command line output.

Note: Once you have finished unpacking Nutch, you should delete the bin.tar.gz to free up disk space.

3.2 Configuring Nutch

The best way to get started on configuring Nutch for your Web crawl is to read the Section 3 and 3.1 of

the Quick Start on the wiki:

http://wiki.apache.org/nutch/NutchTutorial

Pay particular attention to the configuration files that you will need to change/update.

Note: You must set the agent name and its related properties appropriately (this will be graded!).

As in previous assignments, you are advised to do any and all testing using a small crawl. You do not

want to spend several hours running a crawl only to find that you botched it and wasted your time.

Caution: Be very careful about the scope of your crawl! If your crawl is not properly confined to our

mirror site and extends beyond to FBI websites, or worse, the entire Internet, you may end up filling up

all available disk space and/or encounter other technical issues.

3.2.1 Politeness

Crawlers can retrieve data much quicker and in greater depth than human users, so they can have a

crippling impact on the performance of a site. A server would naturally have a hard time keeping up with

requests from multiple crawlers as some of you may have already discovered while doing the earlier crawler assignment!

Except this time all of you are going to be torturing the same server. Please be kind to it if it dies under your bombardment, none of you will be able to complete the assignment!

To avoid overloading the server, you should set an appropriate interval between successive requests to the

same server. (Do not exceed the default settings in Nutch!)

3.3 Crawling Hints

For sanitys sake, please use a wired connection if possible to do your crawling.

Configuring Nutch to perform this task will likely be one of the trickiest parts of the assignment. Pay particular attention to the RegexURLFilter required to crawl the site, and to the crawl command

arguments/parameters.

It may be helpful to do manual inspection of the website to gain a better understanding of what you are dealing with, and what settings you need to adjust.

o The site that you are crawling, http://fs.vsoeclassrooms.usc.edu/minivault/ was created by deleting a large portion of our original mirror site in order to reduce its size. There is no need

to be alarmed by the 404 errors that you encounter; those files were intentionally deleted.

o Consequently you may find it more useful to inspect the larger version of the mirror at http://fs.vsoeclassrooms.usc.edu/vault/ or the original site at http://vault.fbi.gov/. However, DO

NOT CRAWL THESE SITES! They will exceed the space you have.

You should observe that some of the URLs include spaces; these need to be percent encoded to normalize the URL to a valid form that Nutch can work with.

Some of the files we want to crawl are pretty big.

4. Extracting the PDFs from the Sequence File compressed format

Nutch uses a compressed SequenceFile binary format to represent the data that it downloads. Data

is stored in SequenceFiles which are part of segments on disk, some splittable portion of the crawl. Nutch performs this operation both for efficiency, but also to allow for distribution to allow multiple

fetches to be distributed out on the cluster using the Apache Hadoop Map-Reduce Framework.

After successful crawling of FBI vault site you should have a crawl folder with a bunch of segments in it,

and inside of those segments, SequenceFiles with the content stored inside.

To get some idea of how to extract data from Nutch SequenceFiles, have a look at this blog post:

http://www-scf.usc.edu/~csci572/2013Spring/homework/nutch/allenday20080829.html

Your goal in this portion of the assignment is to write a Java program PDFExtractor.java that will

extract the PDF files out of the SequenceFile format, and to write the PDFs to disk. Your program should

be executed with arguments indicating the source (crawl data) directory and the output (PDF files)

directory, like this:

java PDFExtractor

The result of this command will be to extract all PDF files from , and to output them as named

individual PDF files to the directory path identified by .

4.1 Extraction Hints

Once you have completed your crawl, the PDF extraction step may be done in any environment you prefer Java program PDFExtractor.java ought to be cross-platform, so there is no strict requirement for this step to be done on aludra.

Successful PDF extraction should have 20 files. If you dont have all the files, it may be because you failed to extract them, but it could also be because you failed to crawl them in the first place.

This should be obvious: you can test your extracted PDF files by opening them in Adobe Reader.

5. More Documentation and Resources

You may find it helpful to refer to:

Nutch API docs: http://nutch.apache.org/apidocs-1.6/index.html

Hadoop API docs: http://hadoop.apache.org/common/docs/r1.0.0/api/index.html

Besides consulting available documentation, and using Piazza, you may also wish to try the mailing lists

at [email protected] and/or [email protected] if you encounter any difficulties. Both of these

lists are publicly archived:

http://mail-archives.apache.org/mod_mbox/nutch-dev/

http://mail-archives.apache.org/mod_mbox/nutch-user/

6. Submission Guidelines

This assignment is to be submitted electronically, before the start of class on the specified due date, via

https://blackboard.usc.edu/ under Homework: Apache Nutch. A how-to movie is available at: http://www.usc.edu/its/blackboard/support/bb9/assets/movies/bb91/submittingassignment/index.html

Include all Nutch configuration files and/or code that you needed to change or implement in order to

crawl the FBI Vault website. Avoid including unnecessary files. Place in a directory nutchconfig.

Include all source code and external libraries needed to extract the PDFs from Sequence File

compressed format. Put these in a directory pdfextract.

o All source code is expected to be commented, to compile, and to run. You should have (at least)

one Java source file: PDFExtractor.java, containing your main() function. You should

also include other java source files that you added, if any. Do not submit *.class files. We

will compile your program from submitted source.

o There is no need to submit jar files for Tika, Nutch and/or Hadoop. If you have used any other external libraries, you should include those jar files in your submission.

o Prepare a readme.txt containing a detailed explanation of how to compile and execute your

program. Be especially specific on how to include other external libraries if you have used them.

Also include your name, USC ID number and email address in the readme.txt.

Do not include your crawled data or extracted PDF files.

Compress all of the above (nutchconfig folder, pdfextract folder and readme.txt) into a single zip archive and name it according to the following filename convention:

__CSCI572_HW_NUTCH.zip

Use only standard zip format. Do not use other formats such as zipx, rar, ace, etc.

Important Notes:

Make sure that you have attached the file the when submitting. Failure to do so will be treated as non-submission.

Successful submission will be indicated in the assignments submission history. We advise that you check to verify the timestamp, download and double check your zip file for good measure.

Academic Integrity Notice: Your code will be compared to all other student's code via the MOSS code comparison tool to detect plagiarism.

6.1 Late Assignment Policy

-10% if submitted up to 7 days after the due date

no submissions after that date will be accepted

Appendix A: Java 1.6 Setup on aludra.usc.edu

Check your current version of java on aludra by running:

aludra.usc.edu(1): java version

java version "1.6.0_23"

Java(TM) SE Runtime Environment (build 1.6.0_23-b05)

Java HotSpot(TM) Server VM (build 19.0-b09, mixed mode)

aludra.usc.edu(2):

If you get anything lower than 1.6.0, you need to change it.

In order to setup java properly, you need to know which shell you are using. By default, aludra accounts

are setup to use TC Shell and users have the option to change that. Here is how to check which shell you

are using:

aludra.usc.edu(3): ps -p $$

PID TTY TIME CMD

28945 pts/168 0:00 tcsh

aludra.usc.edu(4):

alternatively, if the above command does not work for you, use:

aludra.usc.edu(5): finger $USER

Login name: ttrojan In real life: Tommy Trojan

Directory: /home/scf-XX/ttrojan Shell: /bin/bash

On since Sep 5 07:28:12 on pts/472 from xxx.usc.edu

No unread mail

No Plan.

aludra.usc.edu(6):

If neither of the above commands worked for you, it is safe to assume that you are using tcsh since it is

the default shell.

If you are using tcsh or csh, append the following two lines to your ~/.cshrc file:

source /usr/usc/jdk/1.6.0_23/setup.csh

setenv CLASSPATH .

If you are using bash, append the following two lines to your ~/.bashrc file:

source /usr/usc/jdk/1.6.0_23/setup.sh

export CLASSPATH=.

Once you have completed the steps outlined in that guide, check again to make sure that you now have

the correct version of Java set up. Note that you may need to log out and sign back in again for the

changes to take effect.

If you are still not able to get the desired Java version after following the above instructions, post on

Piazza to obtain TA assistance.

cs572 hw nutch

Documents