cs572 hw nutch

7
Homework: Crawling the World Wide Web with Apache Nutch 1. Objective In Homework: Tika you used Tika parsers to extract text from the provided PDF files and helped search for UFOs in the FBI’s vault dataset. But it’s time to take a step back. Let’s consider how did we obtain all those PDF files in the first place? The short answer is that we crawled the FBI’s Vault site http://vault.fbi.gov/ and downloaded all of the relevant PDFs from that site, dropped them in a directory, and then packaged the whole directory into a tarball. We used Apache Nutch (http://nutch.apache.org/ ) to collect the vault.tar.gz dataset. Now, you get to do it too! You’d think getting all the PDF files would be a snap. That is, of course, if they were named with URLs that ended with .pdf. And of course, provided that the PDFs were all available from a similar directory structure, like http://vault.fbi.gov/pdfs/xxx.pdf. Unfortunately, they’re not. Your goal in this assignment is to download, install and leverage the Apache Nutch web crawling framework to obtain all of the PDFs from a tiny subset of the FBI’s Vault website that we have mirrored at http://fs.vsoeclassrooms.usc.edu/minivault/ , and then to package those PDFs and in order to create your own vault.tar.gz file. 1.1 Individual Work This assignment is to be done as individual work only . 2. Preliminaries We will be making use of the UNIX servers of the Student Compute Facility (SCF) at aludra.usc.edu for this exercise. 1 We will assume a working knowledge of the SCF from CSCI571 or other prerequisite course; if you are not familiar with this please look up the ITS documentation available at: http://www.usc.edu/its/unix/servers/scf.html http://www.usc.edu/its/unix/ http://www.usc.edu/its/ssh/putty.html http://www.usc.edu/its/ssh/ http://www.usc.edu/its/sftp/ 1 If you are an experienced Unix/Linux user, with the OS already installed, you may want to consider doing this on your own computer. Check with the TA first to obtain permission before proceeding.

Upload: easo-thomas

Post on 08-Nov-2015

10 views

Category:

Documents


2 download

TRANSCRIPT

  • Homework: Crawling the World Wide Web with Apache Nutch

    1. Objective

    In Homework: Tika you used Tika parsers to extract text from the provided PDF files and helped search

    for UFOs in the FBIs vault dataset.

    But its time to take a step back. Lets consider how did we obtain all those PDF files in the first place? The short answer is that we crawled the FBIs Vault site http://vault.fbi.gov/ and downloaded all of the relevant PDFs from that site, dropped them in a directory, and then packaged the whole directory into a

    tarball.

    We used Apache Nutch (http://nutch.apache.org/) to collect the vault.tar.gz dataset. Now, you get to

    do it too! Youd think getting all the PDF files would be a snap. That is, of course, if they were named

    with URLs that ended with .pdf. And of course, provided that the PDFs were all available from a similar

    directory structure, like http://vault.fbi.gov/pdfs/xxx.pdf. Unfortunately, theyre not.

    Your goal in this assignment is to download, install and leverage the Apache Nutch web crawling

    framework to obtain all of the PDFs from a tiny subset of the FBIs Vault website that we have mirrored at http://fs.vsoeclassrooms.usc.edu/minivault/, and then to package those PDFs and in order to create your

    own vault.tar.gz file.

    1.1 Individual Work

    This assignment is to be

    done as individual work

    only.

    2. Preliminaries

    We will be making use of the UNIX servers of the Student Compute Facility (SCF) at aludra.usc.edu for

    this exercise.1 We will assume a working knowledge of the SCF from CSCI571 or other prerequisite

    course; if you are not familiar with this please look up the ITS documentation available at:

    http://www.usc.edu/its/unix/servers/scf.html

    http://www.usc.edu/its/unix/

    http://www.usc.edu/its/ssh/putty.html

    http://www.usc.edu/its/ssh/

    http://www.usc.edu/its/sftp/

    1 If you are an experienced Unix/Linux user, with the OS already installed, you may want to consider doing this on your

    own computer. Check with the TA first to obtain permission before proceeding.

  • 2.1 Making Space

    This exercise will require the use of a very large portion of your SCF disk quota, so if you have used your

    UNIX account for other courses or other purposes in the past, then odds are that you will need to back up

    your data onto your own local computer2.

    This exercise will require at least:

    Nutch: 65MB unpacked

    Crawl data: up to 200MB

    Extracted PDFs: approximately 100MB

    plus other assorted logs and outputs

    We recommend that you have more than 400+MB free on your account before beginning this exercise!

    You can check your current disk usage using the command:

    quota -v

    To see how much space is occupied by the contents of each of your directories, you can use:

    du -k

    Further information on managing the storage space available on your SCF account can be found at:

    http://www.usc.edu/its/accounts/unixquotas.html

    WARNING: done correctly, this exercise requires a very large portion of your SCF disk quota. Be very

    careful with your settings if you botch something, you may end up exceeding your available disk space and/or encounter other technical issues that can potentially be painful to fix. In particular, pay close

    attention to the scope of the crawl to ensure that it is properly confined to our mirror site and does not

    extend beyond to FBI websites, or worse, the entire Internet!

    2.2 Setting up Java

    Nutch requires a more recent version of Java compared to the default settings on your SCF account.

    You can check your current version of Java on aludra by running:

    aludra.usc.edu(1): java version

    java version "1.6.0_23"

    Java(TM) SE Runtime Environment (build 1.6.0_23-b05)

    Java HotSpot(TM) Server VM (build 19.0-b09, mixed mode)

    aludra.usc.edu(2):

    If you get anything lower than 1.6.0, you will need to change it! Full instructions in Appendix A.

    2 It is probably a good idea to back all your data up before you begin this exercise anyway, in case something goes wrong.

  • 3. Crawling using Apache Nutch

    3.1 Downloading and Installing Apache Nutch

    To get started with Apache Nutch, grab the latest release, available from:

    http://www.apache.org/dyn/closer.cgi/nutch/

    The latest release of Nutch that we will use in this assignment is Nutch 1.6. Grab the binary distribution

    bin.tar.gz. You can download it directly into your home directory on aludra using:

    wget http:///apache/nutch/1.6/ apache-nutch-1.6-bin.tar.gz

    And then unpack it:

    gtar -xzvf apache-nutch-1.6-bin.tar.gz

    Make sure to chmod +x bin/nutch, and then to test it out by running bin/nutch and making sure

    you see command line output.

    Note: Once you have finished unpacking Nutch, you should delete the bin.tar.gz to free up disk space.

    3.2 Configuring Nutch

    The best way to get started on configuring Nutch for your Web crawl is to read the Section 3 and 3.1 of

    the Quick Start on the wiki:

    http://wiki.apache.org/nutch/NutchTutorial

    Pay particular attention to the configuration files that you will need to change/update.

    Note: You must set the agent name and its related properties appropriately (this will be graded!).

    As in previous assignments, you are advised to do any and all testing using a small crawl. You do not

    want to spend several hours running a crawl only to find that you botched it and wasted your time.

    Caution: Be very careful about the scope of your crawl! If your crawl is not properly confined to our

    mirror site and extends beyond to FBI websites, or worse, the entire Internet, you may end up filling up

    all available disk space and/or encounter other technical issues.

    3.2.1 Politeness

    Crawlers can retrieve data much quicker and in greater depth than human users, so they can have a

    crippling impact on the performance of a site. A server would naturally have a hard time keeping up with

    requests from multiple crawlers as some of you may have already discovered while doing the earlier crawler assignment!

    Except this time all of you are going to be torturing the same server. Please be kind to it if it dies under your bombardment, none of you will be able to complete the assignment!

    To avoid overloading the server, you should set an appropriate interval between successive requests to the

    same server. (Do not exceed the default settings in Nutch!)

  • 3.3 Crawling Hints

    For sanitys sake, please use a wired connection if possible to do your crawling.

    Configuring Nutch to perform this task will likely be one of the trickiest parts of the assignment. Pay particular attention to the RegexURLFilter required to crawl the site, and to the crawl command

    arguments/parameters.

    It may be helpful to do manual inspection of the website to gain a better understanding of what you are dealing with, and what settings you need to adjust.

    o The site that you are crawling, http://fs.vsoeclassrooms.usc.edu/minivault/ was created by deleting a large portion of our original mirror site in order to reduce its size. There is no need

    to be alarmed by the 404 errors that you encounter; those files were intentionally deleted.

    o Consequently you may find it more useful to inspect the larger version of the mirror at http://fs.vsoeclassrooms.usc.edu/vault/ or the original site at http://vault.fbi.gov/. However, DO

    NOT CRAWL THESE SITES! They will exceed the space you have.

    You should observe that some of the URLs include spaces; these need to be percent encoded to normalize the URL to a valid form that Nutch can work with.

    Some of the files we want to crawl are pretty big.

    4. Extracting the PDFs from the Sequence File compressed format

    Nutch uses a compressed SequenceFile binary format to represent the data that it downloads. Data

    is stored in SequenceFiles which are part of segments on disk, some splittable portion of the crawl. Nutch performs this operation both for efficiency, but also to allow for distribution to allow multiple

    fetches to be distributed out on the cluster using the Apache Hadoop Map-Reduce Framework.

    After successful crawling of FBI vault site you should have a crawl folder with a bunch of segments in it,

    and inside of those segments, SequenceFiles with the content stored inside.

    To get some idea of how to extract data from Nutch SequenceFiles, have a look at this blog post:

    http://www-scf.usc.edu/~csci572/2013Spring/homework/nutch/allenday20080829.html

    Your goal in this portion of the assignment is to write a Java program PDFExtractor.java that will

    extract the PDF files out of the SequenceFile format, and to write the PDFs to disk. Your program should

    be executed with arguments indicating the source (crawl data) directory and the output (PDF files)

    directory, like this:

    java PDFExtractor

    The result of this command will be to extract all PDF files from , and to output them as named

    individual PDF files to the directory path identified by .

  • 4.1 Extraction Hints

    Once you have completed your crawl, the PDF extraction step may be done in any environment you prefer Java program PDFExtractor.java ought to be cross-platform, so there is no strict requirement for this step to be done on aludra.

    Successful PDF extraction should have 20 files. If you dont have all the files, it may be because you failed to extract them, but it could also be because you failed to crawl them in the first place.

    This should be obvious: you can test your extracted PDF files by opening them in Adobe Reader.

    5. More Documentation and Resources

    You may find it helpful to refer to:

    Nutch API docs: http://nutch.apache.org/apidocs-1.6/index.html

    Hadoop API docs: http://hadoop.apache.org/common/docs/r1.0.0/api/index.html

    Besides consulting available documentation, and using Piazza, you may also wish to try the mailing lists

    at [email protected] and/or [email protected] if you encounter any difficulties. Both of these

    lists are publicly archived:

    http://mail-archives.apache.org/mod_mbox/nutch-dev/

    http://mail-archives.apache.org/mod_mbox/nutch-user/

  • 6. Submission Guidelines

    This assignment is to be submitted electronically, before the start of class on the specified due date, via

    https://blackboard.usc.edu/ under Homework: Apache Nutch. A how-to movie is available at: http://www.usc.edu/its/blackboard/support/bb9/assets/movies/bb91/submittingassignment/index.html

    Include all Nutch configuration files and/or code that you needed to change or implement in order to

    crawl the FBI Vault website. Avoid including unnecessary files. Place in a directory nutchconfig.

    Include all source code and external libraries needed to extract the PDFs from Sequence File

    compressed format. Put these in a directory pdfextract.

    o All source code is expected to be commented, to compile, and to run. You should have (at least)

    one Java source file: PDFExtractor.java, containing your main() function. You should

    also include other java source files that you added, if any. Do not submit *.class files. We

    will compile your program from submitted source.

    o There is no need to submit jar files for Tika, Nutch and/or Hadoop. If you have used any other external libraries, you should include those jar files in your submission.

    o Prepare a readme.txt containing a detailed explanation of how to compile and execute your

    program. Be especially specific on how to include other external libraries if you have used them.

    Also include your name, USC ID number and email address in the readme.txt.

    Do not include your crawled data or extracted PDF files.

    Compress all of the above (nutchconfig folder, pdfextract folder and readme.txt) into a single zip archive and name it according to the following filename convention:

    __CSCI572_HW_NUTCH.zip

    Use only standard zip format. Do not use other formats such as zipx, rar, ace, etc.

    Important Notes:

    Make sure that you have attached the file the when submitting. Failure to do so will be treated as non-submission.

    Successful submission will be indicated in the assignments submission history. We advise that you check to verify the timestamp, download and double check your zip file for good measure.

    Academic Integrity Notice: Your code will be compared to all other student's code via the MOSS code comparison tool to detect plagiarism.

    6.1 Late Assignment Policy

    -10% if submitted up to 7 days after the due date

    no submissions after that date will be accepted

  • Appendix A: Java 1.6 Setup on aludra.usc.edu

    Check your current version of java on aludra by running:

    aludra.usc.edu(1): java version

    java version "1.6.0_23"

    Java(TM) SE Runtime Environment (build 1.6.0_23-b05)

    Java HotSpot(TM) Server VM (build 19.0-b09, mixed mode)

    aludra.usc.edu(2):

    If you get anything lower than 1.6.0, you need to change it.

    In order to setup java properly, you need to know which shell you are using. By default, aludra accounts

    are setup to use TC Shell and users have the option to change that. Here is how to check which shell you

    are using:

    aludra.usc.edu(3): ps -p $$

    PID TTY TIME CMD

    28945 pts/168 0:00 tcsh

    aludra.usc.edu(4):

    alternatively, if the above command does not work for you, use:

    aludra.usc.edu(5): finger $USER

    Login name: ttrojan In real life: Tommy Trojan

    Directory: /home/scf-XX/ttrojan Shell: /bin/bash

    On since Sep 5 07:28:12 on pts/472 from xxx.usc.edu

    No unread mail

    No Plan.

    aludra.usc.edu(6):

    If neither of the above commands worked for you, it is safe to assume that you are using tcsh since it is

    the default shell.

    If you are using tcsh or csh, append the following two lines to your ~/.cshrc file:

    source /usr/usc/jdk/1.6.0_23/setup.csh

    setenv CLASSPATH .

    If you are using bash, append the following two lines to your ~/.bashrc file:

    source /usr/usc/jdk/1.6.0_23/setup.sh

    export CLASSPATH=.

    Once you have completed the steps outlined in that guide, check again to make sure that you now have

    the correct version of Java set up. Note that you may need to log out and sign back in again for the

    changes to take effect.

    If you are still not able to get the desired Java version after following the above instructions, post on

    Piazza to obtain TA assistance.