a brief, hands-on introduction to hadoop & pig

A brief, hands-on introduction to Hadoop & Pig

Erik Eldridge

Yahoo! Developer Network

OSCAMP 2009

Photo credit: http://www.flickr.com/photos/mckaysavage/1059144105/sizes/l/

Preamble

• Intention: approach hadoop from a tool-user's perspective, specifically, a web dev's perspective

• Intended audience: anyone with a desire to begin using Hadoop

Requirements

• VMWare– Hadoop will be demonstrated using a

VMWare virtual machine– I’ve found the use of a virtual machine to

be the easiest way to get started with Hadoop

Setup VM

• Get hadoop vm from: http://developer.yahoo.com/hadoop/tutorial/module3.html#vm-setup

• Note:– user name: hadoop-user– password: hadoop

• Launch vm• Log in• Note ip of machine

http://developer.yahoo.com/hadoop/tutorial/module3.html#vm-setup

http://developer.yahoo.com/hadoop/tutorial/module3.html#vm-setup

Start Hadoop

• Run the util to launch hadoop: $ ~/start-hadoop

• If it's already running, we'll get an error like"172.16.83.132: datanode running as process 6752. Stop it first.172.16.83.132: secondarynamenode running as process 6845. Stop it first...."

Saying “hi” to Hadoop

• Call hadoop command line util: $ hadoop

• Hadoop command line options are listed here: http://hadoop.apache.org/common/docs/r0.17.0/hdfs_shell.html

• Hadoop should have been launched on boot. verify this is the case: $ hadoop dfs -ls /

http://hadoop.apache.org/common/docs/r0.17.0/hdfs_shell.html


Saying “hi” to Hadoop

• If hadoop has not been started, you'll see something like:"09/07/22 10:01:24 INFO ipc.Client: Retrying connect to server: /172.16.83.132:9000. Already tried 0 time(s).09/07/22 10:01:24 INFO ipc.Client: Retrying connect to server: /172.16.83.132:9000. Already tried 1 time(s)...”

• If hadoop has been launched, the dfs -ls command should show the contents of hdfs

• Before continuing, view all the hadoop utilities and sample files: $ ls

Install Apache

• Why? In the interest of creating a relevant example, I'm going to work on Apache access logs

• Update apt-get so it can find apache2: $ sudo apt-get update

• Install apache2 so we can generate access log data: $ sudo apt-get install apache2

Generate data

• Jump into the directory containing the apache logs: $ cd /var/log/apache2

• Show the top n lines of the access log: $ tail -f -n 10 access.log

Generate data

• Put this script, or something similar, in an executable file on your local machine:

#!/bin/bash

url='http://{VM IP address}:’

for i in {1..1000}

do

curl $url

done

• Edit the IP address to that of your VM

Generate data

• Set executable permissions on the file:$ chmod +x generate.sh

• Run the file: $ ./generate.sh

• Note log data in tail output in VM

Exploring HDFS

• Ref: http://hadoop.apache.org/common/docs/r0.18.3/hdfs_shell.html

• Show home dir structure: – $ hadoop dfs -ls /user– $ hadoop dfs -ls /user/hadoop-user

• Create a directory: $ hadoop dfs -mkdir /user/hadoop-user/foo

• Show new dir: $ hadoop dfs -ls /user/hadoop-user/



Exploring HDFS

• Attempt to re-create new dir and note error: $ hadoop dfs -mkdir /user/hadoop-user/foo

• Create a destination directory using implicit path: $ hadoop dfs -mkdir bar

• Auto-create nested destination directories: $ hadoop dfs -mkdir dir1/dir2/dir3

• Remove dir: $ hadoop dfs -rmr /user/hadoop-user/foo

• Remove dir: $ hadoop dfs -rmr bar dir1

• Try to re-remove dir and note error: $ hadoop dfs -rmr bar

Browse HDFS using web UI

• Open http://{VM IP address}:500750 in browser

• More info: http://hadoop.apache.org/common/docs/r0.18.3/hdfs_user_guide.html#Web+Interface

http://hadoop.apache.org/common/docs/r0.18.3/hdfs_user_guide.html#Web+Interface



Import access log data

• Load access log into hdfs: $ hadoop dfs -put /var/log/apache2/access.log input/access.log

• Verify it's in there: $ hadoop dfs -ls input/access.log

• View the contents: $ hadoop dfs -cat input/access.log

Do something w/ the data

• Ref: http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Single-Node_Cluster%29

• Credit: http://www.michael-noll.com/wiki/Writing_An_Hadoop_MapReduce_Program_In_Python

• Save the mapper and reducer code in two separate files, e.g., mapper.py and reducer.py

http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Single-Node_Cluster%29




• Stream data through these two files, saving the output back to HDFS:#!/bin/bash$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-0.18.0-streaming.jar \-mapper /home/hadoop-user/wordcount/mapper.py \-reducer /home/hadoop-user/wordcount/reducer.py \-input /user/hadoop-user/input/access.log \-output /user/hadoop-user/output/mapReduceOut


• View output files: $ hadoop dfs -ls output/mapReduceOut

• Note multiple output files ("part-00000", "part-00001", etc)

• View output file contents: $ hadoop dfs -cat output/mapReduceOut/part-00000

Pig

• Pig is a higher-level interface for hadoop– Interactive shell Grunt– Declarative, SQL-like language, Pig Latin– Pig engine compiles Pig Latin into MapReduce– Extensible via Java files

• "writing mapreduce routines, is like coding in assembly”

• Pig, Hive, etc.

Exploring Pig

• Ref: http://wiki.apache.org/pig/PigTutorial• Pig is already on the VM• Launch pig w/ connection to cluster:

$ java -cp pig/pig.jar:$HADOOPSITEPATH org.apache.pig.Main

• View contents of HDFS from grunt: > ls

Perform word count w/ Pig

Save this script in a file, e.g, wordcount.pig:myinput = LOAD 'input/access.log' USING TextLoader();words = FOREACH myinput GENERATE FLATTEN(TOKENIZE(\$0));grouped = GROUP words BY \$0;counts = FOREACH grouped GENERATE group, COUNT(words);ordered = ORDER counts BY \$0;STORE ordered INTO 'output/pigOut' USING PigStorage();

Perform word count w/ Pig

• Run this script:$ java -cp pig/pig.jar:$HADOOPSITEPATH org.apache.pig.Main -f wordcount.pig

• View output$ hadoop dfs -cat output/pigOut/part-00000

Resources

• Apache Hadoop Site

• Apache Pig Site

• YDN Hadoop Tutorial – Virtual Machine

Thank you

• Follow me on Twitter: http://twitter.com/erikeldridge

• Find these slides on Slideshare: http://slideshare.net/erikeldridge

• Rate this talk on SpeakerRate: http://speakerrate.com/erikeldridge

a brief, hands-on introduction to hadoop & pig

Technology