apache pig on amazon aws - swine not?
DESCRIPTION
A basic introduction to Apache Pig, focused on understanding what it is as well as quickly getting started using it through Amazon's Elastic Map Reduce service. The second part details my experience following along with Amazon's Pig tutorial at: http://aws.amazon.com/articles/2729TRANSCRIPT
Apache Pig on Amazon AWS
Swine Not?
What is Apache Pig?
Pig is an execution framework that interprets scripts written in a language called Pig Latin and then runs them on a Hadoop cluster.
(Disturbing Logo) -->
Pig is a tool that...
● creates complex jobs that efficiently process large volumes of data
● supports many relational features, making it easy to join, group, and aggregate data
● performs ETL tasks quickly, on many servers simultaneously
What is Pig Latin?
It is a high level data transformation language that:● allows you to concentrate on the data
transformations you require
Rather than:● force you to be concerned with individual
map and reduce functions
Walkthrough - Create a Job Flow
* Basically following the Amazon Pig tutorial at: http://aws.amazon.com/articles/2729
And now we wait...
SSH into master instance$ ssh -i ~/keys/crocs.pem -l hadoop \ ec2-54-215-107-197.us-west-1.compute.amazonaws.com
Type "pig" to enter the grunt shell
$ piggrunt> _
It's a freakin' shell!
grunt> pwdhdfs://10.174.115.214:9000/
You can enter the HDFS file system:grunt> cd hdfs:///
grunt> lshdfs://10.174.115.214:9000/mnt <dir>
Even enter an S3 bucket:grunt> cd \ s3://elasticmapreduce/samples/pig-apache/input/
grunt> lss3://elasticmapreduce/samples/pig-apache/input/access_log_1<r 1> 8754118s3://elasticmapreduce/samples/pig-apache/input/access_log_2<r 1> 8902171
Load Piggybank - Open source library, user contributed functions
grunt> register file:/home/hadoop/lib/pig/piggybank.jar
DEFINE the EXTRACT alias from piggybankgrunt> DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT;
LOAD
Use TextLoader (internal Pig function) to Load each line of the source file:
grunt> RAW_LOGS = LOAD 's3://elasticmapreduce/samples/pig-apache/input/access_log_1' USING TextLoader as (line:chararray);
ILLUSTRATE
Shows a step-by-step process on how Pig would transform a small sample of data
grunt> illustrate RAW_LOGS;Connecting to hadoop file system at: hdfs://10.174.115.214:9000Connecting to map-reduce job tracker at: 10.174.115.214:9001...---------------------------------------------------------------| RAW_LOGS | line:chararray | ---------------------------------------------------------------| | 65.55.106.160 - - [21/Jul/2009:02:29:56 -0700] "GET /gallery/main.php?g2_itemId=32050 HTTP/1.1" 200 7119 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"---------------------------------------------------------------
Now let's:● split each line into fields● store everything in a bag
grunt> LOGS_BASE = FOREACH RAW_LOGS GENERATE FLATTEN( EXTRACT(line, '^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)"') ) as ( remoteAddr: chararray, remoteLogname: chararray, user: chararray, time: chararray, request: chararray, status: int, bytes_string: chararray, referrer: chararray, browser: chararray );
ILLUSTRATE an example of our work grunt> illustrate LOGS_BASE;...| LOGS_BASE | | remoteAddr:chararray | 74.125.74.193 | remoteLogname:chararray | - | user:chararray | - | time:chararray | 20/Jul/2009:20:30:55 -0700 | request:chararray | GET /gwidgets/alexa.xml HTTP/1.1 | status:int | 200 | bytes_string:chararray | 2969 | referrer:chararray | - | browser:chararray | Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
Create a bag containing tuples with just the referrer element (limit 10 items):
grunt> REFERRER_ONLY = FOREACH LOGS_BASE GENERATE referrer;grunt> TEMP = LIMIT REFERRER_ONLY 10;
Output the contents of the bag:
grunt> DUMP TEMP;Pig features used in the script: LIMITFile concatenation threshold: 100 optimistic? falseMR plan size before optimization: 1MR plan size after optimization: 1Pig script settings are added to the jobcreating jar file Job5394669249002614476.jarSetting up single store job1 map-reduce job(s) waiting for submission....
More log output before we get our results (cleaned up here)
...
Input(s):Successfully read 39344 records (126 bytes) from: "s3://elasticmapreduce/samples/pig-apache/input/access_log_1"
Output(s):Successfully stored 10 records (126 bytes) in: "hdfs://10.174.115.214:9000/tmp/temp948493830/tmp76754790"
Counters:Total records written : 10
...
Voila! Our exciting results:
(-)(-)(-)(-)(-)(-)(http://example.org/)(http://example.org/)(-)(-)
First 10 referrers (the dashes represent no referrer)
Now let's filter only by referrerals from bing.com*
grunt> FILTERED = FILTER REFERRER_ONLY BY referrer matches '.*bing.*';grunt> TEMP = LIMIT FILTERED 9;grunt> DUMP TEMP;(http://www.bing.com/search?q=login)(http://www.bing.com/search?q=value)(http://www.bing.com/search?q=value)(http://www.bing.com/search?q=value)(http://www.bing.com/search?q=value)(http://www.bing.com/search?q=views)(http://www.bing.com/search?q=views)(http://www.bing.com/search?q=search)(http://www.bing.com/search?q=philmont)
* We all use Bing, am I right?
Don't forget to terminate your Job Flow
Amazon will charge you even if it's idle!