apache pig on amazon aws - swine not?

26
Apache Pig on Amazon AWS Swine Not?

Upload: drake-emko

Post on 05-Dec-2014

3.064 views

Category:

Technology


1 download

DESCRIPTION

A basic introduction to Apache Pig, focused on understanding what it is as well as quickly getting started using it through Amazon's Elastic Map Reduce service. The second part details my experience following along with Amazon's Pig tutorial at: http://aws.amazon.com/articles/2729

TRANSCRIPT

Page 1: Apache Pig on Amazon AWS  - Swine Not?

Apache Pig on Amazon AWS

Swine Not?

Page 2: Apache Pig on Amazon AWS  - Swine Not?

What is Apache Pig?

Pig is an execution framework that interprets scripts written in a language called Pig Latin and then runs them on a Hadoop cluster.

(Disturbing Logo) -->

Page 3: Apache Pig on Amazon AWS  - Swine Not?

Pig is a tool that...

● creates complex jobs that efficiently process large volumes of data

● supports many relational features, making it easy to join, group, and aggregate data

● performs ETL tasks quickly, on many servers simultaneously

Page 4: Apache Pig on Amazon AWS  - Swine Not?

What is Pig Latin?

It is a high level data transformation language that:● allows you to concentrate on the data

transformations you require

Rather than:● force you to be concerned with individual

map and reduce functions

Page 5: Apache Pig on Amazon AWS  - Swine Not?

Walkthrough - Create a Job Flow

* Basically following the Amazon Pig tutorial at: http://aws.amazon.com/articles/2729

Page 6: Apache Pig on Amazon AWS  - Swine Not?
Page 7: Apache Pig on Amazon AWS  - Swine Not?
Page 8: Apache Pig on Amazon AWS  - Swine Not?
Page 9: Apache Pig on Amazon AWS  - Swine Not?
Page 10: Apache Pig on Amazon AWS  - Swine Not?
Page 11: Apache Pig on Amazon AWS  - Swine Not?
Page 12: Apache Pig on Amazon AWS  - Swine Not?
Page 13: Apache Pig on Amazon AWS  - Swine Not?

And now we wait...

Page 14: Apache Pig on Amazon AWS  - Swine Not?

SSH into master instance$ ssh -i ~/keys/crocs.pem -l hadoop \ ec2-54-215-107-197.us-west-1.compute.amazonaws.com

Page 15: Apache Pig on Amazon AWS  - Swine Not?

Type "pig" to enter the grunt shell

$ piggrunt> _

It's a freakin' shell!

grunt> pwdhdfs://10.174.115.214:9000/

Page 16: Apache Pig on Amazon AWS  - Swine Not?

You can enter the HDFS file system:grunt> cd hdfs:///

grunt> lshdfs://10.174.115.214:9000/mnt <dir>

Even enter an S3 bucket:grunt> cd \ s3://elasticmapreduce/samples/pig-apache/input/

grunt> lss3://elasticmapreduce/samples/pig-apache/input/access_log_1<r 1> 8754118s3://elasticmapreduce/samples/pig-apache/input/access_log_2<r 1> 8902171

Page 17: Apache Pig on Amazon AWS  - Swine Not?

Load Piggybank - Open source library, user contributed functions

grunt> register file:/home/hadoop/lib/pig/piggybank.jar

DEFINE the EXTRACT alias from piggybankgrunt> DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT;

Page 18: Apache Pig on Amazon AWS  - Swine Not?

LOAD

Use TextLoader (internal Pig function) to Load each line of the source file:

grunt> RAW_LOGS = LOAD 's3://elasticmapreduce/samples/pig-apache/input/access_log_1' USING TextLoader as (line:chararray);

Page 19: Apache Pig on Amazon AWS  - Swine Not?

ILLUSTRATE

Shows a step-by-step process on how Pig would transform a small sample of data

grunt> illustrate RAW_LOGS;Connecting to hadoop file system at: hdfs://10.174.115.214:9000Connecting to map-reduce job tracker at: 10.174.115.214:9001...---------------------------------------------------------------| RAW_LOGS | line:chararray | ---------------------------------------------------------------| | 65.55.106.160 - - [21/Jul/2009:02:29:56 -0700] "GET /gallery/main.php?g2_itemId=32050 HTTP/1.1" 200 7119 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"---------------------------------------------------------------

Page 20: Apache Pig on Amazon AWS  - Swine Not?

Now let's:● split each line into fields● store everything in a bag

grunt> LOGS_BASE = FOREACH RAW_LOGS GENERATE FLATTEN( EXTRACT(line, '^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)"') ) as ( remoteAddr: chararray, remoteLogname: chararray, user: chararray, time: chararray, request: chararray, status: int, bytes_string: chararray, referrer: chararray, browser: chararray );

Page 21: Apache Pig on Amazon AWS  - Swine Not?

ILLUSTRATE an example of our work grunt> illustrate LOGS_BASE;...| LOGS_BASE | | remoteAddr:chararray | 74.125.74.193 | remoteLogname:chararray | - | user:chararray | - | time:chararray | 20/Jul/2009:20:30:55 -0700 | request:chararray | GET /gwidgets/alexa.xml HTTP/1.1 | status:int | 200 | bytes_string:chararray | 2969 | referrer:chararray | - | browser:chararray | Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)

Page 22: Apache Pig on Amazon AWS  - Swine Not?

Create a bag containing tuples with just the referrer element (limit 10 items):

grunt> REFERRER_ONLY = FOREACH LOGS_BASE GENERATE referrer;grunt> TEMP = LIMIT REFERRER_ONLY 10;

Output the contents of the bag:

grunt> DUMP TEMP;Pig features used in the script: LIMITFile concatenation threshold: 100 optimistic? falseMR plan size before optimization: 1MR plan size after optimization: 1Pig script settings are added to the jobcreating jar file Job5394669249002614476.jarSetting up single store job1 map-reduce job(s) waiting for submission....

Page 23: Apache Pig on Amazon AWS  - Swine Not?

More log output before we get our results (cleaned up here)

...

Input(s):Successfully read 39344 records (126 bytes) from: "s3://elasticmapreduce/samples/pig-apache/input/access_log_1"

Output(s):Successfully stored 10 records (126 bytes) in: "hdfs://10.174.115.214:9000/tmp/temp948493830/tmp76754790"

Counters:Total records written : 10

...

Page 24: Apache Pig on Amazon AWS  - Swine Not?

Voila! Our exciting results:

(-)(-)(-)(-)(-)(-)(http://example.org/)(http://example.org/)(-)(-)

First 10 referrers (the dashes represent no referrer)

Page 25: Apache Pig on Amazon AWS  - Swine Not?

Now let's filter only by referrerals from bing.com*

grunt> FILTERED = FILTER REFERRER_ONLY BY referrer matches '.*bing.*';grunt> TEMP = LIMIT FILTERED 9;grunt> DUMP TEMP;(http://www.bing.com/search?q=login)(http://www.bing.com/search?q=value)(http://www.bing.com/search?q=value)(http://www.bing.com/search?q=value)(http://www.bing.com/search?q=value)(http://www.bing.com/search?q=views)(http://www.bing.com/search?q=views)(http://www.bing.com/search?q=search)(http://www.bing.com/search?q=philmont)

* We all use Bing, am I right?

Page 26: Apache Pig on Amazon AWS  - Swine Not?

Don't forget to terminate your Job Flow

Amazon will charge you even if it's idle!