practical pig

Practical PigPreventing Perilous Programming Pitfalls for Prestige & Profit

Jameson LoppSoftware EngineerBronto Software, Inc March 20, 2012

Why Pig?● High level language● Small learning curve● Increases productivity● Insulates you from

complexity of MapReduce○ Job configuration tuning

○ Mapper / Reducer optimization

○ Data re-use

○ Job Chains

Input: User profiles, page visits

Output: the top 5 most visited pages by users aged 18-25

Simple MapReduce Example

In Native Hadoop Code

users = LOAD ‘users’ AS (name, age);users = FILTER users BY age >= 18 AND age <= 25;pages = LOAD ‘pages’ AS (user, url);joined = JOIN users BY name, pages BY user;grouped = group JOINED BY url;summed = FOREACH grouped GENERATE group,

COUNT(joined) AS clicks;sorted = ORDER summed BY clicks DESC;top5 = LIMIT sorted 5;STORE top5 INTO ‘/data/top5sites’;

In Pig

Comparisons

Significantly fewer lines of codeConsiderably less development time

Reasonably close to optimal performance

Under the Hood

Automagic!

Getting Up and Running1) Build from source via repository checkout or download a package from:

http://pig.apache.org/releases.html#Downloadhttps://ccp.cloudera.com/display/SUPPORT/Downloads2) Make sure your class paths are setexport JAVA_HOME=/usr/java/defaultexport HBASE_HOME=/usr/lib/hbaseexport PIG_HOME=/usr/lib/pigexport HADOOP_HOME=/usr/lib/hadoopexport PATH=$PIG_HOME/bin:$PATH

3) Run Grunt or execute a Pig Latin script$ pig -x local... - Connecting to ...grunt> OR $ pig -x mapreduce wordCount.pig

Pig Latin BasicsPig Latin statements allow you to transform relations.

● A relation is a bag.● A bag is a collection of tuples.● A tuple is an ordered set of fields.● A field is a piece of data (int / long / float / double / chararray / bytearray)

Relations are referred to by name. Names are assigned by you as part of the Pig Latin statement. Fields are referred to by positional notation or by name if you assign one.

A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);

X = FOREACH A GENERATE name,$2;DUMP X;

(John,4.0F)(Mary,3.8F)(Bill,3.9F)(Joe,3.8F)

Pig Crash Course for SQL UsersSQL Pig Latin

SELECT * FROM users; users = LOAD '/hdfs/users' USING PigStorage(‘\t’) AS (name:chararray, age:int, weight:int);

SELECT * FROM users where weight < 150; skinnyUsers = FILTER users BY weight < 150;

SELECT name, age FROM users where weight < 150;

skinnyUserNames = FOREACH skinnyUsers GENERATE name, age;

Pig Crash Course for SQL UsersSQL Pig Latin

SELECT name, SUM(orderAmount)FROM orders GROUP BY name...

A = GROUP orders BY name; B = FOREACH A GENERATE $0 AS name,

SUM($1.orderAmount) AS orderTotal;

...HAVING SUM(orderAmount) > 500... C = FILTER B BY orderTotal > 500;

...ORDER BY name ASC; D = ORDER C BY name ASC;

SELECT DISTINCT name FROM users; names = FOREACH users GENERATE name;uniqueNames = DISTINCT names;

SELECT name, COUNT(DISTINCT age) FROM users GROUP BY name;

usersByName = GROUP users BY name;numAgesByName = FOREACH usersByName {ages = DISTINCT users.age;GENERATE FLATTEN(group), COUNT(ages);}

"Aggregate yesterday's API web server logs by client and function call." logs = LOAD '/hdfs/logs/$date/api.log' using PigStorage('\t')

AS (type, date, ipAddress, sessionId, clientId, apiMethod); methods = FILTER logs BY type == 'INFO '; methods = FOREACH methods GENERATE

type, date, clientId, class, method;

methods = GROUP methods BY (clientId, class, method); methodStats = FOREACH methods GENERATE

$0.clientId, $0.class, $0.method, COUNT($1) as methodCount; STORE methodStats to '/stats/$date/api/apiUsageByClient

Real World Pig Script

"Find the most commonly used desktop browser, mobile browser, operating system, email client, and geographic location for every contact."

● 150 line Pig Latin script● Runs daily on 6 node computation cluster● Processes ~1B rows of raw tracking data in 40 minutes, doing multiple

groups and joins via 16 chained MapReduce jobs with 2100 mappers● Output: ~40M rows of contact attributes

Pig Job Performance

● Reads input tracking data from sequence files on HDFSlogs = LOAD '/rawdata/track/{$dates}/part-*' USING SequenceFileLoader;logs = FOREACH logs GENERATE $0, STRSPLIT($1, '\t');

● Filters out all tracking actions other than email opens

rawOpens = FILTER logs BY $1.$2 == 'open'

AND $1.$15 IS NOT NULL AND ($1.$17 IS NOT NULL OR $1.$18 IS NOT NULL OR $1.$19 IS NOT NULL OR $1.$20 IS NOT NULL);

● Strip down each row to required data (memory usage optimization)allBrowsers = FOREACH rawOpens GENERATE (chararray)$1.$15 AS subscriberId, (chararray)$1.$17 AS ipAddress, (chararray)$1.$18 AS userAgent, (chararray)$1.$19 AS httpReferer, (chararray)$1.$20 AS browser, (chararray)$1.$21 AS os;

● Separate mobile browser data from desktop browser dataSPLIT allBrowsers INTO mobile IF (browser == 'iPhone' OR browser == 'Android'), desktop IF (browser != 'iPhone' AND browser != 'Android');

Pig Job Performance

OMGWTFBBQ -- the last column is a concatenated 'index' we will use to diff between daily runs of this script storeResults = FOREACH joinedResults {

GENERATE joinedResults::compactResults::subscriberId AS subscriberId, joinedResults::subscriberModeIP::ipCountBySubscriber::ipAddress AS ipAddress, joinedResults::compactResults::primaryBrowser AS primaryBrowser, joinedResults::compactResults::primaryUserAgent AS primaryUserAgent, joinedResults::compactResults::primaryHttpReferer AS primaryHttpReferer, joinedResults::compactResults::mobileBrowser AS mobileBrowser, joinedResults::compactResults::mobileUserAgent AS mobileUserAgent, joinedResults::compactResults::mobileHttpReferer AS mobileHttpReferer, subscriberModeOS::osCountBySubscriber::os AS os, CONCAT(CONCAT(CONCAT(joinedResults::compactResults::subscriberId, (joinedResults::subscriberModeIP::ipCountBySubscriber::ipAddress IS NULL ? '' : joinedResults::subscriberModeIP::ipCountBySubscriber::ipAddress)), CONCAT((joinedResults::compactResults::primaryBrowser IS NULL ? '' : joinedResults::compactResults::primaryBrowser), (joinedResults::compactResults::mobileBrowser IS NULL ? '' : joinedResults::compactResults::mobileBrowser))), (subscriberModeOS::osCountBySubscriber::os IS NULL ? '' : subscriberModeOS::osCountBySubscriber::os)) AS key;}

Pig Job Performance

Pig Job Performance

Allow you to perform more complex operations upon fieldsWritten in java, compiled into a jar, loaded into your Pig script at runtime package myudfs;import java.io.IOException;import org.apache.pig.EvalFunc;import org.apache.pig.data.Tuple;import org.apache.pig.impl.util.WrappedIOException;

public class UPPER extends EvalFunc<String> { public String exec(Tuple input) throws IOException { if (input == null || input.size() == 0) return null; try{ String str = (String)input.get(0); return str.toUpperCase(); }catch(Exception e){ throw WrappedIOException.wrap("Caught exception processing input row ", e); } }}

User Defined Functions

Making use of your UDF in a Pig Script: REGISTER myudfs.jar;students = LOAD 'student_data' AS (name: chararray, age: int, gpa: float);upperNames = FOREACH students GENERATE myudfs.UPPER(name);DUMP upperNames;

User Defined Functions

UDFs are limited; can only operate on fields, not on groups of fields. A given UDF can only return a single data type (integer / float / chararray / etc). To build a jar file that contains all available UDFs, follow these steps:● Checkout UDF code: svn co http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank

● Add pig.jar to your ClassPath: export CLASSPATH=$CLASSPATH:/path/to/pig.jar

● Build the jar file: cd trunk/contrib/piggybank/java; run "ant" This will generate piggybank.jar in the same directory.

You must build piggybank in order to read UDF documentation - run "ant javadoc" from directory trunk/contrib/piggybank/java. The documentation is generated in directory trunk/contrib/piggybank/java/build/javadoc. How to compile a custom UDF isn’t obvious. After writing your UDF, you must place your java code in an appropriate directory inside a checkout of the piggybank code and build the piggybank jar with ant.

UDF Pitfalls

Trying to match pig version with hadoop / hbase versions. There is very little documentation on what is compatible with what. A few snippets from the mailing list: “Are you using Pig 8 distribution or Pig 8 from svn? You want the latter (soon-to-be-Pig 0.8.1)” “Please upgrade your pig version to the latest in the 0.8 branch. The 0.8 release is not compatible with 0.20+ versions of hbase; we bumped up the support in 0.8.1, which is nearing release. Cloudera's latest CDH3 GA might have these patches (it was just released today) but CDH3B4 didn't.”

Common Pig Pitfalls

Bugs in older versions of pig requiring you to register jars. Indicated by MapReduce job failure due to java.lang.ClassNotFoundException: I finally resolved the problem by manually registering jars:

REGISTER /path/to/pig_0.8/lib/google-collections-1.0.jar;REGISTER /path/to/pig_0.8/lib/hbase-0.20.3-1.cloudera.jar;REGISTER /path/to/pig_0.8/lib/zookeeper-hbase-1329.jar

From the mailing list: “If you are using Hbase 0.91 and Pig 0.8.1, the hbaseStorage code in Pig is supposed to auto-register the hbase, zookeeper, and google-collections jars, so you won't have to do that.” No more registering jars, though they do need to be on your classpath.

Common Pig Pitfalls

HBaseLoader bug requiring disabling input splits. Pig versions prior to 0.8.1 will only load a single HBase region unless you disable input splits.

Fix via: SET pig.splitCombination 'false';

Obscure Pig Pitfalls

visitors = LOAD 'hbase://tracking' USING HBaseStorage( 'open:browser open:ip open:os open:createdDate') as (browser:chararray, ipAddress:chararray, os:chararray, createdDate:chararray); Resulted in:

java.lang.RuntimeException: Failed to create DataStorage at org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:75)

Caused by: Call to hadoopMaster failed on java.io.EOFException

Obscure Pig Pitfalls

Join the pig-user mailing list: [email protected]

Use (the latest) complete cloudera distribution to avoid version compatibility issues. Learn the quick & dirty rules for optimizing performance. http://pig.apache.org/docs/r0.9.2/perf.html#performance-enhancers

Use the “set” command to tune your MapReduce jobs. http://pig.apache.org/docs/r0.9.2/cmds.html#set Test & re-test. Walk through your pig script int the Grunt shell and use DUMP/ DESCRIBE / EXPLAIN / ILLUSTRATE on your variables / operations. Once you’re happy with how the script looks on paper, run it on your cluster and examine for places you can tweak the Map/Reduce job config.

Recommendations

http://pig.apache.org/docs/r0.9.2/perf.html#performance-enhancers



http://pig.apache.org/docs/r0.9.2/cmds.html#set

http://pig.apache.org/docs/r0.9.2/cmds.html#set

Variable input requires passing arguments from an external wrapper script; we use groovy scripts to kick start pig jobs. def day = new Date()def dateString = (2..31).collect{day.minus(it).format("yyyy-MM-dd")}.join(",")def pig = "/usr/bin/pig -l /dev/null -param dates=${dateString} /path/to/pig/job.pig".execute() Remember to filter out null data or you'll have wonky results when grouping by that field. Tell pig to parallelize reducers; tune for your cluster.

○ SET default_parallel 30;

Recommendations

Increase acceptable mapper failure rate (tweak for your cluster size)SET mapred.reduce.max.attempts 10;SET mapred.max.tracker.failures 10;SET mapred.max.map.failures.percent 20;

Recommendations

That's All, Folks!

Example code & charts from "Practical Problem Solving with Hadoop and Pig" by Milind Bhandarkar ([email protected])Sample log aggregation script by Jeff Turner ([email protected])"Nerdy Pig" cartoon by http://artistahinworks.deviantart.com/"Pig with Goggles" photo via http://funnyanimalsite.com"Cinderella" photo via http://www.telegraph.co.uk/news/newstopics/howaboutthat/2105763/Meet-Cinderella-Pig-in-Boots.html

"Racing Piglets" via http://marshalltx.us/2012/01/l-a-racing-pig-show-to-be-in-marshall-texas/

"Flying Pig" cartoon via http://veil1.deviantart.com/art/Flying-Pig-198309604

"Fault Tolerance" comic by John Muellerleile (@jrecursive)"Pug Pig" photo via http://dogswearinghats.tumblr.com/post/8831901318/pug-or-pig

"Angry Birds Pig" via http://samspratt.com"Oh Bother" cartoon via http://suckerfordragons.deviantart.com/art/Oh-bother-289816100

"Trojan Pig" cartoon http://www.forbes.com/sites/stevensalzberg/2011/12/29/the-skeptical-optimist/

"Drunk Man Rides Pig" via http://www.youtube.com/watch?v=XA-CSqTTvnM

"Redundancy" via http://www.fakeposters.com/posters/redundancy-you-can-never-be-too-sure/

"That's All, Folks" cartoon via http://www.digitalbusstop.com/pop-culture-illustrations/thats-all-folks/

Credits

practical pig

Technology

users users

foreach joinedresults

group users

filter users

load users

distinct users

pig x

foreach logs