introduction to apache pig - ut€¦ · introduction to apache pig pelle jakovits 28 september...
Embed Size (px)
TRANSCRIPT
-
Introduction to Apache Pig
Pelle Jakovits
28 September 2016, Tartu
-
Outline
• MapReduce recollection
• Apache Pig
– How to run Pig
– Pig Latin
• Data structures
• Examples
– Execution flow
– Advantages & Disadvantages
Pelle Jakovits 2/28
-
You already know MapReduce
• MapReduce = Map, GroupBy, Sort, Reduce”
• Designed or huge scale data processing
• Provides– Distributed file system
– High scalability
– Automatic parallelization
– Automatic fault recovery• Data is replicated
• Failed tasks are re-executed on other nodes
Pelle Jakovits 3/28
-
But is MapReduce enough?
• Hadoop MapReduce is one of the most used frameworks for large scale data processing
• However:
– Writing low level MapReduce code slow
– Need a lot of expertise to optimize MapReduce code
– Prototyping is slow
– A lot of custom code required
• Even for the most simplest tasks
– Hard to manage more complex MapReduce job chains
Pelle Jakovits 4/28
-
Apache Pig
• A data flow framework ontop of Hadoop MapReduce– Retains all its advantages
– And some of it’s disadvantages
• Models a scripting language– Fast prototyping
• Uses Pig Latin language
– Similiar to declarative SQL
– Easier to get started with
• Pig Latin statements are automatically translated into MapReduce jobs
Pelle Jakovits 5/28
-
Running Pig
• Local mode– Everything installed locally on one machine
• Distributed mode– Everything runs in a MapReduce cluster
• Interactive mode– Grunt shell
• Batch mode– Pig scripts
Pelle Jakovits 6/28
-
Pig Latin
• Write complex MapReduce transformations using much simpler scripting language
• Not quite SQL, but similar
• Lazy evaluation
• Compiling is hidden from the user
Pelle Jakovits 7/28
-
Pig Latin Data Structures
• Relation– Can have nested relations
– Similar to a table in a relational database
– Consists of a Bag
• Bag– Collection of unordered tuples
• Tuple– An ordered set of fields
– Similiar to a row in a relational database
– Can contain any number of fields, does not have to match other tuples
• Fields– A `piece` of data
Pelle Jakovits 8/28
-
Pig Example
• A = LOAD 'student' USING PigStorage() AS (name, age, gpa);
• DUMP A;
– (John, 18, 4.0F)
– (Mary, 19, 3.8F)
– (Bill, 20, 3.9F)
– (Joe, 18, 3.8F)
• B = GROUP A BY age;
• C = FOREACH B GENERATE AVG(gpa)
Pelle Jakovits 9/28
-
WordCount in Pig
A = load '/tmp/books/books';
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
C = group B by word;
D = foreach C generate COUNT(B), group;
store D into '/user/labuser/pelle_jakovits/out';
• Input and output are HDFS folders or files
– /tmp/books/books
– /user/labuser/pelle_jakovits/out
• A, B, C, D are relations
• Right hand side contains Pig expressions
Pelle Jakovits 10/28
-
Fields
• Consists of either:– Data atoms - Int, long, float, double, chararray, boolean,
datetime, etc.
– Complex data - Bag, Map, Tuple
• Assigning types to fields– A = LOAD 'student' AS (name:chararray, age:int, gpa:float);
• Referencing Fields– By order - $0, $1, $2
– By name - assigned by user schemas• A = LOAD ‘in.txt‘ AS (age, name, occupation);
Pelle Jakovits 11/28
-
Complex data types
• Tuples - (a, b, c)
• Bags - {(a,b), {c,d}}
• Maps - [martin#18, daniel#27]
• Looking into complex, nested data
– client.$0
– author.age
• Using FLATTEN can "explode" Pig Bag into a set of Tuple records
Pelle Jakovits 12/28
-
Loading and storing data
• LOAD– A = LOAD ‘myfile.txt’ USING PigStorage(‘\t’) AS (f1:int,
f2:int, f3:int);– User defines data loader and delimiters
• STORE– STORE A INTO ‘output_1.txt’ USING PigStorage (‘,’);– STORE B INTO ‘output_2.txt’ USING PigStorage (‘*’);
• Other data loaders– BinStorage– PigDump– TextLoader– Or create a custom one.
Pelle Jakovits 13/28
-
FOREACH … GENERATE
• General data transformation statement
• Used to:
– Change the structure of data
– Apply functions to data
– Flatten complex data to remove nesting
• X = FOREACH C GENERATE FLATTEN (A.(a1, a2)), FLATTEN(B.$1);
Pelle Jakovits 14/28
-
Group .. BY
• A = load 'student' AS (name:chararray, age:int, gpa:float);
• DUMP A; – (John, 18, 4.0F)
– (Mary, 19, 3.8F)
– (Bill, 20, 3.9F)
– (Joe, 18, 3.8F)
• B = GROUP A BY age;
• DUMP B; – (18, {(John, 18, 4.0F), (Joe, 18, 3.8F)})
– (19, {(Mary, 19, 3.8F)})
– (20, {(Bill, 20, 3.9F)})
Pelle Jakovits 15/28
-
JOIN
• A = LOAD 'data1' AS (a1:int,a2:int,a3:int);
• B = LOAD 'data2' AS (b1:int,b2:int);
• X = JOIN A BY a1, B BY b1;
DUMP A; (1,2,3) (4,2,1)
DUMP B; (1,3) (2,7) (4,6)
DUMP X; (1,2,3,1,3)(4,2,1,4,6)
Pelle Jakovits 16/28
-
Union
• A = LOAD 'data' AS (a1:int, a2:int, a3:int);
• B = LOAD 'data' AS (b1:int, b2:int);
• X = UNION A, B;
DUMP A; (1,2,3)(4,2,1)
DUMP A; (2,4) (8,9)
DUMP X; (1,2,3)(4,2,1) (2,4) (8,9)
Pelle Jakovits 17/28
-
Functions
• SAMPLE
– A = LOAD 'data' AS (f1:int,f2:int,f3:int);
– X = SAMPLE A 0.01;
– X will contain 1% of tuples in A
• FILTER
– A = LOAD 'data' AS (a1:int, a2:int, a3:int);
– X = FILTER A BY a3 == 3;
Pelle Jakovits 18/28
-
Functions
• DISTINCT – removes duplicate tuples
– X = DISTINCT A;
• LIMIT –
– X = LIMIT B 3;
• SPLIT –
– SPLIT A INTO X IF f1
-
Nested Pig Statements
A = LOAD 'Unclaimed_bank_accounts.csv' USING PigStorage(',') AS (last_name,first_name,balance,address,city,last_transaction,bank_name);
B = GROUP A BY city;
C = foreach B {
banks = A.bank_name ;
unique_banks = distinct banks ;
GENERATE group as city, unique_banks; }
Pelle Jakovits 20/28
-
User Defined Functions (UDF)
• When the Built in Pig functions are not enough
• When we want to modify the behaviour of built in functions
• Load Pig UDF from jar
REGISTER myudfs.jar;
A = load '/tmp/books/books';
B = foreach A generate flatten(myudfs.TOKENIZE((chararray)$0)) as word;
Pelle Jakovits 21/28
-
Pig UDF
public class MYTOKENIZE extends EvalFunc {
TupleFactory mTupleFactory = TupleFactory.getInstance();
BagFactory mBagFactory = BagFactory.getInstance();
public DataBag exec(Tuple input) throws IOException {
try {
DataBag output = mBagFactory.newDefaultBag();
Object o = input.get(0);
if (!(o instanceof String)) {
throw new IOException("Expected input to be chararray");
}
StringTokenizer tok = new StringTokenizer((String)o, " \",()*");
while (tok.hasMoreTokens())
output.add(mTupleFactory.newT uple(tok.nextToken()));
return output;
} catch (ExecException ee) {}
}
}
Pelle Jakovits 22/28
-
Pig workflow
Pelle Jakovits 23/28
-
Pig workflow
Pelle Jakovits 24/28
-
Advantages of Pig
• Easy to Program– ~5% of the code, ~5% of the time required
• Self-Optimizing– Pig Latin statement optimizations– Generated MapReduce code optimizations
• Can manage more complex data flows– Easy to use and join multiple separate inputs,
transformations and outputs
• Extensible– Can be extended with User Defined Functions (UDF)
to provide more functionality
Pelle Jakovits 25/28
-
Pig disadvantages
• Slow start-up and clean-up of MapReduce jobs
– It takes time for Hadoop to schedule MR jobs
• Not suitable for interactive OLAP Analytics
– When results are expected in < 1 sec
• Complex applications may require many UDF’s
– Pig loses it’s simplicity over MapReduce
Pelle Jakovits 26/28
-
DEMO
TFIDF in Pig
Pelle Jakovits 27/28
-
Thats All
• This week`s practice session
– Processing data with Pig
– Processing unclaimed bank accounts, but this time using Pig
• Next lecture: Spark
Pelle Jakovits 28/28