introduction to apache pig - ut€¦ · introduction to apache pig pelle jakovits 28 september...

Introduction to Apache Pig

Pelle Jakovits

28 September 2016, Tartu

Outline

• MapReduce recollection

• Apache Pig

– How to run Pig

– Pig Latin

• Data structures

• Examples

– Execution flow

– Advantages & Disadvantages

Pelle Jakovits 2/28

You already know MapReduce

• MapReduce = Map, GroupBy, Sort, Reduce”

• Designed or huge scale data processing

• Provides– Distributed file system

– High scalability

– Automatic parallelization

– Automatic fault recovery• Data is replicated

• Failed tasks are re-executed on other nodes

Pelle Jakovits 3/28

But is MapReduce enough?

• Hadoop MapReduce is one of the most used frameworks for large scale data processing

• However:

– Writing low level MapReduce code slow

– Need a lot of expertise to optimize MapReduce code

– Prototyping is slow

– A lot of custom code required

• Even for the most simplest tasks

– Hard to manage more complex MapReduce job chains

Pelle Jakovits 4/28

Apache Pig

• A data flow framework ontop of Hadoop MapReduce– Retains all its advantages

– And some of it’s disadvantages

• Models a scripting language– Fast prototyping

• Uses Pig Latin language

– Similiar to declarative SQL

– Easier to get started with

• Pig Latin statements are automatically translated into MapReduce jobs

Pelle Jakovits 5/28

Running Pig

• Local mode– Everything installed locally on one machine

• Distributed mode– Everything runs in a MapReduce cluster

• Interactive mode– Grunt shell

• Batch mode– Pig scripts

Pelle Jakovits 6/28

Pig Latin

• Write complex MapReduce transformations using much simpler scripting language

• Not quite SQL, but similar

• Lazy evaluation

• Compiling is hidden from the user

Pelle Jakovits 7/28

Pig Latin Data Structures

• Relation– Can have nested relations

– Similar to a table in a relational database

– Consists of a Bag

• Bag– Collection of unordered tuples

• Tuple– An ordered set of fields

– Similiar to a row in a relational database

– Can contain any number of fields, does not have to match other tuples

• Fields– A `piece` of data

Pelle Jakovits 8/28

Pig Example

• A = LOAD 'student' USING PigStorage() AS (name, age, gpa);

• DUMP A;

– (John, 18, 4.0F)

– (Mary, 19, 3.8F)

– (Bill, 20, 3.9F)

– (Joe, 18, 3.8F)

• B = GROUP A BY age;

• C = FOREACH B GENERATE AVG(gpa)

Pelle Jakovits 9/28

WordCount in Pig

A = load '/tmp/books/books';

B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;

C = group B by word;

D = foreach C generate COUNT(B), group;

store D into '/user/labuser/pelle_jakovits/out';

• Input and output are HDFS folders or files

– /tmp/books/books

– /user/labuser/pelle_jakovits/out

• A, B, C, D are relations

• Right hand side contains Pig expressions

Pelle Jakovits 10/28

Fields

• Consists of either:– Data atoms - Int, long, float, double, chararray, boolean,

datetime, etc.

– Complex data - Bag, Map, Tuple

• Assigning types to fields– A = LOAD 'student' AS (name:chararray, age:int, gpa:float);

• Referencing Fields– By order - $0, $1, $2

– By name - assigned by user schemas• A = LOAD ‘in.txt‘ AS (age, name, occupation);


Complex data types

• Tuples - (a, b, c)

• Bags - {(a,b), {c,d}}

• Maps - [martin#18, daniel#27]

• Looking into complex, nested data

– client.$0

– author.age

• Using FLATTEN can "explode" Pig Bag into a set of Tuple records


Loading and storing data

• LOAD– A = LOAD ‘myfile.txt’ USING PigStorage(‘\t’) AS (f1:int,

f2:int, f3:int);– User defines data loader and delimiters

• STORE– STORE A INTO ‘output_1.txt’ USING PigStorage (‘,’);– STORE B INTO ‘output_2.txt’ USING PigStorage (‘*’);

• Other data loaders– BinStorage– PigDump– TextLoader– Or create a custom one.


FOREACH … GENERATE

• General data transformation statement

• Used to:

– Change the structure of data

– Apply functions to data

– Flatten complex data to remove nesting

• X = FOREACH C GENERATE FLATTEN (A.(a1, a2)), FLATTEN(B.$1);


Group .. BY

• A = load 'student' AS (name:chararray, age:int, gpa:float);

• DUMP A; – (John, 18, 4.0F)

– (Mary, 19, 3.8F)

– (Bill, 20, 3.9F)

– (Joe, 18, 3.8F)

• B = GROUP A BY age;

• DUMP B; – (18, {(John, 18, 4.0F), (Joe, 18, 3.8F)})

– (19, {(Mary, 19, 3.8F)})

– (20, {(Bill, 20, 3.9F)})


JOIN

• A = LOAD 'data1' AS (a1:int,a2:int,a3:int);

• B = LOAD 'data2' AS (b1:int,b2:int);

• X = JOIN A BY a1, B BY b1;

DUMP A; (1,2,3) (4,2,1)

DUMP B; (1,3) (2,7) (4,6)

DUMP X; (1,2,3,1,3)(4,2,1,4,6)


Union

• A = LOAD 'data' AS (a1:int, a2:int, a3:int);

• B = LOAD 'data' AS (b1:int, b2:int);

• X = UNION A, B;

DUMP A; (1,2,3)(4,2,1)

DUMP A; (2,4) (8,9)

DUMP X; (1,2,3)(4,2,1) (2,4) (8,9)


Functions

• SAMPLE

– A = LOAD 'data' AS (f1:int,f2:int,f3:int);

– X = SAMPLE A 0.01;

– X will contain 1% of tuples in A

• FILTER

– A = LOAD 'data' AS (a1:int, a2:int, a3:int);

– X = FILTER A BY a3 == 3;


Functions

• DISTINCT – removes duplicate tuples

– X = DISTINCT A;

• LIMIT –

– X = LIMIT B 3;

• SPLIT –

– SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6);


Nested Pig Statements

A = LOAD 'Unclaimed_bank_accounts.csv' USING PigStorage(',') AS (last_name,first_name,balance,address,city,last_transaction,bank_name);

B = GROUP A BY city;

C = foreach B {

banks = A.bank_name ;

unique_banks = distinct banks ;

GENERATE group as city, unique_banks; }


User Defined Functions (UDF)

• When the Built in Pig functions are not enough

• When we want to modify the behaviour of built in functions

• Load Pig UDF from jar

REGISTER myudfs.jar;

A = load '/tmp/books/books';

B = foreach A generate flatten(myudfs.TOKENIZE((chararray)$0)) as word;


Pig UDF

public class MYTOKENIZE extends EvalFunc<DataBag> {

TupleFactory mTupleFactory = TupleFactory.getInstance();

BagFactory mBagFactory = BagFactory.getInstance();

public DataBag exec(Tuple input) throws IOException {

try {

DataBag output = mBagFactory.newDefaultBag();

Object o = input.get(0);

if (!(o instanceof String)) {

throw new IOException("Expected input to be chararray");

}

StringTokenizer tok = new StringTokenizer((String)o, " \",()*");

while (tok.hasMoreTokens())

output.add(mTupleFactory.newT uple(tok.nextToken()));

return output;

} catch (ExecException ee) {}

}

}


Pig workflow


Advantages of Pig

• Easy to Program– ~5% of the code, ~5% of the time required

• Self-Optimizing– Pig Latin statement optimizations– Generated MapReduce code optimizations

• Can manage more complex data flows– Easy to use and join multiple separate inputs,

transformations and outputs

• Extensible– Can be extended with User Defined Functions (UDF)

to provide more functionality


Pig disadvantages

• Slow start-up and clean-up of MapReduce jobs

– It takes time for Hadoop to schedule MR jobs

• Not suitable for interactive OLAP Analytics

– When results are expected in < 1 sec

• Complex applications may require many UDF’s

– Pig loses it’s simplicity over MapReduce


DEMO

TFIDF in Pig


Thats All

• This week`s practice session

– Processing data with Pig

– Processing unclaimed bank accounts, but this time using Pig

• Next lecture: Spark


introduction to apache pig - ut€¦ · introduction to apache pig pelle jakovits 28 september...

Documents