apache pig: making data transformation easy

70
Apache Pig Making data transformation easy Víctor Sánchez Anguix Universitat Politècnica de València MSc. In Artificial Intelligence, Pattern Recognition, and Digital Image Course 2014/2015

Upload: victor-sanchez-anguix

Post on 21-Jan-2017

566 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Apache PigMaking data transformation easy

Víctor Sánchez AnguixUniversitat Politècnica de València

MSc. In Artificial Intelligence, Pattern Recognition, and Digital Image

Course 2014/2015

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Map Reduce Problem Solving

Complex problem

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Map Reduce Problem Solving

Complex problem

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Map Reduce Problem Solving

➢ Need to solve complex problem

➢ More complex atomic operations than M/R

➢ Java is not a data oriented language → Low productivity

➢ Any solutions?

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Apache Pig to the rescue!

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Join in Apache Hadooppublic class DeliveryFileMapper extends MapReduceBase implements

Mapper<LongWritable, Text, Text, Text>{

private String cellNumber,deliveryCode,fileTag="DR~";

public void map(LongWritable key, Text value,

OutputCollector<Text, Text> output, Reporter reporter) throws

IOException

{

String line = value.toString();

String splitarray[] = line.split(",");

cellNumber = splitarray[0].trim();

deliveryCode = splitarray[1].trim();

output.collect(new Text(cellNumber), new Text

(fileTag+deliveryCode));

}

}

** Extracted from http://kickstarthadoop.blogspot.com.

es/2011/09/joins-with-plain-map-reduce.html

public class SmsReducer extends MapReduceBase implements

Reducer<Text, Text, Text, Text> {

private String customerName,deliveryReport;

private static Map<String,String> DeliveryCodesMap= new

HashMap<String,String>();

public void configure(JobConf job){

loadDeliveryStatusCodes();

}

public void reduce(Text key, Iterator<Text> values,

OutputCollector<Text, Text> output, Reporter reporter)

throws IOException{

while (values.hasNext()){

String currValue = values.next().toString();

String valueSplitted[] = currValue.split("~");

if(valueSplitted[0].equals("CD"))

customerName=valueSplitted[1].trim();

else if(valueSplitted[0].equals("DR"))

deliveryReport = DeliveryCodesMap.get

(valueSplitted[1].trim());

}

if(customerName!=null && deliveryReport!=null)

output.collect(new Text(customerName), new Text

(deliveryReport));

else if(customerName==null)

output.collect(new Text("customerName"), new Text

(deliveryReport));

else if(deliveryReport==null)

output.collect(new Text(customerName), new Text

("deliveryReport"));

}

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Join in Apache Pig

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Join in Apache Pig

A = JOIN A BY keyA, B BY keyB;

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Apache Pig overview

➢ Framework layer over HDFS and Hadoop

➢ Developed by Yahoo at 2006

➢ Users: Yahoo, Linkedin, Twitter, IBM, etc.

➢ Last major release: 0.14.0 (November 2014)http://pig.apache.org/

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Apache Hadoop vs. Apache Pig

➢ M/R as atomic operations

➢ Java is not data oriented

➢ M/R inner flexibility➢ Efficiency

➢ ETL operations: Join, Filter, Group, etc.

➢ Pig Latin: Data scripting language

➢ UDF with Java (and others)

➢ Transform to M/R overhead

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Pig Programming Model: Data

➢ Pig operations operate on relations

➢ A relation is a bag

➢ A bag is a collection of tuples

➢ A tuple is an ordered set of fields

➢ A field is any type of data

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Sounds complicated… but it’s not!

➢ Basic data types:○ Boolean: True, False○ Int and Long: 1, 2, 3, 4, 5○ Float and Double: 2.3, 1.4, 4.5○ Chararray: ‘Hello’, ‘I am a string’○ DateTime: 2014-09-11T12:20:14.1234+00:00○ … more but you won’t probably use them very often

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Sounds complicated… but it’s not!

➢ Tuple: A catch-all data type

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Sounds complicated… but it’s not!

➢ Bag:

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Sounds complicated… but it’s not!

➢ Bag:

➢ And relations? Just the most outer (distributed) bags

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Loading data?

➢ Loading data? No, first let’s meet our friend Grunt

➢ Interactive pig shell → Nice for debugging/experimenting

➢ pig -x local or pig -x mapred

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Loading data?

➢ Data source: Local or HDFS (usually!)➢ LOAD instruction:

○ Data is automatically loaded in a distributed relation

Students = LOAD ‘student_path’ USING PigStorage( ‘\t’, ‘-noschema’ ) AS (student_id: Long, name: Chararray, surname: Chararray, gender: Chararray,

age: Int);

Relation Name

Path to HD/HDFS

Connector Field separator

Tuple schema

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Loading data?

➢ Data source: Local or HDFS (usually!)➢ LOAD instruction:

○ Data is automatically loaded in a distributed relation

Grades = LOAD ‘grade_path’ USING PigStorage( ‘,’, ‘-schema’ );

Relation Name

Path to HD/HDFS

Connector Field separator

Load schema from .pig_schema

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Checking relations’ content

➢ DUMP instruction:○ Prints the content of a relation at standard output

DUMP Students;

(1,John,Doe,M,18)(2,Mary,Doe,F,20)(3,Lara,Croft,F,25)(4,Sherlock,Holmes,M,36)(5,John,Watson,M,38)(6,Sarah,Kerrigan,F,21)(7,Bruce,Wayne,M,32)(8,Tony,Stark,M,33)(9,Princess,Peach,F,21)(10,Peter,Parker,M,23)

grunt>

Relation Name

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Checking relations’ content

➢ DESCRIBE instruction:○ Prints the schema of the relation at standard output

DESCRIBE Students;

Students: {student_id: long,name: chararray,surname: chararray,gender: chararray,age: int}

grunt>Relation Name

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Checking relations’ content

➢ ILLUSTRATE instruction:○ Prints the schema of the relation and a tuple example

at standard outputILLUSTRATE Students;

-------------------------------------------------------------------------------------------------------------------| Students | student_id:long | name:chararray | surname:chararray | gender:chararray | age:int |-------------------------------------------------------------------------------------------------------------------| | 9 | Princess | Peach | F | 21 |-------------------------------------------------------------------------------------------------------------------

grunt>

Relation Name

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Operating on relations

➢ FOREACH instruction:○ Generate new relations by projecting data of a relation

StudentsProj= FOREACH Students GENERATE student_id, name, age;

Relation Name

Base relation

Projected data

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Operating on relations

➢ FOREACH instruction:○ Generate new relations by projecting data of a relation

StudentsProj= FOREACH Students GENERATE student_id, CONCAT(name,surname) AS full_name, age;

Relation Name

Base relation

Projected data

We can generate new data too!!

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Operating on relations

➢ FOREACH instruction:○ Let us execute the instruction and… it seems that

nothing happens!

○ We had some tracing output with LOAD, DUMP, and ILLUSTRATE…

○ Any ideas on this issue?

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Operating on relations

➢ Pig employs lazy evaluation

➢ Computation only when:○ LOAD, ILLUSTRATE, DUMP, STORE

➢ Pig keeps a DAG on MR jobs needed to compute relations (optimized!)

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Extend Student relation to add a field that determines if the students is under 25 years

(1,John,Doe,M,18,true)

(2,Mary,Doe,F,20,true)

(3,Lara,Croft,F,25,false)

(4,Sherlock,Holmes,M,36,false)

(5,John,Watson,M,38,false)

(6,Sarah,Kerrigan,F,21,true)

...

Exercise: Who is under 25?

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Operating on relations

➢ FILTER instruction:○ Generate a new relation by filtering data on a relation

StudentsFilt= FILTER Students BY age > 24 AND age < 34;

DUMP StudentsFilt;

(3,Lara,Croft,F,25)(7,Bruce,Wayne,M,32)(8,Tony,Stark,M,33)

Relation Name

Base relation

Condition to fulfill

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Operating on relations

➢ SPLIT instruction:○ Splits a relation into multiple relations based on

conditions

SPLIT Students INTO StudentsMale IF gender == ‘M’, StudentsFemale OTHERWISE;

DUMP StudentsMale;

(1,John,Doe,M,18)(4,Sherlock,Holmes,M,36)(5,John,Watson,M,38)(7,Bruce,Wayne,M,32)(8,Tony,Stark,M,33)(10,Peter,Parker,M,23)

Base relation

New relation

Condition to fulfill by new relation. Otherwise means the rest

New relation

Condition to fulfill by new relation

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Operating on relations

➢ SPLIT instruction:○ Splits a relation into multiple relations based on

conditions

SPLIT Students INTO StudentsUnder25 IF age<25, StudentsUnder30 IF age<30, OtherStudents OTHERWISE;

DUMP OtherStudents;

(4,Sherlock,Holmes,M,36)(5,John,Watson,M,38)(8,Tony,Stark,M,33)

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Operating on relations

➢ GROUP instruction:○ Creates tuples with the key and a of bag tuples with

the same key values

StudentsGr = GROUP Students BY gender;

DUMP StudentsGr;

(F,{(9,Princess,Peach,F,21),(6,Sarah,Kerrigan,F,21),(3,Lara,Croft,F,25),(2,Mary,Doe,F,20)})(M,{(10,Peter,Parker,M,23),(8,Tony,Stark,M,33),(7,Bruce,Wayne,M,32),(5,John,Watson,M,38),(4,Sherlock,Holmes,M,36),(1,John,Doe,M,18)})

DESCRIBE StudentsGr;

StudentsGr: {group: chararray,Students: {(student_id: long,name: chararray,surname: chararray,gender: chararray,age: int)}}

Base relation

New relation

Use these fields’ values to make groups

New schema!

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Operating on relations

➢ GROUP instruction:○ We can use multiple relations. Creates one bag per

relation

StudentsGr = GROUP StudentsUnder25 BY gender, OtherStudents BY gender;

DUMP StudentsGr;(F,{(9,Princess,Peach,F,21),(6,Sarah,Kerrigan,F,21),(2,Mary,Doe,F,20)},{})(M,{(10,Peter,Parker,M,23),(1,John,Doe,M,18)},{(8,Tony,Stark,M,33),(5,John,Watson,M,38),(4,Sherlock,Holmes,M,36)})

DESCRIBE StudentsGr;StudentsCoGr: {group: chararray,StudentsUnder25: {(student_id: long,name: chararray,surname: chararray,gender: chararray,age: int)},OtherStudents: {(student_id: long,name: chararray,surname: chararray,gender: chararray,age: int)}}

Base relation

New relation

Use these fields’ values to make groups

New schema!

Base relation

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Operating on relations

➢ Nested FOREACH:○ Operate on data in bags inside a relation and then

project

StudentsNested = FOREACH StudentsGr{Information = FOREACH Students GENERATE name, surname;GENERATE group AS gender, Information AS

student_information;}

DUMP StudentsNested;(F,{(Princess,Peach),(Sarah,Kerrigan),(Lara,Croft),(Mary,Doe)})(M,{(Peter,Parker),(Tony,Stark),(Bruce,Wayne),(John,Watson),(Sherlock,Holmes),(John,Doe)})

Base relation

New relation

Bag inside base relation

Finally project

New bag

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Operating on relations

➢ (inner) JOIN instruction:○ Our classic database operator for relations!

StudentsGrades= JOIN Students BY student_id, Grades BY student_id;

DUMP StudentsGrades;(1,John,Doe,M,18,1,Physics,2.3) (1,John,Doe,M,18,1,Biology,4.5)(1,John,Doe,M,18,1,Engineering,7.7) (1,John,Doe,M,18,1,Math,5.6)(2,Mary,Doe,F,20,2,Engineering,6.7) (2,Mary,Doe,F,20,2,Physics,6.7)…DESCRIBE StudentsGrades;StudentsGrades: {Students::student_id: long,Students::name: chararray,Students::surname: chararray,Students::gender: chararray,Students::age: int,Grades::student_id: long,Grades::course: chararray,Grades::mark: double}

Base relation 1

New relation

Use these fields’ values to group

New schema!

Base relation

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ (left) JOIN instruction:○ Our classic database operator for relations!

Operating on relations

StudentsGrades= JOIN Students BY student_id LEFT, Grades BY student_id;

DUMP StudentsGrades;(6,Sarah,Kerrigan,F,21,,,) (7,Bruce,Wayne,M,32,7,Engineering,8.5)(7,Bruce,Wayne,M,32,7,Physics,8.9) (7,Bruce,Wayne,M,32,7,Math,8.5)(8,Tony,Stark,M,33,8,Math,6.7)…DESCRIBE StudentsGrades;StudentsGrades: {Students::student_id: long,Students::name: chararray,Students::surname: chararray,Students::gender: chararray,Students::age: int,Grades::student_id: long,Grades::course: chararray,Grades::mark: double}

Left relation

New relation

Do not forget this one!

New schema!

Right relation

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ CROSS instruction:○ Cartesian product of two or more relations

Operating on relations

StudentsCr= CROSS Students, Grades;

DUMP StudentsCr;(10,Peter,Parker,M,23,10,Physics,3.3) (10,Peter,Parker,M,23,9,Physics,5.0)(10,Peter,Parker,M,23,7,Physics,8.9) (10,Peter,Parker,M,23,5,Physics,4.5)(10,Peter,Parker,M,23,4,Physics,6.6) (10,Peter,Parker,M,23,3,Physics,5.7)(10,Peter,Parker,M,23,2,Physics,6.7) (10,Peter,Parker,M,23,1,Physics,2.3)…DESCRIBE StudentsCr;StudentsCr: {Students::student_id: long,Students::name: chararray,Students::surname: chararray,Students::gender: chararray,Students::age: int,Grades::student_id: long,Grades::course: chararray,Grades::mark: double}

Relation 1

New relation

Relation 2

New schema!

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ UNION instruction:○ Joins in the same relation multiple relations

Operating on relations

StudentsUnion= UNION Students, Grades;

DUMP StudentsUnion;(1,John,Doe,M,18) (1,Math,5.6)(2,Mary,Doe,F,20) (2,Math,8.9)(3,Lara,Croft,F,25) (3,Math,7.1)…DESCRIBE StudentsUnion;Schema for StudentsUnion unknown.

Relation 1

New relation

Relation 2

Union does not preserve schemas!

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ DISTINCT instruction:○ Only preserves unique tuples

Operating on relations

Courses= FOREACH Grades GENERATE course AS course;UniqueCourses= DISTINCT Courses;

DUMP UniqueCourses;(Math)(Biology)(Physics)(Engineering)

New relation

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ ORDER BY instruction:○ Sorts relations by a specific criteria

Operating on relations

SortedGrades= ORDER Grades BY mark DESC;

DUMP SortedGrades;(2,Biology,10.0)(10,Engineering,10.0)(10,Math,10.0)(5,Biology,10.0)(5,Engineering,9.0)(7,Physics,8.9)…

Base relation

New relation

field(s) used to sort

Sort criteria: DESC (descendant) or ASC (ascendant)

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ LIMIT instruction:○ Truncates relation’s size

Operating on relations

BestGrades= LIMIT SortedGrades 3;

DUMP BestGrades;(10,Math,10.0)(10,Engineering,10.0)(2,Biology,10.0)

Base relation

New relation

Maximum number of tuples

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ RANK instruction:○ Appends position of each tuple in the relation

Operating on relations

RankedGrades= RANK SortedGrades;

DUMP RankedGrades;(1,2,Biology,10.0)(2,10,Engineering,10.0)(3,10,Math,10.0)(4,5,Biology,10.0)(5,5,Engineering,9.0)… DESCRIBE RankedGrades;RankedGrades: {rank_SortedGrades: long,student_id: long,course: chararray,mark: double}

Base relation

New relation

Rank number!

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ RANK instruction:○ We can also sort and rank!

Operating on relations

RankedGrades= RANK SortedGrades BY student_id ASC, mark DESC;

DUMP RankedGrades;(1,1,Engineering,7.7)(2,1,Math,5.6)(3,1,Biology,4.5)(4,1,Physics,2.3)(5,2,Biology,10.0)… DESCRIBE RankedGrades;RankedGrades: {rank_SortedGrades: long,student_id: long,course: chararray,mark: double}

Base relation

New relation

fields to sort

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ SAMPLE instruction:○ Sample the relation!

Operating on relations

SampledGrades= SAMPLE Grades 0.05;

DUMP SampledGrades;(4,Engineering,8.0)

Base relation

New relation

proportion to sample

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Get the 3 top grades for each student

(1,{(Engineering,7.7),(Math,5.6),(Biology,4.5)})

(2,{(Biology,10.0),(Math,8.9),(Engineering,6.7)})

(3,{(Math,7.1),(Physics,5.7),(Engineering,4.3)})

(4,{(Engineering,8.0),(Biology,6.7),(Physics,6.6)})

(5,{(Biology,10.0),(Engineering,9.0),(Math,6.7)})

(6,{(,)})

...

Exercise: Top grades

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ CUBE instruction:○ Is this really useful? Yes! Many aggregates with just

one operation

Operating on relations

CubedGrades= CUBE Grades BY CUBE(student_id,course);

CubedGrades= FOREACH CubedGrades GENERATE group, AVG(cube.mark);

DUMP CubedGrades;

((,Math),7.188888888888889)((,Biology),7.8)((,Physics),5.375)((,Engineering),6.877777777777778)((,),6.729032258064516)((2,Math),8.9)((2,Biology),10.0)((2,),8.075)…

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ CUBE/ROLLUP instruction:○ Like standard CUBE but nulls values are introduced

from right to left

Operating on relations

RolledGrades= CUBE Grades BY ROLLUP(course,student_id);

RolledGrades= FOREACH RolledGrades GENERATE group, AVG(cube.mark);

DUMP RolledGrades;

((Math,),7.188888888888889)((Math,2),8.9)((Math,3),7.1)((Math,4),2.3)((Math,5),6.7)((Math,7),8.5)((Math,8),6.7)((Math,9),8.9)…

order matters!

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ ASSERT instruction:○ Assert that the whole relation fulfills a condition○ Useful for debugging

Operating on relations

ASSERT Grades BY mark > 0.0, ‘marks should be greater than 0’;

Base relation

Error message

condition

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ STORE instruction:○ Stores the relation into the local FS or HDFS (usually!)○ Useful for debugging

Finally, storing data!

STORE BestGrades INTO ‘best_grades_path’ USING

PigStorage( ‘\t’, ‘-noschema’ );

Relationpath to store data

Connector Field separator

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Problems solved?!

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ ASSERT➢ GROUP➢ CROSS➢ CUBE➢ DISTINCT➢ FILTER➢ FOREACH➢ GROUP

Only these operations?

➢ JOIN➢ LIMIT➢ LOAD➢ ORDER, RANK➢ SAMPLE➢ SPLIT➢ UNION➢ DUMP, ILLUSTRATE,

DESCRIBE

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Transform data in data projections

➢ Built-in functions:○ math functions, string functions, datetime functions,

casting functions, etc.

➢ User defined functions:○ Our own functions written in Java, Python, Ruby,

Javascript, etc.

Functions & user defined functions

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Bag functions:○ AVG/MAX/MIN/SUM: compute the

average/max/min/sum of a bag of numeric values

Functions & user defined functions

GradesGr = GROUP Grades BY course;

GradesAvg= FOREACH GradesGr GENERATE group AS course, AVG(Grades.mark) AS avg_mark;

DUMP GradesAvg;

(Math,7.188888888888889)(Biology,7.8)(Physics,5.375000000000001)(Engineering,6.877777777777777)

Employ only this field in bag/tuple

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Bag functions:○ COUNT: number of elements (not null) in a bag

Functions & user defined functions

GradesCount= FOREACH GradesGr GENERATE group AS course, COUNT(Grades) AS number_students;

DUMP GradesCount;

(Math,9)(Biology,5)(Physics,8)(Engineering,9)

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Bag/Tuple functions:○ FLATTEN: behavior depends on input

Functions & user defined functions

DUMP GradesCount;(Math,{(8,Math,6.7),(1,Math,5.6),(10,Math,10.0),(9,Math,8.9),(2,Math,8.9),(3,Math,7.1),(4,Math,2.3),(5,Math,6.7),(7,Math,8.5)})(Biology,{(5,Biology,10.0),(4,Biology,6.7),(2,Biology,10.0),(1,Biology,4.5),(9,Biology,7.8)})...GradesFlat= FOREACH GradesGr GENERATE group AS course, FLATTEN(Grades.mark) AS mark;

DUMP GradesFlat;

(Math,6.7)(Math,5.6)(Math,10.0)…

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Bag/Tuple functions:○ FLATTEN: behavior depends on input

Functions & user defined functions

GradesTuple = FOREACH Grades GENERATE student_id, TOTUPLE(course, mark) AS tuple_mark;DUMP GradesTuple(1,(Math,5.6))(2,(Math,8.9))(3,(Math,7.1))(4,(Math,2.3))...GradesUntupled= FOREACH GradesTuple GENERATE student_id AS student_id, FLATTEN(tuple_mark);DUMP GradesUntupled;(1,Math,5.6)(2,Math,8.9)(3,Math,7.1)(4,Math,2.3)…

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Bag/Tuple functions:○ SUBTRACT: Tuples on first bag not in the second

Functions & user defined functions

SPLIT Students INTO StudentsUnder25 IF age<25, StudentsUnder20 IF age<20, OtherStudents OTHERWISE;StudentsCoGr = GROUP StudentsUnder25 BY gender, StudentsUnder20 BY gender;DUMP StudentsCoGr(F,{(9,Princess,Peach,F,21),(6,Sarah,Kerrigan,F,21),(2,Mary,Doe,F,20)},{)(M,{(10,Peter,Parker,M,23),(1,John,Doe,M,18)},{(1,John,Doe,M,18)})StudentsSub = FOREACH StudentsCoGr GENERATE group, SUBTRACT( StudentsUnder25, StudentsUnder20 );DUMP StudentsSub;(F,{(2,Mary,Doe,F,20),(6,Sarah,Kerrigan,F,21),(9,Princess,Peach,F,21)})(M,{(10,Peter,Parker,M,23)})

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Bag/Tuple functions:○ DIFF: Non overlapping tuples on two bags

Functions & user defined functions

DUMP StudentsCoGr(F,{(9,Princess,Peach,F,21),(6,Sarah,Kerrigan,F,21),(2,Mary,Doe,F,20)},{)(M,{(10,Peter,Parker,M,23),(1,John,Doe,M,18)},{(1,John,Doe,M,18)})StudentsDiff = FOREACH StudentsCoGr GENERATE group, DIFF(StudentsUnder25, StudentsUnder20);DUMP StudentsDiff;(F,{(2,Mary,Doe,F,20),(6,Sarah,Kerrigan,F,21),(9,Princess,Peach,F,21)})(M,{(10,Peter,Parker,M,23)})

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Math functions:○ Common math functions for numeric values:

■ ABS ■ EXP■ FLOOR■ LOG■ RANDOM■ ROUND■ SQRT■ ...

Functions & user defined functions

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ String functions:○ Transform chararrays:

■ ENDSWITH ■ LOWER■ UPPER■ SUBSTRING■ TRIM■ REPLACE■ ...

Functions & user defined functions

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Datetime functions:○ Get information on dates and timestamps:

■ AddDuration ■ CurrentTime■ ToDate■ ToString■ ToUnixTime■ DaysBetween■ ...

Functions & user defined functions

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

public class SHUFFLE extends EvalFunc<DataBag> {

@Override

public DataBag exec( Tuple input ) throws IOException {

if ( input == null )

throw new IOException("Invalid input: null");

if( input.size() != 1 )

throw new IOException("Expected one argument");

if( input.get( 0 ) == null )

return null;

TupleFactory tf = TupleFactory.getInstance();

DataBag bag = (DataBag) input.get( 0 );

List<Tuple> l = new ArrayList<Tuple>();

for( Tuple t : bag )

l.add( t );

Collections.shuffle( l );

DataBag resBag = B BagFactory.getInstance().newDefaultBag( l );

return resBag;

}

User defined functions

@Override

public Schema outputSchema( Schema input ) {

try {

return new Schema( input.getField( 0 ) );

} catch( Exception e ){

return null;

}

}

}

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Library of useful UDFs released 2010➢ Created by LinkedIn engineering team:

○ Stats: variance, quantiles, median, etc.○ Bags: concat, append, preped, etc.○ Sampling○ Page rank○ Session estimation

➢ Last major release: 1.2.0 (Dec, 2013)http://datafu.incubator.apache.org/

More functions: Datafu Pig

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

How to use UDF libraries

REGISTER lib/datafu-1.2.0.jar

DEFINE BagConcat datafu.pig.bags.BagConcat();

DUMP StudentsCoGr

(F,{(9,Princess,Peach,F,21),(6,Sarah,Kerrigan,F,21),(2,Mary,Doe,F,20)},{})(M,{(10,Peter,Parker,M,23),(1,John,Doe,M,18)},{(1,John,Doe,M,18)})

StudentBagConcat = FOREACH StudentsCoGr GENERATE group, BagConcat(StudentsUnder25,StudentsUnder20);

DUMP StudentBagConcat (F,{(9,Princess,Peach,F,21),(6,Sarah,Kerrigan,F,21),(2,Mary,Doe,F,20)})(M,{(10,Peter,Parker,M,23),(1,John,Doe,M,18),(1,John,Doe,M,18)})

Indicate UDF to be included and name

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Scripting

REGISTER lib/datafu-1.2.0.jar

DEFINE BagConcat datafu.pig.bags.BagConcat();

Students= LOAD ‘$student_file’ USING PigStorage( ‘\t’, ‘-noschema’ ) AS ( student_id: Long, name: Chararray, surname: Chararray, gender: Chararray, age: Int)

SPLIT Students INTO StudentsUnder25 IF age<25, StudentsUnder20 IF age<20, OtherStudents OTHERWISE;StudentsCoGr = GROUP StudentsUnder25 BY gender, StudentsUnder20 BY gender;

StudentBagConcat = FOREACH StudentsCoGr GENERATE group, BagConcat(StudentsUnder25,StudentsUnder20);

STORE StudentBagConcat INTO ‘$output’ USING PigStorage( ‘\t’, ‘-schema’ );

Asda

Libraries and Udfs

Load

data

Transform

data

Store d

ata

parameter

parameter

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Calling a script

pig -x mapred -f myscript.pig -param student_file=students.csv -param output=myoutput_path

parameter definitionexecution mode script file

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Not limited to plain text

➢ Multiple supported format: Json, Avro, Accumulo, etc.

➢ Connectors to data sources: MongoDb, Cassandra, HBase, etc.

More on load/store

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Detect pairs of products bought together (e.g., chairs and tables)

➢ Goal: recommend related products➢ Association score:

Exercise: Product association

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Purchases: purchases.tsv

➢ Products: products.tsv

Product association

product_id user_id price date1 23 14.5 2014-03-034 15 11.2 2014-08-0988 3 48.3 2011-01-01...

product_id status1 ok5 ko99 ok...

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Time to work!

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Clear and simple syntax

➢ Interactive client➢ Transparent M/R

jobs➢ Integration with

Java and others

Final notes: Pros & cons

➢ Not as flexible as Hadoop

➢ Oriented towards ETL, not AI

➢ No loops

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ http://pig.apache.org/

➢ Programming pig. Alan Gates. Ed. O’Reilly

➢ StackOverflow

Extra information