introduction to apache pig - ut€¦ · introduction to apache pig pelle jakovits 28 september...

of 28 /28
Introduction to Apache Pig Pelle Jakovits 28 September 2016, Tartu

Author: others

Post on 13-Jul-2020

15 views

Category:

Documents


1 download

Embed Size (px)

TRANSCRIPT

  • Introduction to Apache Pig

    Pelle Jakovits

    28 September 2016, Tartu

  • Outline

    • MapReduce recollection

    • Apache Pig

    – How to run Pig

    – Pig Latin

    • Data structures

    • Examples

    – Execution flow

    – Advantages & Disadvantages

    Pelle Jakovits 2/28

  • You already know MapReduce

    • MapReduce = Map, GroupBy, Sort, Reduce”

    • Designed or huge scale data processing

    • Provides– Distributed file system

    – High scalability

    – Automatic parallelization

    – Automatic fault recovery• Data is replicated

    • Failed tasks are re-executed on other nodes

    Pelle Jakovits 3/28

  • But is MapReduce enough?

    • Hadoop MapReduce is one of the most used frameworks for large scale data processing

    • However:

    – Writing low level MapReduce code slow

    – Need a lot of expertise to optimize MapReduce code

    – Prototyping is slow

    – A lot of custom code required

    • Even for the most simplest tasks

    – Hard to manage more complex MapReduce job chains

    Pelle Jakovits 4/28

  • Apache Pig

    • A data flow framework ontop of Hadoop MapReduce– Retains all its advantages

    – And some of it’s disadvantages

    • Models a scripting language– Fast prototyping

    • Uses Pig Latin language

    – Similiar to declarative SQL

    – Easier to get started with

    • Pig Latin statements are automatically translated into MapReduce jobs

    Pelle Jakovits 5/28

  • Running Pig

    • Local mode– Everything installed locally on one machine

    • Distributed mode– Everything runs in a MapReduce cluster

    • Interactive mode– Grunt shell

    • Batch mode– Pig scripts

    Pelle Jakovits 6/28

  • Pig Latin

    • Write complex MapReduce transformations using much simpler scripting language

    • Not quite SQL, but similar

    • Lazy evaluation

    • Compiling is hidden from the user

    Pelle Jakovits 7/28

  • Pig Latin Data Structures

    • Relation– Can have nested relations

    – Similar to a table in a relational database

    – Consists of a Bag

    • Bag– Collection of unordered tuples

    • Tuple– An ordered set of fields

    – Similiar to a row in a relational database

    – Can contain any number of fields, does not have to match other tuples

    • Fields– A `piece` of data

    Pelle Jakovits 8/28

  • Pig Example

    • A = LOAD 'student' USING PigStorage() AS (name, age, gpa);

    • DUMP A;

    – (John, 18, 4.0F)

    – (Mary, 19, 3.8F)

    – (Bill, 20, 3.9F)

    – (Joe, 18, 3.8F)

    • B = GROUP A BY age;

    • C = FOREACH B GENERATE AVG(gpa)

    Pelle Jakovits 9/28

  • WordCount in Pig

    A = load '/tmp/books/books';

    B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;

    C = group B by word;

    D = foreach C generate COUNT(B), group;

    store D into '/user/labuser/pelle_jakovits/out';

    • Input and output are HDFS folders or files

    – /tmp/books/books

    – /user/labuser/pelle_jakovits/out

    • A, B, C, D are relations

    • Right hand side contains Pig expressions

    Pelle Jakovits 10/28

  • Fields

    • Consists of either:– Data atoms - Int, long, float, double, chararray, boolean,

    datetime, etc.

    – Complex data - Bag, Map, Tuple

    • Assigning types to fields– A = LOAD 'student' AS (name:chararray, age:int, gpa:float);

    • Referencing Fields– By order - $0, $1, $2

    – By name - assigned by user schemas• A = LOAD ‘in.txt‘ AS (age, name, occupation);

    Pelle Jakovits 11/28

  • Complex data types

    • Tuples - (a, b, c)

    • Bags - {(a,b), {c,d}}

    • Maps - [martin#18, daniel#27]

    • Looking into complex, nested data

    – client.$0

    – author.age

    • Using FLATTEN can "explode" Pig Bag into a set of Tuple records

    Pelle Jakovits 12/28

  • Loading and storing data

    • LOAD– A = LOAD ‘myfile.txt’ USING PigStorage(‘\t’) AS (f1:int,

    f2:int, f3:int);– User defines data loader and delimiters

    • STORE– STORE A INTO ‘output_1.txt’ USING PigStorage (‘,’);– STORE B INTO ‘output_2.txt’ USING PigStorage (‘*’);

    • Other data loaders– BinStorage– PigDump– TextLoader– Or create a custom one.

    Pelle Jakovits 13/28

  • FOREACH … GENERATE

    • General data transformation statement

    • Used to:

    – Change the structure of data

    – Apply functions to data

    – Flatten complex data to remove nesting

    • X = FOREACH C GENERATE FLATTEN (A.(a1, a2)), FLATTEN(B.$1);

    Pelle Jakovits 14/28

  • Group .. BY

    • A = load 'student' AS (name:chararray, age:int, gpa:float);

    • DUMP A; – (John, 18, 4.0F)

    – (Mary, 19, 3.8F)

    – (Bill, 20, 3.9F)

    – (Joe, 18, 3.8F)

    • B = GROUP A BY age;

    • DUMP B; – (18, {(John, 18, 4.0F), (Joe, 18, 3.8F)})

    – (19, {(Mary, 19, 3.8F)})

    – (20, {(Bill, 20, 3.9F)})

    Pelle Jakovits 15/28

  • JOIN

    • A = LOAD 'data1' AS (a1:int,a2:int,a3:int);

    • B = LOAD 'data2' AS (b1:int,b2:int);

    • X = JOIN A BY a1, B BY b1;

    DUMP A; (1,2,3) (4,2,1)

    DUMP B; (1,3) (2,7) (4,6)

    DUMP X; (1,2,3,1,3)(4,2,1,4,6)

    Pelle Jakovits 16/28

  • Union

    • A = LOAD 'data' AS (a1:int, a2:int, a3:int);

    • B = LOAD 'data' AS (b1:int, b2:int);

    • X = UNION A, B;

    DUMP A; (1,2,3)(4,2,1)

    DUMP A; (2,4) (8,9)

    DUMP X; (1,2,3)(4,2,1) (2,4) (8,9)

    Pelle Jakovits 17/28

  • Functions

    • SAMPLE

    – A = LOAD 'data' AS (f1:int,f2:int,f3:int);

    – X = SAMPLE A 0.01;

    – X will contain 1% of tuples in A

    • FILTER

    – A = LOAD 'data' AS (a1:int, a2:int, a3:int);

    – X = FILTER A BY a3 == 3;

    Pelle Jakovits 18/28

  • Functions

    • DISTINCT – removes duplicate tuples

    – X = DISTINCT A;

    • LIMIT –

    – X = LIMIT B 3;

    • SPLIT –

    – SPLIT A INTO X IF f1

  • Nested Pig Statements

    A = LOAD 'Unclaimed_bank_accounts.csv' USING PigStorage(',') AS (last_name,first_name,balance,address,city,last_transaction,bank_name);

    B = GROUP A BY city;

    C = foreach B {

    banks = A.bank_name ;

    unique_banks = distinct banks ;

    GENERATE group as city, unique_banks; }

    Pelle Jakovits 20/28

  • User Defined Functions (UDF)

    • When the Built in Pig functions are not enough

    • When we want to modify the behaviour of built in functions

    • Load Pig UDF from jar

    REGISTER myudfs.jar;

    A = load '/tmp/books/books';

    B = foreach A generate flatten(myudfs.TOKENIZE((chararray)$0)) as word;

    Pelle Jakovits 21/28

  • Pig UDF

    public class MYTOKENIZE extends EvalFunc {

    TupleFactory mTupleFactory = TupleFactory.getInstance();

    BagFactory mBagFactory = BagFactory.getInstance();

    public DataBag exec(Tuple input) throws IOException {

    try {

    DataBag output = mBagFactory.newDefaultBag();

    Object o = input.get(0);

    if (!(o instanceof String)) {

    throw new IOException("Expected input to be chararray");

    }

    StringTokenizer tok = new StringTokenizer((String)o, " \",()*");

    while (tok.hasMoreTokens())

    output.add(mTupleFactory.newT uple(tok.nextToken()));

    return output;

    } catch (ExecException ee) {}

    }

    }

    Pelle Jakovits 22/28

  • Pig workflow

    Pelle Jakovits 23/28

  • Pig workflow

    Pelle Jakovits 24/28

  • Advantages of Pig

    • Easy to Program– ~5% of the code, ~5% of the time required

    • Self-Optimizing– Pig Latin statement optimizations– Generated MapReduce code optimizations

    • Can manage more complex data flows– Easy to use and join multiple separate inputs,

    transformations and outputs

    • Extensible– Can be extended with User Defined Functions (UDF)

    to provide more functionality

    Pelle Jakovits 25/28

  • Pig disadvantages

    • Slow start-up and clean-up of MapReduce jobs

    – It takes time for Hadoop to schedule MR jobs

    • Not suitable for interactive OLAP Analytics

    – When results are expected in < 1 sec

    • Complex applications may require many UDF’s

    – Pig loses it’s simplicity over MapReduce

    Pelle Jakovits 26/28

  • DEMO

    TFIDF in Pig

    Pelle Jakovits 27/28

  • Thats All

    • This week`s practice session

    – Processing data with Pig

    – Processing unclaimed bank accounts, but this time using Pig

    • Next lecture: Spark

    Pelle Jakovits 28/28