apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs

32
Apache Pig Power Tools a quick tour By Viswanath Gangavaram Data Scientist R&D, DSG, Ilabs, [24] 7 INC 06/22/2022 1

Upload: viswanath-gangavaram

Post on 26-Jan-2015

115 views

Category:

Technology


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs

04/10/2023 1

Apache Pig Power Toolsa quick tour

ByViswanath Gangavaram

Data ScientistR&D, DSG, Ilabs, [24] 7 INC

Page 2: Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs

04/10/2023 2

What we are going to cover

A very short introduction to Apache Pig

The Grunt Shell: An interactive shell to write and execute Pig-Latin and to access HDFS Advanced Pig Relational Operators Built-in functions User defined functions DEFINE(UDFs, Streaming, Macros) UDFs Vs. Pig Streaming

JSON Parsing Single Row relation Real python in Pig(nltk, numpy, scipy, etc.) Embedding Pig-Latin for python in iterative processing Hadoop Globing Hue:- Hadoop ecosystem in the Browser

External Libraries:- Piggybank, DataFu, DataFu Hour Glass, SimpleJson, ElephantBird

Page 3: Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs

04/10/2023 3

A very short introduction to Apache Pig

Pig provides a higher level of abstraction for data users, giving them access to the power and flexibility of Hadoop without requiring them to write extensive data processing applications in low-level Java Code(MapReduce code). From the preface of “Programming Pig”

Page 4: Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs

04/10/2023 4

A very short introduction to Apache Pig

Apache Pig in Hadoop 1.0 Ecosystem Apache Pig execution Life Cycle

Page 5: Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs

04/10/2023 5

A very short introduction to Apache Pig

• Apache Pig is a high-level platform for executing data flows in parallel on Hadoop. The language for this platform is called Pig Latin, which includes operators for many of the traditional data operations (join, sort, filter, etc.), as well as the ability for users to develop their own functions for reading, processing, and writing data.

– Pigs fly• Pig processes data quickly. Designers want to consistently improve its performance, and not

implement features in ways that weigh pig down so it can't fly.

• What does it mean to be Pig?

– Pigs Eats Everything• Pig can operate on data whether it has metadata or not. It can operate on data that is relational, nested,

or unstructured. And it can easily be extended to operate on data beyond files, including key/value stores, databases, etc.

– Pigs Live Everywhere• Pig is intended to be a language for parallel data processing. It is not tied to one particular

parallel framework. Check for Pig on Tez– Pigs Are Domestic Animals

• Pig is designed to be easily controlled and modified by its users.• Pig allows integration of user code where ever possible, so it currently supports user defined field

transformation functions, user defined aggregates, and user defined conditionals. • Pig supports user provided load and store functions. • It supports external executables via its stream command and Map Reduce jars via its MapReduce

command. • It allows users to provide a custom partitioner for their jobs in some circumstances and to set the

level of reduce parallelism for their jobs.

Page 6: Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs

04/10/2023 6

Why we need to embrace this sort of philosophy ?

Because that’s the reality

Page 7: Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs

04/10/2023 7

Apache Pig “Word counting:- The hello world of MapReduce”

inputFile = LOAD ‘mary’ using TextLoade() as ( line );words = FOREACH inputFile GENERATE FLATTEN( TOKENIZE(line) ) as word;grpd = GROUP words by word;cntd = FOREACH grpd GENERATE group, COUNT(words)DUMP cntd;

Output:- (This , 2)(is, 2)(my, 2 )(first , 2)(apache, 2)(pig,2)(program, 2)

“mary” file content:-This is my first apache pig programThis is my first apache pig program

Page 8: Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs

04/10/2023 8

Apache Pig Latin: A data flow language• Pig Latin is a dataflow language. This means it allows users to describe how data from one or more inputs

should be read, processed, and then stored to one or more outputs in parallel.• To be mathematically precise, a Pig Latin script describes a directed acyclic graph (DAG), where the edges

are data flows and the nodes are operators that process the data.

Comparing query(HIVE/SQL) and data flow languages(PIG)• After a cursory look, people often say that Pig Latin is a procedural version of SQL. Although there are

certainly similarities, there are more differences. SQL is a query language. Its focus is to allow users to form queries. It allows users to describe what question they want answered, but not how they want it answered. In Pig Latin, on the other hand, the user describes exactly how to process the input data.

• Another major difference is that SQL is oriented around answering one question. When users want to do several data operations together, they must either write separate queries, storing the intermediate data into temporary tables, or write it in one query using subqueries inside that query to do the earlier steps of the processing. However, many SQL users find subqueries confusing and difficult to form properly. Also, using subqueries creates an inside-out design where the first step in the data pipeline is the innermost query.

• Pig, however, is designed with a long series of data operations in mind, so there is no need to write the data pipeline in an inverted set of subqueries or to worry about storing data in temporary tables.

• SQL is the English of data processing. It has the nice feature that everyone and every tool knows it, which means the barrier to adoption is very low. Our goal is to make Pig Latin the native language of parallel data-processing systems such as Hadoop. It may take some learning, but it will allow users to utilize the power of Hadoop much more fully. - Extracted from “Programming Pig”

Page 9: Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs

04/10/2023 9

Page 10: Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs

Pig’s Data types Scalar types

• int, long, float, double, chararray, bytearray Complex types

• Map– A map in Pig is a chararray to data element mapping, where that element can be any Pig

type, including a complex type. – The chararray is called a key and is used as index to find the element, referred to as the

value.– Map constants are formed using brackets to delimit the map, a hash between keys and

values, and a comma between key-value pairs. » [‘dept’#’dsg’, ‘team’#’r&d’]

• Tuple– A tuple is a fixed-length, ordered collection of Pig data elements. Tuples are divided into

fields, with each field containing one data element. These elements can be of any type.– Tuple constants use parentheses to indicate the tuple and commas to delimit fields in

the tuple.» (‘boss’, 55)

• Bag– A bag is an unordered collection of tuples.– Bag constants are constructed using braces, with the tuples in the bag separated by

commas. » { (‘a’, 20), (‘b’, 20), (‘c’, 30) }

Page 11: Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs

04/10/2023 11

Running Pig

One can run Pig (execute Pig Latin statements and Pig commands) using various modes.

Local Mode MapReduce Mode

Interactive Mode(Grunt Shell):-Pig Latin statements and Pig commands

Yes Yes

Batch Mode Yes Yes

Execution Modes:- Local Mode

To run Pig in local mode, you need access to a single machine; all files are installed and run using your local host and file system. Specify local mode using the –x flag (pig -x local).

MapReduce Mode To run Pig in MapReduce mode, you need access to a Hadoop Cluster and HDFS installation.

MapReduce mode is the default mode; you can, but do not need to specify, it using the –x flag (pig or pig –x mapreduce)

/* local mode */pig –x local …java -cp pig.jar org.apache.pig.Main -x local …

/* mapreduce mode */ pig or pig –x mapreduce … java -cp pig.jar org.apache.pig.Main ... java -cp pig.jar org.apache.pig.Main -x mapreduce ...

Page 12: Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs

Relational Operators LOAD

Loads data from the file system. LOAD 'data' [USING function] [AS schema];

If you specify a directory name, all the files in the directory are loaded. A = LOAD ‘t.txt' USING PigStorage('\t') AS (f1:int, f2:int);

STORE Stores or saves results to the file system. STORE alias INTO 'directory' [USING function]; A = LOAD ‘t.txt' USING PigStorage('\t'); STORE A INTO USING PigStorage(‘*') AS (f1:int, f2:int);

LIMIT Limits the number of output tuples. alias = LIMIT alias n; A = LOAD ‘t.txt' USING PigStorage('\t') AS (f1:int, f2:int); B = LIMIT A 5;

FILTER Selects tuples from a relation based on some condition.. alias = FILTER alias BY expression; A = LOAD ‘t.txt' USING PigStorage('\t') AS (f1:int, f2:int); B = FILTER A f2 > 2;

Page 13: Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs

DISTINCT Removes duplicate tuples in a relation. alias = DISTINCT alias [PARTITION BY partitioner] [PARALLEL n]; A = LOAD ‘t.txt' USING PigStorage('\t') AS (f1:int, f2:int); B = DISTINCT A;

DUMP Dumps or displays results to screen. DUMP alias; A = LOAD ‘t.txt' USING PigStorage('\t') AS (f1:int, f2:int); DUMP A;

ORDER BY Sorts a relation based on one or more fields. alias = ORDER alias BY { * [ASC|DESC] | field_alias [ASC|DESC] [, field_alias [ASC|DESC] …] }

[PARALLEL n]; A = LOAD ‘t.txt' USING PigStorage('\t') AS (f1:int, f2:int); B = ORDER A BY f2; DUMP B;

UNION Computes the union of two or more relations. alias = UNION [ONSCHEMA] alias, alias [, alias …]; L1 = LOAD 'f1' USING (a : int, b : float); L2 = LOAD 'f1' USING (a : long, c : chararray); U = UNION ONSCHEMA L1, L2; DESCRIBE U ;

U : {a : long, b : float, c : chararray}

Page 14: Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs

FOREACH Generates data transformations based on columns of data. alias = FOREACH { block | nested_block }; X = FOREACH A GENERATE f1; X = FOREACH B {

S = FILTER A BY 'xyz‘ == ‘3’; GENERATE COUNT (S.$0); }

CROSS Computes the cross product of two or more relations. alias = CROSS alias, alias [, alias …] [PARTITION BY partitioner] [PARALLEL n]; A = LOAD 'data1' AS (a1:int,a2:int,a3:int); B = LOAD 'data2' AS (b1:int,b2:int); X = CROSS A, B

(CO)GROUP Groups the data in one or more relations. The GROUP and COGROUP operators are identical. alias = GROUP alias { ALL | BY expression} [, alias ALL | BY expression …] [USING 'collected' |

'merge'] [PARTITION BY partitioner] [PARALLEL n]; A = load 'student' AS (name:chararray, age:int, gpa:float); B = GROUP A BY age; DUMP B;

Page 15: Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs

Join(Inner) Performs an inner join of two or more relations based on common field values. alias = JOIN alias BY {expression|'('expression [, expression …]')'} (, alias BY

{expression|'('expression [, expression …]')'} …) [USING 'replicated' | 'skewed' | 'merge' | 'merge-sparse'] [PARTITION BY partitioner] [PARALLEL n];

A = load 'mydata'; B = load 'mydata'; C = join A by $0, B by $0; DUMP C;

Join(Outer) Performs an outer join of two relations based on common field values. alias = JOIN left-alias BY left-alias-column [LEFT|RIGHT|FULL] [OUTER], right-alias BY right-

alias-column [USING 'replicated' | 'skewed' | 'merge'] [PARTITION BY partitioner] [PARALLEL n];

A = LOAD 'a.txt' AS (n:chararray, a:int); B = LOAD 'b.txt' AS (n:chararray, m:chararray); C = JOIN A by $0 LEFT OUTER, B BY $0; DUMP C;

Page 16: Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Page 17: Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs

The Grunt Shell: An interactive shell to write and execute Pig-Latin and to access HDFS

Shell commands Fs

Invokes any FsShell command from within a Pig script or the Grunt shell. fs -mkdir /tmp fs -copyFromLocal file-x file-y fs -ls file-y

Sh Invokes any sh shell command from within a Pig script or the Grunt shell.

ls Pwd

Utility commands Clear Exec Help History Kill Exec

Run a Pig script. exec [–param param_name = param_value] [–param_file file_name] [script] Use the exec command to run a Pig script with no interaction between the script and the

Grunt shell (batch mode). Aliases defined in the script are not available to the shell;

Run Run a Pig script run [–param param_name = param_value] [–param_file file_name] script Interactive mode

Page 18: Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs

Advanced Relational Operators Splitting Data into Training and Testing Dataset SPLIT

SPLIT users into kids if age < 18, adults if age >= 18 and age < 65, seniors otherwise; SPLIT data into testing if RANDOM() <= 0.10, training otherwise;

SPLIT operator cannot handle non deterministic functions (such as RANDOM). Thus the above command won’t work and will raise an error:

DEFINE split_into_training_testing(inputData, split_percentage)RETURNS training, testing{ data = foreach $inputData generate RANDOM() as random_assignment, *; SPLIT data into testing_data if random_assignment <= $split_percentage, training_data otherwise; $training = foreach training_data generate $1..; $testing = foreach testing_data generate $1..;};

inData = load 'some_files.txt‘ USING PigStorage(‘\t’);training, testing = split_into_training_testing(inData, 0.1);

Syntax for Macro definition:- DEFINE macro_name (param [, param ...]) RETURNS {void | alias [, alias ...]} { pig_latin_fragment };

Syntax for Macro expansion:- alias [, alias ...] = macro_name (param [, param ...]) ;

Page 19: Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs

ASSERT Assert a condition on the data.. ASSERT alias BY expression [, message]; A = LOAD 'data' AS (a0:int,a1:int,a2:int); ASSERT A by a0 > 0, 'a0 should be greater than 0';

CUBE Performs cube/rollup operations. alias = CUBE alias BY { CUBE expression | ROLLUP expression }, [ CUBE expression | ROLLUP

expression ] [PARALLEL n]; cubedinp = CUBE salesinp BY CUBE(product,year); rolledup = CUBE salesinp BY ROLLUP(region,state,city); cubed_and_rolled = CUBE salesinp BY CUBE(product,year), ROLLUP(region, state, city);

SAMPLE Selects a random sample of data based on the specified sample size. SAMPLE alias size; A = LOAD 'data' AS (f1:int,f2:int,f3:int); X = SAMPLE A 0.01;

RANK Returns each tuple with the rank within a relation. alias = RANK alias [ BY { * [ASC|DESC] | field_alias [ASC|DESC] [, field_alias [ASC|DESC] …] }

[DENSE] ]; B = rank A; C = rank A by f1 DESC, f2 ASC; C = rank A by f1 DESC, f2 ASC DENSE;

Page 20: Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs

MAPREDUCE Executes native MapReduce jobs inside a Pig script. alias1 = MAPREDUCE 'mr.jar' STORE alias2 INTO 'inputLocation' USING storeFunc LOAD

'outputLocation' USING loadFunc AS schema [`params, ... `]; A = LOAD 'WordcountInput.txt'; B = MAPREDUCE 'wordcount.jar' STORE A INTO 'inputDir' LOAD 'outputDir' AS

(word:chararray, count: int) `org.myorg.WordCount inputDir outputDir`;

IMPORT Import macros defined in a separate file. MPORT 'file-with-macro';

STREAM Sends data to an external script or program. alias = STREAM alias [, alias …] THROUGH {`command` | cmd_alias } [AS schema] ; A = LOAD 'data'; B = STREAM A THROUGH `perl stream.pl -n 5`;

Page 21: Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs

DEFINE:- UDFs, Streaming

Assigns an alias to a UDF or streaming command. DEFINE alias {function | [`command` [input] [output] [ship] [cache] [stderr] ] };

DEFINE CMD `perl PigStreaming.pl - nameMap` input(stdin using PigStreaming(',')) output(stdout using PigStreaming(',')); A = LOAD 'file'; B = STREAM B THROUGH CMD;

DEFINE CMD 'script' ship('/a/b/script'); OP = stream IP through CMD;

DEFINE Y 'stream.pl data.gz' SHIP('/work/stream.pl') CACHE('/input/data.gz#data.gz'); X = STREAM A THROUGH Y;

Page 22: Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs

Built-in functions

Eval functions AVG CONCAT COUNT COUNT_STAR

Math functions ABS SQRT Etc …

STRING functions ENDSWITH TRIM …

Datetime functions AddDuration GetDay GetHour …

Dynamic Invokers DEFINE UrlDecode InvokeForString('java.net.URLDecoder.decode', 'String String'); encoded_strings = LOAD 'encoded_strings.txt' as (encoded:chararray); decoded_strings = FOREACH encoded_strings GENERATE UrlDecode(encoded, 'UTF-8');

Page 23: Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs

User Defined functions

Page 24: Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs

Single row relations

a = load 'a.txt'; b = group a all;c = foreach b generate COUNT(a) as sum; d = order a by $0; e = limit d c.sum/100;

Page 25: Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs

Real python in Pig(nltk, numpy, scipy, etc.)

from pig_util import outputSchemaimport nltkimport sysimport platform

from nltk.stem.lancaster import LancasterStemmer

@outputSchema("as:int")def square(num): if num == None: return None return ((num) * (num))

@outputSchema("word:chararray")def returnString(word):

st = LancasterStemmer()return st.stem('maximum') + '\t'+ word +'\t'+ word + '\t' + platform.python_version()

@outputSchema("word:chararray")def wordSteming(word):

st = LancasterStemmer()return st.stem(word)

register 'streamingPython.py' using streaming_python as myfuncs;a = LOAD 't.txt' as (a:chararray, b:chararray);b = foreach a generate myfuncs.returnString('this is pig; this is weird') , myfuncs.square(25);DUMP b;

Page 26: Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs

Embedding Pig-Latin for python in iterative processing

To enable control flow, you can embed Pig Latin statements and Pig commands in the Python, JavaScript and Groovy scripting languages using a JDBC-like compile, bind, run model.

DEMO

Page 28: Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs

Hue:- Hadoop ecosystem in the Browser

Page 29: Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs

04/10/2023 29

Pig’s Debugging Operators

\d alias - shortcut for DUMP. If alias is ignored last defined alias will be used. \de alias - shortcut for DESCRIBE. If alias is ignored last defined alias will be used. \e alias - shortcut for EXPLAIN. If alias is ignored last defined alias will be used. \i alias - shortcut for ILLUSTRATE. If alias is ignored last defined alias will be used. \q - To quit grunt shell

Use the DUMP operator to display results to your terminal screen. Use the DESCRIBE operator to review the schema of a relation. Use the EXPLAIN operator to view the logical, physical, or map reduce execution plans to

compute a relation. Use the ILLUSTRATE operator to view the step-by-step execution of a series of statements.

Shortcuts for Debugging Operators

Page 30: Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs

04/10/2023 30

Resources

Introduction to Apache Pig by Adam Kawa Apache DataFu(incubating) Building Data Products at LinkedIn with DataFu A Brief tour of DataFu Pig Fundamentals Building a high level dataflow system on top of MapReduce: The Pig Experience Pig Hive Cascading Developing Pig on Apache Tez How to make your map-reduce jobs perform as we pig: Lessons from pig optimizations Apache Pig: Macro for splitting data into training and testing dataset

Page 31: Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs

04/10/2023 31

Resources

• Programming Pig– http://chimera.labs.oreilly.com/books/1234000001811/index.html

• Apache Pig’s Official Documentation– http://pig.apache.org/docs/r0.12.1/

• Pig Design Patterns– http://www.packtpub.com/pig-design-patterns/book

• External Libraries– Piggybank– DataFu– DataFu Hourglass– SimpleJson– ElephantBird

Page 32: Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs

04/10/2023 32

So what is pig?

Pig is a champion