apache pig presentation_siddharth_mathur

46
1 CSC 5800: Intelligent Systems: Algorithms and Tools Pig Latin: A Not-So- Foreign Language for Data Processing By Siddharth Mathur

Upload: siddharth-mathur

Post on 10-May-2015

339 views

Category:

Education


0 download

DESCRIPTION

Overview of MapReduce and Details of Apache Pig features, special commands and the Debugger

TRANSCRIPT

Page 1: Apache pig presentation_siddharth_mathur

1

CSC 5800:

Intelligent Systems: Algorithms and Tools

Pig Latin: A Not-So-Foreign Language for Data Processing

By Siddharth Mathur

Page 2: Apache pig presentation_siddharth_mathur

2

What we will be covering

Introduction

MapReduce Overview

Pig Overview

Pig Features

Pig Latin

Pig Debugger

Demo

Page 3: Apache pig presentation_siddharth_mathur

3

Introduction

Enormous data Innovation critically depends upon analyzing terabytes of

data collected everyday SQL can resolve the structure data problems Parallel Database processing

– Data is enormous can’t be analyzed serially.

– Has to be analyzed in parallel.

– Shared nothing clusters are the way to go.

Page 4: Apache pig presentation_siddharth_mathur

4

Parallel DB Products

Teradata, Oracle RAC, Netezza

Expensive at web scale

Programmers have to write complex SQL queries because of this declarative programming is not preferred

Page 5: Apache pig presentation_siddharth_mathur

5

Procedural programming

Map-Reduce programming model

It can easily perform a group by aggregation in parallel over a cluster of machines

The programmer provides map functions which is used as a filter or transforming method

The reduce function performs the aggregation

Appealing to the programmer because there are only 2 high level declarative functions to enable parallel processing

Page 6: Apache pig presentation_siddharth_mathur

6

MapReduce Overview

Programming Model

– To cater large data analytics

– Works over Hadoop

– Java based

– Splits data into independent chunks and process them in-parallel

Program structure

– Mapper

– Reducer

– Driver Program

Page 7: Apache pig presentation_siddharth_mathur

7

MapReduce Driver Program

Works as ‘Main’ function for MR job

Takes care of

– Number of arguments

– Input Data Location

– Input Data Types

– Output Data Location

– Output Data types

– Number of Mappers

– Number of Reducers

Page 8: Apache pig presentation_siddharth_mathur

8

Mapper and Reducer Class

Mapper Class

– Main task is to perform any function logic

– Computes tasks like:

• Filtering

• Splitting

• Tokenizing

• Transforming

Reducer Class

– Works as an aggregator

– Aggregates the intermediate results gathered from Mapper

Page 9: Apache pig presentation_siddharth_mathur

9

Word Count Execution

the quickbrown fox

the fox atethe mouse

how nowbrown cow

Map

Map

Map

Reduce

Reduce

brown, 2fox, 2how, 1now, 1the, 3

ate, 1cow, 1

mouse, 1quick, 1

the, 1brown, 1

fox, 1

quick, 1

the, 1fox, 1the, 1

how, 1now, 1

brown, 1ate, 1

mouse, 1

cow, 1

Input Map Shuffle & Sort Reduce Output

Page 10: Apache pig presentation_siddharth_mathur

10

MapReduce Word Count Program

public static class Map extends Mapper<LongWritable, Text, Text, IntWritable>

{

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens()) {

word.set(tokenizer.nextToken());

context.write(word, one);

}

}

}

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values, Context context)

throws IOException, InterruptedException {

int sum = 0;

for (IntWritable val : values) {

sum += val.get();

}

context.write(key, new IntWritable(sum));

}

}

Page 11: Apache pig presentation_siddharth_mathur

11

Map Reduce Limitations

1 input – 2 stage data flow is extremely rigid.

– To perform a task like join or sum iteration task, workaround has to be devised.

– Custom code for common task like filtering or transforming or projection

– The code is difficult to reuse and maintain

Moreover, because of its own data types, workflow and the fact that people have to learn java, makes it’s a tough choice to take.

Page 12: Apache pig presentation_siddharth_mathur

12

Pig

An Apache open source project.

Provides an engine for executing data flows in parallel on Hadoop.

Includes a language called ‘Pig Latin’ for expressing these data flows.

High level declarative data workflow language.

It has best of both worlds:

– High Level declarative querying like SQL

– Low Level procedural like Map Reduce

Page 13: Apache pig presentation_siddharth_mathur

13

Hadoop Stack

Data Processing Layer

Resource Management Layer

Storage Layer

Hadoop MR

Hive PigHBase

Hadoop Yarn

HDFS

Page 14: Apache pig presentation_siddharth_mathur

14

Why Choose Pig

Written like SQL, compiled into MapReduce

Fully nested data model

Extensive support for UDFs

Can answer multiple questions in one single workflow.A = load './input.txt';

B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;

C = group B by word;

D = foreach C generate COUNT(B), group;

store D into './output';

Page 15: Apache pig presentation_siddharth_mathur

15

Features and Motivation

Design goal of pig is to provide programmers with appealing experience for performing ad-hoc analysis of extremely large data sets.

– DataFlow Language

– QuickStart and Interoperability

– Nested Data Model

– UDF’s

– Debugging Environment

Page 16: Apache pig presentation_siddharth_mathur

16

Data Flow Language

Each step specifies a single high level data transformation

Different from SQL where all these results are a single output.

The system has given opportunity to provide optimization function.

– Example:A= Load ‘input.txt’;

B= Filter A by UDF (Column1);

C= Filter B by Column1 > 0.8;

Page 17: Apache pig presentation_siddharth_mathur

17

Quick start and Interoperability

Data Load

– Capability of Ad-Hoc analysis

– Can run queries directly on Data from dump of search engines

– Just have to provide a function that tells Pig how to parse the content of file into tuple.

– Similarly for output

• Any output format.

• These function can be reused.

• Used for visualization or dumped to excel directly

Page 18: Apache pig presentation_siddharth_mathur

18

Pig as part of workflow

Pig easily becomes a part of workflow eco-system

– Can take most of the input types

– Can output in many of the forms

– Doesn’t take over the data, i.e., it does not lock the data that is being processed.

– Read only data analysis

Page 19: Apache pig presentation_siddharth_mathur

19

Optional data schemas

Schema can be provided by the user :

– In the beginning

– On the fly

– Example: • A= LOAD ‘input.txt’ as (Column1;Column2);

• B= Filter A by Column1>5;

If the schema is not provided then the columns can be referred by ‘$0’, ‘$1’, ‘$2’…. for the 1st, 2nd, 3rd column etc. Example:

A= LOAD ‘input.txt’;

B= Filter A by $0>5;

Page 20: Apache pig presentation_siddharth_mathur

20

Nested Data Model

Suppose, for a document, we want to extract the term and its position.

Format of output : Map<document Set<position>>

SQL data model:

Or keep in normalized form, i.e.,

– term_info(termid, String)

– position_info(termid, position, document)

Term Document ID Position

Hi 1 2

Hi 1 5

Page 21: Apache pig presentation_siddharth_mathur

21

Problem resolved using Pig

In pig we have complex data types like map, tuple or bag to occur as a field of a table itself.

Example:

This approach is good because its more closer to what a programmer thinks.

Data is stored on disk in a nested fashion only

It gives user an ease in writing UDFs.

Term Document ID Position

Hi 1 (2,5,8..)

Page 22: Apache pig presentation_siddharth_mathur

22

UDFs

Significant part of data analysis is custom processing

For example, user might want to process natural language stemming

Or checking if the page is spam or not, or many other tasks

To work on this, Pig Latin has extensive support for UDFs, most of the tasks can be resolved using the UDFs

It can take non-atomic input and can provide a non-atomic output also

Currently the UDFs can be written in java or python

Page 23: Apache pig presentation_siddharth_mathur

23

Debugging Environment

In any language, getting a data processing program work correctly usually takes many iterations

First few iterations mostly produce errors

With a large scale data this would result in serious time and resource wastage

Debuggers can help

Pig has a novel debugging environment

Generates concise examples from input data

Data samples are carefully chosen to resemble real data as far as possible

Sample data is carved specially

Page 24: Apache pig presentation_siddharth_mathur

24

Pig Latin

Language in which data workflow statements are written

It runs on the shell called ‘Grunt’

It has a shared repository name Piggybank

We can create our custom UDFs and add them to Piggybank

Page 25: Apache pig presentation_siddharth_mathur

25

Data Model

Rich, yet simple data models

Atoms

– Simple atomic values like string or number

Tuple

– A collection of fields each of which can be of any data type

– Analogous to rows in SQL

Bag

– Collection of tuples or both tuples and atoms

– Can also be heterogeneous

Page 26: Apache pig presentation_siddharth_mathur

26

Data Model (cont.)

Example of a relation

Atom Tuple Bag

T= ‘alice’, (labours,1), {(‘ipod’, 2),‘james’}

Tuple is represented with round braces

Bag is represented with curly braces

Page 27: Apache pig presentation_siddharth_mathur

27

Specifying Input Data : LOAD

Its the first step in Pig Latin program

Specifying what the input files are

How are its contents to be deserialized, i.e., converted to pig data model.

LOAD command

– Examplequeries= LOAD ‘query_log.csv’

USING PigStorage(‘,’)

AS (userId,queryString,timestamp);

Page 28: Apache pig presentation_siddharth_mathur

28

LOAD (cont.)

Both the ‘USING’ clause and the ‘AS’ clause are optional

We can work without them as shown earlier ($0 for first field)

Pig Storage is a pre-defined function

Can use custom function instead of Pig Storage

Page 29: Apache pig presentation_siddharth_mathur

29

Per Tuple Processing : FOREACH

Similar to FOR statements

Its used for applying special processing to each tuple of the dataset

Example– Expanded_query = FOREACH queries GENERATE

UserId, Expand(queryString), timeStamp;

Its not a FILTERING command

‘Expand’ can take atomic input and can generate a bag of outputs

Page 30: Apache pig presentation_siddharth_mathur

30

Per Tuple Processing : FOREACH(cont.)

The semantics of FOREACH is such that there is no dependency between different tuples of input, therefore permitting efficient parallel implementation

Page 31: Apache pig presentation_siddharth_mathur

31

Discarding Unwanted Data : FILTER

Used as a where clause

Can provide anything in the expression

– Query = FILTER queries By user_id neq ‘bot’;

We can provide a UDF also, like

– Query = FILTER queries by Isbot(user_id);

Page 32: Apache pig presentation_siddharth_mathur

32

COGROUP

Similar to Join

Groups bags of different inputs together

Ease of use for UDF’s– Grouped_data = COGROUP results by querystring, revenue by

querystring;

Page 33: Apache pig presentation_siddharth_mathur

33

JOIN

Not all users want to use COGROUP

Simple equi-join is all that is required

– Example

Join_result = JOIN results by querystring,

revenue by querystring;

Other types of join are also supported:

– Left outer

– Right outer

– Full outer

Page 34: Apache pig presentation_siddharth_mathur

34

Other Commands

Relational Operators

– UNION

– CROSS

– ORDER

– DISTINCT

– LIMIT

Eval Functions

– Concat

– Count

– Diff

Page 35: Apache pig presentation_siddharth_mathur

35

PARALLEL clause

It is used to increase the parallelization of the job

We can specify the number of reduce tasks of the MR jobs created by Pig

It only effects the reduce task

No control over map

The system also can figure out number of reducers

Mostly one reduce task is required

Page 36: Apache pig presentation_siddharth_mathur

36

PARALLEL clause (cont.)

Can be applied to only those commands which come under reduce phase

– COGROUP

– CROSS

– DISTINCT

– GROUP

– JOINS

– ORDERA = LOAD ‘ File1’;

B = LOAD ‘ File2’;

C = CROSS A, B PARALLEL 10;

Page 37: Apache pig presentation_siddharth_mathur

37

Split Clause

We can split the input record into many by providing condition

A = LOAD ‘data’ AS (F1:int, F2:int, F3:int)

(1,2;3)

(2,3;7)

SPLIT A INTO B IF F1>7, C IF F2==5;

B (1,2,3) C (2,5,7)

(2,5,7)

Any expression can be written

UDFs can be used

It is not partitioning

Page 38: Apache pig presentation_siddharth_mathur

38

Output

There are two ways to display

– STORE• If you want to store the output in any location

STORE output_1 INTO ‘hadoopuser/output’

– DUMP

• Basically used to display the result in the GRUNT shell itself

• Dumping doesn’t store the output anywhereDUMP query_result;

Page 39: Apache pig presentation_siddharth_mathur

39

Building a Logical Plan

Pig interpreter first parses all the commands which the client issues

Verifies that the input files, bags or columns referred by the command are valid

Builds a logical plan for every bag the user defines

No processing is carried out

Processing triggers where a user invokes STORE/DUMP command

Called as a Lazy execution approach

Helps in FILTER reordering

Page 40: Apache pig presentation_siddharth_mathur

40

Debugging Environment

This is used to avoid running the complete code on the entire dataset

User can create a sample data

Difficult to tailor these datasets and end up in self cooked data

Pig Pen is Pig’s debugging environment

Creates side dataset automatically, called as sandbox dataset

Pig Pen has its own user interface

Page 41: Apache pig presentation_siddharth_mathur

41

Pig Pen

Outputs can be easily analyzed

Errors can be rectified earlier

Page 42: Apache pig presentation_siddharth_mathur

42

Future Work

User Interface

– Drag-Drop style would help

– Logical plan diagram create made easy

UDF support for other languages

Unified Environment

– Currently, lacks in control structures like loops

– Has to embedded for all iterative tasks

Page 43: Apache pig presentation_siddharth_mathur

43

Summary

Not So Foreign Language

Aims a sweet spot between SQL and MapReduce

Reusable and easy to use

Novel Debugging Environment: Pig Pen

Pig has an active and growing user base in Yahoo!

Pigs

– Eats anything

– Live anywhere

– Are domestic

Page 44: Apache pig presentation_siddharth_mathur

44

Page 45: Apache pig presentation_siddharth_mathur

45

Based on “Pig Latin: A Not-So-Foreign Language for Data Processing”

SIGMOD’08,June 9–12, 2008, Vancouver,BC,Canada

Christopher Olston Benjamin ReedYahoo! Research Yahoo! Research

Utkarsh Srivastava Ravi KumarYahoo! Research Yahoo! Research

Andrew TomkinsYahoo! Research