hadoop pig by ravikrishna adepu. overview what is pig? motivation how is it being used data...

33
Hadoop Pig By Ravikrishna Adepu

Upload: noelia-vaillant

Post on 14-Dec-2015

224 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Hadoop Pig By Ravikrishna Adepu. Overview What is Pig? Motivation How is it being used Data Model/Architecture Components Pig Latin By Example

Hadoop Pig

ByRavikrishna Adepu

Page 2: Hadoop Pig By Ravikrishna Adepu. Overview What is Pig? Motivation How is it being used Data Model/Architecture Components Pig Latin By Example

Overview

•What is Pig?•Motivation• How is it being used• Data Model/Architecture• Components• Pig Latin By Example

Page 3: Hadoop Pig By Ravikrishna Adepu. Overview What is Pig? Motivation How is it being used Data Model/Architecture Components Pig Latin By Example

What is Pig?

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.

Page 4: Hadoop Pig By Ravikrishna Adepu. Overview What is Pig? Motivation How is it being used Data Model/Architecture Components Pig Latin By Example

PigLatinPig's language layer currently consists of a textual language called Pig Latin, which has the following key properties:• Ease of programming. • Optimization opportunities. • Extensibility : Users can create their

own functions to do special-purpose processing.

Page 5: Hadoop Pig By Ravikrishna Adepu. Overview What is Pig? Motivation How is it being used Data Model/Architecture Components Pig Latin By Example

Hadoop Pig Architecture

• Client Machine (Pig job submission)• Pig->Map reduce transformations•Map Reduce Jobs• HDFS(Hadoop Distributed File System)

ClientMachine

Map Reduce

Transformations

Map reduce Jobs

HDFS

Page 6: Hadoop Pig By Ravikrishna Adepu. Overview What is Pig? Motivation How is it being used Data Model/Architecture Components Pig Latin By Example

Features

• Simple to understand data flow language for analysis familiar with scripting languages• Fast , iterative language with strong map

reduce compilation engine• Rich, multivalued nested operations

performed on large datasets

Page 7: Hadoop Pig By Ravikrishna Adepu. Overview What is Pig? Motivation How is it being used Data Model/Architecture Components Pig Latin By Example

Pig v/s SQL

Pig

• Pig is procedural

• Nested relational data model (No constraints on Data Types)

• Schema is optional

• Scan-centric analytic workloads (No Random reads or writes)

• Limited query optimization

SQL

• SQL is declarative

• Flat relational data model (Data is tied to a specific Data Type)

• Schema is required• OLTP + OLAP workloads• Significant opportunity for

query optimization

Page 8: Hadoop Pig By Ravikrishna Adepu. Overview What is Pig? Motivation How is it being used Data Model/Architecture Components Pig Latin By Example

Pig procedural v/s SQL declarative

PIG

• Users = load 'users' as (name, age, ipaddr);

• Clicks = load 'clicks' as (user, url, value);

• ValuableClicks = filter Clicks by value > 0;

• UserClicks = join Users by name, ValuableClicks by user;

• Geoinfo = load 'geoinfo' as (ipaddr, dma);

• UserGeo = join UserClicks by ipaddr, Geoinfo by ipaddr; ByDMA = group UserGeo by dma;

• ValuableClicksPerDMA = foreach ByDMA

generate group, COUNT(UserGeo); • store ValuableClicksPerDMA into

'ValuableClicksPerDMA';

SQL

insert into ValuableClicksPerDMA

select dma, count(*) from geoinfo

join ( select name, ipaddr from

users join clicks on (users.name = clicks.user) where value > 0; )

using ipaddr group by dma;

Page 9: Hadoop Pig By Ravikrishna Adepu. Overview What is Pig? Motivation How is it being used Data Model/Architecture Components Pig Latin By Example

Motivation behind Pig

Challenges :•Map reduce requires a Java Programmer•Map reduce can require multiple stages to

come to solution• User has to reinvent common functionality

(join,filter etc)• Long development cycle with rigorous

testing states

Page 10: Hadoop Pig By Ravikrishna Adepu. Overview What is Pig? Motivation How is it being used Data Model/Architecture Components Pig Latin By Example

Solution :

• Opens the systems to he users familiar with PHP, Ruby,Python• 4hrs in Java -> 15 minutes in PigLatin• Provide common operations like Join,

group, filter and sort etc• Pig provides PigLatin that increases

productivity * 10

Page 11: Hadoop Pig By Ravikrishna Adepu. Overview What is Pig? Motivation How is it being used Data Model/Architecture Components Pig Latin By Example

How is Pig being used

•Web log processing• Data processing for web search platforms• Ad hoc queries across large data sets• Rapid prototyping of algorithms for large

data setsQuick fact : 70% of production jobs at Yahoo Inc being used by Hadoop Pig

Page 12: Hadoop Pig By Ravikrishna Adepu. Overview What is Pig? Motivation How is it being used Data Model/Architecture Components Pig Latin By Example

Pig Processing :

• Grunt ,the pig shell• Submit a script directly• Pig server java class, a JDBC like

interface• Pig Pen which Allows textual &

graphical scripting Samples data & shows example data flow

Page 13: Hadoop Pig By Ravikrishna Adepu. Overview What is Pig? Motivation How is it being used Data Model/Architecture Components Pig Latin By Example

Components :• Pig resides on user machine• No need to install extra cluster• Job submitted to cluster & executed on cluster

Page 14: Hadoop Pig By Ravikrishna Adepu. Overview What is Pig? Motivation How is it being used Data Model/Architecture Components Pig Latin By Example

First look at the program :• Let’s first look at the programming language itself so

that you can see how it’s significantly easier than having to write mapper and reducer programs.• The first step in a Pig program is to LOAD the data you

want to manipulate from HDFS. • Then you run the data through a set

of transformations(which, under the covers, are translated into a set of mapper and reducer tasks).• Finally, you DUMP the data to the screen or

you STORE the results in a file somewhere.

Page 15: Hadoop Pig By Ravikrishna Adepu. Overview What is Pig? Motivation How is it being used Data Model/Architecture Components Pig Latin By Example

Starting grunt :

• cd /usr/share/doc/pig-0.11.0+44/examples/data• ls• $Pig –x local• You should see a prompt like

Grunt> We can run Pig in two modes• Stand alone mode(local mode)• Distributed mode(Map reduce mode)

Page 16: Hadoop Pig By Ravikrishna Adepu. Overview What is Pig? Motivation How is it being used Data Model/Architecture Components Pig Latin By Example

Execution ModesPig has two execution modes:Local Mode : To run Pig in local mode, you need access to a single machine; all files are installed and run using your local host and file system. Specify local mode using the -x flag (pig -x local).Mapreduce Mode :To run Pig in mapreduce mode, you need access to a Hadoop cluster and HDFS installation. Mapreduce mode is the default mode; you can, but don't need to, specify it using the -x flag (pig OR pig -x mapreduce).

Page 17: Hadoop Pig By Ravikrishna Adepu. Overview What is Pig? Motivation How is it being used Data Model/Architecture Components Pig Latin By Example

Loading Data :• Use the LOAD operator and the load/store

functions to read data into Pig • (PigStorage is the default load function).

Storing Final Results:• Use the STORE operator and the load/store

functions to write results to the file system (PigStorage is the default store function).• Debugging Pig Latin

Page 18: Hadoop Pig By Ravikrishna Adepu. Overview What is Pig? Motivation How is it being used Data Model/Architecture Components Pig Latin By Example

Continued :

Pig Latin provides operators that can help you debug your Pig Latin statements:

• Use the DUMP operator to display results to your terminal screen.

• Use the DESCRIBE operator to review the schema of a relation.

• Use the EXPLAIN operator to view the logical, physical, or map reduce execution plans to compute a relation.

• Use the ILLUSTRATE operator to view the step-by-step execution of a series of statements.

Page 19: Hadoop Pig By Ravikrishna Adepu. Overview What is Pig? Motivation How is it being used Data Model/Architecture Components Pig Latin By Example

Piglatin data types :

Basic data types :• INT• LONG• FLOAT • DOUBLE• CHARARRAY• BYTEARRAY• BOOLEAN

Page 20: Hadoop Pig By Ravikrishna Adepu. Overview What is Pig? Motivation How is it being used Data Model/Architecture Components Pig Latin By Example

Continued :

Complex data types • BAG• TUPLE • MAP

Syntax{(data_type) | (tuple(data_type)) | (bag{tuple(data_type)}) | (map[]) } field

Page 21: Hadoop Pig By Ravikrishna Adepu. Overview What is Pig? Motivation How is it being used Data Model/Architecture Components Pig Latin By Example

Usage :

Cast operators enable you to cast or convert data from one type to another, as long as conversion is supported (see the table above). For example, suppose you have an integer field, myint, which you want to convert to a string. You can cast this field from int to chararray using (chararray) myint.

• A field can be explicitly cast. Once cast, the field remains that type (it is not automatically cast back). In this example $0 is explicitly cast to int.

• B = FOREACH A GENERATE (int)$0 + 1;

• Where possible, Pig performs implicit casts. In this example $0 is cast to int (regardless of underlying data) and $1 is cast to double.

• B = FOREACH A GENERATE $0 + 1, $1 + 1.0

Page 22: Hadoop Pig By Ravikrishna Adepu. Overview What is Pig? Motivation How is it being used Data Model/Architecture Components Pig Latin By Example

Tuple construction

A = load 'students' as (name:chararray, age:int,gpa:float); B = foreach A generate (name, age); store B into ‘results’; Input (students): joe smith 20 3.5 amy chen 22 3.2 leo allen 18 2.1 Output (results): (joe smith,20) (amy chen,22) (leo allen,18)

Page 23: Hadoop Pig By Ravikrishna Adepu. Overview What is Pig? Motivation How is it being used Data Model/Architecture Components Pig Latin By Example

Bag Construction A = load 'students' as (name:chararray, age:int, gpa:float); B = foreach A generate {(name, age)}, {name, age}; store B into ‘results’; Input (students): Joe smith 20 3.5 amy chen 22 3.2 leo allen 18 2.1 Output (results): {(joe smith,20)} {(joe smith),(20)} {(amy chen,22)} {(amy chen),(22)} {(leo allen,18)} {(leo allen),(18)}

Page 24: Hadoop Pig By Ravikrishna Adepu. Overview What is Pig? Motivation How is it being used Data Model/Architecture Components Pig Latin By Example

Map construction

A = load 'students' as (name:chararray, age:int, gpa:float); B = foreach A generate [name, gpa]; store B into ‘results’; Input (students):joe smith 20 3.5 amy chen 22 3.2 leo allen 18 2.1 Output (results): [joe smith#3.5] [amy chen#3.2] [leo allen#2.1]

Page 25: Hadoop Pig By Ravikrishna Adepu. Overview What is Pig? Motivation How is it being used Data Model/Architecture Components Pig Latin By Example

Piglatin: UDF

•Pig provides extensive support for user-defined functions (UDFs) as a way to specify custom processing. Functions can be a part of almost every operator in Pig

•All UDF’s are case sensitive

Page 26: Hadoop Pig By Ravikrishna Adepu. Overview What is Pig? Motivation How is it being used Data Model/Architecture Components Pig Latin By Example

UDF: Types• Eval Functions (EvalFunc)• Ex: StringConcat (built-in) : Generates the concatenation of the first

two fields of a tuple.

• Aggregate Functions (EvalFunc & Algebraic)• Ex: COUNT, AVG ( both built-in)

• Filter Functions (FilterFunc)• Ex: IsEmpty (built-in)

• Load/Store Functions (LoadFunc/ StoreFunc)• Ex: PigStorage (built-in)

Note: URL for built in functions: http://pig.apache.org/docs/r0.7.0/api/org/apache/pig/builtin/package-summary.html

Page 27: Hadoop Pig By Ravikrishna Adepu. Overview What is Pig? Motivation How is it being used Data Model/Architecture Components Pig Latin By Example

How It WorksA = LOAD ‘myfile’ AS (x, y, z);B = FILTER A by x > 0; C = GROUP B BY x;D = FOREACH A GENERATE x, COUNT(B);STORE D INTO ‘output’;

Execution PlanMap: Filter Count

Combine/Reduce: Sum

pig.jar:• parses• checks• optimizes• plans execution• submits jar

to Hadoop• monitors job progress

Page 28: Hadoop Pig By Ravikrishna Adepu. Overview What is Pig? Motivation How is it being used Data Model/Architecture Components Pig Latin By Example

Project

Word count using Hadoop Pig :Preparing a text file :• It’s definitely a little more interesting if you can

work with some data you know or at least have an interest in.• I used sample data provided by cloudera for

Hadoop Pig.

Page 29: Hadoop Pig By Ravikrishna Adepu. Overview What is Pig? Motivation How is it being used Data Model/Architecture Components Pig Latin By Example

Import the file into the Sandbox• Go to the File Browser tab and upload the .txt file. Take

note of the default location it is loading to (/user/hue)Write a Pig script to parse the data and dump to a file :--script starts here a = load '/user/hue/word_count_text.txt'; b = foreach a generate flatten(TOKENIZE((chararray)$0)) as word; c = group b by word; d = foreach c generate COUNT(b), group; store d into '/user/hue/pig_wordcount';

/* multi line comments */

Page 30: Hadoop Pig By Ravikrishna Adepu. Overview What is Pig? Motivation How is it being used Data Model/Architecture Components Pig Latin By Example

RESULTS

Page 31: Hadoop Pig By Ravikrishna Adepu. Overview What is Pig? Motivation How is it being used Data Model/Architecture Components Pig Latin By Example

References

1. http://en.wikipedia.org/wiki/Pig_(programming_tool)2. http://pig.apache.org/3. http://hortonworks.com/hadoop/pig/4. http://www.01.ibm.com/software/data/infosphere/

hadoop/pig/5. https://github.com/romainr/yelp-data-analysis6. http://www.cloudera.com/content/cloudera/en/

resources/library/training/introduction-to-apache-pig.html

7. https://github.com/romainr/hadoop-tutorials-examples

Page 32: Hadoop Pig By Ravikrishna Adepu. Overview What is Pig? Motivation How is it being used Data Model/Architecture Components Pig Latin By Example

QUESTIONS?

Page 33: Hadoop Pig By Ravikrishna Adepu. Overview What is Pig? Motivation How is it being used Data Model/Architecture Components Pig Latin By Example

Thank you !