apache pig

40
Apache Hadoop 1 Pig Fundamentals Shashidhar HB

Upload: shashidhar-basavaraju

Post on 26-Jan-2015

215 views

Category:

Technology


16 download

DESCRIPTION

Introduction to Apache PIG

TRANSCRIPT

Page 1: Apache Pig

Apache Hadoop

1

Pig Fundamentals

Shashidhar HB

Page 2: Apache Pig

Course Outline

Why Hadoop Hadoop n The Cloud Industry Querying Large Data... Pig to Rescue Pig: Why? What? How? Pig Basics: Install, Configure, Try Dwelling Deeper into Pig-PigLatin Q&A

2

Page 3: Apache Pig

Why Hadoop? (1/3)

You have 10x more DATAThan you did 3 years ago!

BUT do you know 10x MORE about your

BUSINESS?

NO!3

Page 4: Apache Pig

Why Hadoop? (2/3)

5

A lot of data, BIG data! Information

(The Big Picture)

We are not able to effectively store and analyze all the data we have, so we are not able to see the big picture!

Page 5: Apache Pig

Why Hadoop? (3/3)

BigData / Web Scale: are datasets that grow so large that they become awkward to work with traditional database management tools

Handling Big Data using traditional approach is costly and rigid (Difficulties include capture, storage, search, sharing, analytics and visualization)

Google, Yahoo, Facebook, LinkedIn handles “Petabytes” of data everyday.

They all use HADOOP to solve there BIG DATA problem

6

Page 6: Apache Pig

How big is BIG DATA!?

7

Page 7: Apache Pig

8

So Mr. HADOOP says he has a solution to our BIG

problem !

Page 8: Apache Pig

Hadoop

Hadoop is an open-source software for RELIABLE and SCALABLE distributed computing

Hadoop provides a comprehensive solution to handle Big Data

Hadoop is HDFS : High Availability Data Storage subsystem

(http://labs.google.com/papers/gfs.html: 2003)

+ MapReduce: Parallel Processing system

(http://labs.google.com/papers/mapreduce.html: 2004)9

Page 9: Apache Pig

Hadoop: Time line

2008: Yahoo! Launched Hadoop

2009: Hadoop source code was made available to the free world

2010: Facebook claimed that they have the largest Hadoop cluster in the world with 21 PB of storage

2011: Facebook announced the data has grown to 30 PB

10

Page 10: Apache Pig

Hadoop N Cloud Industry

Stats :Facebook ▪ Started in 2004: 1 million users▪ August 2008: Facebook reaches over 100 million active

users▪ Now: 750+ million active users“Bottom line.. More users more DATA”

The BIG challenge at Facebook!! Using historical data is a very big part of improving the user experience on Facebook. So storing and processing all these bytes is of immense importance.

Facebook tried Hadoop for this.11

Page 11: Apache Pig

12

Hadoop turned out to be a great solution, but there was one little problem!

Page 12: Apache Pig

What is the PROBLEM ?

Map Reduce requires skilled JAVA programmers to write standard MapReduce programs

Developers are more fluent in querying data using SQL

“Pig says, No Problemo!”13

Page 13: Apache Pig

Scenario

14

Input: User profiles, Page visits

Find the top 5 most visited pages by users aged 18-25

Page 14: Apache Pig

MapReduce solution

15

Page 15: Apache Pig

Same solution in Pig

1. Users = LOAD ‘users’ AS (name, age);

2. Filtered = FILTER Users BY age >= 18 AND age <= 25;

3. Pages = LOAD ‘pages’ AS (user, url);

4. Joined = JOIN Filtered BY name, Pages BY user;

5. Grouped = GROUP Joined BY url;

6. Summed = FOREACH Grouped generate GROUP, COUNT(Joined) AS clicks;

7. Sorted = ORDER Summed BY clicks DESC;

8. Top5 = LIMIT Sorted 5;

9. STORE Top5 INTO ‘top5sites’;

16

Page 16: Apache Pig

So what is Pig?

Pig is a dataflow language• Language is called PigLatin• Pretty simple syntax• Under the covers, PigLatin scripts are turned into

MapReduce jobs and executed on the cluster

Pig Latin: High-level procedural language

Pig Engine: Parser, Optimizer and distributed query execution

17

Page 17: Apache Pig

Pig v/s SQL

PIG

Pig is procedural Nested relational data

model (No constraints on Data Types)

Schema is optional Scan-centric analytic

workloads (No Random reads or writes)

Limited query optimization

SQL

SQL is declarative Flat relational data

model (Data is tied to a specific Data Type)

Schema is required OLTP + OLAP workloads

Significant opportunity for query optimization

18

Page 18: Apache Pig

Pig procedural v/s SQL declarative

19

Users = load 'users' as (name, age, ipaddr);

Clicks = load 'clicks' as (user, url, value);

ValuableClicks = filter Clicks by value > 0;

UserClicks = join Users by name, ValuableClicks by user;

Geoinfo = load 'geoinfo' as (ipaddr, dma); UserGeo = join UserClicks by ipaddr,

Geoinfo by ipaddr; ByDMA = group UserGeo by dma;

ValuableClicksPerDMA = foreach ByDMA

generate group, COUNT(UserGeo); store ValuableClicksPerDMA into

'ValuableClicksPerDMA';

SQLPIG

insert into ValuableClicksPerDMA select dma, count(*) from geoinfo join ( select name, ipaddr from users join clicks on (users.name = clicks.user) where value > 0; ) using ipaddr group by dma;

Page 19: Apache Pig

Features of Pig

Joining datasets

Grouping data

Referring to elements by position rather than name ($0, $1, etc)

Loading non-delimited data using a custom SerDe (Writing a custom Reader and Writer)

Creation of user-defined functions (UDF), written in Java

And more..20

Page 20: Apache Pig

21

Under the hood

Page 21: Apache Pig

Pig: Install

Pig runs as a client-side application. Even if you want to run Pig on a Hadoop cluster, there is nothing extra to install on the cluster: Pig launches jobs and interacts with HDFS (or other Hadoop file systems) from your workstation.

Download a stable release from http://hadoop.apache.org/pig/releases.html and unpack the tarball in a suitable place on your workstation:

% tar xzf pig-x.y.z.tar.gz

It’s convenient to add Pig’s binary directory to your command-line path. For example:

% export PIG_INSTALL=/home/tom/pig-x.y.z% export PATH=$PATH:$PIG_INSTALL/bin

You also need to set the JAVA_HOME environment variable to point tosuitable Java installation.

22

Page 22: Apache Pig

Pig: Configure

Execution Types Local mode (pig -x local)

Hadoop mode

Pig must be configured to the cluster’s namenode and jobtracker

1. Put hadoop config directory in PIG classpath % export PIG_CLASSPATH=$HADOOP_INSTALL/conf/

2. Create a pig.propertiesfs.default.name=hdfs://localhost/mapred.job.tracker=localhost:8021

23

Page 23: Apache Pig

Pig: Run

Script: Pig can run a script file that contains Pig commands. For example,% pig script.pig Runs the commands in the local file ”script.pig”.

Alternatively, for very short scripts, you can use the -e option to run a script specified

as a string on the command line.

Grunt: Grunt is an interactive shell for running Pig commands. Grunt is started when no file is specified for Pig to run, and the -e option is not used. Note: It is also possible to run Pig scripts from within Grunt using run and exec.

Embedded: You can run Pig programs from Java, much like you can use JDBC to run SQL programs from Java. There are more details on the Pig wiki at http://wiki.apache.org/pig/EmbeddedPig

24

Page 24: Apache Pig

Pig-Pig Latin Constructs

PigLatin: A Pig Latin program consists of a collection of statements. A statement can be thought of as an operation, or a commandFor example,

1. A GROUP operation is a type of statement:

grunt> grouped_records = GROUP records BY year;2. The command to list the files in a Hadoop filesystem is another example of a statement:

ls /3. LOAD operation to load data from tab seperated file to PIG record

grunt> records = LOAD ‘sample.txt’ AS (year:chararray, temperature:int, quality:int);

Data: In Pig, a single element of data is an atomA collection of atoms – such as a row, or a partial row – is a tupleTuples are collected together into bags

Atom –> Row/Partial Row –> Tuple –> Bag

25

Page 25: Apache Pig

Demo: Sample Data (employee.txt)Example contents of ‘employee.txt’ a tab delimited text

1 Krishna 234000000 none 2 Krishna_01 234000000 none 124163 Shashi 10000 cloud 124164 Gopal 1000000 setlabs 124165 Govind 1000000 setlabs 124166 Ram 450000 es 124167 Madhusudhan 450000 e&r 124168 Hari 6500000 e&r 124169 Sachith 50000 cloud

26

Page 26: Apache Pig

Demo: Employees with salary > 1lk

--Loading data from employee.txt into emps bag and with a schemaempls = LOAD ‘employee.txt’ AS (id:int, name:chararray, salary:double, dept:chararray);

--Filtering the data as requiredrich = FILTER empls BY $2 > 100000;

--Sortingsortd = ORDER rich BY salary DESC;

--Storing the final resultsSTORE sortd INTO ‘rich_employees.txt’;

-- Or alternatively we can dump the record on the screenDUMP sortd;

--------------------------------------------------------------------Group by salarygrp = GROUP empls BY salary;

--Get count of employees in each salary groupcnt = FOREACH grp GENERATE group, COUNT(empls.id) as emp_cnt;

27

Page 27: Apache Pig

More PigLatin (1/2)

To view the schema of a relation DESCRIBE empls;

To view step-by-step execution of a series of statements ILLUSTRATE empls;

To view the execution plan of a relation EXPLAIN empls;

Join two data setsLOAD 'data1' AS (col1, col2, col3, col4);

LOAD 'data2' AS (colA, colB, colC);jnd = JOIN data1 BY col3, data2 BY colA PARALLEL 50;STORE jnd INTO 'outfile‘;

28

Page 28: Apache Pig

More PigLatin (2/2)

Load using PigStorage empls = LOAD ‘employee.txt’ USING PigStorage('\t') AS (id:int, name:chararray, salary:double, dept:chararray);

Store using PigStorage STORE srtd INTO ‘rich_employees.txt’ USING PigStorage('\t');

29

Page 29: Apache Pig

30

Flexibility with PIGIs that all we can do with the

PIG!!??

Page 30: Apache Pig

PigLatin: UDF

Pig provides extensive support for user-defined functions (UDFs) as a way to specify custom processing. Functions can be a part of almost every operator in Pig

All UDF’s are case sensitive

31

Page 31: Apache Pig

UDF: Types

Eval Functions (EvalFunc) Ex: StringConcat (built-in) : Generates the concatenation of the

first two fields of a tuple.

Aggregate Functions (EvalFunc & Algebraic) Ex: COUNT, AVG ( both built-in)

Filter Functions (FilterFunc) Ex: IsEmpty (built-in)

Load/Store Functions (LoadFunc/ StoreFunc) Ex: PigStorage (built-in)

Note: URL for built in functions: http://pig.apache.org/docs/r0.7.0/api/org/apache/pig/builtin/package-summary.html

32

Page 32: Apache Pig

UDF: Before writing a custom UDF visit...

Piggy Bank Piggy Bank is a place for Pig users to share their

functions

DataFu (Linkedin’s collection of UDF’s) Hadoop library for large-scale data processing

33

Page 33: Apache Pig

UDF: How to write a custom UDF (EvalFunc)

package myudfs;import java.io.IOException;import org.apache.pig.EvalFunc;import org.apache.pig.data.Tuple;import org.apache.pig.impl.util.WrappedIOException;

public class UPPER extends EvalFunc <String>{ public String exec(Tuple input) throws IOException { if (input == null || input.size() == 0) return null; try{ String str = (String)input.get(0); return str.toUpperCase(); }catch(Exception e){ throw WrappedIOException.wrap("Caught exception processing input row ", e); } }}

34

Page 34: Apache Pig

UDF: How to use custom UDF in pig script?

-- myscript.pig REGISTER myudfs.jar;

Note: myudfs.jar should not be surrounded with quotes

A = LOAD 'employee_data' AS (id: int,name: chararray, salary: double, dept: chararray);

B = FOREACH A GENERATE myudfs.UPPER(name);

DUMP B;

35

Page 35: Apache Pig

UDF: How to execute pig script with custom UDF?

java -cp pig.jar org.apache.pig.Main -x local myscript.pig

or pig -x local myscript.pig

Note: myudfs.jar should be in class path!

Locating an UDF jar file Pig first checks the classpath. Pig assumes that the location is either an

absolute path or a path relative to the location from which Pig was invoked 36

Page 36: Apache Pig

PigLatin: Pig Data Types

37

Pig Type Java Classbytearray DataByteArraychararray Stringint Integerlong Longfloat Floatdouble Doubletuple Tuplebag DataBagmap Map<Object, Object>

Page 37: Apache Pig

38

All is well, but.. What about the performance

trade offs?

Page 38: Apache Pig

Performance of Pig v/s Hadoop (MapReduce)

39

Source: Yahoo

Page 39: Apache Pig

40

Q&AMail me :

[email protected]

Page 40: Apache Pig

41

That’s all folks!