apache pig: a big data processor

50
PIG: A Big Data Processor Tushar B. Kute, http://tusharkute.com

Upload: tushar-b-kute

Post on 12-Feb-2017

855 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Apache Pig: A big data processor

PIG: A Big Data Processor

Tushar B. Kute,http://tusharkute.com

Page 2: Apache Pig: A big data processor

What is Pig?

• Apache Pig is an abstraction over MapReduce. It is a tool/platform which is used to analyze larger sets of data representing them as data flows.

• Pig is generally used with Hadoop; we can perform all the data manipulation operations in Hadoop using Apache Pig.

• To write data analysis programs, Pig provides a high-level language known as Pig Latin.

• This language provides various operators using which programmers can develop their own functions for reading, writing, and processing data.

Page 3: Apache Pig: A big data processor

Apache Pig

• To analyze data using Apache Pig, programmers need to write scripts using Pig Latin language.

• All these scripts are internally converted to Map and Reduce tasks.

• Apache Pig has a component known as Pig Engine that accepts the Pig Latin scripts as input and converts those scripts into MapReduce jobs.

Page 4: Apache Pig: A big data processor

Why do we need Apache Pig?

• Using Pig Latin, programmers can perform MapReduce tasks easily without having to type complex codes in Java.

• Apache Pig uses multi-query approach, thereby reducing the length of codes. For example, an operation that would require you to type 200 lines of code (LoC) in Java can be easily done by typing as less as just 10 LoC in Apache Pig. Ultimately, Apache Pig reduces the development time by almost 16 times.

• Pig Latin is SQL-like language and it is easy to learn Apache Pig when you are familiar with SQL.

• Apache Pig provides many built-in operators to support data operations like joins, filters, ordering, etc. In addition, it also provides nested data types like tuples, bags, and maps that are missing from MapReduce.

Page 5: Apache Pig: A big data processor

Features of Pig

• Rich set of operators: It provides many operators to perform operations like join, sort, filer, etc.

• Ease of programming: Pig Latin is similar to SQL and it is easy to write a Pig script if you are good at SQL.

• Optimization opportunities: The tasks in Apache Pig optimize their execution automatically, so the programmers need to focus only on semantics of the language.

• Extensibility: Using the existing operators, users can develop their own functions to read, process, and write data.

• UDF’s: Pig provides the facility to create User-defined Functions in other programming languages such as Java and invoke or embed them in Pig Scripts.

• Handles all kinds of data: Apache Pig analyzes all kinds of data, both structured as well as unstructured. It stores the results in HDFS.

Page 6: Apache Pig: A big data processor

Pig vs. MapReduce

Page 7: Apache Pig: A big data processor

Pig vs. SQL

Page 8: Apache Pig: A big data processor

Pig vs. Hive

Page 9: Apache Pig: A big data processor

Applications of Apache Pig

• To process huge data sources such as web logs.

• To perform data processing for search platforms.

• To process time sensitive data loads.

Page 10: Apache Pig: A big data processor

Apache Pig – History

• In 2006, Apache Pig was developed as a research project at Yahoo, especially to create and execute MapReduce jobs on every dataset.

• In 2007, Apache Pig was open sourced via Apache incubator.

• In 2008, the first release of Apache Pig came out. In 2010, Apache Pig graduated as an Apache top-level project.

Page 11: Apache Pig: A big data processor

Pig Architecture

Page 12: Apache Pig: A big data processor

Apache Pig – Components

• Parser: Initially the Pig Scripts are handled by the Parser. It checks the syntax of the script, does type checking, and other miscellaneous checks. The output of the parser will be a DAG (directed acyclic graph), which represents the Pig Latin statements and logical operators.

• Optimizer: The logical plan (DAG) is passed to the logical optimizer, which carries out the logical optimizations such as projection and pushdown.

• Compiler: The compiler compiles the optimized logical plan into a series of MapReduce jobs.

• Execution engine: Finally the MapReduce jobs are submitted to Hadoop in a sorted order. Finally, these MapReduce jobs are executed on Hadoop producing the desired results.

Page 13: Apache Pig: A big data processor

Apache Pig – Data Model

Page 14: Apache Pig: A big data processor

Apache Pig – Elements

• Atom

– Any single value in Pig Latin, irrespective of their data, type is known as an Atom.

– It is stored as string and can be used as string and number. int, long, float, double, chararray, and bytearray are the atomic values of Pig.

– A piece of data or a simple atomic value is known as a field.

– Example: ‘raja’ or ‘30’

Page 15: Apache Pig: A big data processor

Apache Pig – Elements

• Tuple

– A record that is formed by an ordered set of fields is known as a tuple, the fields can be of any type. A tuple is similar to a row in a table of RDBMS.

– Example: (Raja, 30)

Page 16: Apache Pig: A big data processor

Apache Pig – Elements

• Bag

– A bag is an unordered set of tuples. In other words, a collection of tuples (non-unique) is known as a bag. Each tuple can have any number of fields (flexible schema). A bag is represented by ‘{}’. It is similar to a table in RDBMS, but unlike a table in RDBMS, it is not necessary that every tuple contain the same number of fields or that the fields in th same position (column) have the same type.

– Example: {(Raja, 30), (Mohammad, 45)}

– A bag can be a field in a relation; in that context, it is known as inner bag.

– Example: {Raja, 30, {9848022338, [email protected],}}

Page 17: Apache Pig: A big data processor

Apache Pig – Elements

• Relation

– A relation is a bag of tuples. The relations in Pig Latin are unordered (there is no guarantee that tuples are processed in any particular order).

• Map

– A map (or data map) is a set of key-value pairs. The key needs to be of type chararray and should be unique. The value might be of any type. It is represented by ‘[]’

– Example: [name#Raja, age#30]

Page 18: Apache Pig: A big data processor

Installation of PIG

Page 19: Apache Pig: A big data processor

Download

• Download the tar.gz file of Apache Pig from here:

http://mirror.fibergrid.in/apache/pig/pig-0.15.0/pig-0.15.0.tar.gz

Page 20: Apache Pig: A big data processor

Extract and copy

• Extract this file using right-click -> 'Extract here' option or by tar -xzvf command.

• Rename the created folder 'pig-0.15.0' to 'pig'

• Now, move this folder to /usr/lib using following command:

$ sudo mv pig/ /usr/lib

Page 21: Apache Pig: A big data processor

Edit the bashrc file

• Open the bashrc file:

sudo gedit ~/.bashrc

• Go to end of the file and add following lines.

export PIG_HOME=/usr/lib/pig

export PATH=$PATH:$PIG_HOME/bin

• Type following command to make it in effect:

source ~/.bashrc

Page 22: Apache Pig: A big data processor

Start the Pig

• Start the pig in local mode:

pig -x local

• Start the pig in mapreduce mode (needs hadoop datanode started):

pig -x mapreduce

Page 23: Apache Pig: A big data processor

Grunt shell

Page 24: Apache Pig: A big data processor

Data Processing with PIG

Page 25: Apache Pig: A big data processor

Example: movies_data.csv

1,Dhadakebaz,1986,3.2,7560

2,Dhumdhadaka,1985,3.8,6300

3,Ashi hi banva banvi,1988,4.1,7802

4,Zapatlela,1993,3.7,6022

5,Ayatya Gharat Gharoba,1991,3.4,5420

6,Navra Maza Navsacha,2004,3.9,4904

7,De danadan,1987,3.4,5623

8,Gammat Jammat,1987,3.4,7563

9,Eka peksha ek,1990,3.2,6244

10,Pachhadlela,2004,3.1,6956

Page 26: Apache Pig: A big data processor

Load data

• $ pig -x local

• grunt> movies = LOAD 'movies_data.csv' USING PigStorage(',') as (id,name,year,rating,duration)

• grunt> dump movies;

it displays the contents

Page 27: Apache Pig: A big data processor

Filter data

• grunt> movies_greater_than_35 = FILTER movies BY (float)rating > 3.5;

• grunt> dump movies_greater_than_35;

Page 28: Apache Pig: A big data processor

Store the results data

• grunt> store movies_greater_than_35 into 'my_movies';

• It stores the result in local file system directory named 'my_movies'.

Page 29: Apache Pig: A big data processor

Display the result

• Now display the result from local file system.

cat my_movies/part-m-00000

Page 30: Apache Pig: A big data processor

Load command

• The load command specified only the column names. We can modify the statement as follows to include the data type of the columns:

• grunt> movies = LOAD 'movies_data.csv' USING PigStorage(',') as (id:int, name:chararray, year:int, rating:double, duration:int);

Page 31: Apache Pig: A big data processor

Check the filters

• List the movies that were released between 1950 and 1960

grunt> movies_between_90_95 = FILTER movies by year > 1990 and year < 1995;

• List the movies that start with the Alpahbet D

grunt> movies_starting_with_D = FILTER movies by name matches 'D.*';

• List the movies that have duration greater that 2 hours

grunt> movies_duration_2_hrs = FILTER movies by duration > 7200; 

Page 32: Apache Pig: A big data processor

Output

Movies between

1990 to 1995

Movies starts

With 'D'

Movies greater

Than 2 hours

Page 33: Apache Pig: A big data processor

Describe

• DESCRIBE The schema of a relation/alias can be viewed using the DESCRIBE command:

grunt> DESCRIBE movies;

movies: {id: int, name: chararray, year: int, rating: double, duration: int} 

Page 34: Apache Pig: A big data processor

Foreach

• FOREACH gives a simple way to apply transformations based on columns. Let’s understand this with an example.

• List the movie names its duration in minutes

grunt> movie_duration = FOREACH movies GENERATE name, (double)(duration/60);

• The above statement generates a new alias that has the list of movies and it duration in minutes.

• You can check the results using the DUMP command.

Page 35: Apache Pig: A big data processor

Output

Page 36: Apache Pig: A big data processor

Group

• The GROUP keyword is used to group fields in a relation.

• List the years and the number of movies released each year.

grunt> grouped_by_year = group movies by year;

grunt> count_by_year = FOREACH grouped_by_year GENERATE group, COUNT(movies);

Page 37: Apache Pig: A big data processor

Output

Page 38: Apache Pig: A big data processor

Order by

• Let us question the data to illustrate the ORDER BY operation.

• List all the movies in the ascending order of year.

grunt> desc_movies_by_year = ORDER movies BY year ASC;

grunt> DUMP desc_movies_by_year;• List all the movies in the descending order of year.

grunt> asc_movies_by_year = ORDER movies by year DESC;

grunt> DUMP asc_movies_by_year;

Page 39: Apache Pig: A big data processor

Output- Ascending by year

From1985

To2004

Page 40: Apache Pig: A big data processor

Limit

• Use the LIMIT keyword to get only a limited number for results from relation.

grunt> top_5_movies = LIMIT movies 5;

grunt> DUMP top_10_movies;

Page 41: Apache Pig: A big data processor

Pig: Modes of Execution

• Pig programs can be run in three methods which work in both local and MapReduce mode. They are

– Script Mode

– Grunt Mode

– Embedded Mode

Page 42: Apache Pig: A big data processor

Script mode

• Script Mode or Batch Mode: In script mode, pig runs the commands specified in a script file. The following example shows how to run a pig programs from a script file:

$ vim scriptfile.pig

   A = LOAD 'script_file';

   DUMP A;

$ pig ­x local scriptfile.pig

Page 43: Apache Pig: A big data processor

Grunt mode

• Grunt Mode or Interactive Mode: The grunt mode can also be called as interactive mode. Grunt is pig's interactive shell. It is started when no file is specified for pig to run.

$ pig ­x local

grunt> A = LOAD 'grunt_file';

grunt> DUMP A;

• You can also run pig scripts from grunt using run and exec commands.

grunt> run scriptfile.pig

grunt> exec scriptfile.pig

Page 44: Apache Pig: A big data processor

Embedded mode

• You can embed pig programs in Java, python and ruby and can run from the same.

Page 45: Apache Pig: A big data processor

Example: Wordcount program

• Q) How to find the number of occurrences of the words in a file using the pig script?

• You can find the famous word count example written in map reduce programs in apache website. Here we will write a simple pig script for the word count problem.

• The pig script given in next slide finds the number of times a word repeated in a file:

Page 46: Apache Pig: A big data processor

Example: text file- shivneri.txt

Page 47: Apache Pig: A big data processor

Example: Wordcount program

lines = LOAD 'shivneri.txt' AS (line:chararray);

words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;

grouped = GROUP words BY word;

w_count = FOREACH grouped GENERATE group, COUNT(words);

DUMP w_count;

forts.pig

Page 48: Apache Pig: A big data processor

Output snapshot

$ pig -x local forts.pig

Page 49: Apache Pig: A big data processor

References

• “Programming Pig” by Alan Gates, O'Reilly Publishers.

• “Pig Design Patterns” by Pradeep Pasupuleti, PACKT Publishing

• Tutorials Point

• http://github.com/rohitdens

• http://pig.apache.org

Page 50: Apache Pig: A big data processor

[email protected]

Thank you

This presentation is created using LibreOffice Impress 4.2.8.2, can be used freely as per GNU General Public License

Blogshttp://digitallocha.blogspot.inhttp://kyamputar.blogspot.in

Web Resourceshttp://tusharkute.com