pig latin - ku leuvenbettina.berendt/... · pig latin includes a small set of carefully chosen...

Pig Latin

Dominique FonteynWim Leers

Universiteit Hasselt

Pig Latin ...

... is an English word game in which we place the �rst letter of aword at the end and add the su�x -ay.

Pig Latin becomes igpay atinlay

banana becomes anana-bay

What does this have to do with computer sciences?

Will the real Pig Latin please stand up?

Pig Latin is a language developed by Yahoo! designed for ad-hocdata analysis.

Combination of

high-level declarative querying (SQL style)

low-level procedural programming (map-reduce)

First example

Find the average pagerank of high-pagerank URLs for eachsu�ciently large category in a table urls (url, category,pagerank).

SELECT category, AVG(pagerank)

FROM urls WHERE pagerank > 0.2

GROUP BY category HAVING COUNT(*) > 106

First example (2)

Find the average pagerank of high-pagerank URLs for eachsu�ciently large category in a table urls (url, category,pagerank).

PIG LATIN:

good_urls = FILTER urls BY pagerank > 0.2;

groups = GROUP good_urls BY category;

big_groups = FILTER groups BY COUNT(good_urls) >106;output = FOREACH big_groups GENERATE category,

AVG(good_urls.pagerank);

First example (3)

Pig Latin programs are sequences of steps

Each step carries out a single data transformation

Transformations are fairly high-level

e.g. �ltering, grouping, aggregationlow-level manipulations are unnecessary

Writing Pig Latin programs is similar to specifying a query

execution plan and thus easier for programmers to understandand control how their data is being processed.

Presentation Overview

1 Features and Motivation

2 Pig Latin, the Language

3 Implementation

4 Practical Notes

5 Copresentation

3 Implementation

4 Practical Notes

5 Copresentation

Data�ow Language

Pig Latin is a high-level data �ow language. The user speci�es asequence of steps. Each step performs only a single, high-level datatransfomation.

It is not necessary that the operations be executed in the order ofthat sequence.

Usage of high-level relational algebra-style primitives like group andfilter allows traditional database optimizations.

Data�ow Language (2)

Find the URLs of all pages that are classi�ed as spam, but have ahigh pagerank.

spam_urls = FILTER urls BY isSpam(url);

culprit_urls = FILTER spam_urls BY pagerank > 0.8;

isSpam() is a user-de�ned function and may be expensive

not the most e�cient method

Data�ow Language (3)

More e�cient would be,

Find the URLs of all pages that are classi�ed as spam, but have ahigh pagerank.

culprit_urls = FILTER urls BY pagerank > 0.8;

spam_urls = FILTER spam_urls BY isSpam(url);

1 get all high pagerank pages �rst

2 invoke isSpam() only on these high pagerank pages

This optimization can be done automatically by the system.

Quick Start and Interoperability

Pig Latin is designed to support ad-hoc data analysis.

queries can be run directly over data �les

the user must provide a function to parse the content intotuples

Similar for output.

Quick Start and Interoperability (2)

Stored schemas are strictly optional.

Schema information can be provided on the �y, or even not at all.

Because ...

Quick Start and Interoperability (2)

Stored schemas are strictly optional.

Schema information can be provided on the �y, or even not at all.

Because ... PIGS EAT ANYTHING!

Nested Data Model

Programmers often think in terms of nested data structures.

Example: Capture information of each pig in a collection of pig

farms.Map<pigFarmId, Set<pig>�>

Nested Data Model (2)

Databases allow only �at tables, i.e., columns are atomic �elds.

pig_farms: (pigFarmId, pigFarmName, ...)

pigs: (pigId, pigName, ...)

pig_info: (pigFarmId, pigId)

Nested Data Model (3)

Pig Latin o�ers a �exible, fully nested data model and allowscomplex, non-atomic data types as �eld or table.

Some reasons for having a nested data model:

closer to how programmers think and thus much more naturalto them than normalization

allows programmers to easily write a rich set of user-de�nedfunctions

UDFs as First-Class Citizens

Custom processing is a signi�cant part of analysing data.

Pig Latin has extensive support for user-de�ned functions (UDFs).All aspects of Pig Latin processing can be customized through theuse of UDFs.

Input and output of UDFS in Pig Latin follow the nested datamodel. A UDF can take non-atomic parameters as input, and alsooutput non-atomic values.

UDFs as First-Class Citizens (2)

Example: Find the top 10 URLs according to pagerank for each

category.

groups = GROUP urls BY category;

output = FOREACH groups GENERATE category,

top10(urls);

Here, top10() is a UDF that accepts a set of URLs, and outputs aset containing the top 10 URLs by pagerank for that group.

The �nal output contains non-atomic �elds: there is a tuple foreach category, and one of the �elds is the set of top 10 URLs.

UDFs as First-Class Citizens (3)

Practical notes

UDFs are written in Java.

Yahoo! is building support for other languages, includingC/C++, Perl (Erlpay) and Python (Ythonpay).

Parallellism Required

Processing web-scale data requires parallelism.

Pig Latin includes a small set of carefully chosen primitivesthat can be easily parallelized.

Other primitives that do not lend themselves to e�cientparallel evaluation have been deliberately excluded.

They can still be carried out by UDFs. The user is then responsiblefor how e�cient his programs are and whether they will beparallelized.

Debugging Environment

Getting a data processing program right usually takes manyiterations. With web-scale data, a single iteration can take manyminutes or hours. The usual run-debug-run cycle can be very slowand ine�cient.

Pig comes with a novel interactive debugging environment thatgenerates concise example data tables illustrating the output ofeach step of the user's program.

Debugging Environment (2)

3 Implementation

4 Practical Notes

5 Copresentation

Data Model

Pig uses a rich, yet simple data model consisting of 4 types:

Data Model (3)

Specifying Input Data

The �rst step is to specify what the input data �les are, and howthe �le contents are to be deserialized. We use the LOAD command.

We assume the input �le is a bag, i.e., it contains a sequence oftuples.

Specifying Input Data (2)

queries = LOAD 'query_log.txt'

USING myLoad()

AS (userId, queryString, timestamp);

input �le is query_log.txt

input is converted into tuples by using a custom myLoaddeserializer

loaded tuples have 3 �elds named userId, queryString andtimestamp

Specifying Input Data (3)

queries = LOAD 'query_log.txt'

USING myLoad()

AS (userId, queryString, timestamp);

Both the USING and AS clause are optional.

If no deserializer is speci�ed, Pig uses a default one thatexpects a plain text, tab-delimited �le.

If no schema is used, �elds must be referred to by positioninstead of by name. For readability it is desirable to includeschemas.

Per-tuple Processing

The FOREACH command applies some processing to each tuple of adata set.

expanded_queries = FOREACH queries

GENERATE userID,

expandQuery(queryString);

Each tuple of the bag queries should be processed independentlyto produce an output tuple.

The �rst �eld is the userId �eld of the input tuple.

The second �eld is the result of applying the UDFexpandQuery() to the queryString �eld of the input tuple.

Per-tuple Processing (2)

The GENERATE clause can be followed by a list of expressions. Acommon expression type is �attening. The FLATTEN keywordeliminates nesting by extracting the �elds of the tuples in the bag,and making them �elds of the tuple being output by GENERATE.This removes one level of nesting.

expanded_queries = FOREACH queries

GENERATE userID,

FLATTEN(expandQuery(queryString));

Discarding Unwanted Data

The FILTER command discards all data that is not of interest.

Example: Get rid of bot tra�c.

real_queries = FILTER queries BY userId neq 'bot';

comparison operators:

==, !=, <, >, ... (numbers)eq, neq (strings)

logical operators: AND, OR, NOT

Discarding Unwanted Data (2)

We can use UDFs as well.

Example: Get rid of bot tra�c.

real_queries = FILTER queries BY NOT isBot(userId);

Getting Related Data Together

It is often necessary to group together related tuples from one ormore data sets. This is done with the COGROUP command.

Example: we have 2 data sets speci�ed

results: (queryString, url, position)

revenue: (queryString, adSlot, amount)

Getting Related Data Together (2)

Example: group together all search result data and revenue data for

the same query string

grouped_data = COGROUP results BY queryString,

revenue BY queryString;

Output: grouped_data: (group, results, revenue)

�rst �eld is the group identi�er, the value of queryString

each next �eld is a bag, one for each input being cogroupedand is named the same as the alias of that input

Example: join all search result data and revenue data for the same

query string

join_result = JOIN results BY queryString,

revenue BY queryString;

What is the di�erence with COGROUP?

When there is only one data set, we use GROUP.

grouped_revenue = GROUP results BY queryString;

Getting Related Data Together - Summarized

When there is one data set

When there are two or more data sets

COGROUP

JOIN equals a COGROUP followed by FLATTEN

Map-Reduce in Pig Latin

The GROUP and FOREACH statements allow us to express amap-reduce program.

map_result = FOREACH input GENERATE FLATTEN(map(*));

key_groups = GROUP map_result BY $0;

output = FOREACH key_groups GENERATE reduce(*);

Other Commands

Other commands are,

DISTINCT

Nested Operations

Each command operates over one or more bags or tuples as input.

When we have nested bags within tuples, we can nest somecommands within a FOREACH command.

grouped_revenue = group revenue BY queryString;

query_revenue = FOREACH grouped_revenue {

top_slot = FILTER revenue

BY adSlot eq 'top';

GENERATE queryString,

SUM(top_slot.amount),

SUM(revenue.amount);

Asking for Output

Write results to �le with STORE

STORE query_revenues INTO 'myoutput' USING myStore();

3 Implementation

4 Practical Notes

5 Copresentation

Implementation

Pig Latin is implemented by the Pig sytem.

Programs are compiled into map-reduce jobs and executed byHadoop.

It is an open source project in the Apache incubator.

Building a Logical Plan

The Pig interpreter �rst parses the Pig Latin commands and veri�esthat the referred input �les and bags are valid.

e.g. when entering c = COGROUP a BY ..., b BY ...,Pig veri�es that a and b are already de�ned

It builds a logical plan for each de�ned bag.

Building a Logical Plan (2)

When de�ning a new bag, the logical plan is constructed bycombining the logical plans for the input bags, and the currentcommand.

e.g. when entering c = COGROUP a BY ..., b BY ...,

The logical plan for c consists of a cogroup command withthe plans for a and b as input.

Building a Logical Plan (3)

When the logical plans are constructed, no processing is carried out.

Processing is only triggered when invoking a STORE command.Then the logical plan is compiled into a physical plan and executed.

This lazy style of execution permits in-memory pipelining and otheroptimizations.

Map-Reduce Plan Compilation

Map-reduce provides the ability to do a large-scale group by

the map tasks assign keys for grouping

the reduce tasks process a group at a time

3 Implementation

4 Practical Notes

5 Copresentation

Practical Notes

More information can be found at pig.apache.org.

Pig is a project under active development. New features are to beadded:

safe optimizer

user interfaces

external functions

uni�ed environment

3 Implementation

4 Practical Notes

5 Copresentation

pig latin - ku leuvenbettina.berendt/... · pig latin includes a small set of carefully chosen...

Documents

pig -...

pig latin: a not-so-foreign language for data...

pig latin reference manual 1 - pig. · pdf file1. overview...

pig latin: a not-so-foreign language for data

problems rock-paper-scissors (fair game) functions frenzy ...

the pig latin dataflow language a brief overview james jolly...

pig, a high level data processing system on hadoop...3 pig...

apache big data europe 2016 power pig with spark...apache...

lecture 09: parallel databases , big data, map/reduce,...

tradutor de pig latin

hadoop pig by ravikrishna adepu. overview what is pig?...

des and md5 based healthcare data protection with big data...

dirty words, pig latin, and the structure of...

high-level programming languages: apache pig and pig latin

introduction to database systems cse 444...introduction to...

moss pig latin

pig latin: a not-so-foreign language for data processing

mapreduce & pig...

pig latin: a not-so-foreign language for data...

pig latin and hive