pig latin olston, reed, srivastava, kumar, and tomkins. pig latin: a not-so-foreign language for...

32
Pig Latin Pig Latin Olston, Reed, Srivastava, Kumar, and Olston, Reed, Srivastava, Kumar, and Tomkins. Pig Latin: A Not-So-Foreign Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD Language for Data Processing. SIGMOD 2008. 2008. Shahram Ghandeharizadeh Shahram Ghandeharizadeh Computer Science Department Computer Science Department University of Southern California University of Southern California

Post on 21-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Pig Latin Olston, Reed, Srivastava, Kumar, and Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD 2008. Shahram Ghandeharizadeh

Pig LatinPig Latin

Olston, Reed, Srivastava, Kumar, and Tomkins. Pig Olston, Reed, Srivastava, Kumar, and Tomkins. Pig Latin: A Not-So-Foreign Language for Data Latin: A Not-So-Foreign Language for Data Processing. SIGMOD 2008.Processing. SIGMOD 2008.

Shahram GhandeharizadehShahram GhandeharizadehComputer Science DepartmentComputer Science DepartmentUniversity of Southern CaliforniaUniversity of Southern California

Page 2: Pig Latin Olston, Reed, Srivastava, Kumar, and Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD 2008. Shahram Ghandeharizadeh

A Shared-Nothing FrameworkA Shared-Nothing Framework

Shared-nothing architecture consisting of Shared-nothing architecture consisting of thousands of nodes!thousands of nodes! A node is an off-the-shelf, commodity PC.A node is an off-the-shelf, commodity PC.

Google File SystemGoogle File System

Google’s Bigtable Data ModelGoogle’s Bigtable Data Model

Google’s Map/Reduce FrameworkGoogle’s Map/Reduce Framework

Yahoo’s Pig Latin Yahoo’s Pig Latin

…………..

Page 3: Pig Latin Olston, Reed, Srivastava, Kumar, and Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD 2008. Shahram Ghandeharizadeh

Pig LatinPig Latin

Supports read-only data analysis workloads Supports read-only data analysis workloads that are scan-centric; no transactions!that are scan-centric; no transactions!

Fully nested data model.Fully nested data model. Does not satisfy 1NF! By definition will violate Does not satisfy 1NF! By definition will violate

the other normal forms.the other normal forms.

Extensive support for user-defined Extensive support for user-defined functions.functions. UDF as first class citizen.UDF as first class citizen.

Manages plain input files without any Manages plain input files without any schema information.schema information.

A novel debugging environment.A novel debugging environment.

Page 4: Pig Latin Olston, Reed, Srivastava, Kumar, and Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD 2008. Shahram Ghandeharizadeh

Data ModelsData Models

ConceptualConceptual

LogicalLogical

PhysicalPhysical

You are here!You are here!Relational data modelRelational data modelRelational AlgebraRelational AlgebraSQLSQL

Page 5: Pig Latin Olston, Reed, Srivastava, Kumar, and Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD 2008. Shahram Ghandeharizadeh

Data ModelsData Models

ConceptualConceptual

LogicalLogical

PhysicalPhysical

You are here!You are here!

Nested data modelNested data modelPig LatinPig Latin

Page 6: Pig Latin Olston, Reed, Srivastava, Kumar, and Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD 2008. Shahram Ghandeharizadeh

Why Nested Data Model?Why Nested Data Model?

Closer to how programmers think and more Closer to how programmers think and more natural to them.natural to them. E.g., To capture information about the positional E.g., To capture information about the positional

occurrences of terms in a collection of occurrences of terms in a collection of documents, a programmer may create a documents, a programmer may create a structure of the form Idx<documentId, structure of the form Idx<documentId, Set<positions>> for each term.Set<positions>> for each term.

Normalization of the data creates two tables:Normalization of the data creates two tables:Term_info: (TermId, termString, ….)Term_info: (TermId, termString, ….)

Pos_info: (TermId, documentId, position)Pos_info: (TermId, documentId, position)

Obtain positional occurrence by joining these Obtain positional occurrence by joining these two tables on TermId and grouping on <TermId, two tables on TermId and grouping on <TermId, documentId>documentId>

Page 7: Pig Latin Olston, Reed, Srivastava, Kumar, and Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD 2008. Shahram Ghandeharizadeh

Why Nested Data Model?Why Nested Data Model?

Data is often stored on disk in an inherently Data is often stored on disk in an inherently nested fashion.nested fashion. A web crawler might output for each url, the set A web crawler might output for each url, the set

of outlinks from that url.of outlinks from that url.

A nested data model justifies a new A nested data model justifies a new algebraic language!algebraic language!

Adaptation by programmers because it is Adaptation by programmers because it is easier to write user-defined functions.easier to write user-defined functions.

Page 8: Pig Latin Olston, Reed, Srivastava, Kumar, and Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD 2008. Shahram Ghandeharizadeh

Dataflow LanguageDataflow Language

User specifies a sequence of steps where User specifies a sequence of steps where each step specifies only a single, high level each step specifies only a single, high level data transformation. Similar to relational data transformation. Similar to relational algebra and procedural – desirable for algebra and procedural – desirable for programmers.programmers.

With SQL, the user specifies a set of With SQL, the user specifies a set of declarative constraints. Non-procedural and declarative constraints. Non-procedural and desirable for non-programmers.desirable for non-programmers.

Page 9: Pig Latin Olston, Reed, Srivastava, Kumar, and Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD 2008. Shahram Ghandeharizadeh

Dataflow Language: ExampleDataflow Language: Example

A high level program that specifies a query A high level program that specifies a query execution plan. execution plan. Example: For each sufficiently large category, Example: For each sufficiently large category,

retrieve the average pagerank of high-pagerank retrieve the average pagerank of high-pagerank urls in that category.urls in that category. SQL assuming a table urls (url, category, pagerank)SQL assuming a table urls (url, category, pagerank)

SELECTSELECT category, AVG(pagerank)category, AVG(pagerank)

FROMFROM urlsurls

WHEREWHERE pagerank > 0.2pagerank > 0.2

GROUP BYGROUP BY categorycategory

HAVINGHAVING count(*) > 1,000,000count(*) > 1,000,000

Page 10: Pig Latin Olston, Reed, Srivastava, Kumar, and Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD 2008. Shahram Ghandeharizadeh

Dataflow Language: Example (Cont…)Dataflow Language: Example (Cont…)

A high level program that specifies a query A high level program that specifies a query execution plan. execution plan. Example: For each sufficiently large category, Example: For each sufficiently large category,

retrieve the average pagerank of high-pagerank retrieve the average pagerank of high-pagerank urls in that category.urls in that category. Pig Latin:Pig Latin:

1.1. Good_urls = FILTER urls BY pagerank > 0.2;Good_urls = FILTER urls BY pagerank > 0.2;

2.2. Groups = GROUP Good_urls BY category;Groups = GROUP Good_urls BY category;

3.3. Big_groups = FILTER Groups by COUNT(Good_urls) > 1,000,000;Big_groups = FILTER Groups by COUNT(Good_urls) > 1,000,000;

4.4. Output = FOREACH Big_groups GENERATE category, Output = FOREACH Big_groups GENERATE category, AVG(Good_urls, AVG(Good_urls.pagerank);AVG(Good_urls, AVG(Good_urls.pagerank);

Availability of schema is optional!Availability of schema is optional!Columns are referenced using $0, $1, $2, …Columns are referenced using $0, $1, $2, …

Page 11: Pig Latin Olston, Reed, Srivastava, Kumar, and Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD 2008. Shahram Ghandeharizadeh

Lazy ExecutionLazy Execution

Database style optimization by lazy Database style optimization by lazy processing of expressions.processing of expressions.

ExampleExampleRecall urls: (url, category, pagerank)Recall urls: (url, category, pagerank)

Set of urls of pages that are classified as spam and Set of urls of pages that are classified as spam and have a high pagerank score.have a high pagerank score.1.1. Spam_urls = Filter urls BY isSpam(url);Spam_urls = Filter urls BY isSpam(url);

2.2. Culprit_urls = FILTER spam_urls BY pagerank > 0.8;Culprit_urls = FILTER spam_urls BY pagerank > 0.8;

Optimized execution:Optimized execution:1.1. HighRank_urls = FILTER urls BY pagerank > 0.8;HighRank_urls = FILTER urls BY pagerank > 0.8;

2.2. Cultprit_urls = FILTER HighRank_urls BY isSpam (url);Cultprit_urls = FILTER HighRank_urls BY isSpam (url);

Page 12: Pig Latin Olston, Reed, Srivastava, Kumar, and Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD 2008. Shahram Ghandeharizadeh

Quick Start/InteroperabilityQuick Start/Interoperability

To process a file, the user provides a To process a file, the user provides a function that gives Pig the ability to parse function that gives Pig the ability to parse the content of the file into records.the content of the file into records.

Output of a Pig program is formatted based Output of a Pig program is formatted based on a user-defined function.on a user-defined function.

Why do not conventional DBMSs do the Why do not conventional DBMSs do the same? (They require importing data into same? (They require importing data into system-managed tables)system-managed tables)

Page 13: Pig Latin Olston, Reed, Srivastava, Kumar, and Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD 2008. Shahram Ghandeharizadeh

Quick Start/InteroperabilityQuick Start/Interoperability

To process a file, the user provides a To process a file, the user provides a function that gives Prig the ability to parse function that gives Prig the ability to parse the content of the file into records.the content of the file into records.

Output of a Pig program is formatted based Output of a Pig program is formatted based on a user-defined function.on a user-defined function.

Why do not conventional DBMSs do the Why do not conventional DBMSs do the same? (They require importing data into same? (They require importing data into system-managed tables)system-managed tables) To enable transactional consistency guarantees,To enable transactional consistency guarantees, To enable efficient point lookups (RIDs),To enable efficient point lookups (RIDs), To curate data on behalf of the user, and record To curate data on behalf of the user, and record

the schema so that other users can make sense the schema so that other users can make sense of the data.of the data.

Page 14: Pig Latin Olston, Reed, Srivastava, Kumar, and Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD 2008. Shahram Ghandeharizadeh

PigPig

Page 15: Pig Latin Olston, Reed, Srivastava, Kumar, and Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD 2008. Shahram Ghandeharizadeh

Data ModelData Model

Consists of four types:Consists of four types: Atom: Contains a simple atomic value such as a Atom: Contains a simple atomic value such as a

string or a number, e.g., ‘Joe’.string or a number, e.g., ‘Joe’. Tuple: Sequence of fields, each of which might Tuple: Sequence of fields, each of which might

be any data type, e.g., (‘Joe’, ‘lakers’)be any data type, e.g., (‘Joe’, ‘lakers’) Bag: A collection of tuples with possible Bag: A collection of tuples with possible

duplicates. Schema of a bag is flexible.duplicates. Schema of a bag is flexible.

Map: A collection of data items, where each item Map: A collection of data items, where each item has an associated key through which it can be has an associated key through which it can be looked up. Keys must be data atoms. Flexibility looked up. Keys must be data atoms. Flexibility enables data to change without re-writing enables data to change without re-writing programs.programs.

Page 16: Pig Latin Olston, Reed, Srivastava, Kumar, and Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD 2008. Shahram Ghandeharizadeh

A Comparison with Relational AlgebraA Comparison with Relational Algebra

Pig LatinPig Latin Everything is a bag.Everything is a bag. Dataflow language.Dataflow language.

Relational AlgebraRelational Algebra Everything is a table.Everything is a table. Dataflow language.Dataflow language.

Page 17: Pig Latin Olston, Reed, Srivastava, Kumar, and Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD 2008. Shahram Ghandeharizadeh

Expressions in Pig LatinExpressions in Pig Latin

Page 18: Pig Latin Olston, Reed, Srivastava, Kumar, and Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD 2008. Shahram Ghandeharizadeh

Specifying Input DataSpecifying Input Data

Use LOAD command to specify input data file.Use LOAD command to specify input data file. Input file is query_log.txtInput file is query_log.txt Convert input file into tuples using myLoad Convert input file into tuples using myLoad

deserializer.deserializer. Loaded tuples have 3 fields.Loaded tuples have 3 fields. USING and AS clauses are optional.USING and AS clauses are optional.

Default serializer that expects a plain text, tab-deliminated Default serializer that expects a plain text, tab-deliminated file, is used.file, is used.

No schema No schema reference fields by position $0 reference fields by position $0 Return value, assigned to “queries”, is a handle to a Return value, assigned to “queries”, is a handle to a

bag.bag. ““queries” can be used as input to subsequent Pig Latin queries” can be used as input to subsequent Pig Latin

expressions.expressions. Handles such as “queries” are logical. No data is actually Handles such as “queries” are logical. No data is actually

read and no processing carried out until the instruction read and no processing carried out until the instruction that explicitly asks for output (STORE).that explicitly asks for output (STORE).

Think of it as a “logical view”.Think of it as a “logical view”.

Page 19: Pig Latin Olston, Reed, Srivastava, Kumar, and Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD 2008. Shahram Ghandeharizadeh

Per-tuple ProcessingPer-tuple Processing

Iterate members of a set using FOREACH Iterate members of a set using FOREACH command.command.

expandQuery is a UDF that generates a bag expandQuery is a UDF that generates a bag of likely expansions of a given query string.of likely expansions of a given query string.

Semantics: Semantics: No dependence between processing of different No dependence between processing of different

tupels of the input tupels of the input Parallelism! Parallelism! GENERATE can be followed by a list of any GENERATE can be followed by a list of any

expression from Table 1.expression from Table 1.

Page 20: Pig Latin Olston, Reed, Srivastava, Kumar, and Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD 2008. Shahram Ghandeharizadeh

FOREACH & FlatteningFOREACH & Flattening

To eliminate nesting in data, use FLATTEN.To eliminate nesting in data, use FLATTEN. FLATTEN consumes a bag, extracts the FLATTEN consumes a bag, extracts the

fields of the tuples in the bag, and makes fields of the tuples in the bag, and makes them fields of the tuple being output by them fields of the tuple being output by GENERATE, removing one level of nesting.GENERATE, removing one level of nesting.

OUTPUTOUTPUT

Page 21: Pig Latin Olston, Reed, Srivastava, Kumar, and Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD 2008. Shahram Ghandeharizadeh

FILTERFILTER

Discards unwanted data. Identical to the Discards unwanted data. Identical to the select operator of relational algebra.select operator of relational algebra.

Synatx:Synatx: FILTER bag-id BY expressionFILTER bag-id BY expression

Expression is:Expression is:field-name op Constantfield-name op Constant

Field-name op UDFField-name op UDF

op might be ==, eq, !=, neq, <, >, <=, >=op might be ==, eq, !=, neq, <, >, <=, >=

A comparison operation may utilize boolean A comparison operation may utilize boolean operators (AND, OR, NOT) with several operators (AND, OR, NOT) with several expressionsexpressions

Page 22: Pig Latin Olston, Reed, Srivastava, Kumar, and Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD 2008. Shahram Ghandeharizadeh

A Comparison with Relational AlgebraA Comparison with Relational Algebra

Pig LatinPig Latin Everything is a bag.Everything is a bag. Dataflow language.Dataflow language. FILTER is same as the FILTER is same as the

Select operator.Select operator.

Relational AlgebraRelational Algebra Everything is a table.Everything is a table. Dataflow language.Dataflow language. Select operator is same Select operator is same

as the FILTER cmd.as the FILTER cmd.

Page 23: Pig Latin Olston, Reed, Srivastava, Kumar, and Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD 2008. Shahram Ghandeharizadeh

MAP MAP part of MapReducepart of MapReduce: Grouping related data: Grouping related data

COGROUP groups together tuples from one COGROUP groups together tuples from one or more data sets that are related in some or more data sets that are related in some way.way.

Example:Example: Imagine two data sets:Imagine two data sets: Results contains, for different query strings, the Results contains, for different query strings, the

urls shown as search results and the position at urls shown as search results and the position at which they are shown.which they are shown.

Revenue contains, for different query strings, Revenue contains, for different query strings, and different ad slots, the average amount of and different ad slots, the average amount of revenue made by the ad for that query string at revenue made by the ad for that query string at that slot.that slot.

For a queryString, group data together.For a queryString, group data together.

(querystring, adSlot, amount)(querystring, adSlot, amount)

Page 24: Pig Latin Olston, Reed, Srivastava, Kumar, and Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD 2008. Shahram Ghandeharizadeh

COGROUPCOGROUP The output of a COGROUP contains one tuple for each group.The output of a COGROUP contains one tuple for each group.

First field of the tuple, named group, is the group identifier.First field of the tuple, named group, is the group identifier. Each of the next fields is a bag, one for each input being Each of the next fields is a bag, one for each input being

cogrouped, and is named the same as the alias of that input.cogrouped, and is named the same as the alias of that input.

Page 25: Pig Latin Olston, Reed, Srivastava, Kumar, and Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD 2008. Shahram Ghandeharizadeh

COGROUPCOGROUP Grouping can be performed according to arbitrary expressions Grouping can be performed according to arbitrary expressions

which may include UDFs.which may include UDFs. Grouping is different than “Join”Grouping is different than “Join”

Page 26: Pig Latin Olston, Reed, Srivastava, Kumar, and Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD 2008. Shahram Ghandeharizadeh

COGROUP is not JOINCOGROUP is not JOIN

Assign search revenue to search-result urls to figure out the Assign search revenue to search-result urls to figure out the monetary worth of each url. A UDF, distributeRevenue monetary worth of each url. A UDF, distributeRevenue attributes revenue from the top slot entirely to the first search attributes revenue from the top slot entirely to the first search result, while the revenue from the side slot may be attributed result, while the revenue from the side slot may be attributed equally to all the results.equally to all the results.

Page 27: Pig Latin Olston, Reed, Srivastava, Kumar, and Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD 2008. Shahram Ghandeharizadeh

WITH JOINWITH JOIN

Page 28: Pig Latin Olston, Reed, Srivastava, Kumar, and Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD 2008. Shahram Ghandeharizadeh

GROUPGROUP

A special case of COGROUP when there is A special case of COGROUP when there is only one data set involved.only one data set involved.

Example: Find the total revenue for each Example: Find the total revenue for each query string.query string.

Page 29: Pig Latin Olston, Reed, Srivastava, Kumar, and Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD 2008. Shahram Ghandeharizadeh

JOINJOIN

Pig Latin supports equi-joins.Pig Latin supports equi-joins.

Implemented using COGROUPImplemented using COGROUP

Page 30: Pig Latin Olston, Reed, Srivastava, Kumar, and Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD 2008. Shahram Ghandeharizadeh

MapReduce in Pig LatinMapReduce in Pig Latin

A map function operates on one input tuple A map function operates on one input tuple at a time, and outputs a bag of key-value at a time, and outputs a bag of key-value pairs.pairs.

The reduce function operates on all values The reduce function operates on all values for a key at a time to produce the final for a key at a time to produce the final results.results.

Page 31: Pig Latin Olston, Reed, Srivastava, Kumar, and Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD 2008. Shahram Ghandeharizadeh

MapReduce Plan CompilationMapReduce Plan Compilation

Map tasks assign keys for grouping, and the reduce tasks Map tasks assign keys for grouping, and the reduce tasks process a group at a time.process a group at a time.

Compiler:Compiler: Converts each (CO)GROUP command in the logical plan into a Converts each (CO)GROUP command in the logical plan into a

distinct MapReduce job consisting of its own MAP and distinct MapReduce job consisting of its own MAP and REDUCE functions.REDUCE functions.

Page 32: Pig Latin Olston, Reed, Srivastava, Kumar, and Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD 2008. Shahram Ghandeharizadeh

Debugging EnvironmentDebugging Environment

Iterative process for programming.Iterative process for programming. Sandbox data set generated automatically to show results for Sandbox data set generated automatically to show results for

the expressions.the expressions.