eedc apache pig language

23
Execution Environments for Distributed Computing Apache Pig EEDC 3433 0 Master in Computer Architecture, Networks and Systems - CANS Homework number: 3 Group number: EEDC-3 Group members: Javier Álvarez – [email protected] Francesc Lordan – [email protected] Roger Rafanell – [email protected]

Upload: roger-rafanell-mas

Post on 17-Jun-2015

861 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: EEDC Apache Pig Language

Execution Environments for Distributed Computing

Apache PigEEDC

343

30

Master in Computer Architecture, Networks and Systems - CANS

Homework number: 3Group number: EEDC-3

Group members:Javier Álvarez – [email protected]

Francesc Lordan – [email protected]

Roger Rafanell – [email protected]

Page 2: EEDC Apache Pig Language

222

Outline

1.- Introduction

2.- Pig Latin

2.1.- Data model

2.2.- Relational commands

3.- Implementation

4.- Conclusions

Page 3: EEDC Apache Pig Language

Execution Environments for Distributed Computing Part 1

Introduction

EEDC

343

30

Master in Computer Architecture, Networks and Systems - CANS

Page 4: EEDC Apache Pig Language

444

Why Apache Pig?

Today’s Internet companies needs to process hugh data sets:

– Parallel databases can be prohibitively expensive at this scale.

– Programmers tend to find declarative languages such as SQL very unnatural.

– Other approaches such map-reduce are low-level and rigid.

Page 5: EEDC Apache Pig Language

555

What is Apache Pig?

A platform for analyzing large data sets that:

– It is based in Pig Latin which lies between declarative (SQL) and procedural (C++) programming languages.

– At the same time, enables the construction of programs with an easy parallelizable structure.

Page 6: EEDC Apache Pig Language

666

Which features does it have?

Dataflow Language– Data processing is expressed step-by-step.

Quick Start & Interoperability– Pig can work over any kind of input and produce any kind of output.

Nested Data Model– Pig works with complex types like tuples, bags, ...

User Defined Functions (UDFs)– Potentially in any programming language (only Java for the moment).

Only parallel– Pig Latin forces to use directives that are parallelizable in a direct way.

Debugging environment– Debugging at programming time.

Page 7: EEDC Apache Pig Language

Execution Environments for Distributed Computing Part 2

Pig Latin

EEDC

343

30

Master in Computer Architecture, Networks and Systems - CANS

Page 8: EEDC Apache Pig Language

Execution Environments for Distributed Computing Section 2.1

Data model

EEDC

343

30

Master in Computer Architecture, Networks and Systems - CANS

Page 9: EEDC Apache Pig Language

999

Data Model

Very rich data model consisting on 4 simple data types:

Atom: Simple atomic value such as strings or numbers.‘Alice’

Tuple: Sequence of fields of any type of data.(‘Alice’, ‘Apple’)

(‘Alice’, (‘Barça’, ‘football’))

Bag: collection of tuples with possible duplicates.(‘Alice’, ‘Apple’)

(‘Alice’, (‘Barça’, ‘football’))

Map: collection of data items with an associated key (always an atom).

‘Fan of’ (‘Apple’)

(‘Barça’, ‘football’)

‘Age’ ’20’

Page 10: EEDC Apache Pig Language

Execution Environments for Distributed Computing

Section 2.2

Relationalcommands

EEDC

343

30

Master in Computer Architecture, Networks and Systems - CANS

Page 11: EEDC Apache Pig Language

111111

Relational commands

visits = LOAD ‘visits.txt’ AS (user, url, time)

pages = LOAD `pages.txt` AS (url, rank);

visits: (‘Amy’, ‘cnn.com’, ‘8am’)

(‘Amy’, ‘nytimes.com’, ‘9am’)

(‘Bob’, ‘elmundotoday.com’, ’11am’)

pages: (‘cnn.com’, ‘0.8’)

(‘nytimes.com’, ‘0.6’)

(‘elmundotoday’, ‘0.2’)

Page 12: EEDC Apache Pig Language

121212

Relational commands

visits = LOAD ‘visits.txt’ AS (user, url, time)

pages = LOAD `pages.txt` AS (url, rank);

vp = JOIN visits BY url, pages BY url

v_p: (‘Amy’, ‘cnn.com’, ‘8am’, ‘cnn.com’, ‘0.8’)

(‘Amy’, ‘nytimes.com’, ‘9am’, ‘nytimes.com, ‘0.6’)

(‘Bob’, ‘elmundotoday.com’, ’11am’, ‘elmundotoday.com’, ‘0.2’)

Page 13: EEDC Apache Pig Language

131313

Relational commands

visits = LOAD ‘visits.txt’ AS (user, url, time)

pages = LOAD `pages.txt` AS (url, rank);

vp = JOIN visits BY url, pages BY url

users = GROUP vp BY user

user: (‘Amy’, { (‘Amy’, ‘cnn.com’, ‘8am’, ‘cnn.com’, ‘0.8’),

(‘Amy’, ‘nytimes.com’, ‘9am’, ‘nytimes.com, ‘0.6’)})

(‘Bob’, {‘elmundotoday.com’, ’11am’, ‘elmundotoday.com’, ‘0.2’)})

Page 14: EEDC Apache Pig Language

141414

Relational commands

visits = LOAD ‘visits.txt’ AS (user, url, time)

pages = LOAD `pages.txt` AS (url, rank);

vp = JOIN visits BY url, pages BY url

users = GROUP vp BY user

useravg = FOREACH users GENERATE group, AVG(vp.rank) AS avgpr

user: (‘Amy’, ‘0.7’)

(‘Bob’, ‘0.2’)

Page 15: EEDC Apache Pig Language

151515

Relational commands

visits = LOAD ‘visits.txt’ AS (user, url, time)

pages = LOAD `pages.txt` AS (url, rank);

vp = JOIN visits BY url, pages BY url

users = GROUP vp BY user

useravg = FOREACH users GENERATE group, AVG(vp.rank) AS avgpr

answer = FILTER useravg BY avgpr > ‘0.5’

answer: (‘Amy’, ‘0.7’)

Page 16: EEDC Apache Pig Language

161616

Relational commands

Other relational operators:

– STORE : exports data into a file.STORE var1_name INTO 'output.txt‘;

– COGROUP : groups together tuples from diferent datasets.COGROUP var1_name BY field_id, var2_name BY field_id

– UNION : computes the union of two variables.– CROSS : computes the cross product.– ORDER : sorts a data set by one or more fields.– DISTINCT : removes replicated tuples in a dataset.

Page 17: EEDC Apache Pig Language

Execution Environments for Distributed Computing Part 3

Implementation

EEDC

343

30

Master in Computer Architecture, Networks and Systems - CANS

Page 18: EEDC Apache Pig Language

181818

Implementation: Highlights

Works on top of Hadoop ecosystem:– Current implementation uses Hadoop as execution platform.

On-the-fly compilation:– Pig translates the Pig Latin commands to Map and Reduce methods.

Lazy style language:– Pig try to pospone the data materialization (on disk writes) as much as

possible.

Page 19: EEDC Apache Pig Language

191919

Implementation: Building the logical plan

Query parsing:– Pig interpreter parses the commands verifying that the input files and

bags referenced are valid.

On-the-fly compilation:– Pig compiles the logical plan for that bag into physical plan (Map-Reduce

statements) when the command cannot be more delayed and must be executed.

Lazy characteristics:– No processing are carried out when the logical plan are build up.– Processing is triggered only when the user invokes STORE command

on a bag.– Lazy style execution permits in-memory pipelining and other interesting

optimizations.

Page 20: EEDC Apache Pig Language

202020

Implementation: Map-Reduce plan compilation

CO(GROUP):– Each command is compiled in a distinct map-reduce job with its own map and

reduce functions. – Parallelism is achieved since the output of multiple map instances is

repartitioned in parallel to multiple reduce instances.

LOAD:– Parallelism is obtained since Pig operates over files residing in the Hadoop

distributed file system.

FILTER/FOREACH:– Automatic parallelism is given since for a map-reduce job several map and

reduce instances are run in parallel.

ORDER (compiled in two map-reduce jobs):– First: Determine quantiles of the sort key– Second: Chops the job according the quantiles and performs a local sorting in

the reduce phase resulting in a global sorted file.

Page 21: EEDC Apache Pig Language

Execution Environments for Distributed Computing Part 4

Conclusions

EEDC

343

30

Master in Computer Architecture, Networks and Systems - CANS

Page 22: EEDC Apache Pig Language

222222

Conclusions Advantages:

– Step-by-step syntaxis.– Flexible: UDFs, not locked to a fixed schema (allows schema changes over the time).– Exposes a set of widely used functions: FOREACH, FILTER, ORDER, GROUP, …– Takes advantage of Hadoop native properties such: parallelism, load-balancing, fault-tolerance.– Debugging environment.– Open Source (IMPORTANT!!)

Disadvantages:– UDFs methods could be a source of performance loss (the control relies on user).– Overhead while compiling Pig Latin into map-reduce jobs.

Usage Scenarios:– Temporal analysis: search logs mainly involves studying how search query distribution changes

over time.

– Session analysis: web user sessions, i.e, sequences of page views and clicks made by users are analized to calculate some metrics such:

– how long is the average user session?

– how many links does a user click on before leaving a website?

– Others, ...

Page 23: EEDC Apache Pig Language

232323

Q&A