pig latin cs 6800 utah state university. writing mapreduce jobs higher order functions map applies a...

17
Pig Latin CS 6800 Utah State University

Upload: loren-miller

Post on 12-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Pig Latin CS 6800 Utah State University. Writing MapReduce Jobs Higher order functions Map applies a function to a list Example list [1, 2, 3, 4] Want

Pig Latin

CS 6800

Utah State University

Page 2: Pig Latin CS 6800 Utah State University. Writing MapReduce Jobs Higher order functions Map applies a function to a list Example list [1, 2, 3, 4] Want

Writing MapReduce Jobs• Higher order functions• Map applies a function to a list

Example list [1, 2, 3, 4] Want to square each number in the list Write function f(x) = x*x Compute [f(1), f(2), f(3), f(4)] = [1, 4, 9, 16] map function signature: (a -> b) -> [a] -> [b] Haskell specification map f [] = [] map f (x:xs) = (f x) :: (map f xs) Call the function map (\x -> x * x) [1, 2, 3, 4]

Page 3: Pig Latin CS 6800 Utah State University. Writing MapReduce Jobs Higher order functions Map applies a function to a list Example list [1, 2, 3, 4] Want

Reduce• Reduce converts a list into a scalar

Example list [1, 2, 3, 4] Want to sum the numbers in the list Write function g(x,y) = x+y Compute g(1,g(2,g(3,g(4,0)))) = 10 reduce signature: (a -> b -> c) -> b -> [a] -> c Haskell specification

reduce g c [] = c

reduce g c (x:xs) = g x (reduce g c xs) Call the function

reduce (\x -> x + x) 0 [1, 2, 3, 4]

Page 4: Pig Latin CS 6800 Utah State University. Writing MapReduce Jobs Higher order functions Map applies a function to a list Example list [1, 2, 3, 4] Want

Use in Cloud Computing• Map can be used to clean data and "group" it

• Suppose a list of wordswords = [Bat Volcano bat vulcano]

• Map to lower caselcase = map lowercase words

• Map to correct spellings = map spellFix lcase

• Count each wordgroups = map (\x -> (x, 1)) s

groups is [(bat, 1), (volcano, 1), (bat, 1) …

Page 5: Pig Latin CS 6800 Utah State University. Writing MapReduce Jobs Higher order functions Map applies a function to a list Example list [1, 2, 3, 4] Want

Use in Cloud Computing (continues)• Shuffles collects tuples with same "group" value

• Reduce combines countsresult = reduce + 0 groups

• Problem - MapReduce jobs written in PL (e.g., Java) Complicated Not reusable Database-like operations common

Page 6: Pig Latin CS 6800 Utah State University. Writing MapReduce Jobs Higher order functions Map applies a function to a list Example list [1, 2, 3, 4] Want

CouchDB - Count People per Gender

Page 7: Pig Latin CS 6800 Utah State University. Writing MapReduce Jobs Higher order functions Map applies a function to a list Example list [1, 2, 3, 4] Want

Pig Latin• Yahoo

40% of Hadoop jobs run using Pig

• Platform for analyzing massive data sets

• Runs on Hadoop (Map/Reduce)

• Version 0.12

Page 8: Pig Latin CS 6800 Utah State University. Writing MapReduce Jobs Higher order functions Map applies a function to a list Example list [1, 2, 3, 4] Want

What is Pig Latin?• Dataflow language

• Non 1NF data model Tuples Sets Bags

• Use relational algebra-like operations to manipulate data

Joins Filter - selection Generate - projection

• Compiles to MapReduce jobs on Hadoop cluster

Page 9: Pig Latin CS 6800 Utah State University. Writing MapReduce Jobs Higher order functions Map applies a function to a list Example list [1, 2, 3, 4] Want

Pig Latin Features• A dataflow (NoSQL) language

SQL is declarative, most PLs are not SQL poor at expressing workflow

• Non-1NF data model Bags, sets, tuples, maps Data resides in read-only files Schema-less

Page 10: Pig Latin CS 6800 Utah State University. Writing MapReduce Jobs Higher order functions Map applies a function to a list Example list [1, 2, 3, 4] Want

Example• Count subscribers in each city

A = LOAD ’subscribers.txt’ AS

(name: chararray, city: chararray, amount: int);

B = GROUP A BY city;

C = FOREACH B GENERATE city, COUNT(B.name);

DUMP C;

• Dataflow

LOAD …

GROUP A …

A

B

C

FOREACH B …

Page 11: Pig Latin CS 6800 Utah State University. Writing MapReduce Jobs Higher order functions Map applies a function to a list Example list [1, 2, 3, 4] Want

Compilation

Pig Latin Compiler

Map

Reduce HDFSMap

Reduce HDFSMap

Reduce HDFSMap

Reduce HDFSMap

Reduce HDFSMap

Reduce HDFS

Hadoop

Map Reduce

Job

Pig Latin Program

Result

Page 12: Pig Latin CS 6800 Utah State University. Writing MapReduce Jobs Higher order functions Map applies a function to a list Example list [1, 2, 3, 4] Want

Data Transformations• Relational algebra-like

JOIN (inner and outer joins)

FILTER (selection) FOREACH (projection) CROSS (product) UNION

• SQL-like DISTINCT LIMIT ORDER BY GROUP

• Non-traditional COGROUP MAPREDUCE FLATTEN RANK STREAM SAMPLE SPLIT

Page 13: Pig Latin CS 6800 Utah State University. Writing MapReduce Jobs Higher order functions Map applies a function to a list Example list [1, 2, 3, 4] Want

Magazine Subscriber Data

Subscribers

(Maya, Logan, $20, 1)

(Jose, Logan, $15, 2)

(Name, City, Amt, Id)

(Knut, Ogden, $20, 3)

...

Personal Information

(Maya, [email protected], 5)

(Jose, [email protected], 6)

(Name, Email, Id)

(Knut, [email protected], 7)

...

Page 14: Pig Latin CS 6800 Utah State University. Writing MapReduce Jobs Higher order functions Map applies a function to a list Example list [1, 2, 3, 4] Want

FILTER • A filter restricts the result

/* Restrict to Logan subscribers */

X = FILTER R ON city = "Logan";

• FILTER example

Subscribers

(Name, City, Amt, Id)

(Maya, Logan, $20, 1)

(Jose, Logan, $15, 2)

(Knut, Ogden, $20, 3)

...

Page 15: Pig Latin CS 6800 Utah State University. Writing MapReduce Jobs Higher order functions Map applies a function to a list Example list [1, 2, 3, 4] Want

Magazine Subscriber Data

Subscribers

(Maya, Logan, $20, 1)

(Jose, Logan, $15, 2)

(Name, City, Amt, Id)

(Knut, Ogden, $20, 3)

...

Personal Information

(Maya, [email protected], 5)

(Jose, [email protected], 6)

(Name, Email, Id)

(Knut, [email protected], 7)

...

B = JOIN Subscribers BY name, PerInfo By name

Page 16: Pig Latin CS 6800 Utah State University. Writing MapReduce Jobs Higher order functions Map applies a function to a list Example list [1, 2, 3, 4] Want

Magazine Subscriber Data

B

(Maya, Logan, $20, 1, Maya, [email protected], 5)

(Jose, Logan, $15, 2, Jose, [email protected], 6)

(Name, City, Amt, Id, Name, Email, Id)

(Knut, Ogden, $20, 3, Knut, [email protected], 7)

...

B = JOIN Subscribers BY name, PerInfo By name

Page 17: Pig Latin CS 6800 Utah State University. Writing MapReduce Jobs Higher order functions Map applies a function to a list Example list [1, 2, 3, 4] Want

Optimization

FILTER …A B C

JOIN … FILTER …D E

CROSS …

Map/Reduce Map/Reduce