pig latin cs 6800 utah state university. writing mapreduce jobs higher order functions map applies a...

Pig Latin

CS 6800

Utah State University

Writing MapReduce Jobs• Higher order functions• Map applies a function to a list

Example list [1, 2, 3, 4] Want to square each number in the list Write function f(x) = x*x Compute [f(1), f(2), f(3), f(4)] = [1, 4, 9, 16] map function signature: (a -> b) -> [a] -> [b] Haskell specification map f [] = [] map f (x:xs) = (f x) :: (map f xs) Call the function map (\x -> x * x) [1, 2, 3, 4]

Reduce• Reduce converts a list into a scalar

Example list [1, 2, 3, 4] Want to sum the numbers in the list Write function g(x,y) = x+y Compute g(1,g(2,g(3,g(4,0)))) = 10 reduce signature: (a -> b -> c) -> b -> [a] -> c Haskell specification

reduce g c [] = c

reduce g c (x:xs) = g x (reduce g c xs) Call the function

reduce (\x -> x + x) 0 [1, 2, 3, 4]

Use in Cloud Computing• Map can be used to clean data and "group" it

• Suppose a list of wordswords = [Bat Volcano bat vulcano]

• Map to lower caselcase = map lowercase words

• Map to correct spellings = map spellFix lcase

• Count each wordgroups = map (\x -> (x, 1)) s

groups is [(bat, 1), (volcano, 1), (bat, 1) …

Use in Cloud Computing (continues)• Shuffles collects tuples with same "group" value

• Reduce combines countsresult = reduce + 0 groups

• Problem - MapReduce jobs written in PL (e.g., Java) Complicated Not reusable Database-like operations common

CouchDB - Count People per Gender

Pig Latin• Yahoo

40% of Hadoop jobs run using Pig

• Platform for analyzing massive data sets

• Runs on Hadoop (Map/Reduce)

• Version 0.12

What is Pig Latin?• Dataflow language

• Non 1NF data model Tuples Sets Bags

• Use relational algebra-like operations to manipulate data

Joins Filter - selection Generate - projection

• Compiles to MapReduce jobs on Hadoop cluster

Pig Latin Features• A dataflow (NoSQL) language

SQL is declarative, most PLs are not SQL poor at expressing workflow

• Non-1NF data model Bags, sets, tuples, maps Data resides in read-only files Schema-less

Example• Count subscribers in each city

A = LOAD ’subscribers.txt’ AS

(name: chararray, city: chararray, amount: int);

B = GROUP A BY city;

C = FOREACH B GENERATE city, COUNT(B.name);

DUMP C;

• Dataflow

LOAD …

GROUP A …

A

B

C

FOREACH B …

Compilation

Pig Latin Compiler

Map

Reduce HDFSMap

Reduce HDFSMap

Reduce HDFSMap

Reduce HDFSMap

Reduce HDFSMap

Reduce HDFS

Hadoop

Map Reduce

Job

Pig Latin Program

Result

Data Transformations• Relational algebra-like

JOIN (inner and outer joins)

FILTER (selection) FOREACH (projection) CROSS (product) UNION

• SQL-like DISTINCT LIMIT ORDER BY GROUP

• Non-traditional COGROUP MAPREDUCE FLATTEN RANK STREAM SAMPLE SPLIT

Magazine Subscriber Data

Subscribers

(Maya, Logan, $20, 1)

(Jose, Logan, $15, 2)

(Name, City, Amt, Id)

(Knut, Ogden, $20, 3)

...

Personal Information

(Maya, [email protected], 5)

(Jose, [email protected], 6)

(Name, Email, Id)

(Knut, [email protected], 7)

...

FILTER • A filter restricts the result

/* Restrict to Logan subscribers */

X = FILTER R ON city = "Logan";

• FILTER example

Subscribers





...


Subscribers





...

Personal Information

(Maya, [email protected], 5)

(Jose, [email protected], 6)

(Name, Email, Id)

(Knut, [email protected], 7)

...

B = JOIN Subscribers BY name, PerInfo By name


B

(Maya, Logan, $20, 1, Maya, [email protected], 5)

(Jose, Logan, $15, 2, Jose, [email protected], 6)

(Name, City, Amt, Id, Name, Email, Id)

(Knut, Ogden, $20, 3, Knut, [email protected], 7)

...

B = JOIN Subscribers BY name, PerInfo By name

Optimization

FILTER …A B C

JOIN … FILTER …D E

CROSS …

Map/Reduce Map/Reduce

pig latin cs 6800 utah state university. writing mapreduce jobs higher order functions map applies a...

Documents

x x x

function map x x

logan subscribers

x ycompute g1

map function signature

map f xscall

b haskell specification

g c xscall