pig latin cs 6800 utah state university. writing mapreduce jobs higher order functions map applies a...
TRANSCRIPT
Pig Latin
CS 6800
Utah State University
Writing MapReduce Jobs• Higher order functions• Map applies a function to a list
Example list [1, 2, 3, 4] Want to square each number in the list Write function f(x) = x*x Compute [f(1), f(2), f(3), f(4)] = [1, 4, 9, 16] map function signature: (a -> b) -> [a] -> [b] Haskell specification map f [] = [] map f (x:xs) = (f x) :: (map f xs) Call the function map (\x -> x * x) [1, 2, 3, 4]
Reduce• Reduce converts a list into a scalar
Example list [1, 2, 3, 4] Want to sum the numbers in the list Write function g(x,y) = x+y Compute g(1,g(2,g(3,g(4,0)))) = 10 reduce signature: (a -> b -> c) -> b -> [a] -> c Haskell specification
reduce g c [] = c
reduce g c (x:xs) = g x (reduce g c xs) Call the function
reduce (\x -> x + x) 0 [1, 2, 3, 4]
Use in Cloud Computing• Map can be used to clean data and "group" it
• Suppose a list of wordswords = [Bat Volcano bat vulcano]
• Map to lower caselcase = map lowercase words
• Map to correct spellings = map spellFix lcase
• Count each wordgroups = map (\x -> (x, 1)) s
groups is [(bat, 1), (volcano, 1), (bat, 1) …
Use in Cloud Computing (continues)• Shuffles collects tuples with same "group" value
• Reduce combines countsresult = reduce + 0 groups
• Problem - MapReduce jobs written in PL (e.g., Java) Complicated Not reusable Database-like operations common
CouchDB - Count People per Gender
Pig Latin• Yahoo
40% of Hadoop jobs run using Pig
• Platform for analyzing massive data sets
• Runs on Hadoop (Map/Reduce)
• Version 0.12
What is Pig Latin?• Dataflow language
• Non 1NF data model Tuples Sets Bags
• Use relational algebra-like operations to manipulate data
Joins Filter - selection Generate - projection
• Compiles to MapReduce jobs on Hadoop cluster
Pig Latin Features• A dataflow (NoSQL) language
SQL is declarative, most PLs are not SQL poor at expressing workflow
• Non-1NF data model Bags, sets, tuples, maps Data resides in read-only files Schema-less
Example• Count subscribers in each city
A = LOAD ’subscribers.txt’ AS
(name: chararray, city: chararray, amount: int);
B = GROUP A BY city;
C = FOREACH B GENERATE city, COUNT(B.name);
DUMP C;
• Dataflow
LOAD …
GROUP A …
A
B
C
FOREACH B …
Compilation
Pig Latin Compiler
Map
Reduce HDFSMap
Reduce HDFSMap
Reduce HDFSMap
Reduce HDFSMap
Reduce HDFSMap
Reduce HDFS
Hadoop
Map Reduce
Job
Pig Latin Program
Result
Data Transformations• Relational algebra-like
JOIN (inner and outer joins)
FILTER (selection) FOREACH (projection) CROSS (product) UNION
• SQL-like DISTINCT LIMIT ORDER BY GROUP
• Non-traditional COGROUP MAPREDUCE FLATTEN RANK STREAM SAMPLE SPLIT
Magazine Subscriber Data
Subscribers
(Maya, Logan, $20, 1)
(Jose, Logan, $15, 2)
(Name, City, Amt, Id)
(Knut, Ogden, $20, 3)
...
Personal Information
(Maya, [email protected], 5)
(Jose, [email protected], 6)
(Name, Email, Id)
(Knut, [email protected], 7)
...
FILTER • A filter restricts the result
/* Restrict to Logan subscribers */
X = FILTER R ON city = "Logan";
• FILTER example
Subscribers
(Name, City, Amt, Id)
(Maya, Logan, $20, 1)
(Jose, Logan, $15, 2)
(Knut, Ogden, $20, 3)
...
Magazine Subscriber Data
Subscribers
(Maya, Logan, $20, 1)
(Jose, Logan, $15, 2)
(Name, City, Amt, Id)
(Knut, Ogden, $20, 3)
...
Personal Information
(Maya, [email protected], 5)
(Jose, [email protected], 6)
(Name, Email, Id)
(Knut, [email protected], 7)
...
B = JOIN Subscribers BY name, PerInfo By name
Magazine Subscriber Data
B
(Maya, Logan, $20, 1, Maya, [email protected], 5)
(Jose, Logan, $15, 2, Jose, [email protected], 6)
(Name, City, Amt, Id, Name, Email, Id)
(Knut, Ogden, $20, 3, Knut, [email protected], 7)
...
B = JOIN Subscribers BY name, PerInfo By name
Optimization
FILTER …A B C
JOIN … FILTER …D E
CROSS …
Map/Reduce Map/Reduce