pig latin cs 6800 utah state university. writing mapreduce jobs higher order functions map applies a...
Post on 12-Jan-2016
216 Views
Preview:
TRANSCRIPT
Pig Latin
CS 6800
Utah State University
Writing MapReduce Jobs• Higher order functions• Map applies a function to a list
Example list [1, 2, 3, 4] Want to square each number in the list Write function f(x) = x*x Compute [f(1), f(2), f(3), f(4)] = [1, 4, 9, 16] map function signature: (a -> b) -> [a] -> [b] Haskell specification map f [] = [] map f (x:xs) = (f x) :: (map f xs) Call the function map (\x -> x * x) [1, 2, 3, 4]
Reduce• Reduce converts a list into a scalar
Example list [1, 2, 3, 4] Want to sum the numbers in the list Write function g(x,y) = x+y Compute g(1,g(2,g(3,g(4,0)))) = 10 reduce signature: (a -> b -> c) -> b -> [a] -> c Haskell specification
reduce g c [] = c
reduce g c (x:xs) = g x (reduce g c xs) Call the function
reduce (\x -> x + x) 0 [1, 2, 3, 4]
Use in Cloud Computing• Map can be used to clean data and "group" it
• Suppose a list of wordswords = [Bat Volcano bat vulcano]
• Map to lower caselcase = map lowercase words
• Map to correct spellings = map spellFix lcase
• Count each wordgroups = map (\x -> (x, 1)) s
groups is [(bat, 1), (volcano, 1), (bat, 1) …
Use in Cloud Computing (continues)• Shuffles collects tuples with same "group" value
• Reduce combines countsresult = reduce + 0 groups
• Problem - MapReduce jobs written in PL (e.g., Java) Complicated Not reusable Database-like operations common
CouchDB - Count People per Gender
Pig Latin• Yahoo
40% of Hadoop jobs run using Pig
• Platform for analyzing massive data sets
• Runs on Hadoop (Map/Reduce)
• Version 0.12
What is Pig Latin?• Dataflow language
• Non 1NF data model Tuples Sets Bags
• Use relational algebra-like operations to manipulate data
Joins Filter - selection Generate - projection
• Compiles to MapReduce jobs on Hadoop cluster
Pig Latin Features• A dataflow (NoSQL) language
SQL is declarative, most PLs are not SQL poor at expressing workflow
• Non-1NF data model Bags, sets, tuples, maps Data resides in read-only files Schema-less
Example• Count subscribers in each city
A = LOAD ’subscribers.txt’ AS
(name: chararray, city: chararray, amount: int);
B = GROUP A BY city;
C = FOREACH B GENERATE city, COUNT(B.name);
DUMP C;
• Dataflow
LOAD …
GROUP A …
A
B
C
FOREACH B …
Compilation
Pig Latin Compiler
Map
Reduce HDFSMap
Reduce HDFSMap
Reduce HDFSMap
Reduce HDFSMap
Reduce HDFSMap
Reduce HDFS
Hadoop
Map Reduce
Job
Pig Latin Program
Result
Data Transformations• Relational algebra-like
JOIN (inner and outer joins)
FILTER (selection) FOREACH (projection) CROSS (product) UNION
• SQL-like DISTINCT LIMIT ORDER BY GROUP
• Non-traditional COGROUP MAPREDUCE FLATTEN RANK STREAM SAMPLE SPLIT
Magazine Subscriber Data
Subscribers
(Maya, Logan, $20, 1)
(Jose, Logan, $15, 2)
(Name, City, Amt, Id)
(Knut, Ogden, $20, 3)
...
Personal Information
(Maya, maya@gmail.com, 5)
(Jose, jose@gmail.com, 6)
(Name, Email, Id)
(Knut, knut@hotmail.com, 7)
...
FILTER • A filter restricts the result
/* Restrict to Logan subscribers */
X = FILTER R ON city = "Logan";
• FILTER example
Subscribers
(Name, City, Amt, Id)
(Maya, Logan, $20, 1)
(Jose, Logan, $15, 2)
(Knut, Ogden, $20, 3)
...
Magazine Subscriber Data
Subscribers
(Maya, Logan, $20, 1)
(Jose, Logan, $15, 2)
(Name, City, Amt, Id)
(Knut, Ogden, $20, 3)
...
Personal Information
(Maya, maya@gmail.com, 5)
(Jose, jose@gmail.com, 6)
(Name, Email, Id)
(Knut, knut@hotmail.com, 7)
...
B = JOIN Subscribers BY name, PerInfo By name
Magazine Subscriber Data
B
(Maya, Logan, $20, 1, Maya, maya@gmail.com, 5)
(Jose, Logan, $15, 2, Jose, jose@gmail.com, 6)
(Name, City, Amt, Id, Name, Email, Id)
(Knut, Ogden, $20, 3, Knut, knut@hotmail.com, 7)
...
B = JOIN Subscribers BY name, PerInfo By name
Optimization
FILTER …A B C
JOIN … FILTER …D E
CROSS …
Map/Reduce Map/Reduce
top related