pig latin cs 6800 utah state university. writing mapreduce jobs higher order functions map applies a...

Pig Latin

CS 6800

Utah State University

Writing MapReduce Jobs• Higher order functions• Map applies a function to a list

Example list [1, 2, 3, 4] Want to square each number in the list Write function f(x) = x*x Compute [f(1), f(2), f(3), f(4)] = [1, 4, 9, 16] map function signature: (a -> b) -> [a] -> [b] Haskell specification map f [] = [] map f (x:xs) = (f x) :: (map f xs) Call the function map (\x -> x * x) [1, 2, 3, 4]

Reduce• Reduce converts a list into a scalar

Example list [1, 2, 3, 4] Want to sum the numbers in the list Write function g(x,y) = x+y Compute g(1,g(2,g(3,g(4,0)))) = 10 reduce signature: (a -> b -> c) -> b -> [a] -> c Haskell specification

reduce g c [] = c

reduce g c (x:xs) = g x (reduce g c xs) Call the function

reduce (\x -> x + x) 0 [1, 2, 3, 4]

Use in Cloud Computing• Map can be used to clean data and "group" it

• Suppose a list of wordswords = [Bat Volcano bat vulcano]

• Map to lower caselcase = map lowercase words

• Map to correct spellings = map spellFix lcase

• Count each wordgroups = map (\x -> (x, 1)) s

groups is [(bat, 1), (volcano, 1), (bat, 1) …

Use in Cloud Computing (continues)• Shuffles collects tuples with same "group" value

• Reduce combines countsresult = reduce + 0 groups

• Problem - MapReduce jobs written in PL (e.g., Java) Complicated Not reusable Database-like operations common

CouchDB - Count People per Gender

Pig Latin• Yahoo

40% of Hadoop jobs run using Pig

• Platform for analyzing massive data sets

• Runs on Hadoop (Map/Reduce)

• Version 0.12

What is Pig Latin?• Dataflow language

• Non 1NF data model Tuples Sets Bags

• Use relational algebra-like operations to manipulate data

Joins Filter - selection Generate - projection

• Compiles to MapReduce jobs on Hadoop cluster

Pig Latin Features• A dataflow (NoSQL) language

SQL is declarative, most PLs are not SQL poor at expressing workflow

• Non-1NF data model Bags, sets, tuples, maps Data resides in read-only files Schema-less

Example• Count subscribers in each city

A = LOAD ’subscribers.txt’ AS

(name: chararray, city: chararray, amount: int);

B = GROUP A BY city;

C = FOREACH B GENERATE city, COUNT(B.name);

DUMP C;

• Dataflow

LOAD …

GROUP A …

FOREACH B …

Compilation

Pig Latin Compiler

Reduce HDFSMap

Reduce HDFS

Hadoop

Map Reduce

Pig Latin Program

Result

Data Transformations• Relational algebra-like

JOIN (inner and outer joins)

FILTER (selection) FOREACH (projection) CROSS (product) UNION

• SQL-like DISTINCT LIMIT ORDER BY GROUP

• Non-traditional COGROUP MAPREDUCE FLATTEN RANK STREAM SAMPLE SPLIT

Magazine Subscriber Data

Subscribers

(Maya, Logan, $20, 1)

(Jose, Logan, $15, 2)

(Name, City, Amt, Id)

(Knut, Ogden, $20, 3)

Personal Information

(Maya, maya@gmail.com, 5)

(Jose, jose@gmail.com, 6)

(Name, Email, Id)

(Knut, knut@hotmail.com, 7)

FILTER • A filter restricts the result

/* Restrict to Logan subscribers */

X = FILTER R ON city = "Logan";

• FILTER example

Subscribers

Personal Information

(Maya, maya@gmail.com, 5)

(Jose, jose@gmail.com, 6)

(Name, Email, Id)

(Knut, knut@hotmail.com, 7)

B = JOIN Subscribers BY name, PerInfo By name

(Maya, Logan, $20, 1, Maya, maya@gmail.com, 5)

(Jose, Logan, $15, 2, Jose, jose@gmail.com, 6)

(Name, City, Amt, Id, Name, Email, Id)

(Knut, Ogden, $20, 3, Knut, knut@hotmail.com, 7)

B = JOIN Subscribers BY name, PerInfo By name

Optimization

FILTER …A B C

JOIN … FILTER …D E

CROSS …

Map/Reduce Map/Reduce

pig latin cs 6800 utah state university. writing mapreduce jobs higher order functions map applies a...

x x x

function map x x

logan subscribers

x ycompute g1

map function signature

map f xscall

b haskell specification

g c xscall

Documents

omniswitch 6800 series hardware users guide 6800... · part...

mapreduce and hadoop file...

presented by shen...

mapreduce (hadoop)densetsu.org/cloud2012/(11)...

kodak フォトプリンター6800/6805...kodak...

refurbished oec 6800 mini c-arm - eastern diagnostic ·...

introduction to mapreduce | mapreduce architecture |...

6800 parts list

comsphere 6800 series network management...

6800 definitions - wvdrs.org

,9,& 6800,

6800@urp | 6800 solectron drive · 1 6800@urp | 6800...

hadoop/mapreduce - 123seminarsonly.comhadoop mapreduce •...

starlight 6800

mapreduce framework suffling & sorting. mapreduce example -...

1. introduction to mapreduce -...

epri el-6800

centro bbq 6800

abb unitrol 6800

6800 -...