apache pig. agenda what is apache pig how to setup tutorial examples

21
Apache Pig

Upload: dina-mckinney

Post on 12-Jan-2016

225 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Apache Pig. Agenda What is Apache Pig How to Setup Tutorial Examples

Apache Pig

Page 2: Apache Pig. Agenda What is Apache Pig How to Setup Tutorial Examples

Agenda

• What is Apache Pig • How to Setup• Tutorial Examples

Page 3: Apache Pig. Agenda What is Apache Pig How to Setup Tutorial Examples

PIG Introduction

• Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs

• Pig generates and compiles a Map/Reduce program(s) on the fly

PIGParse

Compile

Optimize

Plan

Pig Latin Scripts

Page 4: Apache Pig. Agenda What is Apache Pig How to Setup Tutorial Examples

有 Pig 後 Map-Reduce 簡單了! ?

• Apache Pig 用來處理大規模資料的高級查詢語言• 適合操作大型半結構化數據集• 比使用 Java , C++ 等語言編寫大規模資料處理程式

的難度要小 16 倍,實現同樣的效果的代碼量也小 20倍。

• Pig 元件– Pig Shell (Grunt)– Pig Language (Latin)– Libraries (Piggy Bank)– UDF: 使用者定義功能

4figure Source : http://www.slideshare.net/ydn/hadoop-yahoo-internet-scale-data-processing

Page 5: Apache Pig. Agenda What is Apache Pig How to Setup Tutorial Examples

大象遇到豬 ( setup )

• 解壓縮

• 修改 ~/.bashrc

• 啟動 pig shell

export JAVA_HOME=/usr/lib/jvm/java-7-oracleexport HADOOP_HOME=/home/hadoop/hadoopexport HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoopexport PIG_HOME=/home/hadoop/pigexport PATH=$PATH:$PIG_HOME/bin

cd /home/hadoopwget http://archive.cloudera.com/cdh5/cdh/5/pig-0.12.0-cdh5.3.2.tar.gztar –zxvf pig-0.12.0-cdh5.3.2.tar.gzmv pig-0.12.0-cdh5.3.2 pig

$ pig grunt> grunt> ls /hdfs://master:9000/hadoop <dir>hdfs://master:9000/tmp <dir>hdfs://master:9000/user <dir>grunt>

Page 6: Apache Pig. Agenda What is Apache Pig How to Setup Tutorial Examples

豬也會的程式設計

6

豬BigData

pigLatin功能 指令讀取 LOAD

儲存 STORE

資料處理

REGEX_EXTRACT, FILTER, FOREACH, GROUP, JOIN, UNION, SPLIT, …

彙總運算

AVG, COUNT, MAX, MIN, SIZE, …

數學運算

ABS, RANDOM, ROUND, …

字串處理

INDEXOF, SUBSTRING, REGEX EXTRACT, …

Debug DUMP, DESCRIBE, EXPLAIN, ILLUSTRATE

HDFS cat, ls, cp, mkdir, …

$ pig –x grunt> A = LOAD ‘file1’ AS (x, y, z);grunt> B = FILTER A by y > 10000;grunt> STORE B INTO ‘output’;

Page 7: Apache Pig. Agenda What is Apache Pig How to Setup Tutorial Examples

用 shell 硬把程式兜出來,放棄用 hadoop 了 使用 PIG 發憤圖強,廢寢忘食的研究…

練習一 :

• 場景 :– 老闆要我統計組織內所有員工的平均工時。於是我取

得了全台灣的打卡紀錄檔 ( 打卡鐘的 log 檔 ) ,還跟人事部門拿到了員工 id 對應表。這些資料量又多且大,我想到要餵進去 Hadoop 的 HDFS, .. 然後

• 問題 :– 為了寫 MapReduce ,開始學 Java, 物件導向 ,

hadoop API, … @@• 解法 :

7

Page 8: Apache Pig. Agenda What is Apache Pig How to Setup Tutorial Examples

整型前的 mapreduce code

8

nm dp Id Id dt hr

劉 北 A1 A1 7/7 13

李 中 B1 A1 7/8 12

王 中 B2 A1 7/9 4

Java Code

Map-Reduce

A1 劉 北 7/8 13

A1 劉 北 7/9 12

A1 劉 北 Jul 12.5

Page 9: Apache Pig. Agenda What is Apache Pig How to Setup Tutorial Examples

用 pig 整形後

9

北 A1 劉 12.5

LOADLOAD

FILTER

JOIN

GROUP

FOREACH

STORE

(nm, dp, id)

(nm, dp, id)(id, dt, hr)

(nm, dp, id, id, dt, hr)

(group, {(nm, dp, id, id, dt, hr)})

(group, …., AVG(hr))

(dp,group, nm, hr)

Logical PlanPig LatinA = LOAD 'file1.txt' using PigStorage(',') AS (nm, dp, id) ;B = LOAD ‘file2.txt' using PigStorage(',') AS (id, dt, hr) ;C = FILTER B by hr > 8;D = JOIN C BY id, A BY id;E = GROUP D BY A::id;F = FOREACH E GENERATE $1.dp,group,$1.nm, AVG($1.hr);STORE F INTO '/tmp/pig_output/';

nm dp Id Id dt hr

劉 北 A1 A1 7/7 13

李 中 B1 A1 7/8 12

王 中 B2 A1 7/9 4

Tips : 先用小量資料於 pig –x local 模式驗證;每行先配合 dump or illustrate 看是否正確

Page 10: Apache Pig. Agenda What is Apache Pig How to Setup Tutorial Examples

練習一 : 實作 cd ~

git clone https://github.com/waue0920/hadoop_example.git

cd ~/hadoop_example/pig/ex1 pig -x local -f exc1.pig cat /tmp/pig_output/part-r-00000

練習 : 執行 pig –x mapreduce ,將 exc1.pig 每一行單獨執行,並搭配 dump , illustrate 來看結果,如 :

Grunt> A = LOAD 'file1.txt' using PigStorage(',') AS (nm, dp, id);Grunt> Dump AGrunt> Illustrate A

Q : result 是否有改進空間 ?Q : 如何改進 ?

Page 11: Apache Pig. Agenda What is Apache Pig How to Setup Tutorial Examples

進階Simple Types Description Exampleint Signed 32-bit integer 10long 

Signed 64-bit integer 

Data:     10L or 10lDisplay: 10L

float 

32-bit floating point 

Data:     10.5F or 10.5f or 10.5e2f or 10.5E2FDisplay: 10.5F or 1050.0F

double 

64-bit floating point 

Data:     10.5 or 10.5e2 or 10.5E2Display: 10.5 or 1050.0

chararray Character array (string) in Unicode UTF-8 format hello world

bytearray Byte array (blob)  boolean boolean true/false (case insensitive)

datetime datetime 1970-01-01T00:00:00.000+00:00

biginteger Java BigInteger 2E+11bigdecimal Java BigDecimal 33.45678332Complex Types Description ExampleFields A piece of data Johntuple An ordered set of fields. (19,2)bag An collection of tuples. {(19,2), (18,1)}map A set of key value pairs. [open#apache]

Page 12: Apache Pig. Agenda What is Apache Pig How to Setup Tutorial Examples

進階

• cat data;

• A = LOAD 'data' AS ( t1:tuple(t1a:int,t1b:int,t1c:int), t2:tuple(t2a:int,t2b:int,t2c:int) );

• X = FOREACH A GENERATE t1.t1a,t2.$0;

(3,8,9) (4,5,6)(1,4,7) (3,7,5) (2,5,8) (9,5,8)

((3,8,9),(4,5,6))((1,4,7),(3,7,5))((2,5,8),(9,5,8))

(3,4)(1,3) (2,9)

Page 13: Apache Pig. Agenda What is Apache Pig How to Setup Tutorial Examples

進階

Page 14: Apache Pig. Agenda What is Apache Pig How to Setup Tutorial Examples

進階

Page 15: Apache Pig. Agenda What is Apache Pig How to Setup Tutorial Examples

進階Data Types and

More Relational Operators

Complex Types ASSERT MAPREDUCEBags COGROUP ORDER BYTuples CROSS RANKFields CUBE SAMPLEMap DEFINE SPLIT

Simple Types DISTINCT STOREint FILTER STREAMlong FOREACH UNIONfloat GROUP  double IMPORT  chararray JOIN (inner)  bytearray JOIN (outer)  boolean LIMIT  datetime LOAD  biginteger UDF Statementsbigdecimal DEFINE REGISTER

Page 16: Apache Pig. Agenda What is Apache Pig How to Setup Tutorial Examples

練習二

• 說明 : 從數字陣列中,觀察 pig 的語法,以及結果的變化

• 使用技術 : filter .. by, foreach .. by, group .. by, foreach .. generate, cogroup

• Input/ output

See : https://wiki.apache.org/pig/PigLatin (last edited 2010)

myfile.txt B.txt

Page 17: Apache Pig. Agenda What is Apache Pig How to Setup Tutorial Examples

練習二 cd ~/hadoop_example/pig/ex2 hadoop fs -put myfile.txt B.txt ./ pig -x mapred> A = LOAD 'myfile.txt' USING PigStorage('\t') AS (f1,f2,f3);> B = LOAD 'B.txt' ; dump A; dump B;> Y = FILTER A BY f1 == '8'; dump Y;> Y = FILTER A BY (f1 == '8') OR (NOT (f2+f3 > f1)); dump Y;> X = GROUP A BY f1; dump X;> X = FOREACH A GENERATE f1, f2; dump X;> X = FOREACH A GENERATE f1+f2 as sumf1f2; dump X;> Y = FILTER X by sumf1f2 > 5.0; dump Y;> C = COGROUP A BY $0, B BY $0; dump C;> C = COGROUP A BY $0 INNER, B BY $0 INNER; dump C;

Page 18: Apache Pig. Agenda What is Apache Pig How to Setup Tutorial Examples

練習三

• 說明 : 從 <userid, time, query_term> 的記錄檔,做出使用者喜愛的關鍵字分析

• 使用技術 : UDF, DISTINCT, FLATTEN, ORDER

• Source pigtutorial.tar.gz• Input / output

See : https://cwiki.apache.org/confluence/display/PIG/PigTutorial

Page 19: Apache Pig. Agenda What is Apache Pig How to Setup Tutorial Examples

練習三 cd ~/hadoop_example/pig/ex3 pig -x local> REGISTER ./tutorial.jar;> raw = LOAD 'excite-small.log' USING PigStorage('\t') AS (user, time, query);> clean1 = FILTER raw BY org.apache.pig.tutorial.NonURLDetector(query);> clean2 = FOREACH clean1 GENERATE user, time, org.apache.pig.tutorial.ToLower(query) as

query;> houred = FOREACH clean2 GENERATE user, org.apache.pig.tutorial.ExtractHour(time) as

hour, query;> ngramed1 = FOREACH houred GENERATE user, hour,

flatten(org.apache.pig.tutorial.NGramGenerator(query)) as ngram;> ngramed2 = DISTINCT ngramed1;> hour_frequency1 = GROUP ngramed2 BY (ngram, hour);> hour_frequency2 = FOREACH hour_frequency1 GENERATE flatten($0), COUNT($1) as

count;> uniq_frequency1 = GROUP hour_frequency2 BY group::ngram;> uniq_frequency2 = FOREACH uniq_frequency1 GENERATE flatten($0),

flatten(org.apache.pig.tutorial.ScoreGenerator($1));> uniq_frequency3 = FOREACH uniq_frequency2 GENERATE $1 as hour, $0 as ngram, $2 as

score, $3 as count, $4 as mean;> filtered_uniq_frequency = FILTER uniq_frequency3 BY score > 2.0;> ordered_uniq_frequency = ORDER filtered_uniq_frequency BY hour, score;> STORE ordered_uniq_frequency INTO 'result' USING PigStorage();

pig -x local -f script1-local.pig cat result/part-r-00000

Page 20: Apache Pig. Agenda What is Apache Pig How to Setup Tutorial Examples

UDF

package myudfs;import java.io.IOException;import org.apache.pig.EvalFunc;import org.apache.pig.data.Tuple;import org.apache.pig.impl.util.WrappedIOException;

public class UPPER extends EvalFunc<String>{ public String exec(Tuple input) throws IOException { if (input == null || input.size() == 0) return null; try{ String str = (String)input.get(0); return str.toUpperCase(); }catch(Exception e){ throw WrappedIOException.wrap("Caught exception processing input row ", e);}}}

grunt> register myudfs.jargrunt> A = load 'student_data' using PigStorage(',') as (name:chararray, age:int,gpa:double);grunt> B = FOREACH A GENERATE myudfs.UPPER(name);grunt> dump B;

Page 21: Apache Pig. Agenda What is Apache Pig How to Setup Tutorial Examples

Reference

• Pig 說明– http://pig.apache.org/docs/r0.12.0/basic.html

• Pig 參考投影片– http://www.slideshare.net/ydn/hadoop-yahoo-

internet-scale-data-processing• Pig 範例參考

– https://cwiki.apache.org/confluence/display/PIG/PigTutorial