map reduce programming waue chen. why ? moore’s law ? 每隔 18 個月, cpu...

24
Map Reduce Programming Waue Chen

Upload: ethelbert-barnett

Post on 20-Jan-2016

229 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Map Reduce Programming Waue Chen. Why ? Moore’s law ?  每隔 18 個月, CPU 的主頻就會增加一倍  2005 開始失效 多核及平行運算時代來臨

Map ReduceProgramming

Waue Chen

Page 2: Map Reduce Programming Waue Chen. Why ? Moore’s law ?  每隔 18 個月, CPU 的主頻就會增加一倍  2005 開始失效 多核及平行運算時代來臨

Why ?

Moore’s law ?每隔 18 個月, CPU 的主頻就會增加一倍2005 開始失效

多核及平行運算時代來臨

Page 3: Map Reduce Programming Waue Chen. Why ? Moore’s law ?  每隔 18 個月, CPU 的主頻就會增加一倍  2005 開始失效 多核及平行運算時代來臨

What is Hadoop

Hadoop 是一個 open source 可運作於大規模 cluster 上的平行分散式程式框架

提供一個分散式文件系統 HDFS ,用來在各個節點上存儲數據

高容錯性,自動處理失敗節點 實現了 Google 的 MapReduce 算法

Page 4: Map Reduce Programming Waue Chen. Why ? Moore’s law ?  每隔 18 個月, CPU 的主頻就會增加一倍  2005 開始失效 多核及平行運算時代來臨

What is MapReduce 把應用程序分割成許多很小的工作單元 每個單元可以在任何節點上執行或運算

Page 5: Map Reduce Programming Waue Chen. Why ? Moore’s law ?  每隔 18 個月, CPU 的主頻就會增加一倍  2005 開始失效 多核及平行運算時代來臨

MapReduce: ExampleMapReduce: Example

Page 6: Map Reduce Programming Waue Chen. Why ? Moore’s law ?  每隔 18 個月, CPU 的主頻就會增加一倍  2005 開始失效 多核及平行運算時代來臨

MapReduce in Parallel: ExampleMapReduce in Parallel: Example

Page 7: Map Reduce Programming Waue Chen. Why ? Moore’s law ?  每隔 18 個月, CPU 的主頻就會增加一倍  2005 開始失效 多核及平行運算時代來臨

Thinking in Hadoop:MapReduce

HDFS Map Class Reduce Class Overall Configuration

Page 8: Map Reduce Programming Waue Chen. Why ? Moore’s law ?  每隔 18 個月, CPU 的主頻就會增加一倍  2005 開始失效 多核及平行運算時代來臨

Program prototypeClass MR{

Class Map …{ }

Class Reduce …{ }

main(){

JobConf conf = new JobConf(“MR.class”);

conf.setInputPath(“the_path_of_HDFS ”);

conf.setMapperClass(Map.class);

conf.setReduceClass(Reduce.class);

JobClient.runJob(conf);

}}

Page 9: Map Reduce Programming Waue Chen. Why ? Moore’s law ?  每隔 18 個月, CPU 的主頻就會增加一倍  2005 開始失效 多核及平行運算時代來臨

Word Count SampleClass WordCount{

main(){

JobConf conf = new JobConf(WordCount.class);conf.setJobName("wordcount"); // set pathconf.setInputPath(new Path(“/user/waue/input”));conf.setOutputPath(new Path(“counts”));FileSystem.get(conf).delete(new Path(wc.outputPath));// set map reduceconf.setOutputKeyClass(Text.class); // set every word as keyconf.setOutputValueClass(IntWritable.class); // set 1 as valueconf.setMapperClass(MapClass.class);conf.setReducerClass(ReduceClass.class);conf.setNumMapTasks(1);conf.setNumReduceTasks(1);// runJobClient.runJob(conf);

}}

Page 10: Map Reduce Programming Waue Chen. Why ? Moore’s law ?  每隔 18 個月, CPU 的主頻就會增加一倍  2005 開始失效 多核及平行運算時代來臨

Word Count Sampleclass MapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);private Text word = new Text();public void map( LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {

String line = ((Text) value).toString();StringTokenizer itr = new StringTokenizer(line);while (itr.hasMoreTokens()) {

word.set(itr.nextToken());output.collect(word, one);

}}}

1

234

56789

Page 11: Map Reduce Programming Waue Chen. Why ? Moore’s law ?  每隔 18 個月, CPU 的主頻就會增加一倍  2005 開始失效 多核及平行運算時代來臨

Word Count Sample

class ReduceClass extends MapReduceBase implements Reducer< Text, IntWritable, Text, IntWritable> {

IntWritable SumValue = new IntWritable();public void reduce( Text key, Iterator<IntWritable> values,OutputCollector<Text, IntWritable> output, Reporter reporter)throws IOException {

int sum = 0;while (values.hasNext())

sum += values.next().get();SumValue.set(sum);output.collect(key, SumValue);

}}

1

2 3

45678

Page 12: Map Reduce Programming Waue Chen. Why ? Moore’s law ?  每隔 18 個月, CPU 的主頻就會增加一倍  2005 開始失效 多核及平行運算時代來臨

Result

Page 13: Map Reduce Programming Waue Chen. Why ? Moore’s law ?  每隔 18 個月, CPU 的主頻就會增加一倍  2005 開始失效 多核及平行運算時代來臨

MapReduce with HBase prototypeClass MR_HBase{

Class Map extends …{

}

Class Reduce extends…{

}

main(){

JobConf conf = new …;

conf.setInputPath(…);

conf.setMapperClass( …);

conf.setReduceClass( …);

JobClient.runJob(conf);

}

}

HBase API:

TableMap

TableReduce

Page 14: Map Reduce Programming Waue Chen. Why ? Moore’s law ?  每隔 18 個月, CPU 的主頻就會增加一倍  2005 開始失效 多核及平行運算時代來臨

WordCountIntoHbase SampleClass WordCountIntoHbase{

main(){

BuildHTable build_table = new BuildHTable( Table_Name, ColumnF);

if (!build_table.checkTableExist(Table_Name)) {

if ( !build_table.createTable() )

System.err.println("create table error !");

} else System.out.println("Table existed !");

JobConf conf = new JobConf(WordCount.class);conf.setJobName("wordcount"); conf.setInputPath(new Path(“/user/waue/input”));//conf.setOutputPath(new Path(“counts”));//FileSystem.get(conf).delete(new Path(wc.outputPath));//conf.setOutputKeyClass(Text.class); // set every word as key//conf.setOutputValueClass(IntWritable.class); // set 1 as value//conf.setMapperClass(MapClass.class);conf.setReducerClass(ReduceClass.class);conf.setNumMapTasks(0);conf.setNumReduceTasks(1);JobClient.runJob(conf);

}}

Page 15: Map Reduce Programming Waue Chen. Why ? Moore’s law ?  每隔 18 個月, CPU 的主頻就會增加一倍  2005 開始失效 多核及平行運算時代來臨

class ReduceClass extends TableReduce<LongWritable, Text> {

Text col = new Text( “word:text” );

private MapWritable map = new MapWritable();

public void reduce( LongWritable key, Iterator<Text> values,

OutputCollector<Text, MapWritable> output, Reporter reporter)

throws IOException {

ImmutableBytesWritable bytes

= new ImmutableBytesWritable(values.next().getBytes());

map.clear();

map.put(col, bytes);

output.collect(new Text(key.toString()), map);

}}

WordCountIntoHbase Sample

1

2

3

4

5

6

7

8

Page 16: Map Reduce Programming Waue Chen. Why ? Moore’s law ?  每隔 18 個月, CPU 的主頻就會增加一倍  2005 開始失效 多核及平行運算時代來臨

result

Page 17: Map Reduce Programming Waue Chen. Why ? Moore’s law ?  每隔 18 個月, CPU 的主頻就會增加一倍  2005 開始失效 多核及平行運算時代來臨

WordCountFromHbase

Word Counting from Hbase after WordCountIntoHbase run.

In Trac … http://trac.nchc.org.tw/cloud/browser/sample/h

adoop-0.16/tw/org/nchc/code/WordCountFromHBase.java

Page 18: Map Reduce Programming Waue Chen. Why ? Moore’s law ?  每隔 18 個月, CPU 的主頻就會增加一倍  2005 開始失效 多核及平行運算時代來臨

What’s HBaseRecordPro parse your record create Hbase set the first line as column qualify store in HBase Automatically Locally http://trac.nchc.org.tw/cloud/wiki/HBaseRe

cordPro

Page 19: Map Reduce Programming Waue Chen. Why ? Moore’s law ?  每隔 18 個月, CPU 的主頻就會增加一倍  2005 開始失效 多核及平行運算時代來臨

HBaseRecordPro

name:locate:years waue:taiwan:1981 rock:taiwan:1981 aso:taiwan:1981 jazz:taiwan:1982

Run HBaseRecordPro.java

hql> Select * from Table;

Page 20: Map Reduce Programming Waue Chen. Why ? Moore’s law ?  每隔 18 個月, CPU 的主頻就會增加一倍  2005 開始失效 多核及平行運算時代來臨
Page 21: Map Reduce Programming Waue Chen. Why ? Moore’s law ?  每隔 18 個月, CPU 的主頻就會增加一倍  2005 開始失效 多核及平行運算時代來臨

Detailed Code Explanation

Apache log parser http://trac.nchc.org.tw/cloud/wiki/

LogParser

Page 22: Map Reduce Programming Waue Chen. Why ? Moore’s law ?  每隔 18 個月, CPU 的主頻就會增加一倍  2005 開始失效 多核及平行運算時代來臨

More .. ? Enjoy http://trac.nchc.org.tw/cloud/

How to code Hadoop in Eclipsehttp://trac.nchc.org.tw/cloud/browser/hadoop-

eclipse.pdfMap Reduce in Hadoop/HBase Manualhttp://trac.nchc.org.tw/cloud/wiki/MR_manualMy Code sourceshttp://trac.nchc.org.tw/cloud/browser/sample/h

adoop-0.16/tw/org/nchc/code

Page 23: Map Reduce Programming Waue Chen. Why ? Moore’s law ?  每隔 18 個月, CPU 的主頻就會增加一倍  2005 開始失效 多核及平行運算時代來臨

Then .. ? Intrusion-Detection-System log parser

Count => the last Format => 6 lines / 1 cell

Apache Pig Pig is a platform for analyzing large data sets that

consists of a high-level language

[**] [1:2189:3] BAD-TRAFFIC IP Proto 103 PIM [**][Classification: Detection of a non-standard protocol or event] [Priority: 2] 07/08-14:57:56.500718 140.110.138.253 -> 224.0.0.13PIM TTL:1 TOS:0xC0 ID:11078 IpLen:20 DgmLen:54[Xref => http://cve.mitre.org/cgi-bin/cvename.cgi?name=2003-0567][Xref => http://www.securityfocus.com/bid/8211]

Page 24: Map Reduce Programming Waue Chen. Why ? Moore’s law ?  每隔 18 個月, CPU 的主頻就會增加一倍  2005 開始失效 多核及平行運算時代來臨

References

API http://hadoop.apache.org/hbase/docs/current/

api/index.htmlhttp://hadoop.apache.org/core/docs/r0.16.4/a

pi/index.html用 Hadoop 進行分佈式並行編程

http://www.ibm.com/developerworks/cn/opensource/os-cn-hadoop1/index.html