introduction to google mapreduce

17
Introduction to Introduction to Google MapReduce Google MapReduce WING Group Meeting WING Group Meeting 13 Oct 2006 13 Oct 2006 Hendra Setiawan Hendra Setiawan

Upload: jafari

Post on 04-Jan-2016

27 views

Category:

Documents


1 download

DESCRIPTION

Introduction to Google MapReduce. WING Group Meeting 13 Oct 2006 Hendra Setiawan. What is MapReduce?. A programming model (& its associated implementation) For processing large data set Exploits large set of commodity computers Executes process in distributed manner - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Introduction to  Google MapReduce

Introduction to Introduction to Google MapReduceGoogle MapReduce

WING Group MeetingWING Group Meeting

13 Oct 200613 Oct 2006

Hendra SetiawanHendra Setiawan

Page 2: Introduction to  Google MapReduce

What is MapReduce?What is MapReduce?

A programming model (& its associated A programming model (& its associated implementation)implementation)

For processing large data setFor processing large data set Exploits large set of commodity computersExploits large set of commodity computers Executes process in distributed mannerExecutes process in distributed manner Offers high degree of transparenciesOffers high degree of transparencies In other words: In other words:

simple and maybe suitable for your tasks !!!simple and maybe suitable for your tasks !!!

Page 3: Introduction to  Google MapReduce

Distributed GrepDistributed Grep

Very big

data

Split data

Split data

Split data

Split data

grep

grep

grep

grep

matches

matches

matches

matches

catAll

matches

Page 4: Introduction to  Google MapReduce

Distributed Word CountDistributed Word Count

Very big

data

Split data

Split data

Split data

Split data

count

count

count

count

count

count

count

count

mergemergedcount

Page 5: Introduction to  Google MapReduce

Map ReduceMap Reduce

Map:Map: Accepts Accepts inputinput

key/value pairkey/value pair Emits Emits intermediateintermediate

key/value pairkey/value pair

Reduce :Reduce : Accepts Accepts intermediateintermediate

key/value* pairkey/value* pair Emits Emits outputoutput key/value key/value

pairpair

Very big

data

ResultMAP

REDUCE

PartitioningFunction

Page 6: Introduction to  Google MapReduce

Partitioning FunctionPartitioning Function

Page 7: Introduction to  Google MapReduce

Partitioning Function (2)Partitioning Function (2)

Default : Default : hash(key) mod Rhash(key) mod R Guarantee:Guarantee:

Relatively well-balanced partitionsRelatively well-balanced partitions Ordering guarantee within partitionOrdering guarantee within partition

Distributed SortDistributed Sort Map: Map:

emit(key,value)emit(key,value)

Reduce (with R=1): Reduce (with R=1): emit(key,value)emit(key,value)

Page 8: Introduction to  Google MapReduce

MapReduceMapReduce

Distributed GrepDistributed Grep Map: Map:

if match(value,pattern) emit(value,1)if match(value,pattern) emit(value,1)

Reduce: Reduce: emit(key,sum(value*))emit(key,sum(value*))

Distributed Word CountDistributed Word Count Map: Map:

for all w in value do emit(w,1)for all w in value do emit(w,1)

Reduce: Reduce: emit(key,sum(value*))emit(key,sum(value*))

Page 9: Introduction to  Google MapReduce

MapReduce TransparenciesMapReduce Transparencies

Plus Google Distributed File System :Plus Google Distributed File System : ParallelizationParallelization Fault-toleranceFault-tolerance Locality optimizationLocality optimization Load balancingLoad balancing

Page 10: Introduction to  Google MapReduce

Suitable for your task ifSuitable for your task if

Have a clusterHave a cluster Working with large datasetWorking with large dataset Working with independent data (or Working with independent data (or

assumed)assumed) Can be cast into Can be cast into mapmap and and reducereduce

Page 11: Introduction to  Google MapReduce

MapReduce outside GoogleMapReduce outside Google

Hadoop (Java)Hadoop (Java) Emulates MapReduce and GFSEmulates MapReduce and GFS

The architecture of Hadoop MapReduce The architecture of Hadoop MapReduce and DFS is master/slaveand DFS is master/slave

MasterMaster SlaveSlave

MapReduceMapReduce jobtrackerjobtracker tasktrackertasktracker

DFSDFS namenodenamenode datanodedatanode

Page 12: Introduction to  Google MapReduce

Example Word Count (1)Example Word Count (1)

MapMappublic static class MapClass extends MapReduceBase implements Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(WritableComparable key, Writable value, OutputCollector output, Reporter reporter) throws IOException { String line = ((Text)value).toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } }}

Page 13: Introduction to  Google MapReduce

Example Word Count (2)Example Word Count (2)

ReduceReducepublic static class Reduce extends MapReduceBase implements Reducer { public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += ((IntWritable) values.next()).get(); } output.collect(key, new IntWritable(sum)); }}

Page 14: Introduction to  Google MapReduce

Example Word Count (3)Example Word Count (3) Main Main public static void main(String[] args) throws IOException { //checking goes here JobConf conf = new JobConf();

conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(MapClass.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputPath(new Path(args[0])); conf.setOutputPath(new Path(args[1])); JobClient.runJob(conf);}

Page 15: Introduction to  Google MapReduce

One time setupOne time setup

set set hadoop-site.xmlhadoop-site.xml and and slavesslaves Initiate namenodeInitiate namenode Run Hadoop MapReduce and DFSRun Hadoop MapReduce and DFS Upload your data to DFSUpload your data to DFS Run your process…Run your process… Download your data from DFSDownload your data from DFS

Page 16: Introduction to  Google MapReduce

SummarySummary

A simple programming model for A simple programming model for processing large dataset on large set of processing large dataset on large set of computer clustercomputer cluster

Fun to use, focus on problem, and let the Fun to use, focus on problem, and let the library deal with the messy detaillibrary deal with the messy detail

Page 17: Introduction to  Google MapReduce

ReferencesReferences

Original paper Original paper (http://labs.google.com/papers/mapreduce(http://labs.google.com/papers/mapreduce.html).html)

On wikipedia (On wikipedia (http://http://en.wikipedia.org/wiki/MapReduceen.wikipedia.org/wiki/MapReduce))

Hadoop – MapReduce in Java Hadoop – MapReduce in Java (http://lucene.apache.org/hadoop/)(http://lucene.apache.org/hadoop/)

Starfish - MapReduce in Ruby Starfish - MapReduce in Ruby (http://rufy.com/starfish/)(http://rufy.com/starfish/)