mapreduce and hadoop - cornell university center for ... · – can subclass or implement virtually...
TRANSCRIPT
![Page 1: MapReduce and Hadoop - Cornell University Center for ... · – Can subclass or implement virtually every aspect of MapReduce pipeline or scheduling • Streaming mode to STDIN, STDOUT](https://reader034.vdocuments.net/reader034/viewer/2022042300/5ecb159e175edb27d35fccbe/html5/thumbnails/1.jpg)
MapReduce and Hadoop
Aaron Birkland
Cornell Center for Advanced Computing
![Page 2: MapReduce and Hadoop - Cornell University Center for ... · – Can subclass or implement virtually every aspect of MapReduce pipeline or scheduling • Streaming mode to STDIN, STDOUT](https://reader034.vdocuments.net/reader034/viewer/2022042300/5ecb159e175edb27d35fccbe/html5/thumbnails/2.jpg)
Motivation
• Simple programming model for Big Data
– Distributed, parallel – but hides this
• Established success at petabyte scale
– Internet search indexes, analysis
– Google, yahoo facebook
• Recently: 8000 nodes sort 10PB in 6.5 hours
• Open source frameworks with different goals
– Hadoop, phoenix
• Lots of research in last 5 years
– Adapt scientific computation algorithms to MapReduce, performance analysis
May 16-17, 2012 www.cac.cornell.edu 2
![Page 3: MapReduce and Hadoop - Cornell University Center for ... · – Can subclass or implement virtually every aspect of MapReduce pipeline or scheduling • Streaming mode to STDIN, STDOUT](https://reader034.vdocuments.net/reader034/viewer/2022042300/5ecb159e175edb27d35fccbe/html5/thumbnails/3.jpg)
A programming model with some nice consequences
• Map(D) → list(Ki, Vi)
• Reduce(Ki, list(Vi)) → list(Vf)
• Map: “Apply a function to every member of dataset” to produce a list of key-value pairs
– Dataset: set of values of uniform type D
• Image blobs, lines of text, individual points, etc
– Function: transforms each value into a list of zero or more key,value pairs of types Ki, Vi
• Reduce: Given a key and all associated values, do some processing to produce list of type Vf
• Execution over data is managed by a MapReduce framework
May 16-17, 2012 www.cac.cornell.edu 3
![Page 4: MapReduce and Hadoop - Cornell University Center for ... · – Can subclass or implement virtually every aspect of MapReduce pipeline or scheduling • Streaming mode to STDIN, STDOUT](https://reader034.vdocuments.net/reader034/viewer/2022042300/5ecb159e175edb27d35fccbe/html5/thumbnails/4.jpg)
Canonical example: Word Count
• D = lines of text
• Ki = Single Words
• Vi = Numbers
• Vf = Word/count pairs
• Map(D) = Emit pairs containing each word and the number 1
• Reduce(Ki, list(Vi)) = Sum all the numbers in the list associated with the given word. Emit the word and the resulting count
Map(D) → list(Ki, Vi)
Reduce(Ki, list(Vi)) → list(Vf)
May 16-17, 2012 www.cac.cornell.edu 4
![Page 5: MapReduce and Hadoop - Cornell University Center for ... · – Can subclass or implement virtually every aspect of MapReduce pipeline or scheduling • Streaming mode to STDIN, STDOUT](https://reader034.vdocuments.net/reader034/viewer/2022042300/5ecb159e175edb27d35fccbe/html5/thumbnails/5.jpg)
Canonical example: Word Count
absence of evidence
is not evidence of
absence
(absence, 1)
(of, 1)
(evidence, 1)
(is, 1)
(not, 1)
(evidence, 1)
(of, 1)
(absence, 1)
Map(D) → list(Ki, Vi)
(of, 1)
(evidence, 1)
(absence, 1)
(absence, 1)
(of, 1)
(of, 1)
(evidence, 1)
(evidence, 1)
(is, 1)
(not, 1)
(absence, 2)
(of, 2)
(evidence, 2)
(is, 1)
(not, 1)
Somehow need to group by keys so Reduce can be given all associated values!
Reduce(Ki, list(Vi)) → list(Vf)
May 16-17, 2012 www.cac.cornell.edu 5
![Page 6: MapReduce and Hadoop - Cornell University Center for ... · – Can subclass or implement virtually every aspect of MapReduce pipeline or scheduling • Streaming mode to STDIN, STDOUT](https://reader034.vdocuments.net/reader034/viewer/2022042300/5ecb159e175edb27d35fccbe/html5/thumbnails/6.jpg)
Opportunities for Parallelism?
absence of evidence
is not evidence of
absence
(absence, 1)
(of, 1)
(evidence, 1)
(is, 1)
(not, 1)
(evidence, 1)
(of, 1)
(absence, 1)
(of, 1)
(evidence, 1)
(absence, 1)
(absence, 1)
(of, 1)
(of, 1)
(evidence, 1)
(evidence, 1)
(is, 1)
(not, 1)
(absence, 2)
(of, 2)
(evidence, 2)
(is, 1)
(not, 1)
Promising Promising Worrisome
May 16-17, 2012 www.cac.cornell.edu 6
![Page 7: MapReduce and Hadoop - Cornell University Center for ... · – Can subclass or implement virtually every aspect of MapReduce pipeline or scheduling • Streaming mode to STDIN, STDOUT](https://reader034.vdocuments.net/reader034/viewer/2022042300/5ecb159e175edb27d35fccbe/html5/thumbnails/7.jpg)
Opportunities for Parallelism
• Map and Reduce functions are independent
– No explicit communication between them
– Grouping phase between Map and Reduce is the only point of data exchange
• Individual Map, Reduce results depend only on input value.
– Order of data, execution does not matter in the end.
• Input data read in parallel
• Output data written in parallel
May 16-17, 2012 www.cac.cornell.edu 7
![Page 8: MapReduce and Hadoop - Cornell University Center for ... · – Can subclass or implement virtually every aspect of MapReduce pipeline or scheduling • Streaming mode to STDIN, STDOUT](https://reader034.vdocuments.net/reader034/viewer/2022042300/5ecb159e175edb27d35fccbe/html5/thumbnails/8.jpg)
Parallel, Distributed execution absence of evidence is not evidence of absence
absence of evidence is not evidence of absence
(absence, 1) (of, 1) (evidence, 1)
(is, 1) (not, 1) (evidence, 1) (of, 1)
(absence, 1)
(absence, 1) (absence, 1)
(of, 1) (of, 1)
(not, 1) (is, 1) (evidence, 1) (evidence, 1)
(absence, 2) (not, 1) (of, 2) (is, 1) (evidence, 2)
(absence, 2) (not, 1) (of, 2)
(is, 1) (evidence, 2)
Map
Reduce
May 16-17, 2012 www.cac.cornell.edu 8
![Page 9: MapReduce and Hadoop - Cornell University Center for ... · – Can subclass or implement virtually every aspect of MapReduce pipeline or scheduling • Streaming mode to STDIN, STDOUT](https://reader034.vdocuments.net/reader034/viewer/2022042300/5ecb159e175edb27d35fccbe/html5/thumbnails/9.jpg)
Full Parallel Pipeline
Split
Read Map
(Combine)
Group
Partition
Reduce Write
May 16-17, 2012 www.cac.cornell.edu 9
![Page 10: MapReduce and Hadoop - Cornell University Center for ... · – Can subclass or implement virtually every aspect of MapReduce pipeline or scheduling • Streaming mode to STDIN, STDOUT](https://reader034.vdocuments.net/reader034/viewer/2022042300/5ecb159e175edb27d35fccbe/html5/thumbnails/10.jpg)
Full Parallel Pipeline
Split – Divide data into parallel streams
• Use features of underlying storage technology
• File sharding, locality information, parallel data formats
May 16-17, 2012 www.cac.cornell.edu 10
![Page 11: MapReduce and Hadoop - Cornell University Center for ... · – Can subclass or implement virtually every aspect of MapReduce pipeline or scheduling • Streaming mode to STDIN, STDOUT](https://reader034.vdocuments.net/reader034/viewer/2022042300/5ecb159e175edb27d35fccbe/html5/thumbnails/11.jpg)
Full Parallel Pipeline
Read – Chop data into iterable units
• Most common in MapReduce world – Lines of Text
• Can be arbitrary simple or complex –integer arrays, pdf documents,
mesh fragments, etc.
May 16-17, 2012 www.cac.cornell.edu 11
![Page 12: MapReduce and Hadoop - Cornell University Center for ... · – Can subclass or implement virtually every aspect of MapReduce pipeline or scheduling • Streaming mode to STDIN, STDOUT](https://reader034.vdocuments.net/reader034/viewer/2022042300/5ecb159e175edb27d35fccbe/html5/thumbnails/12.jpg)
Full Parallel Pipeline
Map – Apply a function, return a list of keys/values
May 16-17, 2012 www.cac.cornell.edu 12
![Page 13: MapReduce and Hadoop - Cornell University Center for ... · – Can subclass or implement virtually every aspect of MapReduce pipeline or scheduling • Streaming mode to STDIN, STDOUT](https://reader034.vdocuments.net/reader034/viewer/2022042300/5ecb159e175edb27d35fccbe/html5/thumbnails/13.jpg)
Full Parallel Pipeline
Combine – (optional) execute a “mini-reduce” on some set of map
output
• For optimization purposes
• May not be possible for every algorithm May 16-17, 2012 www.cac.cornell.edu 13
![Page 14: MapReduce and Hadoop - Cornell University Center for ... · – Can subclass or implement virtually every aspect of MapReduce pipeline or scheduling • Streaming mode to STDIN, STDOUT](https://reader034.vdocuments.net/reader034/viewer/2022042300/5ecb159e175edb27d35fccbe/html5/thumbnails/14.jpg)
Full Parallel Pipeline
Group – Group all results by key, collapse into a list of values for each
key
• Need all intermediate values before this can complete
• Automatically performed by MapReduce framework May 16-17, 2012 www.cac.cornell.edu 14
![Page 15: MapReduce and Hadoop - Cornell University Center for ... · – Can subclass or implement virtually every aspect of MapReduce pipeline or scheduling • Streaming mode to STDIN, STDOUT](https://reader034.vdocuments.net/reader034/viewer/2022042300/5ecb159e175edb27d35fccbe/html5/thumbnails/15.jpg)
Full Parallel Pipeline
Partition – Send grouped data to reduce processes
• Typically, just a dumb hash to evenly distribute
• Opportunities for balancing or other optimization.
May 16-17, 2012 www.cac.cornell.edu 15
![Page 16: MapReduce and Hadoop - Cornell University Center for ... · – Can subclass or implement virtually every aspect of MapReduce pipeline or scheduling • Streaming mode to STDIN, STDOUT](https://reader034.vdocuments.net/reader034/viewer/2022042300/5ecb159e175edb27d35fccbe/html5/thumbnails/16.jpg)
Full Parallel Pipeline
Reduce – Run a computation over each aggregated result, produce a
final list of values
May 16-17, 2012 www.cac.cornell.edu 16
![Page 17: MapReduce and Hadoop - Cornell University Center for ... · – Can subclass or implement virtually every aspect of MapReduce pipeline or scheduling • Streaming mode to STDIN, STDOUT](https://reader034.vdocuments.net/reader034/viewer/2022042300/5ecb159e175edb27d35fccbe/html5/thumbnails/17.jpg)
Full Parallel Pipeline
Write – Move Reduce results to their final destination
• Could be storage, or another MapReduce process!
May 16-17, 2012 www.cac.cornell.edu 17
![Page 18: MapReduce and Hadoop - Cornell University Center for ... · – Can subclass or implement virtually every aspect of MapReduce pipeline or scheduling • Streaming mode to STDIN, STDOUT](https://reader034.vdocuments.net/reader034/viewer/2022042300/5ecb159e175edb27d35fccbe/html5/thumbnails/18.jpg)
Programming considerations
You must provide:
• Map, Reduce functions
You may provide:
• Combine, if it helps
• Partition function, if it matters
Framework must provide:
• Grouping and data shuffling
Framework may provide:
• Read, Write
– For simple data such as lines of text
• Split
– For parallel storage or data formats it knows about
May 16-17, 2012 www.cac.cornell.edu 18
![Page 19: MapReduce and Hadoop - Cornell University Center for ... · – Can subclass or implement virtually every aspect of MapReduce pipeline or scheduling • Streaming mode to STDIN, STDOUT](https://reader034.vdocuments.net/reader034/viewer/2022042300/5ecb159e175edb27d35fccbe/html5/thumbnails/19.jpg)
Benefits
• Presents an easy-to-use programming model
– No synchronization, communication by individual components. Ugly details hidden by framework.
• Execution managed by a framework
– Failure recovery (Maps/Reduces can always be re-run if necessary)
– Speculative execution (Several processes operate on same data, whoever finishes first wins)
– Load balancing
• Adapt and optimize for different storage paradigms
May 16-17, 2012 www.cac.cornell.edu 19
![Page 20: MapReduce and Hadoop - Cornell University Center for ... · – Can subclass or implement virtually every aspect of MapReduce pipeline or scheduling • Streaming mode to STDIN, STDOUT](https://reader034.vdocuments.net/reader034/viewer/2022042300/5ecb159e175edb27d35fccbe/html5/thumbnails/20.jpg)
Drawbacks
• Grouping/partitioning is serial!
– Need to wait for all map tasks to complete before any reduce tasks can be run
• Some algorithms may be hard to conceptualize in MapReduce.
• Some algorithms may be inefficient to express in terms of Map Reduce
May 16-17, 2012 www.cac.cornell.edu 20
![Page 21: MapReduce and Hadoop - Cornell University Center for ... · – Can subclass or implement virtually every aspect of MapReduce pipeline or scheduling • Streaming mode to STDIN, STDOUT](https://reader034.vdocuments.net/reader034/viewer/2022042300/5ecb159e175edb27d35fccbe/html5/thumbnails/21.jpg)
Hadoop
• Open Source MapReduce framework in Java
– Spinoff from Nuch web crawler project
• HDFS – Hadoop Distributed Filesystem
– Distributed, fault-tolerant, sharding
• Many sub-projects
– Pig: Data-flow and execution language. Scripting for MapReduce
– Hive: SQL-like language for analyzing data
– Mahout: Machine learning and data mining libraries
• K-means clustering, Singular Value Decomposition, Bayesian classification
May 16-17, 2012 www.cac.cornell.edu 21
![Page 22: MapReduce and Hadoop - Cornell University Center for ... · – Can subclass or implement virtually every aspect of MapReduce pipeline or scheduling • Streaming mode to STDIN, STDOUT](https://reader034.vdocuments.net/reader034/viewer/2022042300/5ecb159e175edb27d35fccbe/html5/thumbnails/22.jpg)
Hadoop
• User provides java classes for Map, Reduce functions
– Can subclass or implement virtually every aspect of MapReduce pipeline or scheduling
• Streaming mode to STDIN, STDOUT of external map, reduce processes (can be implemented in any language)
– Lots of scientific data that goes beyond lines of text
– Lots of existing/legacy code that can be adapted/wrapped into a Map or Reduce stage.
stream -input /dataDir/dataFile
-file myMapper.sh -mapper “myMapper.sh"
-file myReducer.sh -reducer “myReducer.sh"
-output /dataDir/myResults
May 16-17, 2012 www.cac.cornell.edu 22
![Page 23: MapReduce and Hadoop - Cornell University Center for ... · – Can subclass or implement virtually every aspect of MapReduce pipeline or scheduling • Streaming mode to STDIN, STDOUT](https://reader034.vdocuments.net/reader034/viewer/2022042300/5ecb159e175edb27d35fccbe/html5/thumbnails/23.jpg)
HDFS
• Data distributed among compute nodes
– Sharding: 64MB chunks
– Redundancy
• Small number of large files
• Not quite POSIX file semantics
– No random write, append
• Write-once read many
• Favor throughput over latency
• Streaming/sequential access to files
May 16-17, 2012 www.cac.cornell.edu 23
![Page 24: MapReduce and Hadoop - Cornell University Center for ... · – Can subclass or implement virtually every aspect of MapReduce pipeline or scheduling • Streaming mode to STDIN, STDOUT](https://reader034.vdocuments.net/reader034/viewer/2022042300/5ecb159e175edb27d35fccbe/html5/thumbnails/24.jpg)
HDFS
1
2
3
4
4 3
1 DataNode
3 1
2 DataNode
4 2
3 DataNode
2 1
4 DataNode
NameNode
Replic
atio
n
Sharding
May 16-17, 2012 www.cac.cornell.edu 24
![Page 25: MapReduce and Hadoop - Cornell University Center for ... · – Can subclass or implement virtually every aspect of MapReduce pipeline or scheduling • Streaming mode to STDIN, STDOUT](https://reader034.vdocuments.net/reader034/viewer/2022042300/5ecb159e175edb27d35fccbe/html5/thumbnails/25.jpg)
4
2
1
3
HDFS + MapReduce
1
2
3
4
4 3
1 DataNode
Map/Red
3 1
2 DataNode
Map/Red
4 2
3 DataNode
Map/Red
2 1
4 DataNode
Map/Red
NameNode
JobTracker
3
1
2
4
Locality
metadata
Split fn
May 16-17, 2012 www.cac.cornell.edu 25
![Page 26: MapReduce and Hadoop - Cornell University Center for ... · – Can subclass or implement virtually every aspect of MapReduce pipeline or scheduling • Streaming mode to STDIN, STDOUT](https://reader034.vdocuments.net/reader034/viewer/2022042300/5ecb159e175edb27d35fccbe/html5/thumbnails/26.jpg)
HDFS + MapReduce
• Assume failure-prone nodes
– Data and computation recovery through redundancy
• Move computation to data
– Data is local to computation, direct-attached storage to each node
• Sequential reads on large blocks
• Minimal contention
– Simultaneous maps/reduces on a node can be controlled by configuration
May 16-17, 2012 www.cac.cornell.edu 26
![Page 27: MapReduce and Hadoop - Cornell University Center for ... · – Can subclass or implement virtually every aspect of MapReduce pipeline or scheduling • Streaming mode to STDIN, STDOUT](https://reader034.vdocuments.net/reader034/viewer/2022042300/5ecb159e175edb27d35fccbe/html5/thumbnails/27.jpg)
Hadoop + HDFS vs HPC
4 3
1 ….
….
3 1
2 ….
….
4 2
3 ….
….
2 1
4 ….
….
….
….
….
….
….
….
….
….
1 2 3 4 May 16-17, 2012 www.cac.cornell.edu 27
![Page 28: MapReduce and Hadoop - Cornell University Center for ... · – Can subclass or implement virtually every aspect of MapReduce pipeline or scheduling • Streaming mode to STDIN, STDOUT](https://reader034.vdocuments.net/reader034/viewer/2022042300/5ecb159e175edb27d35fccbe/html5/thumbnails/28.jpg)
Hadoop in HPC environments
• Access to local storage can be problematic
– Local storage may not be available at all
– Even if so, long-term HDFS usually not possible
• HPC relies on global storage (e.g. Lustre) via high-speed interconnect.
– What is meaning of “locality” in inherently non-local (but parallel) storage?
May 16-17, 2012 www.cac.cornell.edu 28
![Page 29: MapReduce and Hadoop - Cornell University Center for ... · – Can subclass or implement virtually every aspect of MapReduce pipeline or scheduling • Streaming mode to STDIN, STDOUT](https://reader034.vdocuments.net/reader034/viewer/2022042300/5ecb159e175edb27d35fccbe/html5/thumbnails/29.jpg)
Hadoop @ TACC
• On Longhorn visualization cluster
• Special, local, persistent /hadoop filesystem on some machines
– 48 nodes with 2TB HDFS storage/node
– 16 nodes with 1TB HDFS storage/node, extra large memory (144GB memory)
• Modified hadoop distribution
– Starts HDFS on allocated nodes
• Special Hadoop queue
• By request only
• Details at https://sites.google.com/site/tacchadoop
May 16-17, 2012 www.cac.cornell.edu 29
![Page 30: MapReduce and Hadoop - Cornell University Center for ... · – Can subclass or implement virtually every aspect of MapReduce pipeline or scheduling • Streaming mode to STDIN, STDOUT](https://reader034.vdocuments.net/reader034/viewer/2022042300/5ecb159e175edb27d35fccbe/html5/thumbnails/30.jpg)
Still much to learn
• Most established patterns are from web and text processing (inverted indexes, ranking, clustering, etc)
• Scientific data and algorithms much more varied
– Papers describing an existing problem applied to MapReduce are common
• When does HDFS provide benefit over traditional global shared FS?
– Tends to do poorly for small tasks, can be a crossover point that needs to be found
• Lots of tuning parameters
– Data skew and heterogeneity may lead to long, inefficient jobs.
May 16-17, 2012 www.cac.cornell.edu 30
![Page 31: MapReduce and Hadoop - Cornell University Center for ... · – Can subclass or implement virtually every aspect of MapReduce pipeline or scheduling • Streaming mode to STDIN, STDOUT](https://reader034.vdocuments.net/reader034/viewer/2022042300/5ecb159e175edb27d35fccbe/html5/thumbnails/31.jpg)
Why Hadoop?
• If you find the programming model simple/easy
• If you have a data intensive workload
• If you need fault tolerance
• If you have dedicated nodes available
• If you like Java
• If you want to experiment.
May 16-17, 2012 www.cac.cornell.edu 31