distributed computations mapreduce / dryad mapreduce slides are adapted from jeff dean’s
Post on 19-Dec-2015
232 views
TRANSCRIPT
![Page 1: Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d375503460f94a10908/html5/thumbnails/1.jpg)
Distributed Computations
MapReduce / Dryad
MapReduce slides are adapted from Jeff Dean’s
![Page 2: Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d375503460f94a10908/html5/thumbnails/2.jpg)
Why distributed computations?
• How long to sort 1 TB on one computer?– One computer can read ~30MB from disk– Takes ~2 days!!
• Google indexes 20 billion+ web pages – 20 * 10^9 pages * 20KB/page = 400 TB
• Large Hadron Collider is expected to produce 15 PB every year!
![Page 3: Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d375503460f94a10908/html5/thumbnails/3.jpg)
Solution: use many nodes!
• Cluster computing (cloud computing)– Thousands PCs connected by high speed LANs
• Grid computing– Hundreds of supercomputers connected by high
speed net
• 1000 nodes potentially give 1000X speedup
![Page 4: Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d375503460f94a10908/html5/thumbnails/4.jpg)
Distributed computations are difficult to program
• Sending data to/from nodes
• Coordinating among nodes
• Recovering from node failure
• Optimizing for locality
• Debugging
Same for all problems
![Page 5: Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d375503460f94a10908/html5/thumbnails/5.jpg)
MapReduce/Dryad
• Both are case studies of distributed systems• Both address typical challenges of distributed
systems– Abstraction– Fault tolerance– Concurrency– Consistency– …
![Page 6: Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d375503460f94a10908/html5/thumbnails/6.jpg)
MapReduce• A programming model for large-scale computations
– Process large amounts of input, produce output– No side-effects or persistent state (unlike file system)
• MapReduce is implemented as a runtime library:– automatic parallelization– load balancing– locality optimization– handling of machine failures
![Page 7: Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d375503460f94a10908/html5/thumbnails/7.jpg)
MapReduce design
• Input data is partitioned into M splits• Map: extract information on each split
– Each Map produces R partitions
• Shuffle and sort– Bring M partitions to the same reducer
• Reduce: aggregate, summarize, filter or transform• Output is in R result files
![Page 8: Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d375503460f94a10908/html5/thumbnails/8.jpg)
More specifically…• Programmer specifies two methods:
– map(k, v) → <k', v'>*– reduce(k', <v'>*) → <k', v'>*
• All v' with same k' are reduced together, in order.
• Usually also specify:– partition(k’, total partitions) -> partition for k’
• often a simple hash of the key• allows reduce operations for different k’ to be
parallelized
![Page 9: Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d375503460f94a10908/html5/thumbnails/9.jpg)
Example: Count word frequencies in web pages
• Input is files with one doc per record
• Map parses documents into words– key = document URL– value = document contents
• Output of map:
“doc1”, “to be or not to be”
“to”, “1”“be”, “1”“or”, “1”…
![Page 10: Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d375503460f94a10908/html5/thumbnails/10.jpg)
Example: word frequencies• Reduce: computes sum for a key
• Output of reduce saved
“be”, “2”“not”, “1”“or”, “1”“to”, “2”
key = “or”values = “1”
“1”
key = “be”values = “1”, “1”
“2”
key = “to”values = “1”, “1”
“2”
key = “not”values = “1”
“1”
![Page 11: Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d375503460f94a10908/html5/thumbnails/11.jpg)
Example: Pseudo-codeMap(String input_key, String input_value): //input_key: document name //input_value: document contents for each word w in input_values: EmitIntermediate(w, "1");
Reduce(String key, Iterator intermediate_values): //key: a word, same for input and output //intermediate_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result));
![Page 12: Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d375503460f94a10908/html5/thumbnails/12.jpg)
MapReduce is widely applicable
• Distributed grep
• Document clustering
• Web link graph reversal
• Detecting approx. duplicate web pages
• …
![Page 13: Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d375503460f94a10908/html5/thumbnails/13.jpg)
MapReduce implementation
• Input data is partitioned into M splits• Map: extract information on each split
– Each Map produces R partitions
• Shuffle and sort– Bring M partitions to the same reducer
• Reduce: aggregate, summarize, filter or transform• Output is in R result files
![Page 14: Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d375503460f94a10908/html5/thumbnails/14.jpg)
MapReduce scheduling
• One master, many workers – Input data split into M map tasks (e.g. 64 MB)– R reduce tasks– Tasks are assigned to workers dynamically– Often: M=200,000; R=4,000; workers=2,000
![Page 15: Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d375503460f94a10908/html5/thumbnails/15.jpg)
MapReduce scheduling• Master assigns a map task to a free worker
– Prefers “close-by” workers when assigning task– Worker reads task input (often from local disk!)– Worker produces R local files containing intermediate
k/v pairs
• Master assigns a reduce task to a free worker – Worker reads intermediate k/v pairs from map workers– Worker sorts & applies user’s Reduce op to produce
the output
![Page 16: Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d375503460f94a10908/html5/thumbnails/16.jpg)
Parallel MapReduce
Map Map Map Map
Inputdata
Inputdata
Reduce
Shuffle
Reduce
Shuffle
Reduce
Shuffle
Partitioned output
Partitioned output
Master
![Page 17: Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d375503460f94a10908/html5/thumbnails/17.jpg)
WordCount Internals• Input data is split into M map jobs
• Each map job generates in R local partitions
“doc1”, “to be or not to be”
“to”, “1”“be”, “1”“or”, “1”“not”, “1“to”, “1”
“be”,“1”
“not”,“1”“or”, “1”
R localpartitions
“doc234”, “do not be silly”
“do”, “1”“not”, “1”“be”, “1”“silly”, “1 “be”,“1”
R localpartitions
“not”,“1”
“do”,“1”
“to”,“1”,”1”Hash(“to”) %
R
![Page 18: Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d375503460f94a10908/html5/thumbnails/18.jpg)
WordCount Internals• Shuffle brings same partitions to same reducer
“to”,“1”,”1”
“be”,“1”
“not”,“1”“or”, “1”
“be”,“1”
R localpartitions
R localpartitions
“not”,“1”
“do”,“1”
“to”,“1”,”1”
“do”,“1”
“be”,“1”,”1”
“not”,“1”,”1”“or”, “1”
![Page 19: Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d375503460f94a10908/html5/thumbnails/19.jpg)
WordCount Internals• Reduce aggregates sorted key values pairs
“to”,“1”,”1”
“do”,“1”
“not”,“1”,”1”
“or”, “1”
“do”,“1”“to”, “2”
“be”,“2”
“not”,“2”“or”, “1”
“be”,“1”,”1”
![Page 20: Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d375503460f94a10908/html5/thumbnails/20.jpg)
The importance of partition function
• partition(k’, total partitions) -> partition for k’– e.g. hash(k’) % R
• What is the partition function for sort?
![Page 21: Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d375503460f94a10908/html5/thumbnails/21.jpg)
Load Balance and Pipelining• Fine granularity tasks: many more map
tasks than machines– Minimizes time for fault recovery– Can pipeline shuffling with map execution– Better dynamic load balancing
• Often use 200,000 map/5000 reduce tasks w/ 2000 machines
![Page 22: Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d375503460f94a10908/html5/thumbnails/22.jpg)
Fault tolerance via re-execution
On worker failure:• Re-execute completed and in-progress map
tasks• Re-execute in progress reduce tasks• Task completion committed through masterOn master failure:• State is checkpointed to GFS: new master
recovers & continues
![Page 23: Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d375503460f94a10908/html5/thumbnails/23.jpg)
Avoid straggler using backup tasks• Slow workers significantly lengthen completion time
– Other jobs consuming resources on machine– Bad disks with soft errors transfer data very slowly– Weird things: processor caches disabled (!!)
• Solution: Near end of phase, spawn backup copies of tasks– Whichever one finishes first "wins"
• Effect: Dramatically shortens job completion time
![Page 24: Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d375503460f94a10908/html5/thumbnails/24.jpg)
MapReduce Sort Performance
![Page 25: Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d375503460f94a10908/html5/thumbnails/25.jpg)
Dryad
• Similar goals as MapReduce, different design
• Computations expressed as a graph– Vertices are computations– Edges are communication channels– Each vertex has several input and output edges
![Page 26: Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d375503460f94a10908/html5/thumbnails/26.jpg)
WordCount in Dryad
CountWord:n
MergeSortWord:n
CountWord:n
DistributeWord:n
![Page 27: Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d375503460f94a10908/html5/thumbnails/27.jpg)
Dryad Implementation
• Job manager (master)– Keeps execution records for each vertex– When all inputs are available, vertex becomes
runnable
• Daemon– Creates processes to run vertices (worker)
• Channels– Files, TCP connections, pipes, shared memory
• Cope with straggler via re-execution
![Page 28: Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s](https://reader036.vdocuments.net/reader036/viewer/2022062320/56649d375503460f94a10908/html5/thumbnails/28.jpg)
MapReduce vs. Dryad
• MapReduce:– Chain multiple MR jobs into a pipeline
• Dryad:– Express computation in arbitrary DAG
• Minor differences?– Dryad vertex takes arbitrary # of input/output
channels – Map takes one input(?), generates R outputs– Dryad channels can be TCP, pipes, files etc.– Map outputs are in R files