jeffrey d. ullman stanford university. 2 chunking replication distribution on racks
TRANSCRIPT
![Page 1: Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649e365503460f94b25351/html5/thumbnails/1.jpg)
The MapReduce EnvironmentDistributed File SystemsOverview of the DFS EcologyMapReduce and Hadoop
Jeffrey D. UllmanStanford University
![Page 2: Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649e365503460f94b25351/html5/thumbnails/2.jpg)
2
Distributed File SystemsChunkingReplicationDistribution on Racks
![Page 3: Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649e365503460f94b25351/html5/thumbnails/3.jpg)
Commodity Clusters
Datasets can be very large. Tens to hundreds of terabytes. Cannot process on a single server.
Standard architecture emerging: Cluster of commodity Linux nodes (compute nodes). Gigabit Ethernet interconnect.
How to organize computations on this architecture? Mask issues such as hardware failure.
![Page 4: Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649e365503460f94b25351/html5/thumbnails/4.jpg)
Cluster Architecture
Mem
Disk
CPU
Mem
Disk
CPU
…
Switch
Each rack contains 16-64 nodes
Mem
Disk
CPU
Mem
Disk
CPU
…
Switch
Switch1 Gbps between any pair of nodesin a rack
2-10 Gbps backbone between racks
![Page 5: Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649e365503460f94b25351/html5/thumbnails/5.jpg)
Stable Storage
First order problem: if nodes can fail, how can we store data persistently?
Answer: Distributed File System. Provides global file namespace. Examples: Google GFS, Colossus; Hadoop HDFS.
Typical usage pattern: Huge files. Data is rarely updated in place. Reads and appends are common.
![Page 6: Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649e365503460f94b25351/html5/thumbnails/6.jpg)
Distributed File System Chunk Servers.
File is split into contiguous chunks, typically 64MB. Each chunk replicated (usually 2x or 3x). Try to keep replicas in different racks. Alternative: Erasure coding.
Master Node for a file. Stores metadata, location of all chunks. Possibly replicated.
![Page 7: Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649e365503460f94b25351/html5/thumbnails/7.jpg)
7
Compute Nodes
Organized into racks. Intra-rack connection typically gigabit speed. Inter-rack connection faster by a small factor.
![Page 8: Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649e365503460f94b25351/html5/thumbnails/8.jpg)
8
Racks of Compute Nodes
File
Chunks
![Page 9: Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649e365503460f94b25351/html5/thumbnails/9.jpg)
9
3-way replication offiles, with copies ondifferent racks.
![Page 10: Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649e365503460f94b25351/html5/thumbnails/10.jpg)
10
Above the DFS
MapReduceKey-Value StoresSQL Implementations
![Page 11: Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649e365503460f94b25351/html5/thumbnails/11.jpg)
11
The New Stack
Distributed File System
MapReduce, e.g.Hadoop
Object Store (key-valuestore), e.g., BigTable,
Hbase, Cassandra
SQL Implementations,e.g., PIG (relational
algebra), HIVE
![Page 12: Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649e365503460f94b25351/html5/thumbnails/12.jpg)
12
MapReduce Systems
MapReduce (Google) and open-source (Apache) equivalent Hadoop.
Important specialized parallel computing tool. Cope with compute-node failures.
Avoid restart of the entire job.
![Page 13: Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649e365503460f94b25351/html5/thumbnails/13.jpg)
13
Key-Value Stores
BigTable (Google), Hbase, Cassandra (Apache), Dynamo (Amazon). Each row is a key plus values over a flexible set of
columns. Each column component can be a set of values.
Example: Structure of the Web. Key is a URL. One column is a set of URL’s – those linked to the
page represented by the key. A second column is the set of URL’s linking to the key.
![Page 14: Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649e365503460f94b25351/html5/thumbnails/14.jpg)
14
SQL-Like Systems
PIG – Yahoo! implementation of relational algebra. Translates to a sequence of map-reduce
operations, using Hadoop. Hive – open-source (Apache) implementation
of a restricted SQL, called QL, over Hadoop.
![Page 15: Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649e365503460f94b25351/html5/thumbnails/15.jpg)
15
SQL-Like Systems – (2)
Sawzall – Google implementation of parallel select + aggregation, but using C++.
Dremel – (Google) real restricted SQL, column oriented store.
F1 – (Google) row-oriented, conventional, but massive scale.
Scope – Microsoft implementation of restricted SQL.
![Page 16: Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649e365503460f94b25351/html5/thumbnails/16.jpg)
16
MapReduce
Formal DefinitionImplementationFault-ToleranceExamples: Word-Count, Join
![Page 17: Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649e365503460f94b25351/html5/thumbnails/17.jpg)
MapReduce
Input: a set of key/value pairs. User supplies two functions:
map(k,v) set(k1,v1) reduce(k1, list(v1)) set(v2)
Technically, the input consists of key-value pairs of some type, but usually only the value is important.
(k1,v1) is an intermediate key/value pair. Output is the set of (k1,v2) pairs.
![Page 18: Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649e365503460f94b25351/html5/thumbnails/18.jpg)
18
Map Tasks and Reduce Tasks MapReduce job =
Map function (inputs -> key-value pairs) + Reduce function (key and list of values -> outputs).
Map and Reduce Tasks apply Map or Reduce function to (typically) many of their inputs. Unit of parallelism.
![Page 19: Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649e365503460f94b25351/html5/thumbnails/19.jpg)
19
Behind the Scenes
The Map tasks generate key-value pairs. Each takes one or more chunks of input from the
distributed file system. The system takes all the key-value pairs from all
the Map tasks and sorts them by key. Then, it forms key-(list-of-associated-values)
pairs and passes each key-(value-list) pair to one of the Reduce tasks.
![Page 20: Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649e365503460f94b25351/html5/thumbnails/20.jpg)
20
MapReduce Pattern
Maptasks
Reducetasks
InputfromDFS
Outputto DFS
“key”-value pairs
![Page 21: Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649e365503460f94b25351/html5/thumbnails/21.jpg)
Example: Word Count
We have a large file documents, which are sequences of words.
Count the number of times each distinct word appears in the file.
![Page 22: Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649e365503460f94b25351/html5/thumbnails/22.jpg)
Word Count Using MapReduce
map(key, value):// key: document name; value: text of document
FOR (each word w in value)emit(w, 1);
reduce(key, value-list):// key: a word; value: an iterator over value-list
result = 0;FOR (each count v on value-list)
result += v;emit(result);
![Page 23: Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649e365503460f94b25351/html5/thumbnails/23.jpg)
Distributed Execution Overview
UserProgram
Worker
Worker
Master
Worker
Worker
Worker
fork fork fork
assignmap
assignreduce
readlocalwrite
remoteread,sort
OutputFile 0
OutputFile 1
write
Chunk 0Chunk1Chunk 2
Input Data
![Page 24: Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649e365503460f94b25351/html5/thumbnails/24.jpg)
Data Management
Input and final output are stored in the distributed file system. Scheduler tries to schedule Map tasks “close” to
physical storage location of input data – preferably at the same node.
Intermediate results are stored on local file storage of Map and Reduce workers.
![Page 25: Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649e365503460f94b25351/html5/thumbnails/25.jpg)
The Master Task
Maintain task status: (idle, active, completed). Idle tasks get scheduled as workers become
available. When a Map task completes, it sends the
Master the location and sizes of its intermediate files, one for each Reduce task.
Master pushes location of intermediates to Reduce tasks.
Master pings workers periodically to detect failures.
![Page 26: Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649e365503460f94b25351/html5/thumbnails/26.jpg)
How Many Map and Reduce Tasks? Rule of thumb: Use several times more Map
tasks and Reduce tasks than the number of compute nodes available. Minimizes skew caused by different tasks taking
different amounts of time. One DFS chunk per Map task is common.
![Page 27: Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649e365503460f94b25351/html5/thumbnails/27.jpg)
Combiners Often a Map task will produce many pairs of the
form (k,v1), (k,v2), … for the same key k. E.g., popular words in Word Count.
Can save communication time by applying Reduce function to values with the same key at the Map task. Called a combiner.
Works only if Reduce function is commutative and associative.
![Page 28: Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649e365503460f94b25351/html5/thumbnails/28.jpg)
Partition Function We need to assure that records with the same
intermediate key end up at the same Reduce task.
System uses a default partition function e.g., hash(key) mod R, if there are R Reduce tasks.
Sometimes useful to override. Example: hash(hostname(URL)) mod R ensures URLs
from a host end up at the same Reduce task and therefore appear together in the output.
![Page 29: Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649e365503460f94b25351/html5/thumbnails/29.jpg)
29
Coping With Failures
MapReduce is designed to deal with compute nodes failing to execute a task.
Re-executes failed tasks, not whole jobs. Failure modes:
1. Compute-node failure (e.g., disk crash).2. Rack communication failure.3. Software failures, e.g., a task requires Java n;
node has Java n-1.
![Page 30: Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649e365503460f94b25351/html5/thumbnails/30.jpg)
30
Things MapReduce is Good At1. Matrix-Matrix and Matrix-vector
multiplication. One step of the PageRank iteration was the original
application.2. Relational algebra operations.
We’ll do an example of the join.3. Many other “embarrassingly parallel”
operations.
![Page 31: Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649e365503460f94b25351/html5/thumbnails/31.jpg)
31
Review of Terminology
Map-Reduce job = Map function (inputs -> key-value pairs) + Reduce function (key and list of values -> outputs).
Map and Reduce Tasks apply Map or Reduce function to (typically) many of their inputs. Unit of parallelism.
Mapper = application of the Map function to a single input.
Reducer = application of the Reduce function to a single key-(list of values) pair.
![Page 32: Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649e365503460f94b25351/html5/thumbnails/32.jpg)
32
Example: Natural Join
Join of R(A,B) with S(B,C) is the set of tuples (a,b,c) such that (a,b) is in R and (b,c) is in S.
Mappers need to send R(a,b) and S(b,c) to the same reducer, so they can be joined there.
Mapper output: key = B-value, value = relation and other component (A or C). Example: R(1,2) -> (2, (R,1))
S(2,3) -> (2, (S,3))
![Page 33: Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649e365503460f94b25351/html5/thumbnails/33.jpg)
33
Mapping Tuples
Mapper for
R(1,2)R(1,2) (2, (R,1))
Mapper for
R(4,2)R(4,2)
Mapper for
S(2,3)S(2,3)
Mapper for
S(5,6)S(5,6)
(2, (R,4))
(2, (S,3))
(5, (S,6))
![Page 34: Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649e365503460f94b25351/html5/thumbnails/34.jpg)
34
Grouping Phase
There is a reducer for each key. Every key-value pair generated by any mapper is
sent to the reducer for its key.
![Page 35: Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649e365503460f94b25351/html5/thumbnails/35.jpg)
35
Mapping Tuples
Mapper for
R(1,2)(2, (R,1))
Mapper for
R(4,2)
Mapper for
S(2,3)
Mapper for
S(5,6)
(2, (R,4))
(2, (S,3))
(5, (S,6))
Reducerfor B = 2
Reducerfor B = 5
![Page 36: Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649e365503460f94b25351/html5/thumbnails/36.jpg)
36
Constructing Value-Lists
The input to each reducer is organized by the system into a pair: The key. The list of values associated with that key.
![Page 37: Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649e365503460f94b25351/html5/thumbnails/37.jpg)
37
The Value-List Format
Reducerfor B = 2
Reducerfor B = 5
(2, [(R,1), (R,4), (S,3)])
(5, [(S,6)])
![Page 38: Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649e365503460f94b25351/html5/thumbnails/38.jpg)
38
The Reduce Function for Join Given key b and a list of values that are either
(R, ai) or (S, cj), output each triple (ai, b, cj). Thus, the number of outputs made by a reducer is
the product of the number of R’s on the list and the number of S’s on the list.
![Page 39: Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649e365503460f94b25351/html5/thumbnails/39.jpg)
39
Output of the Reducers
Reducerfor B = 2
Reducerfor B = 5
(2, [(R,1), (R,4), (S,3)])
(5, [(S,6)])
(1,2,3), (4,2,3)