mapreduce and the new software stack
TRANSCRIPT
![Page 1: MapReduce and the New Software Stack](https://reader031.vdocuments.net/reader031/viewer/2022022123/58a1ac481a28abe6468b67df/html5/thumbnails/1.jpg)
MapReduce and the New Software Stack
Maruf Aytekin PhD Student
BAU Computer Engineering Department Besiktas/Istanbul January 5, 2015
![Page 2: MapReduce and the New Software Stack](https://reader031.vdocuments.net/reader031/viewer/2022022123/58a1ac481a28abe6468b67df/html5/thumbnails/2.jpg)
Outline
• Introduction • DFS • MapReduce • Examples • Matrix Calculation on Hadoop
![Page 3: MapReduce and the New Software Stack](https://reader031.vdocuments.net/reader031/viewer/2022022123/58a1ac481a28abe6468b67df/html5/thumbnails/3.jpg)
Introduction
Modern data-mining or ML applications, called «big-data analysis» requires us to manage massive amounts of data quickly.
![Page 4: MapReduce and the New Software Stack](https://reader031.vdocuments.net/reader031/viewer/2022022123/58a1ac481a28abe6468b67df/html5/thumbnails/4.jpg)
Important Examples
• The ranking of Web pages by importance, which involves an iterated matrix-vector multiplication where the dimension is many billions.
• Searches in social-networking sites, which involve graphs with hundreds of millions of nodes and many billions of edges.
• Processing large amount of text or streams such as news recommendation.
![Page 5: MapReduce and the New Software Stack](https://reader031.vdocuments.net/reader031/viewer/2022022123/58a1ac481a28abe6468b67df/html5/thumbnails/5.jpg)
New software stack
• Not a “supercomputer” (Beowulf etc.) • “computing clusters” – large collections of
commodity hardware, including conventional processors (“compute nodes”) connected by Ethernet cables or inexpensive switches.
![Page 6: MapReduce and the New Software Stack](https://reader031.vdocuments.net/reader031/viewer/2022022123/58a1ac481a28abe6468b67df/html5/thumbnails/6.jpg)
Distributed File System
• The new form of file system which features much larger units than the disk blocks in a conventional operating system.
• Files can be enormous, possibly a terabytes in size.
• Files are rarely updated.
![Page 7: MapReduce and the New Software Stack](https://reader031.vdocuments.net/reader031/viewer/2022022123/58a1ac481a28abe6468b67df/html5/thumbnails/7.jpg)
Physical Organization
• Files are divided into chunks • Chunks are replicated
![Page 8: MapReduce and the New Software Stack](https://reader031.vdocuments.net/reader031/viewer/2022022123/58a1ac481a28abe6468b67df/html5/thumbnails/8.jpg)
DFS Implementations
• The Google File System (GFS) • Hadoop Distributed File System (HDFS) • CloudStore, by Kosmix
![Page 9: MapReduce and the New Software Stack](https://reader031.vdocuments.net/reader031/viewer/2022022123/58a1ac481a28abe6468b67df/html5/thumbnails/9.jpg)
HDFS Architecture
![Page 10: MapReduce and the New Software Stack](https://reader031.vdocuments.net/reader031/viewer/2022022123/58a1ac481a28abe6468b67df/html5/thumbnails/10.jpg)
Block Replication
![Page 11: MapReduce and the New Software Stack](https://reader031.vdocuments.net/reader031/viewer/2022022123/58a1ac481a28abe6468b67df/html5/thumbnails/11.jpg)
MapReduce
Style of computing/framework/pattern.
Implementations:
• MapReduce by Google (internal) • Hadoop by the Apache Foundation.
![Page 12: MapReduce and the New Software Stack](https://reader031.vdocuments.net/reader031/viewer/2022022123/58a1ac481a28abe6468b67df/html5/thumbnails/12.jpg)
MapReduce
Operates exclusively on <key, value> pairs.
(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output)
![Page 13: MapReduce and the New Software Stack](https://reader031.vdocuments.net/reader031/viewer/2022022123/58a1ac481a28abe6468b67df/html5/thumbnails/13.jpg)
MapReduce Computation
![Page 14: MapReduce and the New Software Stack](https://reader031.vdocuments.net/reader031/viewer/2022022123/58a1ac481a28abe6468b67df/html5/thumbnails/14.jpg)
MapReduce
In brief, a MapReduce computation executes as follows: • Chunks from a DFS are given to Map tasks. • These Map tasks turn the chunks into a sequence of
<key, value> pairs. • The <key,value> pairs from each Map task are
collected by a master controller and sorted by key. (Combine)
• The keys are divided among all the Reduce tasks, so all <key,value> pairs with the same key wind up at the same Reduce task.
• The Reduce tasks work on one key at a time and processes values for that key then outputs the results as <key,value> pairs.
![Page 15: MapReduce and the New Software Stack](https://reader031.vdocuments.net/reader031/viewer/2022022123/58a1ac481a28abe6468b67df/html5/thumbnails/15.jpg)
Execution of MapReduce
![Page 16: MapReduce and the New Software Stack](https://reader031.vdocuments.net/reader031/viewer/2022022123/58a1ac481a28abe6468b67df/html5/thumbnails/16.jpg)
Hello World
Word Count
• file01: Hello World Bye World • file02: Hello Hadoop Goodbye Hadoop
![Page 17: MapReduce and the New Software Stack](https://reader031.vdocuments.net/reader031/viewer/2022022123/58a1ac481a28abe6468b67df/html5/thumbnails/17.jpg)
Word Count
For the given sample input the first map emits: < Hello, 1 > < World, 1 > < Bye, 1 > < World, 1 > The second map emits: < Hello, 1 > < Hadoop, 1 > < Goodbye, 1 > < Hadoop, 1 >
Combiner: After being sorted on the keys: The output of the first map: < Bye, 1 > < Hello, 1 > < World, 2 > The output of the second map: < Goodbye, 1 > < Hadoop, 2 > < Hello, 1 >
![Page 18: MapReduce and the New Software Stack](https://reader031.vdocuments.net/reader031/viewer/2022022123/58a1ac481a28abe6468b67df/html5/thumbnails/18.jpg)
Word Count
Thus the output of the job is: < Bye, 1 > < Goodbye, 1 > < Hadoop, 2 > < Hello, 2 > < World, 2 >
The Reducer implementation, via the reduce method just sums up the values, which are the occurrence counts for each key.
![Page 19: MapReduce and the New Software Stack](https://reader031.vdocuments.net/reader031/viewer/2022022123/58a1ac481a28abe6468b67df/html5/thumbnails/19.jpg)
M
j
İ
N
k
j
Matrix CalculationP = M N
k
i
Matrix Data Model for MapReduce:
M (i, j,mij ) N (j, k, njk)
P(1,1) P(1,2)
![Page 20: MapReduce and the New Software Stack](https://reader031.vdocuments.net/reader031/viewer/2022022123/58a1ac481a28abe6468b67df/html5/thumbnails/20.jpg)
Matrix Data Files for MapReduce
M,0,0,10.0 M,0,2,9.0 M,0,3,9.0 M,1,0,1.0 M,1,1,3.0 M,1,2,18.0 M,1,3,25.2 . . .
M, i, j, mijN,0,0,1.0 N,0,2,3.0 N,0,4,2.0 N,1,0,2.0 N,3,2,-1.0 N,3,6,4.0 N,4,6,5.0 . . .
N (j, k, njk)
![Page 21: MapReduce and the New Software Stack](https://reader031.vdocuments.net/reader031/viewer/2022022123/58a1ac481a28abe6468b67df/html5/thumbnails/21.jpg)
Map
![Page 22: MapReduce and the New Software Stack](https://reader031.vdocuments.net/reader031/viewer/2022022123/58a1ac481a28abe6468b67df/html5/thumbnails/22.jpg)
Reduce
![Page 23: MapReduce and the New Software Stack](https://reader031.vdocuments.net/reader031/viewer/2022022123/58a1ac481a28abe6468b67df/html5/thumbnails/23.jpg)
Example
![Page 24: MapReduce and the New Software Stack](https://reader031.vdocuments.net/reader031/viewer/2022022123/58a1ac481a28abe6468b67df/html5/thumbnails/24.jpg)
Map Task
Matrix M key, value pairs produced as follows:
Matrix N key, value pairs produced as follows:
![Page 25: MapReduce and the New Software Stack](https://reader031.vdocuments.net/reader031/viewer/2022022123/58a1ac481a28abe6468b67df/html5/thumbnails/25.jpg)
Map Task Output
![Page 26: MapReduce and the New Software Stack](https://reader031.vdocuments.net/reader031/viewer/2022022123/58a1ac481a28abe6468b67df/html5/thumbnails/26.jpg)
Reduce Task
P =
![Page 27: MapReduce and the New Software Stack](https://reader031.vdocuments.net/reader031/viewer/2022022123/58a1ac481a28abe6468b67df/html5/thumbnails/27.jpg)
Application
• Run the application on Hadoop
![Page 28: MapReduce and the New Software Stack](https://reader031.vdocuments.net/reader031/viewer/2022022123/58a1ac481a28abe6468b67df/html5/thumbnails/28.jpg)
Thank you!
Q & A