mapreduce and the new software stack

MapReduce and the New Software Stack

Maruf Aytekin PhD Student

BAU Computer Engineering Department Besiktas/Istanbul January 5, 2015

Outline

• Introduction • DFS • MapReduce • Examples • Matrix Calculation on Hadoop

Introduction

Modern data-mining or ML applications, called «big-data analysis» requires us to manage massive amounts of data quickly.

Important Examples

• The ranking of Web pages by importance, which involves an iterated matrix-vector multiplication where the dimension is many billions.

• Searches in social-networking sites, which involve graphs with hundreds of millions of nodes and many billions of edges.

• Processing large amount of text or streams such as news recommendation.

New software stack

• Not a “supercomputer” (Beowulf etc.) • “computing clusters” – large collections of

commodity hardware, including conventional processors (“compute nodes”) connected by Ethernet cables or inexpensive switches.

Distributed File System

• The new form of file system which features much larger units than the disk blocks in a conventional operating system.

• Files can be enormous, possibly a terabytes in size.

• Files are rarely updated.

Physical Organization

• Files are divided into chunks • Chunks are replicated

DFS Implementations

• The Google File System (GFS) • Hadoop Distributed File System (HDFS) • CloudStore, by Kosmix

HDFS Architecture

Block Replication

MapReduce

Style of computing/framework/pattern.

Implementations:

• MapReduce by Google (internal) • Hadoop by the Apache Foundation.

MapReduce

Operates exclusively on <key, value> pairs.

(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output)

MapReduce Computation

MapReduce

In brief, a MapReduce computation executes as follows: • Chunks from a DFS are given to Map tasks. • These Map tasks turn the chunks into a sequence of

<key, value> pairs. • The <key,value> pairs from each Map task are

collected by a master controller and sorted by key. (Combine)

• The keys are divided among all the Reduce tasks, so all <key,value> pairs with the same key wind up at the same Reduce task.

• The Reduce tasks work on one key at a time and processes values for that key then outputs the results as <key,value> pairs.

Execution of MapReduce

Hello World

Word Count

• file01: Hello World Bye World • file02: Hello Hadoop Goodbye Hadoop

Word Count

For the given sample input the first map emits: < Hello, 1 > < World, 1 > < Bye, 1 > < World, 1 > The second map emits: < Hello, 1 > < Hadoop, 1 > < Goodbye, 1 > < Hadoop, 1 >

Combiner: After being sorted on the keys: The output of the first map: < Bye, 1 > < Hello, 1 > < World, 2 > The output of the second map: < Goodbye, 1 > < Hadoop, 2 > < Hello, 1 >

Word Count

Thus the output of the job is: < Bye, 1 > < Goodbye, 1 > < Hadoop, 2 > < Hello, 2 > < World, 2 >

The Reducer implementation, via the reduce method just sums up the values, which are the occurrence counts for each key.

M

j

İ

N

k

j

Matrix CalculationP = M N

k

i

Matrix Data Model for MapReduce:

M (i, j,mij ) N (j, k, njk)

P(1,1) P(1,2)

Matrix Data Files for MapReduce

M,0,0,10.0 M,0,2,9.0 M,0,3,9.0 M,1,0,1.0 M,1,1,3.0 M,1,2,18.0 M,1,3,25.2 . . .

M, i, j, mijN,0,0,1.0 N,0,2,3.0 N,0,4,2.0 N,1,0,2.0 N,3,2,-1.0 N,3,6,4.0 N,4,6,5.0 . . .

N (j, k, njk)

Reduce

Example

Map Task

Matrix M key, value pairs produced as follows:

Matrix N key, value pairs produced as follows:

Map Task Output

Reduce Task

P =

Application

• Run the application on Hadoop

Thank you!

Q & A