databases 2 (vu) (707.030) - map-reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf ·...

94
Databases 2 (VU) (707.030) Map-Reduce Denis Helic KMI, TU Graz Nov 4, 2013 Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 1 / 90

Upload: others

Post on 18-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Databases 2 (VU) (707.030)Map-Reduce

Denis Helic

KMI, TU Graz

Nov 4, 2013

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 1 / 90

Page 2: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Outline

1 Motivation

2 Large Scale Computation

3 Map-Reduce

4 Environment

5 Map-Reduce Skew

6 Map-Reduce Examples

Slides

Slides are partially based on “Mining Massive Datasets” course fromStanford University by Jure Leskovec

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 2 / 90

Page 3: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Motivation

Map-Reduce

Today’s data is huge

ChallengesHow to distribute computation?Distributed/parallel programming is hard

Map-reduce addresses both of these points

Google’s computational/data manipulation modelElegant way to work with huge data

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 3 / 90

Page 4: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Motivation

Single node architecture

CPU

Memory

MemoryDisk

Data fits in memory

Machine learning, statistics

“Classical” data mining

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 4 / 90

Page 5: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Motivation

Motivation: Google example

20+ billion Web pages

Approx. 20 KB per page

Approx. 400+ TB for the whole Web

Approx. 1000 hard drives to store the Web

A single computer reads 30− 35 MB/s from disk

Approx. 4 months to read the Web with a single computer

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 5 / 90

Page 6: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Motivation

Motivation: Google example

Takes even more time to do something with the data

E.g. to calculate the PageRank

If m is the number of the links on the Web

Average degree on the Web is approx. 10, thus m ≈ 2 · 1011

To calculate PageRank we need per iteration step m multiplications

We need approx. 100+ iteration steps

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 6 / 90

Page 7: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Motivation

Motivation: Google example

Today a standard architecture for such problems is emerging

Cluster of commodity Linux nodes

Commodity network (ethernet) to connect them

2− 10 Gbps between racks

1 Gbps within racks

Each rack contains 16− 64 nodes

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 7 / 90

Page 8: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Motivation

Cluster architecture

CPU

Memory

MemoryDisk

CPU

Memory

MemoryDisk

CPU

Memory

MemoryDisk

CPU

Memory

MemoryDisk

... ... ...

Switch Switch

Switch

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 8 / 90

Page 9: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Motivation

Motivation: Google example

2011 estimation: Google had approx. 1 million machines

http://www.datacenterknowledge.com/archives/2011/08/01/

report-google-uses-about-900000-servers/

Other examples: Facebook, Twitter, Amazon, etc.

But also smaller examples: e.g. Wikipedia

Single source shortest path: m + n time complexity, approx. 260 · 106

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 9 / 90

Page 10: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Large Scale Computation

Large scale computation

Large scale computation for data mining on commodity hardware

Challenges

How to distribute computation?

How can we make it easy to write distributed programs?

How to cope with machine failures?

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 10 / 90

Page 11: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Large Scale Computation

Large scale computation: machine failures

One server may stay up 3 years (1000 days)

The failure rate per day: p = 10−3

How many failures per day if we have n machines?

Binomial r.v.

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 11 / 90

Page 12: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Large Scale Computation

Large scale computation: machine failures

One server may stay up 3 years (1000 days)

The failure rate per day: p = 10−3

How many failures per day if we have n machines?

Binomial r.v.

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 11 / 90

Page 13: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Large Scale Computation

Large scale computation: machine failures

PMF of a Binomial r.v.

p(k) =

(n

k

)(1− p)n−kpk

Expectation of a Binomial r.v.

E [X ] = np

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 12 / 90

Page 14: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Large Scale Computation

Large scale computation: machine failures

n = 1000, E [X ] = 1

If we have 1000 machines we lose one per day

n = 1000000, E [X ] = 1000

If we have 1 million machines (Google) we lose 1 thousand per day

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 13 / 90

Page 15: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Large Scale Computation

Large scale computation: data copying

Copying data over network takes time

Bring data closer to computation

I.e. process data locally at each node

Replicate data to increase reliability

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 14 / 90

Page 16: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce

Solution: Map-reduce

Storage infrastructure

Distributed file systemGoogle: GFSHadoop: HDFS

Programming model

Map-reduce

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 15 / 90

Page 17: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce

Storage infrastructure

Problem: if node fails how to store a file persistently

Distributed file systemProvides global file namespace

Typical usage pattern

Huge files: several 100s GB to 1 TBData is rarely updated in placeReads and appends are common

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 16 / 90

Page 18: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce

Distributed file system

Chunk servers

File is split into contiguous chunksTypically each chunk is 16− 64 MBEach chunk is replicated (usually 2x or 3x)Try to keep replicas in different racks

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 17 / 90

Page 19: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce

Distributed file system

Master node

Stores metadata about where files are storedMight be replicated

Client library for file access

Talks to master node to find chunk serversConnects directly to chunk servers to access data

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 18 / 90

Page 20: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce

Distributed file system

Reliable distributed file system

Seamless recovery from node failures

Bring computation directly to data

Chunk servers also used as computation nodes

Reliable distributed file system Data kept in “chunks” spread across machines Each chunk replicated on different machines Seamless recovery from disk or machine failure

C0 C1

C2 C5

Chunk server 1

D1

C5

Chunk server 3

C1

C3 C5

Chunk server 2

… C2 D0

D0

Bring computation directly to the data!

C0 C5

Chunk server N

C2 D0

1/8/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 33

Chunk servers also serve as compute servers

Figure: Figure from slides by Jure Leskovec

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 19 / 90

Page 21: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce

Programming model: Map-reduce

Running example

We want to count the number of occurrences for each word in a collectionof documents. In this example, the input file is a repository of documents,and each document is an element.

Example

Example is meanwhile a standard Map-reduce example.

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 20 / 90

Page 22: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce

Programming model: Map-reduce

words input_file | sort | uniq -c

Three step process1 Split file into words, each word on a separate line2 Group and sort all words3 Count the occurrences

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 21 / 90

Page 23: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce

Programming model: Map-reduce

This captures the essence of Map-reduce

Split|Group|Count

Naturally parallelizable

E.g. split and count

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 22 / 90

Page 24: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce

Programming model: Map-reduce

Sequentially read a lot of data

Map: extract something that you care about (key , value)

Group by key: sort and shuffle

Reduce: Aggregate, summarize, filter or transform

Write the result

Outline

Outline is always the same: Map and Reduce change to fit the problem

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 23 / 90

Page 25: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce

Programming model: Map-reduce

2.2. MAP-REDUCE 23

possibility that one of these tasks will fail to execute. In brief, a map-reducecomputation executes as follows:

1. Some number of Map tasks each are given one or more chunks from adistributed file system. These Map tasks turn the chunk into a sequenceof key-value pairs. The way key-value pairs are produced from the inputdata is determined by the code written by the user for the Map function.

2. The key-value pairs from each Map task are collected by a master con-troller and sorted by key. The keys are divided among all the Reducetasks, so all key-value pairs with the same key wind up at the same Re-duce task.

3. The Reduce tasks work on one key at a time, and combine all the val-ues associated with that key in some way. The manner of combinationof values is determined by the code written by the user for the Reducefunction.

Figure 2.2 suggests this computation.

Inputchunks

Groupby keys

Key−value

(k,v)pairs

their valuesKeys with all

outputCombined

Maptasks

Reducetasks

(k, [v, w,...])

Figure 2.2: Schematic of a map-reduce computation

2.2.1 The Map Tasks

We view input files for a Map task as consisting of elements, which can beany type: a tuple or a document, for example. A chunk is a collection ofelements, and no element is stored across two chunks. Technically, all inputs

Figure: Figure from the book: “Mining massive datasets”

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 24 / 90

Page 26: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce

Programming model: Map-reduce

Input: a set of (key , value) pairs (e.g. key is the filename, value is asingle line in the file)

Map(k , v)→ (k ′, v ′)∗

Takes a (k, v) pair and outputs a set of (k ′, v ′) pairsThere is one Map call for each (k , v) pair

Reduce(k ′, (v ′)∗)→ (k ′′, v ′′)∗

All values v ′ with same key k ′ are reduced together and processed in v ′

orderThere is one Reduce call for each unique k ′

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 25 / 90

Page 27: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce

Programming model: Map-reduce

Big Document

Star Wars is anAmerican epicspace operafranchise centeredon a film seriescreated by GeorgeLucas. The filmseries has spawnedan extensive mediafranchise called theExpanded Universeincluding books,television series,computer and videogames, and comicbooks. Thesesupplements to thetwo film trilogies...

Map

(Star, 1)(Wars, 1)

(is, 1)(an, 1)

(American, 1)(epic, 1)

(space, 1)(opera, 1)

(franchise, 1)(centered, 1)

(on, 1)(a, 1)

(film, 1)(series, 1)

(created, 1)(by, 1). . .. . .

Group by key

(Star, 1)(Star, 1)(Wars, 1)(Wars, 1)

(a, 1)(a, 1)(a, 1)(a, 1)(a, 1)(a, 1)

(film, 1)(film, 1)(film, 1)

(franchise, 1)(series, 1)(series, 1)

. . .

. . .

Reduce

(Star, 2)(Wars, 2)

(a, 6)(film, 3)

(franchise, 1)(series, 2)

. . .

. . .

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 26 / 90

Page 28: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce

Programming model: Map-reduce

map(key, value):

// key: document name

// value: a single line from a document

foreach word w in value:

emit(w, 1)

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 27 / 90

Page 29: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce

Programming model: Map-reduce

reduce(key, values):

// key: a word

// values: an iterator over counts

result = 0

foreach count c in values:

result += c

emit(key, result)

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 28 / 90

Page 30: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Environment

Map-reduce computation

Map-reduce environment takes care of:1 Partitioning the input data2 Scheduling the program’s execution across a set of machines3 Performing the group by key step4 Handling machine failures5 Managing required inter-machine communication

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 29 / 90

Page 31: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Environment

Map-reduce computation

1/8/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 43

Big document

MAP: Read input and

produces a set of key-value pairs

Group by key: Collect all pairs with

same key (Hash merge, Shuffle,

Sort, Partition)

Reduce: Collect all values belonging to the key and output

Figure: Figure from the course by Jure Leskovec (Stanford University)

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 30 / 90

Page 32: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Environment

Map-reduce computation

1/8/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 44

All phases are distributed with many tasks doing the work Figure: Figure from the course by Jure Leskovec (Stanford University)

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 31 / 90

Page 33: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Environment

Data flow

Input and final output are stored in distributed file system

Scheduler tries to schedule map tasks close to physical storagelocation of input data

Intermediate results are stored on local file systems of Map andReduce workers

Output is often input to another Map-reduce computation

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 32 / 90

Page 34: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Environment

Coordination: Master

Master node takes care of coordination

Task status, e.g. idle, in-progress, completed

Idle tasks get scheduled as workers become available

When a Map task completes, it notifies the master about the size andlocation of its intermediate files

Master pushes this info to reducers

Master pings workers periodically to detect failures

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 33 / 90

Page 35: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Environment

Map-reduce execution details

2.2. MAP-REDUCE 27

ProgramUser

Master

Worker

Worker

Worker

Worker

WorkerData

Input

File

Output

fork forkfork

Mapassign assign

Reduce

Intermediate

Files

Figure 2.3: Overview of the execution of a map-reduce program

executing at a particular Worker, or completed). A Worker process reports tothe Master when it finishes a task, and a new task is scheduled by the Masterfor that Worker process.

Each Map task is assigned one or more chunks of the input file(s) andexecutes on it the code written by the user. The Map task creates a file foreach Reduce task on the local disk of the Worker that executes the Map task.The Master is informed of the location and sizes of each of these files, and theReduce task for which each is destined. When a Reduce task is assigned by theMaster to a Worker process, that task is given all the files that form its input.The Reduce task executes code written by the user and writes its output to afile that is part of the surrounding distributed file system.

2.2.6 Coping With Node Failures

The worst thing that can happen is that the compute node at which the Masteris executing fails. In this case, the entire map-reduce job must be restarted.But only this one node can bring the entire process down; other failures will bemanaged by the Master, and the map-reduce job will complete eventually.

Suppose the compute node at which a Map worker resides fails. This fail-ure will be detected by the Master, because it periodically pings the Workerprocesses. All the Map tasks that were assigned to this Worker will have tobe redone, even if they had completed. The reason for redoing completed Map

Figure: Figure from the book: “Mining massive datasets”

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 34 / 90

Page 36: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Skew

Maximizing parallelism

If we want maximum parallelism then

Use one Reduce task for each reducer (i.e. a single key and itsassociated value list)Execute each Reduce task at a different compute node

The plan is typically not the best one

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 35 / 90

Page 37: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Skew

Maximizing parallelism

There is overhead associated with each task we create

We might want to keep the number of Reduce tasks lower than thenumber of different keysWe do not want to create a task for a key with a “short” list

There are often far more keys than there are compute nodes

E.g. count words from Wikipedia or from the Web

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 36 / 90

Page 38: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Skew

Input data skew: Exercise

Exercise

Suppose we execute the word-count map-reduce program on a largerepository such as a copy of the Web. We shall use 100 Map tasks andsome number of Reduce tasks.

1 Do you expect there to be significant skew in the times taken by thevarious reducers to process their value list? Why or why not?

2 If we combine the reducers into a small number of Reduce tasks, say10 tasks, at random, do you expect the skew to be significant? Whatif we instead combine the reducers into 10,000 Reduce tasks?

Example

Example is based on the example 2.2.1 from “Mining Massive Datasets”.

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 37 / 90

Page 39: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Skew

Maximizing parallelism

There is often significant variation in the lengths of value list fordifferent keys

Different reducers take different amounts of time to finish

If we make each reducer a separate Reduce task then the taskexecution times will exhibit significant variance

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 38 / 90

Page 40: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Skew

Input data skew

Input data skew describes an uneven distribution of the number ofvalues per key

Examples include power-law graphs, e.g. the Web or Wikipedia

Other data with Zipfian distribution

E.g. the number of word occurrences

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 39 / 90

Page 41: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Skew

Power-law (Zipf) random variable

PMF

p(k) =k−α

ζ(α)

k ∈ N, k ≥ 1, α > 1

ζ(α) is the Riemann zeta function

ζ(α) =∞∑k=1

k−α

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 40 / 90

Page 42: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Skew

Power-law (Zipf) random variable

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19k

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9pro

babili

ty o

f k

Probability mass function of a Zipf random variable; differing α values

α=2.0

α=3.0

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 41 / 90

Page 43: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Skew

Power-law (Zipf) input data

0 2 4 6 8 10 12100

101

102

103

104

105

Key size: sum=189681,µ=1.897,σ2 =58.853

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 42 / 90

Page 44: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Skew

Tackling input data skew

We need to distribute a skewed (power-law) input data into a numberof reducers/Reduce tasks/compute nodes

The distribution of the key lengths inside of reducers/Reducetasks/compute nodes should be approximately normal

The variance of these distributions should be smaller than the originalvariance

If variance is small an efficient load balancing is possible

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 43 / 90

Page 45: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Skew

Tackling input data skew

Each Reduce task receives a number of keys

The total number of values to process is the sum of the number ofvalues over all keys

The average number of values that a Reduce task processes is theaverage of the number of values over all keys

Equivalently, each compute node receives a number of Reduce tasks

The sum and average for a compute node is the sum and averageover all Reduce tasks for that node

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 44 / 90

Page 46: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Skew

Tackling input data skew

How should we distribute keys to Reduce tasks?

Uniformly at random

Other possibilities?

Calculate the capacity of a single Reduce task

Add keys until capacity is reached, etc.

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 45 / 90

Page 47: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Skew

Tackling input data skew

How should we distribute keys to Reduce tasks?

Uniformly at random

Other possibilities?

Calculate the capacity of a single Reduce task

Add keys until capacity is reached, etc.

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 45 / 90

Page 48: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Skew

Tackling input data skew

How should we distribute keys to Reduce tasks?

Uniformly at random

Other possibilities?

Calculate the capacity of a single Reduce task

Add keys until capacity is reached, etc.

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 45 / 90

Page 49: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Skew

Tackling input data skew

We are averaging over a skewed distribution

Are there laws that describe how the averages of sufficiently largesamples drawn from a probability distribution behaves?

In other words, how are the averages of samples of a r.v. distributed?

Central-limit Theorem

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 46 / 90

Page 50: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Skew

Tackling input data skew

We are averaging over a skewed distribution

Are there laws that describe how the averages of sufficiently largesamples drawn from a probability distribution behaves?

In other words, how are the averages of samples of a r.v. distributed?

Central-limit Theorem

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 46 / 90

Page 51: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Skew

Central-Limit Theorem

The central-limit theorem describes the distribution of the arithmeticmean of sufficiently large samples of independent and identicallydistributed random variables

The means are normally distributed

The mean of the new distribution equals the mean of the originaldistribution

The variance of the new distribution equals σ2

n , where σ2 is thevariance of the original distribution

Thus, we keep the mean and reduce the variance

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 47 / 90

Page 52: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Skew

Central-Limit Theorem

Theorem

Suppose X1, . . . ,Xn are independent and identical r.v. with theexpectation µ and variance σ2. Let Y be a r.v. defined as:

Yn =1

n

n∑i=1

Xi

The CDF Fn(y) tends to PDF of a normal r.v. with the mean µ andvariance σ2 for n→∞:

limn→∞

Fn(y) =1√

2πσ2

∫ y

−∞e−

(x−µ)2

2σ2

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 48 / 90

Page 53: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Skew

Central-Limit Theorem

Practically, it is possible to replace Fn(y) with a normal distributionfor n > 30

We should always average over at least 30 values

Example

Approximating uniform r.v. with a normal r.v. by sampling and averaging

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 49 / 90

Page 54: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Skew

Central-Limit Theorem

0.0 0.2 0.4 0.6 0.8 1.00

2

4

6

8

10

12

14

16

18

Averages: µ=0.5,σ2 =0.08333

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 50 / 90

Page 55: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Skew

Central-Limit Theorem

0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.700

5

10

15

20

25

30

35

40

Averages: µ=0.499,σ2 =0.00270

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 51 / 90

Page 56: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Skew

Central-Limit Theorem

IPython Notebook examples

http:

//kti.tugraz.at/staff/denis/courses/kddm1/clt.ipynb

Command Line

ipython notebook –pylab=inline clt.ipynb

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 52 / 90

Page 57: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Skew

Input data skew

We can reduce impact of the skew by using fewer Reduce tasks thanthere are reducers

If keys are sent randomly to Reduce tasks we average over value listlengths

Thus, we average over the total time for each Reduce task(Central-limit Theorem)

We should make sure that the sample size is large enough (n > 30)

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 53 / 90

Page 58: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Skew

Input data skew

We can further reduce the skew by using more Reduce tasks thanthere are compute nodes

Long Reduce tasks might occupy a compute node fully

Several shorter Reduce tasks are executed sequentially at a singlecompute node

Thus, we average over the total time for each compute node(Central-limit Theorem)

We should make sure that the sample size is large enough (n > 30)

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 54 / 90

Page 59: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Skew

Input data skew

0 2 4 6 8 10 12100

101

102

103

104

105

Key size: sum=196524,µ=1.965,σ2 =243.245

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 55 / 90

Page 60: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Skew

Input data skew

0 200 400 600 800 10000

500

1000

1500

2000

2500

3000

3500

4000

4500

Task key size: sum=196524,µ=196.524,σ2 =25136.428

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 56 / 90

Page 61: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Skew

Input data skew

0 1 2 3 4 5 6100

101

102

Task key averages: µ=1.958,σ2 =1.886

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 57 / 90

Page 62: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Skew

Input data skew

0 2 4 6 8 100

5000

10000

15000

20000

25000

Node key size: sum=196524,µ=19652.400,σ2 =242116.267

Node key averages: µ=1.976,σ2 =0.030

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 58 / 90

Page 63: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Skew

Input data skew

IPython Notebook examples

http:

//kti.tugraz.at/staff/denis/courses/kddm1/mrskew.ipynb

Command Line

ipython notebook –pylab=inline mrskew.ipynb

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 59 / 90

Page 64: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Skew

Combiners

Sometimes a Reduce function is associative and commutative

Commutative: x ◦ y = y ◦ x

Associative: (x ◦ y) ◦ z = x ◦ (y ◦ z)

The values can be combined in any order, with the same result

The addition in reducer of the word count example is such anoperation

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 60 / 90

Page 65: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Skew

Combiners

When the Reduce function is associative and commutative we canpush some of the reducers’ work to the Map tasks

E.g. instead of emitting (w , 1), (w , 1), . . .

We can apply the Reduce function within the Map task

In that way the output of the Map task is “combined” beforegrouping and sorting

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 61 / 90

Page 66: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Skew

Combiners

Map

(Star, 1)(Wars, 1)

(is, 1). . .

(American, 1)(epic, 1)

(space, 1). . .

(franchise, 1)(centered, 1)

(on, 1). . .

(film, 1)(series, 1)

(created, 1). . .. . .. . .

Combiner

(Star, 2). . .

(Wars, 2)(a, 6). . .

(a,3). . .

(film, 3)(franchise, 1)

(series, 2). . .. . .. . .

Group by key

(Star, 2)(Wars, 2)

(a, 6)(a, 3)(a, 4). . .

(film, 3)(franchise, 1)

(series, 2). . .. . .. . .

Reduce

(Star, 2)(Wars, 2)

(a, 13)(film, 3)

(franchise, 1)(series, 2)

. . .

. . .

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 62 / 90

Page 67: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Skew

Input data skew: Exercise

Exercise

Suppose we execute the word-count map-reduce program on a largerepository such as a copy of the Web. We shall use 100 Map tasks andsome number of Reduce tasks.

1 Do you expect there to be significant skew in the times taken by thevarious reducers to process their value list? Why or why not?

2 If we combine the reducers into a small number of Reduce tasks, say10 tasks, at random, do you expect the skew to be significant? Whatif we instead combine the reducers into 10,000 Reduce tasks?

3 Suppose we do use a combiner at the 100 Map tasks. Do you expectskew to be significant? Why or why not?

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 63 / 90

Page 68: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Skew

Input data skew

0 2 4 6 8 10 12100

101

102

103

104

105

Key size: sum=195279,µ=1.953,σ2 =83.105

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 64 / 90

Page 69: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Skew

Input data skew

0 1 2 3 4 5 6 7100

101

102

103

104

105

Task key averages: µ=1.793,σ2 =10.986

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 65 / 90

Page 70: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Skew

Input data skew

IPython Notebook examples

http://kti.tugraz.at/staff/denis/courses/kddm1/

combiner.ipynb

Command Line

ipython notebook –pylab=inline combiner.ipynb

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 66 / 90

Page 71: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Examples

Map-Reduce: Applications

Map-reduce computation make sense when files are large and rarelyupdated in place

E.g. we will not see map-reduce computation when managing onlinesales

We will not see map-reduce for handling Web request (even if wehave millions of users)

However, you want use map-reduce for analytic queries on the datagenerated by an e.g. Web application

E.g. find users with similar bying patterns

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 67 / 90

Page 72: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Examples

Map-Reduce: Applications

Computations such as analytic queries typically involve matrixoperations

The original purpose for the map-reduce implementation was toexecute large matrix-vector multiplications to calculate the PageRank

Matrix operations such as matrix-matrix and matrix-vectormultiplications fit nicely into map-reduce programming model

Another important class of operations that can use map-reduceeffectively are the relational-algebra operations

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 68 / 90

Page 73: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Examples

Map-Reduce: Exercise

Exercise

Suppose we have an n× n matrix M, whose element in row i and column jwill be denoted mij . Suppose we also have a vector v of length n, whosejth element is vj . Then the matrix-vector product is the vector x of lengthn, whose ith element is given by

xi =n∑

j=1

mijvj

Outline a Map-Reduce program that calculates the vector x.

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 69 / 90

Page 74: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Examples

Matrix-vector multiplication

Let us first assume that the vector v is large but it still can fit intothe memory

The matrix M and the vector v will be each stored in a file of the DFS

We assume that the row-column coordinates of a matrix element(indices) can be discovered

E.g. each value is stored as a triple (i , j ,mij)

Similarly, the position of vj can be discovered in the analogous way

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 70 / 90

Page 75: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Examples

Matrix-vector multiplication

The Map Function:

The map function applies to one element of the matrix M

The vector v is first read in its entirety and is subsequentelly availablefor all Map tasks at that computaion node

From each matrix element mij the map function produces thekey-value pair (i ,mijvj)

All terms of the sum that make up the component xi of thematrix-vector product will get the same key i

The Reduce Function:

The reduce function sums all the values associated with a given key i

The result is a pair (i , xi )

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 71 / 90

Page 76: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Examples

Matrix-vector multiplication

However, it is possible that the vector v can not fit into main memory

It is not required that the vector v fits into the memory at a computenode, but if it does not there will be a very large number of diskaccesses

As an alternative we can divide the matrix M into vertical stripes ofequal width and divide the vector into an equal number of horizontalstripes of the same height

The goal is to use enough stripes so that the portion of the vector inone stripe can fit into main memoray at a compute node

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 72 / 90

Page 77: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Examples

Matrix-vector multiplication

30 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK

disk accesses as we move pieces of the vector into main memory to multiplycomponents by elements of the matrix. Thus, as an alternative, we can dividethe matrix into vertical stripes of equal width and divide the vector into an equalnumber of horizontal stripes, of the same height. Our goal is to use enoughstripes so that the portion of the vector in one stripe can fit conveniently intomain memory at a compute node. Figure 2.4 suggests what the partition lookslike if the matrix and vector are each divided into five stripes.

MMatrix Vector v

Figure 2.4: Division of a matrix and vector into five stripes

The ith stripe of the matrix multiplies only components from the ith stripeof the vector. Thus, we can divide the matrix into one file for each stripe, anddo the same for the vector. Each Map task is assigned a chunk from one ofthe stripes of the matrix and gets the entire corresponding stripe of the vector.The Map and Reduce tasks can then act exactly as was described above for thecase where Map tasks get the entire vector.

We shall take up matrix-vector multiplication using map-reduce again inSection 5.2. There, because of the particular application (PageRank calcula-tion), we have an additional constraint that the result vector should be part-itioned in the same way as the input vector, so the output may become theinput for another iteration of the matrix-vector multiplication. We shall seethere that the best strategy involves partitioning the matrix M into squareblocks, rather than stripes.

2.3.3 Relational-Algebra Operations

There are a number of operations on large-scale data that are used in databasequeries. Many traditional database applications involve retrieval of small am-ounts of data, even though the database itself may be large. For example, aquery may ask for the bank balance of one particular account. Such queries arenot useful applications of map-reduce.

However, there are many operations on data that can be described easily interms of the common database-query primitives, even if the queries themselves

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 73 / 90

Page 78: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Examples

Matrix-vector multiplication

The ith stripe of the matrix multiplies only components from the ithstripe of the vector

We can divide matrix into one file for each stripe, and do the samefor the vector

Each Map task is assigned a chunk from one of the stripes in thematrix and it gets the entire corresponding stripe of the vector

The Map and Reduce tasks can then act exactly as before

We need to sum up once more the results of the stripes multiplication

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 74 / 90

Page 79: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Examples

Relational-Algebra operations

There are many operation on data that can be described easily interms of the common database-query primitives

The queries themselves must not be executed within a DBMS

E.g. standard operations on relations

A relation is a table with column headers called attributes

The set of attributes of a relation is called its schema:R(A1,A2, . . . ,An)

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 75 / 90

Page 80: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Examples

Example of a relation

From To

url1 url2

url1 url3

url2 url3

url2 url4

. . . . . .

Table: Relation Links

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 76 / 90

Page 81: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Examples

Example of a relation

A tuple is a pair or URLs such that there is at least one link from thefirst URL to the second

The first row (url1, url2) states that the Web page at url1 points tothe WEb page at url2

You would typically have a similar relation stored by a search engine

With billions of tuples

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 77 / 90

Page 82: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Examples

Relational-Algebra

Standard operation on relations are1 Selection: apply a condiotion C to each tuple and output only tuples

that satisfy C2 Projection: Produce from each tuple only a subset S of attributes3 Union, Intersection, Difference: set operations on tuples4 Natural Join: Given two relations compare each pair of tuples and

output those that agree on all common attributes5 Grouping and Aggregation: Partition the tuples in a relation according

to their values in a set of attributes. For each group perform one of theoperations such as SUM, COUNT, AVG, MIN or MAX

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 78 / 90

Page 83: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Examples

Example of relational algebra

Find the paths of length two in the Web using the Links relation

In other words find triples of URLs (u, v ,w) such that there is a linkbetween u and v and a link between v and w

We want to take natural join of Links with itself

Let us describe this with two copies of Links: L1(U1,U2) andL2(U2,U3)

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 79 / 90

Page 84: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Examples

Example of relational algebra

Now we compute L1 ./ L2

For eaxh tuple t1 of L1 and each tuple t2 of L2 we see if their U2components are same

If these two components agree we produce (U1,U2,U3) as a result

If we want only to check for the existence of the path of length twowe might want to project onto U1 and U3

πU1,U3(L1 ./ L2)

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 80 / 90

Page 85: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Examples

Example of relational algebra

Imagine that a social-networking site has a relationFriends(User ,Friend)

Suppose we want to calculate the statistics about the number offriends of each user

In terms of relational algebra we would perform groupng andaggregation:

γUser ,COUNT (Friend)(Friends)

This operation groups all tuples by the value of the first componentand then counts the number of friends

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 81 / 90

Page 86: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Examples

Selection by Map-Reduce

Selection

Given is a relation R. We want to compute σC (R)

The Map Function:

For each tuple t in R test if it satisfies C

If so produce the key value pair (t, t)

The Reduce Function:

The reduce function is identity

It simply passes each key-value pair to the output

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 82 / 90

Page 87: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Examples

Projection by Map-Reduce

Selection

Given is a relation R. We want to compute πS(R)

The Map Function:

For each tuple t in R construct a tuple t ′ by eliminating from t thosecomponents that are not in projection S

Output the key-value pair (t ′, t ′)

The Reduce Function:

For each key t ′ there will be one or more key-value pairs (t ′, t ′)

The reduce function turns (t ′, [t ′, t ′, . . . , t ′]) into (t ′, t ′)

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 83 / 90

Page 88: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Examples

Natural-Join by Map-Reduce

Selection

Given are relations R(A,B) and S(B,C ). We want to compute R ./ S

The Map Function:

For each tuple (a, b) of R produce the key-value pair (b, (R, a))

For each tuple (b, c) of S produce the key-value pair (b, (S , c))

The Reduce Function:

Each key b will be associated with a list of pairs that are either of theform (R, a) or (S , c)

Construct all pairs consisting of the values (a, b, c)

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 84 / 90

Page 89: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Examples

Grouping and Aggregation by Map-Reduce

Selection

Given is a relation R(A,B,C ). We want to compute γA,θ(B)(R)

The Map Function:

Map produces the grouping

For each tuple (a, b, c) produce the key-value pair (a, b)

The Reduce Function:

The reduce function produces the aggregation

Each key a represents a group

Apply the aggregation operator θ to the list [b1, b2, . . . , bn] of thevalues associated with a

The outpur is a pair (a, x), where x is the result of θ applied to the list

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 85 / 90

Page 90: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Examples

Matrix-matrix multiplication

Exercise

Suppose we have an n× n matrix M, whose element in row i and column jwill be denoted mij . Suppose we also have another n × n matrix N withelements nij . Then the matrix-matrix product is the matrix P ofdimensions n × n, whose elements are given by

Pijk =n∑

j=1

mijnjk

Outline a Map-Reduce program that calculates the matrix P.

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 86 / 90

Page 91: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Examples

Matrix-matrix multiplication

We can think of a matrix as a relation with three attributes: the rownumber, the column number and the value in that row and column

Thus, we can represent M as a relation M(I , J,V ) with tuples(i , j ,mij)

Similarly, we can represent N as a relation N(J,K ,W ) with tuples(j , k , njk)

The matrix-matrix product is almost a natural join followed bygrouping and aggregation

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 87 / 90

Page 92: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Examples

Matrix-matrix multiplication

The common attribute of two relations is J

The natural join of M(I , J,V ) and N(J,K ,W ) would produce a tuple(i , j , k, v ,w)

This five component relation is a pair of matrix elements (mij , njk)

We need (i , j , k, v × w)

Once when we have this relation we can perform grouping andaggregation

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 88 / 90

Page 93: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Examples

Matrix-matrix multiplication

The Map Function:

For each matrix element mij produce the key-value pair (j , (M, i ,mij))

Likewise for each matrix element njk produce the key-value pair(j , (N, k , njk))

M and N are names indicating from which matrix the elementoriginates

The Reduce Function:

For each key j examine the list of associated values

For each value (M, i ,mij)) that comes from M and for each value(N, k, njk)) that comes from N produce a key-value pair with keyequal to (i , k) and value equal to mijnjk

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 89 / 90

Page 94: Databases 2 (VU) (707.030) - Map-Reducekti.tugraz.at/staff/denis/courses/dbase2/mp_dbase2.pdf · Map-Reduce. Programming model: Map-reduce. Big Document Star Wars is an American epic

Map-Reduce Examples

Matrix-matrix multiplication

The Map Function:

The Map function is identity

The Reduce Function:

For each key (i , k) produce the sum of the list of values associatedwith this key

The result is a pair ((i , k), pik)

Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 90 / 90