distributed architecture: map/reduce -...

68
Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024) Roman Kern KMI, TU Graz Dec 19, 2012 Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 1 / 61

Upload: others

Post on 17-Mar-2020

21 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Distributed Architecture: Map/ReduceSoftware Architecture VO/KU (707.023/707.024)

Roman Kern

KMI, TU Graz

Dec 19, 2012

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 1 / 61

Page 2: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Outline

1 Introduction

2 Independent operations

3 Distributed operations

4 Summary

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 2 / 61

Page 3: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Section

Recap

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 3 / 61

Page 4: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Recap

Figure: Client-server style

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 4 / 61

Page 5: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Recap

Figure: Layered system

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 5 / 61

Page 6: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Recap

Peer-to-Peer

Separation between client and server is removed

Each client is a server at the same time, called peer

The goal is to distribute the processing or data among many peers

No central administration or coordination

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 6 / 61

Page 7: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Recap

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 7 / 61

Page 8: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Introduction

Section

Introduction

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 8 / 61

Page 9: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Introduction

Distributed Architectures

Goal is to achieve a scalable infrastructure

⇒ scale horizontally (scale out)

Different levels of complexity

Depends on the systems and the required attributes

Certain approaches have evolved

Frameworks have been developed

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 9 / 61

Page 10: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Introduction

Distributed Architectures

Parallel computing vs. distributed computing

In parallel computing all component share a common memory,typically threads within a single program

In distributed computing each component has it own memory

Typically in distributed computing the individual components areconnected over a network

Dedicated programming languages (or extensions) for parallelcomputing

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 10 / 61

Page 11: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Introduction

Distributed Architectures

http://nighthacks.com/roller/jag/resource/Fallacies.html

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 11 / 61

Page 12: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Introduction

Distributed Architectures

Different levels of complexity

Lowest complexity for operations, which can easily be distributed

If they are independent and short enough be to executed independentfrom each other

Higher degree of complexity for operations, which compute a singleresult on multiple nodes

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 12 / 61

Page 13: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Independent operations

Section

Independent operations

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 13 / 61

Page 14: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Independent operations

Independent operations

In a simple scenario, the system just contains of separate,independent operations

No operation do not require complex interaction

Input data are typically small chunks

Shared repository - all the data is available on all nodes

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 14 / 61

Page 15: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Independent operations

Distributed Architectures

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 15 / 61

Page 16: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Independent operations

Independent operations

Still a number of issues to address

1 Group membership

2 Leader election

3 Queues - distribution of workload

4 Distributed locks

5 Barriers

6 Shared resources

7 Configuration

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 16 / 61

Page 17: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Independent operations

Independent operations - Group membership

Group membership

When a single node comes online...

How does it know where to connect to?

How do the other members know of an added node?

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 17 / 61

Page 18: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Independent operations

Independent operations - Group membership

⇒ Peer-to-peer architectural style

Each node is client, as well as server

Parts of the bootstrapping mechanism

Dynamic vs. static

Fully dynamic via broadcast/multicast within local area networks(UDP)

Centralised P2P - e.g. central login components/servers

Static lists of group members (needs to be configurable)

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 18 / 61

Page 19: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Independent operations

Independent operations - Leader Election

Leader Election

Not all nodes are equal, e.g. centralised components in P2P networks

Single node acts as master, others are workers

Some nodes have additional responsibilities (supernodes)

Having centralised components makes some functionality easier toimplement

E.g. assign work-load

Disadvantage: might lead to a single point of failure

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 19 / 61

Page 20: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Independent operations

Independent operations - Leader Election

⇒ Client-server architectural style

Once the leader has been elected, it takes over the role of the server

All other group members then act as clients

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 20 / 61

Page 21: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Independent operations

Independent operations - Leader Election

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 21 / 61

Page 22: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Independent operations

Independent operations - Leader Election

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 22 / 61

Page 23: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Independent operations

Independent operations - Leader Election

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 23 / 61

Page 24: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Independent operations

Independent operations - Queues

Queues

Important component in many distributed systems

Two types of nodes: manager of the queue, workers

Incoming requests are collected at a single point

And are stored as items in a queue

Many client node consume items from the queue

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 24 / 61

Page 25: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Independent operations

Independent operations - Queues

Queues are often FIFO (first-in, first-out)

Sometimes specific items are of higher priority

Crucial aspect is the coordinated access to the queue

Each item is only processed by a single client

What if the client crashes while processing an item from the queue?

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 25 / 61

Page 26: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Independent operations

Independent operations - Queues

⇒ Publish-subscribe architectural style

Basically a producer-consumer pattern

Each worker client registers itself

Queue manager notifies the worker of new items

How to schedule the workers, which should be picked next?

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 26 / 61

Page 27: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Independent operations

Independent operations - Locks

Distributed Locks

Restrict access to shared resources to only a single node at a time

E.g. allow only a single node to write to a file

May yield many non-trivial problems, for example deadlocks or raceconditions

Distributed locks without central component are very complex torealise

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 27 / 61

Page 28: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Independent operations

Independent operations - Locks

⇒ Blackboard architectural style

The shared repository is responsible to orchestrate the access to alocks

Notifies waiting nodes once the lock has been lifted

This functionality is often coupled with the elected leader

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 28 / 61

Page 29: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Independent operations

Independent operations - Barriers

Barriers

Specific type of distributed lock

Sychronise multiple nodes

E.g. multiple nodes should wait until a certain state has been reached

Used when a part of the processing can be done in parallel and someparts cannot be distributed

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 29 / 61

Page 30: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Independent operations

Independent operations - Shared Resources

Shared Resources

If all nodes need to be able to access a common data-structure

Read-only vs. read-write

If read-write, the complexity rises due to synchronisation issues

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 30 / 61

Page 31: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Independent operations

Apache Zookeeper

Apache Zookeeper is a framework/library to

Used by Yahoo!, LinkedIn, Facebook

Initially developed by Yahoo!

Now managed by Apache

Alternative approaches: Google Chubby, Microsoft Centrifuge

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 31 / 61

Page 32: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Independent operations

Apache Zookeeper

Components of Zookeeper

Coordination kernel

File-system like API

Synchronisation, Watches, Locks

Configuration

Shared data

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 32 / 61

Page 33: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Independent operations

Example of a Barrier with Zookeeper

B a r r i e r ( S t r i n g a d d r e s s , S t r i n g name , i n t s i z e ) {super ( a d d r e s s ) ;t h i s . r o o t = name ;t h i s . s i z e = s i z e ;

S t a t s = zk . e x i s t s ( root , f a l s e ) ;i f ( s == nu l l )

zk . c r e a t e ( root , new byte [ 0 ] ,I d s . OPEN ACL UNSAFE , 0 ) ;

// My node namename = new S t r i n g ( I n e t A d d r e s s . g e t L o c a l H o s t ( )

. getCanonica lHostName ( ) . t o S t r i n g ( ) ) ;}

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 33 / 61

Page 34: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Independent operations

Example of a Barrier with Zookeeper

boolean e n t e r ( ) {zk . c r e a t e ( r o o t + ”/” + name , new byte [ 0 ] ,

I d s . OPEN ACL UNSAFE , C r e a t e F l a g s .EPHEMERAL ) ;whi le ( true ) {

synchronized ( mutex ) {A r r a y L i s t <S t r i n g> l i s t = zk . g e t C h i l d r e n (root , true ) ;

i f ( l i s t . s i z e ( ) < s i z e )mutex . w a i t ( ) ;

e l s ereturn true ;

}}

}

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 34 / 61

Page 35: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Independent operations

Example of a Barrier with Zookeeper

i n t consume ( ) throws KeeperExcept ion , I n t e r r u p t e dE x c e p t i o n {i n t r e s u l t = −1;S ta t s t a t = n u l l ;

wh i l e ( t rue ) { // Get the f i r s t e l ement a v a i l a b l es ynch ron i zed (mutex ) {

Ar r a yL i s t<St r i ng> l i s t = zk . g e tCh i l d r e n ( root , t rue ) ;i f ( ! l i s t . i sEmpty ( ) ) {

I n t e g e r min = new I n t e g e r ( l i s t . ge t ( 0 ) . s u b s t r i n g ( 7 ) ) ;f o r ( S t r i n g s : l i s t ) {

I n t e g e r tempValue = new I n t e g e r ( s . s u b s t r i n g ( 7 ) ) ;i f ( tempValue < min ) min = tempValue ;

}byte [ ] b = zk . getData ( r oo t + ”/ e lement ” + min ,

f a l s e , s t a t ) ;zk . d e l e t e ( r oo t + ”/ e lement ” + min , 0 ) ;By t eBu f f e r b u f f e r = Byt eBu f f e r . wrap ( b ) ;r e s u l t = b u f f e r . g e t I n t ( ) ;

r e t u r n r e s u l t ;}mutex . wa i t ( ) ; // Going to wa i t

}}

}Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 35 / 61

Page 36: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Distributed operations

Section

Distributed operations

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 36 / 61

Page 37: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Distributed operations

Distributed Operations

If the processing cannot be split into separate, independent operations

If the data it too big to fit on a single machine

Need for a distributed processing of a single operation

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 37 / 61

Page 38: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Distributed operations

Contemporary Computing Environment

Hardware basics

Access to data in memory is much faster than access to data on disk(or online).

Disk seeks: No data is transferred from disk while the disk head isbeing positioned.

Therefore: Transferring one large chunk of data from disk tomemory is faster than transferring many small chunks.

Disk I/O is block-based: Reading and writing of entire blocks (asopposed to smaller chunks).

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 38 / 61

Page 39: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Distributed operations

Map/Reduce

Distributed indexing at Google

For web-scale indexing

Must use a distributed computing cluster

Individual machines are fault-prone

Can unpredictably slow down or fail

Based on distributed file system

Files are stored among different machinesRedundant storageInformation about storage is available to other components

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 39 / 61

Page 40: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Distributed operations

Map/Reduce

MapReduce

MapReduce (Dean and Ghemawat 2004) is a robust andconceptually simple framework for distributed computing

Motivated by indexing system at Google, which consists of a numberof phases, each implemented in MapReduce

Approach: Bring the code to the data

distributed computing...... without having to write code for the distribution part.

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 40 / 61

Page 41: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Distributed operations

Google Infrastructure

Google data centres mainly contain commodity machines

Data centres are distributed around the world.

Estimate: a total of 1 million servers, 3 million processors/cores(Gartner 2007)

Estimate: Google installs 100,000 servers each quarter.

Based on expenditures of 200-250 million dollars per year

This would be 10% of the computing capacity of the world

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 41 / 61

Page 42: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Distributed operations

Map/Reduce

Input Data

Map Worker Intermediate Data

Reduce Worker Output Data

Output1

Output2

Reduce1

Reduce2

Map2

Map3

Map1Split1

Split2

Split3

Split4

Split5

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 42 / 61

Page 43: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Distributed operations

Map/Reduce

Task of the mapper: read a chunk of the input data and generate aintermediate key plus values

Task of the reducer: process a tuple of intermediate key plus valuesand write the output

Note: Often a number of additional functions need to be provided aswell

Input OutputMapper k1, v1 list(k2, v2)Reducer k2, list(v2) list(k3, v3)

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 43 / 61

Page 44: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Distributed operations

Example of a Mapper

vo id countWordsOldSchool ( ) {Map<St r i ng , I n t e g e r> wordToCountMap =

new HashMap<St r i ng , I n t e g e r >() ;L i s t<F i l e> f i l e L i s t = d i r . l i s t F i l e s ( ) ;f o r ( F i l e f i l e : f i l e L i s t ) {

S t r i n g con t en t = IOU t i l s . r e a dF i l eToS t r i n g ( f i l e ) ;L i s t<St r i ng> wordL i s t = token i z e I n t oWord s ( con t en t ) ;f o r ( S t r i n g word : wo rdL i s t ) {

i n c r ement ( word , 1 ) ;}

}w r i t eToF i l e (wordToCountMap ) ;

}

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 44 / 61

Page 45: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Distributed operations

Example of a Mapper

vo id map( i n t documentId , S t r i n g con t en t ) {L i s t<St r i ng> wordL i s t = token i z e I n t oWord s ( con t en t ) ;f o r ( S t r i n g word : wo rdL i s t ) {

y i e l d ( word , 1 ) ;}

}

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 45 / 61

Page 46: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Distributed operations

Example of a Reducer

vo id r educe ( S t r i n g word , L i s t<I n t e g e r> c o u n t L i s t ) {i n t coun t e r = 0 ;f o r ( I n t e g e r count : c o u n t L i s t ) {

coun t e r += count ;}w r i t e ( word , coun t e r ) ;

}

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 46 / 61

Page 47: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Distributed operations

Overview Inverted Index

Input: Documents to be indexed, input documents are parsed andtext is extracted3 7→ Friends, Romans, countrymen.

Tokenizer: Produces a token stream from the text3 7→ Friends Romans countrymen

Linguistic models: Analyses and modifies the tokens3 7→ friends romans countrymen

Indexer: Collects the tokens and inverts the data-structurecountrymen 7→ 2 → 3 ...

friends 7→ 1 → 3 → 7 ...

romans 7→ 3 → 9 ...

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 47 / 61

Page 48: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Distributed operations

Overview Inverted Index

Input: Documents to be indexed, input documents are parsed andtext is extracted3 7→ Friends, Romans, countrymen.

Tokenizer: Produces a token stream from the text3 7→ Friends Romans countrymen

Linguistic models: Analyses and modifies the tokens3 7→ friends romans countrymen

Indexer: Collects the tokens and inverts the data-structurecountrymen 7→ 2 → 3 ...

friends 7→ 1 → 3 → 7 ...

romans 7→ 3 → 9 ...

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 47 / 61

Page 49: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Distributed operations

Overview Inverted Index

Input: Documents to be indexed, input documents are parsed andtext is extracted3 7→ Friends, Romans, countrymen.

Tokenizer: Produces a token stream from the text3 7→ Friends Romans countrymen

Linguistic models: Analyses and modifies the tokens3 7→ friends romans countrymen

Indexer: Collects the tokens and inverts the data-structurecountrymen 7→ 2 → 3 ...

friends 7→ 1 → 3 → 7 ...

romans 7→ 3 → 9 ...

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 47 / 61

Page 50: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Distributed operations

Overview Inverted Index

Input: Documents to be indexed, input documents are parsed andtext is extracted3 7→ Friends, Romans, countrymen.

Tokenizer: Produces a token stream from the text3 7→ Friends Romans countrymen

Linguistic models: Analyses and modifies the tokens3 7→ friends romans countrymen

Indexer: Collects the tokens and inverts the data-structurecountrymen 7→ 2 → 3 ...

friends 7→ 1 → 3 → 7 ...

romans 7→ 3 → 9 ...

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 47 / 61

Page 51: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Distributed operations

Detail Inverted Index

Document 1

I did enact Julius

Caesar I was killed

in the Capitol;

Brutus killed me.

Step 1: Build term-document table

Document 2

So let it be with

Caesar. The noble

Brutus hath told you

Caesar was ambitious

Term Doc #

i 1did 1enact 1julius 1caesar 1i 1was 1killed 1in 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2... ...

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 48 / 61

Page 52: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Distributed operations

Detail Inverted Index

Document 1

I did enact Julius

Caesar I was killed

in the Capitol;

Brutus killed me.

Step 1: Build term-document table

Document 2

So let it be with

Caesar. The noble

Brutus hath told you

Caesar was ambitious

Term Doc #

i 1did 1enact 1julius 1caesar 1i 1was 1killed 1in 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2... ...Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 48 / 61

Page 53: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Distributed operations

Detail Inverted Index

Step 2: Sort by terms

Term Doc #

i 1did 1enact 1julius 1caesar 1i 1was 1killed 1in 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2... ...

Term Doc #

ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1i 1i 1in 1it 2julius 1killed 1killed 1let 2me 1... ...

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 49 / 61

Page 54: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Distributed operations

Detail Inverted Index

Step 2: Sort by terms

Term Doc #

i 1did 1enact 1julius 1caesar 1i 1was 1killed 1in 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2... ...

Term Doc #

ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1i 1i 1in 1it 2julius 1killed 1killed 1let 2me 1... ...

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 49 / 61

Page 55: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Distributed operations

Detail Inverted Index

Step 3: Addtermfrequency,multipleentries fromsingledocument getmerged

Term Doc #

ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1i 1i 1in 1it 2julius 1killed 1killed 1let 2me 1... ...

Term Doc # TF

ambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1i 1 2in 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1... ... ...

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 50 / 61

Page 56: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Distributed operations

Detail Inverted Index

Step 3: Addtermfrequency,multipleentries fromsingledocument getmerged

Term Doc #

ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1i 1i 1in 1it 2julius 1killed 1killed 1let 2me 1... ...

Term Doc # TF

ambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1i 1 2in 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1... ... ...

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 50 / 61

Page 57: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Distributed operations

Detail Inverted Index

Step 4: Result is split into dictionary file and postings file.

Term Doc # TF

ambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1i 1 2in 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1... ... ...

Dictionary# Term DF CF

0 ambitious 1 11 be 1 12 brutus 2 23 capitol 1 14 caesar 2 35 did 1 16 enact 1 17 hath 1 18 i 1 2

... ... ...

PostingsTerm# 7→ {Doc#,TF}

0 7→ 2,1

1 7→ 2,1

2 7→ 1,1 → 2,1

3 7→ 1,1

4 7→ 1,1 → 2,2

5 7→ 1,1

6 7→ 1,1

7 7→ 2,1

8 7→ 1,2

...

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 51 / 61

Page 58: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Distributed operations

Detail Inverted Index

Step 4: Result is split into dictionary file and postings file.

Term Doc # TF

ambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1i 1 2in 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1... ... ...

Dictionary# Term DF CF

0 ambitious 1 11 be 1 12 brutus 2 23 capitol 1 14 caesar 2 35 did 1 16 enact 1 17 hath 1 18 i 1 2

... ... ...

PostingsTerm# 7→ {Doc#,TF}

0 7→ 2,1

1 7→ 2,1

2 7→ 1,1 → 2,1

3 7→ 1,1

4 7→ 1,1 → 2,2

5 7→ 1,1

6 7→ 1,1

7 7→ 2,1

8 7→ 1,2

...

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 51 / 61

Page 59: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Distributed operations

Index Construction

What is the role of the Map/Reduce framework when building suchan index?

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 52 / 61

Page 60: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Distributed operations

Index Construction

Document 1

I did enact Julius

Caesar I was killed

in the Capitol;

Brutus killed me.

Recall step 1 of inverted index creation.

Document 2

So let it be with

Caesar. The noble

Brutus hath told you

Caesar was ambitious

Term Doc #

i 1did 1enact 1julius 1caesar 1i 1was 1killed 1in 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2... ...Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 53 / 61

Page 61: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Distributed operations

Index Creation

After all documentshave been parsed, theinverted file is sortedby terms.

There might be many itemsto sort.

Term Doc #

i 1did 1enact 1julius 1caesar 1i 1was 1killed 1in 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2... ...

Term Doc #

ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1i 1i 1in 1it 2julius 1killed 1killed 1let 2me 1... ...

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 54 / 61

Page 62: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Distributed operations

Index Construction

Map step: parse the documents and yield terms as keys

Framework: Sort the keys from the mappers

Reduce: Collect all keys and write out the inverted index

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 55 / 61

Page 63: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Distributed operations

Map/Reduce Framework

Existing open-source framework: Apache Hadoop

Implemented in Java

Initially developed by Yahoo!

Now used by many companies and organisations

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 56 / 61

Page 64: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Summary

Section

Summary

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 57 / 61

Page 65: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Summary

Summary

If the system needs to be scalable, it needs to be appropriatelydesigned

In a simple scenario, the load is distributed via individual operations

For more demanding operations, specific approaches are necessary

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 58 / 61

Page 66: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Summary

Summary

The simple scenario

Scalability limited often limited by dedicated central components

E.g. the master node

Performance bottlenecks for shared resources

No guarantee on execution order

Limited suitable for interactive applications

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 59 / 61

Page 67: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Summary

Summary

The scenario with a complex operation

Scalability is very good

High complexity when implementing

Not suited for interactive applications

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 60 / 61

Page 68: Distributed Architecture: Map/Reduce - KTIkti.tugraz.at/staff/denis/courses/sa/slides_mapreduce.pdf · Distributed Architecture: Map/Reduce Software Architecture VO/KU (707.023/707.024)

Summary

Section

Questions?

Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 61 / 61