csce678 - mapreduce & hadoopcourses.cse.tamu.edu/.../csce678/s19/slides/mapreduce.pdfmapreduce...

24
MapReduce CSCE678

Upload: others

Post on 27-Jun-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CSCE678 - MapReduce & Hadoopcourses.cse.tamu.edu/.../csce678/s19/slides/mapreduce.pdfMapReduce (1/3) •A programming model for enabling: •Automatic parallelization and distribution

MapReduce

CSCE678

Page 2: CSCE678 - MapReduce & Hadoopcourses.cse.tamu.edu/.../csce678/s19/slides/mapreduce.pdfMapReduce (1/3) •A programming model for enabling: •Automatic parallelization and distribution

Parallel Computing is Hard (1/2)

• Data partitioning needs to be done by developers

• Which data need to be processed together?

• Synchronization can get pretty complex

• One node waits for some data to be ready by other nodes

• Threading (on each node) causes race conditions

2

Data Pool

Page 3: CSCE678 - MapReduce & Hadoopcourses.cse.tamu.edu/.../csce678/s19/slides/mapreduce.pdfMapReduce (1/3) •A programming model for enabling: •Automatic parallelization and distribution

Parallel Computing is Hard (2/2)

• Wasteful data movement between nodes

• How to schedule the best route for sending the data?

• Scaling computation to more nodes

• Changing the partitions of data during computation

• What if a node failed? Do we rerun everything on the node?

3

Data Pool

Page 4: CSCE678 - MapReduce & Hadoopcourses.cse.tamu.edu/.../csce678/s19/slides/mapreduce.pdfMapReduce (1/3) •A programming model for enabling: •Automatic parallelization and distribution

MapReduce (1/3)

• A programming model for enabling:

• Automatic parallelization and distribution of workloads

• Data movement scheduling and optimization

• Scaling out computation to more commodity servers without affecting any running jobs

• Fault tolerance: handling machine failures

4

Page 5: CSCE678 - MapReduce & Hadoopcourses.cse.tamu.edu/.../csce678/s19/slides/mapreduce.pdfMapReduce (1/3) •A programming model for enabling: •Automatic parallelization and distribution

MapReduce (2/3)

• Many data problems can be deconstructed as Map and

Reduce operations

5

Data split 1

Map:extract useful informationfrom each data record

Reduce:aggregating or filteringmultiple records

Outputfile 1

Outputfile 2

Data split 2

Data split 3

Node

Node

Node

Localwrite

Node

Node

Sort +remotereadRead Write

Page 6: CSCE678 - MapReduce & Hadoopcourses.cse.tamu.edu/.../csce678/s19/slides/mapreduce.pdfMapReduce (1/3) •A programming model for enabling: •Automatic parallelization and distribution

MapReduce (3/3)

6

For each key-value pair,generate a list of key-value outputs

Collect all map outputswith the same key

Aggregated reduce outputs

• Map: (k1, v1) → list(k2, v2)

• Reduce: (k2, list(v2)) → list(v2)

Page 7: CSCE678 - MapReduce & Hadoopcourses.cse.tamu.edu/.../csce678/s19/slides/mapreduce.pdfMapReduce (1/3) •A programming model for enabling: •Automatic parallelization and distribution

Basic Example: Word Count

• Problem: counting the occurrence of each word

7

• Map: (k1, v1) → list(k2, v2)

• Reduce: (k2, list(v2)) → list(v2)

(A.txt, “Hello This Is Hello Michael”)

(B.txt, “Michael Hello This”)

Page 8: CSCE678 - MapReduce & Hadoopcourses.cse.tamu.edu/.../csce678/s19/slides/mapreduce.pdfMapReduce (1/3) •A programming model for enabling: •Automatic parallelization and distribution

Basic Example: Word Count

• Problem: counting the occurrence of each word

8

• Map: (k1, v1) → list(k2, v2)

• Reduce: (k2, list(v2)) → list(v2)

(Hello, 1), (This, 1), (Is, 1), (Hello, 1), (Michael, 1)

(Michael, 1), (Hello, 1), (This, 1)

( Hello, [1, 1, 1] )( This, [1, 1] )

( Is, [1] )( Michael, [1, 1] )

Hello: 3

This: 2

Is: 1Michael: 2

(A.txt, “Hello This Is Hello Michael”)

(B.txt, “Michael Hello This”)

For each word in each value, emit (word, 1)

For each key, emit (key, sum of all values)

Page 9: CSCE678 - MapReduce & Hadoopcourses.cse.tamu.edu/.../csce678/s19/slides/mapreduce.pdfMapReduce (1/3) •A programming model for enabling: •Automatic parallelization and distribution

Basic Example: Word Count

9

It will be seen that this mere painstaking burrowerand grub-worm of a poor devil of a Sub-Subappears to have gone through the long Vaticansand streetstalls of the earth, picking up whateverrandom allusions to whales he could anyways findin any book whatsoever, sacred or profane.Therefore you must not, in every case at least,take the higgledy-piggledy whale statements,however authentic, in these extracts, for veritablegospel cetology. Far from it. As touching theancient authors generally, as well as the poetshere appearing, these extracts are solely valuableor entertaining, as affording a glancing bird's eyeview of what has been promiscuously said,thought, fancied, and sung of Leviathan, by manynations and generations, including our own.

it 1will 1be 1seen 1that 1this 1mere 1 painstaking 1 burrower 1and 1grub-worm 1of 1a 1poor 1devil 1of 1

a 1a 1aback 1aback 1abaft 1abaft 1abandon 1abandon 1 abandon 1 abandoned 1 abandoned 1 abandoned 1 abandoned 1 abandoned 1 abandoned 1 abandoned 1

2

2

2

3

7

map reduce

Unsorted,Unaggregated

Sorted,Aggregated

Page 10: CSCE678 - MapReduce & Hadoopcourses.cse.tamu.edu/.../csce678/s19/slides/mapreduce.pdfMapReduce (1/3) •A programming model for enabling: •Automatic parallelization and distribution

Basic Example: File Grep

• Problem: searching lines containing the word “Michael”

10

• Map: (k1, v1) → list(k2, v2)

• Reduce: (k2, list(v2)) → list(v2)

(line 1, “Hello This Is Michael”)

(line 3, “Michael Hello”)

(line 2, “Hello Again”)

Page 11: CSCE678 - MapReduce & Hadoopcourses.cse.tamu.edu/.../csce678/s19/slides/mapreduce.pdfMapReduce (1/3) •A programming model for enabling: •Automatic parallelization and distribution

Basic Example: File Grep

• Problem: searching lines containing the word “Michael”

11

• Map: (k1, v1) → list(k2, v2)

• Reduce: (k2, list(v2)) → list(v2)

(line 1, “Hello This Is Michael”)

(line 3, “Michael Hello”)

(“Hello This is Michael”, 1)

(line 2, “Hello Again”)

(“Michael hello”, 1)

(“Hello This is Michael”, 1)

(“Michael hello”, 1)

(“Hello This is Michael”, 1)

(“Michael hello”, 1)

For each value containing “Michael”, emit (value, 1)

Emit each key-value pair from input

Page 12: CSCE678 - MapReduce & Hadoopcourses.cse.tamu.edu/.../csce678/s19/slides/mapreduce.pdfMapReduce (1/3) •A programming model for enabling: •Automatic parallelization and distribution

Apache Hadoop

• The most common and open-source MapReduce

implementation

• Containing node manager (resource manager and

task scheduler) and storage manager (HDFS)

• Basis of almost all MapReduce cloud offerings

• Amazon Elastic MapReduce

• Azure HDInsight

• Google Cloud Dataproc

12

Page 13: CSCE678 - MapReduce & Hadoopcourses.cse.tamu.edu/.../csce678/s19/slides/mapreduce.pdfMapReduce (1/3) •A programming model for enabling: •Automatic parallelization and distribution

Data Partitioning & Movement

• HDFS (Hadoop file system)

• Partitions input files into multiple splits (shards)

• Replicating splits (shards) across nodes

13

Split 1 Split 2 Split 3 Split 4 Split M …

Input files

Page 14: CSCE678 - MapReduce & Hadoopcourses.cse.tamu.edu/.../csce678/s19/slides/mapreduce.pdfMapReduce (1/3) •A programming model for enabling: •Automatic parallelization and distribution

Data Partitioning & Movement

• HDFS (Hadoop file system)

• Partitions input files into multiple splits (shards)

• Replicating splits (shards) across nodes

14

Split 1 Split 2 Split 3 Split 4

Input files

Split 1 Split 3

Split 1 Split 2 Split 3 Split 4

Split 2 Split 4Node 1

Node 2

Node 3

Node 4

Page 15: CSCE678 - MapReduce & Hadoopcourses.cse.tamu.edu/.../csce678/s19/slides/mapreduce.pdfMapReduce (1/3) •A programming model for enabling: •Automatic parallelization and distribution

Data Partitioning & Movement

• Move data to operations ➔ Expensive network I/O

Move operations to data ➔ Cost-effective

15

Split 1 Split 2 Split 3 Split 4

Input files

Split 1 Split 3

Split 1 Split 2 Split 3 Split 4

Split 2 Split 4Node 1

Node 2

Node 3

Node 4

Task 1 Task 2 Task 3 Task 4

Task 2 Task 3

Task 4

More data replicas= more nodes for scheduling map tasks

Task 1

Page 16: CSCE678 - MapReduce & Hadoopcourses.cse.tamu.edu/.../csce678/s19/slides/mapreduce.pdfMapReduce (1/3) •A programming model for enabling: •Automatic parallelization and distribution

Slave NodeSlave NodeMaster Node

Scheduling

• Hadoop master forks multiple workers across nodes

• Each worker is a single thread

• Each idle worker can be assigned as:

• Mapper: each work on a data split

• Reducer: each work on a part of map outputs

16

Master

Worker Worker Worker Worker

Mapper Mapper Mapper Reducer

Remote fork

Scheduler

Page 17: CSCE678 - MapReduce & Hadoopcourses.cse.tamu.edu/.../csce678/s19/slides/mapreduce.pdfMapReduce (1/3) •A programming model for enabling: •Automatic parallelization and distribution

Dealing with Stragglers

• Stragglers are workers that run too long

• Example: a machine with a bad disk can slow down its read from 30MB/s to 1MB/s

• Backup tasks:

• Spawning backups of in-complete tasks when the whole computation is close to completion

• If the backup task finishes first, kill the original task

17

Page 18: CSCE678 - MapReduce & Hadoopcourses.cse.tamu.edu/.../csce678/s19/slides/mapreduce.pdfMapReduce (1/3) •A programming model for enabling: •Automatic parallelization and distribution

Fault Tolerance

• Hadoop master pings each node periodically

• Recovery from a node failure

• Both map and reduce are deterministic

• Re-execute any tasks which not yet sync outputs to HDFS

• Can recover from cluster failure or network outage

• Master failure:

• If Hadoop master fails, the whole system needs to abort

• Hadoop 2.0: high availability with two masters

18

Page 19: CSCE678 - MapReduce & Hadoopcourses.cse.tamu.edu/.../csce678/s19/slides/mapreduce.pdfMapReduce (1/3) •A programming model for enabling: •Automatic parallelization and distribution

Partitioner

• Decides which reducer to process the map outputs

• Default partitioner:

• Same key ➔ always processed by the same reducer

• Users can customize partitioner

• To change the way of grouping map outputs for reducersEx: Dates as keys ➔ group by months

19

(k, v) → Hash(k) mod #reducers

Page 20: CSCE678 - MapReduce & Hadoopcourses.cse.tamu.edu/.../csce678/s19/slides/mapreduce.pdfMapReduce (1/3) •A programming model for enabling: •Automatic parallelization and distribution

Shuffling & Sorting

• After partitioning, map outputs are sorted by keys

20

(Hello, 2)(This, 1)(Is, 1)

(Michael, 1)

(Michael, 1)(Hello, 1)(This, 1)

Map outputs:

(Hello, 2)(Hello, 1)(This, 1)(This, 1)

After partitioningand sorting:

(Is, 1)(Michael, 1)(Michael, 1)

(Hello, [2, 1])(This, [1, 1])

Reduce inputs:

(Is, [1])(Michael, [1, 1])

Page 21: CSCE678 - MapReduce & Hadoopcourses.cse.tamu.edu/.../csce678/s19/slides/mapreduce.pdfMapReduce (1/3) •A programming model for enabling: •Automatic parallelization and distribution

Advanced Example: TeraSort

• Problem: How to sort terabytes of data

21

Map Partition Reduce

Page 22: CSCE678 - MapReduce & Hadoopcourses.cse.tamu.edu/.../csce678/s19/slides/mapreduce.pdfMapReduce (1/3) •A programming model for enabling: •Automatic parallelization and distribution

Advanced Example: TeraSort

• Problem: How to sort terabytes of data

22

Map Partition Reduce

(k, v) → (k, v) k /# Reducer

Max - Min(k, [v]) → (k, [v])

Default (no-op) mapper

k = 15k = 4k = 10

k = 7k = 18k = 3

k = 4k = 7k = 3

k = 15k = 10k = 18

k = 3k = 4k = 7

k = 10k = 15k = 18

Partitioninto ranges Sorted

Default (no-op) reducer

0 <= k < 10

10 <= k < 20

Page 23: CSCE678 - MapReduce & Hadoopcourses.cse.tamu.edu/.../csce678/s19/slides/mapreduce.pdfMapReduce (1/3) •A programming model for enabling: •Automatic parallelization and distribution

TeraSort Performance

• TeraGen + TeraSort + TeraValidate (O’Malley 2008)

• 10 billion key-value pairs

• 910 machines with 4 dual-core Xeon CPUs, 8GB RAM

• 1800 mappers and 1800 reducers

23

All reducers completedwithin 209 seconds

Page 24: CSCE678 - MapReduce & Hadoopcourses.cse.tamu.edu/.../csce678/s19/slides/mapreduce.pdfMapReduce (1/3) •A programming model for enabling: •Automatic parallelization and distribution

Lessons from MapReduce

• A programming model with load distribution in mind

• Good at processing key-value data

• Easily scale out computation to nearly 1000 machines

• Used for calculating page ranks in Google

• Problems:

• Batch-oriented, can take too long to finish a job

• Reducers have to wait for mappers

• Cannot handle relational data (i.e., SQL)

24