center-of-gravity reduce task scheduling to lower mapreduce network traffic mohammad hammoud, m....

1

Center-of-Gravity Reduce Task Scheduling to Lower MapReduce Network Traffic

Mohammad Hammoud, M. Suhail Rehman, and Majd F. Sakr

Hadoop MapReduce• MapReduce is now a pervasive analytics engine on the cloud

• Hadoop is an open source implementation of MapReduce

• Hadoop MapReduce incorporates two phases, Map and Reduce phases, which encompass multiple map and reduce tasks

Map Task

Map Task

Map Task

Map Task

Reduce Task

Reduce Task

Reduce Task

Partition

PartitionPartition

Partition

PartitionPartition PartitionPartition

Partition

To HDFSDataset

HDFS

HDFS BLK

HDFS BLK

HDFS BLK

HDFS BLK

Map PhaseShuffle Stage

Merge Stage Reduce Stage

Reduce Phase 2

Partition

3

Task Scheduling in Hadoop• A golden principle adopted by Hadoop is: “Moving computation towards data

is cheaper than moving data towards computation”

• Hadoop applies this principle to Map task scheduling but not to Reduce task scheduling

• With reduce task scheduling, once a slave (or a TaskTracker- TT) polls for a reduce task, R, at the master node (or the JobTracker- JT), JT assigns TT any R

CS

RS1

TT1

RS2

TT2 TT3 TT4 TT5 JTRequest Reduce Task R

Assign R to TT1

Shuffle Partitions

+1

+1+1

+1

Total Network Distance (TND) = 4

CS= Core Switch & RS = Rack Switch

A locality problem, where R is scheduled at TT1 while its partitions exist at TT4

4

Data Locality: A Working Example• TT1 and TT2 are feeding nodes of a reduce task R• Every TT is requesting R from JT• JT can assign R to any TT

CS

TT1

TNDR = 8

CASE-I

Hadoop does not distinguish between

different cases to choose the one that provides

best locality (i.e., CASE-IV or CASE-V)

RS1

TT2 TT3 TT4

RS2

TT5 JT

CS

TT1

TNDR = 4

CASE-III

RS1

TT2 TT3 TT4

RS2

TT5 JT

CS

TT1

TNDR = 2

CASE-IV

RS1

TT2 TT3 TT4

RS2

TT5 JT

CS

TT1

TNDR = 2

CASE-V

RS1

TT2 TT3 TT4

RS2

TT5 JT

CS

TT1

TNDR = 8

CASE-II

RS1

TT2 TT3 TT4

RS2

TT5 JT

5

Partitioning Skew in MapReduce• Existing Hadoop’s reduce task scheduler is not only locality unaware, but

also partitioning skew unaware

• Partitioning skew refers to the significant variance in intermediate key’s frequencies and their distribution across different data nodes

• Partitioning skew has been reported to exist in many scientific applications including feature extraction and bioinformatics, among others

• Partitioning skew causes shuffle skew where some reduce tasks receive more data than others

WordCount Sort K-Means

6

Partitioning Skew: A Working Example• TT1 and TT2 are feeding nodes of a reduce task R• TT1 and TT2 are requesting R from JT• R’s partitions at TT1 and TT2 are of sizes 100MB and 20MB

Amount of data shuffled = 100MB

Amount of data shuffled = 20MB

Hadoop does not consider partitioning skew exhibited by some MapReduce applications

CS

TT1

TNDR = 2

CASE-IV

RS1

TT2 TT3 TT4

RS2

TT5 JT

CS

TT1

TNDR = 2

CASE-V

RS1

TT2 TT3 TT4

RS2

TT5 JT

7

Our Work• We explore the locality and the partitioning skew problems present in the

current Hadoop implementation

• We propose Center-of-Gravity Reduce Scheduler (CoGRS), a locality-aware skew-aware reduce task scheduler for MapReduce

• CoGRS attempts to schedule every reduce task, R, at its center-of-gravity node determined by: The network locations of R’s feeding nodes The skew in the sizes of R’s partitions

• By scheduling reduce tasks at their center-of-gravity nodes, we argue for diminished network traffic and improved Hadoop performance

8

Talk Roadmap• The proposed Center-of-Gravity Reduce Scheduler (CoGRS)

• Tradeoffs:– Locality, Concurrency, Load Balancing and Utilization

• CoGRS and the Shuffle Stage in Hadoop

• Quantitative Methodology and Evaluations• CoGRS on a Private Cloud• CoGRS on Amazon EC2

• Concluding Remarks

9

CoGRS Approach• To address data locality and partitioning skew, CoGRS attempts to place

every reduce task, R, at a suitable node that minimizes:– Total Network Distance of R (TNDR)

– Shuffle Data

• We suggest that a suitable node would be the center-of-gravity node in accordance with:– The network locations of R’s feeding nodes

– The weights of R’s partitions

• We define the weight of a partition P needed by R as the size of P divided by the total sizes of all the partitions needed by R

1

2

To address 1

To address 2

10

Weighted Total Network Distance• We propose a new metric called Weighted Total Network Distance (WTND)

and define it as follows:

– WTNDR = where: n is the number of R’s partition, ND is the network distance required to shuffle a partition i to R, and wi is the weight of a partition i• In principle, the center-of-gravity of R is always one of R’s feeding nodes

since it is less expensive to access data locally than to shuffle them over the network

• Hence, we designate the center-of-gravity of R to be the feeding node of R that provides the minimum WTNDR

11

Locality, Load Balancing, Concurrency, and Utilization Tradeoffs

• Strictly exploiting data locality can lead to scheduling skew

• GoGRS gives up some locality for the sake of extracting more concurrency and improving load balancing and cluster utilization

CS

RS1

TT1

RS2

TT2 TT3 TT4 TT5 JT

R1R2R3R4

IDLEIDLEIDLE

CS

RS1

TT1

RS2

TT2 TT3 TT4 TT5 JT

R1R2R3R4

Better Utilization and Load Balancing

isOccupied? YES Attempt to Schedule Close to TT5

12

GoGRS and Early Shuffle• To determine the center-of-gravity node of a particular reduce task, R, we

need to designate the network locations of R’s feeding nodes

• This cannot be precisely determined before the Map phase commits because any map task could lead a cluster node to become a feeding node of R

• Default Hadoop starts scheduling reduce tasks after only 5% of map tasks commit so as to overlap Map and Reduce phases

• GoGRS defers early shuffle a little bit (e.g., after 20% of map tasks commit) so that most (or all) keys (which determine reduce tasks) will likely be encountered

Shuffle

Map

Reduce

H_O

FF

0:00:00 0:00:43 0:01:26 0:02:09 0:02:52 0:03:36 0:04:19

Sort

Timeline (h:mm:ss)

EARLY SHUFFLE

Shuffle

Map

Reduce

13

Quantitative Methodology

Benchmark Sort1 Sort2 WordCount K-Means

Key Frequency Uniform Non-Uniform Real Log Files Random

Data Distribution

Uniform Non-Uniform Real Log Files Random

Dataset Size 14GB 13.8GB 11GB 5.2GB

Map Tasks 238 228 11 84

Reduce Tasks 25 25 3 3

• We evaluate CoGRS on:– A private cloud with 14 machines – Amazon EC2 with 8, 16, and 32 instances

• We use Apache Hadoop 0.20.2

• We use various benchmarks with different dataset distributions

14

Timeline: Sort2 (An Example)

Map

Shuffle

Re

Map

Shuffle

Re

Map

Shuffle

Re

CoGR

SH_

ON

H_O

FF

0:00:00 0:00:43 0:01:26 0:02:09 0:02:52 0:03:36 0:04:19 0:05:02

0:00:58

0:01:35

0:02:37

0:01:01

0:02:33

0:03:04

0:00:51

0:01:47

0:03:13

Timeline (h:mm:ss)

H_OFF Ends

H_ON Ends Earlier

GoGRS Ends Even EarlierDefers early shuffle a little bit

15

Reduce Network Traffic on Private Cloud

On average, CoGRS maximizes node-local data by 34.5% and minimizes off-rack data by 9.6% versus native Hadoop

H_ON H_OFF CoGRS H_ON H_OFF CoGRS H_ON H_OFF CoGRS H_ON H_OFF CoGRSSort1 Sort2 Wordcount K-Means

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Unshuffled Off-Rack Shuffled

Traffi

c Br

eakd

own

(%)

16

Execution Times on Private Cloud

CoGRS outperforms native Hadoop by an average of 3.2% and by up to 6.3%

Sort1 Sort2 WordCount K-Means0

100

200

300

400

500

600

700H_ON H_OFF CoGRS

Benchmarks

Runti

me

(Sec

onds

)

GoGRS on Amazon EC2: Sort2

H_ON CoGRS H_ON CoGRS H_ON CoGRS0%

10%20%30%40%50%60%70%80%90%

100%

Unshuffled Off-Rack Shuffled

Traffi

c Br

eakd

own

(%)

8 16 320

100

200

300

400

500

600Native CoGRS

EC2 Cluster Size

Exec

ution

Tim

e (S

econ

ds)

Compared to native Hadoop, on average, CoGRS maximizes node-local data by 1%, 32%,and 57.9%, and minimizes off-rack data by 2.3%, 10%, and 38.6% with 8, 16, and 32 cluster sizes, respectively

8 EC2 Instances 16 EC2 Instances 32 EC2 Instances

This translates to 1.9%, 7.4%, and 23.8% average reductions in job execution times under CoGRS versus native Hadoop with 8, 16, and 32 cluster sizes, respectively

17

18

Concluding Remarks• In this work we observed that the network load is of special concern

with MapReduce Large amount of traffic can be generated during the shuffle stage This can deteriorate Hadoop performance

• We realized that scheduling reduce tasks at their center-of-gravity nodes has positive effects on Hadoop’s network traffic and performance Average reductions of 9.6% and 38.6% of off-rack network traffic have been

accomplished on a private cloud and on Amazon EC2, respectively This provided Hadoop by up to 6.3% and 23.8% performance improvement on

a private cloud and on Amazon EC2, respectively

• We expect GoGRS to play a major role in MapReduce for applications that exhibit high partitioning skew (e.g., scientific applications)

19

Thank You!

Questions?

center-of-gravity reduce task scheduling to lower mapreduce network traffic mohammad hammoud, m....

Documents

task r tt1

task r assign r

tt cs tt1 tnd r

task scheduling

hadoop mapreduce mapreduce

task scheduler

task partition

skew unaware partitioning