center-of-gravity reduce task scheduling to lower mapreduce network traffic mohammad hammoud, m....
TRANSCRIPT
1
Center-of-Gravity Reduce Task Scheduling to Lower MapReduce Network Traffic
Mohammad Hammoud, M. Suhail Rehman, and Majd F. Sakr
Hadoop MapReduce• MapReduce is now a pervasive analytics engine on the cloud
• Hadoop is an open source implementation of MapReduce
• Hadoop MapReduce incorporates two phases, Map and Reduce phases, which encompass multiple map and reduce tasks
Map Task
Map Task
Map Task
Map Task
Reduce Task
Reduce Task
Reduce Task
Partition
PartitionPartition
Partition
PartitionPartition PartitionPartition
Partition
To HDFSDataset
HDFS
HDFS BLK
HDFS BLK
HDFS BLK
HDFS BLK
Map PhaseShuffle Stage
Merge Stage Reduce Stage
Reduce Phase 2
Partition
3
Task Scheduling in Hadoop• A golden principle adopted by Hadoop is: “Moving computation towards data
is cheaper than moving data towards computation”
• Hadoop applies this principle to Map task scheduling but not to Reduce task scheduling
• With reduce task scheduling, once a slave (or a TaskTracker- TT) polls for a reduce task, R, at the master node (or the JobTracker- JT), JT assigns TT any R
CS
RS1
TT1
RS2
TT2 TT3 TT4 TT5 JTRequest Reduce Task R
Assign R to TT1
Shuffle Partitions
+1
+1+1
+1
Total Network Distance (TND) = 4
CS= Core Switch & RS = Rack Switch
A locality problem, where R is scheduled at TT1 while its partitions exist at TT4
4
Data Locality: A Working Example• TT1 and TT2 are feeding nodes of a reduce task R• Every TT is requesting R from JT• JT can assign R to any TT
CS
TT1
TNDR = 8
CASE-I
Hadoop does not distinguish between
different cases to choose the one that provides
best locality (i.e., CASE-IV or CASE-V)
RS1
TT2 TT3 TT4
RS2
TT5 JT
CS
TT1
TNDR = 4
CASE-III
RS1
TT2 TT3 TT4
RS2
TT5 JT
CS
TT1
TNDR = 2
CASE-IV
RS1
TT2 TT3 TT4
RS2
TT5 JT
CS
TT1
TNDR = 2
CASE-V
RS1
TT2 TT3 TT4
RS2
TT5 JT
CS
TT1
TNDR = 8
CASE-II
RS1
TT2 TT3 TT4
RS2
TT5 JT
5
Partitioning Skew in MapReduce• Existing Hadoop’s reduce task scheduler is not only locality unaware, but
also partitioning skew unaware
• Partitioning skew refers to the significant variance in intermediate key’s frequencies and their distribution across different data nodes
• Partitioning skew has been reported to exist in many scientific applications including feature extraction and bioinformatics, among others
• Partitioning skew causes shuffle skew where some reduce tasks receive more data than others
WordCount Sort K-Means
6
Partitioning Skew: A Working Example• TT1 and TT2 are feeding nodes of a reduce task R• TT1 and TT2 are requesting R from JT• R’s partitions at TT1 and TT2 are of sizes 100MB and 20MB
Amount of data shuffled = 100MB
Amount of data shuffled = 20MB
Hadoop does not consider partitioning skew exhibited by some MapReduce applications
CS
TT1
TNDR = 2
CASE-IV
RS1
TT2 TT3 TT4
RS2
TT5 JT
CS
TT1
TNDR = 2
CASE-V
RS1
TT2 TT3 TT4
RS2
TT5 JT
7
Our Work• We explore the locality and the partitioning skew problems present in the
current Hadoop implementation
• We propose Center-of-Gravity Reduce Scheduler (CoGRS), a locality-aware skew-aware reduce task scheduler for MapReduce
• CoGRS attempts to schedule every reduce task, R, at its center-of-gravity node determined by: The network locations of R’s feeding nodes The skew in the sizes of R’s partitions
• By scheduling reduce tasks at their center-of-gravity nodes, we argue for diminished network traffic and improved Hadoop performance
8
Talk Roadmap• The proposed Center-of-Gravity Reduce Scheduler (CoGRS)
• Tradeoffs:– Locality, Concurrency, Load Balancing and Utilization
• CoGRS and the Shuffle Stage in Hadoop
• Quantitative Methodology and Evaluations• CoGRS on a Private Cloud• CoGRS on Amazon EC2
• Concluding Remarks
9
CoGRS Approach• To address data locality and partitioning skew, CoGRS attempts to place
every reduce task, R, at a suitable node that minimizes:– Total Network Distance of R (TNDR)
– Shuffle Data
• We suggest that a suitable node would be the center-of-gravity node in accordance with:– The network locations of R’s feeding nodes
– The weights of R’s partitions
• We define the weight of a partition P needed by R as the size of P divided by the total sizes of all the partitions needed by R
1
2
To address 1
To address 2
10
Weighted Total Network Distance• We propose a new metric called Weighted Total Network Distance (WTND)
and define it as follows:
– WTNDR = where: n is the number of R’s partition, ND is the network distance required to shuffle a partition i to R, and wi is the weight of a partition i• In principle, the center-of-gravity of R is always one of R’s feeding nodes
since it is less expensive to access data locally than to shuffle them over the network
• Hence, we designate the center-of-gravity of R to be the feeding node of R that provides the minimum WTNDR
11
Locality, Load Balancing, Concurrency, and Utilization Tradeoffs
• Strictly exploiting data locality can lead to scheduling skew
• GoGRS gives up some locality for the sake of extracting more concurrency and improving load balancing and cluster utilization
CS
RS1
TT1
RS2
TT2 TT3 TT4 TT5 JT
R1R2R3R4
IDLEIDLEIDLE
CS
RS1
TT1
RS2
TT2 TT3 TT4 TT5 JT
R1R2R3R4
Better Utilization and Load Balancing
isOccupied? YES Attempt to Schedule Close to TT5
12
GoGRS and Early Shuffle• To determine the center-of-gravity node of a particular reduce task, R, we
need to designate the network locations of R’s feeding nodes
• This cannot be precisely determined before the Map phase commits because any map task could lead a cluster node to become a feeding node of R
• Default Hadoop starts scheduling reduce tasks after only 5% of map tasks commit so as to overlap Map and Reduce phases
• GoGRS defers early shuffle a little bit (e.g., after 20% of map tasks commit) so that most (or all) keys (which determine reduce tasks) will likely be encountered
Shuffle
Map
Reduce
H_O
FF
0:00:00 0:00:43 0:01:26 0:02:09 0:02:52 0:03:36 0:04:19
Sort
Timeline (h:mm:ss)
EARLY SHUFFLE
Shuffle
Map
Reduce
13
Quantitative Methodology
Benchmark Sort1 Sort2 WordCount K-Means
Key Frequency Uniform Non-Uniform Real Log Files Random
Data Distribution
Uniform Non-Uniform Real Log Files Random
Dataset Size 14GB 13.8GB 11GB 5.2GB
Map Tasks 238 228 11 84
Reduce Tasks 25 25 3 3
• We evaluate CoGRS on:– A private cloud with 14 machines – Amazon EC2 with 8, 16, and 32 instances
• We use Apache Hadoop 0.20.2
• We use various benchmarks with different dataset distributions
14
Timeline: Sort2 (An Example)
Map
Shuffle
Re
Map
Shuffle
Re
Map
Shuffle
Re
CoGR
SH_
ON
H_O
FF
0:00:00 0:00:43 0:01:26 0:02:09 0:02:52 0:03:36 0:04:19 0:05:02
0:00:58
0:01:35
0:02:37
0:01:01
0:02:33
0:03:04
0:00:51
0:01:47
0:03:13
Timeline (h:mm:ss)
H_OFF Ends
H_ON Ends Earlier
GoGRS Ends Even EarlierDefers early shuffle a little bit
15
Reduce Network Traffic on Private Cloud
On average, CoGRS maximizes node-local data by 34.5% and minimizes off-rack data by 9.6% versus native Hadoop
H_ON H_OFF CoGRS H_ON H_OFF CoGRS H_ON H_OFF CoGRS H_ON H_OFF CoGRSSort1 Sort2 Wordcount K-Means
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Unshuffled Off-Rack Shuffled
Traffi
c Br
eakd
own
(%)
16
Execution Times on Private Cloud
CoGRS outperforms native Hadoop by an average of 3.2% and by up to 6.3%
Sort1 Sort2 WordCount K-Means0
100
200
300
400
500
600
700H_ON H_OFF CoGRS
Benchmarks
Runti
me
(Sec
onds
)
GoGRS on Amazon EC2: Sort2
H_ON CoGRS H_ON CoGRS H_ON CoGRS0%
10%20%30%40%50%60%70%80%90%
100%
Unshuffled Off-Rack Shuffled
Traffi
c Br
eakd
own
(%)
8 16 320
100
200
300
400
500
600Native CoGRS
EC2 Cluster Size
Exec
ution
Tim
e (S
econ
ds)
Compared to native Hadoop, on average, CoGRS maximizes node-local data by 1%, 32%,and 57.9%, and minimizes off-rack data by 2.3%, 10%, and 38.6% with 8, 16, and 32 cluster sizes, respectively
8 EC2 Instances 16 EC2 Instances 32 EC2 Instances
This translates to 1.9%, 7.4%, and 23.8% average reductions in job execution times under CoGRS versus native Hadoop with 8, 16, and 32 cluster sizes, respectively
17
18
Concluding Remarks• In this work we observed that the network load is of special concern
with MapReduce Large amount of traffic can be generated during the shuffle stage This can deteriorate Hadoop performance
• We realized that scheduling reduce tasks at their center-of-gravity nodes has positive effects on Hadoop’s network traffic and performance Average reductions of 9.6% and 38.6% of off-rack network traffic have been
accomplished on a private cloud and on Amazon EC2, respectively This provided Hadoop by up to 6.3% and 23.8% performance improvement on
a private cloud and on Amazon EC2, respectively
• We expect GoGRS to play a major role in MapReduce for applications that exhibit high partitioning skew (e.g., scientific applications)
19
Thank You!
Questions?