dependency-aware data locality for mapreducexiaoyif/papers/dalm-cloud-trans.pdf · work, including...

13
2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2015.2511765, IEEE Transactions on Cloud Computing 1 Dependency-aware Data Locality for MapReduce Xiaoqiang Ma, Student Member, IEEE, Xiaoyi Fan, Student Member, IEEE, Jiangchuan Liu, Senior Member, IEEE and Dan Li, Member, IEEE Abstract—MapReduce effectively partitions and distributes computation workloads to a cluster of servers, facilitating today’s big data processing. Given the massive data to be dispatched, and the intermediate results to be collected and aggregated, there have been a significant studies on data locality that seeks to co-locate computation with data, so as to reduce cross-server traffic in MapReduce. They generally assume that the input data have little dependency with each other, which however is not necessarily true for that of many real-world applications, and we show strong evidence that the finishing time of MapReduce tasks can be greatly prolonged with such data dependency. In this paper, we present DALM (Dependency-Aware Locality for MapReduce) for processing the real-world input data that can be highly skewed and dependent. DALM accommodates data-dependency in a data-locality framework, organically synthesizing the key components from data reorganization, replication, placement. Beside algorithmic design within the framework, we have also closely examined the deployment challenges, particularly in public virtualized cloud environments, and have implemented DALM on Hadoop 1.2.1 with Giraph 1.0.0. Its performance has been evaluated through both simulations and real-world experiments, and compared with that of state-of-the-art solutions. Index Terms—MapReduce, data locality, data dependency. 1 I NTRODUCTION T HE emergence of MapReduce as a convenient compu- tation tool for data-intensive applications has greatly changed the landscape of large-scale distributed data pro- cessing [1]. Harnessing the power of large clusters of tens of thousands of servers with high fault-tolerance, such practi- cal MapReduce implementations as the open-source Apache Hadoop project 1 have become a preferred choice in both academia and industry in this era of big data computation at terabyte- and even petabyte-scales. A typical MapReduce workflow transforms an input pair to a list of intermediate key-value pairs (the mapping stage), and the intermediate values for the distinct keys are computed and then merged to form the final results (the reduce stage) [1]. This effectively distributes the computation workload to a cluster of servers; yet the data is to be dis- patched, too, and the intermediate results are to be collected and aggregated. Given the massive data volume and the relatively scarce bandwidth resources (especially for clusters with high over-provisioning ratio), fetching data from re- mote servers across multiple network switches can be costly. It is highly desirable to co-locate computation with data, making them as close as possible. Real-world systems, e.g., Google’s MapReduce and Apache Hadoop, have attempted X. Ma, X. Fan and J. Liu are with the School of Computing Science, Simon Fraser University, Burnaby, BC, V5A 1S6, Canada. E-mail: {xma10,xiaoyif,jcliu}@sfu.ca. D. Li is with the Department of Computer Science, Tsinghua University, Beijing 100084, P.R. China. E-mail: [email protected]. Manuscript received xxx, 2015; revised xxx, 2015. 1. Apache Hadoop http://hadoop.apache.org. to achieve better data locality through replicating each file block on multiple servers. This simple uniform replication approach reduces cross-server traffic as well as job finish time for inputs of uniform popularity, but is known to be ineffective with skewed input [2], [3]. It has been observed that in certain real-world inputs, the number of accesses to popular files can be ten or even one hundred times more than that to less popular files, and highly popular files thus still experience severe contention with massive concurrent accesses. There have been a series of works on alleviating hot spots with popularity-based approaches. Representatives include Scarlett [2], DARE [3] and PACMan [4], seeking to place more replicas for popular blocks/files. Using the spare disc or memory for these extra replicas, the overall completion time of data-intensive jobs can be reduced. The inherent relations among the input data, however, have yet to be considered. It is known that many of the real-world data inherently exhibit strong dependency. For example, Facebook relies on the Hadoop distributed file system (HDFS) to store its user data, which, in a volume over 15 petabytes [5], preserves diverse social relations. A sample social network graph with four communities is shown in Figure 1.The dependency between data files is determined by the size of cut (the count of edges) between them. Here subsets A and B have very strong dependency (as compared to subsets A and C, as well as B and C), since their users are in the same community, thus being friends to each other with higher probability. As such, accesses to these two subsets are high- ly correlated. The dependency can be further aggravated during mapping and reducing: many jobs involve iterative steps, and in each iteration, specific tasks need to exchange the output of the current step with each other before going

Upload: others

Post on 20-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Dependency-aware Data Locality for MapReducexiaoyif/Papers/dalm-cloud-trans.pdf · work, including a brief overview of data locality in MapRe-duce, as well as processing data with

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2015.2511765, IEEETransactions on Cloud Computing

1

Dependency-aware Data Locality forMapReduce

Xiaoqiang Ma, Student Member, IEEE, Xiaoyi Fan, Student Member, IEEE,Jiangchuan Liu, Senior Member, IEEE and Dan Li, Member, IEEE

Abstract—MapReduce effectively partitions and distributes computation workloads to a cluster of servers, facilitating today’s big dataprocessing. Given the massive data to be dispatched, and the intermediate results to be collected and aggregated, there have been asignificant studies on data locality that seeks to co-locate computation with data, so as to reduce cross-server traffic in MapReduce.They generally assume that the input data have little dependency with each other, which however is not necessarily true for that ofmany real-world applications, and we show strong evidence that the finishing time of MapReduce tasks can be greatly prolonged withsuch data dependency.In this paper, we present DALM (Dependency-Aware Locality for MapReduce) for processing the real-world input data that can behighly skewed and dependent. DALM accommodates data-dependency in a data-locality framework, organically synthesizing the keycomponents from data reorganization, replication, placement. Beside algorithmic design within the framework, we have also closelyexamined the deployment challenges, particularly in public virtualized cloud environments, and have implemented DALM on Hadoop1.2.1 with Giraph 1.0.0. Its performance has been evaluated through both simulations and real-world experiments, and compared withthat of state-of-the-art solutions.

Index Terms—MapReduce, data locality, data dependency.

F

1 INTRODUCTION

THE emergence of MapReduce as a convenient compu-tation tool for data-intensive applications has greatly

changed the landscape of large-scale distributed data pro-cessing [1]. Harnessing the power of large clusters of tens ofthousands of servers with high fault-tolerance, such practi-cal MapReduce implementations as the open-source ApacheHadoop project1 have become a preferred choice in bothacademia and industry in this era of big data computationat terabyte- and even petabyte-scales.

A typical MapReduce workflow transforms an inputpair to a list of intermediate key-value pairs (the mappingstage), and the intermediate values for the distinct keys arecomputed and then merged to form the final results (thereduce stage) [1]. This effectively distributes the computationworkload to a cluster of servers; yet the data is to be dis-patched, too, and the intermediate results are to be collectedand aggregated. Given the massive data volume and therelatively scarce bandwidth resources (especially for clusterswith high over-provisioning ratio), fetching data from re-mote servers across multiple network switches can be costly.It is highly desirable to co-locate computation with data,making them as close as possible. Real-world systems, e.g.,Google’s MapReduce and Apache Hadoop, have attempted

• X. Ma, X. Fan and J. Liu are with the School of Computing Science, SimonFraser University, Burnaby, BC, V5A 1S6, Canada.E-mail: {xma10,xiaoyif,jcliu}@sfu.ca.

• D. Li is with the Department of Computer Science, Tsinghua University,Beijing 100084, P.R. China.E-mail: [email protected].

Manuscript received xxx, 2015; revised xxx, 2015.1. Apache Hadoop http://hadoop.apache.org.

to achieve better data locality through replicating each fileblock on multiple servers. This simple uniform replicationapproach reduces cross-server traffic as well as job finishtime for inputs of uniform popularity, but is known to beineffective with skewed input [2], [3]. It has been observedthat in certain real-world inputs, the number of accesses topopular files can be ten or even one hundred times morethan that to less popular files, and highly popular files thusstill experience severe contention with massive concurrentaccesses.

There have been a series of works on alleviating hot spotswith popularity-based approaches. Representatives includeScarlett [2], DARE [3] and PACMan [4], seeking to placemore replicas for popular blocks/files. Using the spare discor memory for these extra replicas, the overall completiontime of data-intensive jobs can be reduced. The inherentrelations among the input data, however, have yet to beconsidered. It is known that many of the real-world datainherently exhibit strong dependency. For example, Facebookrelies on the Hadoop distributed file system (HDFS) to storeits user data, which, in a volume over 15 petabytes [5],preserves diverse social relations. A sample social networkgraph with four communities is shown in Figure 1.Thedependency between data files is determined by the size ofcut (the count of edges) between them. Here subsets A andB have very strong dependency (as compared to subsets Aand C, as well as B and C), since their users are in the samecommunity, thus being friends to each other with higherprobability. As such, accesses to these two subsets are high-ly correlated. The dependency can be further aggravatedduring mapping and reducing: many jobs involve iterativesteps, and in each iteration, specific tasks need to exchangethe output of the current step with each other before going

Page 2: Dependency-aware Data Locality for MapReducexiaoyif/Papers/dalm-cloud-trans.pdf · work, including a brief overview of data locality in MapRe-duce, as well as processing data with

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2015.2511765, IEEETransactions on Cloud Computing

2

A

C

B

Fig. 1. A social network graph with four communities

to the next iteration; in other words, the outputs becomehighly dependent. Take the PageRank [27] algorithm as anexample. In each iteration, a node passes its PageRank scoreto all of its outlink neighbors, and then each node sums upall PageRank scores that have been passed to it, namely theoutput of mappers, and updates its PageRank score. Thisalgorithm iterates until PageRank scores converge. Sinceeach mapper only has a subset of the whole graph, thePageRank scores need to be exchanged among tasks.

In this paper, we present DALM (Dependency-Aware Lo-cality for MapReduce) for processing the real-world inputdata that can be highly skewed and dependent. DALM ac-commodates data-dependency in a data-locality framework,organically synthesizing the key components from datareorganization, replication, placement. Beside algorithmicdesign, we also closely examine the deployment challenges,particularly in public virtualized cloud environments, andhave implemented DALM on Hadoop 1.2.1 with Giraph1.0.0. We evaluate DALM through both simulations andreal-world experiments, and compare it with state-of-the-artsolutions, including the basic Hadoop system, the represen-tative popularity-based Scarlett [2], and the representativegraph-partitioning-based Surfer [30]. The results demon-strate that the DALM’s replication strategy can significant-ly improve data locality for different inputs; For populariterative graph processing applications on Hadoop, ourprototype implementation of DALM significantly reducesthe job finish time, as compared to the basic Hadoop system,Scarlett, and Surfer. For larger-scale multi-tenancy systems,more savings can be envisioned given that they are moresusceptible to data dependency.

The rest of this paper is organized as follows. In Sec-tion 2, we introduce the background and related work ofMapReduce and graph partitioning. In Section 3, we presentour motivation, highlighting the impact of data locality. InSection 4, we formulate the problem and discuss the designprinciples DALM. In Section 6, we evaluate the effectivenessof DALM via both simulations and testbed experiments.Section 7 concludes the paper.

2 BACKGROUND AND RELATED WORK

In this section, we offer the necessary background of ourwork, including a brief overview of data locality in MapRe-duce, as well as processing data with dependency.

2.1 Data Locality in MapReduceA MapReduce cluster generally consists of a master, whichacts as a central controller, and a number of slaves, whichstore data and conduct user-defined computation tasks in adistributed fashion [1]. In a physical machine cluster, eachslave node has a DataNode that stores a portion of data, and aTaskTracker that accepts and schedules tasks. The NameNodeon the master node hosts the directory tree of all files onDataNodes, and keeps track of the locations of files. Atypical MapReduce workflow consists of two major phases.First, the map processes (mappers) on slaves read the inputdata from the distributed file system, and transform theinput data to a list of intermediate key-value pairs (knownas the map phase); the reduce processes (reducers) then mergethe intermediate values for the distinct keys, which arestored in the local disks of slaves, to form the final resultsthat are written back to the distributed file system (knownas the reduce phase). Both map and reduce phases can befurther divided into multiple steps.

Co-locating computation with data, namely, data locality,largely avoids the costly massive data exchange crossingswitches, thereby reducing the job finish time of MapRe-duce applications [6]. By default, MapReduce achieves datalocality via data replication, which divides data files intoblocks of fixed size (e.g., 64 MB) and stores multiple replicasof each block on different servers. More advanced solutionshave been proposed to further improve data locality, eitherthrough smart and dynamic data replication, e.g., [2], [3], orjob scheduling, e.g., [9]–[12].

Scarlett [2] adopts a proactive replication design thatperiodically replicates files based on predicted popularity. Itcaptures the popularity of files, and computes a replicationfactor for each file to alleviate hotspots. DARE [3] utilizesa reactive approach that dynamically detects popularitychanges at smaller time scales. When a map task remotely

Page 3: Dependency-aware Data Locality for MapReducexiaoyif/Papers/dalm-cloud-trans.pdf · work, including a brief overview of data locality in MapRe-duce, as well as processing data with

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2015.2511765, IEEETransactions on Cloud Computing

3

accesses a block, the block is inserted into the file system atthe node that fetched it. By doing this, the number of replica-tions of the block is automatically increased by one, withoutgenerating explicit network traffic. Quincy [10] considersthe scheduling problem as an assignment problem, anddifferent assignments have different costs based on localityand fairness. Different from killing the running tasks to freethe resources and launching new tasks, Delay scheduling[11] lets a job wait for a small amount of time if it cannotlaunch tasks locally, allowing other jobs to launch tasksinstead. This simple strategy improves resource utilization,and works well for a variety of workloads. Purlieus [12]considers the locality of intermediate data, and aims tominimize the cost of data shuffling between map tasks andreduce tasks. These works generally have assumed that theinput data have no strong dependency.

2.2 Processing Data with Dependency

As mentioned earlier, processing data with dependency,especially mining large graphs, involves intensive data ex-change and state update. In Pregel [29], the inter-workercommunication is implemented through message passingrather than remotely reading the entire new state in MapRe-duce. This design significantly improves the performanceon processing large graphs, as compared to the originalMapReduce, and has also been adopted in Apache Giraph2,Surfer [30], and Mizan [28]. Giraph is essentially an opensource project of Pregel, which is implemented on top ofHadoop and inherits the convenience of Hadoop such asdistributed storage and fault tolerance with the underlyingHDFS, while Mizan and Surfer are developed with C++in order to further improve the efficiency. Using an onlinehash- or range-based partitioning strategy to dispatch sub-graphs to workers, Giraph well balances computation andcommunication. Yet this naive partitioning strategy largelyneglects the structure of graphs: subgraphs with strongdependency can be assigned to different workers, whichincurs excessive message passing. Surfer and Mizan takea further step toward adapting to the structure of graphs.Surfer also considers the heterogeneous network bandwidthin the cloud environment under a network-performance-aware graph partitioning framework. The partitions witha large number of cross-partition edges are stored in themachines with high network bandwidth. Mizan, on theother hand, monitors the message traffic of each work-er, and balances the workload through migrating selectedvertices across workers. Our DALM is motivated by theseearlier studies. The difference is that DALM incorporates apreprocessing module that reorganizes the raw input databased on the graph partitioning techniques [14], which isthen used to compute the dependency among the obtainedpartitions. This approach effectively minimizes the over-head during the running time, as compared to Mizan. ThenDALM smartly replicates the partitioned data accordingto data popularity, and places the replicas on the serversto minimize the cross-server traffic, as will be detailed inSection 4.

2. Apache Giraph, http://giraph.apache.org.

3 MOTIVATION

To understand the impact of dependency in data replicationand the necessity of joint optimization with popularity anddependency, we take Figure 2 as an example, where files Aand B are popular, and files C, D, E, F are not. Now assumethat there is strong data dependency between each of thefile pairs (A,B), (C,D), and (E,F), weak dependency between(A,C), (A,E), and (D,F), and no dependency otherwise. Wetry to find a good storage strategy to minimize the trafficoverhead, according to the dependency among files, as wellas the file popularity.

As Figure 2(a) shows, the default replication strategyof Hadoop uniformly spreads files over all the servers,which achieves good fault tolerance and scalability. Yet itdoes not consider file popularity and dependency, and thuswould incur high cross-server traffic. Figure 2(b) demon-strates the efforts of more advanced solutions with min-cutsbased partitioning techniques to balance computation andcommunication [30]. Since files with strong dependency arestored together, a large amount of cross-server traffic amongworkers can be avoided. Unfortunately, without consideringfile popularity, they suffer from imbalanced workload onindividual servers. For example, server 1 would become thehotspot since it hosts more popular files, as compared toother servers. (Server 4 even has no jobs to schedule).

On the other hand, the popularity-based replication s-trategies, e.g., Scarlett [2], work well only if the originaldata-intensive job can be evenly distributed, and each com-putation node can work in parallel. In practice, ultra-highparallelization is often hindered by data dependency amongfiles. A basic popularity-based strategy would create tworeplicas for the popular files, namely A1, A2 for file A, andB1, B2 for file B, and one for each of the remaining files. Thereplicas of the popular files can then be evenly to distributedto four servers, each accompanying a replica of an unpopu-lar file (see Figure 2(b)). This strategy, however, would incurfrequent cross-server traffic when highly dependent files arestored separately. For example, since files A and B havestrong dependency, there will be frequent communicationsbetween servers 1 and 2, so will be servers 3 and 4.

Hence, a better replication strategy should consider bothfile popularity and dependency, as shown in Figure 2(d). Wecan replicate files based on file popularity and place the fileswith strong dependency in close vicinity. Not only the cross-server traffic can be largely reduced, but also the imbalancedcomputation workloads can be relieved.

Such data dependency exists in many real-world ap-plications. For example, many jobs involve iterative steps,and in each iteration, specific tasks need to exchange theoutput of the current step with each other before going tothe next iteration [26]. These tasks will keep waiting untilall the necessary data exchanges are completed. Considerthe PageRank algorithm [27]. Millions of websites and thehyperlinks among them compose a huge graph, which isdivided into a large number of subgraphs for MapReduce.The ranking-computing task is usually conducted on acluster with hundreds and even thousands of servers. Sinceeach server only deals with a subset of the websites, tocompute the rank of all the websites, the servers storingdependent data need to exchange the intermediate results

Page 4: Dependency-aware Data Locality for MapReducexiaoyif/Papers/dalm-cloud-trans.pdf · work, including a brief overview of data locality in MapRe-duce, as well as processing data with

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2015.2511765, IEEETransactions on Cloud Computing

4

A1 B1 A2 B2

C1 D1

E1 F1

Server 1 Server 2 Server 3 Server 4

A1

B1

A2

B2

C1

D1

E1

F1

Server 1 Server 2 Server 3 Server 4

(c)

(d)

B A C D

Server 1 Server 2 Server 3 Server 4

(a)

E F

Server 1 Server 2 Server 3 Server 4

(b)

B

A

C

D

E

F

Fig. 2. An example of different replication strategies based on (a) Hadoop only; (b)graph partition approach only; (c) data popularity only; (d) bothdata popularity and dependency. The number of edges between nodes denotes the level of dependency.

in each iteration. Another example is to count the hopsbetween a user pair in an online social network, a frequentlyused function for such social network analysis as finding thecloseness of two users in Facebook or Linkedin. The shortestpath calculation in a large social network will involve anumber of files, since each single file generally stores a smallportion of the whole graph data only.

It is worth noting that a dependency-aware strategy mayresult in less-balanced workload of individual servers, asshown in Figure 2(d). The side effect however is minimal:(1) the popular files still have more replicas, which remainto be stored on different servers, thereby mitigating thehotspot problem; (2), for the servers storing popular files,e.g., servers 1 and 4, most of the data I/O requests canbe serviced locally, resulting in higher I/O throughputand thus lower CPU idle time. In other words, a gooddependency-based strategy should strive to localize dataaccess as much as possible while avoiding potential hotspotsthrough replicating popular files.

4 DALM: DESIGN AND OPTIMIZATION

Our dependency-aware locality for MapReduce (DALM)addresses the challenges above under a coherent frame-work. Figure 3 provides a high level description of theDALM workflow. A data preprocess module first operates onthe raw data to identify the dependency in the data, andreorganize the raw data accordingly; A popularity modulebuilt on HDFS then records the access counters for eachdata block, and dynamically modifies the replication factorsof individual files using an adaptive replication strategy.If a block has a higher replication factor than the actualnumber of the replicas, the popularity module inserts repli-cated blocks in the HDFS triggered by remote-data accesses.These series of modular operations ensure that DALM canaccommodate both data popularity and dependency withminimized overhead. We next present detailed design andoptimization of DALM.

4.1 Dependency-aware Data Reorganization

RAW DataData

PreprocessPlacement

Popularity ReplicationHistorical LogHDFS

Fig. 3. DALM workflow

In the existing HDFS, when the raw input data, which iseither a single file or a set of files, is sequentially loaded intothe distributed file system, the raw data will be segmentedinto data blocks of equal size, which are then distributed toindividual slave nodes. Thus the dependency among datawill be hardly preserved. For example, a large connectedcomponent will be divided into a number of subgraphsand then stored on different servers. When specific tasksneed to process this connected component, the states of allthe servers storing data need to be synchronized, whichmay incur considerable communication overhead. Insteadof loading the raw data files directly into the underlying dis-tributed file system, DALM first identifies data dependencyin its preprocessing module through graph partitioning,which helps to place dependent data as close as possible.The key observation here is that the data points in the samecluster have a high degree of self-similarity among them,and a proper graph partitioning will divide the raw datainto k partitions of almost equal size, where k is the user-defined parameter which can adapt to the storage systemand system scale. In this way, the highly dependent data willbe kept in the same file, and the dependency between a pairof files can be easy obtained by normalizing the size of cut ofthe considered two files. There have been extensive studieson rapidly achieving good quality partitioning for large-scale graphs [13]–[16]. DALM is not confined to work with aparticular partitioning algorithm, and such popular tools as

Page 5: Dependency-aware Data Locality for MapReducexiaoyif/Papers/dalm-cloud-trans.pdf · work, including a brief overview of data locality in MapRe-duce, as well as processing data with

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2015.2511765, IEEETransactions on Cloud Computing

5

ParMETIS [14] and SCOTCH [16] all can be incorporated.The algorithms implemented in ParMETIS are based onthe parallel multilevel k-way graph-partitioning, adaptiverepartitioning, and parallel multi-constrained partitioningschemes. The well-known multilevel k-way partition meth-ods contain three steps: coarsening, initial partitioning anduncoarsening. In the coarsening step, the algorithm cangenerate a series of increasing coarser graphs via the originalgraph. In the initial partition step, the algorithm builds ini-tial partitioning samples, which are much smaller than thecoarser graphs. Finally, in the uncoarsening step, the initialpartitioning is used to derive partitioning of the successivefiner graphs. This is done by first projecting the partitionwith partitioning refinement which target on reducing cross-partition edges by moving points among the partitions.

Figure 3 shows how the preprocessing step is grace-fully integrated into HDFS. DALM balances its workloadby detecting the dependency in the graph data, placingdependent data blocks together, and then determining thereplication factors of individual data blocks based on thefile popularity as shown below.

4.2 Minimizing Cross-server Traffic: Problem Formula-tionAfter reorganizing the raw data based on dependency,DALM aims at reducing the cross-server traffic by placingdependent data on nearby servers, as formulated in thefollowing. Table 1 summarizes the notations. Consider thatthere are n different files, and that for file i, denoting itspopularity by pi and its size si. The value of the element dijof the matrix D refers to the dependency degree betweenfiles i and j. In DALM, the dependency degree is obtainedby normalizing the size of cut of the considered two files. Inthe first step, we need to determine the replication factor, orhow many replicas that each file has, which is denoted byri. Then we need to place these replicas on m independentservers, where for each server j, denote the storage space bycj . We use ρki to represent the kth replica of file i, then fora replica placement strategy, ρki = j indicates that the kthreplica of file i is stored in server j.

In the following we discuss the constraints in the consid-ered data replication and placement problem.

To ensure that the total file size on any server does notexceed the storage capacity of the server, we must have:

n∑i=1

ri∑k=1

Ij(ρki )si ≤ cj (1)

where Ij is the indicator function such that Ij(ρki ) is equalto 1 if ρki = j holds, and 0 otherwise.

In off-the-shelf MapReduce-based systems, e.g., Hadoop,all files have the same replication factor, which is threeby default, and can be tuned by users. In our consideredproblem, the replication factor of each file can vary from thelower bound (rL) to the upper bound (rU ) based on the filepopularity to balance the workload.

Naturally, the extra replicas beyond the lower boundprovide higher data availability, which also call for addi-tional storage space. A user-set parameter δ controls thepercentage of the total storage capacity reserved for extrareplicas, which trades off the data locality and storage

TABLE 1Summary of Notations

pi Popularity of file isi Size of file idij Dependency between files i and jcj Storage capacity of server jPj Aggregate file popularity of server jP̄ Average of aggregate file popularity of all serversη Threshold of deviation of Pj from P̄ri Replication factor of file firL Lower bound of replication factorrU Upper bound of replication factorρki Placement of the kth replica of file iδ Quota of storage capacity for extra replicasS Aggregate files size of serversR Aggregate replicasD File dependency matrix, where the element dij refers to the

degree of dependency between files i and j

efficiency. Let S =∑

1≤i sirL be the total storage capacity ofstoring rL replicas for each file. Then the replication factorshould satisfy the following constraints:

n∑i=1

(ri − rL)si ≤ δS (2)

rL ≤ ri ≤ rU , for i = 1, ..., n (3)

The basic idea of the popularity-based replication ap-proach is to adapt the replication factor of each file accord-ing to the file popularity in order to alleviate the hotspotproblem and improve the resource utilization. Let Pj be theaggregate file popularity of server j, which is the sum ofthe popularity of all the replicas stored in this server, andP̄ the average of aggregate file popularity of all servers.Further, the replica placement should prevent any singleserver from hosting too many popular files. Hence we havethe following constraints,

Pj =n∑i=1

ri∑k=1

Ij(ρki )piri≤ (1 + η)P̄ (4)

where η is the threshold of the deviation of Pj from P̄ . Wedivide pi by ri by assuming that the access to file i evenlyspreads over all of its replicas.

In our considered problem, we aim at minimizing the re-mote access. In the current Hadoop system, the computationnode will access the replicas as close as possible. Similar tothe popularity-based strategy [2], [3], we consider a file asthe replica granularity, and our dependency-aware replica-tion strategy is then divided into two successive steps. Thefirst is to calculate the replication factor of each file based onits popularity, and the second is to decide which server tostore each replica, such that the replicas of highly dependentfiles are stored as close as possible while not exceeding thestorage budget and preventing potential hotspots.

4.3 Replication StrategyWe compute the replication factor ri of file fi based on thefile popularity and size, as well as the storage budget givenby δ. Denote the observed number of concurrent accesses bypi. The files with larger size have a higher priority of gettingits desired number of replicas, since it has been shown thatlarge files are accessed more often [2].

Page 6: Dependency-aware Data Locality for MapReducexiaoyif/Papers/dalm-cloud-trans.pdf · work, including a brief overview of data locality in MapRe-duce, as well as processing data with

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2015.2511765, IEEETransactions on Cloud Computing

6

Following is the optimization problem to obtain the filereplication factor:

min || rR− p||2 (5)

s.t.n∑i=1

(ri − rL)si ≤ δS

rL ≤ ri ≤ rU , for i = 1, .., n

where r = (r1, ..., rn), R =∑ni=1 ri, and p = (p1, ..., pn).

To solve it, we start from the Lagrangian function:

f(r) = || rR− p||2 + α[

n∑i=1

(ri − rL)si − δS]

+n∑i=1

βi(rL − ri) +n∑i=1

γi(ri − rU ) (6)

where α, βi and γi are the Lagrange multipliers. Now theLagrangian dual becomes

fLD = maxα≥0,βi≥0,γi≥0

minrL≤ri≤rU

f(r) = f∆

Since the objective function in (5) is convex, and the con-straints are linear, according to the Slater’s condition, wehave

∆ri = 2(riR− pi)

1

R+ αsi − βi + γi = 0

Namely,

ri = Rpi −R2

2(αsi − βi + γi) (7)

Combining Equations (6) and (7), the Lagrange dual prob-lem becomes:

max g(α,βββ,γγγ) =∑ni=1(R(αsi−βi+γi)

2 )2 +

α[∑ni=1(Rpi + R2

2 (αsi − βi + γi)− rL)si − δS] +∑ni=1 βi(rL −Rpi + R2

2 (αsi − βi + γi)) +∑ni=1 γi(Rpi − R2

2 (αsi − βi + γi)− rU )

s.t. α ≥ 0, βi ≥ 0, γi ≥ 0, for i = 1, ..., n

This is a quadratic programming (QP) optimizationproblem with linear constraints. Rather than adopting thetime-consuming numerical QP optimization, here we usethe Sequential Minimal Optimization (SMO) technique [25]to solve the above problem to find ri. The main idea ofSMO is as follows. SMO breaks the original QP probleminto a series of smallest sub-problems, which have onlytwo multipliers and can be solved analytically. The originalproblem is solved when all the Lagrange multipliers satisfythe Karush-Kuhn-Tucker (KKT) conditions. In other words,SMO consists of two components: solving the sub-problemswith two Lagrange multipliers through an analytic method,and choosing which pair of multipliers to optimize, throughheuristics in general. The first multiplier is selected amongthe multipliers that violate the KKT conditions, and the sec-ond multiplier is selected such that the dual objective has alarge increase. SMO is guaranteed to converge, and variousheuristics have been developed to speed up the algorithm.Compared to numerical QP optimization methods, SMOscales better and runs much faster.

4.4 Placement Strategy

After obtaining the replication factor of each file, replicasare to be placed on a set of candidate servers, which isequivalent to finding a mapping from replicas to servers,so that the replicas with strong dependency can be assignedto the same server or nearby ones. This is related to theclustering problem that the replicas are partitioned intom groups, where m is the numbers of servers. The pop-ular k-means clustering [7] (including its variations, e.g.,k-median) however does not work in this context for itrequires the data points being coordinated in a feature space.In our scenario, only the distance information between datapoints are available, which reflect the data dependency. Assuch, we resort to a k-medoids algorithm [22] that relieson distance information only and is reliable to outliers. Theoriginal k-medoids works as follows. Suppose that thereare n data points to be divided into k clusters. In the firststep, k data points are randomly selected as the medoids(Initialization). In the second step, associate each of theremaining data points to the closest medoid based on thedistance information. In the third step, for each medoid mand each non-medoid data point o associated to m, swap mand o and compute the total cost of this new configuration.Select the configuration with the lowest cost and update thecorresponding medoid. Then repeat the second and thirdsteps until there is no change in the medoid, namely the totalcost converges. Yet in DALM, we need to further incorporatethe following constraints:

• Only one replica of the same file can be placed in thesame server.

• The aggregated popularity of each server cannotexceed (1 + η)P̄ .

• The total file/replica size on any server cannot ex-ceed the storage capacity of the server.

Algorithm 1 The modified k-medois algorithm for replicaplacement

1: Randomly select m replicas be the initial cluster centers,namely medoids.

2: For each of the remaining replicas i, assign it to thecluster with the most dependent filesif the constraintsare satisfied.

3: for each medoid replica i do4: for each non-medoid replica j do5: Swap i and j if the constraints are satisfied.6: Memorize the current partition if it preserves

more data dependency7: end for8: end for9: Iterate between steps 3 and 8 until convergence.

10: return m clusters

To accommodate these constraints, a modified k-medoids algorithm is developed (pseudo code shown inAlgorithm 1). It takes the dependency matrixD as the input,and returns a partition of replicas that minimizes the cross-server dependency while satisfying all the constraints. It isworth noting that this algorithm assumes that all the serversare in the same layer, e.g., interconnected by a single core

Page 7: Dependency-aware Data Locality for MapReducexiaoyif/Papers/dalm-cloud-trans.pdf · work, including a brief overview of data locality in MapRe-duce, as well as processing data with

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2015.2511765, IEEETransactions on Cloud Computing

7

switch. For a multilevel tree-topology cluster, this algorithmshould be modified in a hierarchical way, which means thatthe data replicas are iteratively partitioned from the toplevel to the bottom level, such that dependent data is placedas close as possible.

5 DEPLOYMENT ISSUES AND PROTOTYPE IMPLE-MENTATION

So far we have outlined the design of DALM and laidits algorithmic foundations. We now discuss the practicalchallenges toward deploying DALM in real-world cloud en-vironment, and also present our prototype implementation.

5.1 Accommodating Machine Virtualization

To reduce the cross-server network traffic during the jobexecution, the MapReduce task scheduler on the masternode usually places a task onto the slave, on which therequired input data is available if possible. This is notalways successful since the slave nodes having the inputdata may not have free slots at that time. Recall that thedefault replication factor is three in the Hadoop systems,which means that each file block is stored on three servers–two of them are within the same rack and the remainingone is in a different rack. Hence, depending on the distancebetween DataNode and TaskTracker, the default Hadoopdefines three levels of data locality, namely, node-local, rack-local, and off-switch. Node-local is the optimal case thatthe task is scheduled on the node having data. Rack-localrepresents a suboptimal case that the task is scheduled ona node different from the node having data; the two nodesare within the same rack. Rack-local will incur cross-servertraffic. Off-switch is the worst case, in which both the node-local and rack-local nodes have no free slots, and thus thetask needs to retrieve data from a node in a different rack,incurring cross-rack traffic. When scheduling a task, the taskscheduler takes a priority-based strategy: node-local has thehighest priority, whereas off-switch has the lowest.

DALM can be easily incorporated into this conventionalthree-level design with a physical machine cluster. Such ma-chine virtualization tools as Xen, KVM, and VMware allowmultiple virtual machines (VMs) running on a single physicalmachine (PM), offering highly efficient hardware resourcesharing and effectively reducing the operating costs of cloudproviders. They have not noly become a foundation fortoday’s modern cloud platforms, but also affect the notionof data locality [8], [12], [20], [21]. For illustration, considerthe overall architecture of a typical Xen-based virtualizedsystem in Figure 4. For a file block stored on VM 1-a,scheduling the task on VM 1-b or VM 2-b are consideredidentical by the task scheduler, for both are rack-local. Thedata exchanges between co-located VMs (e.g., VM 1-a andVM 1-b), however, are much faster than those betweenremote VMs (e.g., VM 1-b and VM 2-b), and hence the twocases should have different scheduling priorities.

To understand the impact to locality with highly-dependent data, we set up a MapReduce cluster consistingof three physical machines: one acts as the master node, andthe other two each runs three VMs. As such, our clusterhas six slave nodes in total. For the file block placement

1 2 30

100

200

300

400

500

600

Storage strategy

Job

finis

h tim

e (s

ec)

Fig. 5. Job finish time with different storage strategies.

strategy, we consider the following three representatives: 1.first divide the whole graph into two partitions, and thenfurther divide each partition into three smaller partitions,which are placed on the co-located three VMs, respectively;2. divide the graph into six partitions, and randomly placeeach partition on each VM; 3. uniformly place the graphdata on all VMs.

We extract 3.5 GB wikipedia data from the wikimediadatabase3, and then run the Shortest Path application inthe widely used Giraph system as the microbenchmark.Figure 5 reports the average job finish time over five runs.The first storage strategy achieves the best performancesince it matches well with the underlying server topology.As compared with the second strategy, it can significantlyreduce the cross-server traffic by placing dependent data onthe same PM, which on the other hand, is offloaded to themuch more efficient communication among co-located VMs.

As such, it is necessary to modify the original three-levelpriority scheduling strategy by splitting node-local into twopriority levels, namely, VM-local and PM-local. A task is VM-local if its required data is on the same virtual machine,while a task is PM-local if its required data is on two co-located virtual machines. At the beginning of running aMapReduce job, DALM launches a topology configurationprocess that identifies the structure of the underlying virtu-alized clusters from a configuration file. The points in thesame or nearest communities in the graph are able to beplaced on the same data block, so that they will be dealtwith in the same computation node. A VM is associatedwith a unique ID, in the format of rack-PM-VM, such thatthe degree of locality can be easily obtained. The schedulingalgorithm then schedules tasks in a virtualized MapReducecluster based on the newly designed priority-levels. The keychange in this scheduling algorithm is that, when a VMhas free slots, it requests new tasks from the master nodethrough heartbeat, and the task scheduler on the masternode assigns tasks according to the following priority order:VM-local, PM-local, rack-local, and off-switch.

5.2 Prototype ImplementationTo demonstrate the practicability of DALM and investigateits performance in real-world MapReduce clusters, we have

3. Available at http://dumps.wikimedia.org/

Page 8: Dependency-aware Data Locality for MapReducexiaoyif/Papers/dalm-cloud-trans.pdf · work, including a brief overview of data locality in MapRe-duce, as well as processing data with

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2015.2511765, IEEETransactions on Cloud Computing

8

Core Switch

ToR Switch 1 ToR Switch 2

TaskTracker

VM 2-a

TaskTracker TaskTracker

VM 2-b VM 2-c

DataNode

TaskTracker

VM 3-a

TaskTracker TaskTracker

VM 3-b VM 3-c

DataNode

TaskTracker

VM 4-a

TaskTracker TaskTracker

VM 4-b VM 4-c

DataNode

TaskTracker

VM 1-a

TaskTracker TaskTracker

VM 1-b VM 1-c

DataNode

Master

PM1 PM2 PM3 PM4

DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode

Fig. 4. A typical virtual MapReduce cluster. A core switch is a high-capacity switch interconnecting top-of-rack (ToR) switches, which have relativelylow capacity.

implemented the prototype of DALM based on Hadoop1.2.1 and Apache Giraph 1.0.0 by extending the HadoopDistributed File System (HDFS) with the proposed adaptivereplication scheme. We have also modified the existingjob scheduling strategy to be virtualization-aware. As wehave re-defined the locality levels, and revised the taskscheduling policy accordingly, we carefully examine thewhole Hadoop package to identify the related source codes,and make necessary modifications. We ensure that ourmodifications are compatible with the remaining modules.We now highlight the key modifications in the practicalimplementation in the following.

First, we modify the original NODE-LOCAL toPM-LOCAL and VM-LOCAL as follows. In Locality.java,NODE_LOCAL is replaced by VM_LOCAL or PM_LOCALdepending on whether a task is launched on the localVM or on a co-located VM. The NetworkTopologyclass in NetworkTopology.java, which defines thehierarchical tree topology of MapReduce clusters,is extended to VirtualNetworkTopology byoverriding such methods as getDistance() andpseudoSortByDistance(). During the initializationof NameNode, VirtualNetworkTopology returnsclusterMap, which is the network topology, to theDataNodeManager. In DALM, the distance calculation isupdated to VM-local (0), PM-local (1), rack-local (2), andoff-switch (4).

Second, we modify the task scheduling al-gorithm to be virtualization aware, which ismainly implemented in JobInProgress.java andJobQueueTaskScheduler.java. In the standardHadoop and vLocality systems, the task scheduler worksas shown in Figure 6(a) and Figure 6(b), respectively.We modify the original JobQueueTaskScheduler()method to respond to TaskTracker heartbeat requests.The getMachineLevelForNodes() method inJobInProgess.java is the key component of thetask scheduler. In our implementation of DALM, thetask scheduler in DALM works as shown in Figure 6(b):it first assigns VM-local and PM-local tasks through

Public synchronized List <Task> assignTasks (TaskTracker taskTracker) {List<Task> assignedTasks = new ArrayList<Task>();for (int i=0; i < availableMapSlots; ++i) { synchronized (jobQueue) { for (JobInProgress job : jobQueue) { Task t = null; t = job.obtainNewLocalMapTask(); if (t != null) {

assignedTasks.add(t); break;

} t = job.obtainNewNonLocalMapTask();

if (t != null) { assignedTasks.add(t); break; break scheduleMaps;

} } } }

Public synchronized List <Task> assignTasks (TaskTracker taskTracker) {List<Task> assignedTasks = new ArrayList<Task>();for (int i=0; i < availableMapSlots; ++i) { ... t = job.obtainNewVmOrPmLocalMapTask(); if (t != null) {

assignedTasks.add(t); break;

} t = job.obtainNewRackLocalMapTask(); if (t != null) {

assignedTasks.add(t); break;

} t = job.obtainNewNonLocalMapTask();

if (t != null) { assignedTasks.add(t); break; break scheduleMaps;

} ... }

(a)

(b)

Fig. 6. Task scheduler in Standard Hadoop (a) and vLocality (b)

obtainVmOrPmLocalMapTask(), then rack-local tasksthrough obtainRackLocalMapTask(), and finallyother tasks through obtainNonLocalMapTask().All these three methods are encapsulated in ageneral method, namely, obtainNewMapTask() in

Page 9: Dependency-aware Data Locality for MapReducexiaoyif/Papers/dalm-cloud-trans.pdf · work, including a brief overview of data locality in MapRe-duce, as well as processing data with

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2015.2511765, IEEETransactions on Cloud Computing

9

0 10 20 30 40 501.5

2.0

2.5

3.0

3.5

4.0

4.5

Rank of file according to the popularity

Per

cent

age

in th

e to

tal a

cces

ses

α = 1.0α = 2.0α = 3.0α = 4.0α = 5.0

Fig. 7. File rank ordered by relative popularity

JobInProgress.class. We keep the original interfacesof obtainNewMapTask() to ensure the compatibility.

After making the above modifications, we re-compile thewhole package to get the executable .jar file, which is thendeployed on the MapReduce cluster.

6 PERFORMANCE EVALUATION

In this section, we evaluate DALM through extensive sim-ulations and testbed experiments. We also compare DALMwith state-of-the-art data locality solutions.

6.1 SimulationsWe first conduct a series of simulations to examine theeffectiveness of the proposed dependency-based replicationstrategy. As in previous studies [23], we assume that thefile popularity exhibits a Pareto distribution [24], althoughDALM works with other popularity distributions as well.The probability density function (PDF) of a random variableX with a Pareto distribution is

fX(x) =

{αxαmxα+1 x ≥ xm,0 x < xm.

where xm is the minimum possible value of x, and α is apositive parameter that controls the skewness of the distri-bution. We plot the relative file popularity (the percentageof accesses of each file in the total accesses of all the files) asa function of the rank of the file for a sample set of 50 filesin Figure 7. We can see that with larger α, the most popularfiles will have more accesses.

We generate data dependency based on the file popu-larity since the dependent files will be accessed togetherwith high probability, and thus the dependent files mayhave similar popularity. To reflect this fact, we randomlypick up a dependency ratio for each pair of files betweenzero and the popularity of the less popular file such thatpopular files will have strong mutual dependency withhigher probability.

We compare our dependency-based replication strategy(DALM) with a representative popularity-based system S-carlett [2]. The basic Hadoop system serves as the baseline.

1.0 2.0 3.0 4.0 5.00

2

4

6

8

10

12

14

16

18

20

22

Pareto index α

Red

uced

rem

ote

acce

sses

(%

)

DALMScarlett

Fig. 8. Impact of the skewness of file popularity

6.7 13.3 20.0 26.7 33.30

2

4

6

8

10

12

14

16

18

Extra storage budget (%)

Red

uced

rem

ote

acce

sses

(%

)

DALMScarlett

Fig. 9. Impact of the extra storage budget ratio

In the simulation, there are 50 different files with equalsize and each file has at least 3 replicas. We examine thesensitivity of such system parameters as the extra storagebudget ratio, the upper bound of replication factor, thenumber of servers, and the Pareto index α, with the defaultvalue of 20%, 5, 10, and 4.0, respectively.

The simulation results are shown in Figures 8-11, foreach of the above system parameters, respectively. In gen-eral, we find that DALM outperforms the other two ap-proaches with significantly reduced remote accesses. In thefollowing we present our detailed analysis with respect toeach system parameter.

Figure 8 shows the impact of the skewness of file pop-ularity on the reduction of remote access, with the Paretoindex α varying from 1.0 to 5.0. We can see that, with in-creased skewness, the data exchanges between the popularfiles will be more frequent, and thus DALM can reducemore remote accesses by putting highly dependent filestogether. However, when α = 5.0, the popularity of the mostpopular files is extremely high, such that the replicas of thehighly dependent files cannot be stored together if both ofthem are very popular, due to the constraint of aggregated

Page 10: Dependency-aware Data Locality for MapReducexiaoyif/Papers/dalm-cloud-trans.pdf · work, including a brief overview of data locality in MapRe-duce, as well as processing data with

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2015.2511765, IEEETransactions on Cloud Computing

10

3 4 5 6 7 80

2

4

6

8

10

12

14

16

Replication upper bound

Red

uced

rem

ote

acce

sses

(%

)

DALMScarlett

Fig. 10. Impact of the upper bound of the replication factor

6 8 10 12 140

2

4

6

8

10

12

14

16

Number of servers

Red

uced

rem

ote

acce

sses

(%

)

DALMScarlett

Fig. 11. Impact of system scale

popularity. Scallett also experiences the similar trend, withless improvement over the baseline.

Figure 9 illustrates the impact of the extra storage budgetratio. We can see that when the budget ratio is low, say6.6% and 13.3%, the reduction of remote access is limited.The reason is that with limited extra storage, only a smallnumber of the most popular files have a lot of replicas, whilethe other popular files do not have any replicas. Therefore, alot of data access/exchange requests need to seek to remoteservers. The improvement of DALM becomes remarkablewhen the budget ratio reaches 20%, which reduces theremote access substantially by 22%, compared with 7% ofScallett. It is noting that the gap between DALM and Scallettwill slightly decrease when the budget ratio keeps growing.The reason is that with more extra storage, Scallett will havemore replicas, and thus dependent files would be storedtogether with higher probability.

With a storage budget of 20%, the impact of the upperbound of the replication factor is shown in Figure 10. Theimprovement of DALM becomes more significant when thereplication factor grows from 3 to 6, since a higher upperbound can give more opportunities to alleviate frequent ac-

cess of the popular files; on the other hand, when the upperbound is too high, the improvement goes down since themost popular files will have excessive replicas while few ofthe other files have more than three replicas. Intuitively, thisimbalance of the number of replicas maximizes the localityof the most popular files while sacrificing the locality ofother files that occupy a large portion of the whole accesses.

Figure 11 shows the impact of the number of servers.Interestingly, the improvement of DALM becomes less withmore servers, though DALM still noticeably outperformsScarlett in most cases. The reason is that, with more servers,the replicas have to spread over all servers such that thedeviation of popularity of each server does not exceed a pre-defined value, namely η, in which case the data dependencycannot be perfectly preserved. We expect to smartly adaptour strategy to the number of servers. In particular, whenthe servers are abundant, the imbalance of the servers’ ag-gregated popularity is allowed, and thus more popular fileswith strong dependency can be placed on the same serverwithout causing tangible hardware resource contention.

6.2 Testbed ExperimentsWe next present the real-world experiment results in atypical virtualized environment. Our testbed consists of4 Dell servers (OPTIPLEX 7010), each equipped with anIntel Core i7-3770 3.4 GHz quad core CPU, 16 GB 1333MHz DDR3 RAM, a 1 TB 7200 RPM hard drive, and a 1Gbps Network Interface Card (NIC). Hyper-Threading isenabled for the CPUs so that each CPU core can supporttwo threads. All physical machines are inter-connected bya NETGEAR 8-port gigabit switch. This fully controllabletestbed system allows the machines to be interconnectedwith the maximum speed, and enables us to closely examinethe impact of data dependency and data locality.

We use a separate physical machine as the master node,which, as the central controller, involves heavy I/O oper-ations and network communications to maintain the stateof the whole cluster and schedule tasks. Using a dedicatedphysical machine, rather than a virtual machine, ensuresfast response time with minimized resource contention,and accordingly enables a fair comparison with fully non-virtualized systems. Other physical machines that host vir-tual slave nodes run the latest Xen Hypervisor (version4.1.3). On each physical machine, besides the Domain0 VM,we configure three identical DomainU VMs, which act asslave nodes. For the operating systems running on VMs(both Domain0 and DomainU), we use the popular Ubuntu12.04 LTS 64 bit (kernel version 3.11.0-12). All the VMsare allocated two virtual CPUs and 4 GB memory. Weuse the popular logical volume management (LVM) systemfor on-demand resizing, and allocate each DomainU VMa 100 GB disk space in default. We use the open sourcemonitoring system Ganglia4 3.5.12 to monitor the systemrunning conditions.

We compare DALM with Scarlett and Surfer, as well asthe basic Hadoop system, in terms of job finish time, andcross-server traffic. We use a public social network dataset [19], which contains 8,164,015 vertices and 153,112,820edges, and the volume is 2.10 GB. There are roughly five

4. Ganglia Monitoring System, http://ganglia.sourceforge.net

Page 11: Dependency-aware Data Locality for MapReducexiaoyif/Papers/dalm-cloud-trans.pdf · work, including a brief overview of data locality in MapRe-duce, as well as processing data with

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2015.2511765, IEEETransactions on Cloud Computing

11

Shortest Path PageRank0

5

10

15

20

25

30

Impr

ovem

ent

of jo

b fin

ish

time

(%)

ScarlettSurferDALM

Fig. 12. Average job finish time

communities in the graph, and we select one of them asthe popular one. The default system settings are as follows:the block size is 32 MB, and the replication factor is 2 inthe basic Hadoop system; the extra storage budget ratio is20%, with the replication lower bound of 2 and the upperbound of 4 for both DALM and Scarlett. We allocate 32workers in total, with 8 map workers on each slave server,and 7 map workers and 1 reduce workers on the masterserver. We run two representative applications to evaluatethe performance of these systems, namely, Shortest Path andPageRank. The Shortest Path application finds the shortestpath for all pairs of vertices; the PageRank application ranksthe vertices according to the PageRank algorithm.

6.2.1 Job Finish TimeWe first compare the job finish time of different systems.For each system, we run each benchmark application fivetimes, and plot the average improvement of job finish timein Figure 12. Again, the basic Hadoop system serves as thebaseline.

It can be observed that DALM significantly improves thesystem performance in the job finish time over the other sys-tems. Compared with the basic Hadoop, Scarlett, Surfer, ourproposed DALM on average improves the job finish time by10.74%, 12.40, and 28.09% for Shortest Path, respectively. ForPageRank, the improvements are 6.74%, 16.37%, and 17.80%,respectively. These results clearly demonstrates all the threesystems can significantly reduce the job finish time overthe basic Hadoop system. Scarlett improves data localityby smartly replicating popular blocks, while Surfer can re-duce cross-server traffic with dependency-aware placement.DALM achieves the shortest job finish time by integratingthe dependency-aware placement strategy of Surfer, andthe popularity-based replication strategy of Scarlett. Further,DALM adapts the task scheduling algorithm to be virtualaware, which has the potential to further improve job finishtime in larger scale systems.

6.2.2 Data Locality and Cross-server TrafficWe now take a closer look at how DALM achieves the bestperformance. Data locality is an important metric in MapRe-duce systems, since it largely reflects the data traffic pressure

Shortest Path PageRank0

10

20

30

40

50

Impr

ovem

en o

f Dat

a Lo

calit

y (%

)

ScarlettSurferDALM

Fig. 13. Data locality of jobs

Shortest Path PageRank0

10

20

30

40

50

60

Red

uced

cro

ss−

serv

er tr

affic

(%

)

ScarlettSurferDALM

Fig. 14. Cross-server traffic

on the network fabric. Data locality is highly desirable forMapReduce applications, especially in commercial data cen-ters where the network fabrics are highly oversubscribed,and the communications across racks are generally costly.Figure 13 shows that DALM achieves remarkably improve-ment in data locality, as compared to Scarlett and Surferfor both benchmark applications. Since Scarlett has morereplicas for popular data blocks than Surfer, the data localityof Scarlett is higher. With both popularity and dependency-based schemes, DALM is able to provide 100% data locality.

Figure 14 explains the advantages of DALM overHadoop, Scarlett, and Surfer, from the perspective of cross-server traffic. In the basic Hadoop, only half of the tasksare localized, which means there is considerable overheadcaused by remote task executions and communications. S-carlett and Surfer successfully reduce the cross-server trafficwith more replications and dependency-aware data place-ment, respectively. Yet Scarlett can evict strong dependenttasks on remote VMs, leading to increased task finish time.Surfer, on the other hand, successfully reduces a largeamount of remote communications by placing dependenttasks on nearby VMs, but incurs the hotspot problem forpopular data blocks. In DALM, most tasks are assignedto the appropriate VMs storing the necessary data blocks,

Page 12: Dependency-aware Data Locality for MapReducexiaoyif/Papers/dalm-cloud-trans.pdf · work, including a brief overview of data locality in MapRe-duce, as well as processing data with

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2015.2511765, IEEETransactions on Cloud Computing

12

minimizing the out-of-rack tasks. The hot spot problems arealso largely mitigated through data replication.

7 CONCLUSION AND DISCUSSION

In this paper, we have showed strong evidence that datadependency, which exists in many real-world application-s, has a significant impact on the performance of typicalMapReduce applications. Taking graph data as a repre-sentative case study, we have illustrated that traditionaldata locality strategies can store highly dependent data ondifferent servers, leading to massive remote data accesses.To this end, we have developed DALM, a comprehensiveand practical solution toward dependency-aware localityfor MapReduce in virtualized environments. DALM firstleverages the graph partition technique to identify the de-pendency of the input data, and reorganize the data storageaccordingly. It then replicates data blocks according to thehistorical data popularity, and smartly places data blocksacross VMs subject to such practical constraints as storagebudget, replication upper bound, and the underlying net-work topology. Through both simulations and real-worlddeployment of the DALM prototype, we illustrated thesuperiority of DALM against state-of-the-art solutions.

It is worth noting that cross-server traffic still exists inDALM. For example, if the VMs on one PM are all busyand have no free slots, while some VMs on other PMs havefree slots, the task scheduler has to allocate the pendingtasks. This could be improved through incorporating delayscheduling [11] into DALM. DALM also needs the topologyinformation of the underlying clusters for data placement.If such information is not available, a topology inference al-gorithm might be needed, e.g., through bandwidth/latencymeasurement between VM pairs. DALM could be furtherimproved by incorporating the prioritization technique [31].

Recently, cluster or community detection in large-scalesocial graphs has also attracted remarkable attention [17]–[19]. The detection algorithms group nodes having morelinked edges together, and could be used in DALM’s pre-processing module. Unfortunately, they generally dividegraphs into communities consisting of up to hundreds ofnodes [17]. Considering that the size of file blocks in HDFSis 64 MB by default, the granularity of communities canbe too small for data storage. The size of communities canbe highly heterogeneous across different communities, too,making it difficult to use a community as the basic storageunit. We expect more efficient and flexible community detec-tion algorithms to be developed and possibly incorporatedinto DALM in the future.

REFERENCES

[1] J. Dean and S. Ghemawat, “MapReduce: Simplified data process-ing on large clusters,” Communications of the ACM, vol. 51, no. 1,pp. 107–113, 2008.

[2] G. Ananthanarayanan, S. Agarwal, S. Kandula, A. Greenberg,I. Stoica, D. Harlan, and E. Harris, “Scarlett: Coping with skewedcontent popularity in MapReduce clusters,” in Proceedings of Euro-pean Conference on Computer Systems (EuroSys), 2011.

[3] C. L. Abad, Y. Lu, and R. H. Campbell, “Dare: Adaptive datareplication for efficient cluster scheduling,” in Proceedings of IEEEInternational Conference on Cluster Computing (CLUSTER), 2011.

[4] G. Ananthanarayanan, A. Ghodsi, A. Wang, D. Borthakur, S. Kan-dula, S. Shenker, and I. Stoica, “Pacman: Coordinated memorycaching for parallel jobs,” in Proceedings of USENIX Symposium onNetworked Systems Design and Implementation (NSDI), 2012.

[5] A. Thusoo, Z. Shao, S. Anthony, D. Borthakur, N. Jain, J. Sen Sar-ma, R. Murthy, and H. Liu, “Data warehousing and analytics in-frastructure at facebook,” in Proceedings of the 2010 ACM SIGMODInternational Conference on Management of Data, 2010.

[6] Z. Guo, G. Fox, and M. Zhou, “Investigation of data locality inMapReduce,” in Proceedings of IEEE/ACM International Symposiumon Cluster, Cloud and Grid Computing (CCGRID), 2012.

[7] A.K. Jain, M.N. Murty, and P.J. Flynn, “Data clustering: A review”,ACM Computing Surveys, vol. 31, no. 3, pp. 264–323, September1999.

[8] H. Kang, Y. Chen, J.L. Wong, R. Sion, J. Wu, “Enhancement ofXen’s scheduler for MapReduce workloads,” in Proceedings ofInternational Symposium on High Performance Distributed Computing(HPDC), 2011.

[9] M. Zaharia, A. Konwinski, A.D. Joseph, R. Katz, and I. Stoica,“Improving MapReduce performance in heterogeneous environ-ments,” in Proceedings of USENIX Symposium on Operating SystemsDesign and Implementation (OSDI), 2008.

[10] M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, andA. Goldberg, “Quincy: Fair scheduling for distributed computingclusters,” in Proceedings of ACM Symposium on Operating SystemsPrinciples (SOSP), 2009.

[11] M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker,and I. Stoica, “Delay scheduling: A simple technique for achiev-ing locality and fairness in cluster scheduling,” in Proceedings ofEuropean Conference on Computer Systems (EuroSys), 2010.

[12] B. Palanisamy, A. Singh, L. Liu, and B. Jain, “Purlieus: Locality-aware resource allocation for mapreduce in a cloud,” in Proceedingsof International Conference for High Performance Computing, Network-ing, Storage and Analysis (SC), 2011.

[13] G. Karypis and V. Kumar, “Analysis of multilevel graph parti-tioning,” in Proceedings of ACM/IEEE Conference on Supercomputing,1995.

[14] G. Karypis and V. Kumar, “Parallel multilevel k-way partitioningscheme for irregular graphs,” in Proceedings of ACM/IEEE Confer-ence on Supercomputing, 1996.

[15] K. George and V. Kumar. “A fast and high quality multilevelscheme for partitioning irregular graphs,” SIAM Journal on Sci-entific Computing, vol. 20, no. 1, pp. 359–392, August 1998.

[16] F. Pellegrini and J. Roman, “Scotch: A software package for staticmapping by dual recursive bipartitioning of process and architec-ture graphs,” in Proceedings of International Conference and Exhibitionon High-Performance Computing and Networking (SC), 1996.

[17] J. Leskovec, K.J. Lang, and M. Mahoney, “Empirical comparisonof algorithms for network community detection,” in Proceedings ofInternational Conference on World Wide Web (WWW), 2010.

[18] B. Karrer and M.E.J Newman, “Stochastic blockmodels and com-munity structure in networks,” Physical Review E, vol. 83, no. 1,pp. 016107, January 2011.

[19] J. Yang and J. Leskovec. “Defining and evaluating network com-munities based on ground-truth,” inProceedings of ACM SIGKDDWorkshop on Mining Data Semantics, 2012.

[20] J. Park, D. Lee, B. Kim, J. Huh, and S. Maeng, “Locality-awaredynamic VM reconfiguration on MapReduce clouds,” in Proceed-ings of International Symposium on High Performance DistributedComputing (HPDC), 2012.

[21] X. Bu, J. Rao, and C. Xu, “Interference and locality-aware taskscheduling for MapReduce applications in virtual clusters,” in Pro-ceedings of International Symposium on High Performance DistributedComputing (HPDC), 2013.

[22] H.S. Park and C.H. Jun, “A simple and fast algorithm for k-medoids clustering,” Expert Systems with Applications, vol. 36, no. 2,Part 2, pp. 3336–3341, March 2009.

[23] J. Lin, “The curse of zipf and Limits to parallelization: A Look atthe stragglers problem in MapReduce”, in Workshop on Large-ScaleDistributed Systems for Information Retrieval, 2009.

[24] M. Newman, “Power laws, pareto distributions and zipf’s law,”Contemporary Physics, vol. 46, no. 5, pp. 323–351, 2005.

[25] P. John, “Sequential minimal optimization: A fast algorithm fortraining support vector machines,”, Advances in Kernel Method:Support Vector Learning, MIT Press, 1999, pp. 185–208.

[26] J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S. Bae, J. Qi-u, and G. Fox. “Twister: A runtime for iterative MapReduce,”

Page 13: Dependency-aware Data Locality for MapReducexiaoyif/Papers/dalm-cloud-trans.pdf · work, including a brief overview of data locality in MapRe-duce, as well as processing data with

2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2015.2511765, IEEETransactions on Cloud Computing

13

inProceedings of ACM International Symposium on High PerformanceDistributed Computing (HPDC), 2010.

[27] L. Page, S. Brin, R. Motwami, and T. Winograd, “The pagerankcitation ranking: Bringing order to the web,” Technical Report,Stanford InfoLab, Stanford University, 1999.

[28] Z. Khayyat, K. Awara, A. Alonazi, H. Jamjoom, D. Williams, andP. Kalnis, “Mizan: A system for dynamic load balancing in large-scale graph processing,” in Proceedings of ACM European Conferenceon Computer Systems (EuroSys), 2013.

[29] G. Malewicz, M.H. Austern, A.J.C Bik, J.C. Dehnert, I. Horn,N. Leiser, and G. Czajkowski, “Pregel: A system for large-scalegraph processing,” in Proceedings of ACM SIGMOD InternationalConference on Management of Data, 2010.

[30] R. Chen, M. Yang, X. Weng, B. Choi, B. He, and X. Li, “Improvinglarge graph processing on partitioned graphs in the cloud,” inProceedings of ACM Symposium on Cloud Computing (SoCC), 2012.

[31] Y. Zhang, Q. Gao, L. Gao, and C. Wang, “PrIter: A DistributedFramework for Prioritizing Iterative Computations,” IEEE Trans-actions on Parallel and Distributed Systems, vol. 24, no. 9, pp. 1884–1893, September 2013.