dynamic energy e cient data placement and ... - iiit...

Dynamic Energy Efficient Data Placement and Cluster

Reconfiguration Algorithm for MapReduce Framework

Nitesh Maheshwari∗, Radheshyam Nanduri, Vasudeva Varma

Search and Information Extraction LabLanguage Technologies Research Centre (LTRC)

International Institute of Information Technology, Hyderabad (IIIT Hyderabad)Gachibowli, Hyderabad, India 500032

Abstract

With the recent emergence of cloud computing based services on the Inter-net, MapReduce and distributed file systems like HDFS have emerged asthe paradigm of choice for developing large scale data intensive applications.Given the scale at which these applications are deployed, minimizing powerconsumption of these clusters can significantly cut down operational costsand reduce their carbon footprint - thereby increasing the utility from aprovider’s point of view. This paper addresses energy conservation for clus-ters of nodes that run MapReduce jobs. The algorithm dynamically recon-figures the cluster based on the current workload and turns cluster nodes onor off when the average cluster utilization rises above or falls below adminis-trator specified thresholds, respectively. We evaluate our algorithm using theGridSim toolkit and our results show that the proposed algorithm achievesan energy reduction of 33% under average workloads and up to 54% underlow workloads.

Keywords:Energy Efficiency, MapReduce, Cloud Computing

∗Corresponding author. Address: Search and Information Extraction Lab, Block B2,”Vindhya”, International Institute of Information Technology, Gachibowli, Hyderabad -500 032, India. Tel.: +91-40-66531157.

Email addresses: [email protected] (Nitesh Maheshwari),[email protected] (Radheshyam Nanduri), [email protected](Vasudeva Varma)

Preprint submitted to FGCS (Future Generation Computer Systems) July 2011

1. Introduction

Most enterprises today are focusing their attention on energy efficientcomputing, motivated by high operational costs for their large scale clus-ters and warehouses. This power related cost includes investment, operatingexpenses, cooling costs and environmental impacts. The U.S. Environmen-tal Protection Agency (EPA) datacenter report [1] mentions that the energyconsumed by datacenters has doubled in the period of 2000 and 2006 andestimates another two fold increase from 2007 to 2011 if the servers are notused in an improved operational scenario.

A substantial percentage of these datacenters run large scale data inten-sive applications and MapReduce [2] has emerged as an important paradigmfor building such applications. MapReduce framework, by design, incorpo-rates mechanisms to ensure reliability, load balancing, fault tolerance, etc.but such mechanisms can have an impact on energy efficiency as even idlemachines remain powered on to ensure data availability. The Server and En-ergy Efficiency Report [3] states that more than 15% of the servers are runwithout being used actively on a daily basis.

Consider the case of a website for which the traffic is not uniform through-out the year. The website owners can either invest in buying more servers ormove to a cloud platform. Moving to cloud is clearly a better alternative forthe website owners as they can add resources based on the traffic accordingto a pay-per-use model. But for cloud computing to be efficient, the individ-ual servers that make up the datacenter cloud will need to be used optimally.None of the major cloud providers release their server utilization data, so itis not possible to know how efficient cloud computing actually is. We havealso observed that most applications developed today fail to exploit the lowpower modes supported by the underlying framework.

In this paper, we address the issue of power conservation for clusters ofnodes that run MapReduce jobs, as there is no separate power controller inMapReduce frameworks such as Hadoop [4]. Although we focus on one spe-cific framework, our approach can also be applied to other application frame-works where the framework’s power controller module can be connected to itsfoundational software. Our key contribution is an algorithm that dynamicallyreconfigures the cluster based on the current workload. It scales up (down)the number of nodes in the cluster when the average cluster utilization risesabove (falls below) a threshold specified by the cluster administrator. Bydoing this the nodes in the cluster that are underutilized can be turned off

2

to save power. We remodel the cluster’s data placement policy to place thedata on the cluster and reconfigure the cluster depending on the workloadexperienced. The advantage of such a modification to the framework is thatthe applications developed for these clusters need not be fine-tuned to beenergy aware.

To verify the efficacy of our algorithm, we simulated the Hadoop Dis-tributed File System (HDFS) [5] using the GridSim toolkit [6]. We studiedthe behavior of our algorithm with the default Hadoop implementation asbaseline and our results indicate an energy reduction of 33% under averageworkloads and up to 54% under low workloads.

A substantial amount of earlier research has dealt with optimizing theenergy efficiency of servers at a component level like processors, memoriesand disks [7, 8, 9, 10]. Most literature in this field focuses on solving theproblem of energy efficiency using local techniques like Dynamic VoltageScaling (DVS), request batching and multi-speed disks [11, 12, 13]. In ourwork, we employ the use of a cluster wide technique where the global stateof the cluster is used to dynamically scale the cluster to handle the workloadimposed on it.

The remainder of this paper is organized as follows. We give a briefoverview of HDFS rack aware data placement policy and the key idea behindour chosen approach in Section 2. Next, we present our energy efficientalgorithm for cluster reconfiguration and data placement in Section 3. Section4 describes the methodology and simulation models used in evaluating ouralgorithm and the results of evaluation. We then compare our work withexisting work in Section 5. Finally, we conclude by summarizing the keycontributions, and touching on future topics in this area in Section 6.

2. Background

2.1. Need for a Power Controller in MapReduce

With the emergence of cloud computing in the past few years, MapRe-duce [2] has seen tremendous growth especially for large-scale data intensivecomputing [14]. A large segment of this MapReduce workload is managed byHadoop [4], Amazon Elastic MapReduce [15] and Google’s in house MapRe-duce implementation. MapReduce was designed for deployment on clustersrunning inexpensive commodity hardware. Hence, even idle nodes remainpowered on to ensure data availability. However, there is no separate powercontroller in MapReduce frameworks such as Hadoop as yet.

3

0

10

20

30

40

50

60

70

80

90

100

0 10 20 30 40 50 60 70 80 90 100

Server pow

er usage

(percent of p

eak)

Utilization (percent)

Power

Energy Efficiency

Figure 1: Server power usage and energy efficiency at varying utilizationlevels, from idle to peak performance. Even an energy-efficient server stillconsumes about half its full power when doing virtually no work. Source:[16].

Datacenters are known to be expensive to operate and they consume hugeamounts of electric power [17]. Google’s server utilization and energy con-sumption study [16] reports that the energy efficiency peaks at full utilizationand significantly drops as the utilization level decreases (Figure 1). Hence,the power consumption at zero utilization is still considerably high (around50%). Essentially, even an idle server consumes about half its maximumpower. We also observe that the energy efficiency of these servers lies in therange of 20-60% when operating under 20-50% utilization. Thus Figure 1indicates that dynamically reconfiguring the cluster by scaling it while run-ning the active nodes at higher utilization is the best decision from a powermanagement point of view.

The MapReduce framework implemented by Hadoop relies heavily on theperformance and reliability of the underlying HDFS. It divides applicationsinto numerous smaller blocks of work and application data into smaller datablocks. It then allows parallel execution for applications in a fault tolerantway by creating redundant copies of both data and computation.

4

2.2. HDFS Data Layout

HDFS is based on Google File System (GFS) [18] and has a master-slavearchitecture with the master called as NameNode and the slaves as DataN-odes. The NameNode is responsible for storing the HDFS namespace andrecords changes to the file system metadata. The DataNodes are spreadacross multiple racks and store the HDFS data in their local file systemsas shown in Figure 2. The DataNodes are connected via a network and in-trarack network bandwidth between machines is usually higher than interracknetwork bandwidth.

Worker Nodes(DataNodes)

Master (NameNode)

Data Files

Heartbeats

Racks

Data blocks

Figure 2: Architecture of Hadoop Distributed File System (HDFS). Thecommunication between the master node (NameNode) and the worker nodes(DataNodes) is shown by means of Heartbeat messages. The files to be storedon HDFS are split in 64 MB chunks and distributed across racks.

Each file stored on HDFS is split into smaller chunks of size 64 MB calleddata blocks and these blocks are distributed across the cluster. Each blockis replicated to ensure data availability even in case of connectivity failure

5

to nodes or even complete racks. For the same, Hadoop implements a rack-aware data block replication policy for the data that is stored on HDFS. Forthe default replication factor of three, HDFS’s replica placement policy isto put one replica of the block on one node in the local rack, another on adifferent node in the same rack, and the third on a node in some other rack.

Each DataNode periodically sends a heartbeat message to the NameNodein order to inform the NameNode about its current state including the blockreport of the blocks it stores. The NameNode assumes DataNodes withoutrecent heartbeat messages to be dead and stops forwarding any more requeststo them. The NameNode also re-replicates the data lost because of deadnodes, corrupt blocks, hard disk failures, etc. If some DataNodes are turnedoff without informing the NameNode then the cluster might enter what canbe called as a panic phase where the NameNode tries to replicate the blocksearlier stored on these DataNodes. This process might end up generatingheavy network traffic [19].

2.3. Replica Placement in Hadoop

Racks DataNodes

Data block to be placed

First Replica

Second Replica

Third Replica

Figure 3: A typical replica pipeline in Hadoop for a replication factor of 3.

In the context of high-volume data processing, the limiting factor is therate at which data can be transferred between nodes. Hadoop uses the band-width between the nodes as a measure of distance between them. Hence, thecost of data transfer between nodes in the same rack is less than that of

6

nodes in different racks, which in turn is less than that of nodes on differentdatacenters. Replication pipeline for a replication factor of three is shown inFigure 3.

3. Proposed Algorithm

We attempt to reduce the energy consumption of datacenters that runMapReduce jobs by reconfiguring the cluster. In this section, we explain theapproach followed in our proposed algorithm for cluster reconfiguration.

3.1. Approach

We implement a channel between the cluster’s power controller and HDFSthat scales the number of nodes in the cluster dynamically to meet the currentservice demands (Figure 4). We start with a subset of nodes powered on andintelligently accept job requests by checking if they can be served using thecurrent state of the cluster. We add more nodes to the cluster if the resourcesavailable are not enough for serving these requests. We power them off againwhen they are not being used optimally.

When the NameNode receives a request to create a file Fk of size Sk withreplication factor of Rk, it calculates the number of HDFS blocks that willbe needed to store it using the following formula:

dSk/Be ×Rk (1)

where B is the HDFS block size. The block size is configurable per filewith a default value of 64 MB. The space needed to store Fk on HDFS is:

Sk ×Rk (2)

The algorithm checks if the there is enough space available on HDFS tostore the file and turns on more nodes if needed. At all times, we keep somespace reserved on the DataNodes so that write operations do not block orfail waiting for more space to be provisioned while new DataNodes are beingpowered on. After the nodes are powered on, the cluster is rebalanced asthere is no data on the newly activated DataNodes while other DataNodesare already filled substantially. The NameNode gives instructions to theDataNodes to create blocks in a rack-aware manner on HDFS and updatesits namespace and metadata.

7

When the NameNode receives a request to delete a file, it gives instruc-tions to the DataNodes to delete blocks corresponding to that file. Thiscommunication with the DataNodes happens by means of heartbeat messagesthat are exchanged periodically between the NameNode and the DataNodes.If the cluster becomes underutilized after deleting the data, we move the datafrom the least utilized nodes to other nodes and turn them off.

We shall make use of the following notation in the rest of the paper:

• n = Number of currently active nodes on the cluster

• Cθ = Disk capacity of DataNode θ

• Uθ = Current utilization of DataNode θ

• Rθ = Set of active nodes in DataNode θ’s rack

• U = Average utilization of active nodes in the cluster

• U(R) = Average utilization of active nodes in rack R

• Uu = Scale up threshold

• Ud = Scale down threshold

• τ = Transfer threshold (in percentage)

• δ = Dampening factor for interrack transfer (in percentage)

• λ = Percentage of initially inactive nodes per rack

• µ = Minimum percentage of active nodes per rack

All the U values mentioned above are in percentage.As demonstrated in Figure 4, when the NameNode receives a request to

create a file, it checks with the Power Controller if there is a need to scale upthe cluster to accommodate the current request. Similarly, when some datais deleted from the cluster, the NameNode consults the Power Controller ifsome nodes can be deactivated to save power. The Power Controller moduleresponds with cluster reconfiguration information which has details aboutthe nodes to be activated or deactivated. It decides which nodes to activateor deactivate based on scale up threshold Uu, scale down threshold Ud andthe current average utilization of active nodes in the cluster U .

8

NameNode

Racks DataNodes

Rebalancing & Scaling

Power

Controller

Need to scale

up/down?

U(R1) U(R2) U(Rk)

U

Ud

Reconfiguration

information

Uu

Figure 4: Proposed Approach: Interaction of Power Controller module withthe NameNode. The NameNode performs Rebalancing and Scaling opera-tions once it receives the reconfiguration information from the Power Con-troller.

Once this cluster reconfiguration information is received by the NameN-ode, it performs scaling and rebalancing operations depending on the averageutilization U(Ri) of active nodes in racks. A detailed description of these op-erations is covered in subsequent sections. The latest version of Hadoop(0.21.0) comes with a pluggable interface to place replicas of data blocks inHDFS and has support for an API that allows a module external to HDFS tospecify how data blocks should be placed [20]. This will ease the integrationof placement policy we suggest in Hadoop.

3.2. Cluster Reconfiguration

We consider the following types of cluster reconfigurations in our ap-proach.

9

3.2.1. Scaling Up

Figure 5 presents our algorithm for scaling up the cluster when the currentstate of resources are not enough to serve write requests. Scaling up ofthe cluster is triggered when the average utilization of the active nodes (U)exceeds the administrator specified threshold Uu.

1: if U > Uu then2: ActivateDataNodes(Θ)3: for all θdst ∈ Θ do4: for all θsrc ∈ [Rθdst − θdst] do5: {For all nodes in the same rack}6: if Uθsrc − U > τ then7: p = (Uθsrc − U)× Cθsrc8: IntrarackTransfer(θsrc, θdst, p)9: end if

10: end for11: if Uθdst < U(Rθdst) then12: for all θsrc /∈ Rθdst do13: {For all nodes in some other rack}14: if Uθsrc − U > τ then15: p = (Uθsrc − U − δ)× Cθsrc16: InterrackTransfer(θsrc, θdst, p)17: end if18: end for19: end if20: end for21: end if

Figure 5: Algorithm for Scaling Up reconfiguration. Upon reaching the trig-ger condition, we perform intrarack transfer and interrack transfer to all thenewly added nodes.

When the trigger condition is reached, we select a set of DataNodes (Θ)to be powered on in the function ActivateDataNodes(Θ). We choose theleast recently deactivated nodes to avoid hotspots caused by the same setof machines running all the time. It is advantageous to activate nodes fromdifferent racks because HDFS follows a rack-aware replica placement policywhere replicas of a data block are spread across multiple racks. Also, acti-

10

vating multiple nodes in the same iteration saves time as these nodes can bebooted in parallel.

Addition of new nodes while scaling up will result in more storage space onthe cluster. As a result of this, the average utilization level of the cluster willdrop and the data distribution across nodes will become skewed. Therefore,we balance the racks where new nodes were recently activated. For eachnewly activated DataNode (θdst) we first perform intrarack transfer of blocks(Figure 5, lines 4 to 10). If the utilization of a node (θsrc) in the rack exceedsU by the transfer threshold (τ) specified by the administrator, we transferthe excess amount of data from θsrc to θdst. If the intrarack transfer does notsufficiently fill up the storage on θdst, interrack transfer of blocks to θdst isperformed (Figure 5, lines 12 to 18). Since interrack transfer is costlier thanintrarack transfer, the amount of excess data to be moved in case of interracktransfer is hindered by a value of δ.

3.2.2. Scaling Down

Figure 6 presents our algorithm for scaling down the cluster. Scalingdown of the cluster is triggered when the average utilization of the activenodes (U) falls below the administrator specified threshold Ud.

If the trigger condition is reached, it is likely that some nodes in thecluster are underutilized. Time of activation and current utilization of nodesare considered in function CandidateSet() while selecting the set of candi-date nodes (Θ) that can potentially be powered off. We perform two checkson each candidate node (θsrc) before deactivating it. Firstly, we do not de-activate a node if some node from its rack has been deactivated recently.Secondly, we keep a percentage of nodes (µ) activated at all times in all theracks so that the cluster can still serve write requests that are received inbursts.

For each candidate node θsrc, we calculate the expected average utilizationof its rack (m′) if θsrc was to be powered off. We then perform intraracktransfer of blocks to nodes in the same rack (Figure 6, lines 6 to 12). If thedifference between Uθdst and m′ is less than the transfer threshold specified(τ) by the administrator, we transfer the excess amount of data from θsrcto θdst. If the intrarack transfer does not completely vacate the storage onθsrc, interrack transfer of blocks from θsrc is performed (Figure 6, lines 14 to20). As in the case of scaling up reconfiguration, the amount of excess datato be moved in case of interrack transfer is hindered by a value of δ. Whenall the data blocks from θsrc have been moved, we add it to the set of nodes

11

1: if U < Ud then2: Σ← ∅ {Set of nodes that will be deactivated}3: Θ = CandidateSet()4: for all θsrc ∈ Θ do5: m′ = (U(Rθsrc)× |Rθsrc |)/(|Rθsrc | − 1)6: for all θdst ∈ [Rθsrc − θsrc] do7: {For all nodes in the same rack}8: if Uθdst −m′ < τ then9: p = (m′ + δ − Uθdst)× Cθdst

10: IntrarackTransfer(θsrc, θdst, p)11: end if12: end for13: if Uθsrc > 0 then14: for all θdst /∈ Rθsrc do15: {For all nodes in some other rack}16: if Uθdst −m′ < τ then17: p = (m′ − Uθdst)× Cθdst18: InterrackTransfer(θsrc, θdst, p)19: end if20: end for21: end if22: Σ← Σ ∪ θsrc23: end for24: DeactivateDataNodes(Σ)25: end if

Figure 6: Algorithm for Scaling Down reconfiguration. Upon reaching thetrigger condition, we perform intrarack transfer and interrack transfer fromthe nodes to be removed.

12

to be deactivated (Σ). Finally, after considering all the candidate nodes, wedeactivate the nodes in Σ.

3.2.3. Avoiding Jitter Effect

While performing these scaling operations of activating and deactivatingthe nodes in the cluster, the cluster can experience a jitter effect if the loadvariability is high. Consider a case when by the time new nodes were beingactivated, the load dissipates resulting in shutting down some already acti-vate nodes, at which time the load was increasing again. The parametersfor our algorithm can be configured in a way that this jitter effect does notoccur. The difference in the values of scale up threshold Uu and scale downthreshold Ud should be made sufficiently large so that the cluster does notexperience a jitter effect of activating and deactivating nodes due to smallchanges in average utilization of active nodes in the cluster U .

3.3. Cluster Rebalancing

The level of network activity depends on the techniques used for rebalanc-ing the cluster; hence they are very crucial for our algorithm to perform well.The cluster needs to be rebalanced after scaling up the number of nodes inthe cluster in order to distribute the extra workload from overutilized nodesto the newly added nodes. Similarly, we need to perform rebalancing opera-tions before scaling down the number of nodes in the cluster so that workloadfrom underutilized nodes can be transferred to already active nodes beforethey can be turned off. We put an upper limit on the bandwidth used forrebalancing operations so that it does not congest the network. This alsoensures that the other ongoing file operations remain unaffected while thecluster is being rebalanced.

We employ a simple heuristic for rebalancing where we try to move themaximum amount of data to nodes in the same rack as network bandwidthis greater for intrarack transfers. While moving data blocks across nodes, wedo not violate the rack-aware replica placement policy followed by HDFS.We also do not move data that is currently being used by some job alreadyrunning on the cluster. This ensures that we do not compromise on the faulttolerance and runtime of already running jobs on the cluster while performingrebalancing operations. A data block is removed from the source DataNodeonly after all the running tasks using that data block are completed. In orderto respect the rack aware block replication policy followed by Hadoop, we

13

make the following checks before moving a data block within the same rackand across racks (Figure 7 explains this approach by means of examples):

Intrarack

transfers

Interrack transfers

Valid Transfers Invalid Transfers

Rack

DataNode

Data Blocks

Figure 7: Examples of valid and invalid movement of data blocks while per-forming intrarack and interrack transfer.

Intrarack Transfer :

Intrarack transfer means transferring a data block from a node to someother node in the same rack. While performing intrarack transfer, wemove a data block from a node to another node only if a replica of thatblock does not exist on the destination node. Nodes in the same rackare connected by a higher bandwidth link and hence intrarack transfersare cheaper as compared to interrack transfers.

Interrack Transfer :

Interrack transfer means transferring a data block from a node to oneof the nodes in some other rack. In the case of interrack transfer, adata block is moved from a node to another node only if a replica ofthat block does not exist on any nodes in the destination rack.

14

The subroutines IntrarackTransfer(θsrc, θdst, p) and InterrackTransfer(θsrc, θdst, p)in the algorithms of Figure 5 and 6 transfer p amount of data from node θsrcto node θdst. Table 1 summarizes the amount of data transferred from θsrc toθdst for intrarack and interrack transfers while scaling up and scaling downoperations.

Scaling Up Scaling Down 1

IntrarackTransfer

(Uθsrc − U)× Cθsrc (m′+ δ−Uθdst)×Cθdst

InterrackTransfer

(Uθsrc−U−δ)×Cθsrc (m′ − Uθdst)× Cθdst

Table 1: Data transferred (p) from θsrc to θdst

4. Evaluation and Results

We use the GridSim toolkit to simulate HDFS architecture and performour experiments. We verify the efficacy of our algorithm and study its be-havior with the default HDFS implementation as baseline.

4.1. Simulation Model

The NameNode and DataNode classes in our simulations are extendedfrom the GridSim toolkit’s DataGridResource class. These DataNodes aredistributed across racks and a metadata of these mappings is maintainedat the NameNode. We implement the default rack-aware replica placementpolicy followed by HDFS. We then incorporate the cluster reconfigurationdecisions suggested by our algorithm to dynamically scale the number ofnodes in the cluster. We demonstrate the scale up and scale down operationsof our algorithm and their corresponding energy savings.

The parameters used in our simulation runs along with their default valuesare specified in Table 2. These parameters are kept constant at their defaultvalues between different runs while comparing the results. All values reportedin these results are averaged over ten independent runs of varying workloads,

1where m′ =U(Rθsrc )×|Rθsrc ||Rθsrc |−1

15

Table 2: Simulation Parameters

Parameter Description

Number of racks UniformRandom(5, 10)[Default = 8]

Number of nodes per rack UniformRandom(10, 20)[Default = 15]

Storage capacity of nodes (GB) UniformRandom(100, 500)

HDFS default block size, B 64 MB

Scale up threshold, Uu 80%

Scale down threshold, Ud 50%

Transfer threshold, τ 5%

Dampening factor for interracktransfer, δ

5%

Percentage of initially inactivenodes per rack, λ

30-70% [Default = 50]

Minimum percentage of activenodes per rack, µ

30-70% [Default = 50]

unless otherwise specified. We vary the number, size and replication factorof the files created on the HDFS in our experiments.

4.2. Cluster Reconfiguration

To verify whether our cluster reconfiguration algorithm is able to scalethe cluster in accordance to the workload imposed on it, we perform basicexperiments for scale up and scale down operations. These experiments wereperformed with 50 nodes spread across 5 racks, with each rack having 10nodes. We plot the number of active nodes and their average utilization forthe default HDFS implementation and our algorithm, when presented withthe same workload (Figure 8a and 8b).

4.2.1. Scaling Up

Figure 8a summarizes the results of our scale up experiments. We startwith a percentage of nodes (λ = 50%) turned off initially and scale up the

16

0

10

20

30

40

50

60

70

80

90

0

10

20

30

40

50

60

70

80

90

0 25 50 75 100 125 150 175 200 Percen

tage Utilization of active no

des

Num

ber of active no

des

Simulation time

Percentage Utilization (Default HDFS)Number of active nodes (Default HDFS)Number of active nodes (Our algorithm)Percentage Utilization (Our algorithm)

(a) Basic scale up operation, with Uu = 80% and λ = 50%

0

10

20

30

40

50

60

70

80

90

0

10

20

30

40

50

60

70

80

90

0 25 50 75 100 125 150 175 200 Percen

tage Utilization of active no

des

Num

ber of active no

des

Simulation time

Percentage Utilization (Default HDFS)Number of active nodes (Default HDFS)Number of active nodes (Our algorithm)Percentage Utilization (Our algorithm)

(b) Basic scale down operation, with Ud = 50% and µ = 50%

Figure 8: Basic Scale Up and Scale Down operations for a cluster of 50 nodesspread across 5 racks, with each rack having 10 nodes.

17

cluster when the average utilization rises above the scale up threshold (Uu =80%). The boxed area in the plot represents the scaling up operation wheremore nodes are added to the cluster whenever the cluster utilization (U)rises above the value Uu. Adding more nodes to the cluster brings downthe average utilization of the nodes in the cluster and it accepts incomingrequests until the average utilization reaches Uu again, as demonstrated bythe peaks in the curve (Figure 8a). Our algorithm tries to maintain theutilization levels of the nodes at Uu affirming that the nodes are being usedin the region of higher energy efficiency, as specified in Figure 1. In this case,the average number of inactive nodes during the experiments denote energysavings of about 36%.

4.2.2. Scaling Down

Figure 8b summarizes the results of our scale down experiments. Initiallyall the nodes in the cluster are being used and we scale down the cluster whenthe average utilization falls below the scale down threshold (Ud = 50%).The boxed area in the plot represents the scaling down operation wherenodes are deactivated from the cluster whenever the cluster utilization (U)falls below the value Ud. Deactivating nodes from the cluster brings up theaverage utilization of the nodes and our algorithm scales down the clustertill the percentage of active nodes is more than the value specified by µ, asdemonstrated by the curve in Figure 8b. The energy savings denoted by theexperiments in this case is about 27%.

4.3. Energy Savings

Next, we perform scale up and scale down experiments over a range ofworkloads to calculate the average energy savings with our algorithm. Theseexperiments were performed with 120 nodes spread across 8 racks, with eachrack having 15 nodes. In our model, the amount of energy saved is propor-tional to the number of deactive nodes. We plot the energy savings for thecases when the cluster is imposed with low and high workloads.

4.3.1. Workload Imposed

The number of files is varied from 200 for low workloads to 400 for highworkloads and the minimum file size is varied from 3000 MB for low workloadsto 15000 MB for high workloads. The maximum file size is set to 25000 MBacross all the experiments, whereas the number of replicas of these files isselected randomly from 1 to 3. We observe that the cluster intelligently

18

reconfigures itself based on the workload imposed, proving the effectivenessof our algorithm. As expected, the energy conserved in case of low workloadsis significantly higher.

4.3.2. Scaling Up and the effect of λ

We calculate the average amount of energy saved during the scale up op-eration by performing experiments over a range of various workloads (Figure9). We analyze the behavior of the plots for various values of λ, which isthe percentage of nodes kept inactive at the start of each experiment. Theaverage energy savings denoted by these experiments is about 36%.

26.35

38.2444.05

52.4954.66

19.87

26.9830.48

35.7637.89

0

10

20

30

40

50

60

30 40 50 60 70

Energy savings (in pe

rcen

tage)

Percentage of initially inactive nodes, λ

Low workload

High workload

Figure 9: The effect of varying λ on energy savings while scaling up

We make the following observations from the plots:

• Increasing the value of λ increases the percentage of energy conserved.

• Energy conserved for low workloads can reach up to 54% for highervalues of λ.

• Even for high workloads and lower values of λ, the energy conserved isabout 20%.

19

4.3.3. Scaling Down and the effect of µ

We calculate the average amount of energy saved during the scale downoperation in a similar manner over a range of various workloads (Figure 10).We analyze the behaviour of the plots for various values of µ, which is theminimum percentage of nodes that must be kept active in each rack at alltimes. The average energy savings denoted by these experiments is about30%.

48.4343.58 41.95

35.15 32.80

26.7623.70 22.11

18.2215.68

0

10

20

30

40

50

60

30 40 50 60 70

Energy savings (in pe

rcen

tage)

Minimum percentage of active nodes per rack, μ

Low workload

High workload

Figure 10: The effect of varying µ on energy savings while scaling down

We then make the following observations from the plots:

• Increasing the value of µ decreases the percentage of energy conserved.

• Energy conserved for low workloads can reach up to 48% for lowervalues of µ.

• Even for high workloads and higher values of µ, the energy conservedis about 15%.

5. Related Work

Lowering the energy usage of datacenters is a challenging and complexissue because computing applications and data are growing so quickly thatincreasingly larger servers and disks are needed to process them fast enough

20

within the required time period [21]. Fan et al. report that the opportunitiesfor power and energy savings are significant at the cluster-level and hencethe systems need to be energy efficient across their activity range [22].

For MapReduce frameworks like Hadoop Vasic et al. present a designfor energy aware MapReduce and HDFS where they leverage the sleepingstate of machines to save power [19]. Their work also proposes a model forcollaborative power management where a common control platform acts as acommunication channel between the cluster power management and servicesrunning on the cluster. Leverich et al. present a modified design for Hadoopthat allows scale-down of operational cluster [23]. They propose the notionof covering subset during block replication that at least one replica of a datablock must be stored in the covering subset. This ensures data availability,even when all the nodes not in the covering subset are turned off. Also, thesystem configuration parameters used for MapReduce clusters have a highimpact on their energy efficiency [24].

An important difference between the works mentioned above and our workis that the techniques employed in these works compromise on the replicationfactor of data blocks stored on the cluster. Our algorithm attempts to keepthe cluster utilized to its maximum allowed potential and accordingly scalethe number of nodes without compromising on the replication factor for thedata blocks.

Our work is inspired from earlier works by Pinheiro et al. where theypresent the problem of energy efficiency using cluster reconfiguration at theapplication level for clusters of nodes, and at the operating system level forstandalone servers [25, 26]. We present the same problem for a cluster ofmachines running MapReduce frameworks such as Hadoop. Work by Perezet al. in [27] presents a mathematical formalism to achieve dynamic reconfig-uration with the use of storage groups for data-based clusters. Work by Duyet al. also presents this problem using machine learning based approach byapplying the use of neural network predictors to predict future load demandbased on historical demand [28].

The technique of Dynamic Voltage Scaling (DVS) has been employed toprovide power-aware scheduling algorithms to minimize the power consump-tion [29, 30]. DVS is a power management technique where undervolting(decreasing the voltage) is done to conserve power and overvolting (increas-ing the voltage) is done to increase computing performance. Hsu et al. applya variation of DVS called Dynamic Voltage Frequency Scaling (DVFS) by op-erating servers at various CPU voltage and frequency levels to reduce overall

21

power consumption [31].Berl et al. propose virtualization with cloud computing as a way forward

to identify the main sources of energy consumption [32]. Live migrationand placement optimizations of virtual machines (VM) have been used inearlier works to provide a mechanism to achieve energy efficiency [33, 34,35]. Power aware provisioning and scheduling of VMs has also been usedwith DVFS techniques to reduce the overall power consumption [36, 37].Although virtualization helps reduce datacenter’s power consumption, theease of provisioning virtualized servers can lead to uncontrolled growth andmore unused servers, a phenomenon called virtual server sprawl [3].

6. Conclusion and Future Work

In this paper we addressed the problem of energy conservation for largedatacenters that run MapReduce jobs. We also discussed that servers aremost efficient when used at a higher utilization level. In this context, weproposed an energy efficient data placement and cluster reconfiguration al-gorithm that dynamically scales the cluster in accordance with the workloadimposed on it. However, the efficacy of this approach would depend on howquickly a node can be activated. We evaluated our algorithm by performingsimulations and the results show energy savings of as high as 54% under lowworkloads and about 33% under average workloads.

Future directions of our research include applying machine learning ap-proaches to study the resource usage patterns of jobs and model their powerusage characteristics to efficiently schedule them on the cluster. Upon sub-mission of a job, make use of the knowledge about the dataset to predictthe data layout for its data blocks and optimally schedule the tasks makinguse of data locality. We would also like to study the performance of our al-gorithm when presented with different types of workloads such as processorintensive, memory intensive, disk intensive and network intensive. We wouldalso like to study the impact of node’s boot time and ACPI [38] power statesavailable in the hardware on the efficiency of our algorithm.

We plan to find optimal values of MapReduce configuration parametersso that our algorithm can provide even better energy savings. We also planto make use of multiple heuristics like access time and minimum replicationfactor of a file, while rebalancing the cluster to avoid overhead on the network.

22

Acknowledgment

This work is partly supported by a grant from Yahoo! India R&D, underthe Nurture an area: Cloud Computing project. We would like to thankChidambaran Kollengode of Yahoo! and Reddyraja Annareddy of PramatiTechnologies for their continual guidance throughout this work. We wouldalso like to thank Jaideep Dhok, Sudip Datta and Manisha Verma for theirconstant feedback on our work.

References

[1] EPA report to Congress on Server and Data Center Energy Efficiency,Tech. rep., U.S. Environmental Protection Agency (2007).

[2] J. Dean, S. Ghemawat, Mapreduce: simplified data process-ing on large clusters, Commun. ACM 51 (1) (2008) 107–113.doi:http://doi.acm.org/10.1145/1327452.1327492.

[3] Server Energy and Efficiency Report, Tech. rep., 1E (2009).

[4] Apache Hadoop, http://hadoop.apache.org.

[5] Hadoop Distributed File System, http://hadoop.apache.org/hdfs/.

[6] R. Buyya, M. Murshed, Gridsim: a toolkit for the modeling and simula-tion of distributed resource management and scheduling for grid comput-ing, Concurrency and Computation: Practice and Experience 14 (2002)1175–1220. doi:http://dx.doi.org/10.1002/cpe.710.

[7] M. Weiser, B. Welch, A. Demers, S. Shenker, Scheduling for reducedcpu energy, in: OSDI ’94: Proceedings of the 1st USENIX conference onOperating Systems Design and Implementation, USENIX Association,Berkeley, CA, USA, 1994, p. 2.

[8] A. Rangasamy, R. Nagpal, Y. Srikant, Compiler-directed frequencyand voltage scaling for a multiple clock domain microarchitec-ture, in: CF ’08: Proceedings of the 5th conference on Com-puting frontiers, ACM, New York, NY, USA, 2008, pp. 209–218.doi:http://doi.acm.org/10.1145/1366230.1366267.

23

[9] A. R. Lebeck, X. Fan, H. Zeng, C. Ellis, Power aware page allo-cation, in: ASPLOS-IX: Proceedings of the ninth international con-ference on Architectural support for programming languages and op-erating systems, ACM, New York, NY, USA, 2000, pp. 105–116.doi:http://doi.acm.org/10.1145/378993.379007.

[10] D. P. Helmbold, D. D. Long, T. L. Sconyers, B. Sherrod, Adaptive diskspindown for mobile computers, Mobile Networks and Applications 5(2000) 285–297.

[11] M. Elnozahy, M. Kistler, R. Rajamony, Energy conservation policiesfor web servers, in: USITS’03: Proceedings of the 4th conference onUSENIX Symposium on Internet Technologies and Systems, USENIXAssociation, Berkeley, CA, USA, 2003.

[12] E. V. Carrera, E. Pinheiro, R. Bianchini, Conserving disk energy in net-work servers, in: ICS ’03: Proceedings of the 17th annual internationalconference on Supercomputing, ACM, New York, NY, USA, 2003, pp.86–97. doi:http://doi.acm.org/10.1145/782814.782829.

[13] S. Gurumurthi, A. Sivasubramaniam, M. Kandemir, H. Franke, Drpm:Dynamic speed control for power management in server class disks,Computer Architecture, International Symposium on 0 (2003) 169.doi:http://doi.ieeecomputersociety.org/10.1109/ISCA.2003.1206998.

[14] J. Dhok, N. Maheshwari, V. Varma, Learning based opportunistic ad-mission control algorithm for mapreduce as a service, in: ISEC ’10: Pro-ceedings of the 3rd India software engineering conference, ACM, 2010,pp. 153–160. doi:http://doi.acm.org/10.1145/1730874.1730903.

[15] Amazon Elastic MapReduce, http://aws.amazon.com/elasticmapreduce/.

[16] L. Barroso, U. Holzle, The case for energy-proportional computing,Computer 40 (12) (2007) 33 –37. doi:10.1109/MC.2007.443.

[17] R. Buyya, C. S. Yeo, S. Venugopal, J. Broberg, I. Brandic, Cloud com-puting and emerging it platforms: Vision, hype, and reality for deliver-ing computing as the 5th utility, Future Generation Computer Systems25 (6) (2009) 599 – 616. doi:DOI: 10.1016/j.future.2008.12.001.

24

[18] S. Ghemawat, H. Gobioff, S.-T. Leung, The google filesystem, SIGOPS Oper. Syst. Rev. 37 (5) (2003) 29–43.doi:http://doi.acm.org/10.1145/1165389.945450.

[19] N. Vasic, M. Barisits, V. Salzgeber, D. Kostic, Making clus-ter applications energy-aware, in: ACDC ’09: Proceedingsof the 1st workshop on Automated control for datacentersand clouds, ACM, New York, NY, USA, 2009, pp. 37–42.doi:http://doi.acm.org/10.1145/1555271.1555281.

[20] [HDFS-385] Design a pluggable interface to place replicas of blocks inHDFS, https://issues.apache.org/jira/browse/HDFS-385.

[21] R. Buyya, A. Beloglazov, J. H. Abawajy, Energy-efficient managementof data center resources for cloud computing: A vision, architecturalelements, and open challenges, CoRR abs/1006.0308.

[22] X. Fan, W.-D. Weber, L. A. Barroso, Power provisioningfor a warehouse-sized computer, in: ISCA ’07: Proceed-ings of the 34th annual international symposium on Computerarchitecture, ACM, New York, NY, USA, 2007, pp. 13–23.doi:http://doi.acm.org/10.1145/1250662.1250665.

[23] J. Leverich, C. Kozyrakis, On the energy (in)efficiency ofhadoop clusters, SIGOPS Oper. Syst. Rev. 44 (1) (2010) 61–65.doi:http://doi.acm.org/10.1145/1740390.1740405.

[24] Y. Chen, T. X. Wang, Energy efficiency of map reduce (2008).

[25] E. Pinheiro, R. Bianchini, E. V. Carrera, T. Heath, Load balancing andunbalancing for power and performance in cluster-based systems (2001).

[26] E. Pinheiro, R. Bianchini, E. V. Carrera and T. Heath, Dynamic clus-ter reconfiguration for power and performance, in: Power and energymanagement for server systems, IEEE Computer Society, 2004.

[27] M. S. Perez, A. Sanchez, J. M. Pena, V. Robles, A new formalism fordynamic reconfiguration of data servers in a cluster, Journal of Paralleland Distributed Computing 65 (10) (2005) 1134 – 1145, design andPerformance of Networks for Super-, Cluster-, and Grid-Computing PartI. doi:DOI: 10.1016/j.jpdc.2005.04.018.

25

[28] T. V. T. Duy, Y. Sato, Y. Inoguchi, Performance evaluation ofa green scheduling algorithm for energy savings in cloud comput-ing, Parallel Distributed Processing, Workshops and Phd Forum(IPDPSW), 2010 IEEE International Symposium on (2010) 1 –8doi:10.1109/IPDPSW.2010.5470908.

[29] K. H. Kim, R. Buyya, J. Kim, Power Aware Scheduling of Bag-of-Tasks Applications with Deadline Constraints on DVS-enabled Clus-ters, in: Cluster Computing and the Grid, 2007. CCGRID 2007.Seventh IEEE International Symposium on, 2007, pp. 541 –548.doi:10.1109/CCGRID.2007.85.

[30] Y. C. Lee, A. Y. Zomaya, Minimizing energy consumption forprecedence-constrained applications using dynamic voltage scaling,in: CCGRID ’09: Proceedings of the 2009 9th IEEE/ACMInternational Symposium on Cluster Computing and the Grid,IEEE Computer Society, Washington, DC, USA, 2009, pp. 92–99.doi:http://dx.doi.org/10.1109/CCGRID.2009.16.

[31] C. Hsu, W. Feng, A power-aware run-time system for high-performancecomputing, in: SC ’05: Proceedings of the 2005 ACM/IEEE conferenceon Supercomputing, IEEE Computer Society, Washington, DC, USA,2005, p. 1. doi:http://dx.doi.org/10.1109/SC.2005.3.

[32] A. Berl, E. Gelenbe, M. Di Girolamo, G. Giuliani, H. De Meer, M. Q.Dang, K. Pentikousis, Energy-Efficient Cloud Computing, The Com-puter Journal 53 (7) (2010) 1045–1051. doi:10.1093/comjnl/bxp080.

[33] L. Liu, H. Wang, X. Liu, X. Jin, W. B. He, Q. B. Wang,Y. Chen, Greencloud: a new architecture for green data center,in: ICAC-INDST ’09: Proceedings of the 6th international con-ference industry session on Autonomic computing and communica-tions industry session, ACM, New York, NY, USA, 2009, pp. 29–38.doi:http://doi.acm.org/10.1145/1555312.1555319.

[34] L. Hu, H. Jin, X. Liao, X. Xiong, H. Liu, Magnet: A novel schedul-ing policy for power reduction in cluster with virtual machines, Clus-ter Computing, 2008 IEEE International Conference on (2008) 13 –22doi:10.1109/CLUSTR.2008.4663751.

26

[35] M. Milenkovic, E. Castro-Leon, J. Blakley, Power-aware managementin cloud data centers, in: M. Jaatun, G. Zhao, C. Rong (Eds.), CloudComputing, Vol. 5931 of Lecture Notes in Computer Science, SpringerBerlin / Heidelberg, 2009, pp. 668 –673.

[36] K. H. Kim, A. Beloglazov, R. Buyya, Power-aware provisioningof cloud resources for real-time services, in: MGC ’09: Proceed-ings of the 7th International Workshop on Middleware for Grids,Clouds and e-Science, ACM, New York, NY, USA, 2009, pp. 1–6.doi:http://doi.acm.org/10.1145/1657120.1657121.

[37] G. von Laszewski, L. Wang, A. Younge, X. He, Power-aware schedulingof virtual machines in DVFS-enabled clusters, Cluster Computing andWorkshops, 2009. CLUSTER ’09. IEEE International Conference on(2009) 1 –10doi:10.1109/CLUSTR.2009.5289182.

[38] Advanced Configuration and Power Interface (ACPI),http://www.acpi.info/.

Nitesh Maheshwari received his B.Tech. degree in In-formation and Communication Technology from DA-IICT,Gandhinagar, India, in 2006. He is currently pursuing M.S.by Research in Computer Science and Engineering from In-ternational Institute of Information Technology, Hyderabad,India. He works as part of the Search and Information Ex-traction Lab at IIIT Hyderabad. His research interests are

in the area of cloud computing with emphasis on data placement algorithmsand scheduling policies for energy efficient computing.

Radheshyam Nanduri received B.Tech degree in Infor-mation Technology from Vellore Institute of Technology, Vel-lore, India, 2009. He is currently an M.S by Research studentin the department of Computer Science and Engineering atInternational Institute of Information Technology, Hyder-abad, India. Since December 2009, he has been workingas a research student at Search and Information Extraction

Lab at IIIT Hyderabad. His primary research interests are in cloud comput-ing, provisioning and scheduling algorithms for heterogeneous distributedsystems.

27

Vasudeva Varma is a faculty member at International In-stitute of Information Technology, Hyderabad since 2002.He is heading Search and Information Extraction Lab andSoftware Engineering Research Lab at IIIT Hyderabad. Hisresearch interests include search (information retrieval), in-formation extraction, information access, knowledge man-

agement, cloud computing and software engineering. He published a bookon Software Architecture (Pearson Education) and over seventy technical pa-pers in journals and conferences. In 2004, he obtained young scientist awardand grant from Department of Science and Technology, Government of In-dia, for his proposal on personalized search engines. In 2007, he was givenResearch Faculty Award by AOL Labs.

28

dynamic energy e cient data placement and ... - iiit...

Documents