improving reliability in resource management through adaptive reinforcement learning for distributed...

J. Parallel Distrib. Comput. ( ) –

Contents lists available at ScienceDirect

J. Parallel Distrib. Comput.

journal homepage: www.elsevier.com/locate/jpdc

Improving reliability in resource management through adaptivereinforcement learning for distributed systemsMasnida Hussin a,∗, Nor Asilah Wati Abdul Hamid a, Khairul Azhar Kasmiran b

a Department of Communication Technology & Networks, Faculty of Computer Science & IT, 43400 University Putra Malaysia (UPM), Selangor, Malaysiab Department of Computer Science, Faculty of Computer Science & IT, 43400 University Putra Malaysia (UPM), Selangor, Malaysia

h i g h l i g h t s

• Able to drive the evolution of automation networks towards higher reliability.• Able to handle tasks at different states in processing.• Able to adapt with system changes while leading better learning experiences.• Able to sustain reliable performance• Able to deal with dynamic global network infrastructure.

a r t i c l e i n f o

Article history:Received 10 January 2014Received in revised form13 September 2014Accepted 2 October 2014Available online xxxx

Keywords:Distributed systemsResource managementAdaptive reinforcement learningSystem reliabilityComputational complexity

a b s t r a c t

Demands on capacity of distributed systems (e.g., Grid and Cloud) play a crucial role in today’s informationera due to the growing scale of the systems. While the distributed systems provide a vast amount ofcomputing power their reliability is often hard to be guaranteed. This paper presents effective resourcemanagement using adaptive reinforcement learning (RL) that focuses on improving successful executionwith low computational complexity. The approach uses an emerging methodology of RL in conjunctionwith neural network to help a scheduler that effectively observes and adapts to dynamic changes inexecution environments. The observation of environment at various learning stages that normalize byresource-aware availability and feedback-based scheduling significantly brings the environments closerto the optimal solutions. Our approach also solves a high computational complexity in RL system throughon-demand information sharing. Results from our extensive simulations demonstrate the effectivenessof adaptive RL for improving system reliability.

© 2014 Elsevier Inc. All rights reserved.

1. Introduction

Large-scale distributed systems including Grids and Clouds aredriven by an expansion of the Internet that able to provide mas-sive information and dynamic computing services. Heterogene-ity and dynamicity of resources and applications in these systemsare rather common and must be effectively dealt with [11]. Re-source allocation considering these heterogeneous and dynamiccharacteristics has become increasingly more important with theadvent and prevalence of the distributed systems, e.g., AmazonElastic Compute Cloud (EC2) as a public cloud. However, the di-verse nature of network devices/components and communica-

∗ Corresponding author.E-mail addresses:[email protected] (M. Hussin), [email protected]

(N. Asilah Wati Abdul Hamid), [email protected] (K.A. Kasmiran).

tion technologies highly increases complexity in resource man-agement. It poses a number of new technical challenges includingperformance variability and accountability. The demand of systemreliability for massive and complex networked environment im-poses much burden on resource allocation solutions. The chang-ing nature of network services and a large number of devicesconnected in distributed systems demand for new resource allo-cation approaches to efficient use of heterogeneous resources.

Robust and reliable network services depend heavily on thequality of resource allocation (scheduling) decisions for improvingoverall network performance [6,12]. The decisions are not onlytaking charge ofmatching and scheduling the processing capacitiesof resources and user requirements, but they also need to dealwith a wide variety of resource behaviors and performancefluctuations. Services in large-scale distributed environments(e.g., The Internet of Things [15]) are required to stay and continueoperating even in the presence of malicious and unpredictable

http://dx.doi.org/10.1016/j.jpdc.2014.10.0010743-7315/© 2014 Elsevier Inc. All rights reserved.

http://dx.doi.org/10.1016/j.jpdc.2014.10.001

http://www.elsevier.com/locate/jpdc

http://www.elsevier.com/locate/jpdc

mailto:[email protected]



http://dx.doi.org/10.1016/j.jpdc.2014.10.001

2 M. Hussin et al. / J. Parallel Distrib. Comput. ( ) –

circumstances; that is, their processing capacity must not besignificantly affected by the user requirements [11,8]. For instance,the scheduling decisions are required to be reliable in the senseof tolerating failure automatically, or guaranteeing propertiessuch as high availability; hence having good performance evenunder malfunctioned. However, it raises complexity to adjustthe resource allocation decisions in highly dynamic and complexnetwork environment where there is a need for making quicksolutions without significant performance degradation.

We address adaptive and scalable resource allocation in theface of uncertainty. These abilities are important for trustwor-thy scheduling decisions. Mainly, the distributed systems requireconsistent and iterativemonitoring for valuation resources’ behav-iors and processing requirements. Therefore, an autonomous, scal-able and highly dynamic learning approach is deserved. Recently,a promising approach based on reinforcement learning (RL) hasbeen studied for dynamic task scheduling and resource allocation[18,16]. RL offers an effective decision-making policy while pro-viding with less system-specific knowledge that can be of a verypractical scheme for resource allocation and scheduling. It is alsoa promising approach for automatically developing effective andefficient policies for real-time self-adaptive management. Sincethe RL system is capable of repairing and improving its perfor-mance when exhibiting poor in the quality of decisions [16], thedependable information for future decision-making is realized.However, it raises computational complexity when the problemspace is too large to explore completely, hence it is hard to pre-dict about new situations and to find an optimal solution. Thereinforcement-agent is required to effectively explore the environ-ment while modifying its action in response to system changessuch as performance fluctuations, high computing complexityand massive distributed applications. An online and fast-adaptivelearning to fabricate sufficient knowledge for reliable experiencesis necessary that helps the agent to sense, manipulate and addressinformation at once.

In this paper, a novel Dynamic-Adaptive resource allocationbased on REinforcement Learning (DAREL) is proposed and evalu-ated. There aremanydifferentways (e.g., [18,16,14,17]) to incorpo-rate reinforcement learning in resource allocation. In this work, wecombined an Artificial Neural Network (ANN) [18] with stochas-tic automata for handling environments that are not deterministic.Our model-based reinforcement learning strategy aims to simul-taneously learn system environment and improve learning pol-icy. It realizes by performing actions (or allocation decisions) andobserving feedbacks while memorizing the actions that bring thesystems closer to the optimal decisions. The reinforcement-agentsautonomously explore and respond to its environment for allo-cating applications into resources based on rewarding system. Inorder to adapt with the system changes and scale of learning ex-periences, DAREL also supports an on-demand information shar-ing strategy that significantly accelerates a convergence speed oflearning. Our experimental results demonstrate that DAREL plays asignificant role in enhancing system reliability, also reduces com-putational complexity in RL.

The reminder of this paper is organized as follows. A review ofrelated work is presented in Section 2. In Section 3, we describethe models used in the paper. Section 4 presents the structureof reinforcement learning (RL) for dynamic and massive networkenvironment. In Section 5, we detail the components of DARELfor reliable services. Experimental setting, comparison of differentresource scheduling policies and results are presented in Section 6.Finally, conclusions are made in Section 7.

2. Related work

Effective resource allocation mainly consider some substantialissues such as effective matching and scheduling strategies, usage

status of resources, communication between schedulers and ap-proximate task execution time. Developing self-adaptive schedul-ing policy to deal with variability of application services is a keyissue in optimizing utility distributed systems. One of the strate-gies that is proposed for resource allocation problems is rein-forcement learning approach [18,16,10,2]. In [18], the authorsintroduced RL-based dynamic scheduling policy in parallel pro-cessors systems. The work highlights scalability and adaptabilityas the main key objectives of dynamic scheduling in decentral-ized learning strategy. It also shows the effectiveness of RL-basedscheduler model towards improving the quality of scheduling de-cisions, through on-going learning process.

The authors in [16] proposed the method called ordinal sharinglearning (OSL) to copewith exploration and exploitation abilities ofthe agents. For increasing the learning exploration, the agents in-teract with the environment using an indirect method where theyuse the applications’ information to estimate the state. Meanwhile,the utility table that is updated by each agent is ordinal and it-eratively shared in neighborhood to view and estimate the effi-ciency of resources. As such, efficient coordination among agentsand optimum utilization in the proposed learning strategy are re-alized. However, they do not formulate the decisions for highlydynamic environments where uncertainties and variability of dis-tributed resources and application are present constantly. Ourwork adopts the distributed reinforcement learning structure andlearning-feedback loop in [18,16] for better resource schedulingthat focuses for highly dynamic and probabilistic environments.

The learning-based scheduling policy mainly aims to build upknowledge for improving the ability of the scheduler in solvinguncertain decision-making problem. The control learning strategythat is proposed in [2] develops a policy to identify the most ap-propriate action from different states. Their reinforcement-agentslearn by trial and error to identify the optimal action for a givenstate. Our work focused more on reinforcement learning strat-egy for resource allocation compared to agent capability that ishighlighted in [2]. The adaptive reinforcement learning in [1] im-proves online learning policy for Web systems auto-configuration.However, they do not formulate RL framework especially forinitial learning phase of training data. For optimal allocation de-cisions, the study in [9] proposed negotiation-based resourceallocation using the Q -learning technique (NQL). The system com-positions (users and providers)make successive offers that dependon very quick negotiation process. It identifies the possible negoti-ation state and selects the actionwith the highest learning rate. Theagent of each resource provider behaves greedily most of the time.

QIA in [10] sustains incentive in every resource in order to pre-dict how well a provider can adjust its trust factor optimally. Thetrust factor that is computed through the Markov Decision Pro-cesses (MDP) method aims to recognize dependable computingnode among available resources. During the decision-making pro-cess, the agent learned the observed the trust factor, the resourcewith highest Q -value is then chosen. The learning strategy that isbased on ordinal information sharing [16], negotiation [9] and in-centive [10] for resource allocation serves as a point of comparisonin our experimental evaluation. Their strategies address averagedistributed system utility in the learning policy that implies in ourwork as well.

3. Model

In this section, we describe the system and application modelsemployed in this work.

3.1. System model

The target network system used in this work consists of multi-ple resource sites that are loosely connected by a communication

M. Hussin et al. / J. Parallel Distrib. Comput. ( ) – 3

Fig. 1. System model.

network, and each has a resource manager (scheduler) and aset N of r compute nodes (Fig. 1). It is important to note thatthe system model focused for resource allocation in parallel anddistributed processing that is related to two notable distributedsystems (i.e., grid and cloud). The bandwidth between any twoindividual resource sites varies, corresponding to realistic network.The inter-site communications are assumed to dispute with somedelays. It is also assumed that a message can be transmitted fromone site to another in the sense that there is a communication routebetween them.

Each scheduler (reinforcement-agent) in each site is given theauthority to keep track of its resources’ details; where differentschedulers deal with its tasks in parallel. The scheduler may com-municate with each other to share and exchange resources’ in-formation. Each of it manages a learning module that includesscheduling policy and learning experiences where the module hasvaried in size. Henceforth, scheduler and agent will be used inter-changeably.

Nodes inside the same site are fully interconnected through ahigh-bandwidth network, and their communications work with-out substantial contentions, which is possible in many systems.Each node rn; where n = {1, 2 . . . ,M} consists of various num-ber of multi core-processors with a shared cache module and thisis a clear indication of varying levels of processing power that isgiven as:

PCn =L

Li=1

exe; (1)

where L is the total number of tasks completed within some ob-servation period and exe is task execution time, respectively. It isassumed that the capability of resources, such as the processingcapacity and network links, fluctuates. Therefore, the accurate (ac-tual) completion time of a task on a particular computing node isdifficult, if not impossible, to determine a priori. Due to the fact thatresources are accountable for executing both local andwidespreadtaskswe assume that the number of tasks to be scheduled at a giventime ismore than the number of available resources. Hereafter, theterms node and resource are used interchangeably.

3.2. Application model

The system users (local and global end-users) are producedand submitted tasks to scheduler; where all tasks stayed first atthe scheduler for scheduling process (i.e., matching and mapping).The tasks considered in this study are computation-intensive andindependent from each other (i.e. no inter-job communication ordependencies). Each task is a single arrival unit and associatedwith

Fig. 2. Reinforcement learning scheme.

the set of parameters shown below.

Ti = {si, di} (2)

where si is the computational size of task i and di is the latest time(deadline) by which task Ti is supposed to be completed. Tasksarrive in a Poisson distribution. We assume that the task’s profileis available and can be provided by the user using job profiling,analytical models or historical information.

4. Reinforcement learning structure for dynamic and massivenetwork environment

This section begins by briefly describing our reinforcementlearning framework, and then gives details of learning policy fordynamic and massive network environment.

4.1. Overview of reinforcement learning (RL)

Reinforcement learning systems process a large number ofinputs and generate several actions for realizing defined goals.Particularly, the reinforcement-learning system (Fig. 2) interactsbetween observed states of its environment (input) and chosenactions (output), and it obtains reinforcement feedback (reward orpenalty) [7,13].

The reinforcement system heavily relies on the ability of agentstomake accurate predictions of future state-feedback probabilitiesbased on the current state and action. In order to choose themost appropriate action for a given state, the system must aim forproper balance between exploitation and exploration [13]. For thatpurpose, the reinforcement-agent can choose an immediate actionthat leads to high reward (exploitation) or explicitly explore itsenvironment to perform an action for maximizing average rewardit receives in the long term (exploration).

Given that, the optimal long-term reward is expected to beachieved when all states can be visited frequently [18], as such,the structure of our reinforcement system is designed based onthe Artificial Neural Network (ANN) that is presented in [18].Due to the dynamic learning environment, the long-term rewardsfor scheduling decisions are usually unknown. However, it stillcan be estimated through various iterations and optimizationcycles within ANN layers. The neural network represents asresource interconnection to relay signal of processing performanceacross network. Each connection between resources has a weightparameter that is used to reach at themost appropriate decision asto which resource a task should be mapped. This weight value isproduced based on the resource availability (available to performcomputation in a timely manner or by task deadline). During thelearning process, the weight value of resource is regularly updatedand propagated the network. More specifically, by referring toFig. 3, the resource state from the Input layers is propagated tothe Hidden layers that act as mediator to update and formulatethe weight value during the learning process. The Output layers


Fig. 3. A typical neural network.

correspond to scheduling decisions that identify a value for themost appropriate resource.

In our work the component of actor-critic network of ANN forRL in [18] is replaced by the system feedback (i.e., either reward orpenalty). For training the ANN, we rely on stochastic automata thatare facilitated with a systematic trial-and-error scheme. Specifi-cally, the trial-and-error scheme assists the reinforcement-agentfor achieving better explorationwith a high predictive accuracy ac-tion. It is realized when the RL system frequently updates learningprobabilities (or state transitions) while exploring new action toincrease the discounted cumulative reinforcement [13] or globalutility value. Optimistically, the (near) optimal utility value can beachieved by continuously making such action or allocation deci-sions.

In the complex network environment, taking the same action inthe same state may result in different reinforcement feedback [7].Our RL system encompasses four different steps to dynamicallylearn the significance of the states. The action (resource-taskmapping) occurs according to the resource state (availability) andlearning experiences (feedbacks). The action is then evaluatedbased on the immediate reinforcement feedback to identify howthe allocation decision benefits in providing better performance(analyze). The output from the evaluation process is kept in thelearningmodule. Themodule stores scheduling policy and learningexperiences including the triplet state-action-feedback and globalutility value; that are viewed as a table and updated frequently(update). The global utility value is given as the total numberof successful tasks divided by the total number of decisions thatare made (action) in the system. Finally, the agent memorizes itslearning experiencewhile consistentlymonitoring the utility value(memorize). Since there is an expected feedback for each actionin a multitude of states, the size of the table for maintaining thetriplet is growing exponentially. In response to this, only the tripletthat increases the global utility value will be kept in the learningmodule. As such, the module can be always in a possible lengthand leads to expedite the learning process. Now, when the agentreceives other tasks, it performs the same operations (analyze,update, memorize). The reinforcement operations attempt to healthe agent’s decisions and learn the experiences simultaneously.

4.2. Reinforcement-based scheduling

The reinforcement-based scheduling (Fig. 4) is formulatedby the interaction between reinforcement-agent and resources(parallel compute nodes). At a particular point in time t , the systemobserves the resource’s state, S(t) (i.e., idle, busy, overloaded), andthe information is then used to determine the right action AS(t) for

Fig. 4. Reinforcement-based scheduling.

the state. From the given state, the learning policy then consults theweight value (i.e., resource availability function) while providingevaluations of its experiences. The evaluation of the experiences isnecessary for the agent to gather useful information, particularly inthe presence of minimal prior knowledge about resources’ states.Based on such information, the agent then performs the action.The action refers to a decision to schedule and assign tasks tothe most favorable resource regardless of sites’ autonomy. Afterthe action is implemented, the reinforcement system receivesthe feedback FA(t) (either reward or penalty) to indicate theconsequences of its action. The reward is defined in terms of thetask to be achieved (successful execution) while penalty is givenfor any action that impedes the agent from successfully completingthe tasks (missed deadlines). If the reward is given, then theaction is indicated as trusted scheduling decision; otherwise, thescheduling policy needs to be revised. The scheduling policy isupdated based on the feedback; given that each action correspondsto one state. This encourages the agent to consult its schedulingpolicy in determining the most appropriate action for the currentstate. The agent then keeps this triplet state-action-feedback,S(t)-AS(t)-FA(t) in its learning module for better future decision-making while consistently updates the learning policy and globalutility value.

4.3. Different stages of learning

The reinforcement-agent is responsible for taking charge ofresource scheduling for several users. Hence, it acquires toobserve a huge number of triplets {state-action-feedback} andstate transition to converge to optimal action. However, addressingsuch learning discovery might be intractably complex and couldbe prohibitively costly [13]. This hinders the effectiveness of RLin dynamic computation. In addition, the agent might begin tomake stochastic action from scratch [13] especially in its earlystage of learning. This makes the action at such stage of learningpoor due to lack of knowledge about which action can steer theenvironment in the desired direction. To address this issue, weemphasize RL for providing an action at different stages of learning.We divided the learning phase into two stages that are based onthe learning cycle made by the agent. The first stage is denoted asearly-stagewhere the learning cycles are less than20%; otherwise itis a stable-stage. The stages are identified during the training phasethat is initially conducted to identify beneficial of experience for


improving stability of learning. From the preliminary investigation,it is identified that the global utility of the systemduring the stable-stage (more than 20% cycle) is gradually improved.

We also introduced on-demand information sharing throughcentralized and decentralized learning strategies for betterlearning experiences. It benefits our reinforcement-agents in per-forming the dependable action using both local information (cen-tralized learning) and other agents’ experience (decentralizedlearning). In the centralized learning, the agent improves its expe-riences depending only on the local triplet {state-action-feedback}without explicitly communicating with other agents. The decen-tralized learning, on the other hand, requires interaction amongactive agents that has been employed as alternative solution forenhancing the quality of action [16]. The main goal of decentral-ized learning in our approach is merely to derive instant decisionpolicy to maximize the global utility value.

Since all agents learn and make decisions independently andsimultaneously, the interaction and information sharing amongthem raise coordination problem and communication overhead[7,3]. To solve the problem in multi-agent interaction, each agentis designed to incorporate with on-demand information sharingprocess. In this process, an agent uses a broadcast signal for re-questing other agents’ learning experiences. Other agents (sched-ulers) are encouraged to respond to the signal. Due to delay in datatransfer time and communication overhead, the agent only learnsthe scheduling policy {state-action-feedback} from the agent thathas the highest utility value. It is assumed that the agent with ahigher utility value compared with those of other agents can bebetter trusted for successful task execution. Due to the fact thatthe agents work in heterogeneous environment, not all informa-tion that is shared is beneficial or explicitly increases the agent’sutility value. Therefore, the on-demand information sharing onlyoccurred when it is necessary that triggered when the global util-ity value continually decreased within some observation periodof time. This information sharing strategy helps to increase ob-servability of agents particularly during the early stage of learning(early-stage) and improve their experiences specifically for work-ing in uncertain or incomplete observations.

5. Adaptive RL-based resource allocation

Due to the heterogeneity of resources, recognizing and effi-ciently exploiting optimal actions are particularly challenging in RLsystem. Thus, in this work an adaptive control is concerned withreinforcement learning to reduce computational expenses whileimproving the quality of the action (allocation decision). In thissection, we describe two main components in DAREL explicitly toencourage reinforcement-agent to determine the most appropri-ate action for the current state. The pictorial description of ourframework is presented in Fig. 5.

5.1. Load-aware resource discovery

Generally, the resource availability in large-scale distributedsystems varies depending on the system workload. It is hard toobtain accurate information of resources due to their capacitieschangemore rapidly than the scheduler can adapt. For instance, thenetwork traffic growth when there are numerous newly arrivingtasks within a short time interval. It is also hard to promise thetask deadline is guaranteed in such situation.

To deal with the above difficulties, we incorporated a load-aware resource characterization inDAREL. Given that resource fail-ures in the distributed systems may not occur very frequently orthey can be recovered with minimal impact on system perfor-mance [6,8,3,4], this work rather focuses on available resourceswith their fluctuating performance. We adopted the indirect

Fig. 5. Adaptive RL-based resource allocation.

method that was proposed in [16] by using resource’s score for se-lecting the resource for a given tasks. Since there are various taskdeadlines, the processing requirements of tasks that queued in anode rn must be accurately identified and determined. It is used tomeasure a load in resource n; given as:

loadn =

Ti=1

sidi

Ti=1

di

; (3)

where T is the total number of tasks in the queue. This measure-ment includes the impact of queue size (number of tasks in thewaiting queue) because a larger queue size will lengthen the re-sponse time for a task and reduce the performance. Hence, we gen-eralized the resource availability or score using the probability ofsuccessful execution of application, given as

An =PCn

loadn; (4)

where PCn is processing capacity of node n. A resource is identifiedas Capable (i.e., An ≥ 1) if it can process the applications within itsspecified deadline, or it is referred as Incapable. Each scheduler con-sistently collects andmaintains its node information by interactingwith the resource. Such information updated regularly each timethe node’s queue is half occupied; that is, the agent can exactly in-dicate the current state of resource. The resource characterizationhelps to simplify the modeling of optimal decision-making thatonly considers partial state of available resources learned online.The resource characterization technique significantly improves thediscovery subscription of service while reducing resource alloca-tion alternatives [4].

5.2. Feedback-based scheduling

In our work, the reinforcement action produced immediatelywithout explicitly searching or forecasting of future state in order


to reduce task waiting time. The resource allocation decision(action)mainly aims to choose the right resource (current state) fora task hence significantly increase successful execution (reward)in the system. The feedback (reward or penalty) is then generatedfor each pair of state–action. Note that, agents allocate task to theresource according to the availability score (Eq. (5)). However, thisstrategy alone leaves the true optimal action because the mosttrusted resource is hard to be discovered. In response to this, weproposed a feedback-based scheduling.

The feedback-based RL scheme in [18] reports that an agenthas to actively learn its reinforcement feedback to determine andprovide favorable actions. Because the resources have differentprocessing capacities, the feedback in the learning module varieswhich chooses the same action might not result in the samefeedback value. More formally, the feedback value of action FAdenotes the efficiency of agent in selecting a particular resourcethat meets task deadline and is given by,

FA = 1 −1χ

; where χ

|di − ET | if ETi < diET/di otherwise. (5)

The high feedback value such as 0.5 or more refers to reward;otherwise it is penalty. In our large environment state space, set-ting the rewardmore than 0.5 is acceptable to satisfy a certain levelof agent’s learning experience. Agents continuously aim to achieveas many rewards as possible. In some cases, due to uncertaintyin resource availability, low performance in feedback value mighthappen (e.g., penalty). The agent then changes its scheduling policyby expanding its learning capability through decentralized learn-ing method (mentioned in Section 4). This repeats until improve-ment on the global utility value is possible. It helps to increasesuccessful task execution with considerably improved allocationdecisions.

6. Performance evaluation and discussions

In this section, we detail the evaluation study conducted for theperformance of our resource allocation approach. The experimen-tal settings and results are described and discussed.

6.1. Experiment configuration

In our simulation system, there are 5–20 resource sites in eachof which its own agent resides. Due to the large scale constitutionof distributed environments, the parameters of simulation modelused are assumed to be unbounded; that can be created. Each re-source site contains a varying number of compute nodes rangingfrom5 to 8. For each node, there are 4–12processorswith the num-ber of cores dynamically chosen in the range of 2 and 8 with an in-terval of 2. The relative processing power (speed) of the processoris selected within the range of 1 and 7.5. The number of tasks ina particular simulation is set between 1000 and 6000. Task inter-arrival times (iat) follow a Poisson distribution with a mean of 5time units. For a given task ti, the computational size si is randomlygenerated from a uniform distribution ranging from 600 to 7200(MI). The probability of task deadline di is set in between 20% and70% of the computational size. Such deadline sets to bring diversein processing requirements to the system.

6.2. Performance metrics

The performance is studied using three metrics and they aredefined as follows.

• Successful task execution: The metric is used to measurethe degree of reliable execution and to identify how well the

resource allocation approach deals with diversity in processingrequirements. It denotes the number of tasks that met theirdeadline, given as:

s_rate =1L

Li=1

βi; where β =

1 if ACT i ≤ di0 else. (6)

• Utilization rate: We define utilization rate as RU = busyn/(busyn + idlen); where busyn is the total time when the nodern is busy for servicing tasks and idlen is total idle time of rn,respectively.

• Global utility value: This metric is used to analyze the per-formance of the scheduler/agent towards its learning processwhendealingwith performance fluctuation. It is given in Eq. (7).

aveLearn =1S

a=1

FAAS

a; (7)

where S is total number of agents in the network, FA is feedback,As is action and a is indicator for each action, respectively.

The parameters for the experimental study and metrics areused in our previous work [5] where the dynamic scheduling isapplied in the heterogeneous system environment. However, inthe previous work we focused on different performance issues.

6.3. Experimental results

Wecarried out twodifferent experiments that differ in regard tothe learning methods. For the first experiment, comparisons havebeen conducted between three previously proposed resource al-location approaches (i.e., NQL, QIA, OSL) introduced in Section 2and DAREL. This selection is made based on their proven perfor-mance. Note that our allocation scheme is compared with thoseapproaches in terms of scheduling policy. The second experimentconducted is to study how the reliable performance is influencedby the learning capability between DAREL and D-REL. In D-REL, theresource-task match is constantly made using the resource avail-ability or scorewithout considering the two-learning stages (early-stage, stable-stage) and the feedback-based scheduling.

6.3.1. Difference in scheduling policiesIn Fig. 6, average utility rate is plotted with respect to different

number of tasks. It shows that with the number of tasks increased,the utility rate of the allocation approaches increased as well.This indicates that the adaptive learning experiences help muchin allocating the suitable resources for the tasks. The performanceof DAREL however is more appealing as the number of tasksincreases. While OSL delivers appealing results, it is more suitablefor the small volume of tasks (less than 3000). We can draw theconclusion that on-demand information sharing between agentscan assure better trade-off between exploration and exploitationwith effective works in dynamic workload changes.

Fig. 7 shows that the discrepancy of successful execution rateamong the approaches is small (approximately 20% on average)when there is less volume of tasks. A light-loaded communicationsystem to some extend influenced the rate. DAREL sustains betterperformance in successful execution regardless of the numberof tasks. The primary source of this performance gain is theincorporation of feedback scheme into our scheduling policy. Inaddition, better synchronization of centralized and decentralizedinformation sharing in DAREL contributes for better learningexperiences. Also is shown in Fig. 8 resource utilizationwhen usingDAREL exhibits higher rate than that of other approaches. DARELdynamically monitors resource workload from the task queuewhich makes the overall demand for resources can be managedeffectively.


Fig. 6. Average utility rate with different learning approaches.

Fig. 7. Successful execution rate with different learning approaches.

Fig. 8. Utilization rate with different learning approaches.

We extend the analysis betweenDAREL andOSL as they showedcompelling results in former experiments. For this analysis, successrates are plotted with respect to percentage of learning cyclethat was previously observed. DAREL and OSL reach successfulexecution rates of more than 50% between 50% and 100% cyclesas shown in Fig. 9. Interestingly, DAREL exhibits the rate increaseslinearly with learning cycle while OSL shows an exponentialgrowth. This can be explained that the incorporation of feedback-based scheduling into resource management is much preferableto make better (reliable) decisions. We also believe that with theadaptive learning strategy at two different learning stages (early-stage, stable-stage) helps to develop useful experiences hence,improve the performance.

6.3.2. Difference in learning capabilitiesFor comparisons with different learning capabilities, we vary

the heterogeneity of resources according to the service coefficientof variation; that is proposed and used in [5]. The service

Fig. 9. Successful execution rate of DAREL and OSL within the learning process.

Fig. 10. Successful execution rate under different mapping strategies.

coefficient of variation is defined as the summation of differencesin processing capacity on the target system divided by processingcapacity in a respective resource. For example, a resourceheterogeneity rate of 0.1 (10%) indicates that the difference inprocessing capacity is relatively small.

Results in Fig. 10 demonstrate a better successful executionwhen the system heterogeneity is low. The results show thatsuccessful execution of DAREL and D-REL has reached more than60%. It means that the heterogeneity factor does not significantlyhamper both strategies to gain reliable execution. In this case,because DAREL incorporates allocation decisions that imply atdifferent learning stages, the average performance is higher thanD-REL.

We next illustrate the effectiveness of reinforcement-agentsin DAREL and D-REL to gain better global utility with varyingresource capacities. The results are shown in Fig. 11. DAREL is ca-pable of achieving a better global utility value for high system het-erogeneity. It means that when there is high degree of differencein resource capacities the agents tend to produce better experi-ences; hence leads to optimal (near) decisions. The better schedul-ing experiences make the system more reliable in order to dealwith different entities (resources and tasks) and vary character-istics (heterogeneous and dynamic). The careful balance betweenexploration (i.e., on-demand information sharing) and exploitation(i.e., feedback-based scheduling) helps in achieving better learningperformance; and this is another compelling strength of DAREL.

7. Conclusion

In this paper, we have addressed the problem of adaptivedecision-making aiming for reliable services using reinforcementlearning for distributed systems. We presented a novel RL-basedresource management approach to drive the evolution of automa-tion networks towards higher reliability. The incorporation of RL


Fig. 11. Global utility value under different mapping strategies.

into resource management enables the scheduler (reinforcement-agent) to actively exploit diversity in both tasks and resources. Ourapproach is able to handle tasks at different states in processingand adapt to system changes. Two elements in adaptive schedul-ing (i.e., resource-aware availability and feedback scheme) that areadopted in our work lead to better learning experiences; and thisincorporation effectively deals with performance fluctuations. Theinter-agent communication through on-demand information shar-ing also helps in reducing the computational complexity of RL. Wealso have confirmed that the incorporation of adaptive reinforce-ment learning into resource management is an effective means tosustain reliable performance specifically for dealingwith dynamic-heterogeneous distributed systems.

References

[1] X. Bu, J. Rao, C.-Z. Xu, A reinforcement learning approach to online websystems auto-configuration, in: 29th IEEE Int’l. Conf. onDistributedComputingSystems, 2009, pp. 1–11.

[2] A. Galstyan, K. Czajkowski, K. Lerman, Resource allocation in the Grid usingreinforcement learning, Presented at the 3rd Int’l Joint Conf. on AutonomousAgents and Multiagent Systems, AAMAS 2004, New York City, NY, USA, 2004.

[3] M. Holenderski, R.J. Bril, J.J. Lukkien, Grasp: visualizing the behavior ofhierarchical multiprocessor real-time systems, J. Syst. Archit. (2012).

[4] M. Hussin, Y.C. Lee, A.Y. Zomaya, ADREA: a framework for adaptive resourceallocation in distributed computing systems, Presented at the 11th Int’lConf. on Parallel and Distributed Computing, Applications and Technologies(PDCAT), Wuhan China, 2010.

[5] M. Hussin, Y.C. Lee, A.Y. Zomaya, Efficient energy management using adaptivereinforcement learning-based scheduling in large-scale distributed systems,Presented at the 40th Int’l Conf. on Parallel Processing, ICPP2011, Taipei,Taiwan, 2011.

[6] M. Hussin, Y.C. Lee, A.Y. Zomaya, Priority-based scheduling for large-scaledistribute systems with energy awareness, in: Proc. of the 2011 IEEE NinthInternational Conference on Dependable, Autonomic and Secure Computing,Sydney, Australia, 2011, pp. 503–509.

[7] L.P. Kaelbling, M.L. Littman, A.W. Moore, Reinforcement learning: a survey,J. Artificial Intelligence Res. 4 (1996) 237–285.

[8] Y.C. Lee, A.Y. Zomaya, Scheduling in Grid environments, in: S. Rajasekaran,J. Reif (Eds.), Handbook of Parallel Computing: Models, Algorithms andApplications, CRC Press, Boca Raton, Florida, USA, 2008, pp. 21.1–21.19.

[9] J. Li, R. Yayhapour, Learning-based negotiation strategies for Grid scheduling,Presented at the 6th IEEE Int’l Sym. on Cluster Computing and the Grid,CCGRID06.

[10] L. Lin, Y. Zhang, J. Huai, Sustaining incentive in Grid resource allocation: areinforcement learning approach, Presented at the 7th IEEE Int’l Sym. onCluster Computing and the Grid, CCGrid07, 2007.

[11] I. Llorente, R. Moreno-Vozmediano, R. Montero, Cloud computing for on-demand Grid resource provisioning, in: Advances in Parallel Computing.Vol. 18, IOS Press, 2009, pp. 177–191.

[12] N.B. Rizvandi, J. Taheri, R. Moraveji, A.Y. Zomaya, A study on using uncertaintime series matching algorithms for MapReduce applications, Concurr.Comput.: Pract. Exper. 25 (2013) 1699–1718.

[13] R.S. Sutton, A.G. Barto, Reinforcement Learning: An Introduction, MIT Press: ABradford Book, Cambridge, MA, 1998.

[14] D. Vengerov, A reinforcement learning approach to dynamic resourceallocation, Eng. Appl. Artif. Intell. 20 (2007) 383–390.

[15] M.H.O. Vermesan, H. Vogt, K. Kalaboukas, M. Tomaselle, The Internet ofthings—strategic research roadmap, 2009.

[16] J. Wu, X. Xu, P. Zhang, C. Liu, A novel multi-agent reinforcement-learningapproach for job scheduling in Grid computing, Future Gener. Comput. Syst.27 (2011) 430–439.

[17] C.-Z. Xu, J. Rao, X. Bu, URL: a unified reinforcement learning approach forautonomic cloud management, J. Parallel Distrib. Comput. 72 (2012) 95–105.

[18] A.Y. Zomaya, M. Clements, S. Olariu, A framework for reinforcement-basedscheduling in parallel processor systems, IEEE Trans. Parallel Distrib. Syst. 9(1998) 249–260.

Masnida Hussin is a senior lecturer at the Departmentof Communication Technology and Network, Facultyof Computer, Universiti Putra Malaysia, Malaysia. Shereceived her Ph.D. from University of Sydney, Australiain 2012. Mainly her research interest is in QoS andresourcemanagement for distributed systems such asGridand Cloud. She was also involved in green computingproject. She received Huawei Technology Certification in2012 as System Instructor that makes her specialized inconfiguring Huawei network computer components. Sheis a member of the IEEE, also has published several papers

that are related to parallel and distributed computing.

Nor Asilah Wati Abdul Hamid is a senior lecturer atthe Department of Communication Technology and Net-work, Faculty of Computer, Universiti Putra Malaysia,Malaysia. She received her Ph.D. from University of Ade-laide in 2008. Her research interests are in parallel anddistributed high performance computing, cluster comput-ing, computational science and other applications of high-performance computing. In 2011 she did her post-doctoralat High Performance Computing Lab at George Washing-ton University, USA. She is also an Associate Researcherof High Speed Machine at Institute for Mathematical Re-

search (INSPEM), Universiti Putra Malaysia.

Khairul Azhar Kasmiran received his BIT (Hons.) de-gree in software engineering from Multimedia Univer-sity, Melaka, Malaysia, in 2003, the M.Sc. degree insoftware engineering from Universiti Putra Malaysia, Se-langor, Malaysia, in 2006, and the Ph.D. degree in in-formation technology from the University of Sydney,Sydney, Australia, in 2012. His current research interestsinclude code optimization, computational intelligence,high-performance computing, and biomedical informat-ics.

http://refhub.elsevier.com/S0743-7315(14)00191-9/sbref3










improving reliability in resource management through adaptive reinforcement learning for distributed...

Documents