towards fault-adaptive control of enterprise computing ... filetowards fault-adaptive control of...

6
Towards Fault-Adaptive Control of Enterprise Computing Systems—A Position Paper Dara Kusic * Drexel University Philadelphia, PA 19104 [email protected] Nagarajan Kandasamy Drexel University Philadelphia, PA 19104 [email protected] Sherif Abdelwahed Mississippi State University Mississippi State, MS, 39762 [email protected] Guofei Jiang NEC Laboratories America Princeton, NJ 08540 [email protected] ABSTRACT There is a growing interest in implementing online control frameworks that manage distributed computing systems for power and performance objectives. While such frameworks continuously manage the system to optimize resource al- location and respond to dynamic environment input, they often rely upon static models of application behavior that do not adapt to slow behavior changes that occur during normal operation. By introducing adaptive models that dy- namically adjust to the changing performance profile of an application, a robust controller can maintain performance objectives through normal changes that can occur in pro- duction and those introduced by software errors. In this paper, we characterize the effects of events that change an application’s performance profile over time. Such studies motivate the need for model-adaptive control to maintain system power and performance objectives over time under dynamic operating conditions. Key words : Adaptive model learning, dynamic control, ro- bust control, resource provisioning 1. INTRODUCTION Web-based services such as online banking and shopping are enabled by enterprise applications. We broadly define an enterprise application as any software hosted on a server which simultaneously provides services to a large number of users over a computer network. A promising method for au- tomating system management tasks in enterprise computing * D. Kusic is supported by NSF grant DGE-0538476. N. Kandasamy acknowledges support from NSF grant CNS-0643888. S. Abdelwahed acknowledges support from the NSF SOD Program, contact number CNS-0613971. systems is to formulate them as control problems in terms of cost or performance metrics. This approach offers some important advantages over heuristic or rule-based policies in that we can design a generic control framework to address a class of problems such as power management or resource provisioning using the same basic control concepts. More- over, we can verify such a control scheme’s feasibility with respect to the performance goals prior to actual deployment. Researchers from academia and industry have successfully applied classical feedback and proportional, integral, and derivative (PID) control to CPU provisioning [1], load bal- ancing [2], and power-management problems in Web servers [3, 4]. Assuming a linear time-invariant system, an uncon- strained state space, and a continuous input and output do- main, a closed-loop feedback controller is designed under stability and sensitivity requirements. Others have used con- cepts from model-predictive control (MPC) to pose resource provisioning problems as optimization problems, which are then solved online under dynamic operating constraints [5, 6]. These techniques can also accommodate hybrid and non- linear system behavior, dead times, as well as the cost of control. Recently, we have applied an MPC-based scheme to manage power consumption and performance in a vir- tualized computing environment, and experimental results obtained using a cluster of Dell PowerEdge servers indicate that the controller achieves a 26% reduction in power con- sumption costs over a 24-hour period when compared to an uncontrolled system while achieving the desired performance goals [7]. This paper motivates the need for fault-adaptive control of distributed computing systems. Our experience with the aforementioned cluster suggests that long-running software components managed by a controller be treated as adaptive processes that exhibit slow behavioral changes over time as a result of both internal and external influences. For exam- ple, consider Trade6, a stock-trading application from IBM where users can browse, buy, and sell stocks [8]. Trade6 has two main components—the WebSphere Server, housing the application logic, and DB2, the back-end database— that can be distributed across multiple machines comprising the application and database tiers. Now, the response time achieved by a long-running Trade6 application can be time 1

Upload: others

Post on 19-Sep-2019

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Towards Fault-Adaptive Control of Enterprise Computing ... fileTowards Fault-Adaptive Control of Enterprise Computing Systems—A Position Paper Dara Kusic ∗ Drexel University Philadelphia,

Towards Fault-Adaptive Control of Enterprise ComputingSystems—A Position Paper

Dara Kusic∗

Drexel UniversityPhiladelphia, PA [email protected]

Nagarajan Kandasamy†

Drexel UniversityPhiladelphia, PA 19104

[email protected]

Sherif Abdelwahed‡

Mississippi State UniversityMississippi State, MS, [email protected]

Guofei JiangNEC Laboratories America

Princeton, NJ [email protected]

ABSTRACTThere is a growing interest in implementing online controlframeworks that manage distributed computing systems forpower and performance objectives. While such frameworkscontinuously manage the system to optimize resource al-location and respond to dynamic environment input, theyoften rely upon static models of application behavior thatdo not adapt to slow behavior changes that occur duringnormal operation. By introducing adaptive models that dy-namically adjust to the changing performance profile of anapplication, a robust controller can maintain performanceobjectives through normal changes that can occur in pro-duction and those introduced by software errors. In thispaper, we characterize the effects of events that change anapplication’s performance profile over time. Such studiesmotivate the need for model-adaptive control to maintainsystem power and performance objectives over time underdynamic operating conditions.

Key words: Adaptive model learning, dynamic control, ro-bust control, resource provisioning

1. INTRODUCTIONWeb-based services such as online banking and shopping areenabled by enterprise applications. We broadly define anenterprise application as any software hosted on a serverwhich simultaneously provides services to a large number ofusers over a computer network. A promising method for au-tomating system management tasks in enterprise computing

∗D. Kusic is supported by NSF grant DGE-0538476.†N. Kandasamy acknowledges support from NSF grantCNS-0643888.‡S. Abdelwahed acknowledges support from the NSF SODProgram, contact number CNS-0613971.

systems is to formulate them as control problems in termsof cost or performance metrics. This approach offers someimportant advantages over heuristic or rule-based policies inthat we can design a generic control framework to addressa class of problems such as power management or resourceprovisioning using the same basic control concepts. More-over, we can verify such a control scheme’s feasibility withrespect to the performance goals prior to actual deployment.

Researchers from academia and industry have successfullyapplied classical feedback and proportional, integral, andderivative (PID) control to CPU provisioning [1], load bal-ancing [2], and power-management problems in Web servers[3, 4]. Assuming a linear time-invariant system, an uncon-strained state space, and a continuous input and output do-main, a closed-loop feedback controller is designed understability and sensitivity requirements. Others have used con-cepts from model-predictive control (MPC) to pose resourceprovisioning problems as optimization problems, which arethen solved online under dynamic operating constraints [5,6]. These techniques can also accommodate hybrid and non-linear system behavior, dead times, as well as the cost ofcontrol. Recently, we have applied an MPC-based schemeto manage power consumption and performance in a vir-tualized computing environment, and experimental resultsobtained using a cluster of Dell PowerEdge servers indicatethat the controller achieves a 26% reduction in power con-sumption costs over a 24-hour period when compared to anuncontrolled system while achieving the desired performancegoals [7].

This paper motivates the need for fault-adaptive control ofdistributed computing systems. Our experience with theaforementioned cluster suggests that long-running softwarecomponents managed by a controller be treated as adaptiveprocesses that exhibit slow behavioral changes over time asa result of both internal and external influences. For exam-ple, consider Trade6, a stock-trading application from IBMwhere users can browse, buy, and sell stocks [8]. Trade6has two main components—the WebSphere Server, housingthe application logic, and DB2, the back-end database—that can be distributed across multiple machines comprisingthe application and database tiers. Now, the response timeachieved by a long-running Trade6 application can be time

1

sharad
TextBox
FeBid 2008 Third International Workshop on Feedback Control Implementation and Design in Computing System and Networks 2008, Annapolis, Maryland, US
Page 2: Towards Fault-Adaptive Control of Enterprise Computing ... fileTowards Fault-Adaptive Control of Enterprise Computing Systems—A Position Paper Dara Kusic ∗ Drexel University Philadelphia,

varying even under a constant workload intensity due to: (1)internal changes to the application as it “ages” (e.g., slowperformance degradation caused by memory leaks) and/or(2) external actions wherein users and stocks are added tothe database or removed from it. This time-varying behav-ior has significant implications on model-based control tech-niques such as MPC, in that, if a static model is used to pre-dict the response time achieved by Trade6, then, over time,the predictions will not match the actual response times.

Based on above discussion, we propose FACT—Fault Adap-tive Control Technology—that will integrate online para-meter tuning, machine learning, and model-based diagnosiswithin the overall control framework to: (1) improve the ac-curacy of partially specified system models, (2) maintain thecorrectness of the model against slow behavioral changes tosystem components over time, and (3) identify and localizecomponent failures resulting in abrupt changes to systembehavior.

We discuss FACT in the context of power and performancemanagement in a virtualized execution environment. Virtu-alization allows a single server to be shared among multi-ple performance-isolated platforms called virtual machines(VMs), where each virtual machine can, in turn, host mul-tiple enterprise applications. Virtualization also enables on-demand or utility computing—a just-in-time resource pro-visioning model in which computing resources such as CPU,memory, and disk space are made available to applicationsonly as needed and not allocated statically based on the peakworkload demand. By dynamically provisioning virtual ma-chines, consolidating the workload, and turning servers onand off as needed, hosting center operators can maintain thedesired performance objectives while achieving higher serverutilization and energy efficiency [7, 9, 10].

The rest of this paper is organized as follows. Section 2describes the physical testbed and the resource provision-ing problem, and summarizes the main results. Section 3uses the Trade6 and RUBBoS applications to make the casefor treating long-running software components as adaptiveprocesses. Specifically, we show that, even under a constantworkload intensity, increasing (or decreasing) the size of theback-end database (an external influence) and small mem-ory leaks (internal changes to the application) changes theresponse times achieved by these applications. Section 4describes the proposed FACT framework as well as the re-search challenges that must be addressed to realize FACTin practical settings. The key challenge is to integrate con-trol, machine learning, and diagnosis processes into a com-mon model-based framework that will adapt to slow behav-ioral changes to the underlying system components as well asidentify faulty components to help the reconfiguration andrecovery process.

2. PRELIMINARIESRecently published work by the authors [7] has shown the ef-fectiveness of an online control framework to manage a smallvirtualized hosting environment for power and performanceobjectives. The resource-provisioning problem is posed asone of sequential optimization under uncertainty and solvedusing limited lookahead control (LLC), a predictive controlapproach previously introduced in [11]. This approach al-lows for multi-objective optimization under explicit operat-

Table 1: Control performance, in terms of the av-erage energy savings and SLA violations, for fivedifferent workloads as published in [7].

Total % SLA % SLAWorkload Energy Violations Violations

Savings (App. 1) (App. 2)Workload 1 18% 3.2% 2.3%Workload 2 17% 1.2% 0.5%Workload 3 17% 1.4% 0.4%Workload 4 45% 1.1% 0.2%Workload 5 32% 3.5% 1.8%

ing constraints and is applicable to computing systems withnon-linear dynamics where control inputs must be chosenfrom a finite set.

The testbed hosts multiple web-based services, comprisingfront-end application servers and back-end database servers.The applications perform dynamic content retrieval (cus-tomer browsing) as well as transaction commitments (cus-tomer purchases) requiring database reads and writes, re-spectively. The hosting hardware consists of a cluster ofheterogenous Dell PowerEdge servers running VMware ESXServer 3.5 virtualization software to logically partition thecomputing resources (CPU, memory, disk, etc.). Incomingrequests from workloads such as those shown in Fig.2 foreach service are dispatched to a dedicated cluster of virtualmachines (VMs) distributed upon the host machines.

The virtual clusters generate revenue as per the non-linearpricing scheme or service-level agreement (SLA) shown inFig. 1 that relates the average response time achieved pertransaction to a dollar value that clients are willing to pay.Response times below a threshold value result in a rewardpaid to the service provider, while response times violatingthe SLA result in the provider paying a penalty to the client.

The control objective is to maximize the profit generated bythis system by minimizing both the power consumption andSLA violations. To achieve this objective, the online con-troller decides the number of physical and virtual machinesto allocate to each service where the VMs and their hostsare turned on or off according to workload demand, and theCPU share and fraction of incoming workload to allocate toeach VM.

Experimental results show that the hosting environment,when managed using the implemented LLC approach saves,on average, 26% more power over a twenty-four hour pe-riod when compared to a system operating without dynamiccontrol. An uncontrolled system is defined as one in whichall available host machines remain powered on. Power con-sumption during run-time was estimated from models of themeasured power consumption of the host machines in vari-ous operating states (e.g. “idle”, in boot, n virtual machinesserving workloads). These power savings are achieved withvery few SLA violations, 1.6% of the total number of servicerequests. An uncontrolled system incurs some SLA viola-tions due to normal variability in application performance–about less than 1% to 5% of the total requests made to thesystem. Results of the energy savings and SLA violations oftwo applications for five different workloads similar to those

2

Page 3: Towards Fault-Adaptive Control of Enterprise Computing ... fileTowards Fault-Adaptive Control of Enterprise Computing Systems—A Position Paper Dara Kusic ∗ Drexel University Philadelphia,

0

RUBBoS SLA

Trade6 SLA

Response Time, seconds

Rev

enue

, do

llars

7e-4

3e-4

-3e-4

0 1 2 3

A stepwise pricing SLA for the online services

5e-4

1e-4

-1e-4

4

DVD Store SLA

Figure 1: A pricing strategy that differentiates theonline services.

0 500 1000 1500 2000 25000

2000

4000

6000

8000

10000

12000

14000

Time instance in 30 second increments

Num

ber

of a

rriv

als

Workload Arrivals to Three Online Applications

Trade6 workload

DVD Storeworkload

RUBBoS workload

Figure 2: Transaction requests to the online appli-cations, plotted at 30-second intervals.

System

Predictivefilter

Systemmodel

System Optimizer

)(kλ )(ˆ lλ

)(ˆ lx )(ˆ lu)(kx

)(ku

System

Predictivefilter

Systemmodel

System Optimizer

)(kω )1(ˆ +kω

)1(ˆ +ks )1(ˆ +ku

)(ks

)(ku

System

Predictivefilter

Approximationmodel

System Optimizer

)(kω )1(ˆ +kω

)1(ˆ +ks )1(ˆ +ku

)(ks

)(ku

System

Predictivefilter

Approximationmodel

System

)(kω )1(ˆ +kω

)(ks

)(ku

Figure 3: The schematic of a limited lookahead con-troller.

shown in Fig. 2 are summarized in Table 1.

A key aspect of implementing the LLC approach is to for-mulate a model that will estimate the future state of thesystem in order to evaluate control decisions. Fig. 3 illus-trates the basic LLC concept where the environment in-put λ is estimated over a prediction horizon h and usedby the system model to forecast future system states x̂. Ateach time step k, the controller finds a feasible sequence{u∗(l)|l ∈ [k + 1, k + h]} of control actions within the pre-diction horizon that maximize the profit generated by thecluster. Then, only the first control action in the chosensequence, u(k +1), is applied to the system and the rest arediscarded. The entire process is repeated at time k+1 whenthe controller can adjust the trajectory, given new state in-formation and an updated workload forecast.

20 25 30 350

100

200

300

400

500

600

700

800Measured Average Response Time for the Trade6 Application

Workload in arrivals per second

Res

pons

e tim

e in

ms

3 GHz CPU Share VM

4.5 GHz CPU Share VM

6GHz CPU Share VM

Trade6 SLA

Figure 4: The measured average response times usedto construct a static model for the Trade6 applica-tion as a function of a VM’s CPU share. The VMis allocated 1 GB of memory.

In previous work using the LLC approach to perform dy-namic resource provisioning [7, 12], we generated models ofthe applications to use in the online controller shown inFig. 3 via simulated-based learning. An example of the pro-filing data captured via offline simulations of the Trade6application is shown in Fig. 4. To collect this data, we al-located 1 MB of memory to a virtual machine and variedthe CPU share to measure the average response times un-der increasing workload intensities. The data is then usedto construct lookup tables of the average expected responsetimes per CPU share and workload intensity. These arestatic models and, as such, are inherently inflexible to slowbehavioral changes in the system due to software ageing orexternal changes introduced by a system administrator.

In order for the LLC approach or any model-based controlstrategy to be of practical value over time, it must adaptto changes in the operating conditions that can compro-mise the accuracy of predictions about application behavior.Without robustness to application dynamics that increaseresource utilization and response times, the number of SLAviolations shown in Table 1 will increase even as workloadintensity remains constant. Some common conditions thatslowly change an application’s performance profile over timeare presented in the next section.

3. SOFTWARE COMPONENTS AS ADAP-TIVE PROCESSES

We use three multi-tier, transaction-based applications fortesting in our virtualized hosting environment. The first,IBM’s Trade6 benchmark—a multi-threaded stock-tradingapplication that allows users to browse, buy, and sell stocks—is used in the work presented in [7]. Trade6 is a transaction-based application integrated within the IBM WebSphere Ap-plication Server V6.0 and uses DB2 Enterprise Server Edi-tion V9.2 as the database component. This execution envi-ronment is then distributed across multiple servers compris-ing the application and database tiers.

The second application hosted within our virtualized clus-ter is Dell’s DVD Store [13], an open source simulation ofan online e-commerce site that we host on an Apache Tom-cat 6.0.14 application server and have adapted for use with

3

Page 4: Towards Fault-Adaptive Control of Enterprise Computing ... fileTowards Fault-Adaptive Control of Enterprise Computing Systems—A Position Paper Dara Kusic ∗ Drexel University Philadelphia,

0 5 10 15 20 25 30 350

50

100

150

200

250

300

350

400

Arrivals per second

Ave

rage

mea

sure

d re

spon

se ti

me

for

2K r

eque

sts

Trade6 application performance with database scaling

50K stocks, 25K users

10K stocks, 5K users

25K stocks, 13K users

1K stocks, 500 users

35K stocks, 18K users

30K stocks, 15K users

Figure 5: The changing performance profile of theTrade6 application as the database size grows.

IBM’s DB2 Enterprise Server Edition V9.2 for the databasecomponent. The DVD Store has several default databasesizes available, and for our experiments we have chosen the‘small’ database size.

Rice University’s servlet implementation of their RUBBoSapplication [14] is the third application hosted within ourvirtualized cluster. RUBBoS is a bulletin board benchmarksimilar to the online forum Slashdot which has browsingcapabilities similar to an e-commerce site, and posting ca-pabilities that are similar to purchase transactions that re-quire database writes. We host RUBBoS on an Apache Tom-cat 6.0.14 application server and have adapted the databasecomponent to IBM’s DB2 Enterprise Server Edition V9.2.The experimental results for all three applications reflect ap-plication workloads requiring a 50/50 mix of database readsand writes.

3.1 Effect of Database Size on Response TimeWe first sought to characterize the effects of database file sizegrowth on an application performance. The size of an ap-plication’s database may increase due to production activityas new users accumulate, or as additional data is introducedby a system administrator. The database size affects theretrieval time for dynamic content, as well as the time tocommit new transactions. Fig. 5 clearly shows the effectsof database size on the average response time per requestfor the Trade6 application. The default database for theTrade6 application consists of 500 users and 1,000 stocksand is scaled upward from that size. As the size of the data-base grows, the maximum workload arrival rate that canbe tolerated within the bounds of the SLA shown in Fig. 1decreases. An adaptive online controller would scale backthe workload directed to each node to accommodate the in-creased response time per transaction.

3.2 Effect of Memory Leaks on Response TimeWhile growing database sizes can be considered an effect ofnormal operation, memory leaks caused by software bugsare a common source of errant operation in an applica-tion, particularly when using the C/C++ programming lan-guages [15]. A memory leak is defined as memory that hasbeen dynamically allocated by an application, such as byusing the malloc() operation in the C language, but that isnever released back to the operating system. Memory leaksthat accumulate will eventually allocate the resource to ca-pacity and the operating system will reserve the swap space

Trade6, 4 req./sec., 1 MB/min. memory leak

Re

spo

nse

tim

e i

n m

s

Time instance in 1 min. increments

(a)

Trade6, 4 req./sec., 5 MB/min. memory leak

Re

spo

nse

tim

e i

n m

s

Time instance in 1 min. increments

(b)

Trade6, 4 req./sec., 10 MB/min. memory leak

Re

spo

nse

tim

e i

n m

s

Time instance in 1 min. increments

(c)

Figure 6: The measured response times for theTrade6 application as 1 MB, 5 MB, and 10 MBmemory leaks accumulate per minute. The VM isallocated a 6 GHz CPU share and 1 GB of memory.

(virtual memory). It is typically at this point when the ap-plication has reserved the system memory to capacity thatperformance significantly degrades.

Fig. 6 and Fig. 7 show the response times measured per-transaction for the Trade6 and RUBBoS applications, re-

4

Page 5: Towards Fault-Adaptive Control of Enterprise Computing ... fileTowards Fault-Adaptive Control of Enterprise Computing Systems—A Position Paper Dara Kusic ∗ Drexel University Philadelphia,

RUBBoS, 4 req./sec., 1 MB/min. memory leak

Re

spo

nse

tim

e i

n m

s

Time instance in 1 min. increments

(a)

Time instance in 1 min. increments

RUBBoS, 4 req./sec., 5 MB/min. memory leak

Re

spo

nse

tim

e i

n m

s

(b)

Time instance in 1 min. increments

RUBBoS, 4 req./sec., 10 MB/min. memory leak

Re

spo

nse

tim

e i

n m

s

(c)

Figure 7: The measured response times for theRUBBoS application as 1 MB, 5 MB, and 10 MBmemory leaks accumulate per minute. The solid lineindicates the average response time per one minuteincrement. The VM is allocated a 6 GHz CPU shareand 1 GB of memory.

spectively. The solid black lines through Fig. 6 and Fig. 7indicate the response time averaged over the last 20 timesamples. In both applications, the memory leak was effected

at the application server and database tiers and the work-load arrival rate was held constant. The point at whichapplication performance degrades significantly is commen-surate with the rate of the memory leak, occurring at aboutt = 150, or two and a half hours into a 5 MB per minutememory leak, and at half that time, occurring at aboutt = 70, or an hour and fifteen minutes into a 10 MB perminute memory leak. The 1 MB per minute memory leakshown in Fig. 7(a) exhibits a more gradual increase beforecausing the application server to crash, at about t = 1100.

In both applications, the number of per-transaction SLA vi-olations, those requests that exceed the acceptable responsetime thresholds in Fig. 1, jumps from less than one hun-dredth of a percent to 2 − 5% of the total requests over agiven time period. Because SLA goals are set somewhat con-servatively, the percentage of per-transaction SLA violationsmay not be large. However, the increases in average responsetimes are a more dramatic measure of performance degrada-tion. The Trade6 application exhibits somewhat modest in-creases in average response times (about 60% to 75% abovenormal) as the memory is allocated to capacity, however,the RUBBoS application exhibits more severe increases inaverage response times (about 150% to 500% above normal)under the same conditions.

4. THE FACT FRAMEWORK AND ASSO-CIATED RESEARCH CHALLENGES

To make good provisioning decisions, an online controllermust maintain an accurate model of application behavior un-der various workload intensities when hosted by a VM givena particular CPU and memory allocation. In previous work,we have captured this behavior using supervised learning,i.e., by extensively simulating the VM (and the underlyingapplication) in offline fashion and using the observations totrain an approximation structure such as a neural networkfor use at run time [16]. However, it is generally not possible,via offline simulations, to create an exact model of systemdynamics using a limited set of training data. Therefore,we propose to integrate online parameter tuning and model-learning techniques within the LLC control framework to:(1) improve the accuracy of partially specified system mod-els, and (2) maintain the correctness of the model againstslow behavioral changes to system components over time.

We can use supervised online learning to construct (approx-imate) dynamical models of the application components un-der control. These models capture the complex and non-linear relationship between the response time achieved by aVM, and the CPU and memory share provided to it, un-der dynamically changing operating conditions. We proposeto incorporate the online control framework with an onlinemodel-learning component as shown in Fig. 8 aimed at de-veloping autonomic systems that anticipate changes in op-erating conditions and react to gradual behavior changes inthe system components.

Fig. 8 shows FACT components integrated with the LLCframework shown in Fig.3, using a recurrent neural networkfor the System Model [17]. Re-training of the neural networkutilizes datasets of length M , where the previous environ-ment inputs and system state inputs are retained from time(k − 1) through time (k −M), using blocks denoted as Z−1

5

Page 6: Towards Fault-Adaptive Control of Enterprise Computing ... fileTowards Fault-Adaptive Control of Enterprise Computing Systems—A Position Paper Dara Kusic ∗ Drexel University Philadelphia,

System

System

Predictivefilter

Systemmodel

Optimizer

)1(ˆ +kω

)1(ˆ +kx )1(ˆ +ku

)(ku

System

Faultmodel

Faultdiagnoser

System

Faultreporting/recovery

)(kx

)(kv

)(kω )(' kxComparator

( ))(kx∂

)(ke

Fault diagnosis component

Self-optimizing control framework

Recovery action

)(ka

Neural network)(kλ )(ˆ kλ

)1( −kx

)(ˆ ku

)(kx )(ku

Adaptation algorithm

z-1 z-1z-1

z-1 z-1 z-1

)1( −kλ

Σ-+

)2( −kλ )( Mk−λ

)2( −kx )( Mkx −

)(ˆ kx

)(ke

SystemOptimizerSystemSystem

SystemPredictivefilter

Figure 8: The proposed fault-adaptive control tech-nology framework utilizing a recurrent neural net-work

for the time delay. A comparator, denoted asP

, inputsthe actual state of the system, x and compares it againstthe estimated state of the system as output by the neuralnetwork, denoted as x̂. The error, e(x), is then applied toa model adaptation algorithm. The process by which themodel is updated will be dependent upon the modeling ap-proach chosen. For example, for a simple lookup table, theadaptation process is a matter of updating the table entries.For abstract models, such as the recurrent neural network inFig. 8, the adaptation process involves re-training the neuralnetwork as new data becomes available. The cost in termsof the time to perform the adaptation will depend upon thelength M of the datasets.

The key challenge is to learn the new system behavior un-der dynamic conditions and update the model used by thecontroller in real time. An accurate system model helps toavoid the excessive switching a controller will introduce intothe system to correct for misguided control actions, and toreduce the number of SLA violations. The key challengewill be for the model adaptation unit to learn an accuratenew model of the system in a timely manner. As shown inFig. 7, changes in the application behavior can occur sud-denly. To tackle this problem, it is useful to study severalmodeling techniques (non-linear regression, auto-regressivemoving averages, etc.) for their estimation accuracy andconvergence time.

5. CONCLUSIONThis results presented in this paper support the position thatto obtain good control performance, the dynamical models ofapplication components should be refined to match the time-varying behavior of the real system. We propose the de-velopment of FACT—Fault Adaptive Control Technology—that continuously refines system or component models usingfeedback from the system. It observes the current systemstate and compares it to the estimated system state to up-date the underlying approximation structure (e.g., lookuptables, neural networks). We also propose an analysis of howthe convergence speed of a given learning structure impactscontrol performance. The potential compensatory behaviorof the controller can also be considered.

6. REFERENCES[1] W. Xu, X. Zhu, S. Singhal, and Z. Wang, “Predictive

control for dynamic resource allocation in enterprisedata centers,” in Proc. of the IEEE Network Ops. andMgmt. Sym., Apr. 2006, pp. 115–26.

[2] Y. Diao, J. Hellerstein, A. Storm, M. Surendra,S. Lightstone, S. Parekh, and C. Garcia-Arellano,“Using mimo linear control for load balancing incomputing systems,” in Proc. of the IEEE AmericanControl Conf., Jun. 2004, pp. 2045–50.

[3] V. Sharma, A. Thomas, T. Abdelzaher, K. Skadron,and Z. Lu, “Power-aware qos management in webservers,” in Proc. of the IEEE Real-Time SystemsSym., Dec. 2003, pp. 63–72.

[4] J. L. Hellerstein, Y. Diao, S. Parekh, and D. M.Tilbury, Feedback Control of Computing Systems.Wiley-IEEE Press, 2004.

[5] B. Moerdyk, R. Decarlo, D. Birdwell, M. Zefran, andJ. J. Chiasson, “Hybrid optimal control for loadbalancing in a cluster of computer nodes,” in Proc. ofthe IEEE Conf. on Control Applications, Oct. 2006,pp. 1713–18.

[6] C. Lu, X. Wang, and X. Koutsoukos, “Feedbackutilization control in distributed real-time systemswith end-to-end tasks,” Perf. Eval. Review, vol. 16,pp. 550–61, Jun. 2005.

[7] D. Kusic, J. Kephart, J. Hanson, N. Kandasamy, andG. Jiang, “Power and performance management ofvirtualized computing environments via lookaheadcontrol,” in Proc. IEEE Intl. Conf. on AutonomicComputing (ICAC), Jun. 2008.

[8] IBM, Trade6 Performance-Characterizing Applicationfor WebSphere.http://www.ibm.com/developerworks/edu/dm-dw-dm-0506lau.html,2005.

[9] G. Khanna, K. Beaty, G. Kar, and A. Kochut,“Application performance management in virtualizedserver environments,” in Proc. of the IEEE NetworkOps. and Mgmt. Sym., Apr. 2006, pp. 373–381.

[10] M. Steinder, I. Whalley, D. Carrera, I. Gaweda, andD. Chess, “Server virtualization in autonomicmanagement of heterogeneous workloads,” in Proc. ofthe IEEE Sym. on Integrated Network Management,May 2007, pp. 139–148.

[11] S. Abdelwahed, N. Kandasamy, and S. Neema, “Onlinecontrol for self-management in computing systems,” inProc. IEEE Real-Time & Embedded Technology &Application Symp. (RTAS), 2004, pp. 368–376.

[12] D. Kusic and N. Kandasamy, “Risk-aware limitedlookahead control for dynamic resource provisioning inenterprise computing systems,” in Proc. of IEEE Intl.Conf. on Autonomic Computing (ICAC), June 2006.

[13] Dell, Dell DVD Store Database Test Suite.http://linux.dell.com/dvdstore.

[14] RUBBoS, RUBBoS: Bulletin Board Benchmark.http://jmob.objectweb.org/rubbos.html.

[15] “Software trends for the 21st century,” ComputerWeekly, Jun. 6 2006.

[16] D. Kusic and N. Kandasamy, “Approximationmodeling for the online performance management ofdistributed computing systems,” in Proc. of IEEE Intl.Conf. on Autonomic Computing (ICAC), June 2007.

[17] D. Mandic and J. Chambers, Recurrent NeuralNetworks for Prediction. John Wiley & Sons, Ltd.,2001.

6