autonomic decentralised elasticity management of cloud applications

Autonomic Decentralised Elasticity

Management of Cloud Applications

Srikumar Venugopal, Reza Nouri and Han Li

School of Computer Science and Engineering

University of New South Wales, Sydney, Australia

E: [email protected]

W: http://www.cse.unsw.edu.au/~srikumarv

Agenda

Background & Motivation

Problem Statement

Solution Overview

Evaluation: Methodology and Results

Conclusion

The Promise of Cloud

Computing

Background & Motivation

State-of-the-art in Auto-scaling

Product/Project Trigger Controller Actions

Amazon Autoscaling

Cloudwatchmetrics/ Threshold

Rule-based/Schedule-based

Add/Remove Capacity

WASABi Azure Diagnostics/Threshold

Rule-based Add/RemoveCapacity, Custom

RightScale/Scalr Load monitoring Rule-based/Schedule-based

Add/Remove Capacity, Custom

Google ComputeEngine

CPU Load, etc. Rule-based Add/Remove Capacity

Academic

CloudScale Demand Prediction Control theory Voltage-scaling

Cataclysm Threshold-based Queueing-model Admission Control

IBM Unity Application Utility Utility functions/RL Add/RemoveCapacity

Cons of Rule-based Auto-

scaling• Currently, the most popular mechanisms

for auto-scaling are rule-based

mechanisms

• The effectiveness of rule-based

autoscaling is determined by the trigger

conditions

• Setting up the triggers is a trial-and-error

process.

Cons of Rule-based Autoscaling

• Commercial products are rule-based

– Gives “illusion of control” to users

– Leads to the problem of defining the “right”

thresholds

• Centralised controllers

– Communication overhead increases with size

– Processing overhead also increases (Big

Data!)

• Limited to One application per VM

Challenges of large-scale

elasticity• Large numbers of instances and apps

– Deriving solutions takes time

• Dynamic conditions

– Apps are going into critical all the time

• Shifting bottlenecks

– Greedy solutions may create bottlenecks in

other places

• Network partitions, fault tolerance…

H. Li, S. Venugopal, Using Reinforcement Learning for Controlling an Elastic Web Application Hosting Platform, Proceedings of

8th ICAC '11.

Problem Statement

Initial Conditions

Instance1

App Server1

app1 app2

Instance2

App Server2

app3 app4

IaaS Provider

A Critical Event

Instance1

App Server1

app1 app2

IaaS Provider

Instance2

App Server2

app3 app4

Placement 1

Instance1

App Server1

app2

IaaS Provider

Instance2

App Server2

app3 app4 app1

Placement 2

Instance1

App Server1

app1

IaaS Provider

Instance2

App Server2

app3 app4 app2

Placement 3

Instance1

App Server1

app2

IaaS Provider

Instance2

App Server2

app3 app4

Instance3

App Server3

app1

Placements 4 & 5

Instance1App Server1

app2

IaaS Provider


app3 app4


app2

IaaS Provider


app3 app4


app1 app1

app1 app1

Challenges of App Placement

• Load shifts are dynamic

• Multiple applications may go critical

simultaneously

• Instance provisioning should be controlled

• Service QoS must be maintained

Twin Objectives

• Provisioning Problem

– To determine the smallest number of servers

required to satisfy resource requirements of

all the applications

• Dynamic Placement Problem

– To distribute the applications so as to

maximise utilisation yet meet each app’s

response time and availability requirements

Solution Overview

Decentralised Elastic Control

• Instances control their own utilisation

– Monitoring, management and feedback

• Local controllers are learning agents

– Reinforcement Learning

• Servers are linked by Zookeeper

– Agility, Flexibility, Co-ordination

• We call our system ADEC (Autonomic

Decentralised Elasticity Control)

Software Architecture of ADEC

Reinforcement Learning

• Learn optimal management policies over time

– vs. Model-based policies

• Learn long-term effects of short-term actions

– If the state-action pairs are chosen correctly

• We have applied Q-Learning to this problem

– Initial actions are drawn using Boltzmann dist.

Abstract View of the Control

Scheme

States

Basic Actions

Server

Application

create terminate find

move duplicate merge

(-3.5) (3.5) (3.5)

(0.5) (0.5) (0.5)

Actions and Rewards

• Actual actions are a combination of a

server and an application action

– E.g. find and move, merge and terminate

• 11 pre-defined actions

– Reducing complexity

• Each action is associated with a reward

– -ve rewards for actions incurring costs (e.g.

start server)

– +ve rewards for actions that save (e.g.

terminate

Co-ordination using find

• Server looks up other servers with the

least load

– Zookeeper lookup

• Sends a move message to the selected

server

• Replies with accept or reject

– accept has a +ve reward

Shrinking

• The controller is always reward

maximising

– Highest Reward is for merge+terminate

• A controller initiates its own shutdown

– Low load on its applications

• Gets exclusive lock on termination

– Only one instance can terminate at a time

• Transfers state before shutdown

Information on the DHT

• Server event notification

• List of applications on each server

• Server status updates (load information)

• Q-value updates

Evaluation

Experiment 1: Testing ADEC

• IaaS provider: Amazon EC2

– small instances and high CPU instance

• Load-tester: Apache Jmeter

• Application server: Tomcat 6.0

– JVM with 1 GB RAM

• Server thresholds: 60% and 85%

Experiment 1: Testing

• Six web applications

– Test Application: Hotel Management

– Search Book Confirm

• Five were subjected to a background load

– Uniform Random

• One was subjected to the test load

• Application threshold: 200 and 500 ms

• Metrics

– Average Response Time, Drop Rate, Servers

Peaking Workload

Poisson Workload

Conclusion

Conclusion

• Demonstrated a co-ordination architecture

for provisioning web applications

• Each server is independent and the

system is managed by set of simple states

and actions

• Instances start and shutdown on their own

to meet application objectives

Ongoing Work

• Imrpoved performance modeling for quick

detection of slowdowns

• Using utility functions for defining

application priorities

• Extension to SOA and BPM

– Collaboration with Technical Univ of Vienna,

Austria

• Scaling the database

– ElasCass project

Questions ?

[email protected]

Thank you!

autonomic decentralised elasticity management of cloud applications

Technology

dynamic multiple applications

based autoscalingcurrently

appsderiving solutions

reza nouri

controlled service qos

instance provisioning

elasticadding resources

smallest number of servers