autonomic decentralised elasticity management of cloud applications
Post on 14-Jul-2015
241 Views
Preview:
TRANSCRIPT
Autonomic Decentralised Elasticity
Management of Cloud Applications
Srikumar Venugopal, Reza Nouri and Han Li
School of Computer Science and Engineering
University of New South Wales, Sydney, Australia
E: srikumarv@cse.unsw.edu.au
W: http://www.cse.unsw.edu.au/~srikumarv
Agenda
Background & Motivation
Problem Statement
Solution Overview
Evaluation: Methodology and Results
Conclusion
The Promise of Cloud
Computing
Background & Motivation
State-of-the-art in Auto-scaling
Product/Project Trigger Controller Actions
Amazon Autoscaling
Cloudwatchmetrics/ Threshold
Rule-based/Schedule-based
Add/Remove Capacity
WASABi Azure Diagnostics/Threshold
Rule-based Add/RemoveCapacity, Custom
RightScale/Scalr Load monitoring Rule-based/Schedule-based
Add/Remove Capacity, Custom
Google ComputeEngine
CPU Load, etc. Rule-based Add/Remove Capacity
Academic
CloudScale Demand Prediction Control theory Voltage-scaling
Cataclysm Threshold-based Queueing-model Admission Control
IBM Unity Application Utility Utility functions/RL Add/RemoveCapacity
Cons of Rule-based Auto-
scaling• Currently, the most popular mechanisms
for auto-scaling are rule-based
mechanisms
• The effectiveness of rule-based
autoscaling is determined by the trigger
conditions
• Setting up the triggers is a trial-and-error
process.
Cons of Rule-based Autoscaling
• Commercial products are rule-based
– Gives “illusion of control” to users
– Leads to the problem of defining the “right”
thresholds
• Centralised controllers
– Communication overhead increases with size
– Processing overhead also increases (Big
Data!)
• Limited to One application per VM
Challenges of large-scale
elasticity• Large numbers of instances and apps
– Deriving solutions takes time
• Dynamic conditions
– Apps are going into critical all the time
• Shifting bottlenecks
– Greedy solutions may create bottlenecks in
other places
• Network partitions, fault tolerance…
H. Li, S. Venugopal, Using Reinforcement Learning for Controlling an Elastic Web Application Hosting Platform, Proceedings of
8th ICAC '11.
Problem Statement
Initial Conditions
Instance1
App Server1
app1 app2
Instance2
App Server2
app3 app4
IaaS Provider
A Critical Event
Instance1
App Server1
app1 app2
IaaS Provider
Instance2
App Server2
app3 app4
Placement 1
Instance1
App Server1
app2
IaaS Provider
Instance2
App Server2
app3 app4 app1
Placement 2
Instance1
App Server1
app1
IaaS Provider
Instance2
App Server2
app3 app4 app2
Placement 3
Instance1
App Server1
app2
IaaS Provider
Instance2
App Server2
app3 app4
Instance3
App Server3
app1
Placements 4 & 5
Instance1App Server1
app2
IaaS Provider
Instance2App Server2
app3 app4
Instance1App Server1
app2
IaaS Provider
Instance2App Server2
app3 app4
Instance3App Server3
app1 app1
app1 app1
Challenges of App Placement
• Load shifts are dynamic
• Multiple applications may go critical
simultaneously
• Instance provisioning should be controlled
• Service QoS must be maintained
Twin Objectives
• Provisioning Problem
– To determine the smallest number of servers
required to satisfy resource requirements of
all the applications
• Dynamic Placement Problem
– To distribute the applications so as to
maximise utilisation yet meet each app’s
response time and availability requirements
Solution Overview
Decentralised Elastic Control
• Instances control their own utilisation
– Monitoring, management and feedback
• Local controllers are learning agents
– Reinforcement Learning
• Servers are linked by Zookeeper
– Agility, Flexibility, Co-ordination
• We call our system ADEC (Autonomic
Decentralised Elasticity Control)
Software Architecture of ADEC
Reinforcement Learning
• Learn optimal management policies over time
– vs. Model-based policies
• Learn long-term effects of short-term actions
– If the state-action pairs are chosen correctly
• We have applied Q-Learning to this problem
– Initial actions are drawn using Boltzmann dist.
Abstract View of the Control
Scheme
States
Basic Actions
Server
Application
create terminate find
move duplicate merge
(-3.5) (3.5) (3.5)
(0.5) (0.5) (0.5)
Actions and Rewards
• Actual actions are a combination of a
server and an application action
– E.g. find and move, merge and terminate
• 11 pre-defined actions
– Reducing complexity
• Each action is associated with a reward
– -ve rewards for actions incurring costs (e.g.
start server)
– +ve rewards for actions that save (e.g.
terminate
Co-ordination using find
• Server looks up other servers with the
least load
– Zookeeper lookup
• Sends a move message to the selected
server
• Replies with accept or reject
– accept has a +ve reward
Shrinking
• The controller is always reward
maximising
– Highest Reward is for merge+terminate
• A controller initiates its own shutdown
– Low load on its applications
• Gets exclusive lock on termination
– Only one instance can terminate at a time
• Transfers state before shutdown
Information on the DHT
• Server event notification
• List of applications on each server
• Server status updates (load information)
• Q-value updates
Evaluation
Experiment 1: Testing ADEC
• IaaS provider: Amazon EC2
– small instances and high CPU instance
• Load-tester: Apache Jmeter
• Application server: Tomcat 6.0
– JVM with 1 GB RAM
• Server thresholds: 60% and 85%
Experiment 1: Testing
• Six web applications
– Test Application: Hotel Management
– Search Book Confirm
• Five were subjected to a background load
– Uniform Random
• One was subjected to the test load
• Application threshold: 200 and 500 ms
• Metrics
– Average Response Time, Drop Rate, Servers
Peaking Workload
Poisson Workload
Conclusion
Conclusion
• Demonstrated a co-ordination architecture
for provisioning web applications
• Each server is independent and the
system is managed by set of simple states
and actions
• Instances start and shutdown on their own
to meet application objectives
Ongoing Work
• Imrpoved performance modeling for quick
detection of slowdowns
• Using utility functions for defining
application priorities
• Extension to SOA and BPM
– Collaboration with Technical Univ of Vienna,
Austria
• Scaling the database
– ElasCass project
Questions ?
srikumarv@cse.unsw.edu.au
Thank you!
top related