autonomic sla-driven provisioning for cloud applications
Post on 18-Dec-2014
948 Views
Preview:
DESCRIPTION
TRANSCRIPT
Autonomic SLA-driven Provisioning for Cloud Applications
Nicolas Bonvin, Thanasis Papaioannou, Karl Aberer
CCGRID 2011, May 23-26 2011, New Port Beach, CA, USA
nicolas.bonvin@epfl.chLSIR - EPFL
● A distributed, component-based application running on an elastic infrastructure
Cloud Apps – Issue #1 : Placement
2 EPFL – LSIR - Nicolas Bonvin
C1C1 C2C2 C3C3 C4C4
● A distributed, component-based application running on an elastic infrastructure
Cloud Apps – Issue #1 : Placement
3 EPFL – LSIR - Nicolas Bonvin
C1C1 C2C2 C3C3 C4C4
VM1 VM2 VM3
● A distributed, component-based application running on an elastic infrastructure
● Performance of C1, C2 and C3 is probably less than C4● No info on other VMs colocated on same server !
Cloud Apps – Issue #1 : Placement
4 EPFL – LSIR - Nicolas Bonvin
C3C3 C4C4
VM2 VM3
Server 1 Server 2
C1C1 C2C2
VM1
● A distributed, component-based application running on an elastic infrastructure
● Performance of C1, C2 and C3 is probably less than C4● No info on other VMs colocated on same server !
Cloud Apps – Issue #1 : Placement
5 EPFL – LSIR - Nicolas Bonvin
No control on placement
C3C3 C4C4
VM2 VM3
Server 1 Server 2
C1C1 C2C2
VM1
● Load-balanced trafic to 4 identical components on 4 identical VMs
Cloud Apps – Issue #2 : Unstability
6 EPFL – LSIR - Nicolas Bonvin
C1C1 C1C1 C1C1 C1C1
VM1 VM2 VM3 VM4
100 ms 100 ms 100 ms 100 ms
● Load-balanced trafic to 4 identical components on 4 identical VMs
– VM performance can vary up to a ratio 4 ! [Dej2009]
● Physical server, Hypervisor, Storage, ...
Cloud Apps – Issue #2 : Unstability
7 EPFL – LSIR - Nicolas Bonvin
C1C1 C1C1 C1C1 C1C1
VM1 VM2 VM3 VM4
100 ms 140 ms 100 ms 100 ms
● Load-balanced trafic to 4 identical components on 4 identical VMs
– VM performance can vary up to a ratio 4 ! [Dej2009]
● Physical server, Hypervisor, Storage, ...● Component overloaded
Cloud Apps – Issue #2 : Unstability
8 EPFL – LSIR - Nicolas Bonvin
C1C1 C1C1 C1C1 C1C1
VM1 VM2 VM3 VM4
130 ms 140 ms 100 ms 100 ms
● Load-balanced trafic to 4 identical components on 4 identical VMs
– VM performance can vary up to a ratio 4 ! [Dej2009]
● Physical server, Hypervisor, Storage, ...● Component overloaded● Component bug, crash, deadlock, ...
Cloud Apps – Issue #2 : Unstability
9 EPFL – LSIR - Nicolas Bonvin
C1C1 C1C1 C1C1 C1C1
VM1 VM2 VM3 VM4
130 ms 140 ms 100 ms infinity
● Load-balanced trafic to 4 identical components on 4 identical VMs
– VM performance can vary up to a ratio 4 ! [Dej2009]
● Physical server, Hypervisor, Storage, ...● Component overloaded● Component bug, crash, deadlock, ...● Failure of C1 on VM4 -> load is rebalanced
Cloud Apps – Issue #2 : Unstability
10 EPFL – LSIR - Nicolas Bonvin
C1C1 C1C1 C1C1 C1C1
VM1 VM2 VM3 VM4
140 ms 150 ms 130 ms infinity
● Load-balanced trafic to 4 identical components on 4 identical VMs
– VM performance can vary up to a ratio 4 ! [Dej2009]
● Physical server, Hypervisor, Storage, ...● Component overloaded● Component bug, crash, deadlock, ...● Failure of C1 on VM4 -> load is rebalanced
Cloud Apps – Issue #2 : Unstability
11 EPFL – LSIR - Nicolas Bonvin
C1C1 C1C1 C1C1 C1C1
VM1 VM2 VM3 VM4
140 ms 150 ms 130 ms infinity
Application should react early !
● Build for failures
– Do not trust the underlying infrastructure
– Do not trust your components either !
● Components should adapt to the changing conditions
– Quickly
– Automatically
– e.g. by replacing a wonky VM by a new one
Cloud Apps – Overview
12 EPFL – LSIR - Nicolas Bonvin
Scarce: a framework to build scalable cloud applications
Architecture Overview
14 EPFL – LSIR - Nicolas Bonvin
Agent
Server
GOSSIPING + BROADCAST
Agent
A
B
E
● An agent on each server / VM
– starts/stops/monitors the components
– Takes decisions on behalf of the components
● An agent communicates with other agents
– Routing table
– Status of the server (resources usage)
Agent
Agent
Agent
Agent
An economic approach
15 EPFL – LSIR - Nicolas Bonvin
● Time is split into epochs (no synchronization between servers)● Servers charge a virtual rent for hosting a component according to
– Current resource usage (I/O, CPU, ...) of the server
– Technical factors (HW, connectivity, ...)
– Non-technical factors (country stability, ....)
An economic approach
16 EPFL – LSIR - Nicolas Bonvin
● Time is split into epochs (no synchronization between servers)● Servers charge a virtual rent for hosting a component according to
– Current resource usage (I/O, CPU, ...) of the server
– Technical factors (HW, connectivity, ...)
– Non-technical factors (country stability, ....)
● Components
– Pay virtual rent at each epoch
– Gain virtual money by processing requests
– Take decisions based on balance ( = gain – rent )
● Replicate, migrate, suicide, stay
● Virtual rents are updated by gossiping (no centralized board)
Economic model (i)
17 EPFL – LSIR - Nicolas Bonvin
● The rent of a server is different for each component !
Economic model (ii)
18 EPFL – LSIR - Nicolas Bonvin
● VM1 and VM2 have an « identical » resources usage : 45%● Server rent = server's resources usage with component's weights
– Rent for C1 @ VM1 > rent for C1 @ VM2
C1C1CPU : 30%I/O : 5%
VM1
CPU : 70%I/O : 20%
Multiplexing of server resources
VM2
CPU : 25%I/O : 65%
?
Economic model (iii)
19 EPFL – LSIR - Nicolas Bonvin
● Choosing a candidate server j during replication/migration of a component i
– netbenefit maximization
● 2 optimization goals :
– high-availability by geographical diversity of replicas
– low latency by grouping related components
● gj : weight related to the proximity of the server location to the geographical distribution of the client requests to the component
● Si is the set of server hosting a replica of component i
SLA Performance Guarantees (i)
20 EPFL – LSIR - Nicolas Bonvin
● Each component has its own SLA constraints● SLA derived directly from entry components
● Resp. Time = Service Time + max (Resp. Time of Dependencies)
C3C3
C1SLA : 500ms
C1SLA : 500ms
C2C2
C5C5
C4C4
SLA Performance Guarantees (ii)
21 EPFL – LSIR - Nicolas Bonvin
● SLA propagation from parents to children● Parent j sends its performance constraints (e.g. response time upper
bound) to its dependencies D(j) :
● Child i computes its own performance constraints :
● : group of constraints sent by the replicas of the parent g
SLA Performance Guarantees (iii)
22 EPFL – LSIR - Nicolas Bonvin
● SLA propagation from parents to children
Automatic Provisioning
23 EPFL – LSIR - Nicolas Bonvin
● Usage of allocated resources is maximized :
– autonomic migration / replication / suicide of components
– not enough to ensure end-to-end response time
● Cloud resources managed by framework via cloud API
● Each individual component has to satisfy its own SLA
– SLA easily met -> decrease resources (scale down)
– SLA not met -> increase resources (scale up, scale out)
Adaptivity to slow servers
24 EPFL – LSIR - Nicolas Bonvin
● Each component keeps statistics about its children
– e.g. 95th perc. response time
● A routing coefficient is computed for each child at each epoch
– Send more requests to more performant children
Evaluation
Evaluation: Setup
26 EPFL – LSIR - Nicolas Bonvin
● 5 components, mostly CPU-intensive (wc >> wm,wn,wd)
● 8 8-cores servers (Intel Core i7 920, 2.67 GHz, 8GB, Linux 2.6.32-trunk-amd64)
● d=0, C=110, k =10000, xs* = 25%
C3C3
C1SLA : 500ms
C1SLA : 500ms
C2C2
C5C5
C4C4
Adaptation to Varying Load (i)
27 EPFL – LSIR - Nicolas Bonvin
● 5 rps to 60 rps at minute 8, step 5 rps/min● Static setup : 2 servers with 2 cores
Adaptation to Varying Load (ii)
28 EPFL – LSIR - Nicolas Bonvin
● 5 rps to 60 rps at minute 8, step 5 rps/min● Static setup : 2 servers with 2 cores
Adaptation to Slow Server
29 EPFL – LSIR - Nicolas Bonvin
● Max 2 cores/server, 25 rps● At minute 4, a server gets slower (200 ms delay)
Scalability
30 EPFL – LSIR - Nicolas Bonvin
● Add 5 rps
per minute until 150 rps● Max 6 cores/server
Conclusion
Conclusion
32 EPFL – LSIR - Nicolas Bonvin
● Framework for building cloud applications● Elasticity : add/remove resources ● High Availability : software, hardware, network failures● Scalability : growing load, peaks, scaling down, ...
– Quick replication of busy components
● Load Balancing : load has to be shared by all available servers
– Replication of busy components
– Migration of less busy components
– Reach equilibrium when load is stable
● SLA performance guarantees
– Automatic provisioning
● No synchronization, fully decentralized
Thank you !
top related