self learning cloud controllers

Self-Learning Cloud Controllers

Pooyan Jamshidi, Amir Sharifloo, Claus Pahl, Andreas Metzger, Giovani Estrada IC4, Dublin City University, Ireland University of Duisburg-Essen, Germany Intel, Ireland

[email protected]

Invited presentation at UFC, Brazil, Fortaleza

~50% = wasted hardware

Actual traffic

Typical weekly traffic to Web-based applications (e.g., Amazon.com)

Problem 1: ~75% wasted capacity Actual

demand

Problem 2: customer lost

Traffic in an unexpected burst in requests (e.g. end of year traffic to Amazon.com)

Really like this??

Auto-scaling enables you to realize this ideal on-demand provisioning

Time

Demand

? Enacting change in the Cloud resources are not real-time

Capacity we can provision with Auto-Scaling

A realistic figure of dynamic provisioning

0 50 1000

500

1000

1500

0 50 100100

200

300

400

500

0 50 1000

1000

2000

0 50 1000

200

400

600

These quantitative values are required to be determined by the user ⇒  requires deep

knowledge of application (CPU, memory, thresholds)

⇒  requires performance modeling expertise (when and how to scale)

⇒  A unified opinion of user(s) is required

Amazon auto scaling Microsoft Azure Watch

9

Microsoft Azure Auto-scaling Application Block

Naeem Esfahani and Sam Malek, “Uncertainty in Self-Adaptive Software Systems”

Pooyan Jamshidi, Aakash Ahmad, Claus Pahl, Muhammad Ali Babar , “Sources of Uncertainty in Dynamic Management of Elastic Systems”, Under Review

Uncertainty related to enactment latency: The same scaling action (adding/removing a VM with precisely the same size) took different time to be enacted on the cloud platform (here is Microsoft Azure) at different points and this difference were significant (up to couple of minutes). The enactment latency would be also different on different cloud platforms.

Ø  Offline benchmarking

Ø  Trial-and-error

Ø  Expert knowledge

Costly and not systematic

A. Gandhi, P. Dube, A. Karve, A. Kochut, L. Zhang, Adaptive, “Model-driven Autoscaling for Cloud Applications”, ICAC’14

arrival rate (req/s)

95% Resp. :me (m

s)

400 ms

60 req/s

RobusT2Scale Initial setting + elasticity rules + response-time SLA

environment monitoring

application monitoring

scaling actions

Fuzzy Reasoning Users

Prediction/Smoothing

0 0.5 1 1.5 2 2.5 30

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Region of definite satisfaction

Region of definite dissatisfaction Region of

uncertain satisfaction

Performance Index

Pos

sibi

lity

Performance Index P

ossi

bilit

y

words can mean different things to different people

Different users often recommend different elasticity policies

0 0.5 1 1.5 2 2.5 30

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Type-2 MF

Type-1 MF

Workload

Response time

0 10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x2

uM

embe

rshi

p gr

ade

0 10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x2u

Mem

bers

hip

grad

e

=>

=>

UMF

LMF

Embedded

FOU

mean

sd

Rule (𝒍)

Antecedents Consequent

𝒄𝒂𝒗𝒈𝒍 Workload Response-‐

time Normal (-‐2)

Effort (-‐1)

Medium Effort (0)

High Effort (+1)

Maximum Effort (+2)

1 Very low Instantaneous 7 2 1 0 0 -‐1.6 2 Very low Fast 5 4 1 0 0 -‐1.4 3 Very low Medium 0 2 6 2 0 0 4 Very low Slow 0 0 4 6 0 0.6 5 Very low Very slow 0 0 0 6 4 1.4 6 Low Instantaneous 5 3 2 0 0 -‐1.3 7 Low Fast 2 7 1 0 0 -‐1.1 8 Low Medium 0 1 5 3 1 0.4 9 Low Slow 0 0 1 8 1 1 10 Low Very slow 0 0 0 4 6 1.6 11 Medium Instantaneous 6 4 0 0 0 -‐1.6 12 Medium Fast 2 5 3 0 0 -‐0.9 13 Medium Medium 0 0 5 4 1 0.6 14 Medium Slow 0 0 1 7 2 1.1 15 Medium Very slow 0 0 1 3 6 1.5 16 High Instantaneous 8 2 0 0 0 -‐1.8 17 High Fast 4 6 0 0 0 -‐1.4 18 High Medium 0 1 5 3 1 0.4 19 High Slow 0 0 1 7 2 1.1 20 High Very slow 0 0 0 6 4 1.4 21 Very high Instantaneous 9 1 0 0 0 -‐1.9 22 Very high Fast 3 6 1 0 0 -‐1.2 23 Very high Medium 0 1 4 4 1 0.5 24 Very high Slow 0 0 1 8 1 1 25 Very high Very slow 0 0 0 4 6 1.6

Rule ()

Antecedents Consequent

Work load

Response

-time -2 -1 0 +1 +2

12 Medium Fast 2 5 3 0 0 -0.9

10 experts’ responses

𝑅↑𝑙 : IF (the workload (𝑥↓1 ) is 𝐹 ↓𝑖↓1  , AND the response-time (𝑥↓2 ) is 𝐺 ↓𝑖↓2  ), THEN (add/remove 𝑐↓𝑎𝑣𝑔↑𝑙  instances).

𝑐↓𝑎𝑣𝑔↑𝑙 = ∑𝑢=1↑𝑁↓𝑙 ▒𝑤↓𝑢↑𝑙 ×𝐶 /∑𝑢=1↑𝑁↓𝑙 ▒𝑤↓𝑢↑𝑙   

Goal: pre-computations of costly calculations to make a runtime efficient elasticity reasoning based on fuzzy inference

Liang, Q., Mendel, J. M. (2000). Interval type-2 fuzzy logic systems: theory and design. Fuzzy Systems, IEEE Transactions on, 8(5), 535-550.

Scaling Actions

Monitoring Data

0 10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.5954

0.3797

𝑀

0 10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.2212

0.0000

0 10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x2

u

0 10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x2

uMonitoring data

Workload

Response time

0 10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.5954

0.3797

0 10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

10.95680.9377

Performance index

𝑦↓𝑙 , 𝑦↓𝑟 

0 50 1000

500

1000

1500

0 50 100100

200

300

400

500

0 50 1000

1000

2000

0 50 1000

200

400

600

0 50 1000

500

1000

0 50 1000

500

1000

0 10 20 30 40 50 60 70 80 90 100-500

0

500

1000

1500

2000

Time (seconds)

Num

ber o

f hits

Original databetta=0.10, gamma=0.94, rmse=308.1565, rrse=0.79703betta=0.27, gamma=0.94, rmse=209.7852, rrse=0.54504betta=0.80, gamma=0.94, rmse=272.6285, rrse=0.70858

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Big spike Dual phase Large variations Quickly varying Slowly varying Steep tri phase

0 50 1000

500

1000

1500

0 50 100100

200

300

400

500

0 50 1000

1000

2000

0 50 1000

200

400

6000 50 1000

500

1000

0 50 1000

500

1000

Roo

t Rel

ativ

e S

quar

ed E

rror

SUT Criteria Big spike

Dual phase

Large variations

Quickly varying

Slowly varying

Steep tri phase

RobusT2Scale 973ms 537ms 509ms 451ms 423ms 498ms

3.2 3.8 5.1 5.3 3.7 3.9

Overprovisioning

354ms 411ms 395ms 446ms 371ms 491ms

6 6 6 6 6 6

Under provisioning

1465ms 1832ms 1789ms 1594ms 1898ms 2194ms

2 2 2 2 2 2

SLA: 𝒓𝒕↓𝟗𝟓 ≤𝟔𝟎𝟎𝒎𝒔 For every 10s control interval

• RobusT2Scale is superior to under-provisioning in terms of guaranteeing the SLA and does not require excessive resources • RobusT2Scale is superior to over-provisioning in terms of guaranteeing required resources while guaranteeing the SLA

0

0.02

0.04

0.06

0.08

0.1

alpha=0.1 alpha=0.5 alpha=0.9 alpha=1.0

Roo

t Mea

n S

quar

e E

rror

Noise level: 10%

Self-Learning Controller

Current Solution (my PhD Thesis + IC4 work on auto-scaling)

Future Plan Updating K in MAPE-K @ Runtime Design-time

Assistance

Multi-cloud

Fuzzifier

Inference Engine

Defuzzifier Rule base

Fuzzy Q-learning

Cloud Application Monitoring Actuator

Cloud Platform

Fuzzy Logic Controller

Knowledge Learning

Aut

onom

ic C

ontr

olle

r

𝑟𝑡 𝑤

𝑤, 𝑟𝑡, 𝑡ℎ, 𝑣𝑚

𝑠𝑎

system state system goal

RobusT2Scale

Learned rules

FQL

Monitoring Actuator

Cloud Platform

.fis

LW

W

ElasticBench

𝑤, 𝑟𝑡


𝑠𝑎

Load Generator

C

system state

WCF

REST

𝛾,𝜂,𝜀,𝑟

Cloud Platform (PaaS) On-Premise

P: Worker

Role

L: Web Role

P: Worker

Role

P: Worker

Role

Cache

M: Worker

Role

Results:

Storage

Blackboard: Storage

LG: Console

Auto-scaling Logic (controller)

Policy Enforcer

1 2 3

7 8

11 12

4

10 9

LB: Load

Balancer

6

5 Queue

Actuator

Monitoring

0

0.2

0.4

0.6

0.8

1

1.2

0 4 12

15

21

26

33

39

47

53

60

68

76

83

90

95

103

110

118

124

130

136

145

152

159

166

171

179

185

193

200

206

214

218

219

228

236

242

249

255

263

267

271

275

283

290

295

303

307

314

320

S1 S2 S3 S4 S5

0

0.5

1

1.5

2

2.5

0 50 100 150 200 250 300 350

q(9,5)

0 0.02 0.04 0.06 0.08

0.1 0.12 0.14 0.16 0.18

0.2

0 50 100 150 200 250 300 350

q(7,5)

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

0 50 100 150 200 250 300 350

q(9,3)

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

0 50 100 150 200 250 300 350

q(3,3)

0

1

2

3

4

5

6

7

8

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

1

2

3

4

Challenge 1: ~ 75% wasted capacityA c t u a l

d e m a n d

Challenge 2: customer lost

Fuzzifier

Inference Eng ine

DefuzzifierRule base

FuzzyQ-‐learning

Cloud ApplicationMonitoring Actuator

Cloud Platform

Fuzzy Log ic Controller

Knowledge Learning

Auton

omic Con

troller

𝑟𝑡𝑤

𝑤,𝑟𝑡,𝑡ℎ,𝑣𝑚

𝑠𝑎

system state system goal

RobusT2Scale

Learned rules

FQL

Monitoring Actuator

Cloud Platform

.fis

LW

W

ElasticBench

𝑤, 𝑟𝑡


𝑠𝑎

Load Generator

C

system state

WCF

REST

𝛾, 𝜂, 𝜀, 𝑟

http://computing.dcu.ie/~pjamshidi/PDF/SEAMS2014.pdf

More Details? => http://www.slideshare.net/pooyanjamshidi/

Slides?

=>

Thank you!

Aakash Ahmad Claus Pahl Soodeh Farokhi Amir Sharifloo Armin Balalaie

https://github.com/pooyanjamshidi

Code? =>

Hamed Jamshidi

Nabor Mendonca

Brian Carroll Reza Teimourzadegan

self learning cloud controllers

Software

different time

different cloud platforms

different things

different points

different elasticity

cloud applications

medium effort

mf workload response