monitoring latency sensitive enterprise applications on the cloud shankar narayanan ashiwan...

Monitoring Latency Sensitive Enterprise Applications on the Cloud

Shankar NarayananAshiwan Sivakumar

2

Enterprise Applications (EA)Stock Trader Benchmark Application

Data Base (DB)

Business Service (BS)Front End (FE)

Configuration Service (CS)

Order Processing Service (OS)

3

EA as Services

FE

Users

FE

BS

BS

BS

BS

BS

OS

OS

OS

DB

DB

Load Balancers

Service Endpoints

4

EA Characteristics

Notice: Dynamic and distributed nature of cloud deployments.

Reducing user observed latency is the goal – Monitor this !

EA property Relevant cloud characteristic

Scalability Dynamic deployment sizes

Availability geo-redundancy

Economics Pay-as-you-use

Elasticity Decoupled services

Low latency Deploy closer to user groups

Utilization Load balancing

5

Performance Variation: Time Series and CDF of DB Latency

- data snapshot worth 4 hours across both the days

6

Monitoring Framework – Design Goals

Resilience: Less sensitive to cloud variabilityScalability: Capable of scaling with component

instancesPortability: Easy to integrate with applicationsFlexibility: Multiple levels of measurement

User level latencyComponent level isolation

Efficiency: Fast and accurate measurements

7

Why is Monitoring Hard Dynamic environment – number of components change

Distributed deployment - needs a collection framework

Variable request path – different choice of components

Existing monitoring tools

Do not support service oriented architectures

Too detailed

Not scalable

Remember: user observed latency is our goal Abstract away un-necessary details !

8

Measuring End-points – Existing Tools

• FE • BS • DB

• Users

• 1• 2 • 3

• 5• 4• 7 • 6• 1

1• 1

0

• 9• 8• 1

2• 1

3

• HTTP Request

• SOAP Response

• HTTP Response

• MySQL Replies

Aggregate !!

9

Measurement Model

Ti,i+1

C i + 1 C i + 2C i

Ti-1,i

Ti,i+1

Ti+1,i+2 Ti+1,i+2

T’i+1,i+2 Ti+1,i+2

Ti,i+2

T’i,i+2

T’’i,i+2

Ti,i+2

Ti,i+2

Ti,i+2

T’i,i+1 Ti,i+1 Ti+1,i+2 Ti+1,i+2

T’’’i,i+2 Ti,i+2

T’’’’i,i+2 Ti,i+2

CLi = Component latency of ith component

LLi,i+1 = Link latency across components i, i+1

N = No of components Ci

communicates withnj = No of calls made by Ci to each of

the j components

10

Notification Q

Instrumented application component

Log server (local)

Raw logStorage (local)

Global collector

Instrumented application component

Log server (local)

Raw logStorage (local)

Aggregated log

Aggregated log

Monitoring Framework Architecture

11

Outline

• Monitoring tool– Collection framework– Instrumentation framework

12

The Collection Framework

Each component writes to local storage Front-end sends “done” message to local queue Queues: decouple producer, consumer entities Storage: persistence, no limit on size Both: scalable, robust

Question: Why this a right model ?When in doubt, measure!

13

Alternative Model

All components write to queue Collection framework de-queues

Forms a P2P network to collate the data

14

Experiments on Azure and EC2

• Experiments evaluating performance of storage and queues.

• Real cloud deployments (Microsoft Azure, Amazon AWS)

• Extensive measurements from all data-centers US (East/West/North/South)Europe (West/Central)Asia (East/South East)

15

Performance of Storage and Queues

Microsoft Azure Amazon AWS

•Measurements made in all 12 datacenter regions (Azure and AWS)•Experiment length (24 – 26 hours) •Approx 100,000 requests to storage 16,000 requests to the queues

Write Q

Read Q

Read Q

Write Q

Write Store

Write Store

16

Outline

• Monitoring tool– Collection framework– Instrumentation framework

17

Instrumentation Framework - Goals

• Minimize coding effort and intervention• Measure latency at the granularity of user

request• Automate instrumentation as much as

possible• Generate minimal measurement parameters

18

Comparison of Existing Tools

19

Instrumentation Framework

Instrumented Application Component

Original ApplicationComponent

Aspects

Specification for the application endpoints (X-trace: log events)

Measurement metric specification

(X-trace: meta-data)Log Format

specifications

20

Experiment Set-up• Deployed two similar benchmark applications

• DayTrader - Amazon AWS • StockTrader - Windows Azure (prior work)

• Deployed the collection framework on AWS and Azure.

• User sessions and request patterns from DaCapo benchmark suite.

• Instrumentation:• Automated using aspects – DayTrader (AWS)• Custom coded - DayTrader and StockTrader

21

Aggregation Benefit: DayTraderUser request

typeStorage writes without

aggregationStorage writes

with aggregation

FE BS FE BS

Login 3 5 1 1

Portfolio 10 10 1 1

Update profile 4 5 1 1

Home 2 2 1 1

Buy 1 7 1 1

Sell 1 8 1 1

Account 3 3 1 1

Total 24 40 7 7

• User sessions : 20 , 1 every 10 seconds• Results shown for a random user from DaCapo

78% writes reduced in above case transactions benefits

22

Aggregation Benefit: MedRec Application Suite

Application Storage writes without aggregation

Storage writes with aggregation

FE BS FE BS

MedRec App 4 8 1 1Physician App 8 15 1 1

Admin App 2 5 1 1

• Storage writes reduced by at least 50% from FE, 80% from BS

23

Instrumentation Benefit

Category Code (# of files)Handcrafted

Code (# of files)X-Trace with Aspect

same 15250 (88) 15250 (92)

modified 593 (74) 465 (70)

added 878 (0) 166 (2)

automatable 0 (0) 166 (2)

• FE component code : automatable using aspects with x-trace• Cross component calls : x-trace object passed as parameter

• New lines of code reduced by ~80%• SLOC reduced by ~20%• Aspects can be automated

24

Future Work

• Scaling the framework • Application scale to Framework scale ratio• Per Datacenter ? Per VM ? Varies per cloud

provider ?• Impact of these design decisions on the sensitivity of

the framework

25

Conclusions• Architectural benefits:

• Generic across - application, # of components, access patterns

• Scalable – decoupled entities• Aggregation benefits:

• N writes to storage becomes one write• Log server offloads work from application

• Instrumentation benefits:• Easy to integrate with application• New lines of code reduced by ~80%• SLOC reduced by ~20%

26

Q & A

27

Back up slides

28

Azure Blob Read and Write Latency

Blob read-write at least30-40 msec

29

Azure Queue Read and Write Latency

Queue read costly,write comparable to blob

30

SQL Azure Performance Issue Snapshot (6 Days)

monitoring latency sensitive enterprise applications on the cloud shankar narayanan ashiwan...

Documents

components ci

j components

link latency

cloud variabilityscalability

component instancesportability

local storagefrontend

user groupsutilizationload

data experiments