monitoring latency sensitive enterprise applications on the cloud shankar narayanan ashiwan...
TRANSCRIPT
Monitoring Latency Sensitive Enterprise Applications on the Cloud
Shankar NarayananAshiwan Sivakumar
2
Enterprise Applications (EA)Stock Trader Benchmark Application
Data Base (DB)
Business Service (BS)Front End (FE)
Configuration Service (CS)
Order Processing Service (OS)
3
EA as Services
FE
Users
FE
BS
BS
BS
BS
BS
OS
OS
OS
DB
DB
Load Balancers
Service Endpoints
4
EA Characteristics
Notice: Dynamic and distributed nature of cloud deployments.
Reducing user observed latency is the goal – Monitor this !
EA property Relevant cloud characteristic
Scalability Dynamic deployment sizes
Availability geo-redundancy
Economics Pay-as-you-use
Elasticity Decoupled services
Low latency Deploy closer to user groups
Utilization Load balancing
5
Performance Variation: Time Series and CDF of DB Latency
- data snapshot worth 4 hours across both the days
6
Monitoring Framework – Design Goals
Resilience: Less sensitive to cloud variabilityScalability: Capable of scaling with component
instancesPortability: Easy to integrate with applicationsFlexibility: Multiple levels of measurement
User level latencyComponent level isolation
Efficiency: Fast and accurate measurements
7
Why is Monitoring Hard Dynamic environment – number of components change
Distributed deployment - needs a collection framework
Variable request path – different choice of components
Existing monitoring tools
Do not support service oriented architectures
Too detailed
Not scalable
Remember: user observed latency is our goal Abstract away un-necessary details !
8
Measuring End-points – Existing Tools
• FE • BS • DB
• Users
• 1• 2 • 3
• 5• 4• 7 • 6• 1
1• 1
0
• 9• 8• 1
2• 1
3
• HTTP Request
• SOAP Response
• HTTP Response
• MySQL Replies
Aggregate !!
9
Measurement Model
Ti,i+1
C i + 1 C i + 2C i
Ti-1,i
Ti,i+1
Ti+1,i+2 Ti+1,i+2
T’i+1,i+2 Ti+1,i+2
Ti,i+2
T’i,i+2
T’’i,i+2
Ti,i+2
Ti,i+2
Ti,i+2
T’i,i+1 Ti,i+1 Ti+1,i+2 Ti+1,i+2
T’’’i,i+2 Ti,i+2
T’’’’i,i+2 Ti,i+2
CLi = Component latency of ith component
LLi,i+1 = Link latency across components i, i+1
N = No of components Ci
communicates withnj = No of calls made by Ci to each of
the j components
10
Notification Q
Instrumented application component
Log server (local)
Raw logStorage (local)
Global collector
Instrumented application component
Log server (local)
Raw logStorage (local)
Aggregated log
Aggregated log
Monitoring Framework Architecture
11
Outline
• Monitoring tool– Collection framework– Instrumentation framework
12
The Collection Framework
Each component writes to local storage Front-end sends “done” message to local queue Queues: decouple producer, consumer entities Storage: persistence, no limit on size Both: scalable, robust
Question: Why this a right model ?When in doubt, measure!
13
Alternative Model
All components write to queue Collection framework de-queues
Forms a P2P network to collate the data
14
Experiments on Azure and EC2
• Experiments evaluating performance of storage and queues.
• Real cloud deployments (Microsoft Azure, Amazon AWS)
• Extensive measurements from all data-centers US (East/West/North/South)Europe (West/Central)Asia (East/South East)
15
Performance of Storage and Queues
Microsoft Azure Amazon AWS
•Measurements made in all 12 datacenter regions (Azure and AWS)•Experiment length (24 – 26 hours) •Approx 100,000 requests to storage 16,000 requests to the queues
Write Q
Read Q
Read Q
Write Q
Write Store
Write Store
16
Outline
• Monitoring tool– Collection framework– Instrumentation framework
17
Instrumentation Framework - Goals
• Minimize coding effort and intervention• Measure latency at the granularity of user
request• Automate instrumentation as much as
possible• Generate minimal measurement parameters
18
Comparison of Existing Tools
19
Instrumentation Framework
Instrumented Application Component
Original ApplicationComponent
Aspects
Specification for the application end- points (X-trace: log events)
Measurement metric specification
(X-trace: meta-data)Log Format
specifications
20
Experiment Set-up• Deployed two similar benchmark applications
• DayTrader - Amazon AWS • StockTrader - Windows Azure (prior work)
• Deployed the collection framework on AWS and Azure.
• User sessions and request patterns from DaCapo benchmark suite.
• Instrumentation:• Automated using aspects – DayTrader (AWS)• Custom coded - DayTrader and StockTrader
21
Aggregation Benefit: DayTraderUser request
typeStorage writes without
aggregationStorage writes
with aggregation
FE BS FE BS
Login 3 5 1 1
Portfolio 10 10 1 1
Update profile 4 5 1 1
Home 2 2 1 1
Buy 1 7 1 1
Sell 1 8 1 1
Account 3 3 1 1
Total 24 40 7 7
• User sessions : 20 , 1 every 10 seconds• Results shown for a random user from DaCapo
78% writes reduced in above case transactions benefits
22
Aggregation Benefit: MedRec Application Suite
Application Storage writes without aggregation
Storage writes with aggregation
FE BS FE BS
MedRec App 4 8 1 1Physician App 8 15 1 1
Admin App 2 5 1 1
• Storage writes reduced by at least 50% from FE, 80% from BS
23
Instrumentation Benefit
Category Code (# of files)Handcrafted
Code (# of files)X-Trace with Aspect
same 15250 (88) 15250 (92)
modified 593 (74) 465 (70)
added 878 (0) 166 (2)
automatable 0 (0) 166 (2)
• FE component code : automatable using aspects with x-trace• Cross component calls : x-trace object passed as parameter
• New lines of code reduced by ~80%• SLOC reduced by ~20%• Aspects can be automated
24
Future Work
• Scaling the framework • Application scale to Framework scale ratio• Per Datacenter ? Per VM ? Varies per cloud
provider ?• Impact of these design decisions on the sensitivity of
the framework
25
Conclusions• Architectural benefits:
• Generic across - application, # of components, access patterns
• Scalable – decoupled entities• Aggregation benefits:
• N writes to storage becomes one write• Log server offloads work from application
• Instrumentation benefits:• Easy to integrate with application• New lines of code reduced by ~80%• SLOC reduced by ~20%
26
Q & A
27
Back up slides
28
Azure Blob Read and Write Latency
Blob read-write at least30-40 msec
29
Azure Queue Read and Write Latency
Queue read costly,write comparable to blob
30
SQL Azure Performance Issue Snapshot (6 Days)