troubleshooting app health and performance with pcf metrics 1.2
TRANSCRIPT
PCF Metrics – App Dev
Providing App Developers insight into app performance
PCF Metrics
Providing App Developers insight into app performance
Pieter Humphrey, Allen Duet
Gartner believes that more than 80% of all mission-critical IT service outages result
from people and process errors and failures, and of those outages, more than
50% result from a lack of coordination between change, release and configuration
management processes.
Four Steps to Optimize Configuration Management Process and Tools, By Ronni J. Colville, Doc #G00258557 Oct 2013
Modern infrastructure is constantly changingMethodologies Deployment
Sparingly at designated times
Ready for prod at any time
Architecture Technologies Operations
App Server on Machine
Containers, Public / Private /
Hybrid Cloud
Monolithic App
Microservices / Composite app
Linear / Sequential
AgileDevOps
CI / CD Pipelines
Many tools, ad hoc automation
Manage services,not servers
Rate of change is driving more outages
5
Outages often preventable using automation
Facebook1 hour, Jan 26th
Config / app / net failures
Apple App Store11 hours March 11th Internal DNS error
NYSE, United, WSJ4 hr, 1.5 hr, 1 hr July 8th Software update, routing failure, server overload
UltraDNS2.5 hours Oct 15th
Configuration Errorshttps://blog.thousandeyes.com/top-internet-outages-2015/http://www.informationweek.com/cloud/9-spectacular-cloud-computing-fails/d/d-id/1321305?image_number=2http://www.informationweek.com/cloud/9-spectacular-cloud-computing-fails/d/d-id/1321305?image_number=4http://www.informationweek.com/cloud/9-spectacular-cloud-computing-fails/d/d-id/1321305?image_number=8
2015
“25% of customers will abandon a web page that takes more than 4 seconds to load”
“47% of consumers expect a web page to load in < 2 seconds”
“Customers prefer competitors website if it is 250ms faster”
“Increase revenue 1% for each 100ms improvement”
Sources: Gartner, Google, Amazon, Walmart
6
Speed and Availability Matters
7
Speed Performance and Human Perception
Delay time
User Reaction
0 - 100 ms 100-300 ms 300-1000 ms 1 second + 10 seconds +
Instant Feels sluggish
Machine is working..
Mental context switch
I’ll come back later ..
Stay under 250 ms to feel "fast".Stay under 1000 ms to keep users attention.
Breaking the 1000 ms Mobile Barrier - Velocity - Google Slideshttps://docs.google.com/presentation/d/1wAxB5DPN-rcelwbGO6lCOus_S1rP24LMqA8m1eXEDRo/present?slide=id.p19
Changes to a single microservice or monolithic app can impact
performance of downstream apps and services, or cause breakage
8
9
Troubleshooting apps and microservices is hard
Most platforms have:Disparate permissions on different
appsData silos across subsystems
Trouble reconciling time series data
10
MultipleLanguages
MicroservicesSupport
ServicesMarketplace
NativeUser
Provided Partner
DEVELOPMENT
10
Operating System
Cloud API
Container Orchestration
App Deployment& Management
Availability
Visibility &Administration
CI/CD Tools,ID, Security
Health, Metrics,Patching
Apps & PlatformDashboards
OPERATIONS
11
4 Levels of High Availability
Availability Zone Fail
4
VM Fail
3
Process Fail
2
App Instance Fail
1
VM VM
Process
VM VM VM
VM VM
VM VM
VM VM
VM VM
12
Container Scheduler Handles Workloads
250,000 containers
managed in a single
environment
https://blog.pivotal.io/pivotal-cloud-foundry/products/250k-containers-in-production-a-real-test-for-the-real-world
13
Container Scheduler Handles Workloads
Dynamic load balancing
14
Container Scheduler Handles Workloads
Dynamic load balancing
Remediation and rebalance of workloads
15
Each Layer Upgradable with No Downtime
App Runtime*
File system mapping
Application
Linux host & kernel
Blue-Green deploy
Canary style deploy
* e.g. Embedded webserver, app configurations, JRE, agents for services packaged as buildpacks
C o n t a i n e r
Our CharterTo provide App Devs with data points to assess overall solution performance and healthProviding App Developers insight into app performance
• Near real-time view
• Covers 80-90% of the problems
• One tool correlates events, logs, metrics
• Common set of facts for Dev+Ops
• Designed for PCF multi-tenancy
• Agentless, no install
• Enabled automatically for all applications
Immediate Integrated Automated
Available Data
CF EVENTS
APP LOGS
APP METRICS
ROUTES
Select an app,watch streaming
data
2 weeks of app log storage2 weeks of detailed container and http start stop metric storageApp Log distribution histogramApp Event UI improvementsFault tolerance on all storage servicesTesting and tuning for large ingestion loads
v1.2.1 PCF Metrics
Data Correlation Demo
22
PCF Metrics 1.2 Architecture
Our Journey
PCF Metrics v1.0PCF Metrics v1.1
PCF Metrics v1.2.1PCF Metrics v1.3
Aggregate Container and HTTP metrics provided for Apps
Aggregate Container and HTTP metrics + App events and Logs
(24 hour storage)
Aggregate Container and HTTP metrics + App events and Logs
(2 weeks storage)
Aggregate Container and HTTP metrics + App events and Logs
(2 weeks storage)
TraceID capture and Trace Logs
Spring Boot actuator supportExpanded event descriptionsAdditional Log sources *Data exposed as APIContinued UX improvements
v1.3+ App Developers