troubleshooting app health and performance with pcf metrics 1.2

26
PCF Metrics – App Dev Providing App Developers insight into app performance PCF Metrics Providing App Developers insight into app performance Pieter Humphrey, Allen Duet

Upload: pivotal

Post on 16-Apr-2017

251 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Troubleshooting App Health and Performance with PCF Metrics 1.2

PCF Metrics – App Dev

Providing App Developers insight into app performance

PCF Metrics

Providing App Developers insight into app performance

Pieter Humphrey, Allen Duet

Page 2: Troubleshooting App Health and Performance with PCF Metrics 1.2

Gartner believes that more than 80% of all mission-critical IT service outages result

from people and process errors and failures, and of those outages, more than

50% result from a lack of coordination between change, release and configuration

management processes.

Four Steps to Optimize Configuration Management Process and Tools, By Ronni J. Colville, Doc #G00258557 Oct 2013

Page 3: Troubleshooting App Health and Performance with PCF Metrics 1.2

Modern infrastructure is constantly changingMethodologies Deployment

Sparingly at designated times

Ready for prod at any time

Architecture Technologies Operations

App Server on Machine

Containers, Public / Private /

Hybrid Cloud

Monolithic App

Microservices / Composite app

Linear / Sequential

AgileDevOps

CI / CD Pipelines

Many tools, ad hoc automation

Manage services,not servers

Page 4: Troubleshooting App Health and Performance with PCF Metrics 1.2

Rate of change is driving more outages

Page 5: Troubleshooting App Health and Performance with PCF Metrics 1.2

5

Outages often preventable using automation

Facebook1 hour, Jan 26th

Config / app / net failures

Apple App Store11 hours March 11th Internal DNS error

NYSE, United, WSJ4 hr, 1.5 hr, 1 hr July 8th Software update, routing failure, server overload

UltraDNS2.5 hours Oct 15th

Configuration Errorshttps://blog.thousandeyes.com/top-internet-outages-2015/http://www.informationweek.com/cloud/9-spectacular-cloud-computing-fails/d/d-id/1321305?image_number=2http://www.informationweek.com/cloud/9-spectacular-cloud-computing-fails/d/d-id/1321305?image_number=4http://www.informationweek.com/cloud/9-spectacular-cloud-computing-fails/d/d-id/1321305?image_number=8

2015

Page 6: Troubleshooting App Health and Performance with PCF Metrics 1.2

“25% of customers will abandon a web page that takes more than 4 seconds to load”

“47% of consumers expect a web page to load in < 2 seconds”

“Customers prefer competitors website if it is 250ms faster”

“Increase revenue 1% for each 100ms improvement”

Sources: Gartner, Google, Amazon, Walmart

6

Speed and Availability Matters

Page 7: Troubleshooting App Health and Performance with PCF Metrics 1.2

7

Speed Performance and Human Perception

Delay time

User Reaction

0 - 100 ms 100-300 ms 300-1000 ms 1 second + 10 seconds +

Instant Feels sluggish

Machine is working..

Mental context switch

I’ll come back later ..

Stay under 250 ms to feel "fast".Stay under 1000 ms to keep users attention.

Breaking the 1000 ms Mobile Barrier - Velocity - Google Slideshttps://docs.google.com/presentation/d/1wAxB5DPN-rcelwbGO6lCOus_S1rP24LMqA8m1eXEDRo/present?slide=id.p19

Page 8: Troubleshooting App Health and Performance with PCF Metrics 1.2

Changes to a single microservice or monolithic app can impact

performance of downstream apps and services, or cause breakage

8

Page 9: Troubleshooting App Health and Performance with PCF Metrics 1.2

9

Troubleshooting apps and microservices is hard

Most platforms have:Disparate permissions on different

appsData silos across subsystems

Trouble reconciling time series data

Page 10: Troubleshooting App Health and Performance with PCF Metrics 1.2

10

MultipleLanguages

MicroservicesSupport

ServicesMarketplace

NativeUser

Provided Partner

DEVELOPMENT

10

Operating System

Cloud API

Container Orchestration

App Deployment& Management

Availability

Visibility &Administration

CI/CD Tools,ID, Security

Health, Metrics,Patching

Apps & PlatformDashboards

OPERATIONS

Page 11: Troubleshooting App Health and Performance with PCF Metrics 1.2

11

4 Levels of High Availability

Availability Zone Fail

4

VM Fail

3

Process Fail

2

App Instance Fail

1

VM VM

Process

VM VM VM

VM VM

VM VM

VM VM

VM VM

Page 12: Troubleshooting App Health and Performance with PCF Metrics 1.2

12

Container Scheduler Handles Workloads

250,000 containers

managed in a single

environment

https://blog.pivotal.io/pivotal-cloud-foundry/products/250k-containers-in-production-a-real-test-for-the-real-world

Page 13: Troubleshooting App Health and Performance with PCF Metrics 1.2

13

Container Scheduler Handles Workloads

Dynamic load balancing

Page 14: Troubleshooting App Health and Performance with PCF Metrics 1.2

14

Container Scheduler Handles Workloads

Dynamic load balancing

Remediation and rebalance of workloads

Page 15: Troubleshooting App Health and Performance with PCF Metrics 1.2

15

Each Layer Upgradable with No Downtime

App Runtime*

File system mapping

Application

Linux host & kernel

Blue-Green deploy

Canary style deploy

* e.g. Embedded webserver, app configurations, JRE, agents for services packaged as buildpacks

C o n t a i n e r

Page 16: Troubleshooting App Health and Performance with PCF Metrics 1.2

Our CharterTo provide App Devs with data points to assess overall solution performance and healthProviding App Developers insight into app performance

Page 17: Troubleshooting App Health and Performance with PCF Metrics 1.2

• Near real-time view

• Covers 80-90% of the problems

• One tool correlates events, logs, metrics

• Common set of facts for Dev+Ops

• Designed for PCF multi-tenancy

• Agentless, no install

• Enabled automatically for all applications

Immediate Integrated Automated

Page 18: Troubleshooting App Health and Performance with PCF Metrics 1.2

Available Data

CF EVENTS

APP LOGS

APP METRICS

ROUTES

Page 19: Troubleshooting App Health and Performance with PCF Metrics 1.2

Select an app,watch streaming

data

Page 20: Troubleshooting App Health and Performance with PCF Metrics 1.2

2 weeks of app log storage2 weeks of detailed container and http start stop metric storageApp Log distribution histogramApp Event UI improvementsFault tolerance on all storage servicesTesting and tuning for large ingestion loads

v1.2.1 PCF Metrics

Page 21: Troubleshooting App Health and Performance with PCF Metrics 1.2

Data Correlation Demo

Page 22: Troubleshooting App Health and Performance with PCF Metrics 1.2

22

PCF Metrics 1.2 Architecture

Page 23: Troubleshooting App Health and Performance with PCF Metrics 1.2

Our Journey

PCF Metrics v1.0PCF Metrics v1.1

PCF Metrics v1.2.1PCF Metrics v1.3

Aggregate Container and HTTP metrics provided for Apps

Aggregate Container and HTTP metrics + App events and Logs

(24 hour storage)

Aggregate Container and HTTP metrics + App events and Logs

(2 weeks storage)

Aggregate Container and HTTP metrics + App events and Logs

(2 weeks storage)

TraceID capture and Trace Logs

Page 24: Troubleshooting App Health and Performance with PCF Metrics 1.2

Spring Boot actuator supportExpanded event descriptionsAdditional Log sources *Data exposed as APIContinued UX improvements

v1.3+ App Developers

Page 25: Troubleshooting App Health and Performance with PCF Metrics 1.2
Page 26: Troubleshooting App Health and Performance with PCF Metrics 1.2