Transcript

Optimizing Monitoring Feedback for your DevOps Teams

Priit Potter

Role of monitoring in DevOps toolchains

Ways to set up monitoring

Monitoring for incident management

Monitoring for problem management

Best practice examples

Optimizing Monitoring Feedback for your DevOps Teams

Plumbr - sign up for your free trial a https://www.plumbr.io

The Business Problem

Large companies are forced to take plays from start-ups’ playbooks to stay competitive.

Enterprises are under pressure to innovate faster in order to stay in business.

----- McKinsey, 2019

Plumbr - sign up for your free trial a https://www.plumbr.io

The Business Problem

Move fast(er), or fall out of business

Plumbr - sign up for your free trial a https://www.plumbr.io

Response by IT world

Image source: https://www.perforce.com/solutions/devops

The DevOps Landscape

Coding – code development and review, source code management, code merging

Building – continuous integration, build status

Testing – continuous testing tools that provide quick and timely feedback on business risks

Packaging – artifact repository, application pre-deployment staging

Releasing – change management, release approvals, release automation

Configuring – infrastructure configuration and management, infrastructure as code tools

Monitoring – applications performance monitoring, end-user experience

Source - https://en.wikipedia.org/wiki/DevOps

The DevOps Landscape

Coding

Building

Testing

Packaging

Releasing

Configuring

Monitoring

Plumbr - sign up for your free trial a https://www.plumbr.io

Ship code fasterand with less errors

The DevOps Landscape

Coding

Building

Testing

Packaging

Releasing

Configuring

Monitoring

Plumbr - sign up for your free trial a https://www.plumbr.io

In shipped code, find andfix errors fast

The DevOps Landscape

Coding

Building

Testing

Packaging

Releasing

Configuring

Monitoring

Plumbr - sign up for your free trial a https://www.plumbr.io

Releases:• 60% contain bugs• 20% severely impact

users

Tools supporting DevOps

Plumbr - sign up for your free trial a https://www.plumbr.io

The Monitoring Landscape

Plumbr - sign up for your free trial a https://www.plumbr.io

Infrastructure Monitoring – Nagios, Zabbix, Prometheus

Log Monitoring – Splunk, Elasticsearch-Logstash-Kibana (ELK) stack

Synthethic Monitoring – Pingdom, Uptime

Application Performance Monitoring – New Relic, Plumbr, AppDynamics

Real User Monitoring – New Relic, Plumbr, AppDynamics

The Monitoring Landscape

Plumbr - sign up for your free trial a https://www.plumbr.io

Infrastructure Monitoring – Nagios, Zabbix, Prometheus

Log Monitoring – Splunk, Elasticsearch-Logstash-Kibana (ELK) stack

Synthethic Monitoring – Pingdom, Uptime

Application Performance Monitoring – New Relic, Plumbr, AppDynamics

Real User Monitoring – New Relic, Plumbr, AppDynamics

How do RUM and APM work?

Plumbr - sign up for your free trial a https://www.plumbr.io

How to install a RUM solution?

Plumbr - sign up for your free trial a https://www.plumbr.io

How do you install RUM?

How to install an APM solution?

Plumbr - sign up for your free trial a https://www.plumbr.io

$ java -javaagent:/path/to/plumbr.jar com.example.YourExecutable

How do you install APM?

Summary

• DevOps suggest 7 categories of toolchains

Plumbr - sign up for your free trial a https://www.plumbr.io

DevOps tools Monitoring toolsRUMAPM

1. Incident management2. Problem management

Incident Management / Alerting

Plumbr - sign up for your free trial a https://www.plumbr.io

Problem: whenever the availability or performance of the application degrades beyond acceptable, on-call DevOps engineer should be alerted

Solution: pick a low noise/high signal metric to base the alerts upon.

Benefit: be aware on performance & availability issues in real time

RUM as a solution

Plumbr - sign up for your free trial a https://www.plumbr.io 18

Set up Real User Monitoring

Define user experience basedperformance and availability objectives

Configure alert channels (PagerDuty/Slack/email/…)

Be immediately aware when such issues arise

Problem Management / Post-Mortems

Plumbr - sign up for your free trial a https://www.plumbr.io

Problem: when the incident has been triggered, fast root cause resolution is needed to mitigate impact

Solution: Have information about the root cause at your fingertips

Benefit: Remove the need to gather additional evidence / reproducing / troubleshooting

APM as a solution

Set up APM to trace the user interactions throughout the distributed back-end nodes

Use the information exposed as root causes to mitigate the problem fast.

Examples. How APM/RUM enable you to:

Be aware of issues thatappear

Understand impact

Prioritize response Fix issues

Let us walk through two real-world use cases

Plumbr - sign up for your free trial a https://www.plumbr.io

An availability incident, rendering one of our key services unavailable for some users

A performance issue, degrading the tail performance of another service

Availability incident: groundwork laid before

Availability incident: alert to

PagerDuty at 08:31 on July 30

Plumbr - sign up for your free trial a https://www.plumbr.io

Availability incident: responding

Plumbr - sign up for your free trial a https://www.plumbr.io

Availability incident: understanding the impact

Plumbr - sign up for your free trial a https://www.plumbr.io

Availability incident: what was the error causing it?

Plumbr - sign up for your free trial a https://www.plumbr.io

Availability incident: fixing it

Plumbr - sign up for your free trial a https://www.plumbr.io

• Enable a banner, notifying impacted accounts

• Patch the data processor (released 2.5 hours after the alert)

• Reprocess data for impacted accounts (~24 hours)

Availability incident: responding to support tickets

Plumbr - sign up for your free trial a https://www.plumbr.io

Availabilty incident: aftermath

Plumbr - sign up for your free trial a https://www.plumbr.io

Availabilty incident: summary

Plumbr - sign up for your free trial a https://www.plumbr.io

Detect the incident

Trigger an alert

Understand the root cause

Monitor impact in real time

Help support team

Confirm resolution

Performance issue: groundwork laid before

Plumbr - sign up for your free trial a https://www.plumbr.io

Performance issue: alert to slack chat on 7 august 14:21

Performance issue: responding

Plumbr - sign up for your free trial a https://www.plumbr.io

Performance issue: understanding the impact

Plumbr - sign up for your free trial a https://www.plumbr.io

Performance issue: understanding the impact

Plumbr - sign up for your free trial a https://www.plumbr.io

Performance issue: impact via distributed

traces

• Distributed trace captured exposes the way how the under-povisioned thread pool hits the dynamically spawned threads

37

Performance issue: patching it

Plumbr - sign up for your free trial a https://www.plumbr.io

Mitigated impact by manually altering configuration in current production set-up

Forgot to change the scripts building the machines

Getting the alert again on next day release

Patching the issue for good, after altering the build scripts as well

Performance issue: aftermath

Plumbr - sign up for your free trial a https://www.plumbr.io

Take-away. Value of APM/RUM for DevOps

Alert you ofincidents

Make impact estimation easy

Help prioritize based on real objective impact

Help respond to support tickets

Expose root cause in source code

When you plan to add APM / RUM to your monitoring stack…

Plumbr - sign up for your free trial a https://www.plumbr.io

… Plumbr will be the solution to consider

Plumbr - sign up for your free trial a https://www.plumbr.io

Integrates with existing monitoring/alerting ecosystem

Plumbr - sign up for your free trial a https://www.plumbr.io

And exposes root causes to enable faster mitigation

Plumbr - sign up for your free trial a https://www.plumbr.io

Thank you!

Priit Potter

Plumbr


Top Related