Role of monitoring in DevOps toolchains
Ways to set up monitoring
Monitoring for incident management
Monitoring for problem management
Best practice examples
Optimizing Monitoring Feedback for your DevOps Teams
Plumbr - sign up for your free trial a https://www.plumbr.io
The Business Problem
Large companies are forced to take plays from start-ups’ playbooks to stay competitive.
Enterprises are under pressure to innovate faster in order to stay in business.
----- McKinsey, 2019
Plumbr - sign up for your free trial a https://www.plumbr.io
The Business Problem
Move fast(er), or fall out of business
Plumbr - sign up for your free trial a https://www.plumbr.io
The DevOps Landscape
Coding – code development and review, source code management, code merging
Building – continuous integration, build status
Testing – continuous testing tools that provide quick and timely feedback on business risks
Packaging – artifact repository, application pre-deployment staging
Releasing – change management, release approvals, release automation
Configuring – infrastructure configuration and management, infrastructure as code tools
Monitoring – applications performance monitoring, end-user experience
Source - https://en.wikipedia.org/wiki/DevOps
The DevOps Landscape
Coding
Building
Testing
Packaging
Releasing
Configuring
Monitoring
Plumbr - sign up for your free trial a https://www.plumbr.io
Ship code fasterand with less errors
The DevOps Landscape
Coding
Building
Testing
Packaging
Releasing
Configuring
Monitoring
Plumbr - sign up for your free trial a https://www.plumbr.io
In shipped code, find andfix errors fast
The DevOps Landscape
Coding
Building
Testing
Packaging
Releasing
Configuring
Monitoring
Plumbr - sign up for your free trial a https://www.plumbr.io
Releases:• 60% contain bugs• 20% severely impact
users
The Monitoring Landscape
Plumbr - sign up for your free trial a https://www.plumbr.io
Infrastructure Monitoring – Nagios, Zabbix, Prometheus
Log Monitoring – Splunk, Elasticsearch-Logstash-Kibana (ELK) stack
Synthethic Monitoring – Pingdom, Uptime
Application Performance Monitoring – New Relic, Plumbr, AppDynamics
Real User Monitoring – New Relic, Plumbr, AppDynamics
The Monitoring Landscape
Plumbr - sign up for your free trial a https://www.plumbr.io
Infrastructure Monitoring – Nagios, Zabbix, Prometheus
Log Monitoring – Splunk, Elasticsearch-Logstash-Kibana (ELK) stack
Synthethic Monitoring – Pingdom, Uptime
Application Performance Monitoring – New Relic, Plumbr, AppDynamics
Real User Monitoring – New Relic, Plumbr, AppDynamics
How to install a RUM solution?
Plumbr - sign up for your free trial a https://www.plumbr.io
How do you install RUM?
How to install an APM solution?
Plumbr - sign up for your free trial a https://www.plumbr.io
$ java -javaagent:/path/to/plumbr.jar com.example.YourExecutable
How do you install APM?
Summary
• DevOps suggest 7 categories of toolchains
Plumbr - sign up for your free trial a https://www.plumbr.io
DevOps tools Monitoring toolsRUMAPM
1. Incident management2. Problem management
Incident Management / Alerting
Plumbr - sign up for your free trial a https://www.plumbr.io
Problem: whenever the availability or performance of the application degrades beyond acceptable, on-call DevOps engineer should be alerted
Solution: pick a low noise/high signal metric to base the alerts upon.
Benefit: be aware on performance & availability issues in real time
RUM as a solution
Plumbr - sign up for your free trial a https://www.plumbr.io 18
Set up Real User Monitoring
Define user experience basedperformance and availability objectives
Configure alert channels (PagerDuty/Slack/email/…)
Be immediately aware when such issues arise
Problem Management / Post-Mortems
Plumbr - sign up for your free trial a https://www.plumbr.io
Problem: when the incident has been triggered, fast root cause resolution is needed to mitigate impact
Solution: Have information about the root cause at your fingertips
Benefit: Remove the need to gather additional evidence / reproducing / troubleshooting
APM as a solution
Set up APM to trace the user interactions throughout the distributed back-end nodes
Use the information exposed as root causes to mitigate the problem fast.
Examples. How APM/RUM enable you to:
Be aware of issues thatappear
Understand impact
Prioritize response Fix issues
Let us walk through two real-world use cases
Plumbr - sign up for your free trial a https://www.plumbr.io
An availability incident, rendering one of our key services unavailable for some users
A performance issue, degrading the tail performance of another service
Availability incident: alert to
PagerDuty at 08:31 on July 30
Plumbr - sign up for your free trial a https://www.plumbr.io
Availability incident: understanding the impact
Plumbr - sign up for your free trial a https://www.plumbr.io
Availability incident: what was the error causing it?
Plumbr - sign up for your free trial a https://www.plumbr.io
Availability incident: fixing it
Plumbr - sign up for your free trial a https://www.plumbr.io
• Enable a banner, notifying impacted accounts
• Patch the data processor (released 2.5 hours after the alert)
• Reprocess data for impacted accounts (~24 hours)
Availability incident: responding to support tickets
Plumbr - sign up for your free trial a https://www.plumbr.io
Availabilty incident: summary
Plumbr - sign up for your free trial a https://www.plumbr.io
Detect the incident
Trigger an alert
Understand the root cause
Monitor impact in real time
Help support team
Confirm resolution
Performance issue: groundwork laid before
Plumbr - sign up for your free trial a https://www.plumbr.io
Performance issue: understanding the impact
Plumbr - sign up for your free trial a https://www.plumbr.io
Performance issue: understanding the impact
Plumbr - sign up for your free trial a https://www.plumbr.io
Performance issue: impact via distributed
traces
• Distributed trace captured exposes the way how the under-povisioned thread pool hits the dynamically spawned threads
37
Performance issue: patching it
Plumbr - sign up for your free trial a https://www.plumbr.io
Mitigated impact by manually altering configuration in current production set-up
Forgot to change the scripts building the machines
Getting the alert again on next day release
Patching the issue for good, after altering the build scripts as well
Take-away. Value of APM/RUM for DevOps
Alert you ofincidents
Make impact estimation easy
Help prioritize based on real objective impact
Help respond to support tickets
Expose root cause in source code
When you plan to add APM / RUM to your monitoring stack…
Plumbr - sign up for your free trial a https://www.plumbr.io
… Plumbr will be the solution to consider
Plumbr - sign up for your free trial a https://www.plumbr.io
Integrates with existing monitoring/alerting ecosystem
Plumbr - sign up for your free trial a https://www.plumbr.io
And exposes root causes to enable faster mitigation
Plumbr - sign up for your free trial a https://www.plumbr.io