optimizing monitoring feedback for your devops teams log monitoring...

Download Optimizing Monitoring Feedback for your DevOps Teams Log Monitoring ¢â‚¬â€œSplunk, Elasticsearch-Logstash-Kibana

Post on 21-May-2020

3 views

Category:

Documents

1 download

Embed Size (px)

TRANSCRIPT

  • Optimizing Monitoring Feedback for your DevOps Teams

    Priit Potter

  • Role of monitoring in DevOps toolchains

    Ways to set up monitoring

    Monitoring for incident management

    Monitoring for problem management

    Best practice examples

    Optimizing Monitoring Feedback for your DevOps Teams

    Plumbr - sign up for your free trial a https://www.plumbr.io

  • The Business Problem

    Large companies are forced to take plays from start-ups’ playbooks to stay competitive.

    Enterprises are under pressure to innovate faster in order to stay in business.

    ----- McKinsey, 2019

    Plumbr - sign up for your free trial a https://www.plumbr.io

  • The Business Problem

    Move fast(er), or fall out of business

    Plumbr - sign up for your free trial a https://www.plumbr.io

  • Response by IT world

    Image source: https://www.perforce.com/solutions/devops

  • The DevOps Landscape

    Coding – code development and review, source code management, code merging

    Building – continuous integration, build status

    Testing – continuous testing tools that provide quick and timely feedback on business risks

    Packaging – artifact repository, application pre-deployment staging

    Releasing – change management, release approvals, release automation

    Configuring – infrastructure configuration and management, infrastructure as code tools

    Monitoring – applications performance monitoring, end-user experience

    Source - https://en.wikipedia.org/wiki/DevOps

  • The DevOps Landscape

    Coding

    Building

    Testing

    Packaging

    Releasing

    Configuring

    Monitoring

    Plumbr - sign up for your free trial a https://www.plumbr.io

    Ship code faster and with less errors

  • The DevOps Landscape

    Coding

    Building

    Testing

    Packaging

    Releasing

    Configuring

    Monitoring

    Plumbr - sign up for your free trial a https://www.plumbr.io

    In shipped code, find and fix errors fast

  • The DevOps Landscape

    Coding

    Building

    Testing

    Packaging

    Releasing

    Configuring

    Monitoring

    Plumbr - sign up for your free trial a https://www.plumbr.io

    Releases: • 60% contain bugs • 20% severely impact

    users

  • Tools supporting DevOps

    Plumbr - sign up for your free trial a https://www.plumbr.io

  • The Monitoring Landscape

    Plumbr - sign up for your free trial a https://www.plumbr.io

    Infrastructure Monitoring – Nagios, Zabbix, Prometheus

    Log Monitoring – Splunk, Elasticsearch-Logstash-Kibana (ELK) stack

    Synthethic Monitoring – Pingdom, Uptime

    Application Performance Monitoring – New Relic, Plumbr, AppDynamics

    Real User Monitoring – New Relic, Plumbr, AppDynamics

  • The Monitoring Landscape

    Plumbr - sign up for your free trial a https://www.plumbr.io

    Infrastructure Monitoring – Nagios, Zabbix, Prometheus

    Log Monitoring – Splunk, Elasticsearch-Logstash-Kibana (ELK) stack

    Synthethic Monitoring – Pingdom, Uptime

    Application Performance Monitoring – New Relic, Plumbr, AppDynamics

    Real User Monitoring – New Relic, Plumbr, AppDynamics

  • How do RUM and APM work?

    Plumbr - sign up for your free trial a https://www.plumbr.io

  • How to install a RUM solution?

    Plumbr - sign up for your free trial a https://www.plumbr.io

    How do you install RUM?

  • How to install an APM solution?

    Plumbr - sign up for your free trial a https://www.plumbr.io

    $ java -javaagent:/path/to/plumbr.jar com.example.YourExecutable

    How do you install APM?

  • Summary

    • DevOps suggest 7 categories of toolchains

    Plumbr - sign up for your free trial a https://www.plumbr.io

    DevOps tools Monitoring toolsRUM APM

    1. Incident management 2. Problem management

  • Incident Management / Alerting

    Plumbr - sign up for your free trial a https://www.plumbr.io

    Problem: whenever the availability or performance of the application degrades beyond acceptable, on-call DevOps engineer should be alerted

    Solution: pick a low noise/high signal metric to base the alerts upon.

    Benefit: be aware on performance & availability issues in real time

  • RUM as a solution

    Plumbr - sign up for your free trial a https://www.plumbr.io 18

    Set up Real User Monitoring

    Define user experience based performance and availability objectives

    Configure alert channels (PagerDuty/Slack/email/…)

    Be immediately aware when such issues arise

  • Problem Management / Post-Mortems

    Plumbr - sign up for your free trial a https://www.plumbr.io

    Problem: when the incident has been triggered, fast root cause resolution is needed to mitigate impact

    Solution: Have information about the root cause at your fingertips

    Benefit: Remove the need to gather additional evidence / reproducing / troubleshooting

  • APM as a solution

    Set up APM to trace the user interactions throughout the distributed back-end nodes

    Use the information exposed as root causes to mitigate the problem fast.

  • Examples. How APM/RUM enable you to:

    Be aware of issues that appear

    Understand impact

    Prioritize response Fix issues

  • Let us walk through two real-world use cases

    Plumbr - sign up for your free trial a https://www.plumbr.io

    An availability incident, rendering one of our key services unavailable for some users

    A performance issue, degrading the tail performance of another service

  • Availability incident: groundwork laid before

  • Availability incident: alert to

    PagerDuty at 08:31 on July 30

    Plumbr - sign up for your free trial a https://www.plumbr.io

  • Availability incident: responding

    Plumbr - sign up for your free trial a https://www.plumbr.io

  • Availability incident: understanding the impact

    Plumbr - sign up for your free trial a https://www.plumbr.io

  • Availability incident: what was the error causing it?

    Plumbr - sign up for your free trial a https://www.plumbr.io

  • Availability incident: fixing it

    Plumbr - sign up for your free trial a https://www.plumbr.io

    • Enable a banner, notifying impacted accounts

    • Patch the data processor (released 2.5 hours after the alert)

    • Reprocess data for impacted accounts (~24 hours)

  • Availability incident: responding to support tickets

    Plumbr - sign up for your free trial a https://www.plumbr.io

  • Availabilty incident: aftermath

    Plumbr - sign up for your free trial a https://www.plumbr.io

  • Availabilty incident: summary

    Plumbr - sign up for your free trial a https://www.plumbr.io

    Detect the incident

    Trigger an alert

    Understand the root cause

    Monitor impact in real time

    Help support team

    Confirm resolution

  • Performance issue: groundwork laid before

    Plumbr - sign up for your free trial a https://www.plumbr.io

  • Performance issue: alert to slack chat on 7 august 14:21

  • Performance issue: responding

    Plumbr - sign up for your free trial a https://www.plumbr.io

  • Performance issue: understanding the impact

    Plumbr - sign up for your free trial a https://www.plumbr.io

  • Performance issue: understanding the impact

    Plumbr - sign up for your free trial a https://www.plumbr.io

  • Performance issue: impact via distributed

    traces

    • Distributed trace captured exposes the way how the under-povisioned thread pool hits the dynamically spawned threads

    37

  • Performance issue: patching it

    Plumbr - sign up for your free trial a https://www.plumbr.io

    Mitigated impact by manually altering configuration in current production set-up

    Forgot to change the scripts building the machines

    Getting the alert again on next day release

    Patching the issue for good, after altering the build scripts as well

  • Performance issue: aftermath

    Plumbr - sign up for your free trial a https://www.plumbr.io

  • Take-away. Value of APM/RUM for DevOps

    Alert you of incidents

    Make impact estimation easy

    Help prioritize based on real objective impact

    Help respond to support tickets

    Expose root cause in source code

  • When you plan to add APM / RUM to your monitoring stack…

    Plumbr - sign up for your free trial a https://www.plumbr.io

  • … Plumbr will be the solution to consider

    Plumbr - sign up for your free trial a https://www.plumbr.io

  • Integrates with existing monitoring/alerting ecosystem

    Plumbr - sign up for your free trial a https://www.plumbr.io

  • And ex

Recommended

View more >