reducing mttr and false escalations: event correlation at linkedin · 2017-07-14 · reducing mttr...

34
Reducing MTTR and False Escalations: Event Correlation at LinkedIn Michael Kehoe Staff Site Reliability Engineer LinkedIn

Upload: others

Post on 13-Mar-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Reducing MTTR and False Escalations: Event Correlation at LinkedIn · 2017-07-14 · Reducing MTTR and False Escalations: Event Correlation at LinkedIn Michael Kehoe Staff Site Reliability

Reducing MTTR and False Escalations:Event Correlation at LinkedIn

Michael KehoeStaff Site Reliability EngineerLinkedIn

Page 2: Reducing MTTR and False Escalations: Event Correlation at LinkedIn · 2017-07-14 · Reducing MTTR and False Escalations: Event Correlation at LinkedIn Michael Kehoe Staff Site Reliability

Have you ever?

2

False Escalations

• Been woken because your service is unhealthy because of a dependency?

• Been woken because someone believes your service is responsible?

• Spent hours trying to work out why your service is broken?

Page 3: Reducing MTTR and False Escalations: Event Correlation at LinkedIn · 2017-07-14 · Reducing MTTR and False Escalations: Event Correlation at LinkedIn Michael Kehoe Staff Site Reliability

3

Agenda

• Project Problem Statement

• Project Goals

• Architecture Considerations

• Correlation Engine Overview

• Results & Takeaways

• Questions

Page 4: Reducing MTTR and False Escalations: Event Correlation at LinkedIn · 2017-07-14 · Reducing MTTR and False Escalations: Event Correlation at LinkedIn Michael Kehoe Staff Site Reliability

$ whoami

4

Michael Kehoe

• Staff Site Reliability Engineer (SRE) @ LinkedIn• Production-SRE team• Funny accent = Australian + 3 years American

Page 5: Reducing MTTR and False Escalations: Event Correlation at LinkedIn · 2017-07-14 · Reducing MTTR and False Escalations: Event Correlation at LinkedIn Michael Kehoe Staff Site Reliability

$ whatis PROD-SRE

5

Michael Kehoe

• Production-SRE• Develop applications to improve MTTD and

MTTR• Build tools for efficient site issue

troubleshooting, issue detection & correlation• Provide direction on site monitoring• Assist in restoring stability to services during site

critical issues

Page 6: Reducing MTTR and False Escalations: Event Correlation at LinkedIn · 2017-07-14 · Reducing MTTR and False Escalations: Event Correlation at LinkedIn Michael Kehoe Staff Site Reliability

6

Problem Statement

Service Complexity

Learning Curve MTTRReliability

Page 7: Reducing MTTR and False Escalations: Event Correlation at LinkedIn · 2017-07-14 · Reducing MTTR and False Escalations: Event Correlation at LinkedIn Michael Kehoe Staff Site Reliability

Project Technical Goal

7

Problem Statement

Find problem with a service between a given time period (or ongoing) using:

Unified API Web Frontend

Page 8: Reducing MTTR and False Escalations: Event Correlation at LinkedIn · 2017-07-14 · Reducing MTTR and False Escalations: Event Correlation at LinkedIn Michael Kehoe Staff Site Reliability

Project Success Criteria

8

Problem Statement

• Reduce MTTR on incidents

• Reduce false/ needless escalations

Page 9: Reducing MTTR and False Escalations: Event Correlation at LinkedIn · 2017-07-14 · Reducing MTTR and False Escalations: Event Correlation at LinkedIn Michael Kehoe Staff Site Reliability

Expected Use-Cases

9

Problem Statement

Applicable use-cases:• A service has high latency or error rates

• Find the problematic service(s)

Non-applicable use-cases:• External monitoring services show slow page-load times

Page 10: Reducing MTTR and False Escalations: Event Correlation at LinkedIn · 2017-07-14 · Reducing MTTR and False Escalations: Event Correlation at LinkedIn Michael Kehoe Staff Site Reliability

10

Architecture Considerations

Real-Time metrics analytics (stream processing)

Ad-Hoc metrics Analytics Alert Correlation

Page 11: Reducing MTTR and False Escalations: Event Correlation at LinkedIn · 2017-07-14 · Reducing MTTR and False Escalations: Event Correlation at LinkedIn Michael Kehoe Staff Site Reliability

Evaluation

11

Architecture Considerations

• Real-Time metrics analytics (stream processing)• Pros

• Fast response time• Ability to do advanced analytics in real-time

• Cons• Resource intensive (especially at LinkedIn scale)

Page 12: Reducing MTTR and False Escalations: Event Correlation at LinkedIn · 2017-07-14 · Reducing MTTR and False Escalations: Event Correlation at LinkedIn Michael Kehoe Staff Site Reliability

Evaluation

12

Architecture Considerations

• Ad-Hoc metric analytics• Pros

• Smaller resource footprint• Cons

• Analysis time is slow

Page 13: Reducing MTTR and False Escalations: Event Correlation at LinkedIn · 2017-07-14 · Reducing MTTR and False Escalations: Event Correlation at LinkedIn Michael Kehoe Staff Site Reliability

Evaluation

13

Architecture Considerations

• Alert Correlation• Pros

• Leverage already existing alerts• Strong signal-to-noise ratio

• Cons• Analysis constrained to alerts only (boolean state)

Page 14: Reducing MTTR and False Escalations: Event Correlation at LinkedIn · 2017-07-14 · Reducing MTTR and False Escalations: Event Correlation at LinkedIn Michael Kehoe Staff Site Reliability

Evaluation

14

Architecture Considerations

• Real-time analytics is expensive, but useful

• Ad-Hoc metric analytics is slower, but cheaper

• Alert Correlation gives us strong signal

Page 15: Reducing MTTR and False Escalations: Event Correlation at LinkedIn · 2017-07-14 · Reducing MTTR and False Escalations: Event Correlation at LinkedIn Michael Kehoe Staff Site Reliability

15

Correlation Engine Overview

At LinkedIn, we had two smaller projects that we could leverage Drilldown + Site-Stabilizer

Near-Time metric analytics & event correlation Invisualize

Alert Correlation

Existing knowledge available

Page 16: Reducing MTTR and False Escalations: Event Correlation at LinkedIn · 2017-07-14 · Reducing MTTR and False Escalations: Event Correlation at LinkedIn Michael Kehoe Staff Site Reliability

Where to get started

16

Correlation Engine Overview

The ability to correlate is great!

But you need to understand dependencies

Build a callgraph!

Page 17: Reducing MTTR and False Escalations: Event Correlation at LinkedIn · 2017-07-14 · Reducing MTTR and False Escalations: Event Correlation at LinkedIn Michael Kehoe Staff Site Reliability

Callgraph

17

Correlation Engine Overview

LinkedIn applications emit metrics on a per-API and per-dependency basis

Map metrics to understand dependencies

Simple to build callgraph platform!

Page 18: Reducing MTTR and False Escalations: Event Correlation at LinkedIn · 2017-07-14 · Reducing MTTR and False Escalations: Event Correlation at LinkedIn Michael Kehoe Staff Site Reliability

Callgraph

18

Correlation Engine Overview

Callgraph-be

Voldemort(RO Datastore)

Espresso(RW Datastore)

Collect:● Call count● Latency

Page 19: Reducing MTTR and False Escalations: Event Correlation at LinkedIn · 2017-07-14 · Reducing MTTR and False Escalations: Event Correlation at LinkedIn Michael Kehoe Staff Site Reliability

drilldown (Near-Time analytics)

19

Correlation Engine Overview

Using callgraph, identifies high-value dependencies (and the associated metrics)

In 5min chunks, analyses high-value metrics Using a k-means unsupervised algorithm, find similar trends between service metrics

Queryable API

Outputs correlation confidence scores Normalised between 0-100

Page 20: Reducing MTTR and False Escalations: Event Correlation at LinkedIn · 2017-07-14 · Reducing MTTR and False Escalations: Event Correlation at LinkedIn Michael Kehoe Staff Site Reliability

inVisualize (Alert Correlation)

20

Correlation Engine Overview

inVisualize analyses alerts (in realtime) from each service

Use callgraph to calculate the unhealthy service and affected services

Queryable APIResults normalised between 0-100

Visualizes impact

Page 21: Reducing MTTR and False Escalations: Event Correlation at LinkedIn · 2017-07-14 · Reducing MTTR and False Escalations: Event Correlation at LinkedIn Michael Kehoe Staff Site Reliability

inVisualize

21

Correlation Engine Overview

Page 22: Reducing MTTR and False Escalations: Event Correlation at LinkedIn · 2017-07-14 · Reducing MTTR and False Escalations: Event Correlation at LinkedIn Michael Kehoe Staff Site Reliability

Site-Stabilizer

22

Correlation Engine Overview

Backend service Collates recommendations from Drilldown & inVisualize

Decorates recommendations with: Scheduled changes Deployment events A/B experiment changes

Page 23: Reducing MTTR and False Escalations: Event Correlation at LinkedIn · 2017-07-14 · Reducing MTTR and False Escalations: Event Correlation at LinkedIn Michael Kehoe Staff Site Reliability

Architecture

23

Correlation Engine Overview

Callgraph-api

Callgraph-be

drilldown invisualize

site-stabilizer

Page 24: Reducing MTTR and False Escalations: Event Correlation at LinkedIn · 2017-07-14 · Reducing MTTR and False Escalations: Event Correlation at LinkedIn Michael Kehoe Staff Site Reliability

Correlate-fe

24

Correlation Engine Overview

API for automation Auto-remediation Alert suppressing

UI for manual introspection

Page 25: Reducing MTTR and False Escalations: Event Correlation at LinkedIn · 2017-07-14 · Reducing MTTR and False Escalations: Event Correlation at LinkedIn Michael Kehoe Staff Site Reliability

Correlate-fe

25

Correlation Engine Overview

User Interfaces gives Responsible service Correlation Confidence Root cause SRE team Analysis

Page 26: Reducing MTTR and False Escalations: Event Correlation at LinkedIn · 2017-07-14 · Reducing MTTR and False Escalations: Event Correlation at LinkedIn Michael Kehoe Staff Site Reliability

Architecture

26

Correlation Engine Overview

Callgraph-api

Callgraph-be

correlate-fe

drilldown invisualize

site-stabilizer

Page 27: Reducing MTTR and False Escalations: Event Correlation at LinkedIn · 2017-07-14 · Reducing MTTR and False Escalations: Event Correlation at LinkedIn Michael Kehoe Staff Site Reliability

Latency Alert

NURSE

Nurse Plan arguments

• service-name: my-frontend

• req_confidence = 85• escalate = True

Escalate to correct SRE

Find what’s wrong with ‘my-frontend’ in

DatacenterB

IrisAlert Correlation API

Service: Service-CConfidence: 91%Reason: ‘Service-C’ has high latency after a deployService Owner: SRE

Page 28: Reducing MTTR and False Escalations: Event Correlation at LinkedIn · 2017-07-14 · Reducing MTTR and False Escalations: Event Correlation at LinkedIn Michael Kehoe Staff Site Reliability

28

Early Results

Siteops (NOC) has greater visibility on the site

Reducing MTTR

Reducing false escalations

Page 29: Reducing MTTR and False Escalations: Event Correlation at LinkedIn · 2017-07-14 · Reducing MTTR and False Escalations: Event Correlation at LinkedIn Michael Kehoe Staff Site Reliability

29

Conclusion

Understand what correlation approach makes sense for you

Understand your dependencies

Build, Integrate and benefit!

Page 30: Reducing MTTR and False Escalations: Event Correlation at LinkedIn · 2017-07-14 · Reducing MTTR and False Escalations: Event Correlation at LinkedIn Michael Kehoe Staff Site Reliability

30

Team

Govindaluri Kishore

Renjith Rajan

Reynold Perumpilly

Rusty Wickell

Michael Kehoe

Page 31: Reducing MTTR and False Escalations: Event Correlation at LinkedIn · 2017-07-14 · Reducing MTTR and False Escalations: Event Correlation at LinkedIn Michael Kehoe Staff Site Reliability

31

Questions?

Page 32: Reducing MTTR and False Escalations: Event Correlation at LinkedIn · 2017-07-14 · Reducing MTTR and False Escalations: Event Correlation at LinkedIn Michael Kehoe Staff Site Reliability

©2014 LinkedIn Corporation. All Rights Reserved.©2014 LinkedIn Corporation. All Rights Reserved.

Page 33: Reducing MTTR and False Escalations: Event Correlation at LinkedIn · 2017-07-14 · Reducing MTTR and False Escalations: Event Correlation at LinkedIn Michael Kehoe Staff Site Reliability

Callgraph

33

Correlation Engine Overview

Callgraph-be

RestLi(Internal API’s)

Voldemort(RO Datastore)

Espresso(RW Datastore)

Call count

Latency

Page 34: Reducing MTTR and False Escalations: Event Correlation at LinkedIn · 2017-07-14 · Reducing MTTR and False Escalations: Event Correlation at LinkedIn Michael Kehoe Staff Site Reliability

Architecture

34

Correlation Engine Overview

Callgraph-api

Callgraph-be

correlate-fe

drilldown invisualize

site-stabilizer