real-time metrics and distributed monitoring - jeff pierce, change.org - devopsdays tel aviv 2015

70
DevOps Days 2015 Real Time Metrics and Distributed Monitoring

Upload: devopsdays-tel-aviv

Post on 15-Apr-2017

311 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

DevOps Days 2015

Real Time Metrics and Distributed Monitoring

Page 2: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

Jeff PierceSenior DevOps Engineer @[email protected]://github.com/jeffpierce@Th3Technomancer

Page 3: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

● Consulted for Citigroup on their High Frequency Trading Servers

● Stints at:○ Apple○ Rackspace

● Project Lead on Cassabon (https://github.com/jeffpierce/cassabon)

Page 4: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

Background

Page 5: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

About Change.org

● Global platform where people start and win campaigns for change

● 120 million users worldwide● Rapidly expanding user base

and engineering team● Spiky, unpredictable traffic

based on current events and viral petitions

Page 6: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

Why not outsource it?

Page 7: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

Why not outsource it?

We tried!

Page 8: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

Why not outsource it?

We tried!We weren’t happy with the price

Page 9: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

Why not outsource it?

We tried!We weren’t happy with the priceWe weren’t happy with the resolution of the stats we were capturing

Page 10: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

Why do we need our monitoring distributed and high res metrics?

Page 11: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

Why do we need our monitoring distributed and high res metrics?

In a cloud world, centralized services are asking for failure

Page 12: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

Why do we need our monitoring distributed and high res metrics?

In a cloud world, centralized services are asking for failure

High resolution metrics are awesome!

Page 13: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

Why do we need our monitoring distributed and high res metrics?

In a cloud world, centralized services are asking for failure

High resolution metrics are awesome!

Faster response time to outages

Page 14: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

Why do we need our monitoring distributed and high res metrics?

In a cloud world, centralized services are asking for failure

High resolution metrics are awesome!

Faster response time to outagesAble to autoscale on our own terms

Page 15: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

What else influenced our decision?

Page 16: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

What else influenced our decision?

● We were pretty understaffed!

Page 17: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

What else influenced our decision?

● We were pretty understaffed!● Low implementation time was key

Page 18: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

What else influenced our decision?

● We were pretty understaffed!● Low implementation time was key● We needed to rely on the

knowledge the team already had

Page 19: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

What else influenced our decision?

● We were pretty understaffed!● Low implementation time was key.● We needed to rely on the

knowledge the team already had● We needed something with low

maintenance and relatively easy scalability

Page 20: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

Searching For A Solution

Page 21: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

First Attempt: Try other providers!

Page 22: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

First Attempt: Try other providers!

● Unable to find a provider that met both our price and resolution requirements

Page 23: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

First Attempt: Try other providers!

● Unable to find a provider that met both our price and resolution requirements

● None that we investigated had reasonable pricing for temporary, autoscaling pool hosts

Page 24: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

First Attempt: Try other providers!

● Unable to find a provider that met both our price and resolution requirements

● None that we investigated had reasonable pricing for temporary, autoscaling pool hosts

● Decided to see what we could come up with in-house!

Page 25: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

Requirements For A DIY Stack

Page 26: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

Requirements For A DIY Stack

● Leverage tools team members were familiar with

Page 27: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

Requirements For A DIY Stack

● Leverage tools team members were familiar with

● Relatively low maintenance

Page 28: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

Requirements For A DIY Stack

● Leverage tools team members were familiar with

● Relatively low maintenance● Flexible, resilient, distributed

Page 29: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

Requirements For A DIY Stack

● Leverage tools team members were familiar with

● Relatively low maintenance● Flexible, resilient, distributed● Cost-competitive with outsourced

services and with higher resolution

Page 30: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

Requirements For A DIY Stack

● Leverage tools team members were familiar with

● Relatively low maintenance● Flexible, resilient, distributed● Cost-competitive with outsourced

services and with higher resolution● Uses many parts that we were

already using in our infrastructure

Page 31: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

We settled on...

Page 32: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

We settled on...

● collectd with statsd plugin (http://collectd.org)

Page 33: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015
Page 34: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

We settled on...

● collectd with statsd plugin (http://collectd.org)

● Cyanite (https://github.com/pyr/cyanite)

Page 35: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

We settled on...

● collectd with statsd plugin (http://collectd.org)

● Cyanite (https://github.com/pyr/cyanite)

● graphite-api (https://github.com/brutasse/graphite-api)

Page 36: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

We settled on...

Page 37: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

We settled on...

● collectd with statsd plugin (http://collectd.org)

● Cyanite (https://github.com/pyr/cyanite)

● graphite-api (https://github.com/brutasse/graphite-api)

● Grafana (http://grafana.org)

Page 38: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

JSON Dashboards Are A Big Deal!

Page 39: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

JSON Dashboards Are A Big Deal!

● Developers often know better which stats and graphs are important

Page 40: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

JSON Dashboards Are A Big Deal!

● Developers often know better which stats and graphs are important

● Takes work off of the plate of DevOps

Page 41: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

JSON Dashboards Are A Big Deal!

● Developers often know better which stats and graphs are important

● Takes work off of the plate of DevOps

● Can be checked in with app code

Page 42: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

JSON Dashboards Are A Big Deal!

● Developers often know better which stats and graphs are important

● Takes work off of the plate of DevOps

● Can be checked in with app code● Can also be generated via

change control with custom libraries

Page 43: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

JSON Dashboards Are A Big Deal!

● Developers often know better which stats and graphs are important

● Takes work off of the plate of DevOps

● Can be checked in with app code● Can also be generated via change

control with custom libraries● JSON is a familiar format to devs,

increasing adoption rate

Page 44: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

App Servers

“Central” Monitor

Ext. Stat Gatherer

TCP 2003Cyanite

CyaniteCyanite

Cyanite

CassandraCassandra

CassandraCassandra

CassandraCassandra

TCP 8080

Elastic Search

Grafana + Graphite-API

TCP 80

Dashboard Requests

Page 45: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

The Monitoring Side

Page 46: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

Monitoring Implementation Goals

● Write/run simple scripts to query Cyanite

Page 47: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

Monitoring Implementation Goals

● Write/run simple scripts to query Cyanite

● Use PagerDuty for alerting/paging

Page 48: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

Monitoring Implementation Goals

● Write/run simple scripts to query Cyanite

● Use PagerDuty for alerting/paging● Only use external monitoring to

check application-wide or aggregate stats

Page 49: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

Monitoring Implementation Goals

● Write/run simple scripts to query Cyanite

● Use PagerDuty for alerting/paging● Only use external monitoring to

check application-wide or aggregate stats

● Try to use external monitoring services as little as possible

Page 50: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

Monitoring Implementation Goals

● Write/run simple scripts to query Cyanite

● Use PagerDuty for alerting/paging● Only use external monitoring to

check application-wide or aggregate stats

● Try to use external monitoring services as little as possible

● Template as many checks as possible for easy management by change control

Page 51: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

Getting Developer Buy-In

Page 52: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

Getting Developer Buy-In

● Make it simple to add stats and monitors so that we get a high adoption rate

Page 53: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

Getting Developer Buy-In

● Make it simple to add stats and monitors so that we get a high adoption rate

● Make importable code in commonly used languages

Page 54: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

Getting Developer Buy-In

● Make it simple to add stats and monitors so that we get a high adoption rate

● Make importable code in commonly used languages

● Demo ease of use

Page 55: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

Getting Developer Buy-In

● Make it simple to add stats and monitors so that we get a high adoption rate

● Make importable code in commonly used languages

● Demo ease of use● Consult individual, influential

developers on importance of getting stats everywhere

Page 56: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

What We Got From All This Work

Page 57: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

Wins Thus Far

● Faster code!

Page 58: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015
Page 59: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

Wins Thus Far

● Faster code!● Faster and fewer rollbacks!

Page 60: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015
Page 61: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

Wins Thus Far

● Faster code!● Faster and fewer rollbacks!● Finding problem instances is easier

than ever!

Page 62: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015
Page 63: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015
Page 64: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

Wins Thus Far

● Faster code!● Faster and fewer rollbacks!● Finding problem instances is easier

than ever!● Faster, easier troubleshooting!

Page 65: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015
Page 66: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

And The Biggest Win...

Page 67: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

Increased Communication Between Feature Developers and DevOps!

Page 68: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

Increased Communication Between Feature Developers and DevOps!

● App developers have an increased sense of ownership -- they choose what stats to capture and which dashboards matter.

Page 69: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

Increased Communication Between Feature Developers and DevOps!

● App developers have an increased sense of ownership -- they choose what stats to capture and which dashboards matter

● When something is wrong, it’s easier to accept it from stats than the Ops person

Page 70: Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org -  DevOpsDays Tel Aviv 2015

Winners Ask Questions!