monitorama 2015 monitoring openconnect cdn

75
Monitoring OpenConnect CDN Sergey Fedorov, Netflix Monitorama 2015 Sergey Fedorov, Netflix, Monitorama 2015

Upload: sergey-fedorov

Post on 28-Jul-2015

1.216 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Monitorama 2015 Monitoring OpenConnect CDN

Monitoring OpenConnect CDN

Sergey Fedorov, NetflixMonitorama 2015

Sergey Fedorov, Netflix, Monitorama 2015

Page 2: Monitorama 2015 Monitoring OpenConnect CDN

What is OpenConnect

36.5%

US downstream traffic *

* 2015 Sandvine reportSergey Fedorov, Netflix, Monitorama 2015

Page 3: Monitorama 2015 Monitoring OpenConnect CDN

OpenConnect Cache Appliance

Space/Power optimized10/40Gbs network interfaceFreeBSD OSNGinx serverBird routing proxy

Gizmodo, “This box can hold an entire Netflix” http://gizmodo.com/this-box-can-hold-an-entire-netflix-1592590450Sergey Fedorov, Netflix, Monitorama 2015

Page 4: Monitorama 2015 Monitoring OpenConnect CDN

Network

Transit

Internet Exchange

ISP embedded

Sergey Fedorov, Netflix, Monitorama 2015

Page 5: Monitorama 2015 Monitoring OpenConnect CDN

Sergey Fedorov, Netflix, Monitorama 2015

Intelligent clients

Page 6: Monitorama 2015 Monitoring OpenConnect CDN

Control Plane

end-user content request router

client locationnetwork conditionsserver utilizationcontent distribution

Sergey Fedorov, Netflix, Monitorama 2015

Page 7: Monitorama 2015 Monitoring OpenConnect CDN

Who we are

Sergey Fedorov Stefan PraszalowiczSergey Fedorov, Netflix, Monitorama 2015

Page 8: Monitorama 2015 Monitoring OpenConnect CDN

Monitoring challenge

Page 9: Monitorama 2015 Monitoring OpenConnect CDN

Testing in prod*

Network changesFirmware deploymentsApp pushesUpdating content...

Sergey Fedorov, Netflix, Monitorama 2015

Page 10: Monitorama 2015 Monitoring OpenConnect CDN

Sergey Fedorov, Netflix, Monitorama 2015

CachesClients

Control Plane

Microservices

Network

Capacity

Config

Content

Telemetry (Atlas)Logs (ElasticSearch)

Data sources

METRICS

Page 11: Monitorama 2015 Monitoring OpenConnect CDN

Something breaks all the time

Page 12: Monitorama 2015 Monitoring OpenConnect CDN
Page 13: Monitorama 2015 Monitoring OpenConnect CDN

Big problems start small

Page 14: Monitorama 2015 Monitoring OpenConnect CDN

Context matters

Sergey Fedorov, Netflix, Monitorama 2015

Page 15: Monitorama 2015 Monitoring OpenConnect CDN

Sergey Fedorov, Netflix, Monitorama 2015

Page 16: Monitorama 2015 Monitoring OpenConnect CDN

Small SRE team

Page 17: Monitorama 2015 Monitoring OpenConnect CDN
Page 18: Monitorama 2015 Monitoring OpenConnect CDN
Page 19: Monitorama 2015 Monitoring OpenConnect CDN

Elastic

Page 20: Monitorama 2015 Monitoring OpenConnect CDN

How we do it

Page 21: Monitorama 2015 Monitoring OpenConnect CDN

Netflix Clients Caches Network ConfigData sources ......

...

Sergey Fedorov, Netflix, Monitorama 2015

Page 22: Monitorama 2015 Monitoring OpenConnect CDN

Netflix Clients Caches Network ConfigData sources ......

...

Orchestration

Data processing

stream processorspollers

Sergey Fedorov, Netflix, Monitorama 2015

Page 23: Monitorama 2015 Monitoring OpenConnect CDN

FSMState processing

Netflix Clients Caches Network ConfigData sources ......

...

Orchestration

Data processing

stream processorspollers

Sergey Fedorov, Netflix, Monitorama 2015

Page 24: Monitorama 2015 Monitoring OpenConnect CDN

MAINTENANCE

start fixing end fixing

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

Page 25: Monitorama 2015 Monitoring OpenConnect CDN

start fixing end fixing

action: okfrom: cpu

threshold=75%

MAINTENANCE

Sergey Fedorov, Netflix, Monitorama 2015

Page 26: Monitorama 2015 Monitoring OpenConnect CDN

start fixing end fixing

action: okfrom: cpu

threshold=75%

MAINTENANCE

Sergey Fedorov, Netflix, Monitorama 2015

Page 27: Monitorama 2015 Monitoring OpenConnect CDN

start fixing end fixing

action: okfrom: cpu

threshold=75%

MAINTENANCE

Sergey Fedorov, Netflix, Monitorama 2015

Page 28: Monitorama 2015 Monitoring OpenConnect CDN

start fixing end fixing

action: okfrom: cpu

threshold=75%

MAINTENANCE

Sergey Fedorov, Netflix, Monitorama 2015

Page 29: Monitorama 2015 Monitoring OpenConnect CDN

MAINTENANCE

start fixing end fixing

action: silencefrom: config

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

Page 30: Monitorama 2015 Monitoring OpenConnect CDN

MAINTENANCE

start fixing end fixing

action: okfrom: cpu

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

Page 31: Monitorama 2015 Monitoring OpenConnect CDN

MAINTENANCE

start fixing end fixing

action: silencefrom: config

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

Page 32: Monitorama 2015 Monitoring OpenConnect CDN

MAINTENANCE

start fixing end fixing

action: breakfrom: cpu

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

Page 33: Monitorama 2015 Monitoring OpenConnect CDN

MAINTENANCE

start fixing end fixing

action: breakfrom: cpu

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

Page 34: Monitorama 2015 Monitoring OpenConnect CDN

MAINTENANCE

start fixing end fixing

action: breakfrom: cpu

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

Page 35: Monitorama 2015 Monitoring OpenConnect CDN

MAINTENANCE

start fixing end fixing

action: breakfrom: cpu

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

Page 36: Monitorama 2015 Monitoring OpenConnect CDN

MAINTENANCE

start fixing end fixing

action: breakfrom: cpu

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

Page 37: Monitorama 2015 Monitoring OpenConnect CDN

MAINTENANCE

start fixing end fixing

action: breakfrom: cpu

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

Page 38: Monitorama 2015 Monitoring OpenConnect CDN

MAINTENANCE

start fixing end fixing

action: okfrom: cpu

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

Page 39: Monitorama 2015 Monitoring OpenConnect CDN

MAINTENANCE

start fixing end fixing

action: unsilencefrom: config

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

Page 40: Monitorama 2015 Monitoring OpenConnect CDN

MAINTENANCE

start fixing end fixing

action: okfrom: cpu

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

Page 41: Monitorama 2015 Monitoring OpenConnect CDN

MAINTENANCE

start fixing end fixing

action: okfrom: cpu

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

Page 42: Monitorama 2015 Monitoring OpenConnect CDN

MAINTENANCE

start fixing end fixing

action: okfrom: cpu

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

Page 43: Monitorama 2015 Monitoring OpenConnect CDN

MAINTENANCE

start fixing end fixing

action: okfrom: cpu

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

Page 44: Monitorama 2015 Monitoring OpenConnect CDN

MAINTENANCE

start fixing end fixing

action: breakfrom: cpu

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

Page 45: Monitorama 2015 Monitoring OpenConnect CDN

MAINTENANCE

start fixing end fixing

action: breakfrom: cpu

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

Page 46: Monitorama 2015 Monitoring OpenConnect CDN

MAINTENANCE

start fixing end fixing

action: start_fixfrom: user

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

Page 47: Monitorama 2015 Monitoring OpenConnect CDN

MAINTENANCE

start fixing end fixing

action: breakfrom: cpu

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

Page 48: Monitorama 2015 Monitoring OpenConnect CDN

MAINTENANCE

start fixing end fixing

action: breakfrom: cpu

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

Page 49: Monitorama 2015 Monitoring OpenConnect CDN

MAINTENANCE

start fixing end fixing

action: breakfrom: cpu

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

Page 50: Monitorama 2015 Monitoring OpenConnect CDN

MAINTENANCE

start fixing end fixing

action: breakfrom: cpu

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

Page 51: Monitorama 2015 Monitoring OpenConnect CDN

MAINTENANCE

start fixing end fixing

action: breakfrom: cpu

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

Page 52: Monitorama 2015 Monitoring OpenConnect CDN

MAINTENANCE

start fixing end fixing

action: breakfrom: cpu

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

Page 53: Monitorama 2015 Monitoring OpenConnect CDN

MAINTENANCE

start fixing end fixing

action: breakfrom: cpu

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

Page 54: Monitorama 2015 Monitoring OpenConnect CDN

MAINTENANCE

start fixing end fixing

action: okfrom: cpu

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

Page 55: Monitorama 2015 Monitoring OpenConnect CDN

MAINTENANCE

start fixing end fixing

action: okfrom: cpu

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

Page 56: Monitorama 2015 Monitoring OpenConnect CDN

MAINTENANCE

start fixing end fixing

action: end_fixfrom: user

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

Page 57: Monitorama 2015 Monitoring OpenConnect CDN

MAINTENANCE

start fixing end fixing

threshold=75%

Sergey Fedorov, Netflix, Monitorama 2015

Page 58: Monitorama 2015 Monitoring OpenConnect CDN

FSMState processing

Netflix Clients Caches Network ConfigData sources ......

...

Orchestration

Data processing

stream processorspollers

Sergey Fedorov, Netflix, Monitorama 2015

Page 59: Monitorama 2015 Monitoring OpenConnect CDN

FSMState processing

Netflix Clients Caches Network ConfigData sources ......

...

Orchestration

Data processing

stream processorspollers

Events processingEvent handlers

Page 60: Monitorama 2015 Monitoring OpenConnect CDN

STATE TRANSITION EVENT● OLD STATE● NEW STATE● Input action● Metric name● Action metadata

○ metric value○ comments○ tags○ timestamp○ ...

Event handlers

Triggers an event

Event handlersRULES

Sergey Fedorov, Netflix, Monitorama 2015

Page 61: Monitorama 2015 Monitoring OpenConnect CDN

Sergey Fedorov, Netflix, Monitorama 2015

Events priority

Escalation

Do Never

Notice

Warning

Critical

Severity

Info

Do Next

Do Last

Do Now

0 1 2 3

Page 62: Monitorama 2015 Monitoring OpenConnect CDN

Notice

Warning

Critical

Severity

Info

0 1 2 3Escalation

Notice

Warning

Critical

Severity

Info

0 1 2 3

Notifications

Sergey Fedorov, Netflix, Monitorama 2015

Page 63: Monitorama 2015 Monitoring OpenConnect CDN

FSMState processing

Netflix Clients Caches Network ConfigData sources ......

...

Orchestration

Data processing

stream processorspollers

Events processingEvent handlers

Page 64: Monitorama 2015 Monitoring OpenConnect CDN

Aggregation

C

ClusterCache state = aggregation of states of its metrics

Cluster state = aggregation of states of its caches

OK all OK DEGRADED some BROKEN or DEGRADEDBROKEN most BROKEN

All caches are OK → cluster state is OK

Sergey Fedorov, Netflix, Monitorama 2015

Page 65: Monitorama 2015 Monitoring OpenConnect CDN

Aggregation

C

Cluster OK all OK DEGRADED some BROKEN or DEGRADEDBROKEN most BROKEN

2/12 caches are BROKEN → cluster state is DEGRADED

Sergey Fedorov, Netflix, Monitorama 2015

Page 66: Monitorama 2015 Monitoring OpenConnect CDN

Aggregation

C

Cluster OK all OK DEGRADED some BROKEN or DEGRADEDBROKEN most BROKEN

7/12 caches are BROKEN → cluster state is BROKEN

Sergey Fedorov, Netflix, Monitorama 2015

Page 67: Monitorama 2015 Monitoring OpenConnect CDN

FSMState processing

Netflix Clients Caches Network ConfigData sources ......

...

Orchestration

Data processing

stream processorspollers

Events processingEvent handlers

Page 68: Monitorama 2015 Monitoring OpenConnect CDN

Challenges

Setup

Sergey Fedorov, Netflix, Monitorama 2015

Page 69: Monitorama 2015 Monitoring OpenConnect CDN

Challenges

SetupPredefined groupings

Sergey Fedorov, Netflix, Monitorama 2015

Page 70: Monitorama 2015 Monitoring OpenConnect CDN

Challenges

SetupPredefined groupingsUI

Sergey Fedorov, Netflix, Monitorama 2015

Page 71: Monitorama 2015 Monitoring OpenConnect CDN

Challenges

SetupPredefined groupingsUIIssues correlation

Sergey Fedorov, Netflix, Monitorama 2015

Page 72: Monitorama 2015 Monitoring OpenConnect CDN

Challenges

SetupPredefined groupingsUIIssues correlationFailure forecasting

Sergey Fedorov, Netflix, Monitorama 2015

Page 73: Monitorama 2015 Monitoring OpenConnect CDN

Challenges

SetupPredefined groupingsUIIssues correlationFailure forecastingOSS

Sergey Fedorov, Netflix, Monitorama 2015

Page 74: Monitorama 2015 Monitoring OpenConnect CDN

Feedback

Page 75: Monitorama 2015 Monitoring OpenConnect CDN

jobs.netflix.com/jobs/1693/

jobs.netflix.com/jobs/2240/

Sergey FedorovOpenConnect, [email protected]