aws loft talk: behind the scenes with signalfx

39

Upload: signalfx

Post on 08-Aug-2015

193 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: AWS Loft Talk: Behind the Scenes with SignalFx

SignalFx

Page 2: AWS Loft Talk: Behind the Scenes with SignalFx

SignalFx

Behind the Scenes with SignalFx

Phillip Liu [email protected]

@SignalFx - signalfx.com

Page 3: AWS Loft Talk: Behind the Scenes with SignalFx

Agenda

• Background

• Overview of Key SignalFx Services

• SignalFx infrastructure and operations

• Analytics approach to monitoring

• Code push side effects, an example

• Summary

Page 4: AWS Loft Talk: Behind the Scenes with SignalFx

SignalFx

Background

Page 5: AWS Loft Talk: Behind the Scenes with SignalFx

About Me

[2013 - ] SignalFx - Founder, CTO, Software EngineerMicroservices; Monitoring using Analytics

[2008 - 2012] Facebook - Software Engineer, Software ArchitectHyperscale SOA; Monitoring using Nagios, Ganglia, and in-house Analytics

[2004 - 2008] Opsware - Chief Architect, Software EngineerMonolithic Architecture; Monitoring using Ganglia, Nagios, Splunk

[2000 - 2004] Loudcloud - Software EngineerLAMP, Application Server; Monitoring using SNMP, Ganglia, NetCool

[1998 - 2000] Marimba - Software EngineerClient / Server; Monitoring using SNMP, FreshWater Software

[ … ]

Page 6: AWS Loft Talk: Behind the Scenes with SignalFx

About SignalFx

Page 7: AWS Loft Talk: Behind the Scenes with SignalFx

SignalFx

Overview of SignalFx Services

Page 8: AWS Loft Talk: Behind the Scenes with SignalFx

A Microservices Definition

Loosely coupled service oriented architecture with bounded context.

Adrian Cockcroft

Page 9: AWS Loft Talk: Behind the Scenes with SignalFx

Overview of Key SignalFx Services

Page 10: AWS Loft Talk: Behind the Scenes with SignalFx

Microservice Complexity

More than 15 internal services. Services span hundreds of instances across multiple AZs.

Have dependencies on tens of external services.

Page 11: AWS Loft Talk: Behind the Scenes with SignalFx

SignalFx

SignalFx Infrastructure

Page 12: AWS Loft Talk: Behind the Scenes with SignalFx

Amazon EC2

Page 13: AWS Loft Talk: Behind the Scenes with SignalFx

SignalFx

Operations at SignalFx

Page 14: AWS Loft Talk: Behind the Scenes with SignalFx

Shared Responsibility

• Engineering is organized around services they provide

• No dedicated operations team

• Each service team is responsible for building and operating their services

• Infrastructure team provides IaaS - DNS, LB, Mail, Server, and Network configuration and provisioning

• Ingest team provides Ingest API, Quantization, and TSDB services

Page 15: AWS Loft Talk: Behind the Scenes with SignalFx

Continuous Build and Deployment

• Services are built and tested on each commit

• Each service deploy at their cadence

• Nearly all deployments are non-disruptive

• Push to lab, test; push product canary, test; rest of prod

• Service engineered to be resilient to partial cluster

availability

• Each service is engineered to support +1/-1 upgrades

Page 16: AWS Loft Talk: Behind the Scenes with SignalFx

On-call Rotation

• All dev on weekly on-call rotation (couple of times a year)

• On-call works on operational tools

• On-call rotates from lab -> production

• On-call is the incident manager• Owns driving both black out and brown out incidents to

resolution

Page 17: AWS Loft Talk: Behind the Scenes with SignalFx

Operations Tools

sfhost - CLI for VM configuration and provisioning

sfc - console to access management data for all services

signalscope - deep transactions tracing

maestro - Docker orchestrator

jenkins - continuous build and deployment

Page 18: AWS Loft Talk: Behind the Scenes with SignalFx

Monitoring

• We use SignalFx to monitor SignalFx

• Engineers instrument their code as part of dev process

• Each service provides at least one dashboard

• CollectD for OS and Docker metrics on all VMs

• Yammer metrics for all Java app servers

• Custom logger to count exception types

Page 19: AWS Loft Talk: Behind the Scenes with SignalFx

Monitoring - API Service Dashboard

Page 20: AWS Loft Talk: Behind the Scenes with SignalFx

SignalFx

Analytics Approach to Monitoring

Page 21: AWS Loft Talk: Behind the Scenes with SignalFx

Monitoring Challenges

• High iteration rate leads to shortened test cycles

• Integration test combinations are intractable

• Catch problems during rolling deployments

• Identify upstream/downstream side effects

• e.g. backpressure

• Identify brownouts before the customer

• etc.

Page 22: AWS Loft Talk: Behind the Scenes with SignalFx

Analytics Approach to Monitoring

Measure

Page 23: AWS Loft Talk: Behind the Scenes with SignalFx

Analytics Approach to Monitoring

Analyze

Page 24: AWS Loft Talk: Behind the Scenes with SignalFx

Analytics Approach to Monitoring

Detect

Page 25: AWS Loft Talk: Behind the Scenes with SignalFx

SignalFx

Examples

Page 26: AWS Loft Talk: Behind the Scenes with SignalFx

Code Push Side Effects - Time Series Router

Page 27: AWS Loft Talk: Behind the Scenes with SignalFx

Code Push Side Effects

Push canary instance and Metadata API dashboard shows healthy tier.

Page 28: AWS Loft Talk: Behind the Scenes with SignalFx

Code Push Side Effects

However, upstream UI dashboard showed unusual # of timeouts.

Page 29: AWS Loft Talk: Behind the Scenes with SignalFx

Code Push Side Effects

In search of root cause. Always safe to start by looking at exception counts.Can’t derive much from all the noise.

Page 30: AWS Loft Talk: Behind the Scenes with SignalFx

Code Push Side Effects

Sum the # of exceptions to create a single signal.

Page 31: AWS Loft Talk: Behind the Scenes with SignalFx

Code Push Side Effects

Compare sum with time-shifted sum from a day ago.

Page 32: AWS Loft Talk: Behind the Scenes with SignalFx

Code Push Side Effects

Look at an outlier host - an Analytics service host.

Page 33: AWS Loft Talk: Behind the Scenes with SignalFx

Code Push Side Effects

java.io.InvalidObjectException: enum constant MURMUR128_MITZ_64 does not exist in class com.google.common.hash.BloomFilterStrategies at java.io.ObjectInputStream.readEnum(ObjectInputStream.java:1743) ~[na:1.7.0_79] at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347) ~[na:1.7.0_79] at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) ~[na:1.7.0_79] …

Looking at Analytic’s logs revealed source of the problem.

Page 34: AWS Loft Talk: Behind the Scenes with SignalFx

Code Push Side Effects

• Analytics across multiple microservices reduced time to identify problem. From push to resolution was ~15min

• Service instrumentation helped narrowed down root cause

• Discovery allowed us to create a detector using analytics to notify similar problems in the future

Page 35: AWS Loft Talk: Behind the Scenes with SignalFx

Other Examples

• A customer started dropping data because they reverted to an unsupported API• Compare TSDB write throughput of two different write

strategies• Create per-service capacity reports• Identify memory usage patterns across our Analytics

service• Create a detector for every previously uncaught error

conditions - postmortem output

Page 36: AWS Loft Talk: Behind the Scenes with SignalFx

SignalFx

Summary

Page 37: AWS Loft Talk: Behind the Scenes with SignalFx

Summary

• Microservice architecture is inherently complex

• Measure all the things

• Use data analytics techniques to• Identify problems• Chase down root cause

• Use intelligent detectors to catch recurrence

Page 38: AWS Loft Talk: Behind the Scenes with SignalFx

SignalFx

Questions

Page 39: AWS Loft Talk: Behind the Scenes with SignalFx

SignalFx

Thank You!

Phillip Liu [email protected]

WE’RE HIRING http://signalfx.com/careers.html