aws loft talk: behind the scenes with signalfx

SignalFx

SignalFx

Behind the Scenes with SignalFx

Phillip Liu [email protected]

@SignalFx - signalfx.com

https://twitter.com/signalfx

https://signalfx.com

Agenda

• Background

• Overview of Key SignalFx Services

• SignalFx infrastructure and operations

• Analytics approach to monitoring

• Code push side effects, an example

• Summary

SignalFx

Background

About Me

[2013 - ] SignalFx - Founder, CTO, Software EngineerMicroservices; Monitoring using Analytics

[2008 - 2012] Facebook - Software Engineer, Software ArchitectHyperscale SOA; Monitoring using Nagios, Ganglia, and in-house Analytics

[2004 - 2008] Opsware - Chief Architect, Software EngineerMonolithic Architecture; Monitoring using Ganglia, Nagios, Splunk

[2000 - 2004] Loudcloud - Software EngineerLAMP, Application Server; Monitoring using SNMP, Ganglia, NetCool

[1998 - 2000] Marimba - Software EngineerClient / Server; Monitoring using SNMP, FreshWater Software

[ … ]

About SignalFx

SignalFx

Overview of SignalFx Services

A Microservices Definition

Loosely coupled service oriented architecture with bounded context.

Adrian Cockcroft

Overview of Key SignalFx Services

Microservice Complexity

More than 15 internal services. Services span hundreds of instances across multiple AZs.

Have dependencies on tens of external services.

SignalFx

SignalFx Infrastructure

Amazon EC2

SignalFx

Operations at SignalFx

Shared Responsibility

• Engineering is organized around services they provide

• No dedicated operations team

• Each service team is responsible for building and operating their services

• Infrastructure team provides IaaS - DNS, LB, Mail, Server, and Network configuration and provisioning

• Ingest team provides Ingest API, Quantization, and TSDB services

Continuous Build and Deployment

• Services are built and tested on each commit

• Each service deploy at their cadence

• Nearly all deployments are non-disruptive

• Push to lab, test; push product canary, test; rest of prod

• Service engineered to be resilient to partial cluster

availability

• Each service is engineered to support +1/-1 upgrades

On-call Rotation

• All dev on weekly on-call rotation (couple of times a year)

• On-call works on operational tools

• On-call rotates from lab -> production

• On-call is the incident manager• Owns driving both black out and brown out incidents to

resolution

Operations Tools

sfhost - CLI for VM configuration and provisioning

sfc - console to access management data for all services

signalscope - deep transactions tracing

maestro - Docker orchestrator

jenkins - continuous build and deployment

Monitoring

• We use SignalFx to monitor SignalFx

• Engineers instrument their code as part of dev process

• Each service provides at least one dashboard

• CollectD for OS and Docker metrics on all VMs

• Yammer metrics for all Java app servers

• Custom logger to count exception types

Monitoring - API Service Dashboard

SignalFx

Analytics Approach to Monitoring

Monitoring Challenges

• High iteration rate leads to shortened test cycles

• Integration test combinations are intractable

• Catch problems during rolling deployments

• Identify upstream/downstream side effects

• e.g. backpressure

• Identify brownouts before the customer

• etc.


Measure


Analyze


Detect

SignalFx

Examples

Code Push Side Effects - Time Series Router

Code Push Side Effects

Push canary instance and Metadata API dashboard shows healthy tier.


However, upstream UI dashboard showed unusual # of timeouts.


In search of root cause. Always safe to start by looking at exception counts.Can’t derive much from all the noise.


Sum the # of exceptions to create a single signal.


Compare sum with time-shifted sum from a day ago.


Look at an outlier host - an Analytics service host.


java.io.InvalidObjectException: enum constant MURMUR128_MITZ_64 does not exist in class com.google.common.hash.BloomFilterStrategies at java.io.ObjectInputStream.readEnum(ObjectInputStream.java:1743) ~[na:1.7.0_79] at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347) ~[na:1.7.0_79] at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) ~[na:1.7.0_79] …

Looking at Analytic’s logs revealed source of the problem.


• Analytics across multiple microservices reduced time to identify problem. From push to resolution was ~15min

• Service instrumentation helped narrowed down root cause

• Discovery allowed us to create a detector using analytics to notify similar problems in the future

Other Examples

• A customer started dropping data because they reverted to an unsupported API• Compare TSDB write throughput of two different write

strategies• Create per-service capacity reports• Identify memory usage patterns across our Analytics

service• Create a detector for every previously uncaught error

conditions - postmortem output

SignalFx

Summary

Summary

• Microservice architecture is inherently complex

• Measure all the things

• Use data analytics techniques to• Identify problems• Chase down root cause

• Use intelligent detectors to catch recurrence

SignalFx

Questions

SignalFx

Thank You!

Phillip Liu [email protected]

WE’RE HIRING http://signalfx.com/careers.html

https://www.signalfx.com/careers.html

aws loft talk: behind the scenes with signalfx

Technology

signalfx operations

signalfx background

signalfx founder

signalfx engineers

signalfx analytics approach

signalfx phillip liu

monitoring measure

deployment services