observability foundations in dynamically evolving architectures

Boyan Dimitrov Director Platform EngineeringSixt

@nathariel

Observability foundations in dynamically evolving architectures

Cloud architectures are gett ing complex

Many small services and functions dedicated to one thing only

Multiple use-case drivenpersistence options

Streaming pipelines

Complex ETL pipelines

Many clients

Serverless workflows

Microservices landscape picture

Microservices yield many moving pieces

Many workf low variat ions

Is my Business running ? How is my workflow doing?How are my applications/infra doing ?

What is happening? Why is it happening?What is about to happen?

Same “old” challenges remain

My wishful perspective of observability

What often happens

And some other times…

We need visibility into all component – a lot delivered for free by cloud providers and APM vendors

Today we will focus on the components we control

AIt’s a complex world of dependencies

Pillars of observability

Events

Metrics

Traces

Analytics / A.I.

Alerting

Visualisation

Scope for today

Events

The journal of what happened with an application curated by engineers.

Great when you have identified a failing component and want to know more

Logging Intro

Logging: best practices

ü Structured loggingü Add correlation-idsü Add context ( Service identifier,

Docker container meta, ECS Task definition…)ü Agree on standard keys:

• “severity”: “WARNING” • “level”: “WARN”, • “criticality”: “W”,

Logging: best practices

{ … "_source": { "stream": "stderr", "time": "2016-06-27T14:48:39.15693871Z", "correlation-id": "2e71f0ee-aab0-4fd0-87e7-530a9b26a37f", "level": "debug", "logger": "handler", "message": "Handling MintSnowflakeID()", "service": "com.sixt.service.idgen", "service-id": "51494694-37c1-11e6-955e-02420afe040e", "docker": { … } ”k8s": { … }}

DEBUG[2017-09-27T14:48:39+02:00] Handling MintSnowflakeId() correlation-id=1365c96e-281d-44aa-805a-1072ab165de6 ip-address=192.168.99.1 logger=handler service=com.sixt.service.idgen service-id=a122da41-3d2e-11e6-8158-a0999b047f3f service-version=0.3.0

Whan an engineer seeslocally

Logging in actionActual formatLogging in action

// ConfigureForService configures the default logger for a given micro.Service ONCEfunc ConfigureForService(service micro.Service) {

serverOptions := service.Server().Options()defaultLogger.AddFields(Fields{

"service": serverOptions.Name,"service-id": serverOptions.Id,"service-version": serverOptions.Version,

})}

//…

logger = logger.WithContext(ctx)// Always pass your req context in your handler

logger.WithFields(log.Fields{”param": req.Param,

}).Debug("Handling MintSnowflakeID()")

Logging context in your codeLogging context in your code

Expressing and evaluating application health in code

“isHealthy”: “True”

Application Health Checks: intro

ü Make those health checks available as endpoints or on the event stream

ü Use tags / metadata

ü Expose them to your app provisioner / orchestrator ( ECS, K8S, Mesos… )

ü React and/or notify on them

“isHealthy”: “True”

Application Health Checks: best practices

isHealthy:true

Service Service

isHealthy:true isHealthy:false

Service

X

Alert

Application Health Checks: why is it important

Trigger compensation actions on failure

Expose service readiness to your orchestrator

Inform

Relevant happenings around the system• Instance rollouts• Configuration changes• Deployments

Changes

Sporadic in nature but can influence your system

Great for tracing interesting system behaviours

Events arecorrelatable

Events arecorrelatable

Legend

High correlation ( no smart tooling ) Low / Medium ( needs smart tooling)

tag(svc,hc) Simply correlated by usingHealth-check id and service name

Metrics

Metrics: intro

Counter: simple numerical value that goes up: ( requests, errors )

Gauge: arbitrary value that gets recorded and can go up or down: ( current threads, used memory…)

Histogram: used for sampling and categorization of observations in buckets per type, time, etc

ü Instrument as much as possible

ü Focus on the important things: response times, errors, traffic

ü Use the right percentiles for your use case: 99th, 95th, 75th

ü Use Tags ( context! ) if possible

ü Agree on some naming standards – it helps ;)

ü Build the right visualization – common vs specific

Metrics: best practices

// Somewhere in your handlertags := map[string]string{

"method": req.Method(),"origin_service": fromService,"origin_method": fromMethod,

}

err := fn(ctx, req, rsp)

// Instrument errorsif err != nil {

TaggedCounter(tags, 1.0, "server_handler", "error", 1)TaggedTiming(tags, 1.0, "server_handler", "error",

time.Since(start))

} else {// Otherwise, success!TaggedCounter(tags, 1.0, "server_handler", "success", 1)TaggedTiming(tags, 1.0, "server_handler", "success",

time.Since(start))}

Metrics: in action

Something is slow

Service Foo is slow

Metrics arecorrelatable too

Tracing

The journey of a request ( transaction )

What is tracing?

Trace: a collection of linked spans. Span: a timed unit of work within a service

Service

Service

ServiceRPC

RPC

Trace-Id: 1Span: AParent-Span:none

Trace-Id: 1Span: CParent-Span:A

Trace-Id: 1Span: BParent-Span:A

How does it work?

Tracing: intro

Tracing: intro

Why is this useful?

Debug distributed workflows( think Microservices or your dozen Lambda functions )

Aggregate what happened ( timings, errors )

Identify latency bottlenecks

Highlight deps between services

AWS X-Ray

Several leading standards: Zipkin, OpenTracing, X-Ray, Stackdriver …( possible to convert one into another )

Many open source and managed Tracers to choose from: Zipkin, Appdash, Jaeger, Sky-Walking, Instana, Lightstep, X-Ray, Stackdriver…

Tracing: available tooling

Most tracers today are based on or influenced by the Google Dapper paper

Tracing: best practices

ü Add Correlation data and context

ü Use logs and baggage ( if available )

ü Make sure your framework is instrumented

ü Enable sampling for high-throughput systems

Tracing: best practices

Tracing: explicit instrumentation vs monkey-patching

Both have their sweet spots!

Netflix Hailo

Tracing: visualization is keyTracing: visualization is key

The Observability Wheel

Simple examplesonce all in place

Metrics Trace Logs

Error rate increased by 80% in last 5 min Service X is timing out. Deadlock errorsWorkflow A Sample Failing Request Service X logs

Simple examplesonce all in place

Metrics Logs

Partition lag high-water mark reached Add/Replace ingestors Failed to act on repartitioningSome event ingestion workflow Container Orchestrator

Compensation Policy

Health Check

Ingester Instance XHas not processed any messagesfor 5 min

Ingester Instance X

Architecture overview

CloudTrail

CloudWatchX-Ray

VPC Flow Logs

TracesApp & OS logsMetricsHealth Checks

ElasticSearch

S3

Athena

QuickSightSample Architectureto get started

AWS managed services alreadyget you far

Externalprovider

Kinesis

CloudTrail

CloudWatch

VPC Flow LogsTraces

Metrics

ElasticSearch S3Our architecture

Health ChecksFluentd

App & OS logs

Metrics

Foo

ü Basic investments in instrumentation, logging and tracing pay off in the long run

ü Share context between different observability systems so that you can correlate them

ü Once you have the basics, it is easy to visualize relationships and work on causation even without “smart” tooling

ü Having separate systems ensures no single point of failure and enables power users

To sum up

Pillars of observability: recap

Events

Metrics

Traces

Analytics / A.I.

Alerting

Visualisation

ü Alert based on severity escalating to the right team

ü Aggregate & Index all dataü Identify dependencies between applications

and componentsü Identify patterns and potential issues

automaticallyü Forecasting

ü Use-case driven visualisation

You

may

not

wan

t not

bui

ld th

is o

n yo

ur o

wn

@nathariel

Thank you

https://www.slideshare.net/nathariel

References

Truck decorated with festive lights: DmccabeVolvo FH16 instrument cluster: Panoha

observability foundations in dynamically evolving architectures

Technology