observability foundations in dynamically evolving architectures
TRANSCRIPT
Boyan Dimitrov Director Platform EngineeringSixt
@nathariel
Observability foundations in dynamically evolving architectures
Cloud architectures are gett ing complex
Many small services and functions dedicated to one thing only
Multiple use-case drivenpersistence options
Streaming pipelines
Complex ETL pipelines
Many clients
Serverless workflows
Is my Business running ? How is my workflow doing?How are my applications/infra doing ?
What is happening? Why is it happening?What is about to happen?
Same “old” challenges remain
We need visibility into all component – a lot delivered for free by cloud providers and APM vendors
Today we will focus on the components we control
AIt’s a complex world of dependencies
Pillars of observability
Events
Metrics
Traces
Analytics / A.I.
Alerting
Visualisation
Scope for today
The journal of what happened with an application curated by engineers.
Great when you have identified a failing component and want to know more
Logging Intro
Logging: best practices
ü Structured loggingü Add correlation-idsü Add context ( Service identifier,
Docker container meta, ECS Task definition…)ü Agree on standard keys:
• “severity”: “WARNING” • “level”: “WARN”, • “criticality”: “W”,
Logging: best practices
{ … "_source": { "stream": "stderr", "time": "2016-06-27T14:48:39.15693871Z", "correlation-id": "2e71f0ee-aab0-4fd0-87e7-530a9b26a37f", "level": "debug", "logger": "handler", "message": "Handling MintSnowflakeID()", "service": "com.sixt.service.idgen", "service-id": "51494694-37c1-11e6-955e-02420afe040e", "docker": { … } ”k8s": { … }}
DEBUG[2017-09-27T14:48:39+02:00] Handling MintSnowflakeId() correlation-id=1365c96e-281d-44aa-805a-1072ab165de6 ip-address=192.168.99.1 logger=handler service=com.sixt.service.idgen service-id=a122da41-3d2e-11e6-8158-a0999b047f3f service-version=0.3.0
Whan an engineer seeslocally
Logging in actionActual formatLogging in action
// ConfigureForService configures the default logger for a given micro.Service ONCEfunc ConfigureForService(service micro.Service) {
serverOptions := service.Server().Options()defaultLogger.AddFields(Fields{
"service": serverOptions.Name,"service-id": serverOptions.Id,"service-version": serverOptions.Version,
})}
//…
logger = logger.WithContext(ctx)// Always pass your req context in your handler
logger.WithFields(log.Fields{”param": req.Param,
}).Debug("Handling MintSnowflakeID()")
Logging context in your codeLogging context in your code
Expressing and evaluating application health in code
“isHealthy”: “True”
Application Health Checks: intro
ü Make those health checks available as endpoints or on the event stream
ü Use tags / metadata
ü Expose them to your app provisioner / orchestrator ( ECS, K8S, Mesos… )
ü React and/or notify on them
“isHealthy”: “True”
Application Health Checks: best practices
isHealthy:true
Service Service
isHealthy:true isHealthy:false
Service
X
Alert
Application Health Checks: why is it important
Trigger compensation actions on failure
Expose service readiness to your orchestrator
Inform
Relevant happenings around the system• Instance rollouts• Configuration changes• Deployments
Changes
Sporadic in nature but can influence your system
Great for tracing interesting system behaviours
Events arecorrelatable
Legend
High correlation ( no smart tooling ) Low / Medium ( needs smart tooling)
tag(svc,hc) Simply correlated by usingHealth-check id and service name
Metrics: intro
Counter: simple numerical value that goes up: ( requests, errors )
Gauge: arbitrary value that gets recorded and can go up or down: ( current threads, used memory…)
Histogram: used for sampling and categorization of observations in buckets per type, time, etc
ü Instrument as much as possible
ü Focus on the important things: response times, errors, traffic
ü Use the right percentiles for your use case: 99th, 95th, 75th
ü Use Tags ( context! ) if possible
ü Agree on some naming standards – it helps ;)
ü Build the right visualization – common vs specific
Metrics: best practices
// Somewhere in your handlertags := map[string]string{
"method": req.Method(),"origin_service": fromService,"origin_method": fromMethod,
}
err := fn(ctx, req, rsp)
// Instrument errorsif err != nil {
TaggedCounter(tags, 1.0, "server_handler", "error", 1)TaggedTiming(tags, 1.0, "server_handler", "error",
time.Since(start))
} else {// Otherwise, success!TaggedCounter(tags, 1.0, "server_handler", "success", 1)TaggedTiming(tags, 1.0, "server_handler", "success",
time.Since(start))}
Metrics: in action
The journey of a request ( transaction )
What is tracing?
Trace: a collection of linked spans. Span: a timed unit of work within a service
Service
Service
ServiceRPC
RPC
Trace-Id: 1Span: AParent-Span:none
Trace-Id: 1Span: CParent-Span:A
Trace-Id: 1Span: BParent-Span:A
How does it work?
Tracing: intro
Tracing: intro
Why is this useful?
Debug distributed workflows( think Microservices or your dozen Lambda functions )
Aggregate what happened ( timings, errors )
Identify latency bottlenecks
Highlight deps between services
AWS X-Ray
Several leading standards: Zipkin, OpenTracing, X-Ray, Stackdriver …( possible to convert one into another )
Many open source and managed Tracers to choose from: Zipkin, Appdash, Jaeger, Sky-Walking, Instana, Lightstep, X-Ray, Stackdriver…
Tracing: available tooling
Most tracers today are based on or influenced by the Google Dapper paper
Tracing: best practices
ü Add Correlation data and context
ü Use logs and baggage ( if available )
ü Make sure your framework is instrumented
ü Enable sampling for high-throughput systems
Tracing: best practices
Simple examplesonce all in place
Metrics Trace Logs
Error rate increased by 80% in last 5 min Service X is timing out. Deadlock errorsWorkflow A Sample Failing Request Service X logs
Simple examplesonce all in place
Metrics Logs
Partition lag high-water mark reached Add/Replace ingestors Failed to act on repartitioningSome event ingestion workflow Container Orchestrator
Compensation Policy
Health Check
Ingester Instance XHas not processed any messagesfor 5 min
Ingester Instance X
CloudTrail
CloudWatchX-Ray
VPC Flow Logs
TracesApp & OS logsMetricsHealth Checks
ElasticSearch
S3
Athena
QuickSightSample Architectureto get started
AWS managed services alreadyget you far
Externalprovider
Kinesis
CloudTrail
CloudWatch
VPC Flow LogsTraces
Metrics
ElasticSearch S3Our architecture
Health ChecksFluentd
App & OS logs
Metrics
Foo
ü Basic investments in instrumentation, logging and tracing pay off in the long run
ü Share context between different observability systems so that you can correlate them
ü Once you have the basics, it is easy to visualize relationships and work on causation even without “smart” tooling
ü Having separate systems ensures no single point of failure and enables power users
To sum up
Pillars of observability: recap
Events
Metrics
Traces
Analytics / A.I.
Alerting
Visualisation
ü Alert based on severity escalating to the right team
ü Aggregate & Index all dataü Identify dependencies between applications
and componentsü Identify patterns and potential issues
automaticallyü Forecasting
ü Use-case driven visualisation
You
may
not
wan
t not
bui
ld th
is o
n yo
ur o
wn