distributed tracing @ jet - usenix · problem: bad ui/ux solution: 33 •current frameworks...

40
Distributed Tracing @ Jet

Upload: others

Post on 22-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Distributed Tracing @ Jet - USENIX · Problem: Bad UI/UX Solution: 33 •Current frameworks aren’t a great fit. •Leverage Splunkfor custom dashboards per team. •Iterating on

Distributed Tracing @ Jet

Page 2: Distributed Tracing @ Jet - USENIX · Problem: Bad UI/UX Solution: 33 •Current frameworks aren’t a great fit. •Leverage Splunkfor custom dashboards per team. •Iterating on
Page 3: Distributed Tracing @ Jet - USENIX · Problem: Bad UI/UX Solution: 33 •Current frameworks aren’t a great fit. •Leverage Splunkfor custom dashboards per team. •Iterating on

Gina MainiSenior Platform Engineer

3

@wiredsis

Page 4: Distributed Tracing @ Jet - USENIX · Problem: Bad UI/UX Solution: 33 •Current frameworks aren’t a great fit. •Leverage Splunkfor custom dashboards per team. •Iterating on

Leo Gorodinski

Hussam Abu-Libdeh

Erich Ess

4

@egarhardess

@eulerfx

@hussam

Page 5: Distributed Tracing @ Jet - USENIX · Problem: Bad UI/UX Solution: 33 •Current frameworks aren’t a great fit. •Leverage Splunkfor custom dashboards per team. •Iterating on

Not Your Average E-Comm 5

• Event Sourcing• F# Language• Multi-region• Containers• Asynchronous• Hosted in Azure

Page 6: Distributed Tracing @ Jet - USENIX · Problem: Bad UI/UX Solution: 33 •Current frameworks aren’t a great fit. •Leverage Splunkfor custom dashboards per team. •Iterating on

Why does Jet care about Distributed Tracing?

Page 7: Distributed Tracing @ Jet - USENIX · Problem: Bad UI/UX Solution: 33 •Current frameworks aren’t a great fit. •Leverage Splunkfor custom dashboards per team. •Iterating on

7

“Show me all inventory updates which failed for a given merchant.”

Page 8: Distributed Tracing @ Jet - USENIX · Problem: Bad UI/UX Solution: 33 •Current frameworks aren’t a great fit. •Leverage Splunkfor custom dashboards per team. •Iterating on

8

CHANGE SKU

HTTP REQ

CHANGE SKUHTTP REPLY

USER

MERCHANT API

CATALOG

SEARCH

INVENTORY INDEX SKU

MESSAGE BUS

KAFKA

PUBLISHMESSAGE BUS

ELASTIC SEARCHTopology Diagram:“A merchant updates a SKU.”

Page 9: Distributed Tracing @ Jet - USENIX · Problem: Bad UI/UX Solution: 33 •Current frameworks aren’t a great fit. •Leverage Splunkfor custom dashboards per team. •Iterating on

9USER MERCHANT CATALOG SEARCH

UPDATE SKU ENQUEUE

DEQUEUE

INDEX SKUVisualizingCommunications By Services:“A merchant updates a SKU.”

PUBLISH

Page 10: Distributed Tracing @ Jet - USENIX · Problem: Bad UI/UX Solution: 33 •Current frameworks aren’t a great fit. •Leverage Splunkfor custom dashboards per team. •Iterating on

10USER MERCHANT CATALOG SEARCH

START

COMPLETE

COMPLETE

START

STARTSTART

START

COMPLETE

INFO EVENT

SYNC

ASYNCASYNC

ASYNC

SYNC

VisualizingCommunicationsBy Type:“A merchant updates a SKU.”

COMPLETE

Page 11: Distributed Tracing @ Jet - USENIX · Problem: Bad UI/UX Solution: 33 •Current frameworks aren’t a great fit. •Leverage Splunkfor custom dashboards per team. •Iterating on

11

Visualizing AnEvent Log :“A merchant updates a SKU.”

TIME

START

START

COMPLETE

COMPLETE

START

START

COMPLETE

COMPLETE

INFO EVENT

START

COMPLETE

Page 12: Distributed Tracing @ Jet - USENIX · Problem: Bad UI/UX Solution: 33 •Current frameworks aren’t a great fit. •Leverage Splunkfor custom dashboards per team. •Iterating on

What exists in the ecosystem today?

Page 13: Distributed Tracing @ Jet - USENIX · Problem: Bad UI/UX Solution: 33 •Current frameworks aren’t a great fit. •Leverage Splunkfor custom dashboards per team. •Iterating on

Ecosystem Overview: 13

• OpenTracing a vendor-neutral open standard for distributed tracing. Influenced by Dapper & Zipkin.

• Dapper Google’s tracing platform, used mostly for RPC interactions.

• Zipkin Twitter’s tracing system based on Dapper.

• OpenCensus A single distribution of libraries for metrics and distributed tracing with minimal overhead that allows you to export data to multiple backends.

Page 14: Distributed Tracing @ Jet - USENIX · Problem: Bad UI/UX Solution: 33 •Current frameworks aren’t a great fit. •Leverage Splunkfor custom dashboards per team. •Iterating on

We decided to make a customDistributed Tracing platform.

WCGW?

Page 15: Distributed Tracing @ Jet - USENIX · Problem: Bad UI/UX Solution: 33 •Current frameworks aren’t a great fit. •Leverage Splunkfor custom dashboards per team. •Iterating on
Page 16: Distributed Tracing @ Jet - USENIX · Problem: Bad UI/UX Solution: 33 •Current frameworks aren’t a great fit. •Leverage Splunkfor custom dashboards per team. •Iterating on
Page 17: Distributed Tracing @ Jet - USENIX · Problem: Bad UI/UX Solution: 33 •Current frameworks aren’t a great fit. •Leverage Splunkfor custom dashboards per team. •Iterating on

Problem: Can’t Trace All The Things All At OnceSolution:

17

• Treat it like a traditional MVP for a product, not R&D.

• Differentiate solutions on persona needs.• Leverage cross-team coordination to identify

which flows people “want to trace” and “need to trace.” Find Tracing ”Sponsors” in your company.

Page 18: Distributed Tracing @ Jet - USENIX · Problem: Bad UI/UX Solution: 33 •Current frameworks aren’t a great fit. •Leverage Splunkfor custom dashboards per team. •Iterating on

Problem: Unwieldy Over-The-Wire SpecificationSolution:

18

• Emit “baggage” both scoped to the Span & the entire Trace as ‘events’

• Since events are immutable and written to a log with a sequence number (using Kafka) we avoid mutable global state.

• Minimalist wire spec containing just Guids and “Operation Names”

Page 19: Distributed Tracing @ Jet - USENIX · Problem: Bad UI/UX Solution: 33 •Current frameworks aren’t a great fit. •Leverage Splunkfor custom dashboards per team. •Iterating on

type Header =

{HeaderVersion : intCorrelationIds : string listMessageId : stringParentIds : string listProducerId : stringPayloadSchemaUri : stringTicksFromEpoch : int64MTags : Map<string,string>Ptags : Map<string,string>

}

19

Old Metadata Format:

Page 20: Distributed Tracing @ Jet - USENIX · Problem: Bad UI/UX Solution: 33 •Current frameworks aren’t a great fit. •Leverage Splunkfor custom dashboards per team. •Iterating on

type TraceContext =

{// a GUID which represents the // instance of the traced actiontrace_id : string

// the operation name which generated // this trace dataop : string

// optional tags which don’t get emitted in carrier // just key/value pairstags : TraceTags

}

20

New Metadata Format:

Page 21: Distributed Tracing @ Jet - USENIX · Problem: Bad UI/UX Solution: 33 •Current frameworks aren’t a great fit. •Leverage Splunkfor custom dashboards per team. •Iterating on

Problem: Not Great Library, Not Easy To Add-In TracingSolution:

21

• Auto-calculate Span Latency information• Integrate with Nomad Metadata• Computation Expressions in F#• Compatibility with multiple mediums: Kafka,

Event Streaming Libs, Azure Databases, HTTP, etc• Bussing of Telemetry Events

Page 22: Distributed Tracing @ Jet - USENIX · Problem: Bad UI/UX Solution: 33 •Current frameworks aren’t a great fit. •Leverage Splunkfor custom dashboards per team. •Iterating on

Propagation Overview: 22

µH1

P1

H1’

P1’

H2’

P2’ H2

P2

PROCESSING OUTIN

DESERIALIZATION SERIALIZATION

PROPAGATION

OVER WIRE OUTPUTOVER WIRE INPUT

Page 23: Distributed Tracing @ Jet - USENIX · Problem: Bad UI/UX Solution: 33 •Current frameworks aren’t a great fit. •Leverage Splunkfor custom dashboards per team. •Iterating on

module Envelope =

/// A message in its serialized statetype Serialized = Serialized

/// The message read from a data sourcetype In = In

// Currently in processtype InProcess = InProcess

/// The message which will be written to a data sourcetype Out = Out

23

Thinking in Envelopes

Page 24: Distributed Tracing @ Jet - USENIX · Problem: Bad UI/UX Solution: 33 •Current frameworks aren’t a great fit. •Leverage Splunkfor custom dashboards per team. •Iterating on

let prepareOutEnvelopeserviceName schemaUri (outEnvelope:Envelope<'kind, _>): Envelope<Out,_> =

let header =Header.Propagate serviceName schemaUri [outEnvelope.Header]

{ Header = headerPayload = outEnvelope.Payload }

let prepareOriginEnvelope serviceName schemaUri outPayload: Envelope<Out,_> =

let header = Header.Origin serviceName schemaUri{ Header = headerPayload = outPayload }

24

Thinking in Envelopes, Phantom Types

Page 25: Distributed Tracing @ Jet - USENIX · Problem: Bad UI/UX Solution: 33 •Current frameworks aren’t a great fit. •Leverage Splunkfor custom dashboards per team. •Iterating on

// The madness continues…

let map f (envelope:Envelope<'kind,_>) : Envelope<InProcess,_> ={ Header = envelope.Header; Payload = f envelope.Payload }

let resultMap(f: 'a -> Result<'b, _>) (envelope:(Envelope<In, Result<'a, _>>)): Envelope<InProcess, _> =

envelope |> map (function| Ok j -> f j| Error err -> Error err)

25

Envelopes all the way down…

Page 26: Distributed Tracing @ Jet - USENIX · Problem: Bad UI/UX Solution: 33 •Current frameworks aren’t a great fit. •Leverage Splunkfor custom dashboards per team. •Iterating on

/// A request-reply server channel wherein the channel/// handles inputs of type 'i and produces outputs of type 'o./// The server is expressed in terms of inputs of type 'a/// and outputs of type 'b.type ReqRepServer<'i, 'o, 'a, 'b> ={ ch : Channeldecoder : Dec<'i, 'a * TraceContext>encoder : Enc<'o, 'b * TraceContext> }

/// A publish-subscribe channel with inputs of type /// 'i, outputs of type ‘o and domain-specific message type 'a.type PubClient<'m, 'a> ={ ch : Channelencode : Enc<'m, 'a * TraceContext> }

/// A publish-subscribe channel with inputs of type /// 'i, outputs of type ‘o and domain-specific message type 'a.type SubClient<'m, 'a> ={ ch : Channeldecoder : Dec<'m, 'a * TraceContext> }

26

Rework: Channel Types, Working over “mediums”…

Page 27: Distributed Tracing @ Jet - USENIX · Problem: Bad UI/UX Solution: 33 •Current frameworks aren’t a great fit. •Leverage Splunkfor custom dashboards per team. •Iterating on

/// A request-reply server channel wherein the channel/// handles inputs of type 'i and produces outputs of type 'o./// The server is expressed in terms of inputs of type 'a/// and outputs of type 'b.type ReqRepServer<'i, 'o, 'a, 'b> ={ ch : Channeldecoder : Dec<'i, 'a * TraceContext>encoder : Enc<'o, 'b * TraceContext> }

/// A publish-subscribe channel with inputs of type /// 'i, outputs of type ‘o and domain-specific message type 'a.type PubClient<'m, 'a> ={ ch : Channelencode : Enc<'m, 'a * TraceContext> }

/// A publish-subscribe channel with inputs of type /// 'i, outputs of type ‘o and domain-specific message type 'a.type SubClient<'m, 'a> ={ ch : Channeldecoder : Dec<'m, 'a * TraceContext> }

27

Rework: Channel Types, Working over “mediums”…

Page 28: Distributed Tracing @ Jet - USENIX · Problem: Bad UI/UX Solution: 33 •Current frameworks aren’t a great fit. •Leverage Splunkfor custom dashboards per team. •Iterating on

28

Feeling experimental??Let’s go to an editor.

Page 29: Distributed Tracing @ Jet - USENIX · Problem: Bad UI/UX Solution: 33 •Current frameworks aren’t a great fit. •Leverage Splunkfor custom dashboards per team. •Iterating on

type TelemetryEvent =

{trace_id : string

/// Either Start, Complete, or a Custom event typeevent_type : string

/// The timestamp of the event being writtentimestamp : DateTimeOffset

tags : TraceTags}

29

New Metadata Format:

Page 30: Distributed Tracing @ Jet - USENIX · Problem: Bad UI/UX Solution: 33 •Current frameworks aren’t a great fit. •Leverage Splunkfor custom dashboards per team. •Iterating on

30

Visualizing AnEvent Log :“A merchant updates a SKU.”

TIME

START

START

COMPLETE

COMPLETE

START

START

COMPLETE

COMPLETE

INFO EVENT

START

COMPLETE

{ “event_type”: “START”,

“trace_id”: “1a2b3”,

“timestamp”: “123456789”

“tags” : {“op_name”: “IndexSkus”“merchant_id”: “4c5d6e7”

}}

Page 31: Distributed Tracing @ Jet - USENIX · Problem: Bad UI/UX Solution: 33 •Current frameworks aren’t a great fit. •Leverage Splunkfor custom dashboards per team. •Iterating on

Problem: Uncertain Querying & IndexingSolution:

31

• Graph Definitions via Reflection• Semantics around “starting” & “stopping” an

“Operation”• Projection from CosmosDB which sends

snapshots to Splunk

Page 32: Distributed Tracing @ Jet - USENIX · Problem: Bad UI/UX Solution: 33 •Current frameworks aren’t a great fit. •Leverage Splunkfor custom dashboards per team. •Iterating on

32

Page 33: Distributed Tracing @ Jet - USENIX · Problem: Bad UI/UX Solution: 33 •Current frameworks aren’t a great fit. •Leverage Splunkfor custom dashboards per team. •Iterating on

Problem: Bad UI/UX Solution:

33

• Current frameworks aren’t a great fit.• Leverage Splunk for custom dashboards per

team.• Iterating on multiple custom UIs that feature

topology graphs (force-directed graphs) and various hierarchical tree views.

Page 34: Distributed Tracing @ Jet - USENIX · Problem: Bad UI/UX Solution: 33 •Current frameworks aren’t a great fit. •Leverage Splunkfor custom dashboards per team. •Iterating on

In the future… 34

• Unified Logging• Schematization• Higher level abstractions• More integrations, more backends, more

projections• Open Source!?!?!?!

Page 35: Distributed Tracing @ Jet - USENIX · Problem: Bad UI/UX Solution: 33 •Current frameworks aren’t a great fit. •Leverage Splunkfor custom dashboards per team. •Iterating on

Lessons You Can Go To The Bank On: 35

1. Leverage product to target owners and “tracing sponsors” around your org.

2. You might need to make some flashy views just to get buy in from stakeholders even if it doesn’t help developers solve real problems.

3. This project takes time and careful requirement gathering. Carve out a useful tracing flow, one distributed action at a time.

Page 36: Distributed Tracing @ Jet - USENIX · Problem: Bad UI/UX Solution: 33 •Current frameworks aren’t a great fit. •Leverage Splunkfor custom dashboards per team. •Iterating on

THANK YOU FOR YOUR TIME! 36

Come find me in the hallway if you:- Want to talk about tracing- Want to not talk about tracing- Want to write F# for a living :]

@wiredsis

Page 37: Distributed Tracing @ Jet - USENIX · Problem: Bad UI/UX Solution: 33 •Current frameworks aren’t a great fit. •Leverage Splunkfor custom dashboards per team. •Iterating on

More reading… 37

• https://opencensus.io/

• http://opentracing.io/

• https://www.usenix.org/conference/nsdi-07/x-trace-pervasive-network-tracing-framework

• https://research.google.com/pubs/pub36356.html

• https://www.usenix.org/conference/nsdi-07/x-trace-pervasive-network-tracing-framework

• https://zipkin.io/

• https://research.fb.com/publications/canopy-end-to-end-performance-tracing-at-scale/

• https://medium.com/@eulerfx/scaling-event-sourcing-at-jet-9c873cac33b8 (He’s Hiring)

Page 38: Distributed Tracing @ Jet - USENIX · Problem: Bad UI/UX Solution: 33 •Current frameworks aren’t a great fit. •Leverage Splunkfor custom dashboards per team. •Iterating on
Page 39: Distributed Tracing @ Jet - USENIX · Problem: Bad UI/UX Solution: 33 •Current frameworks aren’t a great fit. •Leverage Splunkfor custom dashboards per team. •Iterating on
Page 40: Distributed Tracing @ Jet - USENIX · Problem: Bad UI/UX Solution: 33 •Current frameworks aren’t a great fit. •Leverage Splunkfor custom dashboards per team. •Iterating on