distributed tracing in openstack - cern
TRANSCRIPT
Distributed Tracing in OpenStack
Ilya ShakhatHuawei Technologies, Munich Research Center27 May 2019
2
PANTONE 186CRGB 200/16/46
PANTONE 185CRGB 199/0/11
Brand colors
RGB 234/90/79
RGB 120/0/15
Supporting colors
RGB 248/181/60
RGB 235/92/1
RGB 137/137/137
RGB 35/24/21
RGB 221/221/221
RGB 233/140/128
RGB 159/0/1
RGB 245/220/87
RGB 240/133/0
RGB 181/181/181
RGB 89/87/87
RGB 255/255/255
About the presenter
Ilya Shakhat• One of developers of Neutron LBaaS 1.0
• Co-author of Stackalytics
• Member of Scale and performance team
• Maintainer of performa/shaker[1] and performa/os-faults[2] tools
• Core reviewer of osprofiler library
[1] Distributed data-plane testing tool: https://opendev.org/performa/shaker [2] OpenStack fault-injection library: https://opendev.org/performa/os-faults
3
PANTONE 186CRGB 200/16/46
PANTONE 185CRGB 199/0/11
Brand colors
RGB 234/90/79
RGB 120/0/15
Supporting colors
RGB 248/181/60
RGB 235/92/1
RGB 137/137/137
RGB 35/24/21
RGB 221/221/221
RGB 233/140/128
RGB 159/0/1
RGB 245/220/87
RGB 240/133/0
RGB 181/181/181
RGB 89/87/87
RGB 255/255/255
What is distributed tracing?
Observability = logs + metrics + tracing
• Request tracking in distributed systems.
• Performance and latency measurement.
• Service dependency analysis.
• System exploration and debugging.
• Root cause analysis.
[1] Maze is generated with http://www.mazegenerator.net/
4
PANTONE 186CRGB 200/16/46
PANTONE 185CRGB 199/0/11
Brand colors
RGB 234/90/79
RGB 120/0/15
Supporting colors
RGB 248/181/60
RGB 235/92/1
RGB 137/137/137
RGB 35/24/21
RGB 221/221/221
RGB 233/140/128
RGB 159/0/1
RGB 245/220/87
RGB 240/133/0
RGB 181/181/181
RGB 89/87/87
RGB 255/255/255
Trace models
Span model• Ideal for synchronous programming model.[1]
• Implementations: Jaeger, Zipkin,
OpenTracing, OpenCensus.
Event model• Designed for asynchronous programming
model and messaging patterns.[2]
• Trace is DAG (directed acyclic graph).
[1] Google Dapper: https://ai.google/research/pubs/pub36356[2] Facebook Canopy: https://research.fb.com/publications/canopy-end-to-end-performance-tracing-at-scale/ [3] Diagram is from https://medium.com/opentracing/open-for-event-based-tracing-a326c295f2a2
[3]
5
PANTONE 186CRGB 200/16/46
PANTONE 185CRGB 199/0/11
Brand colors
RGB 234/90/79
RGB 120/0/15
Supporting colors
RGB 248/181/60
RGB 235/92/1
RGB 137/137/137
RGB 35/24/21
RGB 221/221/221
RGB 233/140/128
RGB 159/0/1
RGB 245/220/87
RGB 240/133/0
RGB 181/181/181
RGB 89/87/87
RGB 255/255/255
Tracing in OpenStack
Osprofiler• Project under Oslo umbrella.
• Instrumentation library – event collection and storage.
• CLI – event processing and visualization.
• Trace model is event-based, but with events aggregated into spans on the client side.
Tracing is enabled explicitly per each command, e.g.:
openstack --os-profile SECRET_KEY <command>
Trace can be viewed via CLI:
osprofiler trace show <trace-id>
6
PANTONE 186CRGB 200/16/46
PANTONE 185CRGB 199/0/11
Brand colors
RGB 234/90/79
RGB 120/0/15
Supporting colors
RGB 248/181/60
RGB 235/92/1
RGB 137/137/137
RGB 35/24/21
RGB 221/221/221
RGB 233/140/128
RGB 159/0/1
RGB 245/220/87
RGB 240/133/0
RGB 181/181/181
RGB 89/87/87
RGB 255/255/255
Code instrumentation
Context propagationAt boundaries:
• request is received;
• outgoing connection is made;
• system tool is called;
• DB query is executed.
At branching:
• a new thread is spawn.
Instrumentation code in libraries.
Service
REST API RPC API
REST APIclient RPC call RPC cast
System toolThread spawn
DB
7
PANTONE 186CRGB 200/16/46
PANTONE 185CRGB 199/0/11
Brand colors
RGB 234/90/79
RGB 120/0/15
Supporting colors
RGB 248/181/60
RGB 235/92/1
RGB 137/137/137
RGB 35/24/21
RGB 221/221/221
RGB 233/140/128
RGB 159/0/1
RGB 245/220/87
RGB 240/133/0
RGB 181/181/181
RGB 89/87/87
RGB 255/255/255
Demo setup
• OpenStack Stein installed using
PackStack in multi-node mode.
• Additional instrumentation in
oslo.service, oslo.concurrency,
oslo.privsep and neutronclient.
• OSProfiler with Zipkin driver.
• Spans are collected and processed
in Jaeger and stored in
Elasticsearch.
OpenStack
Keystone
Nova
Neutron
Glance
JaegerSpans
ES
8
PANTONE 186CRGB 200/16/46
PANTONE 185CRGB 199/0/11
Brand colors
RGB 234/90/79
RGB 120/0/15
Supporting colors
RGB 248/181/60
RGB 235/92/1
RGB 137/137/137
RGB 35/24/21
RGB 221/221/221
RGB 233/140/128
RGB 159/0/1
RGB 245/220/87
RGB 240/133/0
RGB 181/181/181
RGB 89/87/87
RGB 255/255/255
System exploration
Traces help to understand the code flow in a distributed system.
[1] https://docs.openstack.org/nova/stein/reference/vm-states.html
Server creation in theory Real view in dynamic
[1]
9
PANTONE 186CRGB 200/16/46
PANTONE 185CRGB 199/0/11
Brand colors
RGB 234/90/79
RGB 120/0/15
Supporting colors
RGB 248/181/60
RGB 235/92/1
RGB 137/137/137
RGB 35/24/21
RGB 221/221/221
RGB 233/140/128
RGB 159/0/1
RGB 245/220/87
RGB 240/133/0
RGB 181/181/181
RGB 89/87/87
RGB 255/255/255
Performance analysis
Span are transformed into metrics giving a
view to internal operations, such as RPC calls.
Metric is extracted from spans and visualized in Kibana. Span duration visualized in Jaeger. Slower spans are more red.
Request profiling to find bottlenecks or critical
path analysis.
Outlier – 3 times longer than usual
Neutron operation takes most
10
PANTONE 186CRGB 200/16/46
PANTONE 185CRGB 199/0/11
Brand colors
RGB 234/90/79
RGB 120/0/15
Supporting colors
RGB 248/181/60
RGB 235/92/1
RGB 137/137/137
RGB 35/24/21
RGB 221/221/221
RGB 233/140/128
RGB 159/0/1
RGB 245/220/87
RGB 240/133/0
RGB 181/181/181
RGB 89/87/87
RGB 255/255/255
RCA scenario setup
Server creation command:
Nova architecture diagram is based on https://docs.openstack.org/nova/stein/user/architecture.html
$ openstack --os-profile SECRET_KEY server create --network private --image cirros --flavor m1.tiny test
The command returns once DB object is
created, and VM is spawned in the
background. User has to poll Nova to get VM
status.
11
PANTONE 186CRGB 200/16/46
PANTONE 185CRGB 199/0/11
Brand colors
RGB 234/90/79
RGB 120/0/15
Supporting colors
RGB 248/181/60
RGB 235/92/1
RGB 137/137/137
RGB 35/24/21
RGB 221/221/221
RGB 233/140/128
RGB 159/0/1
RGB 245/220/87
RGB 240/133/0
RGB 181/181/181
RGB 89/87/87
RGB 255/255/255
Server creation trace
Response is sent to the user
Execution continues
asynchronously
12
PANTONE 186CRGB 200/16/46
PANTONE 185CRGB 199/0/11
Brand colors
RGB 234/90/79
RGB 120/0/15
Supporting colors
RGB 248/181/60
RGB 235/92/1
RGB 137/137/137
RGB 35/24/21
RGB 221/221/221
RGB 233/140/128
RGB 159/0/1
RGB 245/220/87
RGB 240/133/0
RGB 181/181/181
RGB 89/87/87
RGB 255/255/255
Fault injection
Injected fault: OVS DB service is down on
the compute node.
Note: Neutron OVS agent is still considered
alive (failure not detected yet).
Without tracing root-cause analysis is:
• grep logs for VM id;
• filter messages by request-id;
• jump to the next service along the path;
• repeat until the error is found.Nova architecture diagram is based on https://docs.openstack.org/nova/stein/user/architecture.html
13
PANTONE 186CRGB 200/16/46
PANTONE 185CRGB 199/0/11
Brand colors
RGB 234/90/79
RGB 120/0/15
Supporting colors
RGB 248/181/60
RGB 235/92/1
RGB 137/137/137
RGB 35/24/21
RGB 221/221/221
RGB 233/140/128
RGB 159/0/1
RGB 245/220/87
RGB 240/133/0
RGB 181/181/181
RGB 89/87/87
RGB 255/255/255
Fault observation
VM status is error:
Build of instance ac50cb4a-ad7c-4abb-8bee-d8d025b545a3 aborted: Failed to allocate
the network(s), not rescheduling.
Trace overview:
14
PANTONE 186CRGB 200/16/46
PANTONE 185CRGB 199/0/11
Brand colors
RGB 234/90/79
RGB 120/0/15
Supporting colors
RGB 248/181/60
RGB 235/92/1
RGB 137/137/137
RGB 35/24/21
RGB 221/221/221
RGB 233/140/128
RGB 159/0/1
RGB 245/220/87
RGB 240/133/0
RGB 181/181/181
RGB 89/87/87
RGB 255/255/255
Root-cause analysis
Error in VIF driver
Failed to call OVS utility
Very long operation (timeout)
15
PANTONE 186CRGB 200/16/46
PANTONE 185CRGB 199/0/11
Brand colors
RGB 234/90/79
RGB 120/0/15
Supporting colors
RGB 248/181/60
RGB 235/92/1
RGB 137/137/137
RGB 35/24/21
RGB 221/221/221
RGB 233/140/128
RGB 159/0/1
RGB 245/220/87
RGB 240/133/0
RGB 181/181/181
RGB 89/87/87
RGB 255/255/255
Trace comparison
Structural changesThe structure of a trace with fault
significantly differs from a normal one.
red – missing in trace with fault
green – missing in normal trace
16
PANTONE 186CRGB 200/16/46
PANTONE 185CRGB 199/0/11
Brand colors
RGB 234/90/79
RGB 120/0/15
Supporting colors
RGB 248/181/60
RGB 235/92/1
RGB 137/137/137
RGB 35/24/21
RGB 221/221/221
RGB 233/140/128
RGB 159/0/1
RGB 245/220/87
RGB 240/133/0
RGB 181/181/181
RGB 89/87/87
RGB 255/255/255
How to use tracing now?
DevStack [1]
enable_plugin osprofiler https://opendev.org/openstack/osprofiler master OSPROFILER_COLLECTOR=redis
The plugin enables tracing in all OpenStack services and Tempest. The default driver is Redis.
Zuul Tempest job [2]
Zuul v3 makes it easy to configure a job to run Tempest tests with the tracing switched on.
Rally [3]
Collect traces for all iterations in Rally scenario.
[1] https://opendev.org/openstack/osprofiler/src/branch/master/devstack [2] https://opendev.org/openstack/osprofiler/src/tag/2.8.0/.zuul.yaml#L27-L42[3] https://review.opendev.org/#/c/615350/
17
PANTONE 186CRGB 200/16/46
PANTONE 185CRGB 199/0/11
Brand colors
RGB 234/90/79
RGB 120/0/15
Supporting colors
RGB 248/181/60
RGB 235/92/1
RGB 137/137/137
RGB 35/24/21
RGB 221/221/221
RGB 233/140/128
RGB 159/0/1
RGB 245/220/87
RGB 240/133/0
RGB 181/181/181
RGB 89/87/87
RGB 255/255/255
Available drivers
Driver Collector [1] View [2] Zuul job [3]
Redis ✓ ✓ ✓
SQLAlchemy (SQLite, Postgresql, MySQL) ✓ ✓ ✓
Elasticsearch ✓ ✓ ✕
MongoDB ✓ ✓ ✕
Jaeger ✓ ✕ ✕
Oslo.Messaging (deprecated) ✓ ✕ ✕
[1] Expose trace events from instrumented code [2] View traces in osprofiler CLI[3] Integration testing in OpenStack gate
18
PANTONE 186CRGB 200/16/46
PANTONE 185CRGB 199/0/11
Brand colors
RGB 234/90/79
RGB 120/0/15
Supporting colors
RGB 248/181/60
RGB 235/92/1
RGB 137/137/137
RGB 35/24/21
RGB 221/221/221
RGB 233/140/128
RGB 159/0/1
RGB 245/220/87
RGB 240/133/0
RGB 181/181/181
RGB 89/87/87
RGB 255/255/255
Future Work
OpenTracing compatibility
Why?• OpenTracing is de-facto a standard for distributed tracing.
• OpenTracing is a part of Open Telemetry initiative [1].
Benefits• Transparent tracing through OpenStack services.
• Out-of-the-box support for advanced platforms such as CNCF Jaeger
[1] https://opentelemetry.io/
19
PANTONE 186CRGB 200/16/46
PANTONE 185CRGB 199/0/11
Brand colors
RGB 234/90/79
RGB 120/0/15
Supporting colors
RGB 248/181/60
RGB 235/92/1
RGB 137/137/137
RGB 35/24/21
RGB 221/221/221
RGB 233/140/128
RGB 159/0/1
RGB 245/220/87
RGB 240/133/0
RGB 181/181/181
RGB 89/87/87
RGB 255/255/255
Thank you.