the mystery machine: end-to-end performance analysis of large-scale internet services michael chow...
TRANSCRIPT
![Page 1: The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649c795503460f9492e406/html5/thumbnails/1.jpg)
The Mystery Machine:End-to-end performance analysis of
large-scale Internet services
Michael ChowDavid Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch
![Page 2: The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649c795503460f9492e406/html5/thumbnails/2.jpg)
Michael Chow 2
Internet services are complex
Tasks
Time
Scale and heterogeneity make Internet services complex
![Page 3: The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649c795503460f9492e406/html5/thumbnails/3.jpg)
Michael Chow 3
Internet services are complex
Tasks
Time
Scale and heterogeneity make Internet services complex
![Page 4: The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649c795503460f9492e406/html5/thumbnails/4.jpg)
Michael Chow 4
Analysis Pipeline
![Page 5: The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649c795503460f9492e406/html5/thumbnails/5.jpg)
Michael Chow 5
Step 1: Identify segments
![Page 6: The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649c795503460f9492e406/html5/thumbnails/6.jpg)
Michael Chow 6
Step 2: Infer causal model
![Page 7: The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649c795503460f9492e406/html5/thumbnails/7.jpg)
Michael Chow 7
Step 3: Analyze individual requests
![Page 8: The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649c795503460f9492e406/html5/thumbnails/8.jpg)
Michael Chow 8
Step 4: Aggregate results
![Page 9: The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649c795503460f9492e406/html5/thumbnails/9.jpg)
Michael Chow 9
Challenges
• Previous methods derive a causal model– Instrument scheduler and communication– Build model through human knowledge
Need method that works at scale with heterogeneous components
![Page 10: The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649c795503460f9492e406/html5/thumbnails/10.jpg)
Michael Chow 10
Opportunities
• Component-level logging is ubiquitous
• Handle a large number of requests
Tremendous detail about a request’s execution
Coverage of a large range of behaviors
![Page 11: The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649c795503460f9492e406/html5/thumbnails/11.jpg)
Michael Chow 11
The Mystery Machine
1) Infer causal model from large corpus of traces– Identify segments– Hypothesize all possible causal relationships– Reject hypotheses with observed counterexamples
2) Analysis– Critical path, slack, anomaly detection, what-if
![Page 12: The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649c795503460f9492e406/html5/thumbnails/12.jpg)
Michael Chow 12
Step 1: Identify segments
![Page 13: The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649c795503460f9492e406/html5/thumbnails/13.jpg)
Michael Chow 13
Define a minimal schema
Segment 1 Segment 2
Task
Event Event Event
![Page 14: The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649c795503460f9492e406/html5/thumbnails/14.jpg)
Michael Chow 14
Define a minimal schema
Segment 1 Segment 2
Task
Event Event Event
Request identifierMachine identifierTimestampTaskEvent
Aggregate existing logs using minimal schema
![Page 15: The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649c795503460f9492e406/html5/thumbnails/15.jpg)
Michael Chow 15
Step 2: Infer causal model
![Page 16: The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649c795503460f9492e406/html5/thumbnails/16.jpg)
Michael Chow 16
Types of causal relationshipsRelationship Counterexample
Happens-Before
Mutual Exclusion
Pipeline
BA AB
BA
ABOR BA
BA C
B’A’ C’
BA C
B’C’ A’
t1
t2
t1
t2
![Page 17: The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649c795503460f9492e406/html5/thumbnails/17.jpg)
Michael Chow 17
Producing causal model
Causal Model
S1 N1 C1
S2 N2 C2
![Page 18: The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649c795503460f9492e406/html5/thumbnails/18.jpg)
Michael Chow 18
Producing causal model
Causal Model
S1 N1 C1
S2 N2 C2
C1N1S1
C2N2S2Trace 1
Time
![Page 19: The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649c795503460f9492e406/html5/thumbnails/19.jpg)
Michael Chow 19
Producing causal model
Causal Model
S1 N1 C1
S2 N2 C2
C1N1S1
C2N2S2Trace 1
Time
![Page 20: The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649c795503460f9492e406/html5/thumbnails/20.jpg)
Michael Chow 20
Producing causal model
Causal Model
S1 N1 C1
S2 N2 C2
C1N1S1
C2N2S2
Time
Trace 2
![Page 21: The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649c795503460f9492e406/html5/thumbnails/21.jpg)
Michael Chow 21
Producing causal model
Causal Model
S1 N1 C1
S2 N2 C2
C1N1S1
C2N2S2
Time
Trace 2
![Page 22: The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649c795503460f9492e406/html5/thumbnails/22.jpg)
Michael Chow 22
Producing causal model
Causal Model
S1 N1 C1
S2 N2 C2
C1N1S1
C2N2S2
Time
Trace 2
![Page 23: The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649c795503460f9492e406/html5/thumbnails/23.jpg)
Michael Chow 23
Step 3: Analyze individual requests
![Page 24: The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649c795503460f9492e406/html5/thumbnails/24.jpg)
Michael Chow 24
Critical path using causal model
S1 N1 C1
S2 N2 C2
C1N1S1
C2N2S2Trace 1
Time
![Page 25: The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649c795503460f9492e406/html5/thumbnails/25.jpg)
Michael Chow 25
Critical path using causal model
S1 N1 C1
S2 N2 C2
C1N1S1
C2N2S2
Trace 1
![Page 26: The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649c795503460f9492e406/html5/thumbnails/26.jpg)
Michael Chow 26
Critical path using causal model
S1 N1 C1
S2 N2 C2
C1N1S1
C2N2S2
Trace 1
![Page 27: The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649c795503460f9492e406/html5/thumbnails/27.jpg)
Michael Chow 27
Step 4: Aggregate results
![Page 28: The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649c795503460f9492e406/html5/thumbnails/28.jpg)
Michael Chow 28
Inaccuracies of Naïve Aggregation
![Page 29: The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649c795503460f9492e406/html5/thumbnails/29.jpg)
Michael Chow 29
Inaccuracies of Naïve Aggregation
![Page 30: The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649c795503460f9492e406/html5/thumbnails/30.jpg)
Michael Chow 30
Inaccuracies of Naïve Aggregation
Need a causal model to correctly understand latency
![Page 31: The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649c795503460f9492e406/html5/thumbnails/31.jpg)
Michael Chow 31
High variance in critical path
• Breakdown in critical path shifts drastically– Server, network, or client can dominate latency
Server Network Client
Percent of critical path Percent of critical path Percent of critical path
cdf
![Page 32: The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649c795503460f9492e406/html5/thumbnails/32.jpg)
Michael Chow 32
High variance in critical path
• Breakdown in critical path shifts drastically– Server, network, or client can dominate latency
Server Network Client
Percent of critical path Percent of critical path Percent of critical path
cdf
20% of requests, server contributes 10%
of latency
![Page 33: The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649c795503460f9492e406/html5/thumbnails/33.jpg)
Michael Chow 33
High variance in critical path
• Breakdown in critical path shifts drastically– Server, network, or client can dominate latency
Server Network Client
Percent of critical path Percent of critical path Percent of critical path
cdf
20% of requests, server contributes 50% or more of
latency
![Page 34: The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649c795503460f9492e406/html5/thumbnails/34.jpg)
Michael Chow 34
Diverse clients and networks
Server Network Client
Server Network Client
Server Network Client
![Page 35: The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649c795503460f9492e406/html5/thumbnails/35.jpg)
Michael Chow 35
Diverse clients and networks
Server Network Client
Server Network Client
Server Network Client
![Page 36: The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649c795503460f9492e406/html5/thumbnails/36.jpg)
Michael Chow 36
Diverse clients and networks
Server Network Client
Server Network Client
Server Network Client
![Page 37: The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649c795503460f9492e406/html5/thumbnails/37.jpg)
Michael Chow 37
Differentiated service
Deliver data when needed and reduce average response time
No slack in server generation timeProduce data faster
Decrease end-to-end latency
Slack in server generation timeProduce data slower
End-to-end latency stays same
![Page 38: The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649c795503460f9492e406/html5/thumbnails/38.jpg)
Michael Chow 38
Additional analysis techniques
• Slack analysis
• What-if analysis– Use natural variation in large data set
C1N1S1
C2N2S2
Time
![Page 39: The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649c795503460f9492e406/html5/thumbnails/39.jpg)
Michael Chow 39
What-if questions
• Does server generation time affect end-to-end latency?
• Can we predict which connections exhibit server slack?
![Page 40: The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649c795503460f9492e406/html5/thumbnails/40.jpg)
Michael Chow 40
Server slack analysis
Slack < 25ms Slack > 2.5s
![Page 41: The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649c795503460f9492e406/html5/thumbnails/41.jpg)
Michael Chow 41
Server slack analysis
Slack < 25ms Slack > 2.5s
End-to-end latency increases as server generation time
increases
Server generation time has little effect on end-
to-end latency
![Page 42: The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649c795503460f9492e406/html5/thumbnails/42.jpg)
Michael Chow 42
Predicting server slack
• Predict slack at the receipt of a request• Past slack is representative of future slack
First slack (ms)
Seco
nd s
lack
(ms)
![Page 43: The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649c795503460f9492e406/html5/thumbnails/43.jpg)
Michael Chow 43
Predicting server slack
• Predict slack at the receipt of a request• Past slack is representative of future slack
Classifies 83% of requests
correctlyType II Error 9%
Type I Error 8%
First slack (ms)
Seco
nd s
lack
(ms)
![Page 44: The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649c795503460f9492e406/html5/thumbnails/44.jpg)
Michael Chow 44
Conclusion
![Page 45: The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch](https://reader035.vdocuments.net/reader035/viewer/2022062421/56649c795503460f9492e406/html5/thumbnails/45.jpg)
Michael Chow 45
Questions