the mystery machine: end-to-end performance analysis of large-scale internet services michael chow...
TRANSCRIPT
The Mystery Machine:End-to-end performance analysis of
large-scale Internet services
Michael ChowDavid Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch
Michael Chow 2
Internet services are complex
Tasks
Time
Scale and heterogeneity make Internet services complex
Michael Chow 3
Internet services are complex
Tasks
Time
Scale and heterogeneity make Internet services complex
Michael Chow 4
Analysis Pipeline
Michael Chow 5
Step 1: Identify segments
Michael Chow 6
Step 2: Infer causal model
Michael Chow 7
Step 3: Analyze individual requests
Michael Chow 8
Step 4: Aggregate results
Michael Chow 9
Challenges
• Previous methods derive a causal model– Instrument scheduler and communication– Build model through human knowledge
Need method that works at scale with heterogeneous components
Michael Chow 10
Opportunities
• Component-level logging is ubiquitous
• Handle a large number of requests
Tremendous detail about a request’s execution
Coverage of a large range of behaviors
Michael Chow 11
The Mystery Machine
1) Infer causal model from large corpus of traces– Identify segments– Hypothesize all possible causal relationships– Reject hypotheses with observed counterexamples
2) Analysis– Critical path, slack, anomaly detection, what-if
Michael Chow 12
Step 1: Identify segments
Michael Chow 13
Define a minimal schema
Segment 1 Segment 2
Task
Event Event Event
Michael Chow 14
Define a minimal schema
Segment 1 Segment 2
Task
Event Event Event
Request identifierMachine identifierTimestampTaskEvent
Aggregate existing logs using minimal schema
Michael Chow 15
Step 2: Infer causal model
Michael Chow 16
Types of causal relationshipsRelationship Counterexample
Happens-Before
Mutual Exclusion
Pipeline
BA AB
BA
ABOR BA
BA C
B’A’ C’
BA C
B’C’ A’
t1
t2
t1
t2
Michael Chow 17
Producing causal model
Causal Model
S1 N1 C1
S2 N2 C2
Michael Chow 18
Producing causal model
Causal Model
S1 N1 C1
S2 N2 C2
C1N1S1
C2N2S2Trace 1
Time
Michael Chow 19
Producing causal model
Causal Model
S1 N1 C1
S2 N2 C2
C1N1S1
C2N2S2Trace 1
Time
Michael Chow 20
Producing causal model
Causal Model
S1 N1 C1
S2 N2 C2
C1N1S1
C2N2S2
Time
Trace 2
Michael Chow 21
Producing causal model
Causal Model
S1 N1 C1
S2 N2 C2
C1N1S1
C2N2S2
Time
Trace 2
Michael Chow 22
Producing causal model
Causal Model
S1 N1 C1
S2 N2 C2
C1N1S1
C2N2S2
Time
Trace 2
Michael Chow 23
Step 3: Analyze individual requests
Michael Chow 24
Critical path using causal model
S1 N1 C1
S2 N2 C2
C1N1S1
C2N2S2Trace 1
Time
Michael Chow 25
Critical path using causal model
S1 N1 C1
S2 N2 C2
C1N1S1
C2N2S2
Trace 1
Michael Chow 26
Critical path using causal model
S1 N1 C1
S2 N2 C2
C1N1S1
C2N2S2
Trace 1
Michael Chow 27
Step 4: Aggregate results
Michael Chow 28
Inaccuracies of Naïve Aggregation
Michael Chow 29
Inaccuracies of Naïve Aggregation
Michael Chow 30
Inaccuracies of Naïve Aggregation
Need a causal model to correctly understand latency
Michael Chow 31
High variance in critical path
• Breakdown in critical path shifts drastically– Server, network, or client can dominate latency
Server Network Client
Percent of critical path Percent of critical path Percent of critical path
cdf
Michael Chow 32
High variance in critical path
• Breakdown in critical path shifts drastically– Server, network, or client can dominate latency
Server Network Client
Percent of critical path Percent of critical path Percent of critical path
cdf
20% of requests, server contributes 10%
of latency
Michael Chow 33
High variance in critical path
• Breakdown in critical path shifts drastically– Server, network, or client can dominate latency
Server Network Client
Percent of critical path Percent of critical path Percent of critical path
cdf
20% of requests, server contributes 50% or more of
latency
Michael Chow 34
Diverse clients and networks
Server Network Client
Server Network Client
Server Network Client
Michael Chow 35
Diverse clients and networks
Server Network Client
Server Network Client
Server Network Client
Michael Chow 36
Diverse clients and networks
Server Network Client
Server Network Client
Server Network Client
Michael Chow 37
Differentiated service
Deliver data when needed and reduce average response time
No slack in server generation timeProduce data faster
Decrease end-to-end latency
Slack in server generation timeProduce data slower
End-to-end latency stays same
Michael Chow 38
Additional analysis techniques
• Slack analysis
• What-if analysis– Use natural variation in large data set
C1N1S1
C2N2S2
Time
Michael Chow 39
What-if questions
• Does server generation time affect end-to-end latency?
• Can we predict which connections exhibit server slack?
Michael Chow 40
Server slack analysis
Slack < 25ms Slack > 2.5s
Michael Chow 41
Server slack analysis
Slack < 25ms Slack > 2.5s
End-to-end latency increases as server generation time
increases
Server generation time has little effect on end-
to-end latency
Michael Chow 42
Predicting server slack
• Predict slack at the receipt of a request• Past slack is representative of future slack
First slack (ms)
Seco
nd s
lack
(ms)
Michael Chow 43
Predicting server slack
• Predict slack at the receipt of a request• Past slack is representative of future slack
Classifies 83% of requests
correctlyType II Error 9%
Type I Error 8%
First slack (ms)
Seco
nd s
lack
(ms)
Michael Chow 44
Conclusion
Michael Chow 45
Questions