the mystery machine: end-to-end performance analysis of large-scale internet services michael chow...

The Mystery Machine:End-to-end performance analysis of

large-scale Internet services

Michael ChowDavid Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch

Michael Chow 2

Internet services are complex

Tasks

Time

Scale and heterogeneity make Internet services complex

Michael Chow 3

Internet services are complex

Tasks

Time

Scale and heterogeneity make Internet services complex

Michael Chow 4

Analysis Pipeline

Michael Chow 5

Step 1: Identify segments

Michael Chow 6

Step 2: Infer causal model

Michael Chow 7

Step 3: Analyze individual requests

Michael Chow 8

Step 4: Aggregate results

Michael Chow 9

Challenges

• Previous methods derive a causal model– Instrument scheduler and communication– Build model through human knowledge

Need method that works at scale with heterogeneous components

Michael Chow 10

Opportunities

• Component-level logging is ubiquitous

• Handle a large number of requests

Tremendous detail about a request’s execution

Coverage of a large range of behaviors

Michael Chow 11

The Mystery Machine

1) Infer causal model from large corpus of traces– Identify segments– Hypothesize all possible causal relationships– Reject hypotheses with observed counterexamples

2) Analysis– Critical path, slack, anomaly detection, what-if

Michael Chow 12

Step 1: Identify segments

Michael Chow 13

Define a minimal schema

Segment 1 Segment 2

Task

Event Event Event

Michael Chow 14

Define a minimal schema

Segment 1 Segment 2

Task

Event Event Event

Request identifierMachine identifierTimestampTaskEvent

Aggregate existing logs using minimal schema

Michael Chow 15

Step 2: Infer causal model

Michael Chow 16

Types of causal relationshipsRelationship Counterexample

Happens-Before

Mutual Exclusion

Pipeline

BA AB

BA

ABOR BA

BA C

B’A’ C’

BA C

B’C’ A’

t1

t2

t1

t2

Michael Chow 17

Producing causal model

Causal Model

S1 N1 C1

S2 N2 C2

Michael Chow 18


Causal Model

S1 N1 C1

S2 N2 C2

C1N1S1

C2N2S2Trace 1

Time

Michael Chow 19


Causal Model

S1 N1 C1

S2 N2 C2

C1N1S1

C2N2S2Trace 1

Time

Michael Chow 20


Causal Model

S1 N1 C1

S2 N2 C2

C1N1S1

C2N2S2

Time

Trace 2

Michael Chow 21


Causal Model

S1 N1 C1

S2 N2 C2

C1N1S1

C2N2S2

Time

Trace 2

Michael Chow 22


Causal Model

S1 N1 C1

S2 N2 C2

C1N1S1

C2N2S2

Time

Trace 2

Michael Chow 23

Step 3: Analyze individual requests

Michael Chow 24

Critical path using causal model

S1 N1 C1

S2 N2 C2

C1N1S1

C2N2S2Trace 1

Time

Michael Chow 25


S1 N1 C1

S2 N2 C2

C1N1S1

C2N2S2

Trace 1

Michael Chow 26


S1 N1 C1

S2 N2 C2

C1N1S1

C2N2S2

Trace 1

Michael Chow 27

Step 4: Aggregate results

Michael Chow 28

Inaccuracies of Naïve Aggregation

Michael Chow 29


Michael Chow 30


Need a causal model to correctly understand latency

Michael Chow 31

High variance in critical path

• Breakdown in critical path shifts drastically– Server, network, or client can dominate latency

Server Network Client

Percent of critical path Percent of critical path Percent of critical path

cdf

Michael Chow 32





cdf

20% of requests, server contributes 10%

of latency

Michael Chow 33





cdf

20% of requests, server contributes 50% or more of

latency

Michael Chow 34

Diverse clients and networks




Michael Chow 35





Michael Chow 36





Michael Chow 37

Differentiated service

Deliver data when needed and reduce average response time

No slack in server generation timeProduce data faster

Decrease end-to-end latency

Slack in server generation timeProduce data slower

End-to-end latency stays same

Michael Chow 38

Additional analysis techniques

• Slack analysis

• What-if analysis– Use natural variation in large data set

C1N1S1

C2N2S2

Time

Michael Chow 39

What-if questions

• Does server generation time affect end-to-end latency?

• Can we predict which connections exhibit server slack?

Michael Chow 40

Server slack analysis

Slack < 25ms Slack > 2.5s

Michael Chow 41

Server slack analysis

Slack < 25ms Slack > 2.5s

End-to-end latency increases as server generation time

increases

Server generation time has little effect on end-

to-end latency

Michael Chow 42

Predicting server slack

• Predict slack at the receipt of a request• Past slack is representative of future slack

First slack (ms)

Seco

nd s

lack

(ms)

Michael Chow 43

Predicting server slack

• Predict slack at the receipt of a request• Past slack is representative of future slack

Classifies 83% of requests

correctlyType II Error 9%

Type I Error 8%

First slack (ms)

Seco

nd s

lack

(ms)

Michael Chow 44

Conclusion

Michael Chow 45

Questions

the mystery machine: end-to-end performance analysis of large-scale internet services michael chow...

Documents

michael chow11 slide

time slide

segments michael chow12

segments michael chow5

wenisch slide

latency slide

b b b b

task event slide