monitoring with percentiles

44
Monitoring With Percentiles Baron Schwartz

Upload: vividcortex

Post on 21-Jan-2018

1.213 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: Monitoring with Percentiles

Monitoring With PercentilesBaron Schwartz

Page 2: Monitoring with Percentiles

#percentiles

Introduction

● My email is [email protected]● Now let’s learn as much as possible about percentiles in 25 minutes!

2

Page 3: Monitoring with Percentiles

#percentiles

What Are Percentiles?

● More generally, quantiles - percentiles are just a common type of quantile● Quantiles divide a distribution of values into ordered, equal intervals● Percentiles divide the distribution into 100 intervals

3

Page 4: Monitoring with Percentiles

#percentiles 4

Page 5: Monitoring with Percentiles

#percentiles

What’s the 99.9th percentile?

● It’s loose terminology, but we all know what we mean.● Strictly speaking, it’d be the 999th permille.

5

Page 6: Monitoring with Percentiles

#percentiles

What are Percentiles Good For?

● They show some measure of the extremes of outliers● They help avoid outliers being obscured● They help hide the outliers so the bulk of the values aren’t obscured● They show “worst common case behavior”

6

Page 7: Monitoring with Percentiles

#percentiles

http://nyti.ms/2cLigbH

7

Page 8: Monitoring with Percentiles

#percentiles

Problem: Averages Hide Outliers

Source: http://www.brendangregg.com/FrequencyTrails/outliers.html

8

Page 9: Monitoring with Percentiles

#percentiles

Problem: Outliers Skew Averages

● It’s hard to see the shape of the chart because the spikes cause the rest of the data to be scaled down near the axis.

● This is a chart of an average; how far out did the outlier really extend? Is the outlier itself being scaled by the rest of the data?

● Net result: averages show us neither the outliers, nor the bulk of the data.● The average is neither robust nor representative.

9

Page 10: Monitoring with Percentiles

#percentiles

Definition of Average

Average (def): a random number that falls somewhere between the maximum and 1/2 the median. Most often used to ignore reality.

- Gil Tene

Source: http://latencytipoftheday.blogspot.com/2014/06/latencytipoftheday-average-random.html

10

Page 11: Monitoring with Percentiles

#percentiles

Is There A Robust, Representative Metric?

● You probably know that the median is “robust and representative.”● It’s commonly used to represent “the common case.”

11

Page 12: Monitoring with Percentiles

#percentiles

Problem: Median Isn’t Most

● The median is the 50th percentile: the midpoint of the distribution.● When it comes to performance, median isn’t representative of most.● We should care about “most people’s experience.”● And we should also care about “some people’s experience.”

12

Page 13: Monitoring with Percentiles

#percentiles

Median Server Response Time: The number that 99.9999999999% of page views can be worse than.

- Gil Tene

Source: http://latencytipoftheday.blogspot.com/2014/06/latencytipoftheday-median-server.html

(This is possible because most page views issue multiple requests to backend servers.)

Definition of Median

13

Page 14: Monitoring with Percentiles

#percentiles

The Median Is Too Coarse

● The median is too coarse, much more so than you’d expect.● High quantiles are better for understanding typical experiences.● You should care about the edge cases, i.e. 99th percentile and higher.● This helps you understand and design for the impact of outliers on your

architecture, and your architecture/design choice’s impact on outliers.● Design systems to “bend but not break” -- @kellabyte

○ Source: http://kellabyte.com/2014/10/29/the-99th-percentile-matters/

14

Page 15: Monitoring with Percentiles

#percentiles

How Do Percentiles Work?

● We’re typically dealing with measurements of highly variable quantities.● These come from processes with properties (i.e. models) that are usually

not knowable a priori, and usually not even stable, so you can’t assume things like “normally distributed.”

● Examples: response size in bytes, response latency in seconds.● There are many more wrong ways to do percentiles than right.

15

Page 16: Monitoring with Percentiles

#percentiles

How Do You Compute Percentiles?

● Divide the possible range of values into partitions.● Place each measurement into a partition and increment its count.● Count the total of all partitions (i.e. “19,847 measurements”).● Multiply the total times the desired quantile (i.e. “99.9th% = 19,827”).● Find the partition that contains the Nth measurement (i.e. partition 1201).● The upper boundary of the partition (i.e. 1822ms) is the result.

16

Page 17: Monitoring with Percentiles

#percentiles

Alternative Definitions

● The Nth percentile is ~= the max value of all but 1-Nth measurements.● So you can discard the Nth worst fraction and measure the max.● i.e. to get the 95th percentile, ignore the worst 5% and measure the max

value of what remains.

(This isn’t strictly correct, since it’s not based on an even partitioning of the value space into equal -iles)

17

Page 18: Monitoring with Percentiles

#percentiles

How Monitoring Tools Implement Percentiles

● Vastly differently!● Don’t assume your monitoring tool does it the right way, the one true way,

or the same way any other tool does.

18

Page 19: Monitoring with Percentiles

#percentiles

Factors To Consider

1. How are values measured?2. What’s the definition of “percentile” in use?3. How are values aggregated into metrics or other representations?4. How are metrics (assuming it’s metrics) emitted?5. How are metrics transmitted and stored?6. How are metrics retrieved?7. How are metrics displayed?8. How are metrics recomputed and transformed for longterm retention?

19

Page 20: Monitoring with Percentiles

#percentiles

Garbage In, Garbage Out

20

Page 21: Monitoring with Percentiles

#percentiles

StatsD and Graphite

● StatsD lets you compute percentiles in the aggregator itself before sending them to Graphite. The result is a “metric of the percentile.” More on this later.

● StatsD’s metrics such as upper_99 and mean_99 are confusing.● It’s possible to track banded metrics in StatsD and Graphite, and then to

compute percentiles from the bands later.● Graphite itself has percentile functions for wildcard series.

21

Page 22: Monitoring with Percentiles

#percentiles

Datadog

● If you collect a histogram with Datadog’s DogStatsD, it emits metrics of min, max, median, and 95th percentile.

● Similar caveats as StatsD’s percentiles.

22

Page 23: Monitoring with Percentiles

#percentiles

Coda Hale’s Metrics

● Many typical/common metrics coming from this set of libraries are exponentially biased over time.

● http://metrics.dropwizard.io/3.1.0/manual/core/#exponentially-decaying-reservoirs

● They’re also computed from statistically representative samples of the population. Representative, ~= but still be aware it’s only a sample.

● Many, many products (e.g. Cassandra) use Coda Hale’s Metrics library.○ E.g. http://wiki.apache.org/cassandra/Metrics NOTE: these are silently distorted values,

not raw measurements.23

Page 24: Monitoring with Percentiles

#percentiles

VividCortex

● We capture banded metrics. At the moment we only visualize them as rainbow charts.

24

Page 25: Monitoring with Percentiles

#percentiles

Honeycomb

● Based on the raw dataset.● If the raw dataset is sampled instead of captured in full, possibly skewed.

25

Page 26: Monitoring with Percentiles

#percentiles

Circonus

● High-resolution histograms using “llquantize()” type bucketing● i.e. “Two base-ten significant digits of precision”

26

Page 27: Monitoring with Percentiles

#percentiles

Advice

For understandability and fidelity, you’re best off with:

● Raw data, not predigested or distorted.● Definitely not exponentially decayed at the aggregator, if possible.

○ i.e. Coda Hale’s Metrics library probably distorts more than you’d be happy with if you really knew the truth about the underlying measurements.

● Banded or histogrammed is better than a single metric of a percentile○ (more on this to come)

● You have to research the underlying implementation yourself.

27

Page 28: Monitoring with Percentiles

#percentiles

How Can I Visualize Percentiles?

● There’s a variety of ways.● Distributions of data contain a lot of information, so visualization is

essential.● You’re usually most interested in how it changes over time.● A few ways to visualize percentiles and distributions over intervals...

28

Page 29: Monitoring with Percentiles

#percentiles

Time Series Graphs

29

Page 30: Monitoring with Percentiles

#percentiles

Banded Metrics

Phusion Passenger; VividCortex

30

Page 31: Monitoring with Percentiles

#percentiles

Histograms

Apex Ping

31

Page 32: Monitoring with Percentiles

#percentiles

Heat Maps

Fastly

32

Page 33: Monitoring with Percentiles

#percentiles

How Can I Describe Distributions?

● If you knew that your values fit a particular distribution…● Then you’d be able to just record the distribution’s parameters.● But that’s basically never the case.● In practice, something equivalent to histograms ends up being necessary.

33

Page 34: Monitoring with Percentiles

#percentiles

Histogram Implementations

● HdrHistogram is the “canonical” implementation for many purposes.○ Fast, flexible, can be merged together (i.e. Downsampled)

● Many roll-your-own examples exist● For predefined ranges, good bucket values aren’t hard to choose● If you don’t the values’ characteristics in advance, it’s harder

○ Powers of two? Powers of… 1.05?○ Linear buckets?○ Log-linear buckets?○ These can end up being equivalent to “achieve desired significant digits in base10”

34

Page 35: Monitoring with Percentiles

#percentiles

Banded Metrics

● Banded metrics can be essentially equivalent to histograms.● One significant difference is that their cut points are static, unlike

histograms which may dynamically differ depending on the actual data in a range of time.

35

Page 36: Monitoring with Percentiles

#percentiles

Histograms to Quantiles

● You don't have to store 100 bands/buckets to get percentiles.● Simply sum and find the cutoff, then the bucket and value as before.● This ends up being an approximation, again, not the strictly exactly 100%

correct statistician’s dictionary definition of a quantile.

36

Page 37: Monitoring with Percentiles

#percentiles

What Insights Can Percentiles Give You?

● How bad is your typical user’s experience? (Use a high percentile)● Are there occasional issues that mean you’re providing low-quality service

overall? (High-quality service is consistently fast)● Is there a rare occurrence that’s going to escalate?

○ Note: this is equivalent in some ways to what VividCortex’s Adaptive Fault Detection does

● In other words, monitoring at the edges helps you be more proactive by listening to the canaries in the coal mine.

37

Page 38: Monitoring with Percentiles

#percentiles

Percentile Pitfalls

● Percentile math can be confusing.● Tools and their distortions can be confusing.● The math isn’t commutative.● A metric of percentile doesn’t make sense over time.

○ You can’t take averages of percentiles.○ You can’t downsample/resample over time.

● You can’t take percentiles of averages.● (Ok, you can, but the result has no defined meaning)

38

Page 39: Monitoring with Percentiles

#percentiles

Percentile of an Average

Q: “what’s the 99th percentile of this metric of buffer-pool-reads-per-second?”

A: “it depends on what you mean.”

It’s possible to imagine uses for this, but note that things-per-second is by definition an average (aggregate!!) with seconds as the denominator. It’s not a population.

39

Page 40: Monitoring with Percentiles

#percentiles

Percentile Pitfalls, Cont’d

● Computing percentiles can be computationally expensive.● There are efficient online approximations if you’re interested.

○ Search for “streaming approximate quantiles.”

● The trouble is they pre-digest and result in an approx “percentile metric.”● General rule of thumb for safety:

○ Don’t emit or store any time series metric that’s not robust when averaged over time.○ In other words, no fractions or other derived metrics. They don’t work right.○ This isn’t specific to percentiles, it’s just broad-based advice.

40

Page 41: Monitoring with Percentiles

#percentiles

Percentile Pitfalls, Cont’d Again

● Percentiles aren’t intuitive.● The high percentiles happen to most of your users, not just some.● The probability any given user will not have a high-percentile experience

with your app is vanishingly small.● See again:

http://latencytipoftheday.blogspot.com/2014/06/latencytipoftheday-most-page-loads.html

41

Page 42: Monitoring with Percentiles

#percentiles

Graphite and StatsD Percentile Pitfalls

● I’m not picking on Graphite and StatsD, but they’re especially fraught.● There’s a lot of combinations of ways things can be done wrong with

them.● If you’re using them, you need to learn how to use them right.

42

Page 43: Monitoring with Percentiles

#percentiles

To Sum Up

● You need to examine the outliers, not just the bulk of the data.● Percentiles are computed from a population. You can’t store a percentile

itself, you have to store either the population itself, or a representation of it (histograms or banded metrics).

● Tools -- almost all of them -- lack guard rails to keep you away from invalid uses of percentiles. There’s moral hazard, you could lead others astray.

● A percentile is still just a single number. Distributions are better than simplifying to a single number.

● All measurements are wrong; some are useful anyway.43

Page 44: Monitoring with Percentiles

#percentiles

Questions?

Don’t Miss Our Next Webinar!What's New in MySQL 8.0 and PostgreSQL 9.6

Tuesday, October 25th2pm EDT

Features to be discussed include:● New replication capabilities.● More extensibility.● Improved performance.● Broader SQL implementation.● Better observation and monitorability.● Improved operability.

Subscribe to our newsletter for details!

44