the tale of heavy tails in computer networking

23
The Tale of Heavy Tails in Computer Networking Stenio Fernandes CIn/UFPE, Recife, Brazil Carleton University - ARS Lab – May 2016

Upload: stenio-fernandes

Post on 12-Apr-2017

141 views

Category:

Engineering


0 download

TRANSCRIPT

Page 1: The tale of heavy tails in computer networking

The Tale of Heavy Tails in Computer Networking

Stenio FernandesCIn/UFPE, Recife, Brazil

Carleton University - ARS Lab – May 2016

Page 2: The tale of heavy tails in computer networking

2

Outline

Essential Concepts and Terminology| The heavy-tail phenomenon| Outliers detection| Heavy-tailed distributions and its variations (subclasses)

Evidences of Heavy-Tailedness in Computer Networks| Examples

Page 3: The tale of heavy tails in computer networking

3

Essential Concepts and Terminology

Page 4: The tale of heavy tails in computer networking

4

The heavy-tail phenomenon

• Heavy-tailedness in computer networking is like Ninjas, they’re everywhere Internet meme

• Extreme observations must be taken carefully and very seriously

• Dataset that exhibits very large observation values makes descriptive and inferential statistical analysis much more difficult

• It might not make sense to use traditional statistical techniques and tools in these cases

• Some important initial questions: • Are we confident to discard single, scattered, or

burst of observations that presents extreme values due to uncontrolled factors?

• Are the extreme values come from valid measurements?

Page 5: The tale of heavy tails in computer networking

5

Statistical black sheep

This is what I call Statistical Black Sheep| the ones that causes shame or embarrassment because of deviation from the accepted

standards of his or her group (Black Sheep definition on M-W)

You can either keep or discard such measurement values based on subjective analysis| It is out of scope of your interest

• Ex.: mean value of cat videos length on YouTube

Take-home lesson: | do not disgrace the black sheep without proper reasons

Decision can also be made based on rigorous statistical analysis| a quantitative analysis| Recall that an outlier might be influential on regression modeling (more on that later)

Page 6: The tale of heavy tails in computer networking

6

It starts with outliers

Here is an Outlier

Here is another one!

Page 7: The tale of heavy tails in computer networking

7

Outliers

An observation can be considered as outlier if it falls below or above certain limits| detection is only an indication that you might want to think carefully about them

There are a number of formal tests and rules of thumbs to detect outliers in an observation variable1. Grubbs’ 2. Tietjen-Moore’s3. Mahalanobis distance4. Extreme Value Theory (EVT)5. Generalized Extreme Studentized Deviate (ESD)

Try to not be so picky when choosing the method| simply because outlier detection and handling is an art| a subjective approach plays an important role to accommodate outliers in your analysis

Page 8: The tale of heavy tails in computer networking

8

Outliers in Regression Models

Page 9: The tale of heavy tails in computer networking

9

Outliers

Kurtosis: concrete idea about the expected number of outliers | High (strong skewness) or low

(weak skewness)

A general approach for outlier detection | identify values apart from the

central values( in terms of )| A common and simple approach

• define the fences as • is the sample mean • is the sample standard deviation

(more conservative: use 4)

Page 10: The tale of heavy tails in computer networking

10

Outliers and Heavy-Tails

Verify if there are lots of observations outside the fences| This might be indicating that the underlying phenomenon generates heavy-tailed data

• Your black sheep metamorphoses into a black swan

If extreme values come from distributions with heavy tails • Weibull, Gamma, Pareto

| Such events are not so rare | They are likely to be part of the underlying phenomenon

If you decided to keep the outliers | You recognized them as part of the underlying data generation process| You need to address them properly

Why do we need to use other statistical measures when dealing with heavy-tailed distributions?

Page 11: The tale of heavy tails in computer networking

11

Moments from Heavy-Tails Distributions

Page 12: The tale of heavy tails in computer networking

12

More on Heavy-Tails

Classification| Light or Thin tail| Fat or Heavy tail| Long tail

Light and thin tailed distributions are always used as references| Normal and Exponential distributions| Definition: A probability distribution that has an exponentially decaying complementary CDF

Heavy-tailed distributions are the general ones| most formal analysis of heavy tailed distributions indeed deals with right heavy tailed

distributions with [0, ∞] support| the term fat tail is not well accepted by the traditional (and more formal) communities of

statisticians and mathematicians, although is widely used in the finance one

Page 13: The tale of heavy tails in computer networking

13

Some Formal Stuff

Some intuition behind the concepts| Power-Law is a relation between two variables in a form, where takes a general form of | There are dozens of power-law distributions

• Zipf and Pareto are the most well-known ones in the computer networking field| They have interesting mathematical properties, such as the tails fall asymptotically according

to the power parameter

The Pareto distribution became famous due to its capability to fit in and model well real-world related problems| The Pareto rule (or principle), aka the 80/20 rule, has been used to exemplify clearly that

phenomena of all sorts are running far from the Normal distribution. • It is clear that the normal is not being Normal!

Page 14: The tale of heavy tails in computer networking

14

Some Formalities

A non-negative random variable X, either continuous or discrete, can be considered a Power-Law distribution if it follows

| where c and are the constant parameters that characterize the distribution. • is known as the scale parameter. Both constants are positive.

Heavy-tailedness| The tail of a function is denoted by , where F is the distribution function of a random variable

X. • F is (right) heavy-tailed if , for all λ > 0. • The distribution is light-tailed when .

Page 15: The tale of heavy tails in computer networking

15

Some Formalities

Long-tailedness (the survival function) is long-tailed when

•   is a non-increasing function, so it converges to 1. | Considering that the tail of has a polynomial decay rate (i.e., the tail index), its moments are

infinite for all .

The Pareto case| One interesting property of a Power-Law distribution is that if you take the logarithmic scale

plot (i.e., log-log) of the CCDF – in a rank plot - it should present a straight line| Its density function is given by:

Page 16: The tale of heavy tails in computer networking

16

Some Formalities

Pareto shows interesting features| If , there is no first moment, i.e., its mean is infinite. | In the case of if , its variance is also infinite (heavy-tail). | A Pareto PDF is scale free

• In computer networking problems, it can capture self-similar behavior (aka fractal) in several layers of the protocol stack

A log-log view of the Pareto PDF reveals, as expected, a straight line, as follows:

The second and third terms of the equation are constants.| The relation between and is linear, where is its slope.

• A simple approach for identifying the scale parameter is by means of linear regression.

Page 17: The tale of heavy tails in computer networking

17

Page 18: The tale of heavy tails in computer networking

18

Take-home lesson

The fact is that some universal statistical practices and theories do not hold if the data follows a heavy-tailed distribution| The Law of Large Numbers (LLN) and the Central Limit Theorem (CLT) do not hold when

dealing with heavy tailed distributions. • This is due to the fact that their first or second moments are not finite, which is the fundamental

assumption that supports both LLN and CLT.

Page 19: The tale of heavy tails in computer networking

19

Evidences of Heavy-Tailedness in Computer Networks

Page 20: The tale of heavy tails in computer networking

20

Evidences of Heavy-tailedness

Extreme events in nature occurs in both micro and macro scales a number of case studies and evidences of the occurrence of

extreme events| nature (e.g., earthquakes, landslides, floods, droughts, storms)| human-induced catastrophes (spills, nuclear accidents, dam ruptures, power outages) | financial (e.g., wealth distribution. When the 0.1% richer has 50% of the world’s wealth) | geo- and socio-political area (e.g., human fatalities in wars) | online social network phenomenon (e.g., tweets like “the naked celebrity pics leak cracks

down the Internet”), which is known to causes spikes in traffic from time to time

Extreme events in computer networking have been studied (by measurements, modelling, and analyses) for decades| Unfortunately, a number of network engineers and researchers still do not take such

phenomena carefully

Page 21: The tale of heavy tails in computer networking

21

Some Examples

Power-Law distributions in Internet measurements| web objects have a tight relation with long tails

• Images, Texts, Video, Embedded code| modelling issues and implications for network planning and design (e.g., web caching architectures)

• Question like “What is the average size of web objects in the Internet?” should not be answered by calculating the mean value!

Recent Studies in mobile environments | typical performance metrics follows heavy-tailed distributions

• main object sizes• embedded object size• number of embedded objects in one request• embedded object inter-arrival time• session duration• interval between two consecutive requests (aka the reading time)

Page 22: The tale of heavy tails in computer networking

22

Some Examples

Video Systems| YouTube: The number views can be modelled well by Zipf, Weibull, or Gamma distributions

• Zipf-like distributions fit well this popularity metric in mobile environments

Intriguing cases of heavy tailedness in the Internet are in the network layer| strong evidences of heavy tailedness for the sampled IP addressed| distributions of IP packets per aggregation are all following a Power-law distribution

• the number of packets per flow, unique address, or IP prefixes Internet connectivity at several levels of aggregation can be modeled with

heavy tail distributions P2P Systems

| video popularity| session duration| churn of peers

• user arrival and departure at/from the overlay network| Different studies have reported different distributions (just be careful with the choice)

Page 23: The tale of heavy tails in computer networking

The Tale of Heavy Tails in Computer Networking

Stenio FernandesCIn/UFPE, Recife, Brazil

Carleton University - ARS Lab – May 2016