drive the stork correlation use & abuse of statistics in capacity planning denise p. kalm...

53
drive THE STORK CORRELATION USE & ABUSE OF STATISTICS IN CAPACITY PLANNING Denise P. Kalm R&D Sr. Product Specialist BMC Software, Inc.

Upload: matthew-derick-montgomery

Post on 26-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Drive THE STORK CORRELATION USE & ABUSE OF STATISTICS IN CAPACITY PLANNING Denise P. Kalm R&D Sr. Product Specialist BMC Software, Inc

drive

THE STORK CORRELATIONUSE & ABUSE OF STATISTICS IN CAPACITY PLANNING

Denise P. KalmR&D Sr. Product SpecialistBMC Software, Inc.

Page 2: Drive THE STORK CORRELATION USE & ABUSE OF STATISTICS IN CAPACITY PLANNING Denise P. Kalm R&D Sr. Product Specialist BMC Software, Inc

“Statistical analysis – Mysterious, sometimes bizarre manipulations performed upon the collected data of an experiment in order to obscure the fact that the results have no generalizable meaning for humanity. Commonly, computers are used, lending an aura of unreality to the proceedings.”

Page 3: Drive THE STORK CORRELATION USE & ABUSE OF STATISTICS IN CAPACITY PLANNING Denise P. Kalm R&D Sr. Product Specialist BMC Software, Inc

Agenda

Why a stork?Tools of the tradeGetting the terminology rightMy favorite statisticsLies, damned lies and telling your manager what he wants to knowSummary

Page 4: Drive THE STORK CORRELATION USE & ABUSE OF STATISTICS IN CAPACITY PLANNING Denise P. Kalm R&D Sr. Product Specialist BMC Software, Inc

The Stork Correlation

In a small Welsh town, there was a .95 correlation between the arrival of storks and the arrival of babies.

Why?

Page 5: Drive THE STORK CORRELATION USE & ABUSE OF STATISTICS IN CAPACITY PLANNING Denise P. Kalm R&D Sr. Product Specialist BMC Software, Inc

The Stork Correlation

There was also a 1.0 correlation between the dates fishermen were home from the sea and the likely dates of conception.

Page 6: Drive THE STORK CORRELATION USE & ABUSE OF STATISTICS IN CAPACITY PLANNING Denise P. Kalm R&D Sr. Product Specialist BMC Software, Inc

Statistical Abuse

Correlation is the most misused statisticStandard deviation and mean rank 2nd and 3rdIgnorance of statistics leads to career-limiting recommendationsStatistics can be your best friend, once you understand themBut… It is more art than science.

Page 7: Drive THE STORK CORRELATION USE & ABUSE OF STATISTICS IN CAPACITY PLANNING Denise P. Kalm R&D Sr. Product Specialist BMC Software, Inc

Why Me?

BackgroundTrainingExperience

Page 8: Drive THE STORK CORRELATION USE & ABUSE OF STATISTICS IN CAPACITY PLANNING Denise P. Kalm R&D Sr. Product Specialist BMC Software, Inc

Why You?

Determining the significance of changesNot confusing correlation with cause-and-effect Saving time on problem resolutionTheory testing

Page 9: Drive THE STORK CORRELATION USE & ABUSE OF STATISTICS IN CAPACITY PLANNING Denise P. Kalm R&D Sr. Product Specialist BMC Software, Inc

Big Caveats

Only use statistics with like-minded individuals. Managers typically only understand average and percentiles.

When statistics don’t appear to be working for you, check out statistics that do not require the assumption of normality.

Page 10: Drive THE STORK CORRELATION USE & ABUSE OF STATISTICS IN CAPACITY PLANNING Denise P. Kalm R&D Sr. Product Specialist BMC Software, Inc

Tools of the Trade

SAS/SAS GraphSPSSStatistical calculatorExcelBrute force with the equations (not recommended, but possible)

Page 11: Drive THE STORK CORRELATION USE & ABUSE OF STATISTICS IN CAPACITY PLANNING Denise P. Kalm R&D Sr. Product Specialist BMC Software, Inc

Definitions

Sample & populationNormalityOutlierMean, median, mode & percentileStandard deviation & varianceMisc. terms

Page 12: Drive THE STORK CORRELATION USE & ABUSE OF STATISTICS IN CAPACITY PLANNING Denise P. Kalm R&D Sr. Product Specialist BMC Software, Inc

“Statistics is a systematic method for getting the wrong conclusion with 95% confidence.”

Page 13: Drive THE STORK CORRELATION USE & ABUSE OF STATISTICS IN CAPACITY PLANNING Denise P. Kalm R&D Sr. Product Specialist BMC Software, Inc

Sample & Population

Population – all the data for the period of time studied, I.e., every RMF/SMF data record for an hour

Sample – a random selection of all data points/records available.

Page 14: Drive THE STORK CORRELATION USE & ABUSE OF STATISTICS IN CAPACITY PLANNING Denise P. Kalm R&D Sr. Product Specialist BMC Software, Inc

Normal Distribution

A distribution which describes many situations where observations are distributed symmetrically around the mean . 68% of all values under the curve lie within one standard deviation of the mean and 95% lie within two standard deviations.

Page 15: Drive THE STORK CORRELATION USE & ABUSE OF STATISTICS IN CAPACITY PLANNING Denise P. Kalm R&D Sr. Product Specialist BMC Software, Inc

Central Limit Theorem

As sample size increases, the distribution of the sample approaches a normal distribution, where the mean = the mean of the population and the standard deviation equals the standard deviation of the population divided by the square root of the sample size.

More samples, better data.

Page 16: Drive THE STORK CORRELATION USE & ABUSE OF STATISTICS IN CAPACITY PLANNING Denise P. Kalm R&D Sr. Product Specialist BMC Software, Inc

Formulas

f(x) = 1/[2* )1/2* ] * e**{-1/2*[(x-µ)/ ]2}- < x < where µ     is the mean    is the standard deviation  e     is the base of the natural logarithm, sometimes called Euler's e (2.71...)    is the constant Pi (3.14...)

Page 17: Drive THE STORK CORRELATION USE & ABUSE OF STATISTICS IN CAPACITY PLANNING Denise P. Kalm R&D Sr. Product Specialist BMC Software, Inc

Page 18: Drive THE STORK CORRELATION USE & ABUSE OF STATISTICS IN CAPACITY PLANNING Denise P. Kalm R&D Sr. Product Specialist BMC Software, Inc

Outlier

Outlier - A point that, because of observation noise, does not followthe characteristics of the input (or desired response) data.

Page 19: Drive THE STORK CORRELATION USE & ABUSE OF STATISTICS IN CAPACITY PLANNING Denise P. Kalm R&D Sr. Product Specialist BMC Software, Inc

“There are liars, outliers, and out-and-out liars.”

Page 20: Drive THE STORK CORRELATION USE & ABUSE OF STATISTICS IN CAPACITY PLANNING Denise P. Kalm R&D Sr. Product Specialist BMC Software, Inc

Mean

Arithmetic Mean – numeric average of all the data.

X = x1 + x2 + x3…/ N(x)

Assumes normalityAffected by outliersPlot data to understand

Page 21: Drive THE STORK CORRELATION USE & ABUSE OF STATISTICS IN CAPACITY PLANNING Denise P. Kalm R&D Sr. Product Specialist BMC Software, Inc

Plot to see meaning of “mean”

Frequency

0

10

20

30

40

50

60

0.2

1.4

2.6

3.8 5

6.2

7.4

8.6

9.8 11

12.2

13.4

14.8

Frequency

Mean

Median/Mode

Page 22: Drive THE STORK CORRELATION USE & ABUSE OF STATISTICS IN CAPACITY PLANNING Denise P. Kalm R&D Sr. Product Specialist BMC Software, Inc

Median and Mode

Median – middle value, where half the values lie on each side of the median, when they are ordered by value.Mode – most frequently observed value. If no repeats, there is no mode value.

Page 23: Drive THE STORK CORRELATION USE & ABUSE OF STATISTICS IN CAPACITY PLANNING Denise P. Kalm R&D Sr. Product Specialist BMC Software, Inc

Percentile & Percentage Change

Percentile – group data by putting equal number of data points into each group. Ex. 95% percentile – 95% of values are less than x.

Percentage Change = (after value – before value) / before value

Risk of using percentage change

Page 24: Drive THE STORK CORRELATION USE & ABUSE OF STATISTICS IN CAPACITY PLANNING Denise P. Kalm R&D Sr. Product Specialist BMC Software, Inc

Standard Deviation & Variance

Standard Deviation – square root of the variance. For normal data, 2/3 of the data points are within 1 SD of the mean on either side.

Variance – amount of “spread” of the data around the mean:

S2= ((x1-X)2 + (x2-X)2 + …. (xn-X)2 ) / n-1

Where x=mean and xn is each data point, n is the number of samples

Page 25: Drive THE STORK CORRELATION USE & ABUSE OF STATISTICS IN CAPACITY PLANNING Denise P. Kalm R&D Sr. Product Specialist BMC Software, Inc

Standard Deviation of a Sample

“If the SD is large, you need to inspect your sampling method. This may indicate suspect data, poor interval choices, etc.

Page 26: Drive THE STORK CORRELATION USE & ABUSE OF STATISTICS IN CAPACITY PLANNING Denise P. Kalm R&D Sr. Product Specialist BMC Software, Inc
Page 27: Drive THE STORK CORRELATION USE & ABUSE OF STATISTICS IN CAPACITY PLANNING Denise P. Kalm R&D Sr. Product Specialist BMC Software, Inc

My Favorite Statistics

Linear RegressionCorrelation

Page 28: Drive THE STORK CORRELATION USE & ABUSE OF STATISTICS IN CAPACITY PLANNING Denise P. Kalm R&D Sr. Product Specialist BMC Software, Inc

“In ancient times, they had no statistics, so they had to fall back on lies.” - Stephen B. Leacock

Page 29: Drive THE STORK CORRELATION USE & ABUSE OF STATISTICS IN CAPACITY PLANNING Denise P. Kalm R&D Sr. Product Specialist BMC Software, Inc

Linear Regression

Linear Regression – describing the relationship between two data elements, by fitting a straight line to the data.Ex. X=transaction rate Y=%CPU utilization

Y= bX+C where x and y are the variables, b is the slope of the line and C is the point where the line intercepts the y-axis.

Page 30: Drive THE STORK CORRELATION USE & ABUSE OF STATISTICS IN CAPACITY PLANNING Denise P. Kalm R&D Sr. Product Specialist BMC Software, Inc

Linear Regression

Page 31: Drive THE STORK CORRELATION USE & ABUSE OF STATISTICS IN CAPACITY PLANNING Denise P. Kalm R&D Sr. Product Specialist BMC Software, Inc

Good Candidate for Regression

47© 2001 BMC SOFTWARE, INC.

Potential Impact of Increasing Volume

0

20

40

60

80

100

BASELINE PLAN1 PLAN2 PLAN3 PLAN4

% Proc

PRD2

PRD1

Predict Impact of Change

Page 32: Drive THE STORK CORRELATION USE & ABUSE OF STATISTICS IN CAPACITY PLANNING Denise P. Kalm R&D Sr. Product Specialist BMC Software, Inc

Bad Candidate for Regression

48© 2001 BMC SOFTWARE, INC.

Predict Impact of Change

Impact on Response Time

0

.50

1.0

1.5

2.0

2.5

BASELINE PLAN1 PLAN2 PLAN3 PLAN4

Secs

BILLWEB Page Service

BILLWEB Page Wait

BILLWEB I/O Wait

BILLWEB CPU Service

BILLWEB I/O Service

BILLWEB CPU Wait

Page 33: Drive THE STORK CORRELATION USE & ABUSE OF STATISTICS IN CAPACITY PLANNING Denise P. Kalm R&D Sr. Product Specialist BMC Software, Inc

Gotchas

Make sure relating the variables makes sense.Plot data when not sure of the relationship (scatter plot)Do not throw out outliers until you are sure of why they occurredDo not commit linear “progression”

Page 34: Drive THE STORK CORRELATION USE & ABUSE OF STATISTICS IN CAPACITY PLANNING Denise P. Kalm R&D Sr. Product Specialist BMC Software, Inc

Correlation

Correlation coefficient - R2 measures the degree of relationship(and direction) between two variables. R2 =1.00 indicates a perfect correlation; R2 = 0.0 means there is no relationship at all. R2 = a negative number means that as one variable increases, the other decreases.

Page 35: Drive THE STORK CORRELATION USE & ABUSE OF STATISTICS IN CAPACITY PLANNING Denise P. Kalm R&D Sr. Product Specialist BMC Software, Inc

Correlation is NOT cause and effect.

Though there may be a causal relationship between two variables, you cannot infer it from a correlation analysis.

A third factor may really be causing the correlation.

Page 36: Drive THE STORK CORRELATION USE & ABUSE OF STATISTICS IN CAPACITY PLANNING Denise P. Kalm R&D Sr. Product Specialist BMC Software, Inc

Don’t calculate it by hand – use a tool.

Use your brain to interpret the results.

Page 37: Drive THE STORK CORRELATION USE & ABUSE OF STATISTICS IN CAPACITY PLANNING Denise P. Kalm R&D Sr. Product Specialist BMC Software, Inc

“A statistician is someone who is skilled at drawing a precise line from an unwarranted assumption to a foregone conclusion.”

Page 38: Drive THE STORK CORRELATION USE & ABUSE OF STATISTICS IN CAPACITY PLANNING Denise P. Kalm R&D Sr. Product Specialist BMC Software, Inc

drive

How to Lie With Statistics

Page 39: Drive THE STORK CORRELATION USE & ABUSE OF STATISTICS IN CAPACITY PLANNING Denise P. Kalm R&D Sr. Product Specialist BMC Software, Inc

Statisticulation

“Statistics are like a bikini. What they reveal is suggestive, but what they conceal is vital.” - Aaron Levenstein

Page 40: Drive THE STORK CORRELATION USE & ABUSE OF STATISTICS IN CAPACITY PLANNING Denise P. Kalm R&D Sr. Product Specialist BMC Software, Inc

Why Lie?

Outliers make your data look badYou are trying to comply with a performance clauseYou are too busy writing the great American novel to do your jobYour manager wouldn’t understand anyway

Page 41: Drive THE STORK CORRELATION USE & ABUSE OF STATISTICS IN CAPACITY PLANNING Denise P. Kalm R&D Sr. Product Specialist BMC Software, Inc

Averaging Averages

Why do it?Most performance data is already averaged, so it is easierMakes response times look better in most casesSmooths out all variabilityMostly eliminates outliers, particularly in plotting data

Page 42: Drive THE STORK CORRELATION USE & ABUSE OF STATISTICS IN CAPACITY PLANNING Denise P. Kalm R&D Sr. Product Specialist BMC Software, Inc

Using Percentage Change

Why do it?To exaggerate the benefit of a performance change.Ex. RT decreased 50% going from 0.2 to 01.

To justify a processor upgradeEx. Doubling application volume will increase its CPU demand 100% (even when the CPU demand was very small)

To impress or terrify

Page 43: Drive THE STORK CORRELATION USE & ABUSE OF STATISTICS IN CAPACITY PLANNING Denise P. Kalm R&D Sr. Product Specialist BMC Software, Inc

Small Sample Size

Why do it? SAS jobs run faster Large, randomly obtained data doesn’t give the

right results; a small, selected window doesYou don’t really have any data and have to

invent some

Page 44: Drive THE STORK CORRELATION USE & ABUSE OF STATISTICS IN CAPACITY PLANNING Denise P. Kalm R&D Sr. Product Specialist BMC Software, Inc

Stupid Graph Tricks

Why do it? To make your data look better

How to do it Log functions on one axis – to diminish the

impact of a change. Or just use different orders of magnitude for x and y axes

Select graph type (pie, line, stacked bar) which best misleads your audience

Eliminate actual metrics so you can draw the line to reflect your reality

Put time on the wrong axis Eliminate all legends, data tables, etc.

Page 45: Drive THE STORK CORRELATION USE & ABUSE OF STATISTICS IN CAPACITY PLANNING Denise P. Kalm R&D Sr. Product Specialist BMC Software, Inc

Invalid Metrics

How to do itUse your own definitions. Ex. Typical CICS tran = non-browsing, non-batch workUse multiple decimal places to lend an air of precision to the data. Good with small or unreliable sample or poor capture ratio.Compare apples to oranges. Ex. Compare performance after tuning using a period of low demand to compare to a “before” of high demandAdd percentage changes together. Ex. If volume changes cause a 10% inc. in DB2, a 15% inc. in CICS and a 20% increase in batch, that’s 45%.

Page 46: Drive THE STORK CORRELATION USE & ABUSE OF STATISTICS IN CAPACITY PLANNING Denise P. Kalm R&D Sr. Product Specialist BMC Software, Inc

Correlation Abuse

How to do it

Select two metrics that aren’t usually related (I/O response time and file size), draw a correlation and justify a memory upgrade.

Most people don’t know performance metrics well enough to challenge you.

Page 47: Drive THE STORK CORRELATION USE & ABUSE OF STATISTICS IN CAPACITY PLANNING Denise P. Kalm R&D Sr. Product Specialist BMC Software, Inc

Another Common Lie

Linear progression – forecasting the line past the data points you have

Unless you are sure the relationship between two variables is linear, do not attempt this. Even mostly linear relationships (such as CPU vs. volume) may go non-linear at near-saturation.

Page 48: Drive THE STORK CORRELATION USE & ABUSE OF STATISTICS IN CAPACITY PLANNING Denise P. Kalm R&D Sr. Product Specialist BMC Software, Inc

What Can Go Wrong

What you think might happen

What might really be happening

Page 49: Drive THE STORK CORRELATION USE & ABUSE OF STATISTICS IN CAPACITY PLANNING Denise P. Kalm R&D Sr. Product Specialist BMC Software, Inc

What We Didn’t Cover

Hypothesis testing – valuable if you want to see how likely it is that your theory matches reality. Is the change in the data due to chance, or did you really make a difference?

Chi-squareT-test

When you don’t have enough information about the data (population) or about cause-and-effect relationships

Page 50: Drive THE STORK CORRELATION USE & ABUSE OF STATISTICS IN CAPACITY PLANNING Denise P. Kalm R&D Sr. Product Specialist BMC Software, Inc

Summary

Turn data into information by applying statistics and your knowledge.

Practice “safe performance analysis” and protect your job.

CYA

Page 51: Drive THE STORK CORRELATION USE & ABUSE OF STATISTICS IN CAPACITY PLANNING Denise P. Kalm R&D Sr. Product Specialist BMC Software, Inc

“Numbers are like people; torture them enough and they’ll tell you anything.”

Page 52: Drive THE STORK CORRELATION USE & ABUSE OF STATISTICS IN CAPACITY PLANNING Denise P. Kalm R&D Sr. Product Specialist BMC Software, Inc

References

Geis “How to Lie with Statistics”Dixon and Massey “Introduction to Statistical Analysis”Gonick & Smith “The Cartoon Guide to Statistics”Sziede “Statistics for the Algebraically Challenged”Munoz “Sampling Issues in the Collection of Performance Data” CMG2002

Page 53: Drive THE STORK CORRELATION USE & ABUSE OF STATISTICS IN CAPACITY PLANNING Denise P. Kalm R&D Sr. Product Specialist BMC Software, Inc

Questions?

Denise P. [email protected] Software, Inc.