ch3 elementary descriptive statistics

48
Ch3 Elementary Descriptive Statistics

Upload: adie

Post on 23-Feb-2016

115 views

Category:

Documents


0 download

DESCRIPTION

Ch3 Elementary Descriptive Statistics. Section 3.1: Elementary Graphical Treatment of Data. Before doing ANYTHING with data: Understand the question. An approximate answer to the exact question is always better than an exact answer to an approximate question. John Tukey . - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Ch3 Elementary Descriptive Statistics

Ch3 Elementary Descriptive Statistics

Page 2: Ch3 Elementary Descriptive Statistics

Section 3.1: Elementary Graphical Treatment of Data

Before doing ANYTHING with data:• Understand the question.

– An approximate answer to the exact question is always better than an exact answer to an approximate question. John Tukey.

• Know how the experiment was conducted.

Page 3: Ch3 Elementary Descriptive Statistics

The FIRST thing to do with the data is toPLOT THE DATA

– Plot all individual points.– If there are connections between points, e.g.

points are from same pairs (or sometimes separate blocks), show connections between related points.

Page 4: Ch3 Elementary Descriptive Statistics

Plotting data is an extremely important step.• More often than not data I get when consulting

have problems like incorrect data or attributes they didn’t tell me about.

• Plotting helps reveal relationships and answers.• Plotting is a very effective way to present

results.– “A picture is worth a thousand words.”

Page 5: Ch3 Elementary Descriptive Statistics

Example:8 lb. test fishing line question: Which type(s) of line are strongest? Listing numerical data Trilene XL 11.5 11.3 11.7 11.6 11.7 11.4 11.5 11.5 11.6 11.4 Trilene XT 11.6 11.8 11.7 11.7 11.5 11.6 11.6 11.8 11.5 11.7Stren 11.1 11.1 11.2 11.0 11.1 11.3 11.2 10.9 11.0 11.1 It’s hard to see what’s happening without organizing the data.

Page 6: Ch3 Elementary Descriptive Statistics

A “dot” diagram

XL XT Stren11.8 **11.7 ** ***11.6 ** ***11.5 *** *11.4 ** *11.3 * *11.2 **11.1 ****11.0 **10.9 *

Page 7: Ch3 Elementary Descriptive Statistics

Stem and leaf plotIt shows the distribution shape and at the same time preserves the original values.In the gears’ runouts example, for the gears hung group, we have data points of

7, 8, 8, 10, 10, 10, 10, 11, 11, 11, 12, 13…

A stem and leaf plot is

0 7 8 81 0 0 0 0 1 1 1 2 3

Page 8: Ch3 Elementary Descriptive Statistics

Two groups can be compared with back to back stem and leaf diagramsE.g. Stopping distances of bikes

Treaded tire Smooth tire34 1 8 935 5

5 366 4 37 5

38 1 39 12 0 40

Or dot diagrams | | | * | ** | | * |** Treaded340 350 360 370 380 390 400 |*** | * | | * | | * | Smooth

Page 9: Ch3 Elementary Descriptive Statistics

When there are associations between sets of data values, plot the data accordingly.

E.g., Snowfall for duluth and White Bear Lake 1972-2000A not very good way to plot the data

WB Lake Duluth130 *120 *110 **

** 100 *** * 90 *****

80 ****** ****** 70 **

*** 60 ** ********** 50 ****

*** 40 *** *** 30 * *** 20

Page 10: Ch3 Elementary Descriptive Statistics

Snowfall plot

0102030405060708090

100110120130140

1972 1977 1982 1987 1992 1997

year

snow

_tot

al

DuluthWhite Bear

Page 11: Ch3 Elementary Descriptive Statistics

A study of trace metals in South Indian River

12

3

4

5

6

T=top water zinc concentration (mg/L)B=bottom water zinc (mg/L)

1 2 3 4 5 6Top 0.415 0.238 0.390 0.410 0.605 0.609Bottom 0.430 0.266 0.567 0.531 0.707 0.716

Page 12: Ch3 Elementary Descriptive Statistics

• One of the first things to do when analyzing data is to PLOT the data

• This is not a useful way to plot the data. There is not a clear distinction between bottom water and top water zinc—even though Bottom>Top at all 6 locations.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Zinc

Top Bottom

Page 13: Ch3 Elementary Descriptive Statistics

A better way

0.2

0.3

0.4

0.5

0.6

0.7

Zinc

Top Bottom

Connect points in the same pair.

Page 14: Ch3 Elementary Descriptive Statistics

Another way (scatter plot)

0

0.2

0.4

0.6

0.8

0 0.2 0.4 0.6 0.8

Bottom=Top

Page 15: Ch3 Elementary Descriptive Statistics

• This following plot would imply a natural ordering of sites from 1 to 6.

• This would not be the best way to plot the data unless the sites 1-6 correspond to a natural ordering such as distance downstream of a factory.

0 1 2 3 4 5 6 70

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

TopBottom

Site

Zinc

Page 16: Ch3 Elementary Descriptive Statistics

Run charts (a version of scatter plot)

• The variable on the x axis is a time variable.• Table: 30 consecutive outer diameters turned on a

lathejoint Diameter (inches above nominal)joint Diameter (inches above nominal)

1 -0.005 16 0.0152 0 17 03 -0.01 18 04 -0.03 19 -0.0155 -0.01 20 -0.0156 -0.025 21 -0.0157 -0.03 22 -0.0158 -0.035 23 -0.0159 -0.025 24 -0.01

10 -0.025 25 -0.01511 -0.025 26 -0.03512 -0.035 27 -0.02513 -0.04 28 -0.0214 -0.035 29 -0.02515 -0.035 30 -0.015

Page 17: Ch3 Elementary Descriptive Statistics

Moving along time, the outer diameters tend to get smaller until part 16, where there is a large jump, followed by a pattern of diameter generally decreasing in time.

0 5 10 15 20 25 30 35

-0.05

-0.04

-0.03

-0.02

-0.01

0

0.01

0.02

Diameter (inches above nominal)

Diameter (inches above nominal)

Page 18: Ch3 Elementary Descriptive Statistics

Section 3.2: Quantiles and Related Graphical Tools

Quantile: Roughly speaking, for a number p between 0 and 1, the p quantile of a distribution is a number such that a fraction p of the distribution lies to the left and a fraction 1-p of the distribution lies to the right.

Page 19: Ch3 Elementary Descriptive Statistics

p quantile = 1O0*pth percentile Q(0.10) = 0.10 quantile = 10 th percentile Q(0.50) = 0.50 quantile = 50 th percentile = median Q(0.25) =0.25 quantile = 25 th percentile= first quartile

Q(0.75) =0.75 quantile = 75 th percentile= third quartile

Page 20: Ch3 Elementary Descriptive Statistics

The p th quantile is ordered point corresponding to the point with index

So the comulative probability corresponding to the i th point is

Page 21: Ch3 Elementary Descriptive Statistics

Consider the following n=10 points

Q(0.25) = 0.25 quantile = 857Q(0.50) = median =

. Q(0.75) = 9614IQR = Interquartile Range = Q(0.75) - Q(0.25)= 9614- 8572= 1042

Page 22: Ch3 Elementary Descriptive Statistics

To find the 93rd percentile: 0.93 is part way between 0.85 and 0.95 .

So the Q(0.93) is 0.8 of the way from Q(0.85) to Q(0.95)

Q(0.85) + 0.8(Q(0.95)-Q(0.85)) =0.2*Q(0.85) + 0.8*Q(0.95) = 0.2(9614)+ 0.8(10,688)= 10,473.

Page 23: Ch3 Elementary Descriptive Statistics

• Boxplots are useful summaries, particularly when there are too many points for a dot plot.

• To make a boxplot, we need essentially 5 numbers.

Page 24: Ch3 Elementary Descriptive Statistics
Page 25: Ch3 Elementary Descriptive Statistics
Page 26: Ch3 Elementary Descriptive Statistics

Section 3.2.3 Q-Q Plots and Comparing Distributional Shapes

• Most of the statistical tools we will use in this class assume normal distributions (a bell shaped distribution for the population of possible values).

• In order to know if these are the right tools for a particular job, we need to be able to assess if the data appear to have come from a normal population.

Page 27: Ch3 Elementary Descriptive Statistics

• With large amounts of data, one can draw a histogram of the measured values and see if it is bell-shaped.

• A normal plot is a method for assessing normality that works well with big or small data sets. It gives a good visual check for normality.

Page 28: Ch3 Elementary Descriptive Statistics

Simulation: 100 observations, normal with mean=5, st dev=1

• x<-rnorm(100, mean=5, sd=1) • qqnorm(x)

-2 -1 0 1 2

Quantiles of Standard Normal

23

45

67

8

x

Page 29: Ch3 Elementary Descriptive Statistics

• A normal plot is a plot of the data in a way such that data from normal populations will come out pretty much in a straight line.

• We plot the corresponding quantiles of a "standard normal'' distribution versus ordered y values

Page 30: Ch3 Elementary Descriptive Statistics

In other wordsIn order to plot the data and check for normality, we compare

• our observed data to

• what we would expect from a sample of standard normal data.

Page 31: Ch3 Elementary Descriptive Statistics

A standard normal distribution is a normal distribution with mean 0 standard deviation 1.

Any normal population can be thought of as a rescaled standard normal population. For example if Z is standard normal, then

100 + 5Z will have 100 and 5.

Multiplying all values by 5 multiplies the standard deviation by 5. Adding 100 to every number adds 100 to the mean.

Page 32: Ch3 Elementary Descriptive Statistics

• So if we plot ordered values from a normal population against corresponding quantiles of a standard normal population, we expect to get a reasonably straight line, since any normal distribution is linearly related to the standard normal distribution.

Page 33: Ch3 Elementary Descriptive Statistics

With Excel normal quantile can be found with the NORMINV function. NORMDIST finds probabilities given a particular value. NORMINV is the inverse function finding a value with a given probability of being less than that. A cell is assigned for example the formula = NORMINV(C3, 0, 1) The 0, 1 indicates 0 and 1

o A standard normal quantile

Page 34: Ch3 Elementary Descriptive Statistics

The textbook plots the • standard normal quantiles on the vertical axis and • the ordered data points on the horizontal axis. Many software packages and other books plot the standard normal quantiles on the horizontal axis and the ordered data points on the vertical axis. Either way, the plot should look ``fairly'' straight if

the data are from a normal distribution.

Page 35: Ch3 Elementary Descriptive Statistics

Here are ordered lifetimes of springs under 2 levels of stress. (page 379) Normal 950 stress 900 stress n i (i-0.5)/n Quantile Lifetime Lifetime 10 1 0.05 -1.645 117 153 2 0.15 -1.036 135 162 3 0.25 -0.674 135 189 4 0.35 -0.385 162 216 5 0.45 -0.126 162 216 6 0.55 0.126 171 216 7 0.65 0.385 189 225 8 0.75 0.674 189 225 9 0.85 1.036 198 243 10 0.95 1.645 225 306

Since n=10 for both sets the corresponding normal quantiles are the same for both sets.

Page 36: Ch3 Elementary Descriptive Statistics

To construct normal plots for these two data sets, we plot each ordered data set versus the standard normal quantiles from Excel.

0

50

100

150

200

250

300

350

-2.000 -1.000 0.000 1.000 2.000

Normal Quantiles

Life

-leng

th

950 stress900 stress

Since both plots are fairly straight, these data are fairly normal.

Page 37: Ch3 Elementary Descriptive Statistics

Normal Ordered Orderedn i (i-0.5)/n Quantile E(Z) 900 stress 950 stress

10 1 0.05 -1.645 -1.539 153 1172 0.15 -1.036 -1.001 162 1353 0.25 -0.674 -0.656 189 1354 0.35 -0.385 -0.376 216 1625 0.45 -0.126 -0.123 216 1626 0.55 0.126 0.123 216 1717 0.65 0.385 0.376 225 1898 0.75 0.674 0.656 225 1899 0.85 1.036 1.001 243 198

10 0.95 1.645 1.539 306 225

Excel File of Lifetime of Springs Data

Page 38: Ch3 Elementary Descriptive Statistics

Section 3.3: Numerical Summaries

Measures of Location: The data are found spread around what value ?

Median = Q(O.50) = 50th percentile. Sample mean = arithmetic mean = averageThe mean is more affected by unusual values

than the median.

1

n

ii

xx

n

Page 39: Ch3 Elementary Descriptive Statistics

Measures of Spread:

• R = Range = Biggest – Smallest

• The size of the range can be affected by how many values we have. Many number will tend to have a larger range than fewer numbers.

• IQR = lnterquartile Range = Q(0.75) – Q(0.25)Range that include half of the values.

Page 40: Ch3 Elementary Descriptive Statistics

• Sample variance =

Essentially an average squared deviation from the mean.

• Sample standard deviation =

22

1ix x

sn

22

1ix x

s sn

Page 41: Ch3 Elementary Descriptive Statistics

Example: X1 = 8 X2 = 9 X3 = 4

2 2 22

8 9 4 73

8 7 9 7 4 77

27 2.65

x

s

s

Page 42: Ch3 Elementary Descriptive Statistics

Statistics and Parameters

A statistic is a numerical summary of the sample data.

= sample mean

s2 = sample variance

x

Page 43: Ch3 Elementary Descriptive Statistics

A parameter is a summary of an entire population or a theoretical distribution, for example a normal distribution.

= population mean

2 = population variance

1

N

ii

x

N

2

2 1

N

ii

x

N

Average squared deviation from the mean.

= population standard deviation 2

Page 44: Ch3 Elementary Descriptive Statistics

• For a sample of size n, the sample variance is

• Why divide by n -1? This makes an unbiased estimator of . Unbiased means on the average correct.

2 2

1

1( )

1

n

ii

s x xn

2s2

Page 45: Ch3 Elementary Descriptive Statistics

Suppose we have a large population of ball bearings with diameters =1cm and

Sample1 0.98 0.000322 1.03 0.000313 1.01 0.000454 1.02 0.00052. . .. . .∞ ------ --------Mean 1.00 0.0004

If we knew we would find Fact

So and would be too small for 2.

Dividing by n-1 makes s2 come out right (2 )on average.

20.02 0.0004

2 sx

22

1

( )ˆn

i

i

xn

22 )()( min xxmx ii

22 )()( ii xxxn

xxi2)(

Page 46: Ch3 Elementary Descriptive Statistics

Notice that s2 is undefined if n=1; we can't divide by zero.

This makes sense.

If we have only one number, that number tells us nothing about potential spread in the population.

Page 47: Ch3 Elementary Descriptive Statistics

Plotting summary statistics over time is useful for issues such as quality control.Read section 3.3.4 for general information.

Page 48: Ch3 Elementary Descriptive Statistics