+ chapter 3: describing data numerically lecture powerpoint slides discovering statistics 2nd...

43
+ Chapter 3: Describing Data Numerically Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

Upload: brice-turner

Post on 26-Dec-2015

236 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: + Chapter 3: Describing Data Numerically Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

+

Chapter 3:Describing Data Numerically

Lecture PowerPoint Slides

Discovering Statistics

2nd Edition Daniel T. Larose

Page 2: + Chapter 3: Describing Data Numerically Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

+ Chapter 3 Overview

3.1 Measures of Center

3.2 Measures of Variability

3.3 Working with Grouped Data

3.4 Measures of Position and Outliers

3.5 The Five-Number Summary and Boxplots

3.6 Chebyshev’s Rule and the Empirical Rule

2

Page 3: + Chapter 3: Describing Data Numerically Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

+ The Big Picture

Where we are coming from and where we are headed…

Chapter 2 showed us graphical and tabular summaries of data.

In Chapter 3, we “crunch the numbers,” that is, develop numerical summaries of data. We examine measures of center, measures of variability, measures of position, and many other numerical summaries of data.

In Chapter 4, we will learn how to summarize the relationship between two quantitative variables.

3

Page 4: + Chapter 3: Describing Data Numerically Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

+ 3.1: Measures of Center

Objectives:

Calculate the mean for a given data set.

Find the median, and describe why the median is sometimes preferable to the mean.

Find the mode of a data set.

Describe how skewness and symmetry affect these measures of center.

4

Page 5: + Chapter 3: Describing Data Numerically Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

5

The Mean

The most well-known and widely used measure of center is the mean. In everyday usage, the word average is often used for mean.

To find the mean of the values in a data set, simply add up all the numbers and divide by how many numbers you have.

Notation:•The sample size (how many observations in the data set) is always denoted by n.•The ith data value is denoted by xi, where i is an index or counter indicating which data point we are specifying.•The notation for “add them together” is Σ(capital sigma), the Greek letter “S,” because it stands for “Summation.”•The sample mean is called (pronounced “x-bar”).

The sample mean can be written as . In plain English, this just means that, in order to find the mean, we1. Add up all the data values, giving us Σx2. Divide by how many observations are in the data set, giving us

Page 6: + Chapter 3: Describing Data Numerically Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

6

The Population Mean

The mean value of the population is usually unknown. We denote the population mean with µ (mu), which is the Greek letter “m.” The population size is denoted by N.

When all the values of the population are known, the population mean is calculated as

We can use the sample mean as an estimate of µ. Note, however, different samples may yield different sample means.

One drawback to using the mean to measure the center of the data is that the mean is sensitive to the presence of extreme values in the data set.

Page 7: + Chapter 3: Describing Data Numerically Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

7

The MedianIn statistics, the median of a data set is the middle data value when the data are put into ascending order.

The MedianThe median of a data set is the middle data value when the data are put into ascending order. Half of the data values lie below the median, and half lie above.•If the sample size n is odd, then the median is the middle value.•If the sample size n is even, then the median is the mean of the two middle data values.

Unlike the mean, the median is not sensitive to extreme values.

Page 8: + Chapter 3: Describing Data Numerically Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

8

The Mode

A third measure of center is called the mode. In a data set, the mode is the value that occurs the most.

The mode of a data set is the data value that occurs with the greatest frequency.

Rank Person Followers (millions)

1 Lady Gaga 6.6

2 Britney Spears 6.1

3 Ashton Kutcher 5.9

4 Justin Bieber 5.6

5 Ellen DeGeneres 5.3

6 Kim Kardashian 5.0

7 Taylor Swift 4.4

8 Oprah Winfrey 4.4

9 Katy Perry 4.2

10 John Mayer 3.7

Sample Mean

Median

Mode

Two people have 4.4 million followers. 4.4 million is the mode.

Page 9: + Chapter 3: Describing Data Numerically Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

9

Skewness and Measures of CenterThe skewness of a distribution can often tell us something about the relative values of the mean, median, and mode.

How Skewness Affects the Mean and Median•For a right-skewed distribution, the mean is larger than the median.•For a left-skewed distribution, the median is larger than the mean.•For a symmetric unimodal distribution, the mean, median, and mode are fairly close to one another.

Page 10: + Chapter 3: Describing Data Numerically Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

+ 3.2: Measures of Variability

Objectives:

Understand and calculate the range of a data set.

Explain in my own words what a deviation is.

Calculate the variance and the standard deviation for a population or a sample.

10

Page 11: + Chapter 3: Describing Data Numerically Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

11

The Range

Section 3.1 introduced ways to find the center of a data set. Two data sets can have exactly the same mean, median, and mode and yet be quite different. We need measures that summarize the variation, or variability, of the data.

Women’s Volleyball Team Heights (in)

Western Massachusetts Univ Northern Connecticut Univ

60 66

70 67

70 70

70 70

75 72

Page 12: + Chapter 3: Describing Data Numerically Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

12

The Range

There are a variety of ways to measure how spread out a data set is. The simplest measure is the range.

Women’s Volleyball Team Heights (in)

Western Massachusetts Univ

Northern Connecticut Univ

60 66

70 67

70 70

70 70

75 72

The range of a data set is the difference between the largest value and the smallest value in the data set:

range = largest value – smallest value

rangeWMU = 75 – 60 = 15 inchesrangeNCU = 72 – 66 = 6 inches

Page 13: + Chapter 3: Describing Data Numerically Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

13

What is Deviation?The range is simple to calculate, but has its drawbacks. It is quite sensitive to extreme values and it completely ignores all of the values in the data set other than the extremes. The standard deviation quantifies spread with respect to the center and uses all available data values.

DeviationA deviation for a given data value x is the difference between the data value and the mean of the data set. For a sample, the deviation equals x – x-bar. For a population, the deviation equals x – µ.

•If the data value is larger than the mean, the deviation will be positive.•If the data value is smaller than the mean, the deviation will be negative.•If the data value equals the mean, the deviation will be zero.

The deviation can roughly be thought of as the distance between a data value and the mean, except that the deviation can be negative while distance is always positive.

Page 14: + Chapter 3: Describing Data Numerically Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

14The Variance and Standard DeviationTo compute the standard deviation and variance, we consider the squared deviations. It is logical to build our measure of spread using the mean squared deviation.

The population variance σ2 is the mean of the squared deviations in the population given by the formula

2

2 x

N

The population standard deviation σ is the positive square root of the population variance and is found by

2x

N

The population standard deviation σ represents a distance from the mean that is representative for that data set.

Page 15: + Chapter 3: Describing Data Numerically Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

15The Sample Variance and Sample Standard DeviationIn the real world, we use the sample mean and sample standard deviation to estimate the population parameters. The sample variance also depends on the concept of the mean squared deviations. However, we replace the denominator with n – 1 to better estimate the parameter.

The sample variance s2 is approximately the mean of the squared deviations in the sample given by the formula

The sample standard deviation s is the positive square root of the sample variance and is found by

The value of s may be interpreted as the typical difference between a data value and the sample mean for a given data set.

Page 16: + Chapter 3: Describing Data Numerically Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

16

Computational FormulasThe following computational formulas simplify the calculations for variance and standard deviation. They are equivalent to the definition formulas.

Computational Formulas for the Variance and Standard Deviation

Population Variance Population Standard Deviation

Sample Variance Sample Standard Deviation

2

2

2

xx

NN

2

2

2

1

xx

nsn

Page 17: + Chapter 3: Describing Data Numerically Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

17

Population Example

2

2

2

2(357.3)20,997.91-

88

629.9998438

630.0

xx

NN

2 629.9998438 25.1

The standard deviation of farmland for all counties in Connecticut is almost 25,100 acres.

Page 18: + Chapter 3: Describing Data Numerically Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

18

Sample ExampleSuppose we take a sample of three counties.

2

2

2

2

1

(213.6)15,963.38 -

33 1

377.53

xx

nsn

2 377.53 19.4s s

The standard deviation of farmland for this sample of three counties in Connecticut is almost 19,400 acres.

Page 19: + Chapter 3: Describing Data Numerically Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

+ 3.3: Working with Grouped Data

Objectives:

Calculate the weighted means.

Estimate the mean for grouped data.

Estimate the variance and standard deviation for grouped data.

19

Page 20: + Chapter 3: Describing Data Numerically Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

20

The Weighted MeanSometimes not all the data values in a data set are of equal importance. Certain data values may be assigned greater weight than others when calculating the mean.

Weighted Mean

To find the weighted mean:

1.Multiply each data point xi by its respective weight wi.

2.Sum these products.

3.Divide the result by the sum of the weights:

1 1 2 2

1 2

...

...i i n n

w

i n

w x w x w x w xx

w w w w

Page 21: + Chapter 3: Describing Data Numerically Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

21Estimating the Mean for Grouped DataData are often reported using frequency distributions. Without the original data, we cannot calculate the exact values of the measures of center and spread.

For each class in the frequency distribution, we estimate the class mean using the class midpoint. The class midpoint is defined as the mean of two adjoining lower class limits and is denoted mi.

The product of the class frequency fi and class midpoint mi is used as an estimate of the sum of the data values within that class.

Summing these products across all classes and dividing by the total population size provides us with an estimated mean for data grouped into a frequency distribution.

Page 22: + Chapter 3: Describing Data Numerically Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

22Estimating the Mean for Grouped DataCalculate the estimated mean age of the adopted children in this table.

Σmifi = (0.5)(12) + (3.5)(611) + (8.5)(320) + (13.5)(161) + (17)(46) = 6 + 2138.5 + 2720 + 2173.5 + 782 = 7820

N = Σfi = 12 + 611 + 320 + 161 + 46 = 1150

Page 23: + Chapter 3: Describing Data Numerically Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

23Estimating the Variance and Standard Deviation for Grouped DataWe also use class midpoints and class frequencies to calculate the estimated variance for data grouped into a frequency distribution and the estimated standard deviation for data grouped into a frequency distribution.

Estimated Variance and Standard Deviation for Data Grouped into a Frequency Distribution

Given a frequency distribution with k classes, the estimated variance for the variable is given by

and estimated standard deviation is given by

Page 24: + Chapter 3: Describing Data Numerically Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

+ 3.4: Measures of Position and OutliersObjectives:

Calculate z-scores and explain why we use them.

Detect outliers using the z-score method.

Find percentiles and percentile ranks for both small and large data sets.

Computer quartiles and the interquartile range.

24

Page 25: + Chapter 3: Describing Data Numerically Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

25

z-ScoresOur first measure of position is the z-score. The term z-score indicates how many standard deviations a particular data value is from the mean.

z-Score

The z-score for a particular data value from a sample is

The z-score for a particular data value from a population is

Page 26: + Chapter 3: Describing Data Numerically Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

26

z-ScoresSuppose the mean score on the Math SAT is µ = 500, with a standard deviation of σ = 100 points.

Jasmine’s Math SAT score is 650. What is her z-score?

Jasmine

Page 27: + Chapter 3: Describing Data Numerically Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

27

z-ScoresIn some cases, we may be given a z-score and asked to find its associated data value x.

Given a z-score, to find its associated value x:

For a sample:

For a population:

where µ is the population mean, x-bar is the sample mean, σ is the population standard deviation, and s is the sample standard deviation.

z-scores can also be used to compare data from different data sets. That is, relative positions can be compared even when the means and standard deviations of the data sets are different.

Page 28: + Chapter 3: Describing Data Numerically Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

28

Detecting Outliers with z-ScoresAn outlier is an extremely large or extremely small data value relative to the rest of the data set. It may represent a data entry error, or it may be genuine data.

Guidelines for Identifying Outliers1.A data value whose z-score lies in the following range is not considered to be unusual:

-2 < z-score < 22.A data value whose z-score lies in the following range may be considered moderately unusual:

-3 < z-score ≤ -2 or 2 ≤ z-score < 31.A data value whose z-score lies in the following range may be considered an outlier:

z-score ≤ -3 or : z-score ≥ 3

Page 29: + Chapter 3: Describing Data Numerically Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

29

Percentiles and Percentile RanksThe next measure of position we consider is the percentile, which shows the location of a data value relative to the other values in the data set.

Percentile

Let p be any integer between 0 and 100. the pth percentile of a data set is the data value at which p percent of the values in the data set are less than or equal to the value.

Percentile

The percentile rank of a data value x equals the percentage of values in the data set that are less than or equal to x. In other words:

Page 30: + Chapter 3: Describing Data Numerically Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

30

QuartilesJust as the median divides the data set into halves, the quartiles are the percentiles that divide the data set into quarters.

Quartiles

The quartiles of a data set divide the data set into four parts, each containing 25% of the data.

•The first quartile (Q1) is the 25th percentile.

•The second quartile (Q2) is the 50th percentile.

•The third quartile (Q3) is the 75th percentile.

For small data sets, the division may be into four parts of only approximately equal size.

Page 31: + Chapter 3: Describing Data Numerically Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

31

QuartilesFind the quartiles of the dance scores of the 12 students on page 129:

First, arrange them in order from smallest to largest:30 44 56 62 65 68 75 78 81 85 89 94

Page 32: + Chapter 3: Describing Data Numerically Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

32

Interquartile RangeThe variance and standard deviations are measures of spread that are sensitive to the presence of extreme values. A more robust (less sensitive) measure of variability is the interquartile range.

Interquartile Range

The interquartile range (IQR) is a robust measure of variability. It is calculated as: IQR = Q3 – Q1.

The interquartile range is interpreted to be the spread of the middle 50% of the data.

IQR = 83 – 59 = 24

Page 33: + Chapter 3: Describing Data Numerically Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

+ 3.5: Five-Number Summary and BoxplotsObjectives:

Calculate the five-number summary of a data set.

Construct and interpret a boxplot for a given data set.

Detect outliers using the IQR method.

33

Page 34: + Chapter 3: Describing Data Numerically Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

34

The Five-Number SummaryOne robust (or resistant) method of summarizing data that is used widely is called the five-number summary. The set consists of five measures we have already seen.

The five-number summary consists of the following set of statistics, which together constitute a robust summarization of a data set:

1.Minimum; the smallest value in the data set2.First quartile, Q13.Median, Q24.Third quartile, Q35.Maximum, the largest value in the data set

Min=30 Max=94

Page 35: + Chapter 3: Describing Data Numerically Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

35

The BoxplotThe boxplot is a convenient graphical display of the five-number summary of a data set.

Constructing a Boxplot by Hand

1. Determine the lower and upper fences:Lower fence = Q1 – 1.5(IQR)Upper fence = Q3 + 1.5(IQR)

2. Draw a horizontal number line that encompasses the range of your data, including the fences. Draw vertical lines at Q1, the median, and Q3. Connect the lines for Q1 and Q3 to form a box.3. Temporarily indicate the fences with brackets [ and ].4. Draw a horizontal line from Q1 to the smallest value greater than the lower fence. Draw a horizontal line from Q3 to the largest value smaller than the upper fence.5. Indicate any data values smaller than the lower fence or larger than the upper fence using an asterisk *.

Page 36: + Chapter 3: Describing Data Numerically Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

36

The Boxplot

Min=30 Max=94

Lower fence = 59– 1.5(24) = 23

IQR = 83 – 59 = 24

Upper fence = 83 + 1.5(24) = 119

Page 37: + Chapter 3: Describing Data Numerically Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

37

Boxplots for Skewed Data

Page 38: + Chapter 3: Describing Data Numerically Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

38

Detecting Outliers with the IQRThe mean and standard deviation are sensitive to outliers. We can use a more robust method of detecting outliers by using the IQR.

IQR Method to Detect Outliers

A data value is an outlier ifa.it is located 1.5(IQR) or more below Q1, orb.it is located 1.5(IQR) or more above Q3.

Page 39: + Chapter 3: Describing Data Numerically Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

+ 3.6: Chebyshev’s Rule and the Empirical Rule

Objectives:

Calculate percentages using Chebyshev’s Rule.

Find percentages and data values using the Empirical Rule.

39

Page 40: + Chapter 3: Describing Data Numerically Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

40

Chebyshev’s RuleP.L. Chebyshev derived a result, called Chebyshev’s Rule, which can be applied to any continuous data set whatsoever.

Chebyshev’s Rule

The proportion of values from a data set that will fall within k standard deviations of the mean will be at least

where k > 1. Chebyshev’s Rule may be applied to either samples or populations. For example:• k = 2. At least 3/4 (or 75%) of the data values will fall within 2 standard deviations of the mean.• k = 3. At least 8/9 (or 88.89%) of the data values will fall within 3 standard deviations of the mean.

2

11 100%k

Page 41: + Chapter 3: Describing Data Numerically Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

41

The Empirical RuleWhen the data distribution is bell-shaped, the Empirical Rule outperforms Chebyshev.

The Empirical Rule

If the data distribution is bell-shaped:•About 68% of the data values will fall within 1 standard deviation of the mean.•About 95% of the data values will fall within 2 standard deviations of the mean.•About 99.7% of the data values will fall within 3 standard deviations of the mean.

Stated in terms of z-scores:•About 68% of the data values will have z-scores between -1 and 1.•About 95% of the data values will have z-scores between -2 and 2.•About 99.7% of the data values will have z-scores between -3 and 3.

Page 42: + Chapter 3: Describing Data Numerically Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

42

The Empirical Rule

Page 43: + Chapter 3: Describing Data Numerically Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

+ Chapter 3 Overview

3.1 Measures of Center

3.2 Measures of Variability

3.3 Working with Grouped Data

3.4 Measures of Position and Outliers

3.5 The Five-Number Summary and Boxplots

3.6 Chebyshev’s Rule and the Empirical Rule

43