chapter 3 numerically summarizing data chapter 3.1

18
1 Chapter 3 Numerically Summarizing Data Chapter 3.1 Measures of Central Tendency Objective A: Mean, Median and Mode Three measures of central of tendency: the mean, the median, and the mode. A1. Mean The mean of a variable is the sum of all data values divided by the number of observations. Population mean: i x N where i x is each data value and N is the population size (the number of observations in the population). Sample mean: i x x n where i x is each data value and n in the sample size (the number of observations in the sample). Example 1: Population: 12 16 23 17 32 27 14 16 Compute the population mean and sample mean from a simple random sample of size 4. Does the sample mean equal to the population mean? Does the population mean or sample mean stay the same? Explain. (a) Population mean: (Round the mean to one more decimal place than that in the raw data) = N = 8 = 12 +16 +23 +17 +32 +27 +14 +16 8 = 157 8 = 19.625 ≈ 19.6 (b) Sample mean: From a lottery method, 23 16 14 17 were selected. = n = 4 = 23 +16 +14 +17 4 = 70 5 = 17.5 (c) Does the sample mean equal to the population mean? No. x

Upload: others

Post on 01-Feb-2022

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Chapter 3 Numerically Summarizing Data Chapter 3.1

1

Chapter 3 Numerically Summarizing Data

Chapter 3.1 Measures of Central Tendency

Objective A: Mean, Median and Mode

Three measures of central of tendency: the mean, the median, and the mode.

A1. Mean

The mean of a variable is the sum of all data values divided by the number of observations.

Population mean: ix

N

where

ix is each data value and N is the population size (the

number of

observations in the population).

Sample mean: ix

xn

where ix is each data value and n in the sample size (the number of

observations in the sample).

Example 1: Population: 12 16 23 17 32 27 14 16

Compute the population mean and sample mean from a simple random sample of size 4.

Does the sample mean equal to the population mean? Does the population mean or sample

mean stay the same? Explain.

(a) Population mean: (Round the mean to one more decimal place than that in the raw

data) 𝜇 = ∑𝑥𝑖

𝑁 N = 8

= 12+16+23+17+32+27+14+16

8

= 157

8

= 19.625

≈ 19.6

(b) Sample mean:

From a lottery method, 23 16 14 17 were selected.

= ∑𝑥𝑖

𝑛 n = 4

= 23+16+14+17

4 =

70

5 = 17.5

(c) Does the sample mean equal to the population mean?

No.

x

Page 2: Chapter 3 Numerically Summarizing Data Chapter 3.1

2

(d) Does the population mean or sample mean stay the same?

Explain.

𝜇 stays the same.

varies from sample to sample.

A2. Median

The median, M, is the value that lies in the middle of the data when arranged in ascending

order.

Example 1: Find the median of the data given below.

4 12 32 24 9 18 28 10 36

Reorder: 4 9 10 12 18 24 28 32 36

n = 9 = odd

The location of the median is at = = = 5th position.

The median is 18.

Example 2: Find the median of the data given below.

$35.34 $42.09 $38.72 $43.28 $39.45 $49.36 $30.15 $40.88

Reorder: $30.15 $35.34 $38.72 $39.45 $40.88 $42.09

$43.28 $49.36

N = 8 = even

The location of median is between and which is between

and= 4th and 5th position.

The median = 39.45+40.88

2 =

80.33

2 = 40.165

x

2

1n

2

19

2

n1

2

n

2

81

2

8

Page 3: Chapter 3 Numerically Summarizing Data Chapter 3.1

3

A3. Mode

Mode is the most frequent observation in the data set.

Example 1: Find the mode of the data given below. 76 60 81 72 60 80 68 73 80 67

Reorder: 60 60 67 68 72 73 76 80 80 81

Mode = 60 and 80

Example 2: Find the mode of the data given below.

A C D C B C A B B F B W F D B W D A D C D

Reorder: A A A B B B B B C C C C D D D D D F F W W Mode = B and D

Example 3: The following data represent the G.P.A. of 12 students.

2.56 3.21 3.88 2.44 1.96 2.85 2.32 3.38 1.86 3.04 2.75 2.23

Find the mean, median, and mode G.P.A.

Reorder: 1.86 1.96 2.23 2.32 2.44 2.56 2.75 2.85 3.04

3.21 3.38 3.88

(a) mean

= ∑𝑥𝑖

𝑛 n = 12

= 18.6+1.96+2.23+2.32+2.44+2.56+2.75+2.85+3.04+3.21+3.38+3.88

12

= 32.48

12

≈ 2.7067

≈ 2.707

(b) median

Reorder: 1.86 1.96 2.23 2.32 2.44 2.56 2.75 2.85 3.04

3.21 3.38 3.88

median = (2.56+2.75)/2 = 2.655

(c) mode

Reorder: 1.86 1.96 2.23 2.32 2.44 2.56 2.75 2.85 3.04

3.21 3.38 3.88

mode = None

x

Page 4: Chapter 3 Numerically Summarizing Data Chapter 3.1

4

Objective B: Relation Between the Mean, Median and Distribution Shape

- The mean is sensitive to extreme data. For continuous data, if the distribution shape

is a bell-shaped curve, the mean is a better measure of central tendency because it

includes all data values in a data set.

- The median is resistant to extreme data. For continuous data, if the distribution

shape is skewed to the right or left, the median is a better measure of central

tendency.

- The mode is used to represent the measure of central tendency for qualitative data.

Definition: A numerical summary of data is said to be resistant if extreme values (very large or

small) relative to the data do not affect its value substantially.

Mean or Median versus Skewness

Page 5: Chapter 3 Numerically Summarizing Data Chapter 3.1

5

Chapter 3.2 Measures of Dispersion

Measurement of dispersion is a numerical measure that can quantify the spread of data.

In this section, the three numerical measures of dispersion that we will discuss are the range,

variance, and standard deviation. In the later section, we will discuss another measure of

dispersion called interquartile range (IQR).

Objective A:Range, Variance and Standard Deviation

A1.Range

Range = R = largest data value - smallest data value

The range is not resistant because it is affected by extreme values in the data set.

A2. Variance and Standard Deviation

Variance is based on the deviation about the mean. Since the sum of deviation about the mean

is zero, we cannot use the average deviation about the mean as a measure of spread.

We use the average squared deviation instead.

The population variance, 2 , of a variable is the sum of the squared deviations about the

population mean, , divided by the number of observations in the population, N .

2

2( )ix

N

Definition Formula

2

2

2

i

i

xx

N

N

Computational Formula

The sample variance, 2s , of a variable is the sum of the squared deviations about the sample

mean, x , divided by the number of observations in the sample minus 1, 1n .

2

2( )

1

ix xs

n Definition Formula

2

2

2

1

i

i

xx

nsn

Computational Formula

Page 6: Chapter 3 Numerically Summarizing Data Chapter 3.1

6

In order to use the sample variance to obtain an unbiased estimate of the population variance,

we divide the sum of the squared deviations about the sample mean by 1n .We call 1n the

degree of freedom because the first 1n observations have freedom to be whatever value they

wish, but the nth value has no freedom in order to force ( )ix x to be zero.

The population standard deviation, , is the square root of the population variance or

Population Varianceσ

The sample standard deviation, s , is the square root of the sample variance or

Sample Variances

To avoid round-off error, never use the rounded value of the variance to compute the standard

deviation. Keep a few more decimal places for an intermediate step calculation.

Example 1: Use the definition formula to find the population variance and standard deviation.

Population: 4, 10, 12, 13, 21

Definition formula 𝛿2 = ∑(𝑥𝑖−𝜇)

𝑁 where 𝜇 =

∑𝑥𝑖

𝑁

𝜇 = 4+10+12+13+21

5 =

60

5 = 12

4 4 - 12 = -8 (-8)² = 64

10 10 – 12 = -2 (-2)² = 4

12 12 – 12 = 0 0² = 0

13 13 – 12 = 1 1² = 1

21 21 – 12 = 9 9² = 81

Population variance: 𝜎2 = 150

5 = 30

Population standard deviation: √𝜎2 = 𝜎 = √30≈ 5.5

ix 2)( ixix

150)( 2 ix

Page 7: Chapter 3 Numerically Summarizing Data Chapter 3.1

7

Example 2: Use the definition formula to find the sample variance and standard deviation.

Sample: 83, 65, 91, 84

Definition formula 𝑠2 = ∑(𝑥𝑖 − x )2

𝑛−1 where x =

∑𝑥𝑖

𝑛

Sample mean: x = 83+65+91+84

4 =

323

4 = 80.75 ≈ 80.8

83 83 – 80.75 = 2.25 (2.25)² = 5.0625

65 65 – 80.75 = -15.75 (-15.75)² = 248.0625

91 91 – 80.75 = 10.25

84 84 – 80.75 = 3.25

Sample variance: 𝑠2 = 368.75

4−1 ≈ 122.9166 ≈ 122.9

Sample standard deviation: s = √122.9166 ≈ 11.08677 ≈ 11.1

Example 3: Use StatCrunch to find the sample variance and standard deviation.

Sample: 83, 65, 91, 84 (same data set as Example 2)

Step 1:

Click StatCrunch navigation button under the Course Home page

Click StatCrunch website Click Open StatCrunch

Input the raw data in var 1 column Click Stat Click Summary Stats Columns

Step 2:

ix ix2)( ix

0625.105)25.10( 2

5625.10)25.3( 2

75.368)( 2 xxi

Page 8: Chapter 3 Numerically Summarizing Data Chapter 3.1

8

Click var1 under Select column(s): Under Statistics:, choose Variance and Std. dev. (click

them while holding Ctrl key on the keyboard) Click Compute!

Variance and standard deviation are computed.

2 122.9

11.1

s

s

For more detailed instructions, please download

“Q3.2.20 “by clicking the StatCrunch Handout navigation button of the course homepage.

Note : For a small data set, students are expected to calculate the standard deviation by

hand.

Page 9: Chapter 3 Numerically Summarizing Data Chapter 3.1

9

Objective C : Empirical Rule

The figure below illustrates the Empirical Rule

Page 10: Chapter 3 Numerically Summarizing Data Chapter 3.1

10

Example 1: SAT Math scores have a bell-shaped distribution with a mean of 515 and a

standard deviation of 114. (Source: College Board, 2007)

(a) What percentage of SAT scores is between 401 and 629?

According to the Empirical Rule, approximately 68% of the data will lie within 1

standard deviation of the mean.

68% of SAT scores is between 401 and 629.

(b) What percentage of SAT scores is between 287 and 743?

According to the Empirical Rule, approximately 95% of the data will lie within 2

standard deviations of the mean.

95% of SAT scores is between 287 and 743.

1114515

401

515 1114515

629

2 2

401 515 629

2 2)114(2515

287)114(2515

743

1 1

Page 11: Chapter 3 Numerically Summarizing Data Chapter 3.1

11

(c) What percentage of SAT scores is less than 401 or greater than 629?

(d) What percentage of SAT scores is between 515 and 743?

95÷ 2 = 47.5

=

401 629

– %100 %68

= 1 1

515401 629

%32

%95

743

287 743

2

1

629515

%5.47

Page 12: Chapter 3 Numerically Summarizing Data Chapter 3.1

12

(e) About 99.7% of SAT scores will be between what scores?

According to the Empirical Rule, approximately 99.7% of the data will lie

within 3 standard deviations of the mean.

(𝜇 − 3𝜎, 𝜇 + 3𝜎)

= (515-3(114), 515 + 3(114))

= (173, 857)

Chapter 3.4 Measures of Position and Outliers

Measures of position determine the relative position of a certain data value within the entire

set of data.

Objective A : z -scores

The z-score represents the distance that a data value is from the mean in terms of the

number of standard deviations.

Population z -score: x

z

Sample z -score:

x xz

s Example 1: The average 20- to 29-year-old man is 69.6 inches tall, with a standard deviation of

3.0 inches, while the average 20- to 29-year-old woman is 64.1 inches tall, with a standard

deviation of 3.8 inches. Who is relatively taller, a 67-inch man or 62-inch woman?

Man: µ = 69.6 inches 𝜎 = 3.0 𝑖𝑛𝑐ℎ𝑒𝑠 x = 67 inches

Z = 𝑥− 𝜇

𝜎 =

67−69.6

3.0 = -0.87

0.87 standard deviation below the mean.

Woman: µ = 64.1 inches 𝜎 = 3.8 inches x = 67 inches

Z = 𝑥− 𝜇

𝜎 =

62−64.1

3.8 = -0.55

0.55 standard deviation below the mean

Therefore, the 62- inch woman is relatively taller than the 67- inch man.

Page 13: Chapter 3 Numerically Summarizing Data Chapter 3.1

13

Objective B:Percentiles and Quartiles

B1. Percentiles

The k th percentile, kP , of a set of data is a value such that k percent of the observations are

less than or equal to the value.

Example 1: Explain the meaning of the 5th percentile of the weight of males 36 months

of age is 12.0 kg.

5% of 36-month-old males weigh 12.0 kg or less.

95% of 36-month-old males weigh more than 12.0 kg.

The most common percentiles are quartiles.

The first quartile, 1Q , is equivalent to 25P .

The second quartile, 2Q , is equivalent to 50P .

The third quartile, 3Q , is equivalent to 75P .

Page 14: Chapter 3 Numerically Summarizing Data Chapter 3.1

14

Example 2: Determine the quartiles of the following data.

46 45 58 71 42 66 72 42 61 49 80

Ascending order:

42 42 45 46 49 58 61 66 71 72 80

M = 58

𝑄2 = 58

Lower half of the data:

42 42 45 46 49

𝑄1 = 45

Upper half of the data:

61 66 71 72 80

𝑸𝟑 = 71

B2. Interquartile

The interquartile range, IQR, is the measure of dispersion that is based on quartiles. The range

and standard deviation are affected by extreme values. The IQR is resistant to extreme values.

Example 1: One variable that is measured by online homework systems is the amount of time

a student spends on homework for each section of the text. The following is a

summary of the number of minutes a student spends for each section of the text

for the fall 2007 semester in a College Algebra class at Joliet Junior College.

1 42Q 2 51.5Q

3 72.5Q

(a) Provide an interpretation of 1Q .

25% of the students spend 42 minutes or less on homework for each section, and 75% of the students spend more than 42 minutes.

50% of the students spend 51.5 minutes or less on homework for each section, and 50% of

the students spend more than 51.5 minutes.

75% of the students spend 72.5 minutes or less on homework for each section, and 25% of

the students spend more than 72.5 minutes.

:421 Q

:5.512 Q

:5.723 Q

Page 15: Chapter 3 Numerically Summarizing Data Chapter 3.1

15

(b) Determine and interpret the interquartile range.

IQR = 𝑄3 - 𝑄1 = 72.5 – 42 = 30.5 minutes

The middle of 50% of all students has a range of 30.5 minutes of time

spent on homework.

(c) Do you believe that the distribution of time spent doing homework is

skewed or symmetric? Why?

Skewed right. The difference between and is less than the

difference between and .

Objective C:Outliers

Extreme observations are called outliers; they may occur by error in the measurement or during data entry or from errors in sampling.

2Q 1Q

3Q 2Q

Page 16: Chapter 3 Numerically Summarizing Data Chapter 3.1

16

Chapter 3.5 The Five-Number Summary and Boxplots

Objective A :The Five-Number Summary

Example 1: The number of chocolate chips in a randomly selected 21 name-brand cookies

were recorded. The results are shown below.

28 23 28 31 27 29 2419 26 23 21 25 22 23

21 23 33 28 33 21 30

Find the five-number Summary.

Ascending order:

19 21 21 21 22 23 23 23 23 24 25 26 27 28 28 28 29 30 31 33 33

Lower half of the data: 19 21 21 22 23 23 23 23 24

𝑄1 = 22+23

2 = 22.5

Upper half of the data: 26 27 28 28 28 29 30 31 33 33

𝑄3 = 28+29

2 = 28.5

Five-number summary:

Minimum = 19, = 22.5, = 25, = 28.5, Maximum = 33

25M

1Q 2Q 3Q

Page 17: Chapter 3 Numerically Summarizing Data Chapter 3.1

17

Objective B: Boxplot The five-number summary can be used to construct a graph called the

boxplot.

Objective C:Using a Boxplot to describe the shape of a distribution

Page 18: Chapter 3 Numerically Summarizing Data Chapter 3.1

18

Example 1: Use the side-by-side boxplots shown to answer the questions that follow.

(a) To the nearest integer, what is the median of variable x ?

≈ 15

(b) To the nearest integer, what is the first quartile of variable y ?

≈ 22

(c) Which variable has more dispersion? Why?

The y variable has more dispersion because the IQR on y is wider than the IQR on the x variable.

(d) Does the variable x have any outliers? If so, what is the value of the outlier?

Yes, there is an asterisk on the right side of the boxplot. Outlier ≈ 30

(e) Describe the shape of the variable y . Support your position.

Since there is a longer whisker on the left and is bigger than , the shape of

the distribution is skewed to the left.

12 QQ 23 QQ