chapter 3 - descriptive stats: numerical measures 3.1 ... · • arrange the data in ascending...

30
Copyright Reserved 1 1 Chapter 3 - Descriptive stats: Numerical measures 3.1 Measures of Location Mean Perhaps the most important measure of location is the mean (average). Sample mean: where n = sample size Example: The number of students per class is as follows: 46 54 42 46 32 The mean is: Median The median is another measure of location for a variable. The median is the value in the middle when the data are arranged in ascending order (smallest to largest value). Computation: o Arrange the data in ascending order (smallest to largest value) o For an odd number of observations, the median is the middle value o For an even number of observations, the median is the average of the middle 2 values Example: The number of students per class is as follows: 46 54 42 46 32 The median is: Arrange the values from smallest to largest: 32 42 46 46 54 Middle value = Median = 46

Upload: hoangque

Post on 22-May-2018

222 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Chapter 3 - Descriptive stats: Numerical measures 3.1 ... · • Arrange the data in ascending order (smallest to largest value) • Compute an index i ( ) where p = percentile of

Copyright Reserved 1

1

Chapter 3 - Descriptive stats: Numerical measures

3.1 Measures of Location

Mean

Perhaps the most important measure of location is the mean (average).

Sample mean:

where n = sample size

Example:

The number of students per class is as follows:

46 54 42 46 32

The mean is:

Median

The median is another measure of location for a variable.

The median is the value in the middle when the data are arranged in ascending order (smallest

to largest value).

Computation:

o Arrange the data in ascending order (smallest to largest value)

o For an odd number of observations, the median is the middle value

o For an even number of observations, the median is the average of the middle 2 values

Example:

The number of students per class is as follows:

46 54 42 46 32

The median is:

Arrange the values from smallest to largest:

32 42 46 46 54

Middle value = Median = 46

Page 2: Chapter 3 - Descriptive stats: Numerical measures 3.1 ... · • Arrange the data in ascending order (smallest to largest value) • Compute an index i ( ) where p = percentile of

Copyright Reserved 2

2

Example

The yearly income (R1000’s) of 8 workers is as follows:

95 102 105 120

125 150 220 450

1. Calculate the mean and the median.

Answers:

Mean/average:

Median:

For the median, we arrange the values from smallest to largest:

95 102 105 120 125 150 220 450

Median =

Although the mean is the more commonly used measure of central location, in some

situations the median is preferred.

The mean is influenced by extremely small and large data values, while the median is not

influenced by extreme values.

Mode

Definition:

The mode is the value that occurs with greatest frequency.

Example:

The number of students per class is as follows:

46 54 42 46 32

The mode is: 46

Note:

Bi-modal:

If the data have exactly 2 modes.

Example of a bi-modal data set: 46 54 42 46 32 54

Multimodal:

If data have more than 2 modes.

Page 3: Chapter 3 - Descriptive stats: Numerical measures 3.1 ... · • Arrange the data in ascending order (smallest to largest value) • Compute an index i ( ) where p = percentile of

Copyright Reserved 3

3

Example:

Give the appropriate measure of location for the following data:

Soft drink Frequency

Coke Classic 19

Diet Coke 8

Dr. Pepper 5

Pepsi-Cola 13

Sprite 5

The mode is: Coke Classic

For this type of data it obviously makes no sense to speak of the mean or median.

Using Microsoft Excel 2007 to compute the mean, median and mode

Formula worksheet

Value worksheet

Page 4: Chapter 3 - Descriptive stats: Numerical measures 3.1 ... · • Arrange the data in ascending order (smallest to largest value) • Compute an index i ( ) where p = percentile of

Copyright Reserved 4

4

Percentiles

Definition: The pth

percentile is a value such that at least p percent of the observations are less than

or equal to this value and at least (100 – p) percent of the observations are greater than or equal to

this value.

Calculating the pth

percentile:

• Arrange the data in ascending order (smallest to largest value)

• Compute an index i

(

) where p = percentile of interest

n = sample size

(a) If i is not an integer, round up

(b) If i is an integer, the pth

percentile is the average of the values in positions i and (i +1)

Example:

Determine the 85th

percentile ( ) for the starting salary data:

Step 1: Arrange the data in ascending order

Step 2: (

) (

) ( )

Step 3: In the 11th

position (after being arranged in ascending order): .

Interpretation: 85% of the graduates have a starting salary of R3 730 or less.

Page 5: Chapter 3 - Descriptive stats: Numerical measures 3.1 ... · • Arrange the data in ascending order (smallest to largest value) • Compute an index i ( ) where p = percentile of

Copyright Reserved 5

5

Determine the 33rd

percentile ( ) for the

starting salary:

Step 1: Arrange the data in ascending order

Step 2: (

) (

) ( )

Step 3: In the 4th

position (after being arranged

in ascending order): .

Interpretation: 33% of the graduates have a

starting salary of R3 480 or less.

Determine the median ( ) for the starting

salary:

Step 1: Arrange the data in ascending order

Step 2: (

) (

) ( )

i + 1 = 7

Step 3: The median is the average of the values

in the 6th

and 7th

positions:

Interpretation: 50% of the graduates have a

starting salary of R3 505 or less.

Page 6: Chapter 3 - Descriptive stats: Numerical measures 3.1 ... · • Arrange the data in ascending order (smallest to largest value) • Compute an index i ( ) where p = percentile of

Copyright Reserved 6

6

Determine the 25th

percentile ( ) for the

starting salary:

Step 1: Arrange the data in ascending order

Step 2: (

) (

) ( )

i + 1 = 4

Step 3: is the average of the values in the

3rd

and 4th

positions:

Interpretation: 25% of the graduates have a

starting salary of R3 465 or less.

Determine the 75th

percentile ( ) for the

starting salary:

Step 1: Arrange the data in ascending order

Step 2: (

) (

) ( )

i + 1 = 10

Step 3: is the average of the values in the

9th

and 10th

positions:

Interpretation: 75% of the graduates have a

starting salary of R3 600 or less.

Page 7: Chapter 3 - Descriptive stats: Numerical measures 3.1 ... · • Arrange the data in ascending order (smallest to largest value) • Compute an index i ( ) where p = percentile of

Copyright Reserved 7

7

Quartiles

First quartile, 25th

percentile

Second quartile, 50th

percentile, median

Third quartile, 75th

percentile

3.2 Measures of variability

Range

Range = Largest Value – Smallest Value

Range

Example of the salary data.

The range is: = 3 925 – 3 310 = 615

Advantages:

o Easy to calculate

Disadvantages:

o It’s sensitive to just 2 data values: the Largest Value and the Smallest Value.

o Unstable, it is influenced by extreme values.

Suppose one of the graduates received a starting salary of 10 000 per month. Then the range is equal

to:

The range is: = 10 000 – 3 310 = 6 690.

Page 8: Chapter 3 - Descriptive stats: Numerical measures 3.1 ... · • Arrange the data in ascending order (smallest to largest value) • Compute an index i ( ) where p = percentile of

Copyright Reserved 8

8

Interquartile Range - IQR

It’s the range for the middle 50% of the data

Example of the salary data.

The interquartile range for the salary data is:

Advantages:

o Easy to interpret

o Is not influenced by extreme values

Disadvantages:

o It’s only based on the middle 50% of the data.

Variance

The variance is a measure of variability that utilizes all the data

Example

Given:

The Sample Variance

∑( )

Standard Deviation

Sample Standard Deviation

√ and therefore √∑( )

46 54 42 46 32

Page 9: Chapter 3 - Descriptive stats: Numerical measures 3.1 ... · • Arrange the data in ascending order (smallest to largest value) • Compute an index i ( ) where p = percentile of

Copyright Reserved 9

9

Example Calculate the standard deviation of the class sizes.

Number of

students in class

( )

Mean

class size

( )

Deviation about

the mean

( )

Squared deviation

about the mean

( )

46 44 2 4

54 44 10 100

42 44 -2 4

46 44 2 4

32 44 -12 144

∑( ) ∑( )

∑( )

and √

OR

∑( )

( )

( ) ( )

( ) ( )

( ) ( ) ( ) ( ) ( )

( ) ( )

and √

Interpretation:

The average deviation of the class sizes from the average class size (44) is 8 students.

Coefficient of Variation

It’s a relative measure of variability

It measures the standard deviation relative to the mean

Coefficient of Variation:

The coefficient of variation tells us that the sample standard deviation is a % of the value of

the sample mean.

Page 10: Chapter 3 - Descriptive stats: Numerical measures 3.1 ... · • Arrange the data in ascending order (smallest to largest value) • Compute an index i ( ) where p = percentile of

Copyright Reserved 10

10

Example:

The class test mark (out of 10) and the semester test mark (out of 50) of 5 students are investigated.

Class test (out of 10) Semester test (out of 50)

4 13

5 20

7 25

6 32

8 40

Average of class test marks = 6 Average of semester test marks = 26

Variance of class test marks = 2.5 Variance of semester test marks = 109.5

Which test has the biggest relative variation? Calculate the relevant numerical measures.

Coefficient of variation for the class test marks:

Coefficient of variation for the semester test marks:

Therefore, the semester test has the biggest relative variation.

Using Microsoft Excel’s 2007 Descriptive Statistics Tool

Self-study (see page 115)

3.3 Measures of Distribution Shape, Relative Location and Detecting Outliers

Distribution Shapes

Read through by yourself.

z- Scores

z - Scores:

The z -score is called the standardized value.

It can be interpreted as the number of standard deviations x is from the mean .

Page 11: Chapter 3 - Descriptive stats: Numerical measures 3.1 ... · • Arrange the data in ascending order (smallest to largest value) • Compute an index i ( ) where p = percentile of

Copyright Reserved 11

11

Example:

z -scores of the class sizes dataset.

(We calculated the mean and standard deviation previously: and s = 8).

Number of students

in class

( )

Deviation about the

mean

( )

z-score

(

)

Interpretation:

54 is 1.25 standard deviations above the mean.

32 is 1.5 standard deviation below the mean.

Example:

The Mathematics marks of 2 students are compared.

Student 1 75% (in School A)

Student 2 80% (in School B)

Which one has done the best, relatively to his school?

School s

A 55 64 8

B 80 144 12

Student 1:

Student 1’s mark is 2.5 standard deviations above the mean.

Student 2:

Student 2’s mark is exactly the same value as the mean.

Conclusion:

Student 1 has done relatively better in his school than Student 2.

Page 12: Chapter 3 - Descriptive stats: Numerical measures 3.1 ... · • Arrange the data in ascending order (smallest to largest value) • Compute an index i ( ) where p = percentile of

Copyright Reserved 12

12

Chebyshev’s Theorem – Not for examination

Empirical Rule

Empirical Rule:

68% of the data values will be within 1 std dev of .

95% of the data values will be within 2 std dev of .

100% of the data values will be within 3 std dev of .

Page 13: Chapter 3 - Descriptive stats: Numerical measures 3.1 ... · • Arrange the data in ascending order (smallest to largest value) • Compute an index i ( ) where p = percentile of

Copyright Reserved 13

13

Example of the application of the empirical rule:

Suppose IQ scores have a bell-shaped distribution with a mean of 100 and a standard deviation of 15.

a) What percentage of people should have an IQ score between 85 and 115? Answer = 68%

b) What percentage of people should have an IQ score between 70 and 130? Answer = 95%

c) What percentage of people should have an IQ score of more than 130? Answer = 2.5%

100% - 95% = 5% and

= 2.5%

Page 14: Chapter 3 - Descriptive stats: Numerical measures 3.1 ... · • Arrange the data in ascending order (smallest to largest value) • Compute an index i ( ) where p = percentile of

Copyright Reserved 14

14

d) The 16th

percentile ( ) is equal to:

100% - 68% = 32% and

= 16%. Therefore, P16 = 85.

e) The 84th

( ) percentile is equal to:

16% + 68% = 84%. P84 = 115

f) Is a person with an IQ score of 160 seen as an outlier?

Yes, since approximately 100% of the values are between 55 and 145, an IQ score of 160 is seen as

an outlier.

OR

> 3 (see the next Section on outliers).

Page 15: Chapter 3 - Descriptive stats: Numerical measures 3.1 ... · • Arrange the data in ascending order (smallest to largest value) • Compute an index i ( ) where p = percentile of

Copyright Reserved 15

15

Detecting Outliers

Sometimes a data set will have one or more observations with unusually large or unusually

small values.

Extreme values are called outliers.

Standardized values (z-scores) can be used to identify outliers.

In the case of a bell-shaped distribution, the following rule can be applied:

Since 100% of the data will be within 3 std dev of the mean, we recommend treating any data

value with a (z-score <-3) OR a (z –score >3) as an outlier.

3.4 Exploratory Data Analysis

Five-Number Summary

The following 5 numbers are used to summarize the data:

1. Smallest Value

2. First Quartile ( ) 3. Second Quartile ( )

4. Third Quartile ( )

5. Largest Value

The five-number summary of the salary data is:

Smallest value = 3310

(Median)

Largest value = 3925

(These values have been calculated previously).

Page 16: Chapter 3 - Descriptive stats: Numerical measures 3.1 ... · • Arrange the data in ascending order (smallest to largest value) • Compute an index i ( ) where p = percentile of

Copyright Reserved 16

16

Box Plot

A box plot is a graphical summary of data that is based on a five-number summary.

A box plot provides another way to identify outliers.

Upper limit = Q3 + (1.5)(IQR) = 3600 + (1.5)(135) = 3802.5

Lower limit = Q1 - (1.5)(IQR) = 3465 - (1.5)(135) = 3262.5

If a point falls above the upper limit or below the lower limit, the point is seen as an outlier.

Page 17: Chapter 3 - Descriptive stats: Numerical measures 3.1 ... · • Arrange the data in ascending order (smallest to largest value) • Compute an index i ( ) where p = percentile of

Copyright Reserved 17

17

Box-plots and skewness:

The median is in the middle of the box, indicating symmetry.

The median is not centered in the middle of the box. The median is closer

to , indicating that the shape of the distribution is skewed to the right.

The median is not centered in the middle of the box. The median is closer

to , indicating that the shape of the distribution is skewed to the left.

Skewness:

Skewed to the left (negative skew): The left tail is longer; the mass of the distribution is

concentrated on the right of the figure. It has relatively few low values.

Skewed to the right (positive skew): The right tail is longer; the mass of the distribution is

concentrated on the left of the figure. It has relatively few high values.

Symmetric

Note: A normal distributions is symmetric

Page 18: Chapter 3 - Descriptive stats: Numerical measures 3.1 ... · • Arrange the data in ascending order (smallest to largest value) • Compute an index i ( ) where p = percentile of

Copyright Reserved 18

18

3.5 Measures of association between two variables

Covariance

Sample Covariance: Measure of the linear relationship between x and y.

∑( )( )

Note:

Positive linear relationship

Negative linear relationship

No linear relationship

Note: (Not in the textbook)

∑( )( )

∑( )

where denotes the sample variance of the x observations.

Similarly:

∑( )( )

∑( )

where denotes the sample variance of the y observations.

Calculations for the variance and standard deviation of x, the variance and standard deviation of y

and the covariance between x and y:

x y ( ) ( ) ( ) ( )

( )( )

2 50 -1 1 -1 1 1

5 57 2 4 6 36 12

1 41 -2 4 -10 100 20

3 54 0 0 3 9 0

4 54 1 1 3 9 3

1 38 -2 4 -13 169 26

5 63 2 4 12 144 24

3 48 0 0 -3 9 0

4 59 1 1 8 64 8

2 46 -1 1 -5 25 5

30 510 0 20 0 566 99

and

Page 19: Chapter 3 - Descriptive stats: Numerical measures 3.1 ... · • Arrange the data in ascending order (smallest to largest value) • Compute an index i ( ) where p = percentile of

Copyright Reserved 19

19

1. Calculate the variance and the standard deviation of x:

∑( )

and √

2. Calculate the variance and the standard deviation of y:

∑( )

and √

3. Calculate and interpret the covariance between x and y:

∑( )( )

. There is a positive linear relationship between x and y.

Page 20: Chapter 3 - Descriptive stats: Numerical measures 3.1 ... · • Arrange the data in ascending order (smallest to largest value) • Compute an index i ( ) where p = percentile of

Copyright Reserved 20

20

Interpretation of sample covariance

A positive linear relationship

A negative linear relationship

Correlation Coefficient

To measure the strength of the linear relationship between x and y.

( )( ) Strong positive linear relationship between x and y.

where

Sample covariance between x and y.

Sample standard deviation of x.

Sample standard deviation of y.

0

5

10

15

20

25

0 2 4 6 8

y

x

0

5

10

15

20

25

0 2 4 6 8

y

x

Page 21: Chapter 3 - Descriptive stats: Numerical measures 3.1 ... · • Arrange the data in ascending order (smallest to largest value) • Compute an index i ( ) where p = percentile of

Copyright Reserved 21

21

Interpretation of the Correlation Coefficient

Measures the linear relationship between x and y

i. Positive linear relationship

Perfect positive linear relationship

ii. Negative linear relationship

Perfect negative linear relationship

iii. Non-linear relationship

Strong negative linear relationship between x and y

Weak negative linear relationship between x and y

Weak positive linear relationship between x and y

Strong positive linear relationship between x and y

No linear relationship between x and y

Page 22: Chapter 3 - Descriptive stats: Numerical measures 3.1 ... · • Arrange the data in ascending order (smallest to largest value) • Compute an index i ( ) where p = percentile of

Copyright Reserved 22

22

Using Microsoft Excel 2007 to compute the covariance and correlation coefficient

Formula worksheet:

Value worksheet:

Note: We have to adjust the Excel result of 9.9 for the covariance, since the COVAR function in

Excel calculates the population covariance.

= sample covariance

= population covariance

(

) (

)

Page 23: Chapter 3 - Descriptive stats: Numerical measures 3.1 ... · • Arrange the data in ascending order (smallest to largest value) • Compute an index i ( ) where p = percentile of

Copyright Reserved 23

23

Homework (work through the following example on your own):

The class test mark (out of 10) (x) and the semester test mark (out of 50) (y) of 5 students are

investigated.

Class test (out of 10) (x) Semester test (out of 50) (y)

4 13

5 20

7 25

6 32

8 40

(a) Calculate the mean mark and the variance for the class test:

and

∑( )

( ) ( ) ( ) ( ) ( )

.

(b) Calculate the mean mark and the variance for the semester test:

and

∑( )

( ) ( ) ( ) ( ) ( )

.

(c) Calculate and interpret the standard deviation for the semester test:

√ .

The average deviation of the semester test marks from the average ( ) is 10.5.

(d) Calculate and interpret the covariance:

Answer:

x y ( ) ( ) ( )( ) 4 13 -2 -13 26

5 20 -1 -6 6

7 25 1 -1 -1

6 32 0 6 0

8 40 2 14 28

∑( )( )

. There is a positive linear relationship between x and y.

(e) Calculate and interpret the correlation coefficient:

√ √ . There is a strong positive linear relationship between x and y.

(f) Suppose a student obtained 6/10 for the class test and 30/50 for the semester test. In which test

did the student perform the best, relative to the other students?

√ and

√ . The student performed the best in the semester

test, relative to the other students.

Page 24: Chapter 3 - Descriptive stats: Numerical measures 3.1 ... · • Arrange the data in ascending order (smallest to largest value) • Compute an index i ( ) where p = percentile of

Copyright Reserved 24

24

3.6 The weighted mean and working with grouped data

Weighted Mean

Example

Consider the following sample of 5 purchases of raw material

Purchase Cost per pound ($) Number of pounds

1 3.00 1200

2 3.40 500

3 2.80 2750

4 2.90 1000

5 3.25 800

Question: The mean cost per pound for the raw material?

The weighted mean:

( )( ) ( )( ) ( )( ) ( )( ) ( )( )

Example:

The net full supply capacity (FSC) (in millions of cubic metres) in the various regions and catchment

areas in South Africa, and also the percentage content as on 31 August 1992 are given in the table

below.

Region/catchment area FSC

% content

Vaaldam 2529 20

Bloemhofdam 1269 20

Sterkfonteindam 2617 99

Question: Calculate the weighted mean for the % content in the catchment area:

( )( ) ( )( ) ( )( )

Page 25: Chapter 3 - Descriptive stats: Numerical measures 3.1 ... · • Arrange the data in ascending order (smallest to largest value) • Compute an index i ( ) where p = percentile of

Copyright Reserved 25

25

Grouped data

The audit times for 20 clients were as follows:

Audit times

(in days) Frequency

Class Midpoint

10-14 4

15-19 8

20-24 5

25-29 2

30-34 1

20

Sample mean for grouped data: ∑

The midpoint for class i

The frequency for class i

( )( ) ( )( ) ( )( ) ( )( ) ( )( )

Sample variance for grouped data:

∑ ( )

( )

( ) ( )

( ) ( )

( ) ( ) ( ) ( ) ( )

= 30

The standard deviation:

Page 26: Chapter 3 - Descriptive stats: Numerical measures 3.1 ... · • Arrange the data in ascending order (smallest to largest value) • Compute an index i ( ) where p = percentile of

Copyright Reserved 26

26

Homework (go through this example on your own)

Automobiles traveling on a road that has a posted speed limit of 55 miles per hour are checked for

speed by a state police radar system. Following is a frequency distribution of speeds.

Speed (miles per hour) 45-49 10 47

50-54 40 52

55-59 150 57

60-64 175 62

65-69 75 67

70-74 15 72

75-79 10 77

475

(a) Calculate the average speed of the automobiles.

(b) Calculate the variance and the standard deviation

∑ ( )

Page 27: Chapter 3 - Descriptive stats: Numerical measures 3.1 ... · • Arrange the data in ascending order (smallest to largest value) • Compute an index i ( ) where p = percentile of

Copyright Reserved 27

27

Typical exam questions:

The annual amounts (in $ millions) spent on research and development for a random sample of

30 electronic component manufacturers are given in the following Excel spreadsheets. By

using the Sort-option in Excel the data set is sorted according to the amount spent.

Unsorted Sorted

Annual amounts (in $ millions) for electronic component manufacturers has a bell-shaped

distribution with a mean of 20 and a standard deviation of 7.

Question 1

The range is:

Answer 1

Range = xmax – xmin = 38 – 6 = 32.

Question 2

The median is:

Answer 2

(

) (

) . We need to take the average of the values in the 15

th and 16

th

positions. In position 15 we have 20 and in position 16 we have 20, therefore

.

Question 3: The data type of annual amounts is: Answer 3: Continuous

Question 4

According to the coefficient of variation:

Answer 4

. The standard deviation is 35% of the average.

Page 28: Chapter 3 - Descriptive stats: Numerical measures 3.1 ... · • Arrange the data in ascending order (smallest to largest value) • Compute an index i ( ) where p = percentile of

Copyright Reserved 28

28

Questions 5 to 8 are based on the following information:

The relationship between the age (in years) of a motorist and the speed (in km/h) of the car on the

highway is summarised in the following Excel spreadsheet:

Formula sheet:

Value sheet:

Question 5

The variance of the age of the motorists is:

Answer 5

∑( )

( ) ( ) ( ) ( ) ( )

( ) ( ) ( ) ( ) ( )

Question 6

The coefficient of variation of the age of the motorists is:

Answer 6

Page 29: Chapter 3 - Descriptive stats: Numerical measures 3.1 ... · • Arrange the data in ascending order (smallest to largest value) • Compute an index i ( ) where p = percentile of

Copyright Reserved 29

29

Question 7

The sample covariance is:

Answer 7

Sample covariance = Population covariance

Question 8

The relationship between the age of a motorist and the speed of the car on the highway can be

described as:

(A) no linear relationship

(B) a strong negative linear relationship

(C) a weak negative linear relationship

(D) a strong positive linear relationship

(E) a weak positive linear relationship

Answer 8

r = -0.78 which is close to -1. Consequently, we have a strong negative linear relationship.

Questions 9 to 11 are based on the following information:

Consider the following set of Descriptive Statistics on time per week (in hours) spent on

campaigning for the upcoming general election for a specific political party:

Descriptive statistic Value

22

25

18

22

26

Smallest value 8

Largest value 36

Question 9

The distribution of time per week (in hours) is:

(A) Bimodal (B) Multimodal

(C) Symmetrical (D) Skewed to the right

(E) Skewed to the left

Answer 9

Q1 and Q3 are equally far away from the median, therefore, the distribution is symmetrical. The box-

plot, for example, will look something like this:

The median is in the middle of the box, indicating symmetry.

Page 30: Chapter 3 - Descriptive stats: Numerical measures 3.1 ... · • Arrange the data in ascending order (smallest to largest value) • Compute an index i ( ) where p = percentile of

Copyright Reserved 30

30

Question 10

Using the box and whisker plot approach, an outlier is a value greater than:

Answer 10

( ) ( ) .

Question 11

The z-score (standardised value) for the largest value in the data set is:

Answer 11