1 descriptive statistics: numerical methods chapter 4

80
1 Descriptive Statistics: Numerical Methods Chapter 4

Upload: candace-pierce

Post on 26-Dec-2015

248 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: 1 Descriptive Statistics: Numerical Methods Chapter 4

1

Descriptive Statistics:

Numerical Methods

Chapter 4

Page 2: 1 Descriptive Statistics: Numerical Methods Chapter 4

2

Introduction

In this chapter we use numerical measures to describe data sets, that represent populations or samples.

Usually, we focus our attention on two types of measures when describing population characteristics: Measure of the central location. Measure of dispersion.

Page 3: 1 Descriptive Statistics: Numerical Methods Chapter 4

3

Why both the central location and the variability are used to describe a set of number?

Observe the following example.

Introduction

Page 4: 1 Descriptive Statistics: Numerical Methods Chapter 4

4

IntroductionThink of a sample portfolio composed of three stocks.

100 sharesARR = 10%

200 sharesARR = 15% 100 shares

ARR = 20%

A central measure for this portfolio’s ARR for is 15%.Now observe the following portfolio

100 sharesARR = 5%100 sharesARR = 5%

200 sharesARR = 15% 100 shares

ARR = 25%100 sharesARR = 25%

A central measure of this portfolio’s ARR for is 15% too.

Page 5: 1 Descriptive Statistics: Numerical Methods Chapter 4

5

Introduction

Considering the average ARR only the two portfolios are equal. But are they really?

Is the dispersion of ARR the same for the two portfolio?The dispersion (variability) is an important property

when describing a set of numbers, at least as important as the central location.

We’ll have more detailed discussions on these two important measures later.

Page 6: 1 Descriptive Statistics: Numerical Methods Chapter 4

6

4.1 Measures of Central Location

With one data pointclearly the central location is at the pointitself.

The central data point reflects the locations of all the actual data points.

How?With two data points,the central location should fall in the middlebetween them (in order to reflect the location ofboth of them).

Page 7: 1 Descriptive Statistics: Numerical Methods Chapter 4

7

4.1 Measures of Central LocationThe central data point reflects the locations of all

the actual data points.How?

If the third data point appears in the centerthe measure of central location will remainin the center, but… (click)

But if the third data point appears on the left hand-sideof the midrange, it should “pull”the central location to the left.

Page 8: 1 Descriptive Statistics: Numerical Methods Chapter 4

8

As more and more data points are added, the central location moves (left and right) as requiredin order to reflect the effects of all the points.

4.1 Measures of Central Location

Page 9: 1 Descriptive Statistics: Numerical Methods Chapter 4

9

Sum of the measurementsNumber of measurements

Mean =

This is the most popular and useful measure of central location

The Arithmetic Mean

Page 10: 1 Descriptive Statistics: Numerical Methods Chapter 4

10

nx

x in

1i

Sample mean Population mean

Nx i

N1i

Sample size Population size

nx

x in

1i

The Arithmetic Mean

Page 11: 1 Descriptive Statistics: Numerical Methods Chapter 4

11

Find the mean rate of return for a portfolio equally invested in five stocks having the following annual rate of returns: 11.2%, 8.07%, 5.55%, 13.7%, 21%.

Solution

Example 1

%764.95

217.1355.507.82.11x

The Arithmetic Mean

Page 12: 1 Descriptive Statistics: Numerical Methods Chapter 4

12

The median of a set of measurements is the value that falls in the middle when the measurements are arranged in order of magnitude.

When determining the median pay attention to the number of observations (k). ‘k’ is odd

Median = the number at the (k+1)/2th location of the ordered array.

‘k’ is Even Median = the average of the two numbers in the middle

(The number at the (k/2)th and the [(k/2)+1)]th locations of the ordered array.)

The Median

Page 13: 1 Descriptive Statistics: Numerical Methods Chapter 4

13

30,32,60,3126,26,28,29,

Odd number of observations

26,26,28,29,30,32,60

Example 2

The salaries of seven employeeswere recorded (in 1000s): 28, 60, 26, 32, 30, 26, 29.Find the median salary.

Suppose an additional salary of $31,000is added to the group of salaries recorded before. Find the median salary.

Even number of observations

29.5,

The Median

There are seven salaries (K = 7). The (k+1)/2th salary of the ordered array is the number at the (7+1)/2th = 4th location.The median is 29.

There are eight salaries (K = 8). The two salaries in the middle are 29 (in the (k/2)th =4th location), and 30 (in the [(k/2)+1]th=5th location.The median is the average number – 29.5.

Page 14: 1 Descriptive Statistics: Numerical Methods Chapter 4

14

The Mode of a set of measurements is the value that occurs most frequently.

A Set of data may have one mode (or modal class), or two or more modes.

The modal classFor large data setsthe modal class is much more relevant than a single-value mode.

The Mode

Page 15: 1 Descriptive Statistics: Numerical Methods Chapter 4

15

Example 3 The manager of a men’s clothing store observes the waist

size (in inches) of trousers sold last week: 31, 34, 36, 33, 28, 34, 30, 34, 32, 40.

The mode of this data set is 34 in.

This information seems to be valuable (for example, for the design of a new display in the store), much more than “ the median is 33.5 in.”

This information seems to be valuable (for example, for the design of a new display in the store), much more than “ the median is 33.5 in.”

The Mode

Page 16: 1 Descriptive Statistics: Numerical Methods Chapter 4

16

Relationship among Mean, Median, and Mode

If a distribution is symmetrical, the mean, median and mode coincide

If a distribution is non symmetrical, and skewed to the left or to the right, the three measures differ.

A positively skewed distribution(“skewed to the right”)

MeanMedian

Mode

Page 17: 1 Descriptive Statistics: Numerical Methods Chapter 4

17

If a distribution is symmetrical, the mean, median and mode coincide

If a distribution is non symmetrical, and skewed to the left or to the right, the three measures differ.

A positively skewed distribution(“skewed to the right”)

MeanMedian

Mode MeanMedian

Mode

A negatively skewed distribution(“skewed to the left”)

Relationship among Mean, Median, and Mode

Page 18: 1 Descriptive Statistics: Numerical Methods Chapter 4

18

Using the Mean, Median, and Mode

When to use (not use) each measure of central location):• The mean - is very sensitive to extreme values, thus, should

not be used when a few extreme values residing away from most of the observations, are present. The mean is used in most statistical analyses.

• The median – is not effected by extreme values therefore, can be used in their presence. Yet, the medians does not reflect all the values included in the data set, but rather the location of the observation in the middle.

• The mode – should be used mainly for categorical data.

Page 19: 1 Descriptive Statistics: Numerical Methods Chapter 4

19

Example 4 A professor of statistics wants to report the results of a midterm exam, taken by 100 students.

• The mean of the test marks is 73.90• The median of the test marks is 81• The mode of the test marks is 84

Describe the information each one provides.The mean provides informationabout the over-all performance level of the class. It can serve as a tool for making comparisons with other classes and/or other exams.

The Median indicates that half of the class received a grade below 81%, and half of the class received a grade above 81%. A student can use this statistic to place his/her mark relative to other students in the class.

The mode must be used when data is qualitative. If marks are classified by letter grade, the frequency of each grade can be calculated. Then, the mode becomes a logical measure to compute.

The mode must be used when data is qualitative. If marks are classified by letter grade, the frequency of each grade can be calculated. Then, the mode becomes a logical measure to compute.

Summary Examples

Page 20: 1 Descriptive Statistics: Numerical Methods Chapter 4

20

Summary Examples

Example 5 The following sample represents the lateness of arriving flights in a

certain domestic flight airport (in minutes): 22, 12, 4, -3,… (the data is found in Lateness.xls)(a) Find the mean, median, and mode of this sample. Are these data form

a skewed distribution? negative, positive? (b) Which measure should not be used? Change the largest lateness to 34

minutes (rather than 67). Which central location measures are effected?(c) A person is waiting for the arrival of a certain flight. He is told the flight will

probably be late not more than10 minutes. Should he believe this is a reliable estimate? Use the distribution of data requested in part (b).

Page 21: 1 Descriptive Statistics: Numerical Methods Chapter 4

21

Example 5 - solution We run the data on Excel using the ‘Descriptive

Statistics’ tool.Lateness

Mean 10.8709677Standard Error 2.6436135Median 6Mode 4Standard Deviation 14.719017Sample Variance 216.649462Kurtosis 6.39059859Skewness 2.17922953Range 75Minimum -8Maximum 67Sum 337Count 31

The distribution of these data shows a positive skewness:

Do not use the mean, because an ‘outlier’ of 67 minutes lateness effects (increases) the mean value to be almost 11 minutes.

Lateness

201510 5 0

-8 5 18 31 44 57 70

Summary Examples

Page 22: 1 Descriptive Statistics: Numerical Methods Chapter 4

22

Lateness

0

10

20

-1 8 17 26 35 More

Frequency

.00%

50.00%

100.00%

Example 5 - solution When changing the largest observation from 67 to 34, the mean reduces

to 9.80 minutes, but the median and mode do not change.

Lateness

Mean 9.806451613Standard Error 2.034339265Median 6Mode 4Standard Deviation 11.32672166Sample Variance 128.2946237Kurtosis 0.919374432Skewness 1.051857781Range 48Minimum -8Maximum 40Sum 304Count 31

• It is reasonable to believe that the lateness will not exceed 10 minutes. From the Ogive we see that about 60 % of the flights arrive within 10 minutes of the scheduled arrival time.

Summary Examples

Page 23: 1 Descriptive Statistics: Numerical Methods Chapter 4

23

Problems

P4-1: Consider the following sample of measurements: 27, 32, 30, 28, 31, 32, 35, 28, 28, 29. Compute the mean, median, mode.Does it appear that the mode is a good measure of central location for this set of numbers?

P4-2: The manager at a local supermarket (facing tough competition) tries to improve service to customers waiting to pay by adding a second cashier. The goal is to have customers wait at most 4.5 minutes before leaving the cashier area. From the data presented in P4-02.xls, was the manager successful in achieving this goal? Use Excel and numerical descriptive measures.

Page 24: 1 Descriptive Statistics: Numerical Methods Chapter 4

24

4.2 Measures of Variability

Measures of central location fail to tell the whole story about the distribution.

A question of interest still remains unanswered:

How much are the values of a given set spread out around the mean value?

Page 25: 1 Descriptive Statistics: Numerical Methods Chapter 4

25

Observe two hypothetical data sets:

The mean provides a good representation of thevalues in the data set.

Set 1: Small variability

Why do we need measures of variability?

Page 26: 1 Descriptive Statistics: Numerical Methods Chapter 4

26

Why do we need measures of variability?

Observe two hypothetical data sets:

Set 1: Small variability

Set 2: Larger variability

The mean is the same as before but no longer represents the set values as good as before.

The mean provides a good representation of thevalues in the data set.

Page 27: 1 Descriptive Statistics: Numerical Methods Chapter 4

27

The range of a set of measurements is the difference between the largest and smallest measurements.

Its major advantage is the ease with which it can be computed.

Its major shortcoming is its failure to provide information on the dispersion of the values between the two end points.

? ? ?

But, how do all the measurements spread out?

Smallestmeasurement

Largestmeasurement

The range cannot assist in answering this questionRange

The Range

Page 28: 1 Descriptive Statistics: Numerical Methods Chapter 4

28

This measure reflects the dispersion of all the measurement values.

The variance of a population of N measurements x1, x2,…,xN having a mean is defined as

The variance of a sample of n measurementsx1, x2, …,xn having a mean is defined as

x

N

)x( 2i

N1i2

N

)x( 2i

N1i2

1n

)xx(s

2i

n1i2

1n

)xx(s

2i

n1i2

The Variance

Page 29: 1 Descriptive Statistics: Numerical Methods Chapter 4

29

Consider two small populations:

1098

74 10

11 12

13 16

8-10= -2

9-10= -111-10= +1

12-10= +2

4-10 = - 6

7-10 = -3

13-10 = +3

16-10 = +6

Sum = 0

Sum = 0

The mean of both populations is 10...

…but measurements in Bare more dispersedthen those in A.

A measure of dispersion should agree with this observation.

Can the sum of deviations from the meanbe a good measure of dispersion?

A

B

The Variance

Page 30: 1 Descriptive Statistics: Numerical Methods Chapter 4

30

The sum of deviations is zero for both populations, therefore, is not a good measure of dispersion, since clearly their dispersion is not equal.

The Variance

Page 31: 1 Descriptive Statistics: Numerical Methods Chapter 4

31

Let us calculate the variance of the two populations

185

)1016()1013()1010()107()104( 222222B

25

)1012()1011()1010()109()108( 222222A

Why is the variance defined as the average squared deviation?Why not use the sum of squared deviations as a measure of dispersion instead?

After all, the sum of squared deviations increases in magnitude when the dispersionof a data set increases!!

The Variance

Page 32: 1 Descriptive Statistics: Numerical Methods Chapter 4

32

Which data set has a larger dispersion?Which data set has a larger dispersion?

1 3 1 32 5

A B

Data set Bis more dispersedaround the mean

Let us calculate the sum of squared deviations for both data sets

The Variance

Page 33: 1 Descriptive Statistics: Numerical Methods Chapter 4

33

1 3 1 32 5

A B

SumA = (1-2)2 +…+(1-2)2 +(3-2)2 +… +(3-2)2= 10SumB = (1-3)2 + (5-3)2 = 8

SumA > SumB. This is inconsistent with the observation that set B is more dispersed.

The Variance

Page 34: 1 Descriptive Statistics: Numerical Methods Chapter 4

34

1 3 1 32 5

A B

However, when calculated on “per observation” basis (variance), the dispersions are properly ordered.

A2 = SumA/N = 10/10 = 1

B2 = SumB/N = 8/2 = 4

The Variance

Page 35: 1 Descriptive Statistics: Numerical Methods Chapter 4

35

Example 6 Find the variance of the following set of numbers,

representing annual rates of returns for a group of mutual funds. Assume the set is (i) a sample, (ii) a population: -2, 4, 5, 6.9, 10

Solution

2

2222

in

1i2

percent59.19

)78.410(...)78.44()78.42(15

11n

)xx(s

4.785

23.95

106.95425

xx i

61i

Assuming a sample

The Variance

Page 36: 1 Descriptive Statistics: Numerical Methods Chapter 4

36

Example 6 - solution continued

2

2222

in

1i2

percent6736.15

)78.410(...)78.44()78.42(51

n)xx(

Assuming a population

The Variance

Page 37: 1 Descriptive Statistics: Numerical Methods Chapter 4

37

The standard deviation of a set of measurements is the square root of the set variance.

2

2

:deviationandardstPopulation

ss:deviationstandardSample

2

2

:deviationandardstPopulation

ss:deviationstandardSample

Standard Deviation

Page 38: 1 Descriptive Statistics: Numerical Methods Chapter 4

38

Example 7 The daily percentage of defective items in two weeks of production (10 working days) were calculated for two production lines?Which line provides good items more consistently?

Line 1: 8.3, 6.2, 20.9, 2.7, 33.6, 42.9, 24.4, 5.2, 3.1, 30.05

Line 2: 12.1, 2.8, 6.4, 12.2, 27.8, 25.3, 18.2, 10.7, 1.3, 11.4

Standard Deviation

Page 39: 1 Descriptive Statistics: Numerical Methods Chapter 4

39

Example 7, Solution

Line 1 Line 2

Mean 16 Mean 12Standard Error 5.295 Standard Error 3.152Median 14.6 Median 11.75Mode #N/A Mode #N/AStandard Deviation 16.74 Standard Deviation 9.969Sample Variance 280.3 Sample Variance 99.37Kurtosis -1.34 Kurtosis -0.46Skewness 0.217 Skewness 0.107Range 49.1 Range 30.6Minimum -6.2 Minimum -2.8Maximum 42.9 Maximum 27.8Sum 160 Sum 120Count 10 Count 10

Line 1 Line 2

Mean 16 Mean 12Standard Error 5.295 Standard Error 3.152Median 14.6 Median 11.75Mode #N/A Mode #N/AStandard Deviation 16.74 Standard Deviation 9.969Sample Variance 280.3 Sample Variance 99.37Kurtosis -1.34 Kurtosis -0.46Skewness 0.217 Skewness 0.107Range 49.1 Range 30.6Minimum -6.2 Minimum -2.8Maximum 42.9 Maximum 27.8Sum 160 Sum 120Count 10 Count 10

Line 1 should be considered less consistent because the standard deviation of its defective proportion is larger (i.e. therefore the standard deviation of the good item proportion is also larger).

Standard Deviation

Let us use the Excel printout obtained from the “Descriptive Statistics” sub-menu.

Page 40: 1 Descriptive Statistics: Numerical Methods Chapter 4

40

Interpreting the Standard Deviation

The standard deviation can be used to compare the variability of several distributions make a statement about the general shape of a

distribution.When describing the shape of a distribution we

refer to A distribution with any shape A mound shaped distribution

Page 41: 1 Descriptive Statistics: Numerical Methods Chapter 4

41

The Empirical Rule – Describing a Mound Shaped Data Set

If a sample of measurements has a mound-shaped distribution, the interval…

tsmeasuremen the of 68%ely approximat contains )sx,sx(

tsmeasuremen the of 95%ely approximat contains )s2x,s2x(

tsmeasuremen the of 99.7%ely approximat contains )s3x,s3x(

Page 42: 1 Descriptive Statistics: Numerical Methods Chapter 4

42

Example 10 Describe the set of data provided in Data 10 using numerical descriptive measures.

The Empirical Rule

0

5

10

15

17 17.4 17.8 18.2 18.6 More

Measurements

Frequency

Solution From the histogram it

appears that the distribution is approximately mound shaped. We ’ll use the empirical rule to describe the data.

Page 43: 1 Descriptive Statistics: Numerical Methods Chapter 4

43

From the empirical rule we get: Approximately 68% of the data lie between 17.403 and 18.515

[17. 959-1(.556), 17.959 + 1(.556)]

Approximately 95% of the data lie between 16.847 and 19.071 [17. 959-2(.556), 17. 959+2(.556)]

Approximately 99.7% of the data lie between 16.291 and 19.627

[17. 959-3(.556), 17. 959+3(.556)]

Example 10 – solution continued Running the Descriptive statistics tool in Excel we have

Mean = 17.959Standard deviation (sample) = 0.556

The Empirical Rule – Interpreting the Standard Deviation

Actual count: 26 (100%)

Actual count: 25(96%)

Actual count: 19 (73%)

Page 44: 1 Descriptive Statistics: Numerical Methods Chapter 4

44

The proportion of observations in any sample that lie within k standard deviations of the mean is at least 1-1/z2

for any z > 1.This theorem is valid for any set of measurements

(sample, population) of any shape!!K Interval Minimum %1 at least 75%2 at least 89%3 at least 94%

s3x,s3x s2x,s2x

s4x,s4x

The Chebyshev Theorem - Describing Any Data Set

(1-1/22)

(1-1/32)

(1-1/42)

Page 45: 1 Descriptive Statistics: Numerical Methods Chapter 4

45

Example 9 Employee salaries were recorded and a histogram was

created. Describe this data using the correct numerical measures.

The Chebyshev Theorem

Histogram

0

5

10

15

20

155 200 245 290 335 380 425

Salary

Frequency

Solution Creating the histogram we realize

that the distribution is positively skewed. Chebychev Theorem needs to be used to describe the data.

Page 46: 1 Descriptive Statistics: Numerical Methods Chapter 4

46

Example 9 – solution continued From Excel we have:

Mean = 243.2Standard deviation = 58.354

Applying Chebychev Theorem

• At least 75% of the salaries lie within [243.2-2(58.354), 243.2+2(58.354)] = [126.492, 359.908]

• At least 88.9% of the salaries lie within [243.2-3(58.354), 243.2+3(58.354)] = [68.138, 418.262]

The Chebyshev Theorem

Actual count

39 (97.5%)

All (100%)

Page 47: 1 Descriptive Statistics: Numerical Methods Chapter 4

47

The coefficient of variation of a set of measurements is the standard deviation divided by the mean value.

This coefficient provides a proportionate measure of variation.

CV :variation oft coefficien Population

xs

cv :variation oft coefficien Sample

A standard deviation of 10 may be perceivedlarge when the mean value is 100, but only moderately large when the mean value is 500

The Coefficient of Variation

Page 48: 1 Descriptive Statistics: Numerical Methods Chapter 4

48

4.3 Measures of Relative Location and Box Plots

Additional information on the general shape of a data set can be obtained by describing the relative location of 5 values within the data set.

We use percentiles to describe these 5 relative locations. What is a percentile?

Page 49: 1 Descriptive Statistics: Numerical Methods Chapter 4

49Your score

Percentile The pth percentile of a set of measurements is the

value for which • At most p% of the measurements are less than that value• At most (100-p)% of all the measurements are greater

than that value.Example

Suppose your score is the 60th percentile of a SAT test. Then

60% of all the scores lie here 40%

4.3 Measures of Relative Location and Box Plots

Page 50: 1 Descriptive Statistics: Numerical Methods Chapter 4

50

Here are two possible approaches commonly used to describe a set of values.

The five number summary: Smallest value First quartile (Q1) Median (Q2) Third quartile (Q3) Largest value

- OR -•The first decile (the 10th percentile)•First quartile (Q1)•Median (Q2)•Third quartile (Q3)•The ninth decile (90th percentile)

4.3 Measures of Relative Location and Box Plots

Page 51: 1 Descriptive Statistics: Numerical Methods Chapter 4

51

First (lower)decile = 10th percentile First (lower) quartile, Q1, = 25th percentile Median, = 50th percentile Third quartile, Q3, = 75th percentile Ninth (upper)decile = 90th percentile

Lower decile

A demostration of Commonly used percentiles

10% 90% lie here

Page 52: 1 Descriptive Statistics: Numerical Methods Chapter 4

52

Commonly used percentiles: First (lower)decile = 10th percentile First (lower) quartile, Q1, = 25th percentile Median, = 50th percentile Third quartile, Q3, = 75th percentile Ninth (upper)decile = 90th percentile

Lower quartile

A demostration of Commonly used percentiles - optional

10% 90% lie here

25% 75% lie here

Click

Page 53: 1 Descriptive Statistics: Numerical Methods Chapter 4

53

Commonly used percentiles: First (lower)decile = 10th percentile First (lower) quartile, Q1, = 25th percentile Median, = 50th percentile Third quartile, Q3, = 75th percentile Ninth (upper)decile = 90th percentile

Middle decile-Median

A demostration of Commonly used percentiles

And so on…

25% 75% lie here

50%lie here

50% lie here

Click

Page 54: 1 Descriptive Statistics: Numerical Methods Chapter 4

54

There are two general cases to consider: The percentile is a member of the data set The percentile is not a member of the data set; It

falls in between two values of the data set.Let us demonstrate the two cases with two

examples.

Determining Percentiles and their Location

Page 55: 1 Descriptive Statistics: Numerical Methods Chapter 4

55

Example 11

Find the quartiles for the data set of flight lateness presented in example 4.5.Data: 8.3, 6.2, 20.9, 2.7, 33.6, 42.9, 24.4, 5.2, 3.1, 30.05

Determining Percentiles and their Location

Page 56: 1 Descriptive Statistics: Numerical Methods Chapter 4

56

At most (.25)(10) = 2.5 measurements should appear below the first quartile.Check the smallest 2 measurements on the left hand side.

At most (.25)(10) = 2.5 measurements should appear below the first quartile.Check the smallest 2 measurements on the left hand side.

At most (.75)(10)=7.5 measurements should appear above the first quartile.Check the largest 7 measurements on the right hand side.

At most (.75)(10)=7.5 measurements should appear above the first quartile.Check the largest 7 measurements on the right hand side.

The first quartileThe first quartile10 measurements

Example 11 - SolutionSort the measurements

2.7, 3.1, 5.2, 6.2, 8.3, 20.9, 24.4, 30.05, 33.6, 42.9

Determining Percentiles and their Location

Page 57: 1 Descriptive Statistics: Numerical Methods Chapter 4

57

Example 11 – solution continued The second quartile (Median):

• At most (.5)(10) = 5 numbers lie below and above Q2

• 2.7, 3.1, 5.2, 6.2, 8.3, 20.9, 24.4, 30.05, 33.6, 42.9

Q2

Q2 = (8.3 + 20.9)/2 = 14.6

Determining Percentiles and their Location

Page 58: 1 Descriptive Statistics: Numerical Methods Chapter 4

58

Example 11 – solution continued The third quartile

• At most (.75)10 = 7.5 numbers lie below Q3

• At most (.25)10 = 2.5 numbers lie above Q3

• 2.7, 3.1, 5.2, 6.2, 8.3, 20.9, 24.4, 30.05, 33.6, 42.9

Q3

Determining Percentiles and their Location

Page 59: 1 Descriptive Statistics: Numerical Methods Chapter 4

59

Example 12

Find the 20th percentile for the data set of flight lateness presented in example 11.

Solution Following the procedure applied to the previous example,

• At most (.20)10 = 2 numbers should fall below the 20th percentile.• At least (.80)10 = 8 numbers should fall above the 20th percentile.

• The sorted data set is: 2.7, 3.1, 5.2, 6.2, 8.3, 20.9, 24.4, 30.05, 33.6, 42.• From the sorted data set we see that every number greater than 3.1

and smaller than 5.2 meets these two conditions. • We show next how to determine the location and value of a percentile

whose value is not one of the data set points.

Determining Percentiles and their Location

Page 60: 1 Descriptive Statistics: Numerical Methods Chapter 4

60

Find the location of any percentile using the formula

percentilePtheoflocationtheisLwhere100P

)1n(L

thP

P

percentilePtheoflocationtheisLwhere100P

)1n(L

thP

P

Determining Percentiles and their Location

Page 61: 1 Descriptive Statistics: Numerical Methods Chapter 4

61

Example 12-solution continued Finding the location of the 20th percentile:

2.7, 3.1, 5.2, 6.2, 8.3, 20.9, 24.4, 30.05, 33.6, 42.9 Finding the value of the 20th percentile.

The 20th percentile is located at location 2.75, that is, at .75 the distance from 3.1 to 5.2. Therefore,

75.210020

)110(100P

)1n(LP

2 3

3.1

5.2

2.75

P20 = 3.1 + .75(5.2 – 3.1) = 4.675

Determining Percentiles and their Location

Page 62: 1 Descriptive Statistics: Numerical Methods Chapter 4

62

Quartiles and Variability

Quartiles can provide an idea about the shape of a histogram

Q1 Q2 Q3

Positively skewedhistogram

Q1 Q2 Q3

Negatively skewedhistogram

Page 63: 1 Descriptive Statistics: Numerical Methods Chapter 4

63

This is a measure of the spread of the middle 50% of the observations

Large value indicates a large spread of the observations

Interquartile range = Q3 – Q1

Inter-quartile Range

Page 64: 1 Descriptive Statistics: Numerical Methods Chapter 4

64

1.5(Q3 – Q1) 1.5(Q3 – Q1)

A box plot is a pictorial display that provides the main descriptive measures of the measurement set:

• L - the largest measurement• Q3 - The upper quartile

• Q2 - The median

• Q1 - The lower quartile• S - The smallest measurement

S Q1 Q2 Q3 LWhisker Whisker

Box Plot

An outlier is defined as any valuethat is more than 1.5(Q3 – Q1)away from the box.

Page 65: 1 Descriptive Statistics: Numerical Methods Chapter 4

65

Example 13 Create a box plot for the data regarding the GMAT scores of 200 applicants (see Data13.xls)

Box Plot

GMAT512531461515...

Smallest = 449Q1 = 512Median = 537Q3 = 575Largest = 788IQR = 63Outliers = (788, 788, 766, 763, 756, 719, 712, 707, 703, 694, 690, 675, )

537512449 575417.5512-1.5(IQR) 575+1.5(IQR)

669.5 788

Page 66: 1 Descriptive Statistics: Numerical Methods Chapter 4

66

Interpreting the box plot results• The scores range from 449 to 788.• About half the scores are smaller than 537, and about half are larger than

537.• About half the scores lie between 512 and 575.• About a quarter lies below 512 and a quarter above 575.

Q1

512Q2

537Q3

575

25% 50% 25%

449 669.5

Box PlotExample 13 - continued

Page 67: 1 Descriptive Statistics: Numerical Methods Chapter 4

67

50%

25% 25%

The data set is positively skewed

Q1

512Q2

537Q3

575

25% 50% 25%

449 669.5

Box PlotExample 13 - continued

Page 68: 1 Descriptive Statistics: Numerical Methods Chapter 4

68

4.4 Measures of Linear Relationship

The covariance and the coefficient of correlation are used to measure the direction and strength of the linear relationship between two variables. The Covariance answers the question: Is there any pattern

to the way two variables move together? The Correlation Coefficient answers the question: How

strong is the linear relationship between two variables.

Page 69: 1 Descriptive Statistics: Numerical Methods Chapter 4

69

N

)y)((xY)COV(X,covariance Population yixi

N

)y)((xY)COV(X,covariance Population yixi

x (y) is the population mean of the variable X (Y).N is the population size.

1-n)yy)(x(x

y) cov(x,covariance Sample ii

1-n)yy)(x(x

y) cov(x,covariance Sample ii

Covariance

x (y) is the population mean of the variable X (Y).n is the sample size.

Page 70: 1 Descriptive Statistics: Numerical Methods Chapter 4

70

If the two variables move the same direction, (both increase or both decrease), the covariance is a large positive number.

Covariance

1

3

4

6

10

8

X

Y

Page 71: 1 Descriptive Statistics: Numerical Methods Chapter 4

71

If the two variables move in two opposite directions, (one increases when the other one decreases), the covariance is a large negative number.

Covariance

X

Y

4

6

3

101

8

Page 72: 1 Descriptive Statistics: Numerical Methods Chapter 4

72

If the two variables are unrelated, the covariance will be close to zero.

Covariance

1

3

6

104

8

X

Y

Page 73: 1 Descriptive Statistics: Numerical Methods Chapter 4

73

yx

)Y,X(COV

ncorrelatio oft coefficien Population

yx

)Y,X(COV

ncorrelatio oft coefficien Population

yxss)Y,Xcov(

r

ncorrelatio oft coefficien Sample

yxss

)Y,Xcov(r

ncorrelatio oft coefficien Sample

The coefficient of correlation

The coefficient of correlation measures the strength of the linear relationship between two variables.

Page 74: 1 Descriptive Statistics: Numerical Methods Chapter 4

74

If the two variables are very strongly positively related, the coefficient value is close to +1 (strong positive linear relationship).

The coefficient of correlation

Page 75: 1 Descriptive Statistics: Numerical Methods Chapter 4

75

The coefficient of correlation

If the two variables are very strongly negatively related, the coefficient value is close to -1 (strong negative linear relationship).

Page 76: 1 Descriptive Statistics: Numerical Methods Chapter 4

76

A weak linear relationship is indicated by a coefficient close to zero.

Also, a non-linear relationship translates to a weak linear relationship

The coefficient of correlation

Page 77: 1 Descriptive Statistics: Numerical Methods Chapter 4

77

Example 14 Compute the covariance and the coefficient of

correlation to measure how are car speed (mile per hour) and gas consumption (miles per gallon) related to one another (see data next).

Solution We believe speed affects gas consumption. Thus

• Speed is labeled X• Miles per gallon is labeled Y

The coefficient of correlation and the covariance

Page 78: 1 Descriptive Statistics: Numerical Methods Chapter 4

78

Car x y x2 y2 xy

nx

x1n

1s

nyx

yx1n

1

)y,x(CovFurmulasShortcut

2n1i2

in

1i2

in

1iin

1iii

n1i

The coefficient of correlation and the covariance

Example 14 – solution continued

1 15 7.1 225 50.41 106.52 35 15.5 1225 240.25 542.53 35 18.5 1225 342.25 647.54 40 19.7 1600 388.09 7885 40 22.4 1600 501.76 8966 45 21.3 2025 453.69 958.57 45 22.8 2025 519.84 10268 45 23.1 2025 533.61 1039.59 50 22.8 2500 519.84 114010 50 21.3 2500 453.69 1065

Total 400 194.5 16950 4003.43 8209.5

7.4710

)4.194)(400(5.8209

1101

)y,x(Covfurmulashortcut theUsing

Page 79: 1 Descriptive Statistics: Numerical Methods Chapter 4

79

Car x y x2 y2 xy

nx

x1n

1s

nyx

yx1n

1

)y,x(CovFurmulasShortcut

2n1i2

in

1i2

in

1iin

1iii

n1i

The coefficient of correlation and the covariance

Example 14 – solution continued

1 15 7.1 225 50.41 106.52 35 15.5 1225 240.25 542.53 35 18.5 1225 342.25 647.54 40 19.7 1600 388.09 7885 40 22.4 1600 501.76 8966 45 21.3 2025 453.69 958.57 45 22.8 2025 519.84 10268 45 23.1 2025 533.61 1039.59 50 22.8 2500 519.84 114010 50 21.3 2500 453.69 1065

Total 400 194.5 16950 4003.43 8209.5

948.410

5.194)43.4003(

1101

s

27.1010

400)16950(

1101

s

:have wefurmulashortcut the From Y. andX of deviation satandard the computefirst wencorrelatio oft coefficienthe calculate To

2

y

2

x

Page 80: 1 Descriptive Statistics: Numerical Methods Chapter 4

80

The coefficient of correlation and the covariance

Example 14 – solution continued

9938.)948.4)(27.10(

7.47

ss)Y,Xcov(

ryx

Interpretation: Speed and mileage per gallon are strongly positively linearly related for the speed range of 15 to 50 miles per hour.