8/9/2015slide 1 the standard deviation statistic is challenging to present to our audiences....

51
06/24/22 Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe the standard deviation. The empirical rule states that approximately 68% of the cases in a normal distribution fall within one standard deviation of the mean, 95% of the cases in a normal distribution fall within two standard deviations of the mean, and 99.7% of all cases fall within three standard deviations of the mean. While distributions of real quantitative variables are usually not normal, the empirical rule has been demonstrated to be applicable if the distribution is “nearly

Upload: lillian-park

Post on 24-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe

04/21/23 Slide 1

• The standard deviation statistic is challenging to present to our audiences.

• Statisticians often resort to the “empirical rule” to describe the standard deviation.

• The empirical rule states that approximately 68% of the cases in a normal distribution fall within one standard deviation of the mean, 95% of the cases in a normal distribution fall within two standard deviations of the mean, and 99.7% of all cases fall within three standard deviations of the mean.

• While distributions of real quantitative variables are usually not normal, the empirical rule has been demonstrated to be applicable if the distribution is “nearly normal.”

• The determination that a variable is “nearly normal” requires us to propose a set of criteria for determining the boundary between “nearly normal” and “not nearly normal.”

Page 2: 8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe

04/21/23 Slide 2

• Like all of the criteria that we use in statistics, we will propose a criteria, recognize that is really an approximation rather than a precise estimate, and hope that common sense will prevail in applying the criteria.

• We have previously identified the criteria we will use for assessing the “nearly normal” condition: skewness, kurtosis, and extreme outliers.

• We will use our previous requirements for skewness and kurtosis (both between -1.0 and +1.0), but we will define outliers as cases that are more than 3 standard deviations from the mean (either above or below).

• The last criteria is derived from the empirical rule: if 99.7% of the cases in a normal distribution fall within three standard deviations of the mean, then those that fall outside three standard deviations must be relatively uncommon.

Page 3: 8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe

04/21/23 Slide 3

• The requirement to compare the scores in a distribution to the mean plus or minus three standard deviations could lead to a lot of tedious arithmetic.

• Fortunately, there is a relatively easy substitute – converting the values in the distribution to “standard scores.”

• Standard scores convert the values into any distribution into the distance between the score and the mean of the distribution in standard deviation units.

• Standardizing variables gives them a common unit of measure that makes it easy to compare scores across quantitative variables.

• For example, if I converted a student’s GRE score (e.g. 1100) and GPA (3.78) to standard scores, I would know which was further away from the mean for all students, and thus a higher measure of academic potential.

Page 4: 8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe

04/21/23 Slide 4

• SPSS will automatically convert any distribution to standard scores (also referred to as z-scores) and we can use the same formula over and over to identify outliers.

• Many procedures use standardized scores to present findings or diagnostics, e.g. we will analyze standardized residuals in regression analysis.

• If the original variable does not satisfy the “nearly normal condition,” we will re-express the data values as logarithms and squares to see if we can induce normality.

• If the transformation is successful at meeting the criteria for a nearly normal distribution, we will calculate the percentage of cases falling within 1 and 2 standard deviations of the mean and compare our findings to the percentage prescribed by the empirical rule.

Page 5: 8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe

04/21/23 Slide 5

• In these problems, we will base our assessment of normality on more expanded criteria than we have used previously.

• Since we are concerned with determining probabilities or percentages based on the normal distribution, we are concerned with kurtosis as well as skewness. The height of the distribution as measured by kurtosis has an impact on the standard deviation, which in turn has an impact on the percentage of cases within one standard deviation of the mean and within two standard deviations of the mean.

• In the last assignment, we used a boxplot strategy to identify outliers. In this assignment, we will define outliers as cases falling outside three standard deviations of the mean.

Page 6: 8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe

SOLVING HOMEWORK PROBLEMS

The Empirical Rule states that about 68% of the values will fall within 1 standard deviation of the mean and 95% of the values will fall within 2 standard deviations of the mean, provided the variable satisfies the nearly normal condition that the distribution is unimodal and symmetric.

There are numerous statistical tests and graphic methods for evaluating the normality of a distribution. In these problems, we will use a simple rule of thumb that states that the distribution of the variable is reasonably normal if both skewness and kurtosis of the distribution are between -1.0 and + 1.0 and there are no outliers less than or equal to three standard deviations below the mean or greater than or equal to three standard deviations above the mean.

Slide 6

Page 7: 8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe

If the distribution satisfies the nearly normal condition, we will test whether or not the percentages specified by the empirical rule hold for the variable. We will consider the rule to be satisfied if the actual percentage of values falls within 2% of the proportion indicated by the empirical rule.

If the distribution does not satisfy the nearly normal condition, we will examine the impact on the normality assumption when the distribution is re-expressed by computing the logarithm of the values if the variable is skewed to the right. If the variable is negatively skewed, we will square the values and examine the impact on the normality assumption. If the transformation is successful at meeting the criteria for a nearly normal distribution, we will calculate the percentage of cases falling within 1 and 2 standard deviations of the mean and compare the actual percentage to the percentage prescribed by the empirical rule. Slide 7

Page 8: 8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe

Slide 8

The introductory statement in the question indicates:• The data set to use (2001WorldFactBook)• The task to accomplish (verifying the empirical rule

for a normally distributed variable)• The variable to use in the analysis: population [pop]

Page 9: 8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe

Slide 9

These problem also contain a second paragraph of instructions that provide the formulas to use if the analysis requires us to re-express or transform the variable to achieve normality, and the formula to restore the transformed values back to the original scale.

Page 10: 8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe

Slide 10

The first statement concerns the number of valid cases. To answer this question, we produce the descriptive statistics using the SPSS Descriptives procedure.

The Descriptive procedure creates standard scores for the variable, which will facilitate our check of the empirical rule.

Page 11: 8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe

Slide 11

To compute the descriptive statistics and standard scores, select the Descriptive Statistics > Descriptives command from the Analyze menu.

Page 12: 8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe

Slide 12

Move the variable for the analysis pop to the Variable(s) list box.

Click on the Options button to select optional statistics.

Page 13: 8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe

Slide 13

The check boxes for Mean and Std. Deviation are already marked by default.

Click on Continue button to close the dialog box.

Mark the Kurtosis and Skewness check boxes. This will provide the statistics for assessing normality.

Page 14: 8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe

Slide 14

Click on the OK button to produce the output.

Mark the check box Save standardized values as variables.

Page 15: 8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe

Slide 15

If we scroll the Data View all the way to the right, we see that SPSS has create the standard scores. To name it, SPSS prepends the letter “Z” to the variable name.

Page 16: 8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe

Slide 16

In the output table for Descriptive Statistics, the number of valid cases for population is 218.

If we had more than one variable in the table, the Valid N (listwise) row would tell us the number of cases that are not missing data for any of the variables in the table.

SPSS does not tell us the number of cases that are missing data in this table. To get the number missing, we would have to compare the number of cases in the data set to the N for population.

Page 17: 8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe

Slide 17

The 'Descriptive Statistics' table in the SPSS output showed the number of cases for the variable "population" [pop] to be 218.

Click on the check box to mark the statement as correct.

Page 18: 8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe

Slide 18

The next statement requires us to check the evidence for meeting the “nearly normal condition”:• Skewness between -1.0 and +1.0• Kurtosis between -1.0 and +1.0• No outliers with standard scores less than or

equal to -3.0 or greater than or equal to +3.0

Page 19: 8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe

Slide 19

"Population" [pop] did not satisfy the criteria for a normal distribution. Both the skewness (11.71) and kurtosis (155.82) fell outside the range from -1.0 to +1.0.

Page 20: 8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe

Slide 20

Though we know that we do not satisfy the “nearly normal condition,“ we will still do the check for outliers.

Click the right mouse button on the column header for Zpop, and select Sort Ascending from the pop-up menu. This will show any negative outliers at the top of the column.

Page 21: 8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe

Slide 21

At the top of the column, we do not see any negative values less than or equal to -3.0.

Page 22: 8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe

Slide 22

Click the right mouse button again on the column header for Zpop, and select Sort Descending from the pop-up menu. This will show any positive outliers at the top of the column.

Page 23: 8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe

Slide 23

At the top of the column, we see one positive value (13.52) greater than or equal to +3.0.

Page 24: 8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe

Slide 24

If we scroll back to the left, we see that the outlier for population was China, with a population of 1,273,111,290.

Page 25: 8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe

Slide 25

"Population" [pop] did not satisfy the criteria for a normal distribution. Both the skewness (11.71) and kurtosis (155.82) fell outside the range from -1.0 to +1.0.

There was one outlier that had a standard score less than or equal to -3.0 or greater than or equal to +3.0: - China with a value of 1,273,111,290 (z=13.52)

We do not mark the check for the nearly normal condition.

Page 26: 8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe

Slide 26

The next pair of statements asks us about two possibilities for re-expressing the values to see if the transformed distribution satisfies the nearly normal condition.

If the skewness of the distribution of the variable is positive, we test the log transformation. If the skewing is negative, we test the square transformation.

In this problem, the skewness was 11.71, so we use the logarithmic transformation.

Page 27: 8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe

Slide 27

The formula for transforming pop to LG_pop is provided in the second paragraph of instructions:

LG10(pop).

Page 28: 8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe

Slide 28

To compute the transformed variable, select the Compute command from the Transform menu.

Page 29: 8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe

Slide 29

In the Compute Variable dialog box, we type the name for the new variable, LG_pop, in the Target Variable text box.

In the Numeric Expression text box, type the formula as shown to compute base 10 logarithms of the values of pop.

My convention for naming transformed variables is to add the variable name to the letters LG_ for a log transformation and SQ_ for a square transformation. This helps me keep the relationship between the variables clear.

Click on the OK button to compute the transformed variable.

Page 30: 8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe

Slide 30

Scroll the data editor window to the right to see the transformed variable, LG_pop.

Page 31: 8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe

Slide 31

To calculate the descriptive statistics so we can check the normality conditions for the transformed variable, click on the Dialog Recall tool button, and select Descriptives.

Page 32: 8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe

Slide 32

Since we want the same statistics computed for the variable pop, we only need to replace the variable pop with LG_pop.

Click on the OK button to produce the output.

Be sure the check box for saving standardized values remains checks so that Descriptives will compute standard scores for LG_pop.

Page 33: 8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe

Slide 33

The log transformation of "population" [LG_pop] satisfied the criteria for a normal distribution. The skewness of the distribution (-0.50) was between -1.0 and +1.0 and the kurtosis of the distribution (-0.41) was between -1.0 and +1.0.

Next, we will check for outliers that had a standard score less than or equal to -3.0 or greater than or equal to +3.0.

Page 34: 8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe

Slide 34

The Descriptives procedure add ZLG_pop to the data set.

When we sort ZLG_pop in ascending order, we see that there are no outliers with standard scores less than or equal to -3.0.

Page 35: 8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe

Slide 35

When we sort ZLG_pop in descending order, we see that there are no outliers with standard scores greater than or equal to +3.0.

Page 36: 8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe

Slide 36

The log transformation of "population" [LG_pop] satisfied the criteria for a normal distribution. The skewness of the distribution (-0.50) was between -1.0 and +1.0 and the kurtosis of the distribution (-0.41) was between -1.0 and +1.0.

There were no outliers that had a standard score less than or equal to -3.0 or greater than or equal to +3.0.

The log distribution satisfies the nearly normal condition so we mark the check box.

Page 37: 8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe

Slide 37

The final pair of question in the problem focuses on verifying whether or not percentages based on the distribution of the log transformed variable agree with the percentages specified in the empirical rule.

Page 38: 8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe

Slide 38

We will create a new variable that will have a value of 1 if the standard score is within 1 standard deviation of the mean, and 0 if it has a value outside this range.

To compute the new variable, select the Compute command from the Transform menu.

Page 39: 8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe

Slide 39

We will name the new variable within1sd, selecting a name which describes its contents. Type the formula as shown in the Numeric

Expression text box.

The formula will assign within1sd a value of 1 if the standard score the log transformation of population is greater than or equal to -1.0 and less than or equal to +1.0. If the value is not between -1.0 and +1.0, within1sd will be assigned a 0.

Page 40: 8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe

Slide 40

Scroll down in data view to see values of 0 and 1 for within1sd.

When the standard scores for LG_pop are larger than 1.0, within1sd is assigned the value of 0.

When the standard scores for LG_pop are less than or equal to 1.0, within1sd is assigned the value of 1.

Page 41: 8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe

Slide 41

To find the percentage of cases that have a standard score between -1.0 and +1.0 (within1sd = 1), we will run a frequency distribution on within1sd.

To create the frequency distribution , select Descriptive Statistics > Frequencies from the Analyze menu.

Page 42: 8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe

Slide 42

First, move the variable within1sd to the Variable(s) list box.

Second, click on the OK button to produce the output.

Page 43: 8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe

Slide 43

66.5% of the values fall within one standard deviation of the mean.

If we use 2% as the margin of error, 66.5% is within 2% of the 68% prescribed by the empirical rule.

Page 44: 8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe

Slide 44

We will create a second new variable that will have a value of 1 if the standard score is within 2 standard deviations of the mean, and 0 if it has a value outside this range.

To compute the new variable, select the Compute Variable command from the Recall Dialog pop-up menu.

Page 45: 8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe

Slide 45

Replace the variable name “within1sd” with the name “within2sd”.

Replace the criteria of -1.0 with -2.0 and replace +1.0 with +2.0.

Page 46: 8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe

Slide 46

Scroll down in data view to see values of 0 and 1 for within2sd.

When the standard scores for LG_pop are larger than 2.0, within2sd is assigned the value of 0.

When the standard scores for LG_pop are less than or equal to 2.0, within1sd is assigned the value of 1.

Page 47: 8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe

Slide 47

We will request a second frequency distribution to tally within2sd.

To request the frequency distribution, select the Frequencies command from the Recall Dialog pop-up menu.

Page 48: 8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe

Slide 48

First, remove the variable within1sd from the Variable(s) list box and move the variable within2sd into the list box.

Second, click on the OK button to produce the output.

Page 49: 8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe

Slide 49

95.9% of the values fall within two standard deviations of the mean.

If we use 2% as the margin of error, 95.9% is within 2% of the 95% prescribed by the empirical rule.

Page 50: 8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe

Slide 50

The actual percentage of the values of ZLG_pop between -1.0 and +1.0 was 66.5%, which is within 2% of 68%. The actual percentage of the values of ZLG_pop between -2.0 and +2.0 was 95.9%, which is within 2% of 95%.

The statement that "the actual percentage of cases within one standard deviation of the mean was close to the percentage predicted by the empirical rule" is correct. The statement that "the actual percentage of cases within two standard deviations of the mean was close to the percentage predicted by the empirical rule" is correct.

We mark both of the check boxes.

Page 51: 8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe

Slide 51

In solving this problem, we created three variables in the data set: Zpop, LG_pop, and ZLG_pop.

Since subsequent problems will create other, additional variables, I suggest you delete the created variables at the conclusion of each problem by selecting the columns in the data set and using the Clear command in the Edit menu.