slide 1 solving the homework problems simple linear regression is an appropriate model of the...

48
Slide 1 SOLVING THE HOMEWORK PROBLEMS Simple linear regression is an appropriate model of the relationship between two quantitative variables provided the data satisfies the assumption of linearity in a scatterplot of the raw data, provided the spread of the residuals is equal for all of the predicted values in the residual plot, and provided there are no outliers impacting the linear model. We test the assumption of linearity by comparing a loess fit line to a linear fit line in a scatterplot of the raw data for the variables. The assumption of linearity is satisfied if the loess fit line does not curve outside a confidence interval plotted around the linear fit line. We confirm our evaluation of linearity by checking the plot of the standardized residuals by the standardized predictions (the residual plot) for linearity. Our assessment of linearity in the residual plot should agree with our assessment of the scatterplot of the raw data. We test the assumption of equal variance by examining the overall shape of the points in a plot of the standardized residuals by the standardized predictions. If the spread

Upload: trevor-jefferson

Post on 24-Dec-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

Slide 1

SOLVING THE HOMEWORK PROBLEMS

Simple linear regression is an appropriate model of the relationship between two quantitative variables provided the data satisfies the assumption of linearity in a scatterplot of the raw data, provided the spread of the residuals is equal for all of the predicted values in the residual plot, and provided there are no outliers impacting the linear model.

We test the assumption of linearity by comparing a loess fit line to a linear fit line in a scatterplot of the raw data for the variables. The assumption of linearity is satisfied if the loess fit line does not curve outside a confidence interval plotted around the linear fit line. We confirm our evaluation of linearity by checking the plot of the standardized residuals by the standardized predictions (the residual plot) for linearity. Our assessment of linearity in the residual plot should agree with our assessment of the scatterplot of the raw data.

We test the assumption of equal variance by examining the overall shape of the points in a plot of the standardized residuals by the standardized predictions. If the spread of the residuals is the same across the predicted values, the assumption is satisfied. If the pattern is narrower at one side than the other (funnel shaped), the assumption is violated.

Slide 2

SOLVING THE HOMEWORK PROBLEMS (continued)

The identification of outliers is based on the presence of standardized residuals or standardized predicted values that have an absolute value equal to or greater than 3.0 in the table of residual statistics that is part of the regression output in SPSS.

In this problem set, we will not re-express the variables in an effort to satisfy any assumptions violated by the distributions of the variables.

When we satisfy these requirements, simple linear regression provides us with information about the direction and the strength of the linear relationship. The direction of the relationship is based on the sign of the slope. If the slope is positive, there is a direct or positive relationship and the scores on the variables move in the same direction. If the slope is negative, there is an inverse or negative relationship and the scores on the variables move in opposite directions.

The strength of the relationship is based on the coefficient of determination, R². Since R² is the square of Pearson's r, we can square the Tukey and Cohen guidelines for interpreting the strength of the relationship.

Slide 3

The introductory statement in the question indicates:• The data set to use (world2007.sav)• The task to accomplish (simple linear regression)• The variables to use in the analysis: percent of the

population below poverty line [poverty] and disparity in the distribution of family income (Gini index) [ginindex].

Slide 4

The first question concerns the appropriateness of using linear regression to analyze the relationship between these variables. We will create and analyze a scatterplot of the raw data to answer this question.

Slide 5

To create the scatterplot, select the Legacy Dialogs > Scatter/Dot from the Graphs menu.

Slide 6

In the Scatter/Dot dialog box, we click on Simple Scatter as the type of plot we want to create.

Click on the Define button to go to the next step.

Slide 7

First, move the dependent variable ginindex to the Y axis text box.

Second, move the independent variable poverty to the X axis text box.

Third, click on the OK button to produce the plot.

Slide 8

The scatterplot appears in the SPSS output window.

To facilitate our determination about the linearity of the plot, we will add a linear fit line and a loess fit line.

To edit the plot, double click on the plot area to open the chart in the Chart Editor.

Slide 9

In the Chart Editor, click on the tool button for adding a fit line.

Slide 10

When we click on the Add Fit Line at Total button, a fit line is added to the chart and the Properties dialog box appears. The blue outline around the fit lines shows that the fit line is selected.

In the Properties dialog box, we first accept the default Fit Method of Linear.

Second, we request a Confidence Interval around the Mean.

Third, we change the default width from 95% to 99%.

Fourth, we click on the tab for Lines to formal the fit line.

The confidence interval is not designed to test linearity, but it provides us with a convenient guideline.

Slide 11

In the Lines tab page, select 1.5 from the Weight drop down menu to increase the width of the line.

Slide 12

Next, click on the Red color swatch to change the color of the line.

Slide 13

Click on the Apply button to change the appearance of the line and add the confidence interval lines to the chart.

Slide 14

The lines added to the scatterplot are complete, so we click on the Close button.

We next want to add a loess fit line to the chart. We close this fit line dialog and open a second fit line dialog to add the second fit line.

Trying to add both lines in one step can cause problems because the properties may get applied to the wrong lines.

Slide 15

In the Chart Editor, click on the tool button for adding a fit line again.

Slide 16

When we click on the Add Fit Line at Total button, a second fit line is added to the chart. (It is not very visible because it is overlaid on the first fit line.) The blue outline shows that the line is selected. In the Properties dialog box,

we first click on the Loess Fit Method option button.

Second, we click on the tab for Lines to formal the fit line.

Slide 17

We change the weight of the loess line to 1.5, and the color to Cyan.

We click on the Apply button to change the line.

Slide 18

Having made the desired changes to the scatterplot, click on the Close box to exit the Chart Editor.

Slide 19

The loess fit line falls within the 99% confidence interval around the linear fit line. The relationship is linear. Minor fluctuations over the lines of the confidence interval are ignored.

The pattern of points in the scatterplot does not show an obvious curve indicating non-linearity. The assumption of linearity is satisfied.

Slide 20

Do not over-react to minor fluctuations in the loess line, even if it touches or slightly crosses the confidence interval lines.

I would still characterize this plot as linear.

What we want to detect are major consistent patterns of a non-linear relationship. In this plot the middle of the loess line is clear to one side of the confidence interval, and both ends are on the other side of the confidence interval.

I would characterize this plot as non-linear.

Slide 21

The pattern of points in the scatterplot does not show an obvious curve indicating non-linearity. The assumption of linearity is satisfied.

We mark the check box for the question as correct.

Slide 22

The next question asks about whether the pattern in the residual plot shows any violation of the assumptions of linearity or equal variance.

Since the residual plot is part of the SPSS regression output, we will compute the simple linear regression.

Slide 23

To compute a simple linear regression, select Regression> Linear from the Analyze menu.

Slide 24

First, move the dependent variable, ginindex to the Dependent text box.

Second, move the independent variable, poverty to the Independent(s) list box.

Third, click on the Statistics button to request basic descriptive statistics.

Slide 25

First, in addition to the defaults marked by SPSS, mark the check box for Descriptives so that we get the number of cases used in the analysis.

Second, click on the Continue button to close the dialog box.

Slide 26

Next, click on the Plots button to request the residual plot.

Slide 27

For our residual plot, we will use standardized residuals on the vertical axis and standardized predictions on the horizontal axis.

Select *ZRESID (for standardized residuals) in the list of variables to plot.

Click on the right arrow button to move *ZRESID to the Y axis text box.

Slide 28

Select *ZPRED (for standardized predictions) in the list of variables to plot.

Click on the right arrow button to move *ZPRED to the X axis text box.

Slide 29

While SPSS makes available other diagnostic plots, we will forego them for the time being.

Click on the Continue button to close the dialog box.

Slide 30

Click on the OK button to request the output.

While there are many more optional outputs for linear regression, what we have requested is sufficient for these problems.

Slide 31

The pattern of points in the residual plot does not show an obvious curve indicating non-linearity. The assumption of linearity is satisfied, confirming the finding of non-linearity in the scatterplot for the raw data.

When linear regression satisfies the assumptions of linearity and equal variance of residuals, the plot of standardized residuals across the standardized predictions does not any pattern, direction, or shape.

The pattern of points in the residual plot shows equal spread across the standardized predictions of "percent of the population below poverty line" [poverty]. The assumption of equal variance is satisfied.

Slide 32

The pattern of points in the residual plot shows equal spread across the standardized predictions of "percent of the population below poverty line" [poverty]. The assumption of equal variance is satisfied.

We mark the check box for the question as correct.

Slide 33

The next questions asks whether there are outliers in the distributions of either of the variables.

Slide 34

We can determine whether or not there are outliers that are more than three standard deviations from the mean for both the dependent variable and the independent variable by examining the values in the table of residual statistics.

The distribution of the standardized residuals contained no outliers that were three or more standard deviations from the mean. The smallest value (-2.135) was larger than -3.0 and the largest value (2.578) was less than +3.0.

Outliers can also be determined from the residual plot, but I find it more convenient to use the table of numbers, especially when there are values close to -3.0 or +3.0, but it is not clear whether they are beyond it or not.

Slide 35

The distribution of the standardized predictions contained no outliers that were three or more standard deviations from the mean. The smallest value (-1.566) was larger than -3.0 and the largest value (2.850) was less than +3.0.

There is no problem with outliers in this analysis.

Slide 36

Since the values for both standardized residuals and standardized predictions are within three standard deviations of the mean, there are no outliers.

Mark the check box for the problem.

Slide 37

The next statement asks about the number of cases available for the analysis.

Slide 38

The number of cases with valid data to analyze the relationship between "percent of the population below poverty line" and "disparity in the distribution of family income (Gini index)" was 105, out of the total of 192 cases in the data set.

Slide 39

The number of cases with valid data to analyze the relationship between "percent of the population below poverty line" and "disparity in the distribution of family income (Gini index)" was 105, out of the total of 192 cases in the data set.

Click on the check box to mark the statement as correct.

Slide 40

The next statement asks us about the direction of the relationship between the variables, e.g. high scores on the independent variable are associated with higher/lower scores on the dependent variable.

If we had violated any of the assumptions for a linear model, attempting to characterize the direction of the relationship may lead to incorrect findings and this statement is skipped.

Slide 41

The slope for the regression relationship between "percent of the population below poverty line" [poverty] and "disparity in the distribution of family income (Gini index)" [ginindex] was .25. The positive value for the slope means that scores for the two variables change in the same direction. Higher scores on the variable "percent of the population below poverty line" were associated with higher scores on "disparity in the distribution of family income (Gini index)".

The sign of the B coefficient for the independent variable poverty tells us the direction of the relationship.

Slide 42

Higher scores on the variable "percent of the population below poverty line" were associated with higher scores on "disparity in the distribution of family income (Gini index)". The statement that "countries who had a higher percentage of persons in poverty had greater disparity in the distribution of family income (Gini index)" is correct.

Mark the check box for a correct answer.

Slide 43

The next statement asks us to interpret strength of the relationship using the guidelines attributed to Tukey.

If we had violated any of the assumptions for a linear model, attempting to characterize the strength of the relationship may lead to incorrect findings and this statement is skipped.

Slide 44

Using the rule of thumb attributed to Tukey, an R² between 0.0 and 0.04 is very weak; 0.04 to 0.16 is weak; 0.16 to 0.36 is moderate; 0.36 to 0.64 is strong; and greater than 0.64 is very strong, the relationship was correctly characterized as a moderate (R² = 22.2%).

Slide 45

The relationship was correctly characterized as a moderate (R² = 22.2%) using the Tukey guidelines.

Click on the check box to mark the statement as correct.

Slide 46

The next statement asks us to interpret strength of the relationship using the guidelines attributed to Cohen.

If we had violated any of the assumptions for a linear model, attempting to characterize the strength of the relationship may lead to incorrect findings and this statement is skipped.

Slide 47

Applying Cohen's criteria for effect size (less than 0.01 = trivial; 0.01 up to 0.09 = weak or small; 0.09 up to 0.25 = moderate; 0.25 or greater = strong or large), the relationship was correctly characterized as a moderate (R² = 22.2%).

Slide 48

The relationship was correctly characterized as a moderate (R² = 22.2%) using the Cohen guidelines.

Click on the check box to mark the statement as correct.