slide 1 solving the homework problems simple linear regression is an appropriate model of the...

SOLVING THE HOMEWORK PROBLEMS

Simple linear regression is an appropriate model of the relationship between two quantitative variables provided the data satisfies the assumption of linearity in a scatterplot of the raw data, provided the spread of the residuals is equal for all of the predicted values in the residual plot, and provided there are no outliers impacting the linear model.

We test the assumption of linearity by comparing a loess fit line to a linear fit line in a scatterplot of the raw data for the variables. The assumption of linearity is satisfied if the loess fit line does not curve outside a confidence interval plotted around the linear fit line. We confirm our evaluation of linearity by checking the plot of the standardized residuals by the standardized predictions (the residual plot) for linearity. Our assessment of linearity in the residual plot should agree with our assessment of the scatterplot of the raw data.

We test the assumption of equal variance by examining the overall shape of the points in a plot of the standardized residuals by the standardized predictions. If the spread of the residuals is the same across the predicted values, the assumption is satisfied. If the pattern is narrower at one side than the other (funnel shaped), the assumption is violated.

SOLVING THE HOMEWORK PROBLEMS (continued)

The identification of outliers is based on the presence of standardized residuals or standardized predicted values that have an absolute value equal to or greater than 3.0 in the table of residual statistics that is part of the regression output in SPSS.

In this problem set, we will not re-express the variables in an effort to satisfy any assumptions violated by the distributions of the variables.

When we satisfy these requirements, simple linear regression provides us with information about the direction and the strength of the linear relationship. The direction of the relationship is based on the sign of the slope. If the slope is positive, there is a direct or positive relationship and the scores on the variables move in the same direction. If the slope is negative, there is an inverse or negative relationship and the scores on the variables move in opposite directions.

The strength of the relationship is based on the coefficient of determination, R². Since R² is the square of Pearson's r, we can square the Tukey and Cohen guidelines for interpreting the strength of the relationship.

The introductory statement in the question indicates:• The data set to use (world2007.sav)• The task to accomplish (simple linear regression)• The variables to use in the analysis: percent of the

population below poverty line [poverty] and disparity in the distribution of family income (Gini index) [ginindex].

The first question concerns the appropriateness of using linear regression to analyze the relationship between these variables. We will create and analyze a scatterplot of the raw data to answer this question.

To create the scatterplot, select the Legacy Dialogs > Scatter/Dot from the Graphs menu.

In the Scatter/Dot dialog box, we click on Simple Scatter as the type of plot we want to create.

Click on the Define button to go to the next step.

First, move the dependent variable ginindex to the Y axis text box.

Second, move the independent variable poverty to the X axis text box.

Third, click on the OK button to produce the plot.

The scatterplot appears in the SPSS output window.

To facilitate our determination about the linearity of the plot, we will add a linear fit line and a loess fit line.

To edit the plot, double click on the plot area to open the chart in the Chart Editor.

In the Chart Editor, click on the tool button for adding a fit line.

When we click on the Add Fit Line at Total button, a fit line is added to the chart and the Properties dialog box appears. The blue outline around the fit lines shows that the fit line is selected.

In the Properties dialog box, we first accept the default Fit Method of Linear.

Second, we request a Confidence Interval around the Mean.

Third, we change the default width from 95% to 99%.

Fourth, we click on the tab for Lines to formal the fit line.

The confidence interval is not designed to test linearity, but it provides us with a convenient guideline.

In the Lines tab page, select 1.5 from the Weight drop down menu to increase the width of the line.

Next, click on the Red color swatch to change the color of the line.

Click on the Apply button to change the appearance of the line and add the confidence interval lines to the chart.

The lines added to the scatterplot are complete, so we click on the Close button.

We next want to add a loess fit line to the chart. We close this fit line dialog and open a second fit line dialog to add the second fit line.

Trying to add both lines in one step can cause problems because the properties may get applied to the wrong lines.

In the Chart Editor, click on the tool button for adding a fit line again.

When we click on the Add Fit Line at Total button, a second fit line is added to the chart. (It is not very visible because it is overlaid on the first fit line.) The blue outline shows that the line is selected. In the Properties dialog box,

we first click on the Loess Fit Method option button.

Second, we click on the tab for Lines to formal the fit line.

We change the weight of the loess line to 1.5, and the color to Cyan.

We click on the Apply button to change the line.

Having made the desired changes to the scatterplot, click on the Close box to exit the Chart Editor.

The loess fit line falls within the 99% confidence interval around the linear fit line. The relationship is linear. Minor fluctuations over the lines of the confidence interval are ignored.

The pattern of points in the scatterplot does not show an obvious curve indicating non-linearity. The assumption of linearity is satisfied.

Do not over-react to minor fluctuations in the loess line, even if it touches or slightly crosses the confidence interval lines.

I would still characterize this plot as linear.

What we want to detect are major consistent patterns of a non-linear relationship. In this plot the middle of the loess line is clear to one side of the confidence interval, and both ends are on the other side of the confidence interval.

I would characterize this plot as non-linear.

The pattern of points in the scatterplot does not show an obvious curve indicating non-linearity. The assumption of linearity is satisfied.

We mark the check box for the question as correct.

The next question asks about whether the pattern in the residual plot shows any violation of the assumptions of linearity or equal variance.

Since the residual plot is part of the SPSS regression output, we will compute the simple linear regression.

To compute a simple linear regression, select Regression> Linear from the Analyze menu.

First, move the dependent variable, ginindex to the Dependent text box.

Second, move the independent variable, poverty to the Independent(s) list box.

Third, click on the Statistics button to request basic descriptive statistics.

First, in addition to the defaults marked by SPSS, mark the check box for Descriptives so that we get the number of cases used in the analysis.

Second, click on the Continue button to close the dialog box.

Next, click on the Plots button to request the residual plot.

For our residual plot, we will use standardized residuals on the vertical axis and standardized predictions on the horizontal axis.

Select *ZRESID (for standardized residuals) in the list of variables to plot.

Click on the right arrow button to move *ZRESID to the Y axis text box.

Select *ZPRED (for standardized predictions) in the list of variables to plot.

Click on the right arrow button to move *ZPRED to the X axis text box.

While SPSS makes available other diagnostic plots, we will forego them for the time being.

Click on the Continue button to close the dialog box.

Click on the OK button to request the output.

While there are many more optional outputs for linear regression, what we have requested is sufficient for these problems.

The pattern of points in the residual plot does not show an obvious curve indicating non-linearity. The assumption of linearity is satisfied, confirming the finding of non-linearity in the scatterplot for the raw data.

When linear regression satisfies the assumptions of linearity and equal variance of residuals, the plot of standardized residuals across the standardized predictions does not any pattern, direction, or shape.

The pattern of points in the residual plot shows equal spread across the standardized predictions of "percent of the population below poverty line" [poverty]. The assumption of equal variance is satisfied.

The pattern of points in the residual plot shows equal spread across the standardized predictions of "percent of the population below poverty line" [poverty]. The assumption of equal variance is satisfied.

We mark the check box for the question as correct.

The next questions asks whether there are outliers in the distributions of either of the variables.

We can determine whether or not there are outliers that are more than three standard deviations from the mean for both the dependent variable and the independent variable by examining the values in the table of residual statistics.

The distribution of the standardized residuals contained no outliers that were three or more standard deviations from the mean. The smallest value (-2.135) was larger than -3.0 and the largest value (2.578) was less than +3.0.

Outliers can also be determined from the residual plot, but I find it more convenient to use the table of numbers, especially when there are values close to -3.0 or +3.0, but it is not clear whether they are beyond it or not.

The distribution of the standardized predictions contained no outliers that were three or more standard deviations from the mean. The smallest value (-1.566) was larger than -3.0 and the largest value (2.850) was less than +3.0.

There is no problem with outliers in this analysis.

Since the values for both standardized residuals and standardized predictions are within three standard deviations of the mean, there are no outliers.

Mark the check box for the problem.

The next statement asks about the number of cases available for the analysis.

The number of cases with valid data to analyze the relationship between "percent of the population below poverty line" and "disparity in the distribution of family income (Gini index)" was 105, out of the total of 192 cases in the data set.

The number of cases with valid data to analyze the relationship between "percent of the population below poverty line" and "disparity in the distribution of family income (Gini index)" was 105, out of the total of 192 cases in the data set.

Click on the check box to mark the statement as correct.

The next statement asks us about the direction of the relationship between the variables, e.g. high scores on the independent variable are associated with higher/lower scores on the dependent variable.

If we had violated any of the assumptions for a linear model, attempting to characterize the direction of the relationship may lead to incorrect findings and this statement is skipped.

The slope for the regression relationship between "percent of the population below poverty line" [poverty] and "disparity in the distribution of family income (Gini index)" [ginindex] was .25. The positive value for the slope means that scores for the two variables change in the same direction. Higher scores on the variable "percent of the population below poverty line" were associated with higher scores on "disparity in the distribution of family income (Gini index)".

The sign of the B coefficient for the independent variable poverty tells us the direction of the relationship.

Higher scores on the variable "percent of the population below poverty line" were associated with higher scores on "disparity in the distribution of family income (Gini index)". The statement that "countries who had a higher percentage of persons in poverty had greater disparity in the distribution of family income (Gini index)" is correct.

Mark the check box for a correct answer.

The next statement asks us to interpret strength of the relationship using the guidelines attributed to Tukey.

If we had violated any of the assumptions for a linear model, attempting to characterize the strength of the relationship may lead to incorrect findings and this statement is skipped.

Using the rule of thumb attributed to Tukey, an R² between 0.0 and 0.04 is very weak; 0.04 to 0.16 is weak; 0.16 to 0.36 is moderate; 0.36 to 0.64 is strong; and greater than 0.64 is very strong, the relationship was correctly characterized as a moderate (R² = 22.2%).

The relationship was correctly characterized as a moderate (R² = 22.2%) using the Tukey guidelines.


The next statement asks us to interpret strength of the relationship using the guidelines attributed to Cohen.

If we had violated any of the assumptions for a linear model, attempting to characterize the strength of the relationship may lead to incorrect findings and this statement is skipped.

Applying Cohen's criteria for effect size (less than 0.01 = trivial; 0.01 up to 0.09 = weak or small; 0.09 up to 0.25 = moderate; 0.25 or greater = strong or large), the relationship was correctly characterized as a moderate (R² = 22.2%).

The relationship was correctly characterized as a moderate (R² = 22.2%) using the Cohen guidelines.