plsc 503 spring 2013 lecture 4

160
Why Is It Called Regression? The Correlation Coefficient Two facts about regression The Analysis of Variance and the R 2 Regression Minimizes the Sum of Squared Residuals Using bivariate analysis PLSC 503: Quantitative Methods, Week 4 Thad Dunning Department of Political Science Yale University Mechanics of bivariate regression Lecture Notes, Week 4 1/ 62

Upload: thad-dunning

Post on 21-Nov-2014

3.196 views

Category:

Documents


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

PLSC 503: Quantitative Methods, Week 4

Thad DunningDepartment of Political Science

Yale University

Mechanics of bivariate regression

Lecture Notes, Week 4 1/ 62

Page 2: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

Objectives

The goal of this lecture is to describe the mechanics andproperties of the bivariate regression line.

This line can be fitted to any two-variable scatterplot.

It is important to understand the algebra involved in fitting thisline and the properties that the line has.

To fit the line to data, we need neither a statistical model nor acausal model of the data-generating process; it is simply amatter of applying the algebra of variances and covariances.

What the line means is another question . . .

Lecture Notes, Week 4 2/ 62

Page 3: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

Objectives

The goal of this lecture is to describe the mechanics andproperties of the bivariate regression line.

This line can be fitted to any two-variable scatterplot.

It is important to understand the algebra involved in fitting thisline and the properties that the line has.

To fit the line to data, we need neither a statistical model nor acausal model of the data-generating process; it is simply amatter of applying the algebra of variances and covariances.

What the line means is another question . . .

Lecture Notes, Week 4 2/ 62

Page 4: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

Objectives

The goal of this lecture is to describe the mechanics andproperties of the bivariate regression line.

This line can be fitted to any two-variable scatterplot.

It is important to understand the algebra involved in fitting thisline and the properties that the line has.

To fit the line to data, we need neither a statistical model nor acausal model of the data-generating process; it is simply amatter of applying the algebra of variances and covariances.

What the line means is another question . . .

Lecture Notes, Week 4 2/ 62

Page 5: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

Objectives

The goal of this lecture is to describe the mechanics andproperties of the bivariate regression line.

This line can be fitted to any two-variable scatterplot.

It is important to understand the algebra involved in fitting thisline and the properties that the line has.

To fit the line to data, we need neither a statistical model nor acausal model of the data-generating process; it is simply amatter of applying the algebra of variances and covariances.

What the line means is another question . . .

Lecture Notes, Week 4 2/ 62

Page 6: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

Objectives

The goal of this lecture is to describe the mechanics andproperties of the bivariate regression line.

This line can be fitted to any two-variable scatterplot.

It is important to understand the algebra involved in fitting thisline and the properties that the line has.

To fit the line to data, we need neither a statistical model nor acausal model of the data-generating process; it is simply amatter of applying the algebra of variances and covariances.

What the line means is another question . . .

Lecture Notes, Week 4 2/ 62

Page 7: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

Regression to the MeanThe Regression LineThe Graph of Averages

The Heights of Fathers and Sons

Galton (1822-1911) and Pearson (1857-1936) studied therelationship between the height of fathers and sons.

Freedman et al. (2007: 171) plot height of sons at maturityagainst the height of their fathers.

Each dot represents a father-son pair; the x-coordinate givesfather’s height and the y-coordinate gives son’s height.

The dotted “SD line” passes through the point of averages(the average value of x and average value of y) as well as allpoints that are an equal number of standard deviations awayfrom average, for both variables.

Lecture Notes, Week 4 3/ 62

Page 8: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

Regression to the MeanThe Regression LineThe Graph of Averages

The Heights of Fathers and Sons

Galton (1822-1911) and Pearson (1857-1936) studied therelationship between the height of fathers and sons.

Freedman et al. (2007: 171) plot height of sons at maturityagainst the height of their fathers.

Each dot represents a father-son pair; the x-coordinate givesfather’s height and the y-coordinate gives son’s height.

The dotted “SD line” passes through the point of averages(the average value of x and average value of y) as well as allpoints that are an equal number of standard deviations awayfrom average, for both variables.

Lecture Notes, Week 4 3/ 62

Page 9: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

Regression to the MeanThe Regression LineThe Graph of Averages

The Heights of Fathers and Sons

Galton (1822-1911) and Pearson (1857-1936) studied therelationship between the height of fathers and sons.

Freedman et al. (2007: 171) plot height of sons at maturityagainst the height of their fathers.

Each dot represents a father-son pair; the x-coordinate givesfather’s height and the y-coordinate gives son’s height.

The dotted “SD line” passes through the point of averages(the average value of x and average value of y) as well as allpoints that are an equal number of standard deviations awayfrom average, for both variables.

Lecture Notes, Week 4 3/ 62

Page 10: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

Regression to the MeanThe Regression LineThe Graph of Averages

The Heights of Fathers and Sons

Galton (1822-1911) and Pearson (1857-1936) studied therelationship between the height of fathers and sons.

Freedman et al. (2007: 171) plot height of sons at maturityagainst the height of their fathers.

Each dot represents a father-son pair; the x-coordinate givesfather’s height and the y-coordinate gives son’s height.

The dotted “SD line” passes through the point of averages(the average value of x and average value of y) as well as allpoints that are an equal number of standard deviations awayfrom average, for both variables.

Lecture Notes, Week 4 3/ 62

Page 11: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

Regression to the MeanThe Regression LineThe Graph of Averages

The SD Line and the Regression Line

Lecture Notes, Week 4 4/ 62

Page 12: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

Regression to the MeanThe Regression LineThe Graph of Averages

The Regression Effect

However, the SD line does not provide the best way ofpredicting son’s height, given father’s height.

E.g., look at the vertical strip in Figure 5 over a father’s heightof 72 inches: most of the dots fall well below the standarddeviation line.

I Very tall fathers have slightly shorter sons, on average.

I Very short fathers have slightly taller sons, on average.

This is the “regression effect” or “regression to the mean”(Galton charmingly called it “regression to mediocrity”)

Lecture Notes, Week 4 5/ 62

Page 13: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

Regression to the MeanThe Regression LineThe Graph of Averages

The Regression Effect

However, the SD line does not provide the best way ofpredicting son’s height, given father’s height.E.g., look at the vertical strip in Figure 5 over a father’s heightof 72 inches: most of the dots fall well below the standarddeviation line.

I Very tall fathers have slightly shorter sons, on average.I Very short fathers have slightly taller sons, on average.

This is the “regression effect” or “regression to the mean”(Galton charmingly called it “regression to mediocrity”)

Lecture Notes, Week 4 5/ 62

Page 14: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

Regression to the MeanThe Regression LineThe Graph of Averages

The Regression Effect

However, the SD line does not provide the best way ofpredicting son’s height, given father’s height.E.g., look at the vertical strip in Figure 5 over a father’s heightof 72 inches: most of the dots fall well below the standarddeviation line.

I Very tall fathers have slightly shorter sons, on average.I Very short fathers have slightly taller sons, on average.

This is the “regression effect” or “regression to the mean”(Galton charmingly called it “regression to mediocrity”)

Lecture Notes, Week 4 5/ 62

Page 15: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

Regression to the MeanThe Regression LineThe Graph of Averages

The Regression Line

So by how much is a unit increase in father’s heightassociated, on average, with an increase in son’s height?

The regression line answers this question, if there is a linearrelationship between the height of fathers and sons.

Indeed, the best linear predictor of y given x is the regressionline. (We’ll look at this claim in more detail later.)

Lecture Notes, Week 4 6/ 62

Page 16: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

Regression to the MeanThe Regression LineThe Graph of Averages

The Regression Line

So by how much is a unit increase in father’s heightassociated, on average, with an increase in son’s height?

The regression line answers this question, if there is a linearrelationship between the height of fathers and sons.

Indeed, the best linear predictor of y given x is the regressionline. (We’ll look at this claim in more detail later.)

Lecture Notes, Week 4 6/ 62

Page 17: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

Regression to the MeanThe Regression LineThe Graph of Averages

The Regression Line

So by how much is a unit increase in father’s heightassociated, on average, with an increase in son’s height?

The regression line answers this question, if there is a linearrelationship between the height of fathers and sons.

Indeed, the best linear predictor of y given x is the regressionline. (We’ll look at this claim in more detail later.)

Lecture Notes, Week 4 6/ 62

Page 18: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

Regression to the MeanThe Regression LineThe Graph of Averages

The Regression Line and the Graph of Averages

The regression line passes through the point of averages (likethe SD line), but unlike the SD line it approximately passesthrough the average of each vertical strip.

We could plot the average value of y for each value of x

The regression line approximates this “graph of averages”(that is, it is a smoothed version of the graph of averages).

If the graph of averages follows a straight line, that line is theregression line.

Also, if we fit the regression line to the (weighted) graph ofaverages, this line coincides with the regression line fitted tothe original scatterplot.

Lecture Notes, Week 4 7/ 62

Page 19: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

Regression to the MeanThe Regression LineThe Graph of Averages

The Regression Line and the Graph of Averages

The regression line passes through the point of averages (likethe SD line), but unlike the SD line it approximately passesthrough the average of each vertical strip.

We could plot the average value of y for each value of x

The regression line approximates this “graph of averages”(that is, it is a smoothed version of the graph of averages).

If the graph of averages follows a straight line, that line is theregression line.

Also, if we fit the regression line to the (weighted) graph ofaverages, this line coincides with the regression line fitted tothe original scatterplot.

Lecture Notes, Week 4 7/ 62

Page 20: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

Regression to the MeanThe Regression LineThe Graph of Averages

The Regression Line and the Graph of Averages

The regression line passes through the point of averages (likethe SD line), but unlike the SD line it approximately passesthrough the average of each vertical strip.

We could plot the average value of y for each value of x

The regression line approximates this “graph of averages”(that is, it is a smoothed version of the graph of averages).

If the graph of averages follows a straight line, that line is theregression line.

Also, if we fit the regression line to the (weighted) graph ofaverages, this line coincides with the regression line fitted tothe original scatterplot.

Lecture Notes, Week 4 7/ 62

Page 21: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

Regression to the MeanThe Regression LineThe Graph of Averages

The Regression Line and the Graph of Averages

The regression line passes through the point of averages (likethe SD line), but unlike the SD line it approximately passesthrough the average of each vertical strip.

We could plot the average value of y for each value of x

The regression line approximates this “graph of averages”(that is, it is a smoothed version of the graph of averages).

If the graph of averages follows a straight line, that line is theregression line.

Also, if we fit the regression line to the (weighted) graph ofaverages, this line coincides with the regression line fitted tothe original scatterplot.

Lecture Notes, Week 4 7/ 62

Page 22: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

Regression to the MeanThe Regression LineThe Graph of Averages

The Regression Line and the Graph of Averages

The regression line passes through the point of averages (likethe SD line), but unlike the SD line it approximately passesthrough the average of each vertical strip.

We could plot the average value of y for each value of x

The regression line approximates this “graph of averages”(that is, it is a smoothed version of the graph of averages).

If the graph of averages follows a straight line, that line is theregression line.

Also, if we fit the regression line to the (weighted) graph ofaverages, this line coincides with the regression line fitted tothe original scatterplot.

Lecture Notes, Week 4 7/ 62

Page 23: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

Regression to the MeanThe Regression LineThe Graph of Averages

The Graph of Averages

Lecture Notes, Week 4 8/ 62

Page 24: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

Regression to the MeanThe Regression LineThe Graph of Averages

The Regression Line At Work

The equation for the regression line drawn in Figure 5 ofChapter 10 in Freedman, Pisani, and Purves (2007) is

Yi = 34.188 + 0.514Xi , (1)

where Yi is the height of son i and Xi is the height of his father.

The value of the regression line at the y-intercept isYi = 34.188 + 0.514(58) = 64. However, 58 inches (4.8 feet)was pretty short, even in Victorian England; in Figure 5, nofathers are that short.When when we use the regression equation to pick theaverage value of y at the y-intercept, we are extrapolatingfrom the data.

Lecture Notes, Week 4 9/ 62

Page 25: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

Regression to the MeanThe Regression LineThe Graph of Averages

The Regression Line At Work

The equation for the regression line drawn in Figure 5 ofChapter 10 in Freedman, Pisani, and Purves (2007) is

Yi = 34.188 + 0.514Xi , (1)

where Yi is the height of son i and Xi is the height of his father.The value of the regression line at the y-intercept isYi = 34.188 + 0.514(58) = 64. However, 58 inches (4.8 feet)was pretty short, even in Victorian England; in Figure 5, nofathers are that short.

When when we use the regression equation to pick theaverage value of y at the y-intercept, we are extrapolatingfrom the data.

Lecture Notes, Week 4 9/ 62

Page 26: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

Regression to the MeanThe Regression LineThe Graph of Averages

The Regression Line At Work

The equation for the regression line drawn in Figure 5 ofChapter 10 in Freedman, Pisani, and Purves (2007) is

Yi = 34.188 + 0.514Xi , (1)

where Yi is the height of son i and Xi is the height of his father.The value of the regression line at the y-intercept isYi = 34.188 + 0.514(58) = 64. However, 58 inches (4.8 feet)was pretty short, even in Victorian England; in Figure 5, nofathers are that short.When when we use the regression equation to pick theaverage value of y at the y-intercept, we are extrapolatingfrom the data.

Lecture Notes, Week 4 9/ 62

Page 27: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The Slope of the Regression LineVariances and CovariancesCalculating r

The slope of the regression line

Notice that the slope of the regression line is less steep thanthe slope of the SD line:

I An increase of one SD above average father’s height isassociated, on average, with a gain of less than one SD inson’s height.

I In fact, it is associated with an increase of r×SD of son’sheight, where r is the coefficient of correlation between father’sheight and son’s height.

Letting b denote the slope of the regression line, we have

b = r ·SDy

SDx, (2)

where SDy is the standard deviation of son’s height and SDx

is the standard deviation of father’s height.

Lecture Notes, Week 4 10/ 62

Page 28: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The Slope of the Regression LineVariances and CovariancesCalculating r

The slope of the regression line

Notice that the slope of the regression line is less steep thanthe slope of the SD line:

I An increase of one SD above average father’s height isassociated, on average, with a gain of less than one SD inson’s height.

I In fact, it is associated with an increase of r×SD of son’sheight, where r is the coefficient of correlation between father’sheight and son’s height.

Letting b denote the slope of the regression line, we have

b = r ·SDy

SDx, (2)

where SDy is the standard deviation of son’s height and SDx

is the standard deviation of father’s height.

Lecture Notes, Week 4 10/ 62

Page 29: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The Slope of the Regression LineVariances and CovariancesCalculating r

The slope of the regression line

Notice that the slope of the regression line is less steep thanthe slope of the SD line:

I An increase of one SD above average father’s height isassociated, on average, with a gain of less than one SD inson’s height.

I In fact, it is associated with an increase of r×SD of son’sheight, where r is the coefficient of correlation between father’sheight and son’s height.

Letting b denote the slope of the regression line, we have

b = r ·SDy

SDx, (2)

where SDy is the standard deviation of son’s height and SDx

is the standard deviation of father’s height.

Lecture Notes, Week 4 10/ 62

Page 30: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The Slope of the Regression LineVariances and CovariancesCalculating r

The slope of the regression line

The slope of the regression line, like any line, is the rise overthe run.

I If the run is SDx , the rise is r × SDy .I When a father’s height goes up by one SD of fathers’ heights,

his son’s height goes up by r× SD of sons’ height, on average.

Why r turns out to be the right factor is something we willdiscuss later.

Lecture Notes, Week 4 11/ 62

Page 31: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The Slope of the Regression LineVariances and CovariancesCalculating r

The slope of the regression line

The slope of the regression line, like any line, is the rise overthe run.

I If the run is SDx , the rise is r × SDy .

I When a father’s height goes up by one SD of fathers’ heights,his son’s height goes up by r× SD of sons’ height, on average.

Why r turns out to be the right factor is something we willdiscuss later.

Lecture Notes, Week 4 11/ 62

Page 32: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The Slope of the Regression LineVariances and CovariancesCalculating r

The slope of the regression line

The slope of the regression line, like any line, is the rise overthe run.

I If the run is SDx , the rise is r × SDy .I When a father’s height goes up by one SD of fathers’ heights,

his son’s height goes up by r× SD of sons’ height, on average.

Why r turns out to be the right factor is something we willdiscuss later.

Lecture Notes, Week 4 11/ 62

Page 33: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The Slope of the Regression LineVariances and CovariancesCalculating r

The slope of the regression line

The slope of the regression line, like any line, is the rise overthe run.

I If the run is SDx , the rise is r × SDy .I When a father’s height goes up by one SD of fathers’ heights,

his son’s height goes up by r× SD of sons’ height, on average.

Why r turns out to be the right factor is something we willdiscuss later.

Lecture Notes, Week 4 11/ 62

Page 34: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The Slope of the Regression LineVariances and CovariancesCalculating r

Standard units

First, a bit of review: to calculate r , we must first convert bothx and y to standard units.

Denote the value of the variable x for unit i as xi . You canconvert this value xi to standard units by dividing its deviationfrom the average value of x by the standard deviation of x:

(xi − x)

SDx. (3)

Here, x is the average of x. This formula says how manystandard deviations xi is above or below x.When a variable x is converted to standard units, its averageis 0 and its SD (and variance) are equal to 1 (you will provethis on your problem set).

Lecture Notes, Week 4 12/ 62

Page 35: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The Slope of the Regression LineVariances and CovariancesCalculating r

Standard units

First, a bit of review: to calculate r , we must first convert bothx and y to standard units.Denote the value of the variable x for unit i as xi . You canconvert this value xi to standard units by dividing its deviationfrom the average value of x by the standard deviation of x:

(xi − x)

SDx. (3)

Here, x is the average of x. This formula says how manystandard deviations xi is above or below x.

When a variable x is converted to standard units, its averageis 0 and its SD (and variance) are equal to 1 (you will provethis on your problem set).

Lecture Notes, Week 4 12/ 62

Page 36: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The Slope of the Regression LineVariances and CovariancesCalculating r

Standard units

First, a bit of review: to calculate r , we must first convert bothx and y to standard units.Denote the value of the variable x for unit i as xi . You canconvert this value xi to standard units by dividing its deviationfrom the average value of x by the standard deviation of x:

(xi − x)

SDx. (3)

Here, x is the average of x. This formula says how manystandard deviations xi is above or below x.When a variable x is converted to standard units, its averageis 0 and its SD (and variance) are equal to 1 (you will provethis on your problem set).

Lecture Notes, Week 4 12/ 62

Page 37: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The Slope of the Regression LineVariances and CovariancesCalculating r

The variance and the SD

To understand r , we also need some basic definitions ofvariance and covariance. Recall that the variance is themean-square deviation from average:

Var(x) =1n

n∑i=1

(xi − x)2. (4)

The variance can also be written as follows:

Var(x) = x2 − x2, (5)

where the first term is the average of the squares and thesecond is the square of the average.The standard deviation SDx is the square root of the variance.

Lecture Notes, Week 4 13/ 62

Page 38: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The Slope of the Regression LineVariances and CovariancesCalculating r

The variance and the SD

To understand r , we also need some basic definitions ofvariance and covariance. Recall that the variance is themean-square deviation from average:

Var(x) =1n

n∑i=1

(xi − x)2. (4)

The variance can also be written as follows:

Var(x) = x2 − x2, (5)

where the first term is the average of the squares and thesecond is the square of the average.

The standard deviation SDx is the square root of the variance.

Lecture Notes, Week 4 13/ 62

Page 39: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The Slope of the Regression LineVariances and CovariancesCalculating r

The variance and the SD

To understand r , we also need some basic definitions ofvariance and covariance. Recall that the variance is themean-square deviation from average:

Var(x) =1n

n∑i=1

(xi − x)2. (4)

The variance can also be written as follows:

Var(x) = x2 − x2, (5)

where the first term is the average of the squares and thesecond is the square of the average.The standard deviation SDx is the square root of the variance.

Lecture Notes, Week 4 13/ 62

Page 40: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The Slope of the Regression LineVariances and CovariancesCalculating r

The covariance

Also recall that the covariance of two variables x and y is

Cov(x, y) =1n

n∑i=1

(xi − x)(yi − y). (6)

Here is an alternative, useful way of writing the covariance:

Cov(x, y) = xy − x y, (7)

where the first term is the average value of xi times yi and xand y are the averages of x and y, respectively.

Lecture Notes, Week 4 14/ 62

Page 41: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The Slope of the Regression LineVariances and CovariancesCalculating r

The covariance

Also recall that the covariance of two variables x and y is

Cov(x, y) =1n

n∑i=1

(xi − x)(yi − y). (6)

Here is an alternative, useful way of writing the covariance:

Cov(x, y) = xy − x y, (7)

where the first term is the average value of xi times yi and xand y are the averages of x and y, respectively.

Lecture Notes, Week 4 14/ 62

Page 42: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The Slope of the Regression LineVariances and CovariancesCalculating r

The covariance

Also recall that the covariance of two variables x and y is

Cov(x, y) =1n

n∑i=1

(xi − x)(yi − y). (6)

Here is an alternative, useful way of writing the covariance:

Cov(x, y) = xy − x y, (7)

where the first term is the average value of xi times yi and xand y are the averages of x and y, respectively.

Lecture Notes, Week 4 14/ 62

Page 43: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The Slope of the Regression LineVariances and CovariancesCalculating r

The correlation coefficient

The correlation coefficient is computed by converting x and yto standard units and taking the average of the product.

In symbols, the correlation coefficient r between x and y isgiven by

rx,y =1n

n∑i=1

(xi − x)(yi − y)

SDx · SDy, (8)

where n is the number of observations (units).

Notice that this is simply the covariance of x and y, divided bythe product of the standard deviations.

Lecture Notes, Week 4 15/ 62

Page 44: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The Slope of the Regression LineVariances and CovariancesCalculating r

The correlation coefficient

The correlation coefficient is computed by converting x and yto standard units and taking the average of the product.

In symbols, the correlation coefficient r between x and y isgiven by

rx,y =1n

n∑i=1

(xi − x)(yi − y)

SDx · SDy, (8)

where n is the number of observations (units).

Notice that this is simply the covariance of x and y, divided bythe product of the standard deviations.

Lecture Notes, Week 4 15/ 62

Page 45: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The Slope of the Regression LineVariances and CovariancesCalculating r

The correlation coefficient

The correlation coefficient is computed by converting x and yto standard units and taking the average of the product.

In symbols, the correlation coefficient r between x and y isgiven by

rx,y =1n

n∑i=1

(xi − x)(yi − y)

SDx · SDy, (8)

where n is the number of observations (units).

Notice that this is simply the covariance of x and y, divided bythe product of the standard deviations.

Lecture Notes, Week 4 15/ 62

Page 46: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The Slope of the Regression LineVariances and CovariancesCalculating r

The correlation coefficient as a measure of association

Values of x and y in the top-right and bottom-left quadrantsenter positively; those in the other quadrants enter negatively.(Why?)

Lecture Notes, Week 4 16/ 62

Page 47: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The Slope of the Regression LineVariances and CovariancesCalculating r

Problem cases for the correlation coefficient

The correlation coefficient is sensitive to outliers. In (b), thefour quadrants drawn on the previous slide would be similarlypopulated.

Lecture Notes, Week 4 17/ 62

Page 48: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The Slope of the Regression LineVariances and CovariancesCalculating r

The slope of the regression line, redux

We can now use the definitions of r and the variance andcovariance to rewrite

b = rx,y ·SDy

SDx=

Cov(x, y)

SDx · SDy·

SDy

SDx. (9)

This reduces to

b =Cov(x, y)

Var(x). (10)

This is the standard formula for the slope of the bivariateregression line.

Lecture Notes, Week 4 18/ 62

Page 49: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The Slope of the Regression LineVariances and CovariancesCalculating r

The slope of the regression line, redux

We can now use the definitions of r and the variance andcovariance to rewrite

b = rx,y ·SDy

SDx=

Cov(x, y)

SDx · SDy·

SDy

SDx. (9)

This reduces to

b =Cov(x, y)

Var(x). (10)

This is the standard formula for the slope of the bivariateregression line.

Lecture Notes, Week 4 18/ 62

Page 50: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The Slope of the Regression LineVariances and CovariancesCalculating r

The slope of the regression line, redux

We can now use the definitions of r and the variance andcovariance to rewrite

b = rx,y ·SDy

SDx=

Cov(x, y)

SDx · SDy·

SDy

SDx. (9)

This reduces to

b =Cov(x, y)

Var(x). (10)

This is the standard formula for the slope of the bivariateregression line.

Lecture Notes, Week 4 18/ 62

Page 51: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The average of the residuals is zeroThe correlation of x and the residuals is zero

Actual and predicted values

Having discussed the slope of the regression line, let’s talkabout the difference between actual and predicted values–thatis, the residual.

The actual value of y for any unit i is

yi = a + bxi + ei (11)

Here, a is the intercept, b is the fitted regression slopecoefficient, and ei is the deviation of unit i from the y-valuepredicted by the regression line.The “predicted” value of y for unit i is simply the value of y onthe regression line, given xi :

yi = a + bxi . (12)

Lecture Notes, Week 4 19/ 62

Page 52: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The average of the residuals is zeroThe correlation of x and the residuals is zero

Actual and predicted values

Having discussed the slope of the regression line, let’s talkabout the difference between actual and predicted values–thatis, the residual.The actual value of y for any unit i is

yi = a + bxi + ei (11)

Here, a is the intercept, b is the fitted regression slopecoefficient, and ei is the deviation of unit i from the y-valuepredicted by the regression line.The “predicted” value of y for unit i is simply the value of y onthe regression line, given xi :

yi = a + bxi . (12)

Lecture Notes, Week 4 19/ 62

Page 53: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The average of the residuals is zeroThe correlation of x and the residuals is zero

Actual and predicted values

Having discussed the slope of the regression line, let’s talkabout the difference between actual and predicted values–thatis, the residual.The actual value of y for any unit i is

yi = a + bxi + ei (11)

Here, a is the intercept, b is the fitted regression slopecoefficient, and ei is the deviation of unit i from the y-valuepredicted by the regression line.

The “predicted” value of y for unit i is simply the value of y onthe regression line, given xi :

yi = a + bxi . (12)

Lecture Notes, Week 4 19/ 62

Page 54: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The average of the residuals is zeroThe correlation of x and the residuals is zero

Actual and predicted values

Having discussed the slope of the regression line, let’s talkabout the difference between actual and predicted values–thatis, the residual.The actual value of y for any unit i is

yi = a + bxi + ei (11)

Here, a is the intercept, b is the fitted regression slopecoefficient, and ei is the deviation of unit i from the y-valuepredicted by the regression line.The “predicted” value of y for unit i is simply the value of y onthe regression line, given xi :

yi = a + bxi . (12)

Lecture Notes, Week 4 19/ 62

Page 55: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The average of the residuals is zeroThe correlation of x and the residuals is zero

Graphing the regression residual

Thus the residual ei = yi − yi is the actual value of y for unit i,minus the predicted value of y for unit i. For the dot at thelower left of the graph, the residual is negative.

x  

y  

yi

ˆ y i

ei =

xi

Lecture Notes, Week 4 20/ 62

Page 56: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The average of the residuals is zeroThe correlation of x and the residuals is zero

Two facts about regression

Here are two important facts about the residuals from anyregression:

1. The average of the residuals is always zero (as long as there isan intercept)

2. The correlation between x and the residuals is 0, becauseCov(x, e) = 0.

Lecture Notes, Week 4 21/ 62

Page 57: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The average of the residuals is zeroThe correlation of x and the residuals is zero

Two facts about regression

Here are two important facts about the residuals from anyregression:

1. The average of the residuals is always zero (as long as there isan intercept)

2. The correlation between x and the residuals is 0, becauseCov(x, e) = 0.

Lecture Notes, Week 4 21/ 62

Page 58: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The average of the residuals is zeroThe correlation of x and the residuals is zero

Fact 1: the average of the residuals is zero. Proof:

Since y = a + bx (the regression line goes through the point ofaverages), we can write a = y − bx. Thus,

yi = a + bxi = y − b(x − xi); (13)

ei = yi − yi = (yi − y) − b(xi − x); (14)

and

e =1n

n∑i=1

[(yi − y) − b(xi − x)] (15)

= 0

Lecture Notes, Week 4 22/ 62

Page 59: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The average of the residuals is zeroThe correlation of x and the residuals is zero

Fact 2: Cov(x, e) = 0. Proof:

Cov(x, e) =1n

n∑i=1

(xi − x)(ei − e) =1n

n∑i=1

(xi − x)(ei)

=1n

n∑i=1

(xi − x)[(yi − y) − b(xi − x)]

=1n

n∑i=1

(xi − x)(yi − y) − b1n

n∑i=1

(xi − x)2

= Cov(x, y) − bVar(x)

= 0

Lecture Notes, Week 4 23/ 62

Page 60: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The average of the residuals is zeroThe correlation of x and the residuals is zero

Understanding Cov(x, e) = 0

The second fact–that Cov(x, e) = 0 merits special attention.

This may be the most important fact about the mechanics ofregression.

For reasons we’ll discuss later, this is called orthogonality: theslope of the regression coefficient ensures that e and x areorthogonal (that is, at right angles).

Why at right angles? That’s a topic for next week, when weuse matrix algebra.

Lecture Notes, Week 4 24/ 62

Page 61: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The average of the residuals is zeroThe correlation of x and the residuals is zero

Understanding Cov(x, e) = 0

The second fact–that Cov(x, e) = 0 merits special attention.

This may be the most important fact about the mechanics ofregression.

For reasons we’ll discuss later, this is called orthogonality: theslope of the regression coefficient ensures that e and x areorthogonal (that is, at right angles).

Why at right angles? That’s a topic for next week, when weuse matrix algebra.

Lecture Notes, Week 4 24/ 62

Page 62: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The average of the residuals is zeroThe correlation of x and the residuals is zero

Understanding Cov(x, e) = 0

The second fact–that Cov(x, e) = 0 merits special attention.

This may be the most important fact about the mechanics ofregression.

For reasons we’ll discuss later, this is called orthogonality: theslope of the regression coefficient ensures that e and x areorthogonal (that is, at right angles).

Why at right angles? That’s a topic for next week, when weuse matrix algebra.

Lecture Notes, Week 4 24/ 62

Page 63: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The average of the residuals is zeroThe correlation of x and the residuals is zero

Understanding Cov(x, e) = 0

The second fact–that Cov(x, e) = 0 merits special attention.

This may be the most important fact about the mechanics ofregression.

For reasons we’ll discuss later, this is called orthogonality: theslope of the regression coefficient ensures that e and x areorthogonal (that is, at right angles).

Why at right angles? That’s a topic for next week, when weuse matrix algebra.

Lecture Notes, Week 4 24/ 62

Page 64: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The average of the residuals is zeroThe correlation of x and the residuals is zero

Two facts in summary

These two propositions are cold, mechanical facts aboutregression.

They hold for the residuals from any regression line fit to anydata set.

Notice that we have not specified a regression model, normade any assumptions about error terms or their relationshipto covariates in that model.

Lecture Notes, Week 4 25/ 62

Page 65: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The average of the residuals is zeroThe correlation of x and the residuals is zero

Two facts in summary

These two propositions are cold, mechanical facts aboutregression.

They hold for the residuals from any regression line fit to anydata set.

Notice that we have not specified a regression model, normade any assumptions about error terms or their relationshipto covariates in that model.

Lecture Notes, Week 4 25/ 62

Page 66: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The average of the residuals is zeroThe correlation of x and the residuals is zero

Two facts in summary

These two propositions are cold, mechanical facts aboutregression.

They hold for the residuals from any regression line fit to anydata set.

Notice that we have not specified a regression model, normade any assumptions about error terms or their relationshipto covariates in that model.

Lecture Notes, Week 4 25/ 62

Page 67: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The R2 of the regressionThe Decomposition of Variance

The Analysis of Variance

We now move to our penultimate topic: the analysis ofvariance.

First, define a measure of goodness-of-fit—the R2—anddiscuss the idea of “variance explained”

Then, we’ll see an alternate definition of the R2 based on adecomposition of the variance of y

We can calculate the R2 measure for any regression line fittedto data; again, there is no statistical model needed. Causationhas also not entered the picture yet.

Lecture Notes, Week 4 26/ 62

Page 68: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The R2 of the regressionThe Decomposition of Variance

The Analysis of Variance

We now move to our penultimate topic: the analysis ofvariance.

First, define a measure of goodness-of-fit—the R2—anddiscuss the idea of “variance explained”

Then, we’ll see an alternate definition of the R2 based on adecomposition of the variance of y

We can calculate the R2 measure for any regression line fittedto data; again, there is no statistical model needed. Causationhas also not entered the picture yet.

Lecture Notes, Week 4 26/ 62

Page 69: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The R2 of the regressionThe Decomposition of Variance

The Analysis of Variance

We now move to our penultimate topic: the analysis ofvariance.

First, define a measure of goodness-of-fit—the R2—anddiscuss the idea of “variance explained”

Then, we’ll see an alternate definition of the R2 based on adecomposition of the variance of y

We can calculate the R2 measure for any regression line fittedto data; again, there is no statistical model needed. Causationhas also not entered the picture yet.

Lecture Notes, Week 4 26/ 62

Page 70: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The R2 of the regressionThe Decomposition of Variance

The Analysis of Variance

We now move to our penultimate topic: the analysis ofvariance.

First, define a measure of goodness-of-fit—the R2—anddiscuss the idea of “variance explained”

Then, we’ll see an alternate definition of the R2 based on adecomposition of the variance of y

We can calculate the R2 measure for any regression line fittedto data; again, there is no statistical model needed. Causationhas also not entered the picture yet.

Lecture Notes, Week 4 26/ 62

Page 71: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The R2 of the regressionThe Decomposition of Variance

The Analysis of Variance

We now move to our penultimate topic: the analysis ofvariance.

First, define a measure of goodness-of-fit—the R2—anddiscuss the idea of “variance explained”

Then, we’ll see an alternate definition of the R2 based on adecomposition of the variance of y

We can calculate the R2 measure for any regression line fittedto data; again, there is no statistical model needed. Causationhas also not entered the picture yet.

Lecture Notes, Week 4 26/ 62

Page 72: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The R2 of the regressionThe Decomposition of Variance

Deriving R2

As a preliminary to deriving the measure R2, note that since

yi = a + bxi (17)

is the equation for the regression line, we have

Var(y) = b2Var(x). (18)

(You will show this on your problem set).

Lecture Notes, Week 4 27/ 62

Page 73: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The R2 of the regressionThe Decomposition of Variance

Deriving R2

Recall that the slope coefficient b is defined as

b =Cov(x, y)

Var(x). (19)

Thus, from (18), we have

Var(y) = (Cov(x, y)

Var(x))2Var(xi)

=Cov(x, y)2

Var(x)

= r2 · Var(y). (20)

To get the final line of (20), use the definition of r .

Lecture Notes, Week 4 28/ 62

Page 74: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The R2 of the regressionThe Decomposition of Variance

The Definition of R2

Then, rearranging the final line of (20), we have

r2 =Var(y)

Var(y). (21)

Var(y) is sometimes called the “explained variance” of y; thisis the variance of y along the regression line.

Var(y) is the total variance of y.

It is conventional to use capitals and write R2 instead of r2.However, in a bivariate regression, the formula simply givesthe square of the correlation coefficient r .

Lecture Notes, Week 4 29/ 62

Page 75: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The R2 of the regressionThe Decomposition of Variance

The Definition of R2

Then, rearranging the final line of (20), we have

r2 =Var(y)

Var(y). (21)

Var(y) is sometimes called the “explained variance” of y; thisis the variance of y along the regression line.

Var(y) is the total variance of y.

It is conventional to use capitals and write R2 instead of r2.However, in a bivariate regression, the formula simply givesthe square of the correlation coefficient r .

Lecture Notes, Week 4 29/ 62

Page 76: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The R2 of the regressionThe Decomposition of Variance

The Definition of R2

Then, rearranging the final line of (20), we have

r2 =Var(y)

Var(y). (21)

Var(y) is sometimes called the “explained variance” of y; thisis the variance of y along the regression line.

Var(y) is the total variance of y.

It is conventional to use capitals and write R2 instead of r2.However, in a bivariate regression, the formula simply givesthe square of the correlation coefficient r .

Lecture Notes, Week 4 29/ 62

Page 77: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The R2 of the regressionThe Decomposition of Variance

The Definition of R2

Then, rearranging the final line of (20), we have

r2 =Var(y)

Var(y). (21)

Var(y) is sometimes called the “explained variance” of y; thisis the variance of y along the regression line.

Var(y) is the total variance of y.

It is conventional to use capitals and write R2 instead of r2.However, in a bivariate regression, the formula simply givesthe square of the correlation coefficient r .

Lecture Notes, Week 4 29/ 62

Page 78: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The R2 of the regressionThe Decomposition of Variance

It may be helpful to consider two pictures. On the left, theslope of the regression line is zero, and so is the R2.(Question: why is the intercept on the left equal to the mean ofY?).

On the right, the correlation r is 1, and the slope is SDySDx

; the R2

is 1.

y r=  0  

r=  1,  

b =SDy

SDx

b = 0

 Compare                        in  each  case  

Var( ˆ y )Var(y)

Lecture Notes, Week 4 30/ 62

Page 79: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The R2 of the regressionThe Decomposition of Variance

It may be helpful to consider two pictures. On the left, theslope of the regression line is zero, and so is the R2.(Question: why is the intercept on the left equal to the mean ofY?).On the right, the correlation r is 1, and the slope is SDy

SDx; the R2

is 1.

y r=  0  

r=  1,  

b =SDy

SDx

b = 0

 Compare                        in  each  case  

Var( ˆ y )Var(y)

Lecture Notes, Week 4 30/ 62

Page 80: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The R2 of the regressionThe Decomposition of Variance

The Concept of “Explained Variance”

The R2 is sometimes referred to as the fraction of variance“explained” by the regression.

You should be careful when you talk about “explainedvariance.”

Causality is not in the picture.

We are simply comparing the assessing the “portion” of y thatis associated with x.

What we mean by this will be made more precise later, whenwe consider multivariate regression.

Lecture Notes, Week 4 31/ 62

Page 81: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The R2 of the regressionThe Decomposition of Variance

The Concept of “Explained Variance”

The R2 is sometimes referred to as the fraction of variance“explained” by the regression.

You should be careful when you talk about “explainedvariance.”

Causality is not in the picture.

We are simply comparing the assessing the “portion” of y thatis associated with x.

What we mean by this will be made more precise later, whenwe consider multivariate regression.

Lecture Notes, Week 4 31/ 62

Page 82: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The R2 of the regressionThe Decomposition of Variance

The Concept of “Explained Variance”

The R2 is sometimes referred to as the fraction of variance“explained” by the regression.

You should be careful when you talk about “explainedvariance.”

Causality is not in the picture.

We are simply comparing the assessing the “portion” of y thatis associated with x.

What we mean by this will be made more precise later, whenwe consider multivariate regression.

Lecture Notes, Week 4 31/ 62

Page 83: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The R2 of the regressionThe Decomposition of Variance

The Concept of “Explained Variance”

The R2 is sometimes referred to as the fraction of variance“explained” by the regression.

You should be careful when you talk about “explainedvariance.”

Causality is not in the picture.

We are simply comparing the assessing the “portion” of y thatis associated with x.

What we mean by this will be made more precise later, whenwe consider multivariate regression.

Lecture Notes, Week 4 31/ 62

Page 84: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The R2 of the regressionThe Decomposition of Variance

The Concept of “Explained Variance”

The R2 is sometimes referred to as the fraction of variance“explained” by the regression.

You should be careful when you talk about “explainedvariance.”

Causality is not in the picture.

We are simply comparing the assessing the “portion” of y thatis associated with x.

What we mean by this will be made more precise later, whenwe consider multivariate regression.

Lecture Notes, Week 4 31/ 62

Page 85: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The R2 of the regressionThe Decomposition of Variance

The Decomposition of Variance

Another, closely related, way of defining the R2 measure is bydecomposing the total variance of y as follows:

Var(y) = Var(y) + Var(e). (22)

In words, the total variance of y is the sum of the variance of(y)—the predicted value of y given x—and the variance of theresiduals e.

Lecture Notes, Week 4 32/ 62

Page 86: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The R2 of the regressionThe Decomposition of Variance

The Decomposition of Variance

Another, closely related, way of defining the R2 measure is bydecomposing the total variance of y as follows:

Var(y) = Var(y) + Var(e). (22)

In words, the total variance of y is the sum of the variance of(y)—the predicted value of y given x—and the variance of theresiduals e.

Lecture Notes, Week 4 32/ 62

Page 87: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The R2 of the regressionThe Decomposition of Variance

The Decomposition of Variance

Equation (22) follows because:1. yi = yi + ei ;2. yi = a + bxi

3. Cov(x, e) = 0, as shown above in (17);4. Therefore, Cov(y, e) = Cov(a + bx, e) = bCov(x, e) = 0 (by

exercise 11 on p. 46 of Freedman and Lane 1981)5. Thus, since Var(A + B) = Var(A) + Var(B) + 2Cov(A ,B), we

have

Var(y) = Var(y + e)

= Var(y) + Var(e) + 2Cov(y, e)

= Var(y) + Var(e). (23)

Lecture Notes, Week 4 33/ 62

Page 88: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The R2 of the regressionThe Decomposition of Variance

The Decomposition of Variance

If we multiply (23) through by n, the left-hand side is the sumof squared deviations of each data point from the averagevalue of y; this is sometimes called the “total sum of squares”(SST) (see e.g. Wooldridge 2003: 39).

The right-hand side is then the “explained sum of squares”(SSE)—the sum of squared deviations of the predicted value,based on the regression line, from the average value ofy—plus the “residual sum of squares” (SSR)—the sum ofsquared deviations of each data point from the predictedvalue, based on the regression line.

Lecture Notes, Week 4 34/ 62

Page 89: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The R2 of the regressionThe Decomposition of Variance

The Decomposition of Variance

Using (23), we then have

SST = SSE + SSR. (24)

This leads to two common ways of rewriting (21):

r2 =SSESST

(25)

which is a restatement of equation (21); and

r2 = 1 −SSRSST

. (26)

(Equation 26 simply follows by dividing SST = SSE + SSRthrough by SST).

Lecture Notes, Week 4 35/ 62

Page 90: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The method of least squaresGauss’s proof

Why the regression line?

Now we turn to our final topic–the derivation of the slopecoefficient b.

The regression line has a unique property: it minimizes thesum of squared residuals (as first proved by Carl FriedrichGauss, Germany, 1777-1855). That is, the line that makes thesmallest root-mean-square error in predicting y from x,among all possible lines, is the regression line.This can be shown in two ways—by using calculus or by themethod used by Gauss.We will use calculus to derive the slope coefficient b.Then we will sketch Gauss’s proof to show that if a lineminimizes the sum of squared residuals, that line is theregression line.

Lecture Notes, Week 4 36/ 62

Page 91: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The method of least squaresGauss’s proof

Why the regression line?

Now we turn to our final topic–the derivation of the slopecoefficient b.The regression line has a unique property: it minimizes thesum of squared residuals (as first proved by Carl FriedrichGauss, Germany, 1777-1855). That is, the line that makes thesmallest root-mean-square error in predicting y from x,among all possible lines, is the regression line.

This can be shown in two ways—by using calculus or by themethod used by Gauss.We will use calculus to derive the slope coefficient b.Then we will sketch Gauss’s proof to show that if a lineminimizes the sum of squared residuals, that line is theregression line.

Lecture Notes, Week 4 36/ 62

Page 92: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The method of least squaresGauss’s proof

Why the regression line?

Now we turn to our final topic–the derivation of the slopecoefficient b.The regression line has a unique property: it minimizes thesum of squared residuals (as first proved by Carl FriedrichGauss, Germany, 1777-1855). That is, the line that makes thesmallest root-mean-square error in predicting y from x,among all possible lines, is the regression line.This can be shown in two ways—by using calculus or by themethod used by Gauss.

We will use calculus to derive the slope coefficient b.Then we will sketch Gauss’s proof to show that if a lineminimizes the sum of squared residuals, that line is theregression line.

Lecture Notes, Week 4 36/ 62

Page 93: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The method of least squaresGauss’s proof

Why the regression line?

Now we turn to our final topic–the derivation of the slopecoefficient b.The regression line has a unique property: it minimizes thesum of squared residuals (as first proved by Carl FriedrichGauss, Germany, 1777-1855). That is, the line that makes thesmallest root-mean-square error in predicting y from x,among all possible lines, is the regression line.This can be shown in two ways—by using calculus or by themethod used by Gauss.We will use calculus to derive the slope coefficient b.

Then we will sketch Gauss’s proof to show that if a lineminimizes the sum of squared residuals, that line is theregression line.

Lecture Notes, Week 4 36/ 62

Page 94: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The method of least squaresGauss’s proof

Why the regression line?

Now we turn to our final topic–the derivation of the slopecoefficient b.The regression line has a unique property: it minimizes thesum of squared residuals (as first proved by Carl FriedrichGauss, Germany, 1777-1855). That is, the line that makes thesmallest root-mean-square error in predicting y from x,among all possible lines, is the regression line.This can be shown in two ways—by using calculus or by themethod used by Gauss.We will use calculus to derive the slope coefficient b.Then we will sketch Gauss’s proof to show that if a lineminimizes the sum of squared residuals, that line is theregression line.

Lecture Notes, Week 4 36/ 62

Page 95: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The method of least squaresGauss’s proof

The Sum of Squared Residuals

First, note that the sum of squared residuals is defined asn∑

i=1

(ei)2 =

n∑i=1

(yi − yi)2. (27)

The square root of this quantity is a measure of the “average”prediction error:

r.m.s.error =

√√1n

n∑i=1

(yi − yi)2. (28)

This is sometimes called the “root-mean-square error” of theregression line. (This can become confusing once we discussstatistical models, which have error terms; it may be better tostick with “residuals.”)

Lecture Notes, Week 4 37/ 62

Page 96: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The method of least squaresGauss’s proof

The Sum of Squared Residuals

First, note that the sum of squared residuals is defined asn∑

i=1

(ei)2 =

n∑i=1

(yi − yi)2. (27)

The square root of this quantity is a measure of the “average”prediction error:

r.m.s.error =

√√1n

n∑i=1

(yi − yi)2. (28)

This is sometimes called the “root-mean-square error” of theregression line. (This can become confusing once we discussstatistical models, which have error terms; it may be better tostick with “residuals.”)

Lecture Notes, Week 4 37/ 62

Page 97: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The method of least squaresGauss’s proof

The method of least squares

The problem is to choose the regression intercept a and theslope coefficient b to minimize the sum of squared residuals(and thus also root-mean-square error). That is, we shouldchoose a and b to solve

mina,b

n∑i=1

(yi − a − bxi)2, (29)

where yi − a − bxi = yi − y = ei is the residual for unit i, givenx and our choices of a and b. (Note that here, we treat xi andyi as constants; the “variables” are a and b, because theseare what we must choose to minimize the function).

Lecture Notes, Week 4 38/ 62

Page 98: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The method of least squaresGauss’s proof

The method of least squares

This minimization problem has two necessary first-orderconditions for a solution, one with respect to a:

− 2[n∑

i=1

(yi − a − bxi)] = 0; (30)

and one with respect to b:

− 2[n∑

i=1

(yi − a − bxi)xi] = 0. (31)

(Technical note: for the sufficient conditions, we need to checkthat the matrix of second-order derivatives is positive definite).

Lecture Notes, Week 4 39/ 62

Page 99: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The method of least squaresGauss’s proof

The method of least squares

We can ignore the −2 in (30) and (31), because the term inbrackets must be zero for the left-hand side to equal zero.Equation (30) can thus be rewritten as

n∑i=1

a =n∑

i=1

(yi − bxi) (32)

Recalling that∑n

i=1 a = na and dividing both sides of (32) byn gives

a = y − bx, (33)

where y is the average of y and x is the average of x. Notethat this simply says that that the regression line goes throughthe point of averages, that is, y = a + bx.

Lecture Notes, Week 4 40/ 62

Page 100: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The method of least squaresGauss’s proof

The method of least squares

We can plug in the expression for a in (33) into (31), againdropping the −2, to get

n∑i=1

(yi − (y − bx) − bxi)xi = 0. (34)

Rearranging terms, we haven∑

i=1

(yi − y)xi = bn∑

i=1

(xi − x)xi . (35)

If we divide both sides of (35) by n, the left-hand side isCov(x, y) and the right-hand side is b · Var(x). (SeeWooldridge 2003, e.g. p. 29 or equations A.7 and A.8 on p.677 in Appendix A).

Lecture Notes, Week 4 41/ 62

Page 101: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The method of least squaresGauss’s proof

The method of least squares

Thus we have

b =Cov(x, y)

Var(x). (36)

This is the key result, which we saw above. It says the slopeof the regression line that minimizes the sum of squaredprediction errors is the covariance between x and y, dividedby the variance of x.

As we saw above, we can convert this formula to b = r · SDySDx

;that is why r is the right factor to use to get the slope.You should always remember these formulas, even if youforget the derivation!! (But you should remember thederivations!).

Lecture Notes, Week 4 42/ 62

Page 102: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The method of least squaresGauss’s proof

The method of least squares

Thus we have

b =Cov(x, y)

Var(x). (36)

This is the key result, which we saw above. It says the slopeof the regression line that minimizes the sum of squaredprediction errors is the covariance between x and y, dividedby the variance of x.

As we saw above, we can convert this formula to b = r · SDySDx

;that is why r is the right factor to use to get the slope.You should always remember these formulas, even if youforget the derivation!! (But you should remember thederivations!).

Lecture Notes, Week 4 42/ 62

Page 103: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The method of least squaresGauss’s proof

The method of least squares

Thus we have

b =Cov(x, y)

Var(x). (36)

This is the key result, which we saw above. It says the slopeof the regression line that minimizes the sum of squaredprediction errors is the covariance between x and y, dividedby the variance of x.

As we saw above, we can convert this formula to b = r · SDySDx

;that is why r is the right factor to use to get the slope.

You should always remember these formulas, even if youforget the derivation!! (But you should remember thederivations!).

Lecture Notes, Week 4 42/ 62

Page 104: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The method of least squaresGauss’s proof

The method of least squares

Thus we have

b =Cov(x, y)

Var(x). (36)

This is the key result, which we saw above. It says the slopeof the regression line that minimizes the sum of squaredprediction errors is the covariance between x and y, dividedby the variance of x.

As we saw above, we can convert this formula to b = r · SDySDx

;that is why r is the right factor to use to get the slope.You should always remember these formulas, even if youforget the derivation!! (But you should remember thederivations!).

Lecture Notes, Week 4 42/ 62

Page 105: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The method of least squaresGauss’s proof

Gauss’s proof

Recall that the mean square error (or “mean square residual”)of the regression line is

1n

n∑i=1

(ei)2 =

1n

n∑i=1

(yi − yi)2 (37)

Gauss proposed the following:I Proposition Among all lines for predicting y from x, the

regression line has the smallest mean square error.

Thus, it also has the smallest root-mean-square (R.M.S.)error.

Let’s work through a sketch of a proof...

Lecture Notes, Week 4 43/ 62

Page 106: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The method of least squaresGauss’s proof

Gauss’s proof

Recall that the mean square error (or “mean square residual”)of the regression line is

1n

n∑i=1

(ei)2 =

1n

n∑i=1

(yi − yi)2 (37)

Gauss proposed the following:I Proposition Among all lines for predicting y from x, the

regression line has the smallest mean square error.

Thus, it also has the smallest root-mean-square (R.M.S.)error.

Let’s work through a sketch of a proof...

Lecture Notes, Week 4 43/ 62

Page 107: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The method of least squaresGauss’s proof

Gauss’s proof

Recall that the mean square error (or “mean square residual”)of the regression line is

1n

n∑i=1

(ei)2 =

1n

n∑i=1

(yi − yi)2 (37)

Gauss proposed the following:I Proposition Among all lines for predicting y from x, the

regression line has the smallest mean square error.

Thus, it also has the smallest root-mean-square (R.M.S.)error.

Let’s work through a sketch of a proof...

Lecture Notes, Week 4 43/ 62

Page 108: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The method of least squaresGauss’s proof

Gauss’s proof

Recall that the mean square error (or “mean square residual”)of the regression line is

1n

n∑i=1

(ei)2 =

1n

n∑i=1

(yi − yi)2 (37)

Gauss proposed the following:I Proposition Among all lines for predicting y from x, the

regression line has the smallest mean square error.

Thus, it also has the smallest root-mean-square (R.M.S.)error.

Let’s work through a sketch of a proof...

Lecture Notes, Week 4 43/ 62

Page 109: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The method of least squaresGauss’s proof

Gauss’s proof

We are going to compare the mean square error of anarbitrary line, yi = cxi + d, to the mean square error of theregression line, yi = bxi + d.

First, denote the prediction error for the new line asfi = yi − cxi − d.

Since this holds for all i, we also have f = y − cx − d. (Thenew line goes through the point of averages).

Lecture Notes, Week 4 44/ 62

Page 110: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The method of least squaresGauss’s proof

Gauss’s proof

We are going to compare the mean square error of anarbitrary line, yi = cxi + d, to the mean square error of theregression line, yi = bxi + d.

First, denote the prediction error for the new line asfi = yi − cxi − d.

Since this holds for all i, we also have f = y − cx − d. (Thenew line goes through the point of averages).

Lecture Notes, Week 4 44/ 62

Page 111: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The method of least squaresGauss’s proof

Gauss’s proof

We are going to compare the mean square error of anarbitrary line, yi = cxi + d, to the mean square error of theregression line, yi = bxi + d.

First, denote the prediction error for the new line asfi = yi − cxi − d.

Since this holds for all i, we also have f = y − cx − d. (Thenew line goes through the point of averages).

Lecture Notes, Week 4 44/ 62

Page 112: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The method of least squaresGauss’s proof

Gauss’s proof

The prediction error of the regression line can be written asei = (yi − y) − b(xi − x) (see 14). Thus we have

fi = yi − cxi − d

= [(yi − y) − b(xi − x)] + (b − c)xi + (y − bx − d)

= ei + (b − c)xi + (y − bx − d) (38)

where in the second line we have simply added andsubstracted terms.

Lecture Notes, Week 4 45/ 62

Page 113: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The method of least squaresGauss’s proof

Gauss’s proof

Since y − bx − d is a constant, we have

Var(f) = Var(e) + (b − c)2Var(x), (39)

where we have used the rule that the sum of the variances oftwo variables A and B with coefficients a and b isVar(aA + bB) = a2Var(A) + b2Var(B) + 2abCov(A ,B) andthe fact that Cov(x, e) = 0, as we showed above.

Lecture Notes, Week 4 46/ 62

Page 114: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The method of least squaresGauss’s proof

Gauss’s proof

Now, since Var(x) = x2 + x2 (see 5), we have

Var(f) =1n

n∑i=1

(fi)2 − f2

=1n

n∑i=1

(fi)2 − [y − cx − d]2 (40)

and

Var(e) =1n

n∑i=1

(ei)2, (41)

where we use e = 0 in both (40) and (41).

Lecture Notes, Week 4 47/ 62

Page 115: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The method of least squaresGauss’s proof

Gauss’s proof

We can thus rewrite equation (39) as

1n

n∑i=1

(fi)2 =1n

n∑i=1

(ei)2 + (b − c)2Var(x) + [y − cx − d]2. (42)

On the left-hand side we have the mean square residuals ofthe new line; on the right-hand side is the mean squareresiduals of the regression line, plus(b − c)2Var(x) + [y − cx − d]2.

Notice, then, that the mean square residuals of the regressionline must be smaller than the mean square residuals of thenew line – unless b = c!If the line cx + d has the same MSE as the regression line,then it is the regression line. QED.

Lecture Notes, Week 4 48/ 62

Page 116: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The method of least squaresGauss’s proof

Gauss’s proof

We can thus rewrite equation (39) as

1n

n∑i=1

(fi)2 =1n

n∑i=1

(ei)2 + (b − c)2Var(x) + [y − cx − d]2. (42)

On the left-hand side we have the mean square residuals ofthe new line; on the right-hand side is the mean squareresiduals of the regression line, plus(b − c)2Var(x) + [y − cx − d]2.Notice, then, that the mean square residuals of the regressionline must be smaller than the mean square residuals of thenew line – unless b = c!

If the line cx + d has the same MSE as the regression line,then it is the regression line. QED.

Lecture Notes, Week 4 48/ 62

Page 117: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

The method of least squaresGauss’s proof

Gauss’s proof

We can thus rewrite equation (39) as

1n

n∑i=1

(fi)2 =1n

n∑i=1

(ei)2 + (b − c)2Var(x) + [y − cx − d]2. (42)

On the left-hand side we have the mean square residuals ofthe new line; on the right-hand side is the mean squareresiduals of the regression line, plus(b − c)2Var(x) + [y − cx − d]2.Notice, then, that the mean square residuals of the regressionline must be smaller than the mean square residuals of thenew line – unless b = c!If the line cx + d has the same MSE as the regression line,then it is the regression line. QED.

Lecture Notes, Week 4 48/ 62

Page 118: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

ExperimentsAn application on observational data

Key ideas

We can fit the regression line to any scatterplot (but x musthave positive variance – why?)

Among all lines for predicting y from x, the regression lineminimizes the sum of the squared residuals.

The residuals e from the regression are always uncorrelatedwith x. (More precisely, they are always orthogonal to x; andthey are thus uncorrelated as long as there is an intercept thatis not forced through zero).

The regression line and its residuals will always have theseproperties.

Lecture Notes, Week 4 49/ 62

Page 119: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

ExperimentsAn application on observational data

Key ideas

We can fit the regression line to any scatterplot (but x musthave positive variance – why?)

Among all lines for predicting y from x, the regression lineminimizes the sum of the squared residuals.

The residuals e from the regression are always uncorrelatedwith x. (More precisely, they are always orthogonal to x; andthey are thus uncorrelated as long as there is an intercept thatis not forced through zero).

The regression line and its residuals will always have theseproperties.

Lecture Notes, Week 4 49/ 62

Page 120: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

ExperimentsAn application on observational data

Key ideas

We can fit the regression line to any scatterplot (but x musthave positive variance – why?)

Among all lines for predicting y from x, the regression lineminimizes the sum of the squared residuals.

The residuals e from the regression are always uncorrelatedwith x. (More precisely, they are always orthogonal to x; andthey are thus uncorrelated as long as there is an intercept thatis not forced through zero).

The regression line and its residuals will always have theseproperties.

Lecture Notes, Week 4 49/ 62

Page 121: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

ExperimentsAn application on observational data

Key ideas

We can fit the regression line to any scatterplot (but x musthave positive variance – why?)

Among all lines for predicting y from x, the regression lineminimizes the sum of the squared residuals.

The residuals e from the regression are always uncorrelatedwith x. (More precisely, they are always orthogonal to x; andthey are thus uncorrelated as long as there is an intercept thatis not forced through zero).

The regression line and its residuals will always have theseproperties.

Lecture Notes, Week 4 49/ 62

Page 122: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

ExperimentsAn application on observational data

Key ideas

None of properties of the regression line reviewed in thislecture (or associated measures like the R2) require astatistical model.

Nor is causality necessarily in the picture: we may useregression to summarize a data set.Whether the regression line is useful as a summarydescription of the data may depend:

I If the relationship between y and x is approximately linear(e.g., father’s height and son’s height), the regression linegives us nearly the average value of y, given x.

I If the relationship between y and x has a U- (or inverted-U)shape, the relevance of the regression line is unclear; and itwon’t help much with prediction.

Lecture Notes, Week 4 50/ 62

Page 123: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

ExperimentsAn application on observational data

Key ideas

None of properties of the regression line reviewed in thislecture (or associated measures like the R2) require astatistical model.Nor is causality necessarily in the picture: we may useregression to summarize a data set.

Whether the regression line is useful as a summarydescription of the data may depend:

I If the relationship between y and x is approximately linear(e.g., father’s height and son’s height), the regression linegives us nearly the average value of y, given x.

I If the relationship between y and x has a U- (or inverted-U)shape, the relevance of the regression line is unclear; and itwon’t help much with prediction.

Lecture Notes, Week 4 50/ 62

Page 124: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

ExperimentsAn application on observational data

Key ideas

None of properties of the regression line reviewed in thislecture (or associated measures like the R2) require astatistical model.Nor is causality necessarily in the picture: we may useregression to summarize a data set.Whether the regression line is useful as a summarydescription of the data may depend:

I If the relationship between y and x is approximately linear(e.g., father’s height and son’s height), the regression linegives us nearly the average value of y, given x.

I If the relationship between y and x has a U- (or inverted-U)shape, the relevance of the regression line is unclear; and itwon’t help much with prediction.

Lecture Notes, Week 4 50/ 62

Page 125: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

ExperimentsAn application on observational data

Key ideas

None of properties of the regression line reviewed in thislecture (or associated measures like the R2) require astatistical model.Nor is causality necessarily in the picture: we may useregression to summarize a data set.Whether the regression line is useful as a summarydescription of the data may depend:

I If the relationship between y and x is approximately linear(e.g., father’s height and son’s height), the regression linegives us nearly the average value of y, given x.

I If the relationship between y and x has a U- (or inverted-U)shape, the relevance of the regression line is unclear; and itwon’t help much with prediction.

Lecture Notes, Week 4 50/ 62

Page 126: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

ExperimentsAn application on observational data

Key ideas

None of properties of the regression line reviewed in thislecture (or associated measures like the R2) require astatistical model.Nor is causality necessarily in the picture: we may useregression to summarize a data set.Whether the regression line is useful as a summarydescription of the data may depend:

I If the relationship between y and x is approximately linear(e.g., father’s height and son’s height), the regression linegives us nearly the average value of y, given x.

I If the relationship between y and x has a U- (or inverted-U)shape, the relevance of the regression line is unclear; and itwon’t help much with prediction.

Lecture Notes, Week 4 50/ 62

Page 127: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

ExperimentsAn application on observational data

Bivariate analysis of experiments

Let Yi be an outcome variable, and let Di = 1 if unit i isassigned to treatment and 0 otherwise.

Suppose m out of N units are assigned to treatment. Thus, a

fraction D =∑N

i=1 Di

N = mN are assigned to treatment.

The equation for the regression “line” is

Yi = a + bDi . (43)

What are a and b? (That is, how does regression fit theintercept a and “slope” b?).

Lecture Notes, Week 4 51/ 62

Page 128: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

ExperimentsAn application on observational data

Bivariate analysis of experiments

Let Yi be an outcome variable, and let Di = 1 if unit i isassigned to treatment and 0 otherwise.

Suppose m out of N units are assigned to treatment. Thus, a

fraction D =∑N

i=1 Di

N = mN are assigned to treatment.

The equation for the regression “line” is

Yi = a + bDi . (43)

What are a and b? (That is, how does regression fit theintercept a and “slope” b?).

Lecture Notes, Week 4 51/ 62

Page 129: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

ExperimentsAn application on observational data

Bivariate analysis of experiments

Let Yi be an outcome variable, and let Di = 1 if unit i isassigned to treatment and 0 otherwise.

Suppose m out of N units are assigned to treatment. Thus, a

fraction D =∑N

i=1 Di

N = mN are assigned to treatment.

The equation for the regression “line” is

Yi = a + bDi . (43)

What are a and b? (That is, how does regression fit theintercept a and “slope” b?).

Lecture Notes, Week 4 51/ 62

Page 130: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

ExperimentsAn application on observational data

Bivariate analysis of experiments

Let Yi be an outcome variable, and let Di = 1 if unit i isassigned to treatment and 0 otherwise.

Suppose m out of N units are assigned to treatment. Thus, a

fraction D =∑N

i=1 Di

N = mN are assigned to treatment.

The equation for the regression “line” is

Yi = a + bDi . (43)

What are a and b? (That is, how does regression fit theintercept a and “slope” b?).

Lecture Notes, Week 4 51/ 62

Page 131: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

ExperimentsAn application on observational data

The “slope” coefficient

b =Cov(Y ,D)

Var(D)=

YD − (Y)(D)

D(1 − D)

=YD − (YD)(D) − (Y)(D) + (YD)(D)

D(1 − D)

=(1 − D)YD

D(1 − D)−

(Y − YD)(D)

D(1 − D)

=

∑Ni=1 YiDi∑N

i=1 Di−

∑Ni=1(1 − Di)Yi)∑N

i=1(1 − Di).

Lecture Notes, Week 4 52/ 62

Page 132: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

ExperimentsAn application on observational data

The intercept

a = Y − bD

= Y −YD − (Y)(D)

D(1 − D)D

=(1 − D)Y − YD + (Y)(D)

(1 − D)

=(Y − YD)

(1 − D)

=

∑Ni=1(1 − Di)Yi)∑N

i=1(1 − Di).

Lecture Notes, Week 4 53/ 62

Page 133: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

ExperimentsAn application on observational data

Bivariate analysis of experiments

Here, a is the average outcome in the control group, and b isthe difference of the treatment and control group averages.

I In a bivariate regression with a 0-1 variable for treatment, the“slope” coefficient is the difference of means betweentreatment and control groups.

Recall that we used the difference of means to estimate theaverage causal effect under the Neyman model.

N.B.: equivalence in this simple case does not imply thatregression is equivalent to analysis under the potentialoutcomes model—we will discuss further . . .

Lecture Notes, Week 4 54/ 62

Page 134: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

ExperimentsAn application on observational data

Bivariate analysis of experiments

Here, a is the average outcome in the control group, and b isthe difference of the treatment and control group averages.

I In a bivariate regression with a 0-1 variable for treatment, the“slope” coefficient is the difference of means betweentreatment and control groups.

Recall that we used the difference of means to estimate theaverage causal effect under the Neyman model.

N.B.: equivalence in this simple case does not imply thatregression is equivalent to analysis under the potentialoutcomes model—we will discuss further . . .

Lecture Notes, Week 4 54/ 62

Page 135: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

ExperimentsAn application on observational data

Bivariate analysis of experiments

Here, a is the average outcome in the control group, and b isthe difference of the treatment and control group averages.

I In a bivariate regression with a 0-1 variable for treatment, the“slope” coefficient is the difference of means betweentreatment and control groups.

Recall that we used the difference of means to estimate theaverage causal effect under the Neyman model.

N.B.: equivalence in this simple case does not imply thatregression is equivalent to analysis under the potentialoutcomes model—we will discuss further . . .

Lecture Notes, Week 4 54/ 62

Page 136: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

ExperimentsAn application on observational data

Bivariate analysis of experiments

Here, a is the average outcome in the control group, and b isthe difference of the treatment and control group averages.

I In a bivariate regression with a 0-1 variable for treatment, the“slope” coefficient is the difference of means betweentreatment and control groups.

Recall that we used the difference of means to estimate theaverage causal effect under the Neyman model.

N.B.: equivalence in this simple case does not imply thatregression is equivalent to analysis under the potentialoutcomes model—we will discuss further . . .

Lecture Notes, Week 4 54/ 62

Page 137: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

ExperimentsAn application on observational data

Experiments with multiple levels

Consider an experiment where the dependent variable Yi isaverage annual wages. Subjects are randomly assigned tocomplete middle school (Xi = 1), high school (Xi = 2),university (Xi = 3), or a Ph.D (Xi = 4) and tracked later in lifeto measure Yi . (Just suppose . . . ).

A researcher fits a regression to data from this experimentand finds the following fitted equation:

Yi = $22, 000 + $18, 000Xi . (44)

(Here, “fitted” means, we calculate the intercept and slopeusing the formula for the regression line.)

Lecture Notes, Week 4 55/ 62

Page 138: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

ExperimentsAn application on observational data

Experiments with multiple levels

Consider an experiment where the dependent variable Yi isaverage annual wages. Subjects are randomly assigned tocomplete middle school (Xi = 1), high school (Xi = 2),university (Xi = 3), or a Ph.D (Xi = 4) and tracked later in lifeto measure Yi . (Just suppose . . . ).

A researcher fits a regression to data from this experimentand finds the following fitted equation:

Yi = $22, 000 + $18, 000Xi . (44)

(Here, “fitted” means, we calculate the intercept and slopeusing the formula for the regression line.)

Lecture Notes, Week 4 55/ 62

Page 139: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

ExperimentsAn application on observational data

Experiments with multiple levels

Now, suppose the researcher claims the following: “Eachadditional block of education garners an additional $18,000 inincome on average, above the wages of $22,000 that aperson with no education would expect to earn.”

What might concern you about this statement?

I Linearity: does the average of Y depend linearly on X? Maybenot (sad to say for you future Ph.D.s . . .)

I Extrapolation: There’s no one in the study group assigned tozero education.

I Predicting the results of interventions: we’re probably on betterground here because the researcher actually did theintervention. Still, may not hold for other study groups, etc.

Lesson: doing the experiment doesn’t justify the linearregression.It may be better to calculate the average response separatelyfor each treatment condition (a series of dummy variables inthe regression would do the trick)–here, linearity might beassessed empirically.

Lecture Notes, Week 4 56/ 62

Page 140: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

ExperimentsAn application on observational data

Experiments with multiple levels

Now, suppose the researcher claims the following: “Eachadditional block of education garners an additional $18,000 inincome on average, above the wages of $22,000 that aperson with no education would expect to earn.”What might concern you about this statement?

I Linearity: does the average of Y depend linearly on X? Maybenot (sad to say for you future Ph.D.s . . .)

I Extrapolation: There’s no one in the study group assigned tozero education.

I Predicting the results of interventions: we’re probably on betterground here because the researcher actually did theintervention. Still, may not hold for other study groups, etc.

Lesson: doing the experiment doesn’t justify the linearregression.It may be better to calculate the average response separatelyfor each treatment condition (a series of dummy variables inthe regression would do the trick)–here, linearity might beassessed empirically.

Lecture Notes, Week 4 56/ 62

Page 141: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

ExperimentsAn application on observational data

Experiments with multiple levels

Now, suppose the researcher claims the following: “Eachadditional block of education garners an additional $18,000 inincome on average, above the wages of $22,000 that aperson with no education would expect to earn.”What might concern you about this statement?

I Linearity: does the average of Y depend linearly on X? Maybenot (sad to say for you future Ph.D.s . . .)

I Extrapolation: There’s no one in the study group assigned tozero education.

I Predicting the results of interventions: we’re probably on betterground here because the researcher actually did theintervention. Still, may not hold for other study groups, etc.

Lesson: doing the experiment doesn’t justify the linearregression.It may be better to calculate the average response separatelyfor each treatment condition (a series of dummy variables inthe regression would do the trick)–here, linearity might beassessed empirically.

Lecture Notes, Week 4 56/ 62

Page 142: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

ExperimentsAn application on observational data

Experiments with multiple levels

Now, suppose the researcher claims the following: “Eachadditional block of education garners an additional $18,000 inincome on average, above the wages of $22,000 that aperson with no education would expect to earn.”What might concern you about this statement?

I Linearity: does the average of Y depend linearly on X? Maybenot (sad to say for you future Ph.D.s . . .)

I Extrapolation: There’s no one in the study group assigned tozero education.

I Predicting the results of interventions: we’re probably on betterground here because the researcher actually did theintervention. Still, may not hold for other study groups, etc.

Lesson: doing the experiment doesn’t justify the linearregression.It may be better to calculate the average response separatelyfor each treatment condition (a series of dummy variables inthe regression would do the trick)–here, linearity might beassessed empirically.

Lecture Notes, Week 4 56/ 62

Page 143: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

ExperimentsAn application on observational data

Experiments with multiple levels

Now, suppose the researcher claims the following: “Eachadditional block of education garners an additional $18,000 inincome on average, above the wages of $22,000 that aperson with no education would expect to earn.”What might concern you about this statement?

I Linearity: does the average of Y depend linearly on X? Maybenot (sad to say for you future Ph.D.s . . .)

I Extrapolation: There’s no one in the study group assigned tozero education.

I Predicting the results of interventions: we’re probably on betterground here because the researcher actually did theintervention. Still, may not hold for other study groups, etc.

Lesson: doing the experiment doesn’t justify the linearregression.It may be better to calculate the average response separatelyfor each treatment condition (a series of dummy variables inthe regression would do the trick)–here, linearity might beassessed empirically.

Lecture Notes, Week 4 56/ 62

Page 144: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

ExperimentsAn application on observational data

Experiments with multiple levels

Now, suppose the researcher claims the following: “Eachadditional block of education garners an additional $18,000 inincome on average, above the wages of $22,000 that aperson with no education would expect to earn.”What might concern you about this statement?

I Linearity: does the average of Y depend linearly on X? Maybenot (sad to say for you future Ph.D.s . . .)

I Extrapolation: There’s no one in the study group assigned tozero education.

I Predicting the results of interventions: we’re probably on betterground here because the researcher actually did theintervention. Still, may not hold for other study groups, etc.

Lesson: doing the experiment doesn’t justify the linearregression.

It may be better to calculate the average response separatelyfor each treatment condition (a series of dummy variables inthe regression would do the trick)–here, linearity might beassessed empirically.

Lecture Notes, Week 4 56/ 62

Page 145: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

ExperimentsAn application on observational data

Experiments with multiple levels

Now, suppose the researcher claims the following: “Eachadditional block of education garners an additional $18,000 inincome on average, above the wages of $22,000 that aperson with no education would expect to earn.”What might concern you about this statement?

I Linearity: does the average of Y depend linearly on X? Maybenot (sad to say for you future Ph.D.s . . .)

I Extrapolation: There’s no one in the study group assigned tozero education.

I Predicting the results of interventions: we’re probably on betterground here because the researcher actually did theintervention. Still, may not hold for other study groups, etc.

Lesson: doing the experiment doesn’t justify the linearregression.It may be better to calculate the average response separatelyfor each treatment condition (a series of dummy variables inthe regression would do the trick)–here, linearity might beassessed empirically.

Lecture Notes, Week 4 56/ 62

Page 146: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

ExperimentsAn application on observational data

The political economy of strike activity

Hibbs (1978) is interested in changes in industrial strikeactivity in advanced capitalist democracies in the twentiethcentury.

He observes a shift in strike volume in many countries afterthe Second World War (but not a marked decline in the late1960s as other scholars had alleged).

His hypothesis is that changes in post-war politicaleconomy–the rise of Left-Labor governments and the welfarestate—shifts the locus of conflict away from private firms.

Lecture Notes, Week 4 57/ 62

Page 147: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

ExperimentsAn application on observational data

The political economy of strike activity

Hibbs (1978) is interested in changes in industrial strikeactivity in advanced capitalist democracies in the twentiethcentury.

He observes a shift in strike volume in many countries afterthe Second World War (but not a marked decline in the late1960s as other scholars had alleged).

His hypothesis is that changes in post-war politicaleconomy–the rise of Left-Labor governments and the welfarestate—shifts the locus of conflict away from private firms.

Lecture Notes, Week 4 57/ 62

Page 148: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

ExperimentsAn application on observational data

The political economy of strike activity

Hibbs (1978) is interested in changes in industrial strikeactivity in advanced capitalist democracies in the twentiethcentury.

He observes a shift in strike volume in many countries afterthe Second World War (but not a marked decline in the late1960s as other scholars had alleged).

His hypothesis is that changes in post-war politicaleconomy–the rise of Left-Labor governments and the welfarestate—shifts the locus of conflict away from private firms.

Lecture Notes, Week 4 57/ 62

Page 149: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

ExperimentsAn application on observational data

Inter- and post-war strike volumes, by country164 H I B B S

Ivo

list

rike

ean

2,079

1,713

1,300

1,200

1.100

1,000

900

800

700

600

500

400

300

200

100

0

A

I

| | Inter-war period. (1918-38)

Post-war period. (1944-72)

i 1 IcCD•D0)

CO

|dom

nmar

IBD

v>

rland

z

>-

Ital •a

: inla

n V)

Stat

e

T32'cD

c

elgi

ur

CO

<DFr

anc

o

Fig. 3. Average strike volumes, inter-war and post-war periods. Inter-war data are incompletefor some nations; 1945 is excluded for Sweden.

most cases to negligible levels. This contrasts sharply with the record for most ofthe other countries. Elsewhere industrial conflict has either oscillated about the sameaverage level for approximately three-quarters of a century, or has actually increasedsomewhat during the post-war era.

2. INDUSTRIAL CONFLICT AND THE POLITICAL ECONOMY OFDISTRIBUTION

What explains the long-run patterns of strike trends summarized in Fig. 3? I hardlyneed mention that bearing on this question is an enormous literature on the evolutionof class relations in industrial capitalist societies. However, the arguments are

Lecture Notes, Week 4 58/ 62

Page 150: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

ExperimentsAn application on observational data

Left governments and changes in strike volume166 HIBBS

c

3C

400

200

-200

-400

;: -6ooro^ -800

1-1.000

-2 -1,200

2 -1,400

c -1,600

I -1,800ro

United StatesCanada

• FinlandFrance

• Italy• Belgium

Netherlands• • Denmark

• United Kingdom

• Swedenr=-96

Norway0 10 20 30 40 50 60 70 80 90 100%

Change in Socialist-Labour and Communist percentagecabinet representation, post-war minus inter-war mean

Fig. 4. Change in average strike volume and average Socialist-Labour and Communist percentageof cabinet representation, inter-war to post-war period (Sources: for strike data see Hibbs,Strikes: for cabinet data see Statesman's Yearbook, various issues).

strike activity during the post-war era are largely those where Social Democratic andLabour parties based on the working class and trade unions successfully mobilizedmass political support in the electoral arena, gained control (or at least sharedcontrol) of the state, and sharply expanded the scope of collective consumption anddistribution. This historical development in the political economy of distribution inthese societies represented a massive shift of political power away from businessinterests and their middle-class allies to the 'organized working class'. Some ideaof the close association between the evolution of strike activity and the shift ofpolitical power between the social classes is given by Fig. 4, which shows a scatterplotof the inter-war to post-war change in average strike volume and the inter-war topost-war change in the average percentage of cabinet (executive) posts held bySocialist, Labour, and Communist parties. (The years in which Socialist/Labourparties were continuously in power or alternated regularly in power with bourgeoisparties are identified on the strike volume and participation plots in Fig. 2, pp. 158-60).

The variables in Fig. 4 clearly exhibit a strong linear association (the correlationis - 96) ;19 at one extreme of the Figure lie the countries in which centre and right-wing

19 The nations with the highest inter-war strike means (Norway, Sweden, the U nited Kingdom)have the largest absolute changes (declines) in strike volume and contribute heavily to the high

Lecture Notes, Week 4 59/ 62

Page 151: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

ExperimentsAn application on observational data

Use of bivariate analysis in Hibbs

Obviously, there are causal hypotheses involved in thisargument: a rise of Left-Labor governments and the welfarestate caused the decline in strike activity.

However, Hibbs is also using a simple bivariate scatterplot tosummarize the data—and to show a quite striking relationshipbetween strike activity and Left-Labor governments.

The analysis is simple, but the relationship is suggestive. Theplots with the country names displayed are also nice.

Lecture Notes, Week 4 60/ 62

Page 152: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

ExperimentsAn application on observational data

Use of bivariate analysis in Hibbs

Obviously, there are causal hypotheses involved in thisargument: a rise of Left-Labor governments and the welfarestate caused the decline in strike activity.

However, Hibbs is also using a simple bivariate scatterplot tosummarize the data—and to show a quite striking relationshipbetween strike activity and Left-Labor governments.

The analysis is simple, but the relationship is suggestive. Theplots with the country names displayed are also nice.

Lecture Notes, Week 4 60/ 62

Page 153: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

ExperimentsAn application on observational data

Use of bivariate analysis in Hibbs

Obviously, there are causal hypotheses involved in thisargument: a rise of Left-Labor governments and the welfarestate caused the decline in strike activity.

However, Hibbs is also using a simple bivariate scatterplot tosummarize the data—and to show a quite striking relationshipbetween strike activity and Left-Labor governments.

The analysis is simple, but the relationship is suggestive. Theplots with the country names displayed are also nice.

Lecture Notes, Week 4 60/ 62

Page 154: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

ExperimentsAn application on observational data

Ecological correlations

One caveat is that taking averages (the mean change in strikevolume and the mean change in Left-Labor cabinetrepresentation) smooths out a lot of the variability—andboosts the R2.

This is similar to the problem of ecological correlation (FPPCh. 9.4).

Lecture Notes, Week 4 61/ 62

Page 155: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

ExperimentsAn application on observational data

Ecological correlations

One caveat is that taking averages (the mean change in strikevolume and the mean change in Left-Labor cabinetrepresentation) smooths out a lot of the variability—andboosts the R2.

This is similar to the problem of ecological correlation (FPPCh. 9.4).

Lecture Notes, Week 4 61/ 62

Page 156: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

ExperimentsAn application on observational data

Readings

Freedman, Pisani, and Purves (2007), Chapters 8-12.

Freedman and Lane (1981), e.g. pp. 48-86. (Work theexercises!)

Angrist and Pischke (2009: 29-40) discuss the graph ofaverages as the “conditional expectation function.”

The normal equations for the method of least squares arederived in Wooldridge (2003, Chapter 3 and Appendix 2A).

An abbreviated discussion of the regression line is inFreedman (2007, Chapter 2).

Lecture Notes, Week 4 62/ 62

Page 157: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

ExperimentsAn application on observational data

Readings

Freedman, Pisani, and Purves (2007), Chapters 8-12.

Freedman and Lane (1981), e.g. pp. 48-86. (Work theexercises!)

Angrist and Pischke (2009: 29-40) discuss the graph ofaverages as the “conditional expectation function.”

The normal equations for the method of least squares arederived in Wooldridge (2003, Chapter 3 and Appendix 2A).

An abbreviated discussion of the regression line is inFreedman (2007, Chapter 2).

Lecture Notes, Week 4 62/ 62

Page 158: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

ExperimentsAn application on observational data

Readings

Freedman, Pisani, and Purves (2007), Chapters 8-12.

Freedman and Lane (1981), e.g. pp. 48-86. (Work theexercises!)

Angrist and Pischke (2009: 29-40) discuss the graph ofaverages as the “conditional expectation function.”

The normal equations for the method of least squares arederived in Wooldridge (2003, Chapter 3 and Appendix 2A).

An abbreviated discussion of the regression line is inFreedman (2007, Chapter 2).

Lecture Notes, Week 4 62/ 62

Page 159: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

ExperimentsAn application on observational data

Readings

Freedman, Pisani, and Purves (2007), Chapters 8-12.

Freedman and Lane (1981), e.g. pp. 48-86. (Work theexercises!)

Angrist and Pischke (2009: 29-40) discuss the graph ofaverages as the “conditional expectation function.”

The normal equations for the method of least squares arederived in Wooldridge (2003, Chapter 3 and Appendix 2A).

An abbreviated discussion of the regression line is inFreedman (2007, Chapter 2).

Lecture Notes, Week 4 62/ 62

Page 160: Plsc 503 spring 2013 lecture 4

Why Is It Called Regression?The Correlation CoefficientTwo facts about regression

The Analysis of Variance and the R2

Regression Minimizes the Sum of Squared ResidualsUsing bivariate analysis

ExperimentsAn application on observational data

Readings

Freedman, Pisani, and Purves (2007), Chapters 8-12.

Freedman and Lane (1981), e.g. pp. 48-86. (Work theexercises!)

Angrist and Pischke (2009: 29-40) discuss the graph ofaverages as the “conditional expectation function.”

The normal equations for the method of least squares arederived in Wooldridge (2003, Chapter 3 and Appendix 2A).

An abbreviated discussion of the regression line is inFreedman (2007, Chapter 2).

Lecture Notes, Week 4 62/ 62