correlation and regression geography 450, urban research elvin …ewyly/g450/correlation.pdf ·...

18
1 Correlation and Regression Geography 450, Urban Research Elvin Wyly “To avoid falling for the post hoc fallacy and thus wind up believing many things that are not so, you need to put any statement of relationship through a sharp inspection. The correlation, that convincingly precise figure that seems to prove that something is because of something, can actually be any of several types.” 1 “The correlation coefficient is the most commonly seen measure of association between two variables. It is often denoted r or R, and sometimes by the Greek r, ρ (rho). ... the correlation coefficient, R, is not always a sufficient summary of association, but it is useful and often used. The fact is that no ideal summary numbers exist.” 2 Suppose we’re doing a study of working-class housing in the Vancouver region, and we’re interested in the circumstances of people who live as renters in mobile homes. We have survey responses from a sample of people, and two of the questions deal with the total monthly rent, and the total household income. The survey responses are listed in Table 1. What is the relationship between total monthly rent and total household income? Does rent co-vary with income? In other words, is there a correlation between these two measures? 1 Darrell Huff (1954). How to Lie With Statistics. New York: W.W. Norton, p. 89. 2 Loren Haskins and Kirk Jeffrey (1990). Understanding Quantitative History. Cambridge, MA: MIT Press, p. 234.

Upload: others

Post on 20-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

  • 1

    Correlation and Regression Geography 450, Urban Research Elvin Wyly

    “To avoid falling for the post hoc fallacy and thus wind up believing many things that are not so, you need to put any statement of relationship through a sharp inspection. The correlation, that convincingly precise figure that seems to prove that something is because of something, can actually be any of several types.”1 “The correlation coefficient is the most commonly seen measure of association between two variables. It is often denoted r or R, and sometimes by the Greek r, ρ (rho). ... the correlation coefficient, R, is not always a sufficient summary of association, but it is useful and often used. The fact is that no ideal summary numbers exist.”2

    Suppose we’re doing a study of working-class housing in the Vancouver region, and we’re interested in the circumstances of people who live as renters in mobile homes. We have survey responses from a sample of people, and two of the questions deal with the total monthly rent, and the total household income. The survey responses are listed in Table 1. What is the relationship between total monthly rent and total household income? Does rent co-vary with income? In other words, is there a correlation between these two measures?

    1 Darrell Huff (1954). How to Lie With Statistics. New York: W.W. Norton, p. 89. 2 Loren Haskins and Kirk Jeffrey (1990). Understanding Quantitative History. Cambridge, MA: MIT Press, p. 234.

  • 2

    Table 1. Rent and Income for a Sample of Rentersin Mobile Homes, Vancouver CMA, 2001.

    Gross

    monthly Total household

    rent income

    1033 60000179 38595850 34267608 23071413 34300710 50165850 29064726 61506350 4538299 33501

    425 59000825 32804792 28688

    1192 38513718 4341199 1686499 1200099 24312

    1300 46624560 2060899 27214

    Data Source: Statistics Canada (2005). 2001 Census, Public Use Microdata

    File (PUMF), households and housing file. Ottawa: Statistics Canada. When two or more things co-vary with one another, they share variance. If households with higher incomes tend to have higher monthly rents, and if those with lower incomes tend to also have lower rents, then these two variables have a positive covariance. If the opposite held -- if households with low incomes tend to have higher rents and those with high incomes have lower rents, then the two measures have a negative or inverse covariance. Given everything that we know about household finances and housing markets in this society, in this region, at this point in time, we would not expect to observe negative covariance between income and rent for renters living in mobile homes. It would be logical to anticipate some kind of positive covariance. But what is covariance? Recall that variance is one of the measures of the ‘spread’of a set of numerical scores. Take the difference between each observation and its mean, square the result, add up all the squared deviations, and then divide by the number of observations to obtain the mean.

    ( )n

    XXs

    2

    2 ∑ −=

  • 3

    In some textbooks you’ll see an equation which is just a little bit different:

    ( )1

    2

    2

    −=∑

    n

    XXs

    Statistical purists emphasize that calculating variance for a sample, the denominator should be n-1 rather than n in order to provide an unbiased estimate. This adjustment doesn’t make much of a difference when n is large, but of course things do matter a lot if you’re working with a small sample. Covariance is the product of the differences, for two separate variables, of each score from its mean value. Instead of multiplying a score’s deviation from the mean by itself (that is, squaring it), we multiply the deviation by the corresponding difference with the mean for another variable. For variables X and Y, then, covariance is calculated as ( )( )

    n

    YYXXYXCOV ∑ −−=),(

    As in the case of the equation with variance, when you’re working with a sample, the equation is ( )( )

    1),(

    −−=∑

    n

    YYXXYXCOV

    In both of these equations, however, covariance is affected by the scale of measurement of the two variables. In our case, rent is measured on a scale that varies several hundred dollars, while the range for income is many thousands of dollars. If we multiply the denominator by the product of the standard deviations for both of the variables, we can effectively standardize the covariance. This creates a ratio that will always range between -1.0 and +1.0, no matter what the measurement scale of the original variables (kilometers, liters, thousands of dollars, etc.). The only restriction is that the variables must be measured on an interval or ratio scale. The standardized covariance is known as the correlation coefficient: ( )( )

    )(1 yxssn

    YYXXr

    −−=∑

    The correlation coefficient is often called Pearson’s r, or Pearson’s product-moment correlation coefficient. Karl Pearson developed this measure in 1895, as part of a series of breakthroughs in measurement, probability theory, and the assessment of “goodness of fit” between observed patterns and expectations derived either from a priori theory or an

  • 4

    assumed benchmark of pure, random variation.3 If this equation looks a bit cumbersome or complicated, just keep in mind that when we are expressing things in terms of standard deviations, that’s the same as a z-score. So the correlation coefficient can also be calculated as

    1−=∑

    n

    zzr yx

    Table 2 shows the calculations for the variance, covariance, and then the correlation coefficient for our small sample of renters in mobile homes in the Vancouver region. The covariance is positive -- as we expected it would be -- and the correlation coefficient is also positive. Correlation coefficients range between -1.0 and +1.0. If two variables have no relationship whatsoever, the correlation will be close to zero. Two variables that approach “perfect” positive correlation will have a coefficient close to +1.0. Two variables that approach perfect negative correlation will have a coefficient near -1.0. If we take the square of the correlation coefficient, we obtain the coefficient of determination, r2. The coefficient of determination also ranges from -1.0 to +1.0, but it has a more interesting and valuable property: r2 measures the percentage of variance that two variables share. For our example, r2 is 0.2021. This means that 20.21 percent of the variance in monthly rents for mobile home renters in the Vancouver region can be associated with the variance in total household income.

    3 For a fascinating history, see M. Eileen Magnello (1999). “The Non-Correlation of Biometrics and Eugenics: Rival Forms of Laboratory Work in Karl Pearson’s Career at University College London, Part 1.” History of Science 37, 79-106, especially p. 96.

  • 5

    Table 2. Calculating the Variance, Covariance, and Correlation.

    1 2 3 4 5 6Gross

    monthly Difference from Squared Total household Difference from Squared Column 2 *

    rent mean difference income mean difference Column 5

    1033 460 211907 60000 23815 567145153 10962751179 -394 154973 38595 2410 5807182 -948662850 277 76914 34267 -1918 3679455 -531978608 35 1248 23071 -13114 171981992 -463368413 -160 25493 34300 -1885 3553943 301002710 137 18860 50165 13980 195435074 1919894850 277 76914 29064 -7121 50711354 -1974943726 153 23511 61506 25321 641143395 3882524350 -223 49580 45382 9197 84581305 -204782399 -474 224360 33501 -2684 7204879 1271412

    425 -148 21805 59000 22815 520515534 -3368987825 252 63672 32804 -3381 11432449 -853187792 219 48107 28688 -7497 56207865 -1644384

    1192 619 383574 38513 2328 5418697 1441690718 145 21122 43411 7226 52212323 105015199 -474 224360 16864 -19321 373308401 915180499 -474 224360 12000 -24185 584923438 1145571999 -474 224360 24312 -11873 140972652 5623935

    1300 727 529014 46624 10439 108968744 7592494560 -13 160 20608 -15577 242648863 19731199 -474 224360 27214 -8971 80482259 4249354

    Mean 573 36185 47266707.3Sum of the products... A2363335.37...divided by N-1 is the covariance B

    Variance 141433 195416748 105144366N-1 * (product of the two standard deviations)CStandard Deviation 376 13979

    0.4495 A divided by C is the correlation coefficient Assessing the Significance of r If we’re working with sample data, we know that our results will be different if we draw a different random sample. Correlation coefficients, like means, ratios, and other parameters, are subject to random sampling variability. If we find a particular correlation coefficient in our sample (r), how can we know if r is just the product of chance, random sampling variability? Perhaps the r we observe is just random, chance variation that would lead us to believe there is a relationship when in fact the true, population correlation coefficient (ρ)is actually zero? If we can safely assume that the data for each variable come from a population distribution that is normal, and if we can safely assume that observations are independent -- that is, that one observation for x does not affect the other observations of x, and the same holds for y -- then we can use a t-test to evaluate the significance of a sample correlation coefficient:

    21

    2

    r

    nrt

    −=

    If the null hypothesis is correct -- if, in fact, there is no true correlation in the population -- then this statistic will follow a student’s t distribution, with n-2 degrees of freedom. This means that even if there is no true correlation in the population, if we were to draw repeated random samples and calculate correlation coefficients for each sample, there would be a sampling distribution something like what appears in Figure 2. Most of the sample correlation coefficients would cluster fairly close to the true zero population

  • 6

    correlation. But in a small number of cases -- the “tails” -- we would obtain coefficients very far away from zero. The shape of this distribution depends on the degrees of freedom -- the number of sampled observations minus two (to adjust for the calculation of standard deviations from two different variables). So we calculate the t statistic using the formula above, and then look up the critical values of the t distribution in an appendix of any standard statistics textbook.

    The Distribution of Pearson’s r. Source: Perry R. Hinton (1995). Statistics Explained. New York:

    Routledge, p. 261. For our example, the formula yields a t value of 2.64. For df=19 (our sample of 21 households minus 2), a table of “Critical Values of the T Distribution” indicates that in ninety-five percent of all random samples when the population correlation coefficient is zero, the t statistic will be between -2.093 and +2.093. Since our t value is outside this range, it means we can reject the null hypothesis. We do have sufficient evidence to conclude that there is a statistically significant correlation between the monthly rents paid and the total employment income of workers in households living in mobile homes in the Vancouver region. Correlation in Stata Fortunately, we don’t have to go through all the tedious calculations that Karl Pearson (or, to be much more accurate, Karl Pearson’s many hardworking assistants) had to do in the 1890s. Make sure the 2001 Census of Canada PUMF is located in your c:\data\pumf directory, and then open Stata and issue the following commands: set memory 200m use “c:\data\pumf\2001hh.dta” corr grosrth totinch if cmah==933 & tenurh==2 & dt ypeh==8 & totinch > 10000 & totinch < 70000

  • 7

    The “corr” command asks for an analysis of the correlation between grosrth (monthly gross rent) and totinch (total household income). All the specifications after the “if” narrow the analysis to renters (tenurh 2) who live in dwellings classified as “mobile home or other movable dwelling” (dtypeh 8); finally, the analysis excludes households with annual incomes of less than $10,000 or more than $70,000. After you submit these commands, your screen should look something like this: The correlation between grosrth (monthly gross rent) and totinch (total household income) is 0.4495 for all of the households who meet the criteria in that command. The figure of 0.4495 is precisely what we calculated in the worksheet shown in Table 2. If you would like to request a t-test for the significance of the correlation coefficient, then the command is a little bit different: pwcorr grosrth totinch if cmah==933 & tenurh==2 & d typeh==8 & totinch > 10000 & totinch < 70000, sig

    which gives you this

  • 8

    The figure below the correlation coefficient -- the 0.0409 below the 0.4495 -- indicates that given the sample size we’re working with, random sampling variability will mean that about 4 percent of the time, a random sample will yield a correlation this large even when the correlation in the population is actually zero. Any correlation with a probability below 0.05 is usually regarded as “statistically significant” -- meaning that it probably did not occur solely through chance, random sampling variability. Statistical significance is not the same as practical significance, however. As sample sizes increase, even very small correlation coefficients will yield t statistics that lie in the extreme ranges of the tail of the t distribution. This means that analysts who are working with small sample size tend to “accept” correlation coefficients as meaningful if they pass a t test at P 10000 & totinch < 70000

  • 9

    05

    001

    000

    150

    0M

    onth

    ly g

    ross

    re

    nt

    10000 20000 30000 40000 50000 60000Total household income

    This is an immediately intuitive confirmation of a positive -- but weak relationship between total household income and monthly gross rent for our sample of households. In fact, the relationship is even weaker. Since I couldn’t bring myself to work all the way through the calculations for a pathetically small correlation coefficient, the Stata commands we’ve been using in this example have excluded households with incomes lower than $10,000 per year, and a small number of households with incomes over $70,000. Get rid of these restrictions and draw the scatterplot again: twoway scatter grosrth totinch if cmah==933 & tenur h==2 & dtypeh==8

  • 10

    05

    00

    10

    00

    15

    00

    Mo

    nth

    ly g

    ross

    re

    nt

    0 20000 40000 60000 80000 100000Total household income

    Here, the relationship is even weaker. The correlation coefficient for this graph is 0.0981, which means that only 0.96 percent -- less than one percent -- of the variance in monthly rents can be associated with variance in total household income. Given the sample size and the small value of the coefficient, a t test fails to reject the null hypothesis -- and so we cannot have confidence that the observed correlation is not just a random sampling fluctuation from a zero correlation in the population. This relationship is a little bit stronger in some places, however. Try these two commands to explore the relations in the Edmonton metropolitan area: twoway scatter grosrth totinch if cmah==835 & tenur h==2 & dtypeh==8

    pwcorr grosrth totinch if cmah==835 & tenurh==2 & d typeh==8, sig

  • 11

    05

    00

    10

    00

    15

    00

    Mo

    nth

    ly g

    ross

    re

    nt

    -50000 0 50000 100000 150000Total household income

    The scatter diagram seems to show a somewhat stronger relationship, and the pwcorr command -- “pairwise correlation” -- yields a coefficient of 0.4340, implying a coefficient of determination that can account for about nineteen percent of all the variance in gross rent values among mobile-home renters. But the sample size is pretty small -- only 18 sampled households -- and thus the there is a 0.0719 probability level attached to the t statistic. It’s a judgment call as to how much confidence to place in this correlation. Indeed, if we are suspicious about that one sample household in the upper right-hand corner -- if there’s any reason to believe that there is something fundamentally unique or un-generalizable about this household with an income of about $135,000 living in a rented mobile home -- then we might make a case for eliminating this “outlier.” We can do this by editing the command like this: twoway scatter grosrth totinch if cmah==835 & tenur h==2 & dtypeh==8 & totinch < 130000 And we get the scatterplot that appears on the following page. If we also issue the pwcorr command, we see that the scatter indicates no relationship whatsoever, and the correlation coefficient -- 0.0463 -- implies that less than two-tenths of one percent of the variance in rent levels can be associated with total household income of renters living in mobile homes in Edmonton. Again, whether it makes sense to exclude the “outlier” household is a subjective judgment call -- that would be guided by sifting through the data to explore other characteristics of this household, in an attempt to draw a conceptual inference. (For example, perhaps this household is a middle-class family living temporarily in a mobile home while their new, custom home is under construction; we

  • 12

    could make a case that this household is different from most other renters living in mobile homes.)

    02

    00

    40

    06

    00

    80

    01

    00

    0M

    on

    thly

    gro

    ss r

    ent

    -50000 0 50000 100000Total household income

    You should always draw scatter diagrams when exploring correlations among variables. Scatter diagrams are often the best way to begin your inquiry. This is true even if you’ll eventually calculate correlations and publish the results in a table, like Martin Danyluk and David Ley did when they correlated neighborhood-level gentrification in Vancouver, Toronto, and Montreal to the proportion of workers commuting to work by various means.

    Danyluk and Ley’s Correlation Analysis. Source: Martin Danyluk and David Ley (2007). “Modalities of the New Middle Class: Ideology and Behavior in the Journey to Work from Gentrified Neighbourhoods in Canada.” Urban Studies 44(11), 2195-2210.

  • 13

    Scatter diagrams are also essential in detecting non-linear relationships (Figure 4). The correlation coefficient measures the strength of a linear relationship between two variables -- and thus it is entirely possible to obtain weak correlation coefficients for relations that are strong but non-linear. A “tight” scatter of observations along something that looks roughly like a straight line will yield a very large correlation coefficient -- approaching +1.0 if the slope is upward to the right, or approaching -1.0 if the slope is down to the right (compare the left and right scatters of the top panel in Figure 4). Conventionally, the vertical axis Y in a correlation analysis is referred to as the dependent variable, and the horizontal axis X is described as the independent variable.

    Scatter Diagrams for Various Kinds of Relations between Two Variables. Source: Martin Bland (2005). Clinical Biostatistics, Lecture Notes, Week 7. Toronto: Department of Health Sciences, York University. Available at http://www-users.york.ac.uk/~mb55/msc/clinbio/week7/corr.htm Regression If a correlation coefficient is useful to evaluate the strength of a relationship, and if scatter diagrams are useful to convey this information visually, these approaches still leave important questions unanswered: if there is a relationship, what form does it take? To use our example of mobile home renters, how much does monthly rent increase with each unit change in total household income? Answering questions like these requires a technique known as regression.4 Perry Hinton distinguishes correlation from regression

    4 Why is it called “regression”? The word comes from the Latin regredi, “to go back,” and was used by nineteenth-century researchers to describe a phenomenon known as ‘reversion to the mean.’ Francis

  • 14

    this way: “A linear correlation tells us how close the relationship two variables is to a straight line. A linear regression is the straight line that best describes the linear relationship between the two variables.”5 Let’s return to our sample of Vancouver-area mobile home renters:

    05

    001

    000

    150

    0M

    onth

    ly g

    ross

    re

    nt

    10000 20000 30000 40000 50000 60000Total household income

    To describe the straight line that would achieve the “best fit” with these points, we only need to know a few pieces of information. The equation for the straight line would relate the dependent variable (monthly gross rent, Y) to variation in the independent variable (total household income, X) -- while also specifying the point where the line would intersect the vertical axis. In other words, the equation for our line would take the form Y=a+bX Where Y is the value of the dependent variable, a is the value of the vertical axis where the line intersects it (i.e., where X is equal to zero), and b is the slope coefficient that relates the change in units of X to corresponding changes in the value of Y. Regression involves finding the values of a and b that achieve the “best fit” of a line to the scatter of points. Achieving the best fit requires minimizing the sum of the squared

    Galton, in a series of studies of the heredity of height and other physical characteristics, observed that very tall people tended to have children shorter than themselves (i.e., closer to the average), while very short parents tended to have children who were taller than themselves. 5 Perry Hinton (1995). Statistics Explained. New York: Routledge, p. 262.

  • 15

    deviations of the dependent variable, Y. This simple approach is often labeled “ordinary least squares” or OLS regression.

    The Line of Best Fit. Source: Peter J. Taylor (1977). Quantitative Methods in

    Geography: An Introduction to Spatial Analysis. Prospect Heights, IL: Waveland Press, p. 198.

    The sum of squares is at a minimum when ( )( )∑∑

    −−= 2)( XX

    YYXXb

    Look carefully at the numerator in this equation. This is identical to the numerator in the equation for the correlation coefficient. The equation expresses the ratio between the joint variation of X and Y and the variation of X with itself (i.e., the sum of the squared deviations). Once we’ve figured out b, then a can be calculated as

    XbYa −= The figures in the worksheet in Table 2 can be used to calculate a and b for this small sample; we obtain b=0.0121 and a=135.05. The line of best fit crosses the vertical axis at Y=$135.05 gross monthly rent, and each one-unit increase on the X axis (i.e., one dollar of total household income) yields a corresponding increase in rent of 0.0121. Since the units for the variables are so different, it might help to express the change in rent

  • 16

    associated with, say, an increase of $10,000 in total household income: this is associated with an increase in rent of $121. In Stata, issue the following command: regress grosrth totinch if cmah==933 & tenurh==2 & dtypeh==8 & totinch > 10000 & totinch < 70000

    and your results panel will give you this:

    Notice three parts of the output. First, the “R-squared” value in the upper-right corner is 0.2021, which is the coefficient of determination we calculated earlier -- also equivalent to the squared value of the correlation coefficient. About 20.2 percent of the variance in monthly rent levels can be associated with the variance in total household income among mobile home renters in the Vancouver metropolitan area in 2001. The “Adj R-Squared” value takes into consideration the degrees of freedom -- such that analysis with comparatively few observations will be ‘penalized’ with a lower coefficient of determination. Second, note the “Coef.” column in the lower-left corner. The coefficient for totinch is 0.0120938, which is our b value, sometimes called a “beta coefficient.” Third, the coefficient for “_cons” is Stata’s way of labeling the intercept, which is also sometimes called the “constant.” Notice that Stata also provides a column labeled “Std. Err.” When the scatter diagram is diffuse, our line of best fit will provide rather unreliable estimates for the dependent variable. There will be large differences between the line of best fit -- the line of Y values predicted with that Y=a+bX equation -- and the actual values for each sample. Notice the right-hand side of the graph, where household income is about $60,000; there are three sample households, with rents ranging from less than $500 to more than $1,000. This introduces considerable uncertainty. The difference between the observed value and model-predicted value for each observation is known as a residual. If we calculate the residuals for all the observations, they will have their own mean and standard deviation. The standard deviation of the residuals is known as the standard error of the estimate. We can use the standard error of the estimate to calculate t statistics for the beta coefficient, to test the null hypothesis

  • 17

    that the slope in the population is zero, signifying no relationship. In our example, the t-test yields a probability of 0.041, indicating that we can be more than 95 percent confident that the coefficient in the population is not zero. There does seem to be a relationship, although it is a weak relationship. Multiple Regression Our examples thus far are quite simplistic, with correlations between one variable and another. Things get more interesting when we consider the effect of multiple independent variables on our dependent variable. If we add one more predictor variable, our simple bivariate regression equation, Y=a+bX becomes a multivariate regression,

    2211 XbXbaY ++= with two separate beta or slope coefficients. Instead of fitting a line to a scatter of points plotted on a two-dimensional graph, we are now fitting a plane to a cloud of points plotted in a three-dimensional space:

    Visualizing Multiple Regression as a Sloping Plane. Source: Peter J. Taylor (1977). Quantitative Methods in Geography: An Introduction to Spatial Analysis. Prospect

    Heights, IL: Waveland Press, p. 208. There’s no need for us to remain in the realm of three dimensions; mathematically, the model can be extended to the general form,

  • 18

    nn2211 Xb...XbXbaY +++= In Stata, issue this command: regress grosrth totinch roomh if cmah==933 & tenurh ==2 & dtypeh==8 & totinch > 10000 & totinch < 70000

    and the results panel will yield this: Our r-squared value has increased from 0.2021 to 0.2234 with the addition of a variable measuring the number of rooms in the dwelling. We cannot, however, simply subtract these two values to determine the amount of variance accounted for by the addition of the new variable; this is because totinch and roomh may themselves be correlated. You can test this by issuing this command, pwcorr grosrth totinch roomh if cmah==933 & tenurh= =2 & dtypeh==8 & totinch > 10000 & totinch < 70000

    which gives these results: Note that while rent and income are correlated (0.45), there is a much weaker relation between rent and the number of rooms (0.18); but the number of rooms is also related to income (0.071). When independent variables exhibit interdependencies, we have the problem of collinearity; when it involves multiple inter-relations amongst predictors, it’s called multicollinearity.