# 20.sykes .regression

Post on 07-Apr-2018

216 views

Embed Size (px)

TRANSCRIPT

8/6/2019 20.Sykes .Regression

1/33

The Inaugural Coase Lecture

An Introduction to Regression Analysis

Alan O. Sykes*

Regression analysis is a statistical tool for the investigation of re-lationships between variables. Usually, the investigator seeks toascertain the causal effect of one variable upon anotherthe effect ofa price increase upon demand, for example, or the effect of changesin the money supply upon the inflation rate. To explore such issues,

the investigator assembles data on the underlying variables ofinterest and employs regression to estimate the quantitative effect ofthe causal variables upon the variable that they influence. Theinvestigator also typically assesses the statistical significance of theestimated relationships, that is, the degree of confidence that thetrue relationship is close to the estimated relationship.

Regression techniques have long been central to the field of eco-nomic statistics (econometrics). Increasingly, they have becomeimportant to lawyers and legal policy makers as well. Regression hasbeen offered as evidence of liability under Title VII of the CivilRights Act of, as evidence of racial bias in death penalty litiga-tion, as evidence of damages in contract actions, as evidence ofviolations under the Voting Rights Act, and as evidence of damagesin antitrust litigation, among other things.

In this lecture, I will provide an overview of the most basic tech-niques of regression analysishow they work, what they assume,

Professor of Law, University of Chicago, The Law School. I thank DonnaCote for helpful research assistance.

See, e.g, Bazemore v. Friday, U.S., ().See, e.g., McClesky v. Kemp, U.S. ().See, e.g., Cotton Brothers Baking Co. v. Industrial Risk Insurers, F.d

(th Cir. ).See, e.g., Thornburgh v. Gingles, U.S. ().See, e.g., Sprayrite Service Corp. v. Monsanto Co., F. d (th Cir.

).

8/6/2019 20.Sykes .Regression

2/33

Chicago Working Paper in Law & Economics

and how they may go awry when key assumptions do not hold. Tomake the discussion concrete, I will employ a series of illustrationsinvolving a hypothetical analysis of the factors that determine indi-vidual earnings in the labor market. The illustrations will have alegal flavor in the latter part of the lecture, where they willincorporate the possibility that earnings are impermissibly influencedby gender in violation of the federal civil rights laws. I wish toemphasize that this lecture is not a comprehensive treatment of thestatistical issues that arise in Title VII litigation, and that thediscussion of gender discrimination is simply a vehicle for expositingcertain aspects of regression technique.Also, of necessity, there are

many important topics that I omit, including simultaneous equationmodels and generalized least squares. The lecture is limited to theassumptions, mechanics, and common difficulties with single-equation, ordinary least squares regression.

. What is Regression?

For purposes of illustration, suppose that we wish to identify andquantify the factors that determine earnings in the labor market. Amoments reflection suggests a myriad of factors that are associatedwith variations in earnings across individualsoccupation, age, ex-perience, educational attainment, motivation, and innate abilitycome to mind, perhaps along with factors such as race and gender

that can be of particular concern to lawyers. For the time being, letus restrict attention to a single factorcall it education. Regressionanalysis with a single explanatory variable is termed simple regres-sion.

See U.S.C. e- (), as amended.Readers with a particular interest in the use of regression analysis under Title

VII may wish to consult the following references: Campbell, Regression Analysisin Title VII CasesMinimum Standards, Comparable Worth, and Other IssuesWhere Law and Statistics Meet,Stan. L. Rev. (); Connolly, The Useof Multiple Rgeression Analysis in Employment Discrimination Cases, Population Res. and Pol. Rev. (); Finkelstein, The Judicial Reception of

Multiple Regression Studies in Race and Sex Discrimination Cases, Colum. L.Rev. (); and Fisher, Multiple Regression in Legal Proceedings, Colum. L. Rev. (), at .

8/6/2019 20.Sykes .Regression

3/33

A Introduction to Regression Analysis

. Simple Regression

In reality, any effort to quantify the effects of education uponearnings without careful attention to the other factors that affectearnings could create serious statistical difficulties (termed omittedvariables bias), which I will discuss later. But for now let us assumeaway this problem. We also assume, again quite unrealistically, thateducation can be measured by a single attributeyears of school-ing. We thus suppress the fact that a given number of years in schoolmay represent widely varying academic programs.

At the outset of any regression study, one formulates some hy-pothesis about the relationship between the variables of interest,

here, education and earnings. Common experience suggests thatbetter educated people tend to make more money. It further suggeststhat the causal relation likely runs from education to earnings ratherthan the other way around. Thus, the tentative hypothesis is thathigher levels of education cause higher levels of earnings, otherthings being equal.

To investigate this hypothesis, imagine that we gather data oneducation and earnings for various individuals. Let E denote educa-tion in years of schooling for each individual, and let I denote thatindividuals earnings in dollars per year. We can plot this informa-tion for all of the individuals in the sample using a two-dimensional

diagram, conventionally termed a scatter diagram. Each point inthe diagram represents an individual in the sample.

8/6/2019 20.Sykes .Regression

4/33

Chicago Working Paper in Law & Economics

The diagram indeed suggests that higher values ofE tend toyield higher values ofI, but the relationship is not perfectit seemsthat knowledge ofEdoes not suffice for an entirely accurate predic-tion about I. We can then deduce either that the effect of educationupon earnings differs across individuals, or that factors other thaneducation influence earnings. Regression analysis ordinarily

embraces the latter explanation. Thus, pending discussion below ofomitted variables bias, we now hypothesize that earnings for eachindividual are determined by education and by an aggregation ofomitted factors that we term noise.

To refine the hypothesis further, it is natural to suppose thatpeople in the labor force with no education nevertheless make some

More accurately, what one can infer from the diagram is that if knowledge ofEsuffices to predictIperfectly, then the relationship between them is a complex,nonlinear one. Because we have no reason to suspect that the true relationshipbetween education and earnings is of that form, we are more likely to concludethat knowledge ofEis not sufficient to predictIperfectly.

The alternative possibilitythat the relationship between two variables isunstableis termed the problem of random or time varying coefficients andraises somewhat different statistical problems. See, e.g., H . Theil, Principles ofEconometrics(); G. Chow,Econometrics().

8/6/2019 20.Sykes .Regression

5/33

A Introduction to Regression Analysis

positive amount of money, and that education increases earningsabove this baseline. We might also suppose that education affects in-come in a linear fashionthat is, each additional year of schoolingadds the same amount to income. This linearity assumption is com-mon in regression studies but is by no means essential to the appli-cation of the technique, and can be relaxed where the investigatorhas reason to suppose a priori that the relationship in question isnonlinear.

Then, the hypothesized relationship between education andearnings may be written

I= + E+ where

= a constant amount (what one earns with zero education);

= the effect in dollars of an additional year of schooling on in-come, hypothesized to be positive; and

= the noise term reflecting other factors that influence earn-ings.

The variableI is termed the dependent or endogenous vari-able;E is termed the independent, explanatory, or exogenous

variable; is the constant term and the coefficient of the vari-

ableE.Remember what is observable and what is not. The data set

contains observations for IandE. The noise component is com-prised of factors that are unobservable, or at least unobserved. The

parameters and are also unobservable. The task of regressionanalysis is to produce an estimate of these two parameters, based

When nonlinear relationships are thought to be present, investigators typi-cally seek to model them in a manner that permits them to be transformed into

linear relationships. For example, the relationship y = cx can be transformed into

the linear relationship log y = log c + logx. The reason for modeling nonlinearrelationships in this fashion is that the estimation of linear regressions is muchsimpler and their statistical properties are better known. Where this approach isinfeasible, however, techniques for the estimation of nonlinear regressions havebeen developed. See, e.g., G. Chow, supra note , at .

8/6/2019 20.Sykes .Regression

6/33

Chicago Working Paper in Law & Economics

upon the information contained in the data set and, as shall be seen,

upon some assumptions about the characteristics of.

To understand how the parameter estimates are generated, note

that if we ignore the noise term , the equation above for the rela-tionship between IandE is the equatio