Transcript
  • 8/6/2019 20.Sykes .Regression

    1/33

    The Inaugural Coase Lecture

    An Introduction to Regression Analysis

    Alan O. Sykes*

    Regression analysis is a statistical tool for the investigation of re-lationships between variables. Usually, the investigator seeks toascertain the causal effect of one variable upon anotherthe effect ofa price increase upon demand, for example, or the effect of changesin the money supply upon the inflation rate. To explore such issues,

    the investigator assembles data on the underlying variables ofinterest and employs regression to estimate the quantitative effect ofthe causal variables upon the variable that they influence. Theinvestigator also typically assesses the statistical significance of theestimated relationships, that is, the degree of confidence that thetrue relationship is close to the estimated relationship.

    Regression techniques have long been central to the field of eco-nomic statistics (econometrics). Increasingly, they have becomeimportant to lawyers and legal policy makers as well. Regression hasbeen offered as evidence of liability under Title VII of the CivilRights Act of, as evidence of racial bias in death penalty litiga-tion, as evidence of damages in contract actions, as evidence ofviolations under the Voting Rights Act, and as evidence of damagesin antitrust litigation, among other things.

    In this lecture, I will provide an overview of the most basic tech-niques of regression analysishow they work, what they assume,

    Professor of Law, University of Chicago, The Law School. I thank DonnaCote for helpful research assistance.

    See, e.g, Bazemore v. Friday, U.S., ().See, e.g., McClesky v. Kemp, U.S. ().See, e.g., Cotton Brothers Baking Co. v. Industrial Risk Insurers, F.d

    (th Cir. ).See, e.g., Thornburgh v. Gingles, U.S. ().See, e.g., Sprayrite Service Corp. v. Monsanto Co., F. d (th Cir.

    ).

  • 8/6/2019 20.Sykes .Regression

    2/33

    Chicago Working Paper in Law & Economics

    and how they may go awry when key assumptions do not hold. Tomake the discussion concrete, I will employ a series of illustrationsinvolving a hypothetical analysis of the factors that determine indi-vidual earnings in the labor market. The illustrations will have alegal flavor in the latter part of the lecture, where they willincorporate the possibility that earnings are impermissibly influencedby gender in violation of the federal civil rights laws. I wish toemphasize that this lecture is not a comprehensive treatment of thestatistical issues that arise in Title VII litigation, and that thediscussion of gender discrimination is simply a vehicle for expositingcertain aspects of regression technique.Also, of necessity, there are

    many important topics that I omit, including simultaneous equationmodels and generalized least squares. The lecture is limited to theassumptions, mechanics, and common difficulties with single-equation, ordinary least squares regression.

    . What is Regression?

    For purposes of illustration, suppose that we wish to identify andquantify the factors that determine earnings in the labor market. Amoments reflection suggests a myriad of factors that are associatedwith variations in earnings across individualsoccupation, age, ex-perience, educational attainment, motivation, and innate abilitycome to mind, perhaps along with factors such as race and gender

    that can be of particular concern to lawyers. For the time being, letus restrict attention to a single factorcall it education. Regressionanalysis with a single explanatory variable is termed simple regres-sion.

    See U.S.C. e- (), as amended.Readers with a particular interest in the use of regression analysis under Title

    VII may wish to consult the following references: Campbell, Regression Analysisin Title VII CasesMinimum Standards, Comparable Worth, and Other IssuesWhere Law and Statistics Meet,Stan. L. Rev. (); Connolly, The Useof Multiple Rgeression Analysis in Employment Discrimination Cases, Population Res. and Pol. Rev. (); Finkelstein, The Judicial Reception of

    Multiple Regression Studies in Race and Sex Discrimination Cases, Colum. L.Rev. (); and Fisher, Multiple Regression in Legal Proceedings, Colum. L. Rev. (), at .

  • 8/6/2019 20.Sykes .Regression

    3/33

    A Introduction to Regression Analysis

    . Simple Regression

    In reality, any effort to quantify the effects of education uponearnings without careful attention to the other factors that affectearnings could create serious statistical difficulties (termed omittedvariables bias), which I will discuss later. But for now let us assumeaway this problem. We also assume, again quite unrealistically, thateducation can be measured by a single attributeyears of school-ing. We thus suppress the fact that a given number of years in schoolmay represent widely varying academic programs.

    At the outset of any regression study, one formulates some hy-pothesis about the relationship between the variables of interest,

    here, education and earnings. Common experience suggests thatbetter educated people tend to make more money. It further suggeststhat the causal relation likely runs from education to earnings ratherthan the other way around. Thus, the tentative hypothesis is thathigher levels of education cause higher levels of earnings, otherthings being equal.

    To investigate this hypothesis, imagine that we gather data oneducation and earnings for various individuals. Let E denote educa-tion in years of schooling for each individual, and let I denote thatindividuals earnings in dollars per year. We can plot this informa-tion for all of the individuals in the sample using a two-dimensional

    diagram, conventionally termed a scatter diagram. Each point inthe diagram represents an individual in the sample.

  • 8/6/2019 20.Sykes .Regression

    4/33

    Chicago Working Paper in Law & Economics

    The diagram indeed suggests that higher values ofE tend toyield higher values ofI, but the relationship is not perfectit seemsthat knowledge ofEdoes not suffice for an entirely accurate predic-tion about I. We can then deduce either that the effect of educationupon earnings differs across individuals, or that factors other thaneducation influence earnings. Regression analysis ordinarily

    embraces the latter explanation. Thus, pending discussion below ofomitted variables bias, we now hypothesize that earnings for eachindividual are determined by education and by an aggregation ofomitted factors that we term noise.

    To refine the hypothesis further, it is natural to suppose thatpeople in the labor force with no education nevertheless make some

    More accurately, what one can infer from the diagram is that if knowledge ofEsuffices to predictIperfectly, then the relationship between them is a complex,nonlinear one. Because we have no reason to suspect that the true relationshipbetween education and earnings is of that form, we are more likely to concludethat knowledge ofEis not sufficient to predictIperfectly.

    The alternative possibilitythat the relationship between two variables isunstableis termed the problem of random or time varying coefficients andraises somewhat different statistical problems. See, e.g., H . Theil, Principles ofEconometrics(); G. Chow,Econometrics().

  • 8/6/2019 20.Sykes .Regression

    5/33

    A Introduction to Regression Analysis

    positive amount of money, and that education increases earningsabove this baseline. We might also suppose that education affects in-come in a linear fashionthat is, each additional year of schoolingadds the same amount to income. This linearity assumption is com-mon in regression studies but is by no means essential to the appli-cation of the technique, and can be relaxed where the investigatorhas reason to suppose a priori that the relationship in question isnonlinear.

    Then, the hypothesized relationship between education andearnings may be written

    I= + E+ where

    = a constant amount (what one earns with zero education);

    = the effect in dollars of an additional year of schooling on in-come, hypothesized to be positive; and

    = the noise term reflecting other factors that influence earn-ings.

    The variableI is termed the dependent or endogenous vari-able;E is termed the independent, explanatory, or exogenous

    variable; is the constant term and the coefficient of the vari-

    ableE.Remember what is observable and what is not. The data set

    contains observations for IandE. The noise component is com-prised of factors that are unobservable, or at least unobserved. The

    parameters and are also unobservable. The task of regressionanalysis is to produce an estimate of these two parameters, based

    When nonlinear relationships are thought to be present, investigators typi-cally seek to model them in a manner that permits them to be transformed into

    linear relationships. For example, the relationship y = cx can be transformed into

    the linear relationship log y = log c + logx. The reason for modeling nonlinearrelationships in this fashion is that the estimation of linear regressions is muchsimpler and their statistical properties are better known. Where this approach isinfeasible, however, techniques for the estimation of nonlinear regressions havebeen developed. See, e.g., G. Chow, supra note , at .

  • 8/6/2019 20.Sykes .Regression

    6/33

    Chicago Working Paper in Law & Economics

    upon the information contained in the data set and, as shall be seen,

    upon some assumptions about the characteristics of.

    To understand how the parameter estimates are generated, note

    that if we ignore the noise term , the equation above for the rela-tionship between IandE is the equation for a linea line with an

    intercept of on the vertical axis and a slope of. Returning tothe scatter diagram, the hypothesized relationship thus implies thatsomewhere on the diagram may be found a line with the equationI

    = + E. The task of estimating and is equivalent to the task ofestimating where this line is located.

    What is the best estimate regarding the location of this line? Theanswer depends in part upon what we think about the nature of the

    noise term . If we believed that was usually a large negative num-ber, for example, we would want to pick a line lying above most or

    all of our data pointsthe logic is that if is negative, the true value

    ofI(which we observe), given byI= + E+ , will be less than the

    value ofIon the lineI= + E. Likewise, if we believed that wassystematically positive, a line lying below the majority of data pointswould be appropriate. Regression analysis assumes, however, thatthe noise term has no such systematic property, but is on average

    equal to zeroI will make the assumptions about the noise termmore precise in a moment. The assumption that the noise term isusually zero suggests an estimate of the line that lies roughly in themidst of the data, some observations below and some observationsabove.

    But there are many such lines, and it remains to pick one line inparticular. Regression analysis does so by embracing a criterion thatrelates to the estimatednoise term or error for each observation. Tobe precise, define the estimated error for each observation as the

    vertical distance between the value ofIalong the estimated lineI=

    + E(generated by plugging the actual value ofEinto this equation)

    and the true value ofI for the same observation. Superimposing acandidate line on the scatter diagram, the estimated errors for eachobservation may be seen as follows:

  • 8/6/2019 20.Sykes .Regression

    7/33

    A Introduction to Regression Analysis

    With each possible line that might be superimposed upon the data, adifferent set of estimated errors will result. Regression analysis thenchooses among all possible lines by selecting the one for which thesum of the squares of the estimated errors is at a minimum. This istermed the minimum sum of squared errors (minimum SSE) crite-rion The intercept of the line chosen by this criterion provides the

    estimate of, and its slope provides the estimate of.

    It is hardly obvious why we should choose our line using theminimum SSE criterion. We can readily imagine other criteria thatmight be utilized (minimizing the sum of errors in absolute value,

    for example). One virtue of the SSE criterion is that it is very easy toemploy computationally. When one expresses the sum of squarederrors mathematically and employs calculus techniques to ascertain

    the values of and that minimize it, one obtains expressions for

    and that are easy to evaluate with a computer using only the ob-

    It should be obvious why simply minimizing the sum of errors is not an at-tractive criterionlarge negative errors and large positive errors would cancel out,so that this sum could be at a minimum even though the line selected fitted thedata very poorly.

  • 8/6/2019 20.Sykes .Regression

    8/33

    Chicago Working Paper in Law & Economics

    served values ofE and I in the data sample. But computationalconvenience is not the only virtue of the minimum SSE criterionitalso has some attractive statistical properties under plausible as-sumptions about the noise term. These properties will be discussedin a moment, after we introduce the concept of multiple regression.

    . Multiple Regression

    Plainly, earnings are affected by a variety of factors in addition toyears of schooling, factors that were aggregated into the noise termin the simple regression model above. Multiple regression is atechnique that allows additional factors to enter the analysis sepa-

    rately so that the effect of each can be estimated. It is valuable forquantifying the impact of various simultaneous influences upon asingle dependent variable. Further, because of omitted variables biaswith simple regression, multiple regression is often essential evenwhen the investigator is only interested in the effects of one of theindependent variables.

    For purposes of illustration, consider the introduction into theearnings analysis of a second independent variable called experi-ence. H olding constant the level of education, we would expectsomeone who has been working for a longer time to earn more. LetX denote years of experience in the labor force and, as in the case ofeducation, we will assume that it has a linear effect upon earnings

    that is stable across individuals. The modified model may be written:

    I= + E+ X+

    where is expected to be positive.

    The derivation is so simple in the case of one explanatory variable that it is

    worth including here: Continuing with the example in the text, we imagine thatwe have data on education and earnings for a number of individuals, let them beindexed byj. The actual value of earnings for the jth individual is Ij, and its esti-

    mated value for any line with intercept and slope will be + Ej. The esti-

    mated error is thus Ij Ej. The sum of squared errors is then j(Ij

    Ej)2. Minimizing this sum with respect to a requires that its derivative with re-spect to be set to zero, or 2j(Ij Ej) = 0. Minimizing with respect to

    likewise requires 2jEi(Ij Ej) = 0. We now have two equations in two

    unknowns that can be solved for and .

  • 8/6/2019 20.Sykes .Regression

    9/33

    A Introduction to Regression Analysis

    The task of estimating the parameters, , and is conceptually

    identical to the earlier task of estimating only and . The differ-ence is that we can no longer think of regression as choosing a linein a two-dimensional diagramwith two explanatory variables weneed three dimensions, and instead of estimating a line we areestimating a plane. Multiple regression analysis will select a plane sothat the sum of squared errorsthe error here being the verticaldistance between the actual value ofIand the estimated planeis ata minimum. The intercept of that plane with the I-axis (where E

    and X are zero) implies the constant term , its slope in the

    education dimension implies the coefficient , and its slope in theexperience dimension implies the coefficient .

    Multiple regression analysis is in fact capable of dealing with anarbitrarily large number of explanatory variables. Though people lackthe capacity to visualize in more than three dimensions, mathematicsdoes not. With n explanatory variables, multiple regression analysiswill estimate the equation of a hyperplane in n-space such that thesum of squared errors has been minimized. Its intercept implies theconstant term, and its slope in each dimension implies one of theregression coefficients. As in the case of simple regression, the SSEcriterion is quite convenient computationally. Formulae for the pa-

    rameters ,, ... can be derived readily and evaluated easily on acomputer, again using only the observed values of the dependent andindependent variables.

    The interpretation of the coefficient estimates in a multiple re-

    gression warrants brief comment. In the modelI= + E+ X + ,

    captures what an individual earns with no education or experience,

    captures the effect on income of a year of education, and capturesthe effect on income of a year of experience. To put it slightly differ-

    ently, is an estimate of the effect of a year of education on income,

    The derivation may be found in any standard econometrics text. See, e.g., E.H anushek and J. Jackson, Statistical Methods for Social Scientists (); J.Johnston,Econometric Methods (d ed. ).

  • 8/6/2019 20.Sykes .Regression

    10/33

    Chicago Working Paper in Law & Economics

    holding experience constant. Likewise, is the estimated effect of ayear of experience on income, holding education constant.

    . Essential Assumptions and Statistical Properties of Regression

    As noted, the use of the minimum SSE criterion may be de-fended on two grounds: its computational convenience, and its desir-able statistical properties. We now consider these properties and theassumptions that are necessary to ensure them.

    Continuing with our illustration, the hypothesis is that earningsin the real world are determined in accordance with the equation

    I=

    +E +

    X +

    true values of

    ,

    , and

    exist, and we desire

    to ascertain what they are. Because of the noise term , however, wecan only estimate these parameters.

    We can think of the noise term as a random variable, drawn bynature from some probability distributionpeople obtain an educa-tion and accumulate work experience, then nature generates a ran-

    dom number for each individual, called , which increases or de-creases income accordingly. Once we think of the noise term as a

    random variable, it becomes clear that the estimatesof, , and (asdistinguished from their true values) will also be random variables,because the estimates generated by the SSE criterion will depend

    upon the particular value of drawn by nature for each individual inthe data set. Likewise, because there exists a probability distribution

    from which each is drawn, there must also exist a probability dis-tribution from which each parameter estimate is drawn, the latterdistribution a function of the former distributions. The attractivestatistical properties of regression all concern the relationship be-tween the probability distribution of the parameter estimates and thetrue values of those parameters.

    We begin with some definitions. The minimum SSE criterion istermed an estimator. Alternative criteria for generating parameter es-timates (such as minimizing the sum of errors in absolute value) are

    also estimators.

    An accessible and more extensive discussion of the key assumptions ofregression may be found in Fisher, supra note .

  • 8/6/2019 20.Sykes .Regression

    11/33

    A Introduction to Regression Analysis

    Each parameter estimate that an estimator produces, as noted,can be viewed as a random variable drawn from some probabilitydistribution. If the mean of that probability distribution is equal tothe true value of the parameter that we are trying to estimate, thenthe estimator is unbiased. In other words, to return to our illustra-tion, imagine creating a sequence of data sets each containing thesame individuals with the same values of education and experience,

    differing only in that nature draws a different for each individualfor each data set. Imagine further that we recompute our parameterestimates for each data set, thus generating a range of estimates for

    each parameter , , and . If the estimator is unbiased, we wouldfind that on average we recovered the true value of each parameter.

    An estimator is termed consistent if it takes advantage of addi-tional data to generate more accurate estimates. More precisely, aconsistent estimator yields estimates that converge on the true valueof the underlying parameter as the sample size gets larger and larger.Thus, the probability distribution of the estimate for any parameterhas lower variance as the sample size increases, and in the limit(infinite sample size) the estimate will equal the true value.

    The variance of an estimator for a given sample size is also of in-terest. In particular, let us restrict attention to estimators that are un-biased. Then, lower variance in the probability distribution of the

    estimator is clearly desirableit reduces the probability of an es-timate that differs greatly from the true value of the underlying pa-rameter. In comparing different unbiased estimators, the one withthe lowest variance is termed efficientor best.

    Under certain assumptions, the minimum SSE criterion has thecharacteristics of unbiasedness, consistency, and efficiencytheseassumptions and their consequences follow:

    Variance is a mesaure of the dispersion of the probability distribution of a

    random variable. Consider two random variables with the same mean (same aver-age value). If one of them has a distribution with greater variance, then, roughlyspeaking, the probability that the variable will take on a value far from the mean is

    greater.Lower variance by itself is not necessarily an attractive property for an es-

    timator. For example, we could employ an estimator for of the form = irrespective of the information in the data set. This estimator has zero variance.

  • 8/6/2019 20.Sykes .Regression

    12/33

    Chicago Working Paper in Law & Economics

    () If the noise term for each observation, , is drawn from adistribution that has a mean of zero, then the sum of squared errorscriterion generates estimates that are unbiased and consistent.

    That is, we can imagine that for each observation in the sample,nature draws a noise term from a different probability distribution.As long as each of these distributions has a mean of zero (even if thedsitributions are not the same), the minimum SSE criterion is unbi-ased and consistent. This assumption is logically sufficient to en-sure that one other condition holdsnamely, that each of the ex-planatory variables in the model is uncorrelated with the expectedvalue of the noise term. This will prove important later.

    () If the distributions from which the noise terms are drawn foreach observation have the same variance, and the noise terms arestatistically independent of each other (so that if there is a positivenoise term for one observation, for example, there is no reason to ex-pect a positive or negative noise term for any other observation),then the sum of squared errors criterion gives us the best or mostefficient estimates available from any linearestimator (defined as anestimator that computes the parameter estimates as a linear functionof the noise term, which the SSE criterion does).

    If assumptions () are violated, the SSE criterion remains unbi-ased and consistent but it is possible to reduce the variance of the es-

    timator by taking account of what we know about the noise term.For example, if we know that the variance of the distribution fromwhich the noise term is drawn is bigger for certain observations, thenthe size of the noise term for those observations is likely to be larger.And, because the noise is larger, we will want to give those observa-tions less weight in our analysis. The statistical procedure for dealing

    See, e.g, P. Kennedy,A Guide to Econometrics (d ed. ).If the expected value of the noise term is always zero irrespective of the val-

    ues of the explanatory variables for the observation with which the noise term isassociated, then by definition the noise term cannot be correlated with any ex-planatory variable.

    E.g., id. at ; J. Johnston, supra note , at .

  • 8/6/2019 20.Sykes .Regression

    13/33

    A Introduction to Regression Analysis

    with this sort of problem is termed generalized least squares, whichis beyond the scope of this lecture.

    . An IllustrationDiscrimination on the Basis of Gender

    To illustrate the ideas to this point, as well as to suggest how re-gression may have useful applications in a legal proceeding, imaginea hypothetical firm that has been sued for wage discrimination onthe basis of gender. To investigate these allegations, data have beengathered for all of the firms employees. The questions to be an-swered are (a) whether discrimination is occurring (liability), and (b)what its consequences are (damages). We will address them using a

    modified version of the earnings model developed in section .The usefulness ofmultiple regression here should be intuitively

    apparent. Suppose, for example, that according to the data, womenat the firm on average make less than men. Is this fact sufficient toestablish actionable discrimination? The answer is no if the differ-ence arises because women at this firm are less well-educated, for ex-ample (and thus by inference less productive), or because they areless experienced. In short, the legal question is whether womenearn less after all of the factors that the firm may permissibly con-sider in setting wages have been taken into account.

    To generate the data for this illustration, I assume a hypotheticalreal world in which earnings are determined by equation ():

    Earnings = 5000 + 1000 School + 50 Aptitude+ 300 Experience 2000 Gendum + Noise ()

    where School is years of schooling; Aptitude is a score between and on an aptitude test; Experience is years of experiencein the work force; and Gendum is a variable that equals forwomen and zero for men (more about this variable in a moment).To produce the artificial data set, I made up fifty observations(corresponding to fifty fictitious individuals) for each of the explana-tory variables, half men and half women. In making up the data, Ideliberately tried to introduce some positive correlation between the

    See, e.g., id. at .See, e.g., Miller v Kansas Electric Power Cooperative, Inc., WL

    (D. Kan.).

  • 8/6/2019 20.Sykes .Regression

    14/33

    Chicago Working Paper in Law & Economics

    schooling and aptitude variables, for reasons that will become clearlater. I then employed a random number generator to produce anoise term drawn from a normal distribution, with a standard devia-tion (the square root of the variance) equal to , and a mean ofzero. This standard deviation was chosen more or less arbitrarily tointroduce a considerable but not overwhelming amount of noise inproportion to the total variation in earnings. The right-hand-sidevariables were then used to generate the actual value of earnings foreach of the fifty individuals.

    The effect of gender on earnings in this hypothetical firm entersthrough the variable Gendum. Gendum is a dummy variable in

    econometric parlance because its numerical value is arbitrary, and itsimply captures some nonnumerical attribute of the sample popula-tion. By construction here, men and women both earn the same re-turns to education, experience, and aptitude, but holding these fac-tors constant the earnings of women are $, lower. In effect, theconstant term (baseline earnings) is lower for women, but otherwisewomen are treated equally. In reality, of course, gender discrimina-tion could arise in other ways (such as lower returns to education andexperience for women, for example), and I assume that it takes thisform only for purposes of illustration.

    Note that the random number generator that I employed heregenerates noise terms with an expected value of zero, each drawnfrom a distribution with the same variance. Further, the noise termsfor the various observations are statistically independent (the realizedvalue of the noise term for each observation has no influence on thenoise term drawn for any other observation). Hence, the noise termssatisfy the assumptions necessary to ensure that the minimum SSEcriterion yields unbiased, consistent, and efficient estimates. The ex-pected value of the estimate for each parameter is equal to the truevalue, therefore, and no other linear estimator will do a better job atrecovering the true parameters than the minimum SSE criterion. Itis nevertheless interesting to see just how well regression analysisperforms. I used a standard computer package to estimate the con-

    stant term and the coefficients of the four independent variablesfrom the observed values of Earnings, School, Aptitude,Experience, and Gendum for each of the fifty hypothetical individu-als. The results are reproduced in table , under the column labeled

  • 8/6/2019 20.Sykes .Regression

    15/33

    A Introduction to Regression Analysis

    Estimated Value. (We will discuss the last three columns and the

    R2 statistic in the next section.)

    Table (noise term with standard deviation of,)

    Variable Truevalue

    Estimatedvalue

    Standarderror

    t-statistic

    Prob(-tail)

    Constant . . . . .

    School . . . . .

    Aptitude . . . . .

    Experience . . . . .

    Gendum . . . . .

    R2 = .

    Note that all of the estimated parameters have the right sign.Just by chance, it turns out that the regression overestimates thereturns to schooling and underestimates the other parameters. Theestimated coefficient for Aptitude is offby a great deal in proportionto its true value, and in a later section I will offer an hypothesis as towhat the problem is. The other parameter estimates, though obvi-ously different from the true value of the underlying parameter, aremuch closer to the mark. W ith particular reference to the coefficientof Gendum, the regression results correctly suggest the presence ofgender discrimination, though its magnitude is underestimated byabout percent (remember that an overestimate of the same magni-tude was just as likely ex ante, that is, before the actual values for thenoise terms were generated).

    The source of the error in the coefficient estimates, of course, isthe presence of noise. If the noise term were equal to zero for everyobservation, the true values of the underlying parameters could berecovered in this illustration with perfect accuracy from the data foronly five hypothetical individualsit would be a simple matter ofsolving five equations in five unknowns. And, if noise is the source

    of error in the parameter estimates, intuition suggests that themagnitude of the noise will affect the accuracy of the regressionestimates, with more noise leading to less accuracy on average. Wewill make this intuition precise in the next section, but before

  • 8/6/2019 20.Sykes .Regression

    16/33

    Chicago Working Paper in Law & Economics

    proceeding it is perhaps useful to repeat the parameter estimationexperiment for a hypothetical firm in which the data contain lessnoise. To do so, I took the data for the independent variables usedin the experiment above and again generated values for earnings forthe fifty hypothetical individuals using equation (), changing onlythe noise terms. This time, the noise terms were drawn by therandom number generator from a normal distribution with standarddeviation of, rather than , (a significant reduction in theamount of noise). Reestimating the regression parameters from thismodified data set produced the results in table :

    Table (noise term with standard deviation of,)Variable True

    valueEstimated

    valueStandard

    errort-statis-

    ticProb

    (-tail)

    Constant . . . . .

    School . . . . .

    Aptitude . . . . .

    Experience . . . . .

    Gendum . . . . .

    R2 = .

    Not surprisingly, the estimated parameters here are considerablycloser to their true values. It was not certain that they would be, be-cause after all their expected values are equal to their true values re-gardless of the amount of noise (the estimator is unbiased). But onaverage we would expect greater accuracy, and greater accuracy in-deed emerges here. Put more formally, the probability distributionsof the parameter estimates have greater variance, the greater the vari-ance of the noise term. The variance of the noise term thus affectsthe degree of confidence that we have in the accuracy of regressionestimates.

    In real applications, of course, the noise term is unobservable, as

    is the distribution from which it is drawn. The variance of the noiseterm is thus unknown. It can, however, be estimated using thedifference between the predicted values of the dependent variable foreach observation and the actual value (the estimated errors defined

  • 8/6/2019 20.Sykes .Regression

    17/33

  • 8/6/2019 20.Sykes .Regression

    18/33

    Chicago Working Paper in Law & Economics

    variance of the normal distribution for each coefficient. The squareroot of this estimate is called the standard error of thecoefficientcall this measure s.

    It is also possible to show that if the parameter estimate, call it, is normally distributed with a mean of, then ( )/s has aStudents t distribution. The t-distribution looks very much likethe normal, only it has fatter tails and its mean is zero. Using thisresult, suppose we hypothesize that the true value of a parameter inour regression model is . Call this the null hypothesis. Becausethe minimum SSE criterion is an unbiased estimator, we can deducethat our parameter estimate is drawn from a normal distribution

    with a mean of if the null hypothesis is true. If we then subtract from our actual parameter estimate and divide by its standard error,we obtain a number called the t-statistic, which is drawn from a t-distribution if the null hypothesis is true. This statistic can bepositive or negative as the parameter estimate from which it isderived is greater or less than the hypothesized true value of theparameter. Recalling that the t-distribution is much like a normalwith mean of zero, we know that large values of the t-statistic (inabsolute value) will be drawn considerably less frequently than smallvalues of the t-statistic. And, from the construction of the t- statistic,large values for that statistic arise (in absolute value), other thingsbeing equal, when the parameter estimate on which it is based differsfrom its true (hypothesized) value by a great deal.

    This insight is turned on its head for hypothesis testing. Wehave just argued that a large t-statistic (in absolute value) will arisefairly infrequently if the null hypothesis is correct. H ence, when alarge t-statistic does arise, it will be tempting to conclude that thenull hypothesis is false. The essence of hypothesis testing with aregression coefficient, then, is to formulate a null hypothesis as to itstrue value, and then to decide whether to accept or reject itaccording to whether the t-statistic associated with that nullhypothesis is large enough that the plausibility of the null hypothesisis sufficiently in doubt.

    See sources cited note supra.I limit the discussion here to hypothesis testing regarding the value of a

    particular parameter. In fact, other sorts of hypotheses may readily be tested, such

  • 8/6/2019 20.Sykes .Regression

    19/33

    A Introduction to Regression Analysis

    One can be somewhat more precise. We might resolve that thenull hypothesis is implausible if the t-statistic associated with ourregression estimate lies so far out in one tail of its t-distribution thatsuch a value, or one even larger in absolute value, would arise lessthan, say, percent of the time if the null hypothesis is correct. Putdifferently, we will reject the null hypothesis if the t-statistic fallseither in the uppermost tail of the t-distribution, containing . per-cent of the draws representing the largest positive values, or in thelowermost tail, containing . percent of the draws representing thelargest negative values. This is called a two-tailed test.

    Alternatively, we might have a strong prior belief about the true

    value of a parameter that would lead us to accept the null hypothesiseven if the t-statistic lies far out in one of the tails of the distribution.Consider the coefficient of the gender dummy in table as an illus-tration. Suppose the null hypothesis is that the true value of this co-efficient is zero. Under what circumstances would we reject it? Wemight find it implausible that the true value of the coefficient wouldbe positive, reflecting discrimination againstmen. Then, even if theestimated coefficient for the gender dummy is positive with a largepositive t-statistic, we would still accept the null hypothesis that itstrue value is zero. Only a negative coefficient estimate with a largenegative t-statistic would lead us to conclude that the null hypothesiswas false. Where we reject the null hypothesis only if a t-statistic

    that is large in absolute value has a particular sign, we are employinga one-tailed test.

    To operationalize either a one- or two-tailed test, it is necessaryto compute the exact probability of a t-statistic as large or larger inabsolute value as the one associated with the parameter estimate atissue. In turn, it is necessary to know exactly how spread out is thet-distribution from which the estimate has been drawn. A furtherparameter that we need to pin down the shape of the t-distributionin this respect is called the degrees of freedom, defined as thenumber of observations in the sample less the number of parametersto be estimated. In the illustrations of tables and , we have fifty

    observations in the sample, and we are estimating parameters, soas the hypothesis that all parameters in the model are zero, the hypothesis thatsome subset of the parameters are zero, and so on.

  • 8/6/2019 20.Sykes .Regression

    20/33

    Chicago Working Paper in Law & Economics

    the t-distribution for any of the parameter estimates has degreesof freedom. The fewer the degrees of freedom, the more spreadout is the t-distribution and thus the greater is the probability ofdrawing large t-statistics. The intuition is that the larger the sample,the more collapsed is the distribution of any parameter estimate(recall the concept of consistency above). By contrast, the moreparameters we seek to estimate from a sample of given size, the moreinformation we are trying to extract from the data and the lessconfident we can be in the estimate of each parameterhence, theassociated t-distribution is more spread out.

    Knowing the degrees of freedom for the t-distribution allows an

    investigator to compute the probability of drawing the t-statistic inquestion, or one larger in absolute value, assuming the truth of thenull hypothesis. Using the appropriate one- or two-tailed test (theformer necessary only when the t-statistic is of the right sign), theinvestigator then rejects the null hypothesis if this probability issufficiently small.

    But what do we mean by sufficiently small? The answer is byno means obvious, and it depends upon the circumstances. It has be-come convention in social scientific research to test one particularnull hypothesisnamely, the hypothesis that the true value of a co-efficient is zero. Under this hypothesis, in our notation above isequal to zero, and hence the t-statistic is simply /s, the coefficientestimate divided by its standard error. It is also convention to em-brace a significance level of . , ., or .that is, to inquirewhether the t-statistic that the investigator has obtained, or one evenlarger in absolute value, would arise more than percent, percent,or percent of the time when the null hypothesis is correct. Wherethe answer to this question is no, the null hypothesis is rejected andthe coefficient in question is said to be statistically significant. Forexample, if the parameter estimate that was obtained is far enoughfrom zero that an estimate of that magnitude, or one even furtherfrom zero, would arise less than percent of the time, then the co-efficient is said to be significant at the . level.

    The question whether the conventional social scientific signifi-cance tests are appropriate when regression analysis is used for legal

    See sources cited note supra.

  • 8/6/2019 20.Sykes .Regression

    21/33

    A Introduction to Regression Analysis

    applications, particularly in litigation, is a difficult one that I will de-fer to the concluding section of this lecture. I will simply assume fornow that we are interested in the general problem of testing somenull hypothesis, and that we will reject it if the parameter estimateobtained lies far enough out in one of the tails of the distributionfrom which the estimate has been drawn. We leave open the ques-tion of what constitutes far enough and simply seek to computethe probability under a one- or two-tailed test of obtaining anestimate as far from the mean of the distribution as that generatedby the regression if the null hypothesis is true.

    Most computerized regression packages report not only the pa-

    rameter estimate itself ( in our notation), but also the standarderror of each parameter estimate (s in our notation). This value,coupled with the hypothesized true parameter value ( in ournotation), can then be employed to generate the appropriate t-statistic for any null hypothesis. Many regression packages alsoreport a number called the t-statistic, which is invariably basedupon the conventional social scientific null hypothesis that the trueparameter value is zero. finally, some packages report the probabilitythat the t-statistic at issue could have been generated from a t-distribution with the appropriate degrees of freedom under a one- ortwo-tailed test.

    Returning to tables and , all of this information is reported foreach of the five parameter estimatesthe standard error, the value ofthe t- statistic for the null hypothesis that the true parameter value iszero, and the probability of getting a t-statistic that large or larger inabsolute value under a two-tailed test with degrees of freedom. Tointerpret this information, consider the estimated coefficient for the

    If the regression package does not report these probabilities, they can readilybe found elsewhere. It has become common practice to include in statistics andeconometrics books tables of probabilities for t-distributions with varying degreesof freedom. Knowing the degrees of freedom associated with a t-statistic, there-fore, one can consult such a table to ascertain the probability of obtaining a t-statistic as far from zero or farther as the one generated by the regression (the con-

    cept far from zero again defined by either a one- or two-tailed test). As a pointof reference, when the degrees of freedom are large (say, or more), then the .significance level for a two-tailed test requires a t-statistic approximately equal to..

  • 8/6/2019 20.Sykes .Regression

    22/33

    Chicago Working Paper in Law & Economics

    gender dummy in table . The estimated coefficient of. hasstandard error of . and thus a t-statistic of ./. =

    .. The associated probability under a two-tailed test is reportedas . . This means that if the true value of the coefficient for thegender dummy were zero, a coefficient greater than or equal to. in absolute value would nevertheless arise percent of thetime given the degrees of freedom of the t-distribution from whichthe coefficient estimate is drawn. A rejection of the null hypothesison the basis of a parameter estimate equal to . or greater in ab-solute value, therefore, will be erroneous three times out of ten whenthe null hypothesis is true. By conventional social science standards,

    therefore, the significance level here is too low to reject the null hy-pothesis, and the coefficient of the gender dummy is not statisticallysignificant. It is noteworthy that in this instance (in contrast to anyreal world application), we know the true parameter value, namely.. H ence, if we employ a conventional two-tailed significancetest, we are led erroneously to reject the hypothesis that gender dis-crimination is present.

    As noted, we may regard the two-tailed test as inappropriate forthe coefficient of the gender dummy because we find the possibilityof discrimination against men to be implausible. It is a simple matterto construct an alternative one-tailed test: Table indicates that a

    coefficient estimate of. or greater in absolute value will occur percent of the time if the true value of the coefficient is zero. Putdifferently, an estimate of the gender dummy coefficient greater thanor equal to . will arise percent of the time, and an estimateless than or equal to . will arise percent of the time. It fol-lows that if we are only interested in the lower tail of the t-distribu-tion, rejection of the null hypothesis (when it is true) will be erro-neous only percent of the time if we require a parameter estimateof . or smaller. The one-tailed significance level is thus .,still below the conventional thresholds for statistical significance.

    Using such significance levels, therefore, we again are led to acceptthe null hypothesis, in this case erroneously.

    The result in this illustration is generalfor any t-statistic, the probability

    of rejecting the null hypothesis erroneously under a one-tailed test will be exactlyhalf that probability under a two-tailed test.

  • 8/6/2019 20.Sykes .Regression

    23/33

    A Introduction to Regression Analysis

    I offer this illustration not to suggest that there is anythingwrong with conventional significance tests, but simply to indicatehow one reduces the chance of erroneously rejecting the null hy-pothesis (call this a Type I error) only by increasing the chance oferroneously accepting it (call this a Type II error). The conven-tional significance tests implicitly give great weight to the impor-tance of avoiding Type I errors, and less weight to the avoidance ofType II errors, by requiring a high degree of confidence in the falsityof the null hypothesis before rejecting it. This seems perfectly ap-propriate for most scientific applications, in which the researcher isjustifiably asked to bear a considerable burden of proof before the

    scientific community will accept that the data establish an assertedcausal relation. Whether the proponent of regression evidence in alegal proceeding should bear that same burden of proof is a moresubtle issue.

    . Goodness offit

    Another common statistic associated with regression analysis is

    the R2. This has a simple definitionit is equal to one minus theratio of the sum of squared estimatederrors (the deviation of the ac-tual value of the dependent variable from the regression line) to thesum of squared deviations about the mean of the dependent variable.Intuitively, the sum of squared deviations about its mean is a mea-

    sure of the total variation of the dependent variable. The sum ofsquared deviations about the regression line is a measure of the ex-tent to which the regression fails to explain the dependent variable (a

    measure of the noise). Hence, the R2 statistic is a measure of the ex-tent to which the total variation of the dependent variable is ex-

    plained by the regression. It is not difficult to show that the R2

    statistic necessarily takes on a value between zero and one.

    A high value of R2, suggesting that the regression modelexplains the variation in the dependent variable well, is obviouslyimportant if one wishes to use the model for predictive or forecastingpurposes. It is considerably less important if one is simply interested

    in particular parameter estimates (as, for example, if one is searchingfor evidence of discrimination, as in our illustration, and thus

    See, e.g., E. Hanushek and J. Jackson, supra note , at.

  • 8/6/2019 20.Sykes .Regression

    24/33

    Chicago Working Paper in Law & Economics

    focused on the coefficient of the gender dummy). To be sure, a largeunexplained variation in the dependent variable will increase thestandard error of the coefficients in the model (which are a functionof the estimated variance of the noise term), and hence regressions

    with low values of R2 will often (but by no means always) yieldparameter estimates with small t-statistics for any null hypothesis.

    Because this consequence of a low R2 will be reflected in the t-statistics, however, it does not afford any reason to be concerned

    about a low R2 per se.

    As a quick illustration, turn back to tables and . Recall that thenoise terms for the data set from which the estimates in table

    were

    generated were drawn from a distribution with a standard deviationof,, while for table the noise terms were drawn from a distri-bution with a standard deviation of,. The unexplained variationin the earnings variable is likely to be greater in the first data set,

    therefore, and indeed the R2 statistics confirm that it is (. fortable and . for table ). Likewise, because the estimated varianceof the noise term is greater for the estimates in table , we expect thecoefficient estimates to have larger standard errors and smaller t-statistics. This expectation is also borne out on inspection of the twotables. Variables with coefficients that are statistically significant byconventional tests in table , therefore, such as the gender dummy,

    are not statistically significant in table .In these illustrations, the value of R2 simply reflects the amount

    of noise in the data, and a low R2 is not inconsistent with the mini-mum SSE criterion serving as an unbiased, consistent, and efficientestimator because we know that the noise terms were allindependent draws from the same distribution with a zero mean. In

    practice, however, a low value of R2may indicate that important andsystematic factors have been omitted from the regression model.This possibility raises again the concern about omitted variables bias.

    . Two Common Statistical Problems in Regression Analysis

    Much of the typical econometrics course is devoted to what hap-pens when the assumptions that are necessary to make the minimumSSE criterion unbiased, consistent, and efficient do not hold. I can-not begin to provide a full sense of these issues in such a brief lecture

  • 8/6/2019 20.Sykes .Regression

    25/33

    A Introduction to Regression Analysis

    and will simply illustrate two of the many complications that mayarise, chosen because they are both common and quite important.

    . Omitted Variables

    As noted, the omission from a regression of some variables thataffect the dependent variable may cause an omitted variables bias.The problem arises because any omitted variable becomes part of thenoise term, and the result may be a violation of the assumptionnecessary for the minimum SSE criterion to be an unbiased estima-tor.

    Recall that assumptionthat each noise term is drawn from a

    distribution with a mean of zero. We noted that this assumptionlogically implies the absence of correlation between the explanatoryvariables included in the regression and the expected value of thenoise term (because whatever the value of any explanatory variable,the expected value of the noise term is always zero). Thus, supposewe start with a properly specified model in which the noise term forevery observation has an expected value of zero. Now, omit one ofthe independent variables. If the effect of this variable upon the de-pendent variable is not zero for each observation, the new noiseterms now come from distributions with nonzero means. One con-sequence is that the estimate of the constant term will be biased(part of the estimated value for the constant term is actually the

    mean effect of the omitted variable). Further, unless the omittedvariable is uncorrelated with the included ones, the coefficients ofthe included ones will be biased because they now reflect not only anestimate of the effect of the variable with which they are associated,but also partly the effects of the omitted variable.

    To illustrate the omitted variables problem, I took the data onwhich the estimates reported in table are based, and reran the re-gression after omitting the schooling variable. The results are shownin table:

    See J. Johnston, supra note , at ; E. H anushek and J. Jackson, supranote , at . The bias is a function of two thingsthe true coefficients of theexcluded variables, and the correlation within the data set between the includedand the excluded variables.

  • 8/6/2019 20.Sykes .Regression

    26/33

    Chicago Working Paper in Law & Economics

    Table. Omitted variable illustration

    Variable Truevalue

    Estimatedvalue

    Standarderror

    t-statis-tic

    Prob(-tail)

    Constant . . . . .

    School . omitted

    Aptitude . . . . .

    Experience . . . . .

    Gendum . . . . .

    R2 = .

    You will note that the omission of the schooling variable lowers

    the R2 of the regression, which is not surprising given the originalimportance of the variable. It also alters the coefficient estimates.The estimate for the constant term rises considerably, because themean effect of schooling on income is positive. It is not surprisingthat the constant term is thus estimated to be greater than its truevalue. An even more significant effect of the omission of schooling ison the coefficient estimate for the aptitude variable, which increasesdramatically from below its true value to well above its true value andbecomes highly significant. The reason is that the schooling variableis highly correlated (positively) with aptitude in the data setthe

    correlation is .and because schooling has a positive effect onearnings. Hence, with the schooling variable omitted, the aptitudecoefficient is erroneously capturing some of the (positive) returns toeducation as well as the returns to aptitude. The consequence isthat the minimum SSE criterion yields an upward biased estimate ofthe coefficient for aptitude, and in this case the actual estimate is in-deed above the true value of that coefficient.

    The effect on the other coefficients is more modest, though non-trivial. Notice, for example, that the coefficient of gendum increases(in absolute value) significantly. This is because schooling happensto be positively correlated with being male in my fictitious datasetwithout controlling for schooling, the apparent effect of genderis exaggerated because females are somewhat less well educated onaverage.

  • 8/6/2019 20.Sykes .Regression

    27/33

    A Introduction to Regression Analysis

    The omitted variables problem is troublesome to investigatorsnot simply because it requires them to collect data on more variablesto avoid it, but because the omitted variables are often unobservable.Real world studies of gender discrimination are perhaps a case inpoint. One can readily imagine that earnings depend on such factorsas innate ability and motivation, both of which may be unobservableto an investigator. Omitted variables bias may then become some-thing that the investigator cannot avoid, and an understanding of itsconsequences then becomes doubly important. For an investigatorconcerned primarily with the coefficient of the gender dummy, itmight be argued, the omitted variables bias caused by the exclusion

    of innate ability and motivation should be modest because the corre-lation in the sample between gender and those omitted variablesmight plausibly be assumed to be small. Where the problem appearslikely to be serious, by contrast, the utility of conventional regressionas an investigative tool diminishes considerably.

    I note in passing that the problem of including extraneous or ir-relevant variables is less serious. Their expected coefficient is zeroand the estimates of the other coefficients are not biased, althoughthe efficiency of the minimum SSE criterion is lessened.

    I also note in passing a problem that is closely related to theomitted variables problem, termed errors in variables. In many re-gression studies, it is inevitable that some explanatory variables willbe measured with error. Such errors become part of the noise term.Let us assume that in the absence of measurement error, the noiseterms obey the assumption needed for unbiasedness and consis-tencythey are all drawn from a distribution with zero mean andare thus uncorrelated with the explanatory variables. With measure-ment error, however, this assumption will no longer hold.

    Econometricians have developed some more sophisticated regression tech-

    niques to deal with the problem of unobservable variables, but these are not alwayssatisfactory because of certain restrictive assumptions than an investigator mustmake in using them. See, e.g., Griliches, Errors in Variables and Other Unob-servables, Econometrica (). An accessible discussion of the omittedvariables problem and related issues may be found in P. Kennedy, supra note , at.

    Id.

  • 8/6/2019 20.Sykes .Regression

    28/33

  • 8/6/2019 20.Sykes .Regression

    29/33

    A Introduction to Regression Analysis

    giving. We somehow know that each time both give a speech in amonth, alumni contributions rise by $,. When they each givetwo speeches in a month, contributions rise by $,, and so on.

    Thus, by hypothesis, we know thejointeffect of a speech by bothBaird and Picker ($,), but nothing in the data permit us to as-certain their individual effects. Perhaps each speech increases contri-butions by $,, but it might be that one speaker induces an extra$, in giving and the other none, or that one speaker induces anextra $, in giving and the other reduces giving by $,. Ineconometric parlance, the data on speeches given by Baird andPicker are perfectly collinearthe correlation between the number

    of speeches per month by Baird and the number per month byPicker is .. An attempt to estimate the effect of the number ofspeeches given by each upon contributions would fail and result inan error message from the computer (for reasons that we need notdetail, it would find itself trying to divide by zero).

    The term multicollinearity usually refers to a problem in thedata short of the perfect collinearity in our illustration, but wherechanges in two variables are nevertheless highly correlated to thepoint that it is difficult to separate their effects. Because multi-collinearity does not go to any property of the noise term, the mini-mum SSE criterion can still be unbiased, consistent, and efficient.But the difficulty in separating the effects of the two variables intro-duces greater uncertainty into the estimator, manifest as an increasein the standard errors of the coefficients and a reduction in their t-statistics.

    One illustration of the effects of multicollinearity may alreadyhave been provided. In our discussion of table , we noted that thecoefficient estimate for the aptitude variable was far below its truevalue. As it turns out, aptitude and schooling are highly correlated inthe data set, and this affords a plausible conjecture as to why the co-efficient for the schooling variable is too high and that for aptitudeinsignificantly small (some of the effects of aptitude in the sampleare captured by the schooling coefficient).

    To give another illustration, which incidentally allows us to in-troduce another use of dummy variables, suppose that gender dis-crimination at our hypothetical firm affects the earnings of women intwo waysthrough an effect on the baseline earnings of women as

  • 8/6/2019 20.Sykes .Regression

    30/33

    Chicago Working Paper in Law & Economics

    before, and through an effect on the returns to education for women.In particular, recall that in equation () both sexes earned $, peryear of schooling. Suppose now that males earn $ ,, but femalesearn only $. This effect can be captured mathematically by aninteraction term incorporating the gender dummy, so that earningsare now determined in accordance with equation ():

    Earnings = 5000 + 1000 School + 50 Aptitude+ 300 Experience 2000 Gendum

    200 Gendum School ()

    Using the same hypothetical data for the explanatory variables as

    before, I produced new values of earnings using equation () and(just for varietys sake) noise terms drawn from a distribution withstandard deviation of,. I then estimated a regression from thenew data set, including the variable Gendum School as anadditional explanatory variable. The results are in table , where thevariable Interact is simply Gendum School.

    Table . M ulticollinearity illustration

    Variable Truevalue

    Estimatedvalue

    Standarderror

    t-statis-tic

    Prob(-tail)

    Constant . . . . .

    School . . . . .Aptitude . . . . .

    Experience . . . . .

    Gendum . . . . .

    Interact . . . . .

    R2 = .

    Observe that, in contrast to table , the coefficient for the genderdummy is now higher than the real value by more than a factor oftwo. The coefficient for the interaction term, by contrast, has the

    wrong sign and is close to zero. The other parameter estimates arenot too far offthe mark.

    The poor results for the coefficients of Gendum and Interact arealmost certainly a result of a severe multicollinearity problem. Note

  • 8/6/2019 20.Sykes .Regression

    31/33

    A Introduction to Regression Analysis

    that whenever Gendum = 0, Interact = 0 as well. Gendum is positiveonly when Interact is positive. We would expect a high correlationbetween them, and indeed the correlation coefficient is .. Underthese circumstances, it is no surprise that the regression cannot sepa-rate the effects of the two variables with any accuracy. The estimatedcoefficient for Interact is insignificant by any plausible test, and thecoefficient for Gendum also has a large standard error that producesa rather poor t-statistic despite the high absolute value of the coeffi-cient estimate.

    Notwithstanding the considerable uncertainty introduced intothe coefficient estimates, however, it is plausible that the multi-

    collinearity problem here is not disastrous for an investigator inter-ested in identifying the extent of gender discrimination. T he reasonis that the estimate of the jointeffects of the Gendum and Interactmay not be too far afieldone is inflated and one is understated,with the errors to a great extent canceling each otherand as a legalmatter an estimate of the joint effect may be all that is needed. Thecaveat is that multicollinearity reduces the t-statistics for both vari-ables, and might thereby lead the investigator to reject thehypothesis that discrimination is present at all. To deal with theeffects of multicollinearity here, therefore, the investigator mightsimply wish to discount the low t-statistics, or else to omit one of thetwo variables and recognize that the coefficient estimate for theincluded variable will be biased and will include the effect of theomitted variable.

    In many instances, however, the investigator will not be satisfiedwith an estimate of the joint effect of two variables, but needs to sep-arate them. H ere, multicollinearity can become highly problematic.There is no simple, acceptable solution for all cases, though variousoptions warrant consideration, all beyond the scope of this lecture.

    . Afinal Note on the Law: Regression Analysis and the Burden of Proof

    A key issues that one must confront whenever a regression studyis introduced into litigation is the question of how much weight to

    It is important to recollect that this approach raises the problem of omittedvariables bias for the other variables as well.

    See P. Kennedy, supra note , at .

  • 8/6/2019 20.Sykes .Regression

    32/33

    Chicago Working Paper in Law & Economics

    give it. I hope that the illustrations in this lecture afford some basisfor optimism that such studies can be helpful, while also suggestingconsiderable basis for caution in their use.

    I return now to an issue deferred earlier in the discussion of hy-pothesis testingthe relationship between the statistical significancetest and the burden of proof. Suppose, for example, that to establishliability for wage discrimination on the basis of gender under TitleVII, a plaintiffneed simply show by a preponderance of the evidencethat women employed by the defendant suffer some measure of dis-crimination. With reference to our first illustration, we might saythat the required showing on liability is that, by a preponderance of

    the evidence, the coefficient of the gender dummy is negative.Unfortunately, there is no simple relationship between this bur-

    den of proof and the statistical significance test. At one extreme, ifwe imagine that the parameter estimate in the regression study is theonly information we have about the presence or absence of discrimi-nation, one might argue that liability is established by a preponder-ance of the evidence if the estimated coefficient for the genderdummy is negative regardless of its statistical significance or standarderror. The rationale would be that the negative estimate, howeversubject to uncertainty, is unbiased and is the best evidence we have.

    But this is much too simplistic. Very rarely is the regression es-

    timate the only information available, and when the standard errorsare high the estimate may be among the least reliable informationavailable. Further, regression analysis is subject to considerable ma-nipulation. It is not obvious precisely which variables should be in-cluded in a model, or what proxies to use for included variables thatcannot be measured precisely. There is considerable room for exper-imentation, and this experimentation can become data mining,whereby an investigator tries numerous regression specificationsuntil the desired result appears. An advocate quite naturally mayhave a tendency to present only those estimates that support theclients position. H ence, if the best result that an advocate canpresent contains high standard errors and low statistical significance,

    it is often plausible to suppose that numerous even less impressive

    See, e.g., Texas Department of C ommunity Affairs v. Burdine, U.S. ().

  • 8/6/2019 20.Sykes .Regression

    33/33

    A Introduction to Regression Analysis

    results remain hidden, and conceivably shielded from discovery bythe work product doctrine.

    For these reasons, those who use regression analysis in litigationtend to report results that satisfy the conventional significancetestsoften the -percent significance leveland to suppose thatless significant results are not terribly interesting. Before most ex-perts would feel comfortable asserting that gender discrimination hasbeen established by a study such as that in our illustration, therefore,they likely would require that the coefficient estimate for the genderdummy be negative and statistically significant. Even then, theywould anticipate a vigorous cross-examination based on a number of

    matters, many suggested by the discussion above.Still more difficult issues arise when an exact parameter estimate

    is needed for some purpose, such as for computing damages. Thefact that the parameter is statistically significant simply means thatby conventional tests, one can reject the hypothesis that its true valueis zero. But there are surely many other hypotheses about the pa-rameter value that cannot be rejected, and indeed the likelihood thatregression will produce a perfectly accurate estimate of anyparameter is negligible. About the only guidance that can be givenfrom a statistical standpoint is the obviousparameter estimateswith proportionally low standard errors are less likely to be wide ofthe mark than others.

    Ultimately, therefore, statistics itself does not say how muchweight a regression study ought be given, or whether it is reasonableto use a particular parameter estimate for some legal purpose orother. These assessments are inevitably entrusted to triers of fact,whose judgments on the matter if well informed are likely as good asthose of anyone else.

    I will not digress on the rules of discovery here. In practice, the raw data

    may be discoverable, for example, while the experts undisclosed analysis of thedata may not be.

    See the discussion in Fisher, Statisticians, Econometricians and AdversaryProceedings,J. Am. Stat. Assn.().


Top Related