notes on some aspects of regression analysis

Notes on Some Aspects of Regression AnalysisAuthor(s): D. R. CoxSource: Journal of the Royal Statistical Society. Series A (General), Vol. 131, No. 3 (1968), pp.265-279Published by: Wiley for the Royal Statistical SocietyStable URL: http://www.jstor.org/stable/2343523 .

Accessed: 28/06/2014 16:06

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .http://www.jstor.org/page/info/about/policies/terms.jsp

.JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range ofcontent in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new formsof scholarship. For more information about JSTOR, please contact [email protected].

.

Wiley and Royal Statistical Society are collaborating with JSTOR to digitize, preserve and extend access toJournal of the Royal Statistical Society. Series A (General).

http://www.jstor.org

This content downloaded from 193.105.245.156 on Sat, 28 Jun 2014 16:06:26 PMAll use subject to JSTOR Terms and Conditions

http://www.jstor.org/action/showPublisher?publisherCode=black

http://www.jstor.org/action/showPublisher?publisherCode=rss

http://www.jstor.org/stable/2343523?origin=JSTOR-pdf

http://www.jstor.org/page/info/about/policies/terms.jsp


1968] 265

Regression Methods

Notes on Some Aspects of Regression Analysis

By D. R. Cox Imperial College

[Read before the ROYAL STATISTICAL SOCIETY on Wednesday, March 20th, 1968, the President, Dr F. YATES, C.B.E., F.R.S., in the Chair]

SUMMARY Miscellaneous comments are made on regression analysis under four broad headings: regression of a dependent variable on a single regressor variable; regression on many regressor variables; analysis of bivariate and multivariate populations; models with components of variation.

1. INTRODUCTION THIS is an expository paper consisting not of new results but of miscellaneous and isolated comments on the theory of regression. The subject is a very broad one and the paper is in no sense comprehensive. In particular, the ideas of regression are the basis of much work in time series analysis and in multivariate analysis and these specialized subjects are barely mentioned; nor are experimental design and sampling theory problems associated with regression considered. Another serious limitation to the paper is the omission of relevant parts of the econometric literature.

Two general situations are distinguished and in their simplest forms are: (i) a dependent variable Y has a distribution depending on a regressor variable x and it is required to assess this dependence; (ii) there is a bivariate population of pairs (X, Y) and the joint distribution is to be analysed. Capital letters are used for observations represented by random variables and lower- case letters for other observations.

Most theoretical discussion of regression starts from a quite tightly specified model in which some observations are regarded as corresponding to random variables with probability distributions depending in a given way on unknown parameters. Many of the difficulties of regression analysis, however, concern such questions as which observations should be treated as random variables, what are suitable families of models, and what is the practical interpretation of the conclusions.

The special computational and other problems associated with fitting non-linear models will not be considered explicitly, although much of the discussion applies as much to such problems as to the simpler linear models with which the paper is overtly concerned. Particular applications will not be discussed in detail but one or two typical but hypothetical situations will be outlined later for illustration. Through- out the discussion the measurement of uncertainty by significance tests and confidence limits is important but not paramount.

Inevitably the paper tends to emphasize difficulties likely to be encountered; of course awareness of potential difficulties is a good thing, but only as one facet of constructive scepticism.



266 Cox - Notes on Some Aspects of Regression Analysis [Part 3,

The methodology of regression is described by Williams (1959) and by Draper and Smith (1966) and the more theoretical aspects by Plackett (1960), Kendall and Stuart (1967, Chapters 26-29) and Rao (1965).

2. REGRESSION ON A SINGLE REGRESSOR VARIABLE Suppose that there are n pairs of observations (xl, Y,),..., (x., Y.) and that for

each value of the regressor variable x there is a population of values of the dependent variable from which the observed Yi's are randomly chosen. This is often taken for granted as the starting point for a theoretical discussion of regression; in (i) and (ii) which follow, some of its implications are discussed and then in (iii)-(vi) comments on some more advanced matters are made.

(i) Choice of dependent and regressor variables. In some situations the x values may be chosen deliberately by the experimenter and Y is a response dependent on x. The more difficult situation is when both types of observation can be regarded as random. Then we take as dependent variable: (a) the "effect", the regressor variable being the explanatory variable; for a given value of the explanatory variable, we ask what is the distribution of possible responses; (b) the variable to be predicted, the regressor variable being the variable on which the prediction is to be based. A full solution to the prediction problem is to give the conditional distribution of the variable to be predicted given all available information on the individual concerned.

Suppose that it is reasonable to consider an existing or hypothetical population of Y values for each x. Now whether or not the x's can be regarded as random may well affect the interpretation and application of the conclusions. For analysis of the regression coefficient in the model, however, we argue conditionally on the x values actually observed (Fisher, 1956, p. 156), provided that the x values by themselves would have given no information about the parameter of interest.

Example 1. A random sample of fibre segments of fixed lengths is taken from a homogeneous source and for each segment the mean diameter is measured accurately and the breaking load determined. Both observations can be regarded as random variables but in view of (a) it is reasonable to take breaking load, or rather its log, as dependent variable and log diameter as regressor variable. The same model would be appropriate if, for example, a fixed number of fibres is selected randomly from each of a number of diameter groups. Whether the regression is a fruitful thing to consider depends on its stability, i.e. on whether there is a reproducible relationship involved; see (iii).

Example 2. A more difficult case is illustrated by a calibration experiment in which for n individuals (a) a "slow" measurement and (b) a "quick" measurement are made of some property. Often (a) is a definitive determination, for example an optical measurement of fibre diameter, and (b) is the result of some indirect and much easier method. In future, only the "quick" measurement will be obtained and it is required to predict from this the corresponding "slow" measurement. If the n individuals initially observed are a random sample from the same population as the individuals for which future predictions are to be made, we take the "slow" measurement to be the dependent variable, since this is the one to be predicted. Suppose, however, that the n individuals were chosen systematically, for example to have "slow" values approximately evenly distributed over the range of interest.



1968] Cox - Notes on Some Aspects of Regression Analysis 267

Usually it would be reasonable to regard the "quick" value as having a random component and, provided that a physically stable random system is involved and the relationship linear to write

"quick" = cx +f "slow" +"error", where the "error" does not depend on the "slow" value. Given a new "quick" measurement, we have to estimate a non-random variable by inverse estimation (Williams, 1959, pp. 91, 95). If both variables can in fact be regarded as random, the second approach is inefficient because it ignores the information about the marginal distribution of the "slow" measurement.

(ii) The omitted variables. Suppose that z is a regressor variable that might have been included in the regression analysis but in fact is not, for example because no observations of it are available. What assumptions about z are made when we consider the regression of Y on x alone? Box (1966) has given an illuminating discussion of the dangers of omitting a relevant variable. The relationship ignoring z will be meaningful if: (a) changes in z have no effect on Y; (b) in a randomized experiment in which x corresponds to a treatment, there may be a unit-treatment additivity. Then the usual analysis will give an estimate of the effect of changing x, and an estimate of the standard error. The estimate refers to the difference between the response on one unit with certain values for x and for the omitted variable z and what would have been observed on that same unit with a different value of x and the same z; or (c) z is a random variable, say Z, and the distributions of Z, given x, and of Y, given Z = z and x, are well defined. If x is a random variable X, this amounts to the requirement that (X, Y,Z) have a well-defined three-dimensional distribution. Then the regression of Y on x is well defined and includes a contribution associated with changes in z.

Example 3. Consider observational data for individuals with an ultimately fatal disease, Y being the log time to death and x some aspect of the treatment applied, called the dose, and that the regression of Y on x is analysed. The missing variable z is the initial severity of the disease. If, as might well be the case, z largely determines x, the regression of Y on x, although well defined under the circumstances of (c), would be of very limited usefulness. In particular it would not give for a particular individual an estimate of the effect on his Y of changing dose. This is an extreme example of a difficulty that applies to many regression studies based on observational data.

(iii) Stability of regression. While a fitted regression equation may often be useful simply as a concise summary of data, it is obviously desirable that the relation should be stable and reproducible. This is stressed by Ehrenberg (1968) and Nelder (1968); see also Tukey (1954). Stability might mean that when the experiment is repeated under different conditions: (a) the same regression equation holds, even though other aspects of the data change; or (b) parallel regression equations are obtained; or (c) satisfactory regression lines are always obtained but with different positions and slopes. In cases (b) and (c) the fitting of regression lines will be an important first step in the analysis. The second step will be to try to account for the variation in the parameters whose estimates do vary appreciably, possibly by a further regression analysis on




regressor variables characterizing the different groups of observations and taking the initial regression coefficients as dependent variables. In testing the significance of the differences between the regression coefficients in different groups it will often be important to allow for correlations between the groups of data (Yates, 1939).

(iv) Choice of relation to be fitted. This choice will depend on preliminary plotting and inspection of the data and possibly on the outcome of earlier unsuccessful analyses. In addition, the model may take account of: (a) conclusions from previous sets of data; (b) theoretical analysis, including dimensional analysis, of the system; (c) limiting behaviour. Further, any given model can be parametrized in various ways and, in choosing a parametrization, the following considerations may be relevant: (a)' individual parameters should have a physical interpretation, say in terms of components in a theoretical model or in terms of combinations of regressor variables of physical meaning; (b)' individual parameters and estimates should have a descriptive interpretation, for example in terms of the average slope and curvature of response over the range considered; (c)' interpretations, such as those of (a)' and (b)', should be insensitive to secondary departures from the model; (d)' any instability between groups should be confined to as few parameters as possible; (e)' the sampling errors of estimates of different parameters should not be highly correlated. These requirements may be to some extent mutually conflicting.

There is not space here to discuss and exemplify all these points. As just one example, preliminary analysis of data of Example 1 might, if the x's have relatively little variation, suggest that linear regressions of breaking load on diameter and of log breaking load on log diameter would fit about equally well. The second would in general be preferable because, with respect to the above conditions: (b) it permits easier comparison with the theoretical model breaking load oc (diameter)2; (c) it ensures that breaking load vanishes with diameter; (a, b)' the regression coefficient, being a dimensionless power, is easier to think about than a coefficient having the dimensions of load/diameter.

(v) Goodness offit. This can be examined in a number of ways: (a) by a non-probabilistic graphical or tabular analysis, for example of residuals; (b) by a significance test, using as a test statistic some aspect of the data thought to be a reasonable measure of departure from the model. Thus, the standardized third moment of the residuals could be used, if possible skewness is of interest; (c) by the fitting of an extended model reducing to the given model for particular parameter values. The most familiar example is the inclusion of the new regressor variable, possibly a power of the first variable, or a product of variables, when multiple regression is being considered; (d) by the fitting of a quite different model, seeing whether it fits better than the initial one.

Such examination of the adequacy of the model is important if models are to be refined and improved. Often, but not always, the primary aspect of the model will be the form of the regression equation and the adequacy of what may be secondary




assumptions about constancy of variance, normality of distribution, etc. will be of rather less importance. Formal significance tests are valuable but, of course, need correct interpretation. A very significant lack of fit means that there is decisive evidence of systematic departures from the model; nevertheless the model may account for enough of the variation to be very valuable (Nelder, 1968). A non-significant test result means that in the respect tested the model is reasonably consistent with the data; nevertheless there may be other reasons for regarding the model as inadequate.

Of (a)-(d) the least standard is (d). It is closely related to the problem of choosing between alternative regression equations (Williams, 1959, Chapter 5), for example between the regression of Y on xl alone and that of Y on x2 alone. The more usual procedure in such a case will, however, be to fit both variables to cover the possibility that the joint regression is appreciably better than either separate one.

As a rather different example, suppose that normal theory linear regressions of (os) Y on x, (3) Y on log x, (y) log Y on log x are considered. The goodness of fit of (os) and (3) can be compared descriptively by the residual sums of squares, but to compare say (os) with (y) the residual sum of squares cannot be used directly. The most usual procedure is probably then to compare squared correlation coefficients, but for comparison of the full models it is probably better to compare the maximized log likelihoods of Y1, ..., Y. under the two models. Cox (1961, 1962) has discussed the construction of significance tests in such situations.

An alternative and in many ways preferable approach is to consider a comprehensive model containing (os), (/), (y) as special cases. For example the normal theory linear regression of

y112 - 1 xA1_ 1

A2 on

could be taken and all parameters, including (A1, A2), estimated and tested by maximum likelihood (Box and Tidwell, 1962; Box and Cox, 1964). This is computationally formidable if there are several regressor variables.

(vi) More complex dependence on the regressor variable. In most regression analyses it is assumed that the dependence on the regressor variable x is confined to changes in the conditional mean of Y. Transformation of Y may be necessary to achieve this; if different transformations are required to linearize the regression of the mean and to stabilize variance the first will usually have preference, simply because primary interest will usually lie in the mean. To study, for example, changes in the conditional variance of Y we can: (a) plot residuals; (b) group on x, calculate variances within groups, if necessary applying an adjustment for changes of mean within groups, and then consider the regression on x of log variance; (c) fit, for example by maximum likelihood, a model in which parameters are added to account for changes in variance. For example, the variance might be taken to be u2exp {y(x-x)}. It would nearly always be right to precede any such fitting by (a) or (b) in order to get some idea of an appropriate model and of whether the more complex fitting is likely to be fruitful.

Similar remarks apply to the study of changes of distributional shape. If the regression on the mean is linear but there are substantial changes in variance

a weighted analysis will often be required, although the changes in variance have to be quite substantial before there is appreciable gain in precision in estimating the




regression coefficient. Of course the changes in variance may be of intrinsic interest, or need separate study in order to specify how the precision of prediction depends on x.

3. REGRESSION ON SEVERAL REGRESSOR VARIABLES Suppose now that for each individual several regressor variables are available,

i.e. that for the ith individual we observe (Yi, xi,..., xip). We consider mainly the case where xi, ..., xp are physically distinct measurements rather than, for example, powers of a single x. Virtually all the discussion of Section 2 is relevant, but there are new points mostly connected with the choice of the regressor variables and the interpretation of situations in which there is appreciable non-orthogonality among the regressor variables.

There are two extreme situations to consider. In the first the number of regressor variables is quite small, say not more than three or four. It is then perfectly feasible both to fit the 2P possible regression equations and to examine them individually. Those regressor variables the nature of whose effects is clearly established can be isolated and ambiguities arising from non-orthogonality of other variables listed and, as far as possible, interpreted. Also further regression equations involving, say, squares and cross-products of some of the original regressor variables can be fitted, if required. In the second case the number p of regressor variables is larger. It may still be computationally feasible to fit all 2P equations, but unless all pairs of regressor variables are nearly orthogonal, the interpretation is likely to be difficult and, at the least, some further techniques are required for handling the information from the fits. In many applications of this type there is a reasonable hope that only a fairly small number of regressor variables have important effects over the region studied. The broad distinction between these two cases should be borne in mind in the following discussion.

(i) Interpretation and objectives. Lindley (1968) has emphasized that the choice between alternative equations depends on the purpose of the analysis and has discussed two cases in detail from a decision-theoretic viewpoint, one a prediction problem and one a control problem. His results show very explicitly the consequences of strong assumptions about the problem and are likely to be useful guidance in other cases too. The following remarks refer to cases where less explicit assumptions are possible about the nature of the problem and the objectives of the analysis.

Suppose first that the objective is to predict Y for future individuals in the region of x-space covered by the data. In particular, the x's may be random variables and the new individuals be drawn from the same population. Then any regression equation that fits the data adequately will be about equally effective on the average over a series of x-values. If, however, it is thought that not all regressor variables contribute, there is likely to be a gain from excluding regressor variables with an insignificant effect. Note that a Bayesian analysis of this situation suggests reducing the contribution of, rather than eliminating, such variables and this is sensible also from a sampling theory viewpoint. So long as p is small compared with n, it is not likely to make a major difference which of these various possibilities is taken. The algorithm of Beale et al. (1967) for selecting the "best" equation with a specified number of regressor variables and the various automatic stepwise procedures described by Draper and Smith (1966, Chapter 6) will be relevant.

Suppose next that the prediction is to be made for an individual in a new region of x-space. Things are now different. For example, suppose that xl and x2 are almost




linearly related in the initial sample of observations and that the partial regression coefficients are insignificant, the combined regression being very highly significant. It is thus known that at least one of xl and x2 has an important contribution, but there will be many regression equations fitting the data about equally well. Under the circumstances of the previous paragraph this is immaterial, but if prediction of Y is attempted for (xl, x2) far from the original linear relation, extremely different results will be obtained from the different fits. In such cases the possibilities are: (a) to postpone setting up a prediction equation until better data are available for estimation; (b) to use external information to decide which is "really" the appropriate equation; (c) to use the formal variance of prediction from the full equation as a means of detecting individuals for which prediction from any regression equation is hazardous.

The next, and in many ways the most important, case is where we hope that there is a unique dependence of Y on some, or all, of the regressor variables that will remain stable over a range of conditions and we wish to estimate this relation and in particular to identify the regressor variables that occur in it. In a randomized experiment it may ideally be possible to estimate the contrasts of primary interest separately and efficiently, to show that they do not interact with external factors and they account for most of the variability. Even here there are difficulties, particularly if the response surface is relatively complicated. For observational data, there are two major difficulties: (a) the possibility of important omitted variables (see Section 2, point ii); (b) ambiguities arising from appreciable non-orthogonality of regressor variables. There is discussion below of some of the devices that can be used to try to overcome (b).

In the situation contemplated in the previous paragraph, the objective is essentially the same as that in a randomized experiment. A more limited objective is to analyse preliminary data in order to suggest which factors would be worth including in a subsequent experiment and to suggest appropriate spacing for the levels. It would be interesting to examine the performance of some simple strategies, even though there will always be further information to be taken into account.

(ii) Aids to interpretation. In some cases the main interest may lie in the regression on xl, the variable x2 being included as characterizing say different groups of observations, or some potentially important aspect of secondary interest in the particular investigation. If x2 can conveniently be grouped, it will often be good to fit separate regressions on xl within each x2 group and then to relate the estimated parameters to x2. This leads to an analysis of the stability of the regression equation and possibly to the construction of models containing interaction. More generally xl and x2 may be sets of regressor variables.

If there is a property x that is thought not to have an effect on Y, it will often be good to include x as a regressor variable. Significant regression on x would then be a warning, for example of an important omitted variable.

The next set of remarks refer to ambiguities arising from non-orthogonality and all depend upon introducing further information in some form.

(a) It may be thought that the regression coefficient on say xl should be non- negative. In some special cases this may resolve an apparent ambiguity. For instance, suppose that xl and x2 are closely positively related, that the combined regression is large, but the partial regressions are insignificant, that on xl being negative. Incident- ally the attitude to assumptions such as that about the sign of a regression coefficient needs comment. That taken here is that any such assumption should, so far as possible,




be tested on the data and, if consistent with the data, its consequences should be analysed and compared with the conclusions without the assumption. It might be argued from a Bayesian viewpoint that a prior probability should be attached to the assumption and a single conclusion obtained, but, even apart from the difficulty of doing this quantitatively in a meaningful way, it seems likely that the more cautious approach will be more informative.

(b) There may be sets of regressor variables which are to a large extent physically equivalent. For example, in a textile experiment yarn strength can be measured by several different methods. Quite often the measurements may be expected to be highly correlated and equivalent as regressor variables, although the data may show this expectation to be false. In applications like this it will be natural to try to use throughout one regressor variable, possibly a simple combination of the separate variables, provided that this does not give an appreciably worse fit than full fitting.

(c) Kendall (1957, p. 75) suggested applying principal component analysis to the regressor variables and then taking new regressor variables specified by the first few principal components. Jeffers (1967) and Spurrell (1963) have given interesting applications. A difficulty seems to be that there is no logical reason why the dependent variable should not be closely tied to the least important principal component. The following modification is worth considering. The principal components may suggest simple combinations of regressor variables with physical meaning. These simple combinations, not the principal components, can be used as regressor variables and if a good fit is obtained a constructive, although not necessarily unique, simplification has emerged. If the regressor variables can be divided into meaningful sets, e.g. into physical measurements and chemical measurements, separate principal component analyses could be considered for the two sets.

(d) In some situations, especially in the physical sciences, the method of dimensional analysis may lead to a reduction in the effective number of regressor variables.

(e) Another general way of clarifying the regression relation when some of the regressor variables are random variables is to examine plausible special models for the interrelationships between all the variables. There are two rather different cases. If the additional assumptions cannot be tested from the data, then parameters not previously estimable may become so, and those previously estimable may have the precision of estimation increased. On the other hand, if the additional assumptions can be tested, then the gain is confined to improved precision. Sewall Wright's method of path coefficients is essentially a device for handling complex systems of interrelations. For general discussion of path coefficients not specifically in genetic terms, see Tukey (1954), Turner and Stevens (1959), Turner et al. (1961) and, particularly for the connection with multiple regression, Kempthorne (1957, Chapter 14). The most familiar example of the second type of situation is the use of a concomitant variable to increase the precision of treatment contrasts in controlled experiments. When the concomitant variable is measured before the treatments are applied, the special model is justified by the randomization of treatments. Another simple example is the use of an intermediate variable (Cox, 1960). Here the regression of Y on X1 is of interest and the supposition is that X2 is a further variable such that, given X2 = x2, Y is independent of X1. Then, under some circumstances, observation of X2 can lead to appreciable increase in the precision of the estimated regression of Y on X1. In other applications, analysis of covariance is used to see whether the data are in accord with the hypothesis that Y is affected by X1 only via X2.




(f) A very special case is when the regressor variables can be arranged in order of priority. The main cases are the fitting of polynomials and Fourier series.

(iii) Analysis of a set offitted regressions. For problems in which many alternative equations, for example all 2P linear regressions, are fitted to the same data, the handling of the resulting information needs comment. In a prediction problem in which the predictions are to be made over a set of x values distributed in much the same way as the data, an average variance of prediction, or better the corresponding standard deviation, will often be a reasonable measure of adequacy; of course, in some applications there may be particular points in x-space at which prediction is required. Those equations significantly worse than the overall fit can be identified in some way. Note that an equation significantly in conflict with the data may be used, for example because it involves substantial economy in the number of variables to be measured. This would be reasonable if the standard deviation of prediction is thought satisfactory, but the use of this equation in a new region of x-space is likely to be especially hazardous.

Where we are looking for a (hopefully) unique relation, the first step will often be to list all equations of a particular type that are not significantly contradicted by the data, as a preliminary to trying to narrow down the choice by some of the arguments sketched in (ii). Automatic devices for selecting equations would be used with great caution if at all. Gorman and Toman (1966) and Hocking and Leslie (1967) have discussed some further methods and in particular have outlined some of the recent unpublished work of Dr C. L. Mallows. A different approach is taken by Newton and Spurrell (1967a, b) who introduce quantities called elements to summarize the set of all 2P regression sums of squares.

Particular caution is necessary in examining the effect of regressor variables which vary much less in the data than would be expected in future applications. The standard errors of the regression coefficients will be high and there is an obvious danger in judging the potential importance of such variables solely from the statistical significance of their regression coefficients.

4. BIvARIATE POPULATIONS Consider now situations in which the observations are pairs (X1, Y1), ..., (Xn, Yn)

drawn from a bivariate population and in which there is no particular reason for studying the dependence of Y on X rather than that of X on Y. The example concerning heights and weights of schoolchildren discussed by Ehrenberg (1968) is an instance. Either or both regressions could legitimately be considered, but the question is whether it is fruitful to do so.

With one homogeneous set of data the concise description of the joint distribution is all that can be attempted, in the absence of a more specific objective. This may be done by a frequency table or by an estimate of the joint cumulative distribution function, or some parametric bivariate distribution can be fitted. While there has been discussion of special families of bivariate distributions other than the bivariate normal (Plackett, 1965; Moran, 1967) the bivariate normal distribution is nevertheless the one most likely to arise. Preliminary transformation may be desirable and one possibility is to consider transformations from (x,y) to




and to estimate (A1, A2) by maximum likelihood (Box and Cox, 1964), assuming that on the transformed scale a bivariate normal distribution does apply. In some applications it may be reasonable to take A1 = A2.

If a bivariate normal distribution is fitted, estimates of five parameters are required and these might, for example, be the means (p, py) the variances (cr2, cr2) and the correlation coefficient p; see, however, Section 2, point (iv) for remarks on parametrization.

When there are k populations the problem will be to describe the set of populations in a concise way. There are many possibilities. Often separate descriptions will be attempted of (a) the means (p$i yi) (i = 1, ..., k) and of (b) the parameters determin- ing the covariance matrices. For (a) such questions will arise as whether the means lie on or around a line or curve and of whether their position can be linked with some other variable characterizing the populations. Ehrenberg's (1963) criticisms of regression applied to bivariate populations are partly directed at confusions of comparisons between populations with those within populations.

If the covariance matrices are not constant, it will be natural to look for aspects that are constant and these might include one or other regression coefficient, the ratio of the standard deviations, the correlation coefficient, etc. Any changes in covariance matrix may be linked with changes in mean.

Of course once a potentially reasonable representation is obtained, standard techniques, especially maximum likelihood, are available for fitting and for con- structing significance tests. In many cases, however, the most challenging problem will be to discover the most fruitful concise representation among the many possibilities.

All the remarks of this section apply in principle to p variate problems.

5. MODELS WITH CoMpoNENrs OF VARiATION In the mathematical theory of regression the most awkward problems are probably

those in which the observations are split into components not directly observable, and the relationships between these components are to be explored. There is a very extensive theoretical literature on such situations; see, in particular, Lindley (1947), Madansky (1959), Tukey (1951), Sprent (1966), Fisk (1967), Kendall and Stuart (1967) and Nelder (1968).

In this section a few comments on such systems will be made, particularly on points which connect with the previous discussion.

The simplest situation is where only the dependent variable is split into components, a hypothetical true value and a measurement or sampling error. The main question, easily answered, is then to assess how much of the observed dispersion of Y about its regression on x is accounted for by the measurement or sampling error. For example Y might be the square root of a Poisson distributed variable, when the sampling error has variance nearly 1/4. One would, in particular, want to know whether this accounted for all the random variation present. More difficult problems would arise if it were required to estimate the distributional form of the "hidden" component of random variation.

The more interesting cases are where both independent and regressor variables can be split into components:

Xi= = 'i+Y=Ti+ , JifT = +PD+ cr




Here ep Xi are measurement or sampling errors of zero mean and EIF,i is a deviation from the regression line, again of zero mean. The simplest case is where (Di, the "true" value of the regressor variable, is a random variable. Random variables for different i are assumed independent and the triple (e, -qi, eyo,) is assumed independent of (Di. Various cases may arise for the covariance matrix of the triple, the simplest being that the three components are mutually independent. Fisk (1967) and Nelder (1968) have considered models in which the regression coefficient is a random variable.

Sometimes it is convenient to write P/3, instead of P to distinguish it from /yx, the population least squares regression of Y on X. In fact

Ao= p var (X)

/3YXvar (X)-var(f) If prediction of Y directly from X is the objective, /yx is required, not g/3,, so long as X is a random variable; if, however, the future X's at which prediction is to be attempted are not random, or come from a different distribution, the presence of the components e does need consideration.

Much published discussion concentrates on the estimation of P and in particular on the circumstances under which : is consistently estimable; for some purposes it is enough to note that P is between Pyx and l//3y (Moran, 1956). The simplest case is when var (e) can be estimated from separate data, for instance from within replicate variation, or theoretically. Quite often the correction factor g,/Py_/ is very near one.

Some further problems arise naturally and some, but not all, can be answered in a fairly direct way. In most cases separate estimates of at least part of the covariance matrix of (e, -j) are required. If more than the minimum amount of information is available more searching test of the model is possible. The following illustrate the further problems:

(i) Estimate the three components of variance of Y, namely /32var (Q), var(E,) and var(-q).

(ii) In particular, are the data consistent with all the c's being zero? (iii) Is a discrepancy between the estimated regression coefficient of Y on X and

a theoretical value explicable in terms of "errors" in the regressor variable? (iv) Are apparent differences between groups in the regression coefficients of

Y on X explicable in terms of "errors" in the regressor variable? (v) In the context of Section 4, (Xi, Yj) may refer to the sample means of the

ith group, (D, T) being the corresponding population means. The covariance matrix of (e, 71) can be estimated: what can be said about the relation between 'Ti and (Di?

(vi) How much more effectively could Y be predicted from X if X were measured more precisely, for example by additional replication?

(vii) Is non-linearity in the regression of Y on X explicable by errors in the regressor variable? (In non-normal cases, if T has linear regression on (D. Y will not have linear regression on X.)

When there is more than one regressor variable similar problems arise. The important techniques based on instrumental variables will not be considered here; see, however, Section 3, point (ii).

6. MISCELLANEOUS POINTS This final section deals with a number of miscellaneous topics not discussed

earlier.




(i) Graphical methods. These are very important both for the direct plotting of scatter diagrams of pairs of variables, possibly distinguishing other variables by a coarse grouping, but also for the systematic plotting of residuals; see, for example, Anscombe (1961). Particularly with extensive data, the systematic plotting of residuals is likely to be the most searching way of testing and improving models. It is possible that developments in computer display devices will lead to valuable ways of inspecting relationships involving more than two variables.

(ii) Outliers and robust estimation. The screening of data for suspect observations will often be required. With limited data it will be usual to look at suspect values individually in order to decide whether to include them in any subsequent analysis; often analyses with and without suspect values will be needed. With p observations for each individual the best way of looking for outliers will depend on the type of effect expected. Thus (a) if any extreme deviation is thought to be in a particular known variable, usually the dependent variable, residuals from its regression on the other variables should be examined. For further discussion, see Mickey et al. (1967); (b) suppose that any extreme deviation is thought to be confined to one variable, but not necessarily the same variable for different individuals. This might be the case, for example, with occasional gross recording errors. One procedure is then to calculate p residuals for each individual, one for each variable regressed on all the others; (c) if any individual may be subject to extreme deviations in one or more variables simultaneously, and the joint distribution is approximately p variate normal, it may be reasonable to calculate for the ith individual, with vector observation Yi, a standardized squared distance from the mean i, given by Di = (Y -F)' S-1(Y - F), where S is the estimated covariance matrix. Then the ordered Di's can be plotted against the expected order statistics for samples from the chi-squared distribution with p degrees of freedom. Iteration of the procedure may be desirable. Wilk and Gnanadesikan (1964) have given a general discussion of graphical methods for multiresponse experiments.

With extensive data, however, it may be necessary to use methods of analysis that are insensitive to outliers, so-called methods of robust estimation; see, for example, Huber (1964).

(iii) Missing values. Afifi and Elashoff (1966, 1967) have reviewed the literature on missing values in multivariate data and have considered in some detail point estimation in simple linear regression. Univariate missing value theory concentrates on the computational aspects of exploiting the near-balance of a balanced design spoiled by a missing observation, but no information is contributed by the missing observations. In a multivariate case, however, information may be contributed by individuals for which some component observations are missing. In a multiple regression problem, there is usually no information from individuals in which a regressor variable is missing, unless that variable can be regarded as random. An exception is when there is, say, an individual with xl missing and analysis of the other individuals suggests the omission of xl from the regression equation. Suppose, however, that a regressor variable is random, and the individuals with that variable missing can be regarded as selected randomly, a quite severe assumption, which should be tested where possible. Then more can be done. In some applications nearly all individuals may have at least one missing component and then use of some missing value theory is essential. Roughly speaking, the covariance between any two random variables can be estimated from those individuals on which both variables




are available; there seems scope for further work to settle just when it is wise to do this and when something more elaborate such as full maximum likelihood estimation is desirable.

(iv) Non-normal variation. The present paper is largely concerned with problems to which least squares methods are reasonably applicable, possibly after transformation. In regiession-like problems in which particular non-normal distributions can be specified, we have usually to apply maximum likelihood methods. These are locally equivalent to least squares techniques and therefore a great deal of the above discussion, for example that on the choice of regressor variables, is immediately relevant. Anscombe (1967) considered in some detail the analysis of a linear model with non-normal distribution of error; Cox and Hinkley (1968) found the asymptotic efficiency of least squares estimates in such situations.

The justification of maximum likelihood methods is asymptotic but sometimes analogues of at least a few of the "exact" properties of normal-theory linear models can be obtained. The simplest case is when the ith observation on the dependent variable has a distribution in the exponential family (Lehmann, 1959, p. 50)

exp {Ai(y) B(61) + Cy) + D(6i)}, where 6i is a single parameter and there is a linear model

B(6i) = E xirr

where the /'s are unknown parameters and the x's known constants. Special cases are the binomial, Poisson and gamma distributions when the "linear" model applies to the logit transform, to the log of the Poisson mean and to the reciprocal of the mean of the gamma distribution. Sufficient statistics are obtained and in very fortunate cases useful "exact" significance tests for single regression coefficients emerge.

(v) Experimental and observational data. Many of the issues discussed in the paper apply less acutely to the analysis of controlled experiments than to the analysis of observational data and that is why the paper may seem overweighted towards the latter type of problem. In fact, in terms of the discussion in this paper, there are three rather different reasons why fewer difficulties arise in the analysis of experimental data, quite apart from the smaller random error to which such data are likely to be subject. These reasons are: (1) the spacing of regressor variables is likely to be more suitable; (2) substantial non-orthogonalities of estimation will be avoided; (3) factors omitted from the treatments will be randomized and hence the worst difficulties associated with omitted variables (Section 2, point (ii)) will be avoided.

ACKNOWLEDGEMENT I am grateful to Mrs E. J. Snell and to the referees for constructive comments.

REFERENCES AFIFI, A. A. and ELASHOFF, R. M. (1966). Missing observations in multivariate statistics. I.

Review of the literature. J. Am. Statist. Ass., 61, 595-604. (1967). Missing observations in multivariate statistics. II. Point estimation in simple

linear regression. J. Am. Statist. Ass., 62, 10-29. ANSCOMBE, F. J. (1961). Examination of residuals. Proc. 4th Berkeley Symp., 1, 1-36.

(1967). Topics in the investigation of linear relations fitted by the method of least squares. J. R. Statist. Soc. B, 29, 1-52.




BEALE, E. M. L., KENDALL, M. G., and MANN, D. W. (1967). The discarding of variables in multivariate analysis. Biometrika, 54, 357-366.

Box, G. E. P. (1966). Use and abuse of regression. Technometrics, 8, 625-630. Box, G. E. P. and Cox, D. R. (1964). An analysis of transformations. J. R. Statist. Soc. B,

26, 211-252. Box, G. E. P. and TIDWELL, P. W. (1962). Transformation of the independent variables.

Technometrics, 4, 531-550. Cox, D. R. (1960). Regression analysis when there is prior information about supplementary

variables. J. R. Statist. Soc. B, 22, 172-176. (1961). Tests of separate families of hypotheses. Proc. 4th Berkeley Symp., 1, 105-123. (1962). Further results on tests of separate families of hypotheses. J. R. Statist. Soc. B,

24, 406-424. Cox, D. R. and HINKLEY, D. V. (1968). A note on the efficiency of least squares estimates.

J. R. Statist. Soc. B, 30, 284-289. DRAPER, N. R. and SMITH, H. (1966). Applied Regression Analysis. New York: Wiley. EHRENBERG, A. S. C. (1963). Bivariate regression is useless. Appl. Statistics, 12, 161-179.

- (1968). The elements of law-like relationships. J. R. Statist. Soc. A, 131, 280-302. FISHER, R. A. (1956). Statistical Methods and Scientific Inference. Edinburgh: Oliver and Boyd. FISK, P. (1967). Models of the second kind in regression analysis. J. R. Statist. Soc. B, 29,

266-281. GORMAN, J. W. and TOMAN, R. J. (1966). Selection of variables for fitting equations to data.

Technometrics, 8, 27-51. HOCKING, R. R. and LESLIE, R. N. (1967). Selection of the best subset in regression analysis.

Technometrics, 9, 531-540. HUBER, P. J. (1964). Robust estimation of location. Ann. Math. Statist., 35, 73-101. JEFFERS, J. N. R. (1967). Two case studies in the application of principal component analysis.

Applied Statistics, 16, 225-236. KEMPTHORNE, 0. (1957). An Introduction to Genetic Statistics. New York: Wiley. KENDALL, M. G. (1957). A Course in Multivariate Analysis. London: Griffin. KENDALL, M. G. and STUART, A. (1967). Advanced Theory of Statistics (2nd ed.), Vol. 2. London:

Griffin. LEHMANN, E. L. (1959). Testing Statistical Hypotheses. New York: Wiley. LINDLEY, D. V. (1947). Regression lines and linear functional relationships. J. R. Statist. Soc. B,

9, 218-244. - (1968). The choice of variables in multiple regression. J. R. Statist. Soc. B, 30, 31-66.

MADANSKY, A. (1959). The fitting of straight lines when both variables are subject to error. J. Am. Statist. Ass., 54, 173-205.

MIcKEYn, M. R., DUNN, 0. J. and CLARK, V. (1967). Note on the use of stepwise regression in detecting outliers. Comp. and Biomed. Res., 1, 105-111.

MORAN, P. A. P. (1956). A test of significance for an unidentified relation. J. R. Statist. Soc. B, 18, 61-64. (1967). Testing for correlation between non-negative variates. Biometrika, 54, 385-394.

NELDER, J. A. (1968). Regression, model-building and invariance. J. R. Statist. Soc. A, 131, 303-315.

NEWTON, R. G. and SPURRELL, D. J. (1967a). A development of multiple regression for the analysis of routine data. Applied Statistics, 16, 51-64.

- (1967b). Examples of the use of elements for clarifying regression analysis. Applied Statistics, 16, 165-172.

PLACKETT, R. L. (1960). Regression Analysis. Oxford: Clarendon Press. - (1965). A class of bivariate distributions. J. Am. Statist. Ass., 60, 516-522. RAO, C. R. (1965). Linear Statistical Inference and its Applications. New York: Wiley. SPRENT, P. (1966). A generalized least-squares approach to linear functional relationships.

J. R. Statist. Soc. B, 28, 278-297. SPURRELL, D. J. (1963). Some metallurgical applications of principal components. Applied

Statistics, 12, 180-188. TUKEY, J. W. (1951). Components in regression. Biometrics, 7, 33-69.

(1954). Causation regression and path analysis. In Statistics and Mathematics in Biology (ed. 0. Kempthorne). Iowa: Ames.

TURNER, M. E., MONROE, R. J. and LUCAS, H. L. (1961). Generalized asymptotic regression and non-linear path analysis. Biometrics, 17, 120-143.




TURNER, M. E. and STEVENS, C. D. (1959). The regression analysis of causal paths. Biometrics, 15, 236-258.

WILK, M. B. and GNANADESIKAN, R. (1964). Graphical methods for internal comparisons in multiresponse experiments. Ann. Math. Statist., 35, 613-631.

WILLIAMS, E. J. (1959). Regression Analysis. New York: Wiley. YATES, F. (1939). Tests of significance of the differences between regression coefficients derived

from two sets of correlated variates. Proc. R. Soc. Edinb., 59, 184-194.



notes on some aspects of regression analysis

Documents