evaluation of goodness-of-fit indices for structural equation

16
Evaluation of Goodness-of-Fit Indices for Structural Equation Models Stanley A. Mulaik, Larry R. James, Judith Van Alstine, Nathan Bennett, Sherri Lind, and C. Dean Stilwell Georgia Institute of Technology Discusses how current goodness-of-fit indices fail to assess parsimony and hence disconfirmability of a model and are insensitive to misspecifications of causal relations (a) among latent variables when measurement model with many indicators is correct and (b) when causal relations corresponding to free parameters expected to be nonzero turn out to be zero or near zero. A discussion of philosophy of parsimony elucidates relations of parsimony to parameter estimation, disconfirmability, and goodness of fit. AGFI in LISREL is rejected. A method of adjusting goodness-of-fit indices by a parsi- mony ratio is described. Also discusses less biased estimates of goodness of fit and a relative normed- fit index for testing fit of structural model exclusive of the measurement model. By a goodness-of-fit index, in structural equations modeling, we mean an index for assessing the fit of a model to data that ranges in possible value between zero and unity, with zero indi- cating a complete lack of fit and unity indicating perfect fit. Although chi-square statistics are often used as goodness-of-fit indices, they range between zero and infinity, with zero indicat- ing perfect fit and a large number indicating extreme lack of fit. We prefer to call chi-square and other indices with this property lack-of-fit indices. For a recent discussion of both lack-of-fit and goodness-of-fit indices, see Wheaton (I 988). In this article we evaluate the use of goodness-of-fit indices for the assessment of the fit of structural equation models to data. Our aim is to review their rationales and to assess their strengths and weaknesses. We also consider other aspects of the problem of evaluating a structural equation model with good- ness-of-fit indices. For example, are certain goodness-of-fit in- dices to be used only in certain stages of research (a contention of Sobel & Bohrnstedt, 1985)? Or, how biased are estimates of goodness of fit in small samples? What bearing does parsimony have on assessing the goodness of fit of the model? Can good- ness-of-fit indices focus on the fit of certain aspects of a model as opposed to the fit of the overall model? For example, to what extent do current goodness-of-fit indices fail to reveal poor fits in the structural submodel among the latent variables because of good fits in the measurement model relating latent variables to manifest indicators? We describe a goodness-of-fit index now This article is based in part on a paper presented by the first author to the Society of Multivariate Experimental Psychology at its Annual Meeting in Atlanta, Georgia, October 30 to November l, 1986. We are indebted to Chris Hertzog for comments made to earlier ver- sions of this article, particularlyin connection with the relative normed- fit index, which we thought we had invented, only to discover that he had independently invented the same index a short time before. We use his name for the index and add corrections for bias in small samples to its formula. Correspondence concerning this article should be addressed to Stan- ley A. Mulaik, School of Psychology, Georgia Institute of Technology, Atlanta, Georgia 30332. used by some researchers that addresses this problem. Finally, to what extent do goodness-of-fit indices fail to represent mis- specifications of a model when hypothesized causal paths turn out to have associated with them zero or near-zero estimates for their structural parameters? Our answer is that current good- ness-of-fit indices evaluate only certain aspects of a model and must be used judiciously in connection with other methods for the evaluation of a model. Survey of Current Indices Earlier reviews and discussions (Bentler & Bonett, 1980; So- bel & Bohrnstedt, 1985; Specht, 1975; Specht & Warren, 1976) point out that the use of goodness-of-fit indices has grown out of researchers' dissatisfaction with the chi-square statistic tradi- tionally used in assessing the fit of models. Typically, the values of the chi-square statistic for most researchers' models are sig- nificant, implying that the researchers must reject their models. And yet, in many of these cases, an inspection of the residuals representing the difference between the elements of the unre- stricted sample covariance matrix and those of the estimated hypothetical model covariance matrix for the observed vari- ables reveals that they are small in an absolute sense, giving rise to the impression that the models may not be so theoretically off-target as the significance of the chi-square statistic suggests. Chi-Square Test Justified Just When Test Has Near-Maximum Power Bentler and Bonett (1980) sought to qualify use of the chi- square statistic by pointing out that regarding the chi-square statistic as having a chi-square distribution is justified by asymptotic distribution theory only in large samples, precisely when the power of the statistic to detect small discrepancies be- tween the model and the data becomes very large. Many re- searchers may regard a rejected model as due to a poor specifi- cation on their part of theoretical values for the fixed parame- ters of the model. But in our opinion, the fault in the model may not always be a misspecification of the parameters of the model Ps3~hological Bulletin, 1989, Vol. 105, No. 3, 430-445 Copyright 1989 by the American Psychological Association, Inc. 0033-2909/89/$00.75 430

Upload: others

Post on 03-Feb-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Evaluation of Goodness-of-Fit Indices for Structural Equation Models

S t a n l e y A . M u l a i k , L a r r y R . J a m e s , J u d i t h V a n A l s t i n e , N a t h a n B e n n e t t , S h e r r i L i n d , a n d C. D e a n S t i l w e l l Georgia Institute of Technology

Discusses how current goodness-of-fit indices fail to assess parsimony and hence disconfirmability of a model and are insensitive to misspecifications of causal relations (a) among latent variables when measurement model with many indicators is correct and (b) when causal relations corresponding to free parameters expected to be nonzero turn out to be zero or near zero. A discussion of philosophy of parsimony elucidates relations of parsimony to parameter estimation, disconfirmability, and goodness of fit. AGFI in LISREL is rejected. A method of adjusting goodness-of-fit indices by a parsi- mony ratio is described. Also discusses less biased estimates of goodness of fit and a relative normed- fit index for testing fit of structural model exclusive of the measurement model.

By a goodness-of-fit index, in structural equations modeling, we mean an index for assessing the fit of a model to data that ranges in possible value between zero and unity, with zero indi- cating a complete lack of fit and unity indicating perfect fit. Although chi-square statistics are often used as goodness-of-fit indices, they range between zero and infinity, with zero indicat- ing perfect fit and a large number indicating extreme lack of fit. We prefer to call chi-square and other indices with this property lack-of-fit indices. For a recent discussion of both lack-of-fit and goodness-of-fit indices, see Wheaton (I 988).

In this article we evaluate the use of goodness-of-fit indices for the assessment of the fit of structural equation models to data. Our aim is to review their rationales and to assess their strengths and weaknesses. We also consider other aspects of the problem of evaluating a structural equation model with good- ness-of-fit indices. For example, are certain goodness-of-fit in- dices to be used only in certain stages of research (a contention of Sobel & Bohrnstedt, 1985)? Or, how biased are estimates of goodness of fit in small samples? What bearing does parsimony have on assessing the goodness of fit o f the model? Can good- ness-of-fit indices focus on the fit of certain aspects of a model as opposed to the fit of the overall model? For example, to what extent do current goodness-of-fit indices fail to reveal poor fits in the structural submodel among the latent variables because of good fits in the measurement model relating latent variables to manifest indicators? We describe a goodness-of-fit index now

This article is based in part on a paper presented by the first author to the Society of Multivariate Experimental Psychology at its Annual Meeting in Atlanta, Georgia, October 30 to November l, 1986.

We are indebted to Chris Hertzog for comments made to earlier ver- sions of this article, particularly in connection with the relative normed- fit index, which we thought we had invented, only to discover that he had independently invented the same index a short time before. We use his name for the index and add corrections for bias in small samples to its formula.

Correspondence concerning this article should be addressed to Stan- ley A. Mulaik, School of Psychology, Georgia Institute of Technology, Atlanta, Georgia 30332.

used by some researchers that addresses this problem. Finally, to what extent do goodness-of-fit indices fail to represent mis- specifications of a model when hypothesized causal paths turn out to have associated with them zero or near-zero estimates for their structural parameters? Our answer is that current good- ness-of-fit indices evaluate only certain aspects of a model and must be used judiciously in connection with other methods for the evaluation of a model.

Survey o f C u r r e n t Indices

Earlier reviews and discussions (Bentler & Bonett, 1980; So- bel & Bohrnstedt, 1985; Specht, 1975; Specht & Warren, 1976) point out that the use of goodness-of-fit indices has grown out of researchers' dissatisfaction with the chi-square statistic tradi- tionally used in assessing the fit of models. Typically, the values of the chi-square statistic for most researchers' models are sig- nificant, implying that the researchers must reject their models. And yet, in many of these cases, an inspection of the residuals representing the difference between the elements of the unre- stricted sample covariance matrix and those of the estimated hypothetical model covariance matrix for the observed vari- ables reveals that they are small in an absolute sense, giving rise to the impression that the models may not be so theoretically off-target as the significance of the chi-square statistic suggests.

Chi-Square Test Justified Just When Test Has Near-Maximum Power

Bentler and Bonett (1980) sought to qualify use of the chi- square statistic by pointing out that regarding the chi-square statistic as having a chi-square distribution is justified by asymptotic distribution theory only in large samples, precisely when the power of the statistic to detect small discrepancies be- tween the model and the data becomes very large. Many re- searchers may regard a rejected model as due to a poor specifi- cation on their part of theoretical values for the fixed parame- ters of the model. But in our opinion, the fault in the model may not always be a misspecification of the parameters of the model

Ps3~hological Bulletin, 1989, Vol. 105, No. 3, 430-445 Copyright 1989 by the American Psychological Association, Inc. 0033-2909/89/$00.75

430

EVALUATION OF GOODNESS-OF-FIT INDICES 431

but may reflect a failure to satisfy other conditions necessary for the test of the model (James, Mulaik, & Brett, 1982; Mulaik, 1987). For example, the assumption that the data represent a random sample from a multivariate normal distribution may be wrong. Or, although one may assume that the causal relations between the variables in all subjects in a sample is adequately described by the model, there may be a few isolated individuals in the sample for whom the model is not appropriate. Or the assumption that one has achieved a completely closed system of variables may be incorrect, even though one may have in- eluded in the model those causal variables that account for a substantial portion of the variance in the dependent variables. Consequently, researchers have desired an index that does not simply tell them that their model does or does not fit the data precisely but also indicates how closely their model fits the data. Even if the model is to be rejected by the chi-square test, a high degree of fit may suggest that much is to be salvaged in the model, because a more careful assessment of the model's as- sumptions and the manner in which the data conforms to these assumptions may reveal where the discrepancy lies.

Indices Patterned After Multiple Correlation

The squared multiple correlation has served as a paradigra to inspire a number of goodness-of-fit indices in causal modeling.

For example, Specht (1975) developed a generalized multiple correlation coefficient to indicate how well variation among the exogenous variables in a causal model determines the variation among the endogenous variables. However, such an index has a more special purpose than that of an index of the fit of a whole model to data, and we do not consider it in detail here. However, Specht (1975) also offered an index Q analogous to the general- ized multiple correlation coefficient to indicate how well a causal model reproduces the observed covariance matrix:

Q = ISl / l~j l ,

where I S I is the determinant of the actually observed, uncon- strained variance--covariance matrix for the observed variables and [ ]~jl is the determinant of the reproduced covariance ma- trix under (overidentified) Model j. This index varies between zero and unity, with zero indicating total lack of fit and unity indicating perfect fit. Note that the numerator, which is a func- tion of the observed data to be explained by a model, remains constant for a given set of data, and the denominator varies with different models offered in explanation of the data. This is just the reverse of a coefficient of determination that has an uncon- strained estimate of the variance to be explained in its denomi- nator and varies in value with the values of predicted variance under various models in the numerator. Therefore, the Q index cannot be interpreted as a proportion of, say, total variation accounted for. Furthermore, whereas in principle a zero value for Q indicates complete lack of fit, in practice against almost any worst fitting null model, Q will be bounded from below by some value greater than zero. This is because the determinant ISI of an empirical covariance matrix S involving only moder- ately correlated variables and no linear dependencies among variables will almost always be greater than zero, whereas the determinant I~jl of the covariance matrix for any most re- stricted null model, say, one that generates a diagonal covari-

ance matrix of zero off-diagonal covariances, will be finite and not always much larger than I S I. It would seem that a rational choice for a goodness-of-fit index would require that a worst fitting model, say, the null model, would have a zero value for its goodness-of-fit index. Q does not provide this.

Normed-Fit Index

Nested models concept. Before we consider the normed-fit index, we must first consider what is meant by a nested sequence of models, because the normed-fit index depends on compari- sons of lack of fit between models in a nested sequence of models. Actually, the term a nested sequence of models is used in two ways in the literature. According to Bentler and Bonett (1980), one way, the parameter-nested sequence of models, in- volves a sequence of models nested strictly according to their parameters; the other way, the covariance matrix-nested se- quence of models, uses nested in a less restricted sense to refer to a sequence of models nested according to their covariance matrices. The distinction between these two forms of nesting is given as follows:

On the one hand, a parameter-nested sequence of models is a sequence of similar models having the same parameters but ordered according to increasingly more restricted a priori con- straints placed on their parameters. For example, a typical model in a nested sequence may have five parameters,

bi b2 b3 b4 bs.

Beginning with a completely unrestricted model with no a pri- ori restrictions on its parameters, we may construct an increas- ingly restricted nested sequence of models by fixing one addi- tionai parameter in each succeeding model:

Ms: bl b2 b3 /74 b5 Ml: bi b2 ba b4 0 M2: bl b2 b3 0 0 M3: bl b2 0 0 0 M4: bl 0 0 0 0 Mo: 0 0 0 0 O.

Model M~ is the least restricted model, because all of its param- eters are free to range over the set of real numbers in the estima- tion of the parameters. Succeeding models are more restricted variants of those preceding them. For a model in the sequence (other than the first), those parameters corresponding to fixed parameters in the preceding model in the sequence are also fixed to the same values in the current and subsequent models. Certain additional parameters, corresponding to some of the free parameters in the preceding model, are then fixed to cer- tain values in the current and subsequent models. The remain- ing parameters, also corresponding to the remaining free pa- rameters of the preceding model, are left free in the current model.

Aside from simply fixing parameters to specified point values to place constraints on the parameters, one has other, some- times less restrictive, forms of constraint. For example, one may constrain a model by requiring that a parameter fall within a specified interval, and in a subsequent, more restricted model, one may require that the parameter then take a specific value in this interval. One may constrain a parameter by requiring its

432 MULAIK, JAMES, VAN ALSTINE, BENNETT, LIND, STILWELL

estimated value to equal the estimated value of another parame- ter. See Bentler and Bonett (1980) and Steiger, Shapiro, and Browne (1985) for discussion of the various ways of constrain- ing parameters in nested sequences of models.

Now, the most important characteristic of parameter-nested sequences of increasingly restricted models is that they can dis- play increasing (but never decreasing) magnitudes of lack of fit as one assesses the fit of each successive, more restricted model in the seciuence to a given set of data (Bentler & Bonett, 1980). Because the fit of structural equation models is assessed by comparing a model's reproduced covariance matrix to the ob- served sample covariance matrix, then any other sequence of models that generates the same sequence of reproduced covari- ance matrices as does a parameter-nested sequence of models will generate the same sequence of lack-of-fit index values as does the parameter-nested sequence.

Consequently, a covariance matrix-nested sequence of models is a sequence of structural equation models that gener- ates the same sequence of reproduced covariance matrices as does some parameter-nested sequence of structural equation models. Many nested sequences of models described in the liter- ature are only covariance matrix nested, deemed by researchers to be more convenient to use as proxies for their more strictly parameter-nested counterparts. For example, each model of the nested sequence, going from a saturated model (having perfect fit to the observed covariance matrix) through the measure- ment model to a more constrained structural model, and then to a null model, can be shown to correspond to a model of a parameter-nested sequence in having the same reproduced co- variance matrix as the parameter-nested model. It is important to realize that although a parameter-nested sequence of models may be unique, a covariance matrix-nested sequence of models may correspond to any number of distinct parameter-nested se- quences of models, all of which generate the same sequence of covariance matrices.

The importance of nested sequences of models is that they may be used in connection with a rational sequence of tests de- signed to provide information about distinct aspects of a struc- tural equation model embedded within the sequence. The prin- ciple of nested sequences of models is not of recent origin, hav- ing been described by Roy (1958), Roy and Bargmann (1958), Kabe (1963), Bock and Haggard (1968), and Mulaik (1972) in connection with step-down procedures in multivariate analysis. Discussions of such nested sequences in structural equations modeling are given in Bentler and Bonett (1980) and James, Mulaik, and Brett (1982), and the reader is referred to these references for details. It is also important to note that sequences of chi-square difference tests comparing differences in lack of fit between adjacent models in a nested sequence of models are asymptotically independent (Steiger, Shapiro, & Browne, 1985). Such tests permit one to isolate where fit and lack of fit arise in a model in the nested sequence.

Normed index for comparing models. Bentler and Bonett (1980) described a normed index for the comparison of fit of two nested models against a given set of data. The index can be constructed using any one of a number of lack-of-fit indices as the basis for the measurement of lack of fit of a model to data, such as the chi-square index obtained when fitting the model with maximum likelihood or generalized least squares estima-

tion or the sum of squared residuals obtained using least squares estimation. We present Bentler and Bonett's (1980) normed index here with a slight modification designed to give it greater general application:

Akj = (Fk -- Fj)/(Fo - Fh), (1)

where Fh, ~ , Fk, and Fo are the lack-of-fit indices of four in- creasingly restricted nested models, Mh, Mj, Mk, and 34o, re- spectively, with M0 known as the null model. In effect, the difference in the lack of fit of the most restricted null model M0 and the least restricted model Mh is used as a norm by which to evaluate the difference between the two intermediate models. (Benfler and Bonett [ 1980] did not include Fh in the denomina- tor of their index; but our index is equivalent to theirs if one takes Fh to be the lack of fit of a saturated or just-identified model, which has a lack of fit of zero.) Over the total range of increasing a priori restrictions on parameters in the sequence of nested models, beginning with Mh and ending with Mo, Akj gives the proportion of the difference in lack of fit between the most and least restricted models contributed by the difference in restrictions between the two intermediate models, Mj and g k .

Normed-fit index. A popular index that is a specialization of the normed index for comparing models and that, when used in certain contexts, reflects the proportion of total information "accounted for" by a model is the normed-fit index of Bentler and Bonett (1980), which, with some license in notation on our part, is given as

NFI( j ) = (Fo - Fj)/(Fo - Fs), (2)

where F0 is a lack-of-fit measure, for example, chi-square (for maximum likelihood estimation) or the sum of squared residu- als (for unrestricted least squares estimation) when comparing the sample covariance matrix with the hypothetical covariance matrix derived from the parameters of a null model; F/ is the comparable lack-of-fit measure (chi-square or sum of squared residuals) when comparing the sample covariance matrix with the hypothetical covariance matrix derived from the parameters of a less restricted model (Model j); and Fs is the comparable lack-of-fit measure when comparing the sample covariance ma- trix with the hypothetical covariance matrix derived from the parameters of a saturated or just-identified model. A saturated model has as many parameters to estimate as there are observed parameters in the sample covariance matrix from which to de- rive estimates of those parameters. Consequently, the estimated (reproduced) covariance matrix for the saturated or just-identi- fied model equals the sample covariance matrix. So, the lack- of-fit index Fs for the saturated model equals zero, because there is no discrepancy between the sample covariance matrix and the covariance matrix derived from the estimates of the satu- rated model's parameters. The rationale of the normed fit index NFI(j) is as follows:

Given that the models in a nested sequence are all identified, a nested sequence of models can range, in the most extreme case, from a completely saturated model to a completely speci- fied model having no estimated parameters. Frequently, how- ever, the range of a nested sequence of models is less than this. For example, the most restricted model in the sequence, desig-

EVALUATION OF GOODNESS-OF-FIT INDICES 433

nated the null model Mo, may not specify all of its parameters a priori. Now, Fo, the lack of fit of the null model, is the maxi- mum possible lack of fit one might obtain in a nested sequence of models ranging from a saturated or just-identified model (with lack of fit Fs equal to zero) through Model j (with lack of fit Fj) to the null model (with lack of fit equal to Fo). Thus Fo can serve as a norm by which to evaluate the degree to which Model j reduces lack of fit from the maximum possible lack of fit obtained in the nested sequence of models. The ratio of(Fo - Fj)/Fo (dropping the expression for Fs in the denominator be- cause it equals zero) thus represents the proportion of the total lack of fit that has been reduced by the use of Model j.

It must be obvious by now that one should never try to use the normed-fit index when the lack-of-fit index of the null model is zero, because then one would divide by zero in the calculation of the normed-fit index. On the other hand, if (in the case in which the lack-of-fit index is a chi-square statistic) the lack of fit of the null model is not significant but still not equal to zero, then performing a sequence of difference chi-square tests on the models of the sequence will converge to acceptance of the null model for the nested sequence in question.

Null model. In using Bentler's normed-fit index, one must determine the null model relevant to one's purposes whose lack-of-fit index will serve as F0 in the normed-fit index. Bentler and Bonett (1980) and James et al. (1982) argued that in the case of many structural equations models, and in common fac- tor analysis models particularly, the aim is only to account for relationships among a set of observed variables. No attempt is made to account for the variances of these variables, which may be arbitrarily scaled. Thus a natural, most restricted null model would be one in which there is no relationship among the ob- served variables, with the consequence that the hypothetical co- variance matrix is a diagonal matrix with fixed zero off-diago- nal covariances and unspecified variances in the diagonal. The variances of this diagonal matrix, being free parameters, are estimated by the sample variances of these variables. Thus the sample covariance matrix will differ from the diagonal covari- ance matrix of the null model only in terms of the nonzero co- variances among the variables. It is these differences, represent- ing causal relationships, that are to be explained by a model. And so the difference in lack of fit between the saturated model and the null model represents the range of information that is to be accounted for by any model that seeks to reduce that lack of fit in a parsimonious way. This difference is thus the norm for the index.

On the other hand, the term null model must not be inter- preted to always mean a model that postulates no relationships between variables. Bentler and Bonett (1980) regarded a null model as a general, most restricted model against which other less restricted models are to be compared in a nested sequence of models. This concept clearly leaves open the possibility that in some situations, choices for a null model other than the model of no covariances between variables may be appropriate.

Any structural model that fixes a structural parameter to a value other than zero (other than for purposes of arbitrarily specifying the metric of a latent variable) will necessarily be nested within a sequence in which the most restricted model in the sequence also fixes the same parameter to the same nonzero value. Such a null model may generate a covariance matrix that

is not a diagonal matrix. Thus one may wish to hazard hypothe- ses that specify a priori not just the zero parameters but the nonzero values of other parameters as well. In the nested se- quence of models, beginning with a saturated model, one may, in successive models, fix parameters to specified values in the order of one's decreasing confidence in these specified values. Although the lack-of-fit F0 of the most restricted model in the sequence of models may be significant statistically, one may wish to test the fit of some intermediate model in the sequence involving a subset of the fixed parameters of the most restricted null model about which one has the greatest confidence. The normed-fit index for this intermediate model, however, is not to be interpreted as a "proportion-of-total-covariance" index, because the norm of the index does not in this case correspond to a measure of the covariation to be explained. Rather, the norm corresponds to a measure of the discrepancy between the (possibly nondiagonal) covariance matrix generated under the most restricted model and the covariance matrix generated un- der the saturated or unrestricted model. Thus the normed-fit index in this case is to be interpreted as the proportional reduc- tion in the lack of fit between the null and saturated models achieved by the intermediate model 's fixing fewer and estimat- ing more parameters.

Sobel and Bohrnstedt's criticisms. Sobel and Bohrnstedt 0985) criticized the use of the uncorrelated variables null model. They referred to this null model as the "no-factor" null model, because a diagonal covariance matrix would be ob- tained if there were no common factors among a set of variables. They claimed that it should not be used in other than purely exploratory contexts. They argued that by using the no-factor null model, one might conclude that a relatively large normed- fit index suggests that a model in question is scientifically ade- quate, but in fact the normed-fit index used in this way will not tell one whether the model represents a substantial improve- ment in knowledge over what is already available. It only tells one, said Sobel and Bohrnstedt, that the model is substantially better than a no-factor model. For example, we may already know that a set of variables are intercorrelated with other than zero correlations, so using the null model of no factors or zero correlations is to use a baseline model (Sobel and Bohrnstedt's new term for the null model) that is already rejected by current knowledge. Sobel and Bohrnstedt (1985) argued further that in many instances one should use some other baseline model in lieu of the no-factor null model. The choice for the baseline model depends on the current theoretical context within which the hypothesis is being considered. For example, in a factor ana- lyric context, we may already assume the existence of two fac- tors but consider a less restricted hypothetical model involving additional factors as a hypothesis. Thus a two-factor model could be the baseline or null model rather than the no-factor model against which our hypothetical model is to be compared. Or, to consider another example, within the context of hypothe- sizing two factors, we may compare a model, in which certain factor loadings are free to be estimated and are different from one another, against a baseline or null model, in which the same loadings are constrained to be equal to one another. Because a model that is less restricted than the no-factor model is used as the baseline or null model, the normed-fit index is much more sensitive (by having a smaller denominator) to tests of improve-

434 MULAIK, JAMES, VAN ALSTINE, BENNETT, LIND, STILWELL

ment in fit in going from the baseline model to the hypothesized model.

We accept Sobel and Bohrnstedt's (1985) observation that use of the no-factor null model leads to normed-fit indices that may not be very sensitive to important differences between models that are of current theoretical interest. However, we still find the use of the no-factor null model in a goodness-of-fit in- dex useful. The index reveals, in relation to the observed covari- ance matrix, the proportional degree to which the many rela- tionships observed between variables within that matrix are re- produced by the model. That is useful information for model comparison, especially when the models are not members of the same nested sequence of models. In fact, most competing models in science are not nested one within the other, because they often represent the phenomena in quite different ways with different sets of parameters. Yet it is the more or less absolute, overall fit of these models to the same data that is important information for comparing them. And that is the information that is provided by using a no-factor null model with a normed- fit index. (And similar information is provided by the GFIs of LISREL.) For purposes other than this, we believe that the nested models concept has primarily a limited application, that of evaluating a given hypothetical model by testing different as- pects of the model (e.g., in the manner described by Bentler & BoneR, 1980, Hertzog, in press, and James et al., 1982).

On the other hand, we agree with Sobel and Bohrnstedt (1985) that if two models applied to the same data both obtain normed-fit indices in the .90s, the differences in fit between them may indeed be small, involving only differences in a few parameters, and yet the differences may have considerable theo- retical importance at a given historical moment. To deal with the detection of these differences, we think the answer is to mag- nify these differences in the normed-fit index by the use of less restricted null models and norms other than the difference in lack of fit between the no-factor model and the saturated model (which has zero lack of fit). Sobel and Bohrnstedt (1985) were indeed moving in that direction by suggesting the use of other null models. However, they did not consider anchoring the norm of the normed-fit index in the difference between their baseline model and some model in the nested sequence inter- mediate between the tested model and the saturated model. (In- deed, their formula for the normed-fit index was the same as Bentler & BoneR's [1980] formula, and so they did not suggest that the norm of the index is a difference in lack of fit between two models.) Had Sobel and Bohrnstedt done so, they would have made an index much more sensitive to the differences they wished to detect than their own modified normed-fit indices. We have more to say about this later on in connection with rela- tive normed-fit indices.

Small sample bias o f normed-fit index. Marsh, Balla, and McDonald (1988) showed with a Monte Carlo study that the normed-fit index of Bentler and Boner (1980) belongs to a class of goodness-of-fit indices, the Type I incremental-fit indices, which on the average, for small samples less than 200 in size, significantly underestimate the asymptotic value of the same index. A Type 1 incremental-fit index is of the form

IFII(F) = (Fo - Fj)/Fo,

where F is some basic lack-of-fit index, such as the maximum

likelihood fit function value for the model (FF), x 2, x2/df, the likelihood ratio, or the root-mean-square residual (RMR). Thus Fo is the lack-of-ft index for the null model, and F~ is the lack-of-fit index for Model j. In contrast, Marsh et al. (1988) reported a second class of goodness-of-fit indices, the Type 2 incremental-fit indices, which are of the form

IFI2(F) = (Fo - Fj)/[Fo - E(F/I Model j is true)]

where E( ) is the expected value operator. These indices as a class tend to underestimate their asymptotic value in small sam- ples to a much less degree, and any Type 2 incremental-fit index based on the fit function FF, x 2, x2/df, or on the Akaike (1987) information criterion (AIC) x 2 + 2q(j) or its variant [x 2 + 2(l(j)]/Nas modified by Cudeck and Browne (I 983; where q(j) are the number of parameters estimated in the model) was rec- ommended by Marsh, Balla, and McDonald (1988) in place of the corresponding Type I incremental-fit index.

Those committed in the past to using a Type I incremental- fit index such as the normed-fit index of Bentler and Boner (I 980) may be inclined to resist accepting use of Type 2 increm- ental-fit indices because a rational analysis of their formulas suggests that they measure slightly different aspects of fit. One may still like the Type 1 index because it indicates the propor- tion of the information about associations between variables explained by a model. The Type 2 incremental-fit indices seem not to have quite this same interpretation. However, whatever controversy there may be over whether to choose the Type 1 or Type 2 incremental-fit index in large samples, this controversy is made moot by the fact, not noted by Marsh et al. (1988), that each of the Type 2 incremental-fit indices recommended by them asymptotically equals the asymptotic value of its corre- sponding Type l incremental-fit index. Consequently, because a Type 2 incremental-fit index is less biased as an estimator of its asymptotic value, it may be used as a superior estimator of the asymptotic value of the corresponding Type 1 incremental- fit index. For exam#e, when F = x 2, the corresponding Type incremental-fit index (the Bentler-Bonett normed-fit index) equals

IFIl(x 2) = (×~) - X])/X~.

But because ×2 = (N- I)FF, where FF is the value 0fthe maxi- mum likelihood fit function value for the model and N is the sample size, we may write

IFI 1 (x 2) = (FFo - FFA/FFo

canceling the factor ( N - l), which appears implicitly in expres- sions in both the numerator and denominator. On the other hand, the corresponding Type 2 incremental-fit index equals

IFI2(x 2) = (X 2 _ 2 2 Xj )/(Xo - df)

because E(X 2) = dfwhen the model is true. Dividing each ele- ment by ( N - 1) results in

IFI2(x 2) = (FFo - FFj)/[FFo - df/(N- 1)]

But asymptotically, as N increases indefinitely, df/(N - 1) ap- proaches zero, with the consequence that IFI2(x 2) asymptoti- cally approaches the asymptotic value of IFI 1 (X2). Similar con- vergence to the corresponding Type 1 incremental-fit index can

EVALUATION OF GOODNESS-OF-FIT INDICES 435

be shown for the other Type 2 incremental-fit indices recom- mended by Marsh et al. (1988).

According to the empirically generated tables of Marsh et al. (1988), the correction for bias due to sample size in using the Type 2 incremental-fit index in place of the corresponding Type 1 incremental-fit index does not overcorrect the bias. On the average, for a given sample size, the Type 2 incremental-fit index is a better estimator of the asymptotic Type 1 incremental-fit index. (A mathematical proof to replace this inductive general- ization is still unavailable.) But the interpretation can still be that of a Type 1 incremental-fit index. We should note, however, that in some cases, sampling fluctuations may permit some Type 2 incremental-fit indices to exceed unity by a small amount.

GFIs Of LISREL

Ji~reskog and SiJrbom (1984) described several variants of a goodness-of-fit index (GFI) reported in output for the LISREL VI program. However, the formulas given by JiSreskog and S/Sr- born (1984) for these indices are not given with their rationale. It would seem enlightening to discover a rationale for these indi- ces. The formulas for these indices are as follows: On the one hand,

GFI(ML) = 1 - tr($-lS - I)2/tr(~-~S)2 (3)

is to be used with models whose parameters are estimated by maximum likelihood (ML) estimation, where $ is the estimated covariance matrix for the observed variables derived from a re- stricted model; S is the unrestricted, sample covariance matrix (corresponding to the covariance matrix of a saturated model); and tr( ) is the trace or sum of the diagonal elements of the matrix contained within the parentheses. On the other hand, using ~ and S as in Equation 2,

GFI(ULS) = 1 - tr(S - ~)2/tr(S 2) (4)

is to be used with models whose parameters are estimated by unrestricted least squares (ULS) estimation. Assuming that these indices were invented on the basis of a common principle, it seems that this principle was not that on which the normed- fit index in Equation 2 was based. This is most evident in the case of GFI(ML). One must note that the normed-fit index is based on lack-of-fit indices directly derived from an index of lack of fit that is minimized in the process of estimating free parameters. For example, the chi-square statistic used as the lack-of-fit index for maximum likelihood estimation is given as the sample size multiplied by the expression

F(ML) = log 121 - log ISl + tr(~-~S) - k, (5)

which is the loss function to be minimized in maximum likeli- hood estimation, with k being the number of manifest variables in ~ and S. When the restricted model covariance matrix equals the saturated model covariance matrix S, then log [ • [ and log I Sl are equal, and tr(2~-IS) equals the trace of an iden- tity matrix that contains k ones in its principal diagonal, with the consequence that F(ML) equals zero. In GFI(ML) we ob- serve the numerator expression tr(~-~S - 1) 2. Although this is related to the comparison oftr(~-~S) with the value k in Equa-

tion 5, it is not identical to it. The numerator of GFI(ML) is also not the lack-of-fit index of a null model, and for this reason, GFI(ML) should not be regarded as a case of Bentler and Bon- net's (1980) normed-fit index.

Analogy with coe~icient of determination. The GFIs seem more inspired by analogy with the concept of a coefficient of determination, an index also reported in the LISREL output and discussed by Jtreskog and Strbom (1984). (In this respect the GFIs are analogous to Specht's [ 1975] generalized multiple cor- relation coefficient.) In general, a coefficient of determination may be expressed as

02 = 1 - (error variance/total variance).

In the case of GFI(ML),

tr($-lS) 2 = tr(]~-ls]~-lS) = tr[(¢2-t/2S~,-l/2)(~.-t/2S~,-l/2)]

represents the sum of the squares of the elements o f (2-1/2S~-1/2), the sample covariance matrix S premultiplied and postmultiplied by the "weight" matrix ~-1/2. The (weighted) "error" in the fit of $ to S is given by the elements of the matrix [$-1/2(S- ~)$-~/2]. The sum of the squares of this weighed error matrix is given by tr[~-t/2(S - - ~ ) ~ - 1 / 2 1 2 =

t r ( ~ - l S -- I ) 2, the numerator of GFI(ML) given in Equation 5. Therefore, the proportion of squared (weighted) error t r[Z- '~(s - ~)~- '~l ~ in f i t t ~ the ( ~ t e d ) matrix ~ - ' ~ S ~ - ' ~ is given by the ratio tr(~-lS - | ) 2 / t r ( ~ - I S ) 2 . Subtracting this ratio from unity yields a measure of the proportion of weighted information in S that fits the weighted information in ~.

This can similarly be seen in GFI(ULS). The error in an ele- ment of ~2 may be determined by how much it differs from the corresponding element in S. The sum of the squares of these errors is given by tr(S - 1~)2. On the other hand, the sum of the squares of the elements of S given by tr(S) 2 gives a measure of the total information to be explained by a model. The ratio of tr(S - ~)2/tr(S) 2 gives the proportion of information in S that is in error in fitting to S. Subtracting this ratio from unity gives the proportion of information in S that is fit by 1~. Evidently, the weighting of information in this method of estimation is by the matrix I as opposed to ~-~/2.

In general, a GFI is given by

GFI = 1 - tr[W-t/2(S - ~ ) W - l / 2 ] 2 / t r ( W - I / 2 S W - l / 2 ) 2,

where W is some weight matrix, depending on the method of estimation. For maximum likelihood, W = ~; for unrestricted least squares, W = I; and for generalized least squares, W = S (Tanaka & Huha, 1985).

The theory behind the weighting matrix W for the GFIs was first given by Bentler (1983), who drew on the work of Browne (1982) and Shapiro (1983) on the theory of generalized least squares estimation in proposing a goodness-of-fit index for models estimated by generalized least squares. Bentler seems, however, not to have realized at that time that this theory is a basis for the goodness-of-fit indices Of LISREL. Using this theory, Tanaka and Huba (1985) were able to derive the goodness-of- fit indices of LISREL and show that they were optimized by the estimation methods. In addition, they derived a GFI(GLS) in- dex for generalized least squares (GLS) estimation in structural equation modeling, showing that the weight matrix W in this case is S.

436 MULAIK, JAMES, VAN ALSTINE, BENNETT, LIND, STILWELL

Relation to normed-fit indices. Although the normed-fit in- dex (using the no-factor null model) of Bentler and Bonett (1980) is not based on the same rationale as the GFIs of LISREL, the normed-fit index nevertheless often generates indices sim- ilar in magnitude to those of the GFIs. Furthermore, when one bases the norm of the normed-fit index on the no-factor model, the normed-fit index, like the GFI, yields a measure of the pro- portion of some suitably defined total observed information fit by the model in question. However, using the no-factor null model, the normed-fit index measures the fit of the model to the off-diagonal elements of the covariance matrix, those elements representing the relations between the observed variables. One may wonder why the null model of the normed-fit index is not a null matrix. Although this is a possibility in unrestricted least squares estimation, it is not a possibility in maximum likeli- hood estimation. The fit function of the maximum likelihood estimation procedure is undefined when the model covariance matrix is a null matrix. The inverse of a null matrix is not de- fined. So, the lack-of-fit index for the null matrix as model co- variance matrix is undefined in maximum fikelihood estima- tion, and the null matrix cannot be the null model of the normed-fit index in this case. On the other hand, the GFI seeks to measure the fit of the model to the whole covariance matrix. This is possible because in computing the GFI for a model, one does not need to know the fit function value of a null matrix.

Marsh et al. (1988) showed that the GFI of LISREL in small samples does not underestimate its asymptotic value to quite the same extent as does the normed fit index, although there is still a notable sample size effect. However, the GFI seems effected by violation of the assumptions made by the maximum likelihood estimation method on which it is based, which is seen in the considerably lower average GFI value obtained for the Students' Evaluation of Teaching Effectiveness data reported by Marsh et al. (1988), which represented a large empirical sample of subjects' responses to a teaching evaluation questionnaire. There is reason to believe that the subjects in this sample were not homogeneous for the factor model. Our recommendation is to continue to use the GFIs for the appropriate method of estimation when the conditions for that method are satisfied and when one has samples at least 200 in size.

Tanaka (1987) reported evidence that the normed-fit index varies considerably in value across maximum likelihood and generalized least squares estimations when applied to the same model and data. For example, a model whose free parameters were estimated by maximum likelihood yielded a normed-fit index of.88. When the free parameters of this model were esti- mated by generalized least squares using the same data, the cor- responding normed-fit index was .62. On the other hand, Ta- naka found that GFI(ML) and GFI(GLS) as described here yielded the single value of .89 for the same data. One would expect the GFI indices for maximum likelihood estimated models and generalized least squares estimated models to con- verge asymptotically as sample sizes increase, because the ma- trices S and • should converge as long as the model is correctly specified. However, Tanaka (1987) did not report studies com- paring the GFI(ML) and GFI(GLS) indices across possibly mis- specified models. The weight matrix for generalized least squares (GLS) estimation will remain the same across all models, whereas it can vary across the models, especially mis-

specified ones, in the ease of maximum likelihood (ML) estima- tion. This might produce a difference in the results. Tanaka (personal communication, June 1988) also indicated that the sample size on which this result was based was N = 112. This sample size is well within the range in which the normed-fit index is seriously underestimated. We cannot yet resolve whether his results reflect a small sample effect rather than a major discrepancy between normed-fit indices and GFIs. Stud- ies are needed to resolve this problem. Tanaka (1987) also did not report comparisons with GFI(ULS) indices, which use the fixed weight matrix I. In any case, although one should expect high correlations between the GFI indices (Anderson & Gerb- inf~ 1984), one should use caution in comparing the goodness of fit of models estimated by different methods.

Pars imony and the Problem o f Inflated Indices

A drawback of the normed-fit indices formulated along the lines of Bentler and Bonett's (1980) index and the GFI of JSre- skog and S6rbom's (1984) LISREL program was pointed out by James et al. (1982): One can get goodness-of-fit indices ap- proaching unity by simply freeing up more parameters in a model. This is because estimates of free parameters are ob- tained in such a manner as to get best fits to the observed covari- ance matrix conditional on the fixed parameters. So, each addi- tional parameter freed to be estimated will remove one less con- straint on the final solution with consequently better fits of the model-reproduced covariance matrix to the sample covariance matrix. A just-identified model with as many parameters to es- timate as there are independent elements of the observed vari- ables' covariance matrix has a lack-of-fit index of zero and con- sequently a normed-fit index of unity. The degrees of freedom of the just-identified model is also zero. Hence James et al. (1982) suggested adjusting the normed-fit index for loss of de- grees of freedom by multiplying the normed-fit index NFI(j) for Model j by the ratio of the degrees of freedom, dj, of the model divided by the degrees of freedom, do, of the null model. This ratio assumes values between zero and unity and was called the parsimony index of the model. The resulting index

PNFI(j) = (dJdo)NFI(j)

can be called a parsimonious normed-fit index (PNFI). The effect of this multiplication of the normed-fit index by the parsi- mony index is to reduce the normed-fit index to a value closer to zero. This reduction in value of NFI(j) compensates for the increase in fit of a less restricted model obtained at the expense of degrees of freedom lost in the estimation of free parameters.I In some ways, this index has certain affinities to the Akaike (1987) AIC lack-of-fit index, which also penalizes a model for

t One may wonder whether multiplying the normed-fit index by the simple ratio (d/do) and not by some other nonunit power of this ratio, (dfldoy, where c ~ 1, provides the optimal adjustment for loss in degrees of freedom. We favor the simple ratio because each degree of freedom lost corresponds to a parameter estimated, and the ratio is simply re- duced by l/d0 for each degree of freedom lost, no matter how many other degrees of freedom have been lost. So, all degrees of freedom (and estimated parameters) are treated equally. Howeve~ further study of this issue is warranted.

EVALUATION OF GOODNESS-OF-FIT INDICES 437

losses in degrees of freedom resulting from estimating more pa- rameters, when comparing models according to their lack of fit to the data. However, the PNFI is a goodness-of-fit index, whereas the AIC index is a lack-of-fit index.

A comparable parsimonious GFI, PGFI, can be formed from a GFI reported by the LISR~L program:

PGFI(j) = (4/do)GFI(j),

where it should be noted that do for a GFI equals k(k + 1)/2, the number of independent elements in the diagonal and off- diagonal of the covariance matrix of observed variables, rather than the number of distinct off.diagonal elements, k(k - 1)/2, as is the case with the normed-fit index.

Parsimony of a Model

Parsimony in the history of science. We believe that in assess- ing the quality of a model, especially when comparing different models formulated for a given set of data, the goodness of fit of the model should never be taken into account without also tak- ing into account the parsimony of the model. The value of the PNFIs and PGFIs is that they combine information about goodness of fit with information about parsimony into a single index that seeks to compensate for the artifactual increase in fit resulting from estimating more parameters. As a result, these indices may furnish information leading to inferences concern- ing the acceptance or rejection of a model that differ from infer- ences based on indices that consider goodness of fit alone.

Historically, parsimony in the formulation of theories has been advocated as a virtue in its own right, depending on no other principle. For example, the 14th-century English nomi- nalist philosepher and theologian William of Occam formu- lated the parsimony principle in what is known today as Oc- cam's razor: Entities are not to be multiplied except as may be necessary. Occam's razor came to signify that theories should be as simple as possible (Jones, 1952). But insisting on simplic- ity in theories may at times seem arbitrary. Kant (1781/1900) recognized Occam's razor as a regulative principle of reason impelling us to unify experience as much as possible by means of the smallest number of concepts. But Kant cautioned that the principle is not to be applied uncritically, for against it one could cite another regulative principle, that the varieties of things are not to be rashly diminished if we are to capture the individuality and distinctness of things in experience.

Toward the end of the 19th century, the Austrian and Kantian physicist Heinrich Hertz put forth the view that our theories are not merely summary descriptions of that which is given to us in experience but are constructs or models actively imposed by us onto experience. There are many models we might con- struct that account for the relations among a given set of objects. Thus, to choose between competing models, we must evaluate them in terms of their logical or formal consistency, their empir- ical adequacy, their ability to represent more of the essential relations of the objects, and their simplicity (Janik & Toulmin, 1973). Hertz's stress on simplicity was echoed later by other influential physical scientists (cf. Poincar~, 1902/1952).

The simplicity of theories in representing experience was of- ten cited as a fundamental principle by scientists in the 1930s and 1940s. For example, George Herbert Mead (1938) argued

that one persists in acting according to a hypothesis as long as it works to solve some problem, and one abandons the hypothe- sis for another only if that other is simpler. Science pursues the simpler hypothesis because science has found it to be more suc- cessful to do so. But Mead's position does not elucidate why science has this success.

The quantitative psychologist L. L. Thurstone (1947), an ad- mirer of Mead (Still, 1987), came closer to clarifying the func- tion of parsimony when he argued that "the criterion by which a new ideal construct in science is accepted or rejected is the degree to which it facilitates the comprehension of a class of phenomena which can be thought of as examples of a single construct rather than as individualized events" (L. L. Thur- stone, 1947, p. 52). He then argued that in any situation in which a rational equation is proposed as the law governing the relation between two variables, the ideal equation is one in which the number of parameters of the equation that must be estimated is considerably smaller than the number of observa- tions to be subsumed under it. Unfortunately, he did not clarify why the number of parameters to be estimated must be fewer than the number of observations to be subsumed under the curve. Nevertheless, parsimony became a central principle in his use of the method of factor analysis, influencing his concepts of minimum rank, of the overdetermination of factors, and of simple structure. Many of Thurstone's ideas about parsimony presage principles commonly invoked in structural equation modeling.

The Austrian philosopher of science Karl Pepper (1934/ 1961) argued that the principle of parsimony does not stand on its own but rather works in the service of a more fundamental principle, the elimination of false theories by experience. He regarded the simplicity or parsimony of a hypothesis to be es- sential to evaluating the merits of a hypothesis before and after it is subjected to empirical tests. "The epistemological ques- tions which arise in connection with the concept of simplicity," he said, "can all be answered if we equate this concept with degree offalsifiability'" (Pepper, 1934/1961, p. 140). Pepper grasped to a considerable degree the significance of how, in con- nection with a given set of observations, a hypothesis with few freely estimated parameters may be subjected to more tests of possible disconfirmation than a hypothesis containing numer- ous freely estimated parameters. His thoughts on this topic pointed the way to seeing how a degree of freedom in the test of a structural equation model corresponds to an independent condition by which the model may be disconfirmed.

An example. To see how "falsitiability" (or better, "discon- flrmability"), parsimony, parameter estimation, and goodness of fit are interrelated concepts, let us see what is involved, say, in fitting a function to a set of data points, a problem considered both by Thurstonc (1947, p. 52) and by Pepper (1934/1961, p. 138) in connection with parsimony. Our treatment of this problem here is more extensive than theirs and makes the re- lations among these concepts more perspicuous than Pepper's treatment. The principles to be demonstrated in this example readily generalize to structural equations modeling.

Suppose we are given five data points plotted in a two-dimen- sional coordinate system, and our task is to find a graphical representation of a law that corresponds to a curve that passes through these points under the assumption that they are gener-

438 MULAIK, JAMES, VAN ALSTINE, BENNETT, LIND, STILWELL

ated according to the same law. Unfortunately, the data does not determine a unique curve that passes through these points, because an unlimited number of curves may be found that pass through these points (Hempel, 1965). And so, as Popper (1934/ 1961) pointed out, a problem for so-called inductive logics of discovery has always been how to choose the optimal curve that fits the points.

Frequently, the advice has been to "choose the simplest curve" that fits the points (Popper, 1934/1961, p. 138). Thus linear functions have been regarded as simpler than quadratic functions, and quadratic functions as simpler than quartic functions, and so on. However, Popper pointed out that it is not self-evident that this principle is necessarily the only or the optimal way of ordering functions according to a concept of simplicity. Furthermore, even finding the simplest curve that fits these points in no way guarantees that one has found the law by which these points were generated. The only adequate test of the curve as an inductive generalization from the data is how well it allows one to extrapolate and interpolate to new data points not used in identifying the curve but presumed to be generated by the same process. Hence Popper argued that we should not be preoccupied in these problems with just finding methods that always find curves that fit a given set of data points optimally; rather, we should be concerned with testing hypo- thetical curves, whatever their origin, against new data. Further- more, the more ways we are able to subject a curve to a test against data, and the more the curve passes these tests, the more corroboration we have for use of the curve; and such a curve is preferred. 2

Given that a researcher has a certain number of data elements that he or she may hypothesize are generated by the same func- tional process, parsimony in formulating this hypothesis con- cerns the proportion of these data elements that will be used in estimating parameters to uniquely identify this function. It is quite possible that no parameters will need to be estimated, that the parameter values are already given from other sources. This is the most parsimonious situation with respect to use of the data at hand, for it leaves all of the data available for testing of the hypothesis. But if the data are consulted to determine the values of some of the parameters, then it must be realized that the data elements used in this determination are then unavail- able for testing the model, because the estimated curve will then pass through these data elements necessarily, and one cannot speak with respect to them of a possibility of disconfirmation of the hypothesis.

For example, consider that in the example of five points, we may hypothesize that a quadratic function fits the five points. A quadratic equation is of the form y = ao + a ~ x + a 2 x 2. We may pick any three of the five points and, using the values of their x and y coordinates, substitute these values into the quadratic equation to form three simultaneous equations linear in the un- known coefficients a0, • •., a2. Solving this system of equations for ao, • •., a2, we then identify an equation that fits the three points exactly. However, we cannot test the resulting equation against these same three points, because the curve based on the equation necessarily passes through them. It would make no sense to talk about a possible lack of fit here. But there remain two points not used in estimating the parameters of the curve against which we can now test the adequacy of the hypothesis.

If the resulting second-degree equation fails to pass through ei- ther of these two points, the hypothesis is disconfirmed. Hence each of these two remaining points corresponds to a condition by which the hypothesis may be disconfirmed, and statisticians speak of these conditions as degrees of freedom. 3

Suppose we had come to the five data points with a second- degree curve whose three parameters were already completely specified by either previous experience or pure conjecture. In this case we would not need to estimate any parameters, and so all five data points would be available for testing the curve. Here, the degrees of freedom for the test of the curve against data is equal to the number of points, five, against which the curve may be compared for lack of fit. The difference between the previous case, in which we had to estimate three parameters and thus lost three degrees of freedom, and the present case lies in the gain in degrees of freedom, because no parameters have to be estimated. In short, one can use data in two ways: One can use it to estimate parameters of functions and thereby lose it for testing goodness of fit, or one can forego using it to estimate parameters and use it for testing a prespecified hypothesis for goodness of fit.

We now see what is meant by saying, "One loses a degree of freedom for each parameter estimated," which occurs fre- quently in discussions of structural equations models. We also see why lower degree polynomials seem simpler, because they require using fewer independent elements of the data for the

2 It is easy to believe that here, Popper ( 1934/1961) finally succumbed to the temptations of the very inductivism he sought to overturn, for he seems to argue that a hypothetical curve is better (more likely to pass tests in the future?) because it has passed more tests. But Popper resisted offering such an inductive justification for why a well-corroborated hy- pothesis is better. A more appropriate way to see why Popper says a weU- corroborated hypothesis is better is to see that this is just what Popper means by a better theory, that he stands ready to offer no further reasons for such a definition. One might say that with such a move, Popper abandoned his avowed intention to provide a purely rational basis for doing science. But maybe there is no such thing as acting in a purely rational way, for as Wittgcnstein (1953) pointed out, we always come to a point at which we run out of reasons and must say, "This is simply what we do."

3 The curve-fitting example used here is an oversimplification but makes clear the points to be made. Statisticians usually use all of the data points in estimating the free parameters of a function to be fit to the data but treat the system of simultaneous equations associated with the data points as possibly inconsistent (Schneider, Steeg, & Young, 1982). Estimates of the free parameters are obtained by minimizing some lack-of-fit function applied to all the data points, which has the effect of identifying a component of the data in some subspace of the data space (the "reproduced" data space) from which the free parame- ters are then uniquely determined. When the lack-of-fit function used is least squares or its variants, it can be shown that the dimensionality of the reproduced data space is (locally) equal to the number of free parameters estimated and is further, by the projection theorem (Brock- well & Davis, 1987; Deutsch, 1965; Schneider et al., 1982), orthogonal to the residual data space, which in turn has dimensionality equal to the number of data points minus the number of estimated parameters. Degrees of freedom in this case equal the dimensionality of the residual data space. Thus the free parameters are determined by a component of the data not used in assessing the lack of fit, which nevertheless is reproduced perfectly as a function oftbe estimated parameters.

EVALUATION OF GOODNESS-OF-FIT INDICES 439

estimation of parameters and leave more of these elements for testing the fit of the model to the data. We also see why estimat- ing more parameters increases goodness of fit artifactually, be- cause more components of the data are then made to fit the model. We also see why using all of the data elements to deter- mine a curve that fits all of them perfectly is unparsimonious, because it requires a high-degree polynomial and the estimating of many parameters and leaves no data elements available for testing the empirical fit of the curve.

At this point we should be able to see that the simplicity of a model depends not so much on its dimensionality but on the number of free parameters that must be estimated. In the exam- pies we have used so far, the equation to be estimated was of less degree than the number of points available. But the relevant case that can be extended by analogy to structural modeling is one in which the equation has more parameters than data points, that is, an equation of higher degree than the number of data points. In this ease one must fix at least as many parame- ters as is necessary to reduce the number of free parameters to no more than the number of data points available with which to estimate them. We will preferably fix even more parameters than this, so that we will have fewer parameters to estimate than data points available and thereby be in a position to test the resulting equation against some subset of points (Mulalk, 1987). But simplicity is not gauged by the number of parame- ters in the equation but by the paucity of parameters that must be estimated or, inversely, by the number of degrees of freedom by which the equation may be tested.

Parsimony ratio. We should also see now why a ratio of the degrees of freedom (of the test) of a model to the total number of relevant degrees of freedom in the data (the parsimony ratio) reflects the parsimony or simplicity of the model. Only in the case ofthose models that estimate very few of the available pa- rameters will this ratio approach unity. Given two models with equally high goodness-of-fit indices in connection with the same data, the one to be preferred is the one with the higher parsimony ratio, because it has been subjected to more poten- tially disconfirming tests. Keep in mind that good fit can come about in two ways: (a) by a hypothesis that correctly constrains parameters of the model and (b) by estimating many parame- ters, which necessarily contributes to good fit no matter what the data are. Consequently, the parsimony ratio reflects an up- per bound to the proportion of the independent elements in the data that are relevant to the assessment of goodness of fit.

Parsimonious-fit index is not the same as a goodness-of-fit index. Some researchers have been dismayed when normed-fit indices in the high .90s drop to parsimonious normed-fit indices in the .50s. They have been reluctant to report parsimonious normed-fit indices in the .50s because they believe it suggests that something is wrong with their models. But this need not be the interpretation. The parsimonious normed-fit index is not simply a goodness-of-fit index: Rather, it is an index that seeks to combine two logically interdependent pieces of information about a model, the goodness of fit of the model and the parsi- mony of the model, into a single index that gives a more realistic assessment of how well the model has been subjected to tests against available data and passed those tests. Steiger (1987) sug- gested that goodness of fit and parsimony are just two of many dimensions of a multidimensional preference function that in-

dividual researchers may use in evaluating models. Researchers might, he suggested, consider attaching different weights to par- simony and goodness of fit. Although this may be so, it must be kept in mind that goodness of fit and parsimony are logically interdependent dimensions: Low parsimony implies high good- ness of fit. To assess what is objective and not simply artifact in the goodness of fit of a model, one must consider how parsimo- nious the model is in its use of the data in achieving that good- ness of fit. Weighting parsimony and goodness of fit equally strikes us as the only rational thing to do.

It is not inconceivable to have acceptable models with nonsig- nificant chi-squares, goodness-of-fit indices in the high .90s, and parsimonions-fit indices in the .50s. A nonsignificant chi- square means that a model is statistically acceptable insofar as the constraints on its parameters are consistent with aspects of the data not used in the estimation of free parameters. Good- ness-of-fit indices will always be near unity when chi-square is nonsignificant and may even be near unity when chi-square is significant, indicating that the model with its constrained and estimated parameters reproduces the data very well, although statistically there is a detectable discrepancy. But reproducing the data is not the same as a test of a completely specified model. A moderate parsimonious-fit index corresponding to a high normed-fit index of goodness-of-fit index indicates that much of the good fit, that which is due principally to the estimated values of the free parameters, remains untested, unexplained (from outside the data), and in question.

The parsimonious-fit index should be especially useful when comparing models, for it simultaneously takes into account the goodness of fit of the model to data and the parsimony of the model. Thus one can clearly see the difference in quality of two models that fit the same data equally well when one of the models is far more parsimonious than the other. One can also see the difference in quality of two models that have equal parsi- mony ratios when one fits the data better than the other.

Inadequacies o f Adjusted Goodness-of-Fit Index

JiSreskog and SiSrbom (1984) described an adjusted good- ness-of-fit index (AGFI) designed to compensate for the in- crease in goodness of fit of a less restricted model obtained by estimating more free parameters:

AGFI -- 1 - (1 - GFl)[k(k + l)/2d],

where GFI is the goodness-of-fit index, k is the number of mani- fest variables in the model, and d is the degrees of freedom of the model to which GFI applies. With GFI formulated in anal- ogy with the coefficient of determination, AGFI is apparently formulated in analogy with the correction for bias of a squared multiple correlation coefficient (an index of determination; cf. Guilford, 1950, p. 434):

cR 2 = 1 - (1 - R2)[(N - I ) / ( N - k - 1)],

where cR u denotes the squared multiple correlation corrected for bias, R 2, which is the original uncorrected squared multiple correlation; N is the total number of observations; (N - 1) is the total number of potential degrees of freedom, with one de- gree of freedom lost in the estimation of the mean of the depen- dent variable in a null model of no relation (all regression co-

440 MULAIK, JAMES, VAN ALSTINE, BENNETT, LIND, STILWELL

efficients are fixed equal to zero except the intercept); and ( iV- k - 1) is the number of degrees of freedom of the prediction model, with one degree of freedom lost for each parameter esti- mated of the multiple regression equation with k predictors, which has k + 1 parameters. (Guilford's [ 1950] statement that one degree of freedom is lost in estimating the mean of each variable is misleading.) The correction for bias of the squared multiple correlation has the defect that it can take on negative values when the number of predictor variables k in relation to N is large.

Although the AGFI uses the same information as the parsi- monious-fit index of James et ai. (1982), it does not use this information in a completely rational way, for the resulting AGFI, like the correction for bias of the squared multiple corre- lation, can take on negative values, as Jtreskog and Strbom (1984, p. 1.40) noted. It is informative to see how the AGFI could be negative with an example: Suppose GFI = .90 in a model with 20 manifest variables and two degrees of freedom. AGFI in this case equals -9 .5 . A corresponding parsimonious- fit index for this case, obtained by multiplying the parsimony index of 2/210 (formulated in relation to a null model that seeks to account for all of the information in the covariance matrix of the manifest variables, which contains 2 l0 distinct elements) by GFI would equal .00857. Furthermore, for a just-identified or saturated model with zero degrees of freedom and GFI equal to 1.00, AGFI is undefined. But the corresponding parsimoni- ous-fit index would equal zero. On the other hand, the AGFI is not very sensitive to losses in degrees of freedom for models with moderately high degrees of freedom. For example, with GFI = .90, k = 20, and d = 150, AGFI = .86, a reduction of only .04. But 60, or 28.5% of the 210 potential degrees of free- dom, have been lost in going to 150 degrees of freedom. A corre- sponding parsimonious-fit index, assuming a null model with 2 l0 degrees of freedom, would equal [(150)/(210)] (.90) = .642. Thus the AGFI index does not have the rational norm of a meaningful zero point as does the parsimonious-fit index of James et al. (1982). A negative AGFI may be diagnostic of a poor model (as was suggested by one reviewer of this article), but because zero and negative values have no rationale in the formulation of the AGFI, it is difficult to know what further interpretation to give to them.

Computational Formulas for Use With LISREL Output

Because current versions of the LISREL program report sev- eral goodness-of-fit indices, including the AGFI, it may be help- ful for the researcher to be able to convert these indices into a parsimonious-fit index along the lines of that of James et al. (1982).

When the aim is to account for all of the information in the variance-covariance matrix for the observed variables, then a parsimonious GFI is given by

PGFI(I ) = [2d/k(k + I)]GFI,

where d is the degrees of freedom of the tested model and k the number of observed variables in the model, with GFI being the goodness-of-fit index, computed by LISREL (see Jtreskog & Str- born, 1984, p. we.40).

When the aim of one's model is to account for just the rela-

tionships between the observed variables and hence only the covariances between the observed variables, the potential de- grees of freedom of a null model with a covariance matrix among the observed variables equal to a diagonal matrix with free diagonal parameters is equal to k(k - 1)/2, and we should compute

PNFI2 = {2d/[k(k- 1)]} [(F0 - Fj)/(Fo - d)],

where PNFI2 is the Type 2 parsimonious normed-fit index, d are the degrees of freedom of the model being tested; k are the number of observed variables; Fj is the lack-of-fit index for the model being tested (chi-square for maximum likelihood or the sum of the squared residuals for unrestricted least squares esti- mation); and F0 is the lack-of-fit index for the null model whose covariance matrix is hypothesized to be a diagonal matrix with free diagonal elements (chi-square or sum of squared residuals, depending on method of estimation). 4 In this formula, obtain- ing F0 may present the most problems, especially when maxi- mum likelihood estimation is used. In this case it is recom- mended that one simply test a model in which the covariance matrix for the observed variables is hypothesized to be a diago- nal covariance matrix with free diagonal variance parameters and let F0 be the chi-square of the test of fit of the model to the sample covariance matrix. In the case of unrestricted least squares estimation,

F 2 = RMR2[k(k + 1)/2] = tr(S - ~)2

F0 = {[RMR2k(k + 1)/2]/[1 -GFI (ULS) ]} - V

= tr[S - diag(S)] z

where RMR is the root-mean-square residual reported in the LISREL output, (see Jtreskog & Strbom, 1984, p. 1.41), k is the number of observed variables, GFI(ULS) is the goodness-of-fit index reported in the LISREL output, and V is the sum of squared sample variances. These formulas simply take advan- tage of data provided in the LISREL output to obtain the neces- sary sum of squared residuals in these indices.

Relat ive N o r m e d - F i t Indices

The various goodness-of-fit indices described up to now as- sess the fit of the full structural model's reproduced covariance matrix to the actual observed covariance matrix for the mani- fest variables of the model. But a little-recognized drawback of these goodness-of-fit indices is that they are usually heavily influenced by the goodness of fit of the measurement model portion of the overall model and only reflect, to a much lesser degree, the goodness of fit of the causal model portion of the overall model. It is quite possible to have a model in which the

4 In the case of ULS estimation, din this equation should be replaced by E[tr(S - ~)2]the model for • is true]. Unfortunately, an expression for this term is not now available in a readily usable form, although provisional analysis suggests that it is equal to the sum of the variances of the respective elements of the sample variance--covariance matrix, with each of these variances converging asymptotically to zero as sam- ple size increases indefinitely. It is recommended that one simply set d to zero in this case, realizing that the resulting parsimonious normed- fit index will likely be underestimated on average in small samples.

EVALUATION OF GOODNESS-OF-FIT INDICES 441

Figure 1. Model by which artificial data set was generated, that is, the "correct model"

measurement model portion involving relations of the latent variables to manifest indicator variables is correctly specified but in which the causal model portion involving structural re- lations among the latent variables is misspecified and to still have a goodness-of-fit index for the overall model in the high .80s and .90s.

To illustrate, suppose we have some data generated by the model whose path diagram is given in Figure 1. Suppose further that a researcher hypothesizes a model according to the model given in Figure 2. Notice that the researcher has correctly speci- fied the measurement submodel (involving relations between manifest and latent variables) by specifying correctly the num- ber of latent variables for the model and the relations of these latent variables to the manifest variables of the model. But no- tice also that the researcher has incorrectly specified the causal relations between the latent variables. One would hope that with the structural submodel of the relations between the latent

variables of central theoretical concern, the traditional good- ness-of-fit indices would be highly sensitive to the misspecifica- tion of the structural submodel as given in connection with Fig- ure 2. But they are not. We used Monte Carlo methods to gener- ate a sample from a multivariate normal distribution whose population covariance matrix was determined by a model con- sistent with the model in Figure 1 and then tested the model in Figure 2 against these data. We obtained a Type 2 adjusted normed-fit index of .932 for the fit of the model in Figure 2 to the data. (The chi-square statistic for the fit of the model in Fig- ure 2 indicated a significant lack of fit, but our point concerns interpretation of a high goodness-of-fit index.) This index is quite high and would be accepted by many researchers as a very promising fit. It does not differ very much from the Type 2 ad- justed normed-fit index of.994 that was obtained when the cor- rect model in Figure 1 was applied against the data (whose chi- square was not significant).

442 MULAIK, JAMES, VAN ALSTINE, BENNETT, LIND, STILWELL

Figure 2. Misspecified model applied to artificial data.

This disparity between the influence of the fits of the mea- surement and structural submodels on the goodness-of-fit index for the overall structural model usually arises because, in pursu- ing the goal of parsimony, the researcher generates a model in which the measurement model portion of the model usually contains the bulk of the parameters of the model. With few la- tent variables and many manifest indicators for each latent vari- able, the number of parameters involving relations of manifest indicators to latent variables is much greater than the number of parameters involving relations between the few latent vari- ables. The parameters of the measurement model may then de- termine the greater portion of the covariances among the mani- fest variables, especially if the manifest indicator variables are highly reliable indicators of the latent variables.

One way to deal with this problem is with a relative good- ness-of-fit index (Hertzog, in press; Lerner, Hertzog, Hooker, Hassibi, & Thomas, 1988). The aim of this index is to assess

the relative fit of the structural or causal model among the latent variables independently of assessing the fit of the hy- pothesized relations of the indicator variables to the latent variables.

James et al. (1982), as influenced by Bentler and Bonett (1980), described a nested sequence of models to be used in assessing the fit of a model to data. These are (a) tbe just-identi- fied or saturated model, (b) the measurement model (a confir- matory factor analysis model used to test the model of relations between latent variables and manifest indicators while leaving relations among the latent variables saturated), (c) the struc- tural relations model that imposes some constraints on the re- lations among the latent variables, (d) the uncorrelated latent variables model, and (e) the null model of no relations between the manifest variables. This sequence of models is a covariance matrix-nested sequence. The measurement model, as a factor analysis model, does not specify causal relations among the la-

EVALUATION OF G(~DNESS-OF-FIT INDICES 443

tent variables but corresponds to and has equal fit to the data, as does a model in which the causal relations among the latent variables are fully saturated.

Let Fu be the lack-of-fit index (chi-square) for the model of uncorrelated latent variables. We use this model as the null model for construction of a normed-fit index for the structural model. Let Fm be the lack-of-fit index (chi-square) for the con- firmatory factor analysis model used to test the measurement model. Let Fj be the lack-of-fit index for the structural relations model of interest. Now define

RNFI(j) = (Fu - Fi) /[Fu - F m - (dj - din)]

as the Type 2 adjusted relative normed-fit index for the struc- .tural model ofcansal relations among the latent variables of the full structural equation model, which contains a correction for bias according to principles given by Marsh et al. (1988). Here, the norm for the normed-fit index is the difference in the lack of fit between the unrelated variables model and the measurement model. A corresponding relative parsimony ratio for the causal model would be given by

RP(j) = [dj - d, , ] / [d , - elm],

where dj are the degrees of freedom of the structural equation model, arm are the degrees of freedom of the confirmatory factor analysis measurement model, and d, are the degrees of freedom of the uncorrelated latent variables model. When comparing the fit of different causal models defined on the same latent vari- ables, one would multiply RP(j) by RNFI(j) to get a relative parsimonious-fit index appropriate for assessing how well and to what degree the models explain, from outside the data, all possible relations among the latent variables.

For the artificial data generated according to the model in Figure 1, we obtained chi-square, the normed-fit index, the par- simonious normed-fit index, the Type 2 adjusted normed-fit in- dex, the Type 2 parsimonious normed-fit index, the LISREL GEl, the parsimonious GFI, the LISREL adjusted GFI, and the Akaike (1987) AIC for each of the following models when applied to the data: (a) the null model, (b) the uncorrelated factors model, (c) the misspecified model (in Figure 2), (d) the correct model (in Figure 1), (e) the measurement model, and (f) the saturated model. The indices for these models are shown in Table 1. It is interesting to see that the normed-fit index and the LISREL GFI are quite comparable. However, the GFI of .281 for the null model reflects the fact that for this index, one has already ac- counted for a portion of the model by providing estimates of the variances, which are not relevant to the normed-fit index. It is also interesting to note that of the various models, the correct model had the highest parsimonious-fit indices. Although the normed-fit indices for the correct model and the measurement model are almost identical and, in fact, higher for the measure- ment model, the increase in fit at the expense of loss in degrees of freedom (in comparison with those of the correct model) slightly degrades the quality of the measurement model. This is reflected in the higher parsimonious normed-fit index for the correct model. It is also interesting that the Akaike (1987) (AIC) index reached its smallest value of 497.10 for the measurement model rather than for the more constrained correct model. Us- ing the data in Table 1, we also computed the Type 2 relative

normed-fit index for the correct model MI to be RNFI2(I) = (2229.42 - 377.49)/[2229.42 - 343.10 - (341 - 329)] = .988. On the other hand, the relative normed-fit index for the mis- specified model was given by RNFI2(2) = (2229.42 - 781.58)/ [2229.42 - 343.10 - (344 - 329)] = .774. Here we see that the relative normed-fit index magnifies the difference in the fit of the causal model portions of the two structural models far better than does the ordinary normed-fit index,

Limita t ions o f Goodness-of-Fi t Indices

Goodness-of-fit indices indicate how well a model fits data, even when, statistically, it does not do so perfectly. Many models in science are useful because they fit data well, even though it is known that the fit is not perfect. For example, the idealized models of Newtonian mechanics, involving isolated bodies moving in perfect vacuums or oscillating springs free of internal friction, are regarded as useful, approximate descriptions of physical phenomena, even though careful measurements will reveal that they do not perfectly fit the everyday data to which they are usually applied (Giere, 1985). Psychological theories should be similarly regarded as useful when they fit data well although not perfectly. Goodness-of-fit indices serve to indicate such degrees of fit and reinforce researchers for their efforts when the indices approach unity in value.

However, researchers should realize that goodness-of-fit indi- ces do not assess all aspects of a model's appropriateness for data. Specifically, goodness-of-fit statistics assess directly the vi- ability of overidentifying restrictions in both the structural and measurement portions of a latent variable model that evolve from fixing or constraining parameters. However, hypotheses regarding structural coefficients that are predicted to be non- zero in the population but are estimated as free parameters in the model are not directly assessed by goodness-of-fit indices. One can obtain a high goodness-of-fit index value for a model in which certain structural coefficients hypothesized to be non- zero but treated as free parameters turn out to have estimated values of zero. The results contradict one's hypothesis, but the index alone does not indicate this. Thus goodness-of-fit indices should only be used conditionally on a significant chi-square for the appropriate null model (that is, if one rejects the hypothesis that all structural coefficients are simultaneously equal to zero) and on the significance of tests of individual parameters of spe- cial salience to a model.

But it is also important to realize that tests of the fit or the lack of fit of a model do not depend on the validity of the model alone (Garrison, 1986; James et al., 1982). This is because most research hypotheses are stated in the following way: If certain foundational theories 7"1, . . . , Tp are true and background con- ditions C1, C2, . . . , Ck are the case and Model X is true, then consequence O should be observed. Now, if consequence O is not observed, this may mean that Model X is false, but it logi- cally can also mean that any number of the foundational theo- ries Tl, . . . , Tp or background conditions C l , . . . , Ck are false

while Model X is true. The test of the model cannot logically isolate where the failure to confirm it comes from. On the other hand, if consequence O is observed, this is no guarantee that Model X is true, for it is logically possible that the reason O is observed is because some other model under other background

444 MULAIK, JAMES, VAN ALSTINE, BENNETT, LIND, STILWELL

Table 1 Chi-Squares Degrees of Freedom, NFI, PNFI, NFI2, PNFI2, GFI, PGFI, AGFI, and AIC for Models Tested Against Artificial Example

Model Description x: df NFI PNFI NFI2 PNFI2 GFI PGFI AGFI AIC

Mo Null model 6841.05 378 .000 .000 .000 .000 .281 .262 .228 6897.05 Mu Uncorrelated 2229.42 350 .674 .624 .710 .658 .775 .668 .739 2341.42

factors Me Misspecified 781.58 344 .886 .806 .933 .849 .911 .772 . 8 9 5 905.58

model M~ Correct 377.49 .341 .945 .852 .994 .897 .949 .797 . 9 3 9 507.49

model M~ Measurement 343.10 329 .950 .827 .998 .868 .954 .773 . 9 4 3 497.10

model Ms Saturated 0.00 0 1.000 .000 1.000 .000 1.000 .000 Div/0 812.00

model

Note. NFI = normed-fit index; PNFI = parsimonious normed-fit index; NFI2 = Type 2 adjusted normed- fit index; PNFI2 = Type 2 parsimonious normed-fit index; GFI = LISP, EL goodness-of-fit index; PGFI = parsimonious goodness-of-fit index; AGFI = LISREL adjusted goodness-of-fit index; AIC ffi Akaike (1987) information criterion.

theories and conditions is the case. Such logical indeterminacies in the use of experience to confirm or disconfirm hypotheses are dealt with pragmatically by most researchers by embedding their hypotheses in specific theoretical frameworks that they are more or less strongly committed to treat as true (perhaps for good formal as well as empirical reasons) and by seeking in their experimental and observational techniques to assure them- selves that the appropriate background conditions are reason- ably satisfied. Their decisions to confirm or disconfirm their hypotheses are then made conditional on their assumptions, which may be modified with subsequent thought and experi- ence (Anne, 1970).

We have mentioned that the testing of models generally de- pends on reasonably establishing that certain background con- ditions are the case. Discussions of these background conditions as they apply in structural equations modeling are succinctly given in James et al. (1982) and Mulaik (1986, 1987). When performing tests of the fit of a model, it is assumed for the purposes of eliminating the ambiguity of the test that these background assumptions are met. The test is not regarded as a test of these assumptions but of the model. It is possible to test these background assumptions in separate studies, but tests of these background assumptions themselves will depend on the satisfaction of other assumptions not assessed by these tests. Consequently, in any research activity there is always an ele- ment of faith regarding the reasonableness and appropriateness of one's assumptions. The researcher can only proceed on the basis of his or her assumptions, knowing that whatever conclu- sions are drawn from research are only provisional and at risk of being either rejected by others who do not share these as- sumptions or overturned by the results of future research that shows these assumptions to be untenable.

Conclus ion

Goodness-of-fit indices are often used to supplement chi- square tests of lack of fit in evaluating the acceptability of struc- tural equation and other models. A high goodness-of-fit index

may be an encouraging sign that a model is useful even when it fails to fit exactly on statistical grounds. However, a major limitation of most goodness-of-fit indices now in current use is that index values near unity can give the false impression that much is explained by the constraints on the parameters of the model when in fact the high degree of fit results from freeing most of the parameters so that they can be estimated from the data. In principle, one can get a goodness-of-fit index value of unity by estimating as many parameters in the model as there are independent elements potentially available in the data. Such a model explains nothing, for it has nothing in it from outside the data. Furthermore, nothing has been confirmed about the model. A way to compensate for high goodness-of-fit index val- ues obtained at the expense of loss of degrees of freedom is to multiply it by the parsimony ratio, which in general is the ratio of the degrees of freedom in the test of a model to the total number of potentially relevant degrees of freedom available in the data. The resulting product is called a parsimonious-fit in- dex and is best interpreted as indicating roughly the proportion of the independent elements of data determined by the hypothe- sized constraints of the model. However, assessing the accept- ability of a model depends on more than considering the param- eter constraints used to specify the model. Poor fits may be ob- tained, not because one's specification of parameter constraints is wrong, but because one's assumption about how the data is distributed probabilistically is incorrect, leading one, especially in the case of maximum likelihood estimation, to obtain poor estimates of the free parameters. One may also have errors in the data or violate any number of other background assump- tions required by one's model (James et al., 1982; Mulaik, 1986, 1987). Violations of these assumptions may have any number of unknown effects on the parsimonious-fit index and must be taken into account when evaluating the model.

Traditional goodness-of-fit indices also are unduly influ- enced by the good fits in the measurement portions of the model and can yield values in the .90s even when the structural re- lations among the latent variables of the model are seriously

EVALUATION OF GOODNESS-OF-FIT INDICES 445

misspecitied. We report a new index, the relative normed-fit in- dex, formulated independently by us and by Hertzog (in press), that allows one to assess the fit o f the causal model concerning just the relations between the latent variables of a structural equation. The principles on which this new index are based al- low for the formulation of other indices to magnify differences in degrees of fit in connection with specific aspects of a model.

Traditional goodness-of-fit indices also will yield high values when structural parameters reflecting hypothesized causal paths are left free to be estimated and turn out empirically to have near-zero values, which may correspond to zero popula- tion values. Such misspeeitications in a model must be tested by means other than the traditional goodness-of-fit index.

References Akaike, H. (1987). Factor analysis and AIC. Psychometr/ka, 52, 317-332. Anderson, J. C., & Gerbing, D. W. (1984). The effect of sampling error

on convergence, improper solutions, and goodness-of-fit indices for maximum likelihood confirmatory factor analysis. Psychometrika, 49, 155-173.

Aline, B. (1970). Rationalism, empiricism, and pragmatism. New York: Random House.

Bentler, E (1983). Some contributions to efficient statistics in structural models: Specification and estimation of moment structures. Psycho- metrika, 48, 493-517.

Bentler, P., & Bonett, D. G. (1980). Significance tests and goodness of fit in the analysis of eovariance structures. Psychological Bulletin, 88, 588-606.

Boek, R. D., & Haggard, E. A. (1968). The use of multivariate analysis in behavioral research. In Dean K. Whitla (Ed.), Handbook of mea- surement and assessment in the behavioral sciences (pp. 100-142). Reading, MA: Addison-Wesley.

BrockweU, P. J., & Davis, R. A. (1987). Time series: Theory and meth- oats. New York: Springer-Verlag.

Browne, M. W. (1982). Covariance structures. In D. M. Hawkins (Ed.), Topics in applied multivariate analysis (pp. 72-141). London: Cam- bridge University Press.

Cudeek, R., & Browne, M. W. (1983). Cross-validation of eovariance structures. Multivariate Behavioral Research, I8, 147-167.

Deutsch, R. (1965). Estimation theory. Engle~xxt Cliffs, NJ: Prentice- HalL

Garrison, J. W. (1986). Some principles of postpositivistic philosophy of science. Educational Researcher, 15, 12-18.

Giere, R. N. (1985). Constructive realism. In P. M. Churchland & C. A. Hooker (Eds. ), Images of science. Chicago: University of Chicago Press.

Guilford, J. P. (1950). Fundamental statistics in psychology and educa- tion (2nd ed.). New York: McGraw-Hill.

Hempcl, C. G. (1965). Aspects of scientific explanation. New Yorlc Free Press.

Hertzog, C. (in press). On the utility of structural equation models in developmental research. In P. B. Baltes, D. L. Feathcrman, & R. M. Lerner (Eds. ), Life-span development and behavior (Vol. 9). HiUsdale, N J: Erlhaum.

James, L. R., Mulaik, S. A., & Brett, J. (1982). Causalanalysis: Models, assumptions and data. Beverly Hills, CA: Sage.

Janik, A., & Tonlmin, S. (1973). I4qttgenstein's Vienna. New York: Si- mon and Schuster.

Jones, W. T. (1952). A history of western philosophy. New York: Har- court, Brace.

Ji~reskng, K. G., & Strbom, D. G. (1984). LISREL II1. Mooresville, IN: Scientific Software, Inc.

Kabe, D. G. (1963). Stepwise multivariate linear regression. Journal of the American Statistical Association, 58, 770-773.

Kant, I. (1900). Critique of pure reason (J. M. D. Meiklejohn, Trans.). New York: Wiley. (Original work published 1781 )

Lem~ J. V., Hertzog, C., Hook~ K. A., I-Iassibi, M., & Thomas, A. (1988). A longitudinal study of negative emotional states and adjust- ment from early childhood through adolescence. Child Development, 59, 356-366.

Marsh, H. W., Balia, J. R., & McDonald, R. P. (1988). Goodness-of- fit indices in confirmatory factor analysis: The effect of sample size. Psychological Bulletin, 103, 391-410.

Mead, G. H. (1938). The philosophy of the act. Chicago: University of Chicago Press.

Mulaik, S. A. (1972). The foundations of factor analysis. New York: McGraw-Hill.

Mulaik, S. A. (1986). Toward a synthesis of deterministic and probabi- listic formulations of causal relations by the functional relation con- cept. Philosophy of Science, 53, 313-332.

Mulaik, S. A. (1987). Toward a conception of causality appficable to exp~- imentation and causal modeFmg. Child Development, 58, 18-32.

Poincar~, H. (1952). Science and hypothesis. (W. J. Greenstreet, Trans.). New York: Dover. (Original work published 1902)

Popper, K. R. (1961). The logic of scientific discovery (translated and revised by the author). New York: Science Editions. (Original work published 1934)

Roy, S. N. (1958). Step-down procedure in multivariate analysis. Annals of Mathematical Statistics, 29, 1177-1187.

Roy, S. N., & Bargmann, R. E. (1958). Tests of multiple independence and the associated confidence bounds. Annals of Mathematical Sta- tistics, 29, 491-503.

Schneider, D. M., Steeg, M., & Young, F. H. (1982). Linear algebra: A concrete introduction. New York: Macmillan.

Shapiro, A. (1983). Asymptotic distribution theory in the analysis of covariance structures (a unified approach). South African Statistical Journal, 17, 33-81.

Sohel, M. E., & Bohrnstedt, G. W. (1985). Use of null models in evaluat- ing the fit ofeovariance structure models. In N. B. Tuma (Ed.), Socio- logical methodology. San Francisco: Jnssey-Bass.

Specht, D. A. (1975). On the evaluation of causal models. Social Sci- ence Research, 4, 113-133.

Spccht, D. A., &Warren, R. D. (1976). Comparing cansal modds. InD. R. Heise (Fxt), Soc/ologica/methaddo~ San Francisco, C~ Jossey-Ba~

Steiger, J. H. (1987, October). R.M.S. confidence intervals for goodness of fit in the analysis of covariance structures. Paper presented to the annual meeting of the Society for Multivariate Experimental Psychol- ogy, Vancouver, British Columbia, Canada.

Steig~; J. H., Shapiro, A., & Browne, M. W. (1985). On the multivariate asymptotic distribution of sequential chi-square statistics. Psychome- trika, 50, 253-263.

Still, A. (1987). L. L. Thurstone: A new assessment. British Journal of Mathematical and Statistical Psycholog)z, 40, 101-108.

Tanaka, J. S. (1987). "How big is big enough7": Sample size and good- hess of fit in structural equation models with latent variables. Child Development, 58, 134-146.

Tanaka, J. S., & Huba, G. J. (1985). A fit index for covariance structure models under arbitrary GLS estimation. British Journal of Mathe- matical and Statistical Psychology,, 38, 197-201.

Thurstone, L. L. (1947). Multiple factor analysis. Chicago: University of Chicago Press.

Wheaton, B. (1988). Assessment of fit in overidenfified models. In J. S. Long (Ed.), Common problems/Proper solutions (pp. 193-225). Beverly Hills, CA: Sage.

Wittsenstein, L. (1953). Philosophical investigations (G. E. M. An- scombe, Trans.). New York: Macmillan.

Received August 18, 1987 Revision received April 26, 1988

AeeeptedAugust 19, 1988 •