generalizability theory and classical test theory · 2011. 11. 7.  · generalizability theory and...

21
APPLIED MEASUREMENT IN EDUCATION, 24: 1–21, 2011 Copyright © Taylor & Francis Group, LLC ISSN: 0895-7347 print / 1532-4818 online DOI: 10.1080/08957347.2011.532417 Generalizability Theory and Classical Test Theory Robert L. Brennan Center for Advanced Studies in Measurement and Assessment University of Iowa Broadly conceived, reliability involves quantifying the consistencies and inconsis- tencies in observed scores. Generalizability theory, or G theory, is particularly well suited to addressing such matters in that it enables an investigator to quantify and distinguish the sources of inconsistencies in observed scores that arise, or could arise, over replications of a measurement procedure. Classical test theory is an his- torical predecessor to G theory and, as such, it is sometimes called a parent of G theory. Important characteristics of both theories are considered in this article, but primary emphasis is placed on G theory. In addition, the two theories are briefly compared with item response theory. The pursuit of scientific endeavors necessitates careful attention to measurement procedures, the purpose of which is to acquire information about certain attributes or characteristics of objects. The data obtained from any measurement proce- dure include errors, however, since the measurements may vary depending on numerous conditions of measurement. From this perspective on measurement, “error” does not mean mistake in the conventional sense, and what constitutes error in scores from a measurement procedure is, in part, a matter of definition. It is one thing to say that error is an inherent aspect of a measurement proce- dure; it is quite another thing to quantify error and specify which conditions of An earlier version of this paper was presented at the 2008 annual meeting of the American Educational Research Association. The paper was one of two presented in a symposium sponsored by the Buros Center for Testing, the sponsor of this journal. The other paper enumerated the benefits of item response theory. We hope to be able to present this item response theory paper in a future issue of the journal. Correspondence should be addressed to Robert L. Brennan, E. F. Lindquist Chair in Measurement and Testing and Director, Center for Advanced Studies in Measurement and Assessment (CASMA), 210D Lindquist, University of Iowa, Iowa City, IA 52242. E-mail: [email protected]

Upload: others

Post on 10-Nov-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Generalizability Theory and Classical Test Theory · 2011. 11. 7.  · GENERALIZABILITY THEORY AND CTT 3 First, since T and E are both unobserved variables, to use this model one

APPLIED MEASUREMENT IN EDUCATION, 24: 1–21, 2011Copyright © Taylor & Francis Group, LLCISSN: 0895-7347 print / 1532-4818 onlineDOI: 10.1080/08957347.2011.532417

Generalizability Theory and ClassicalTest Theory

Robert L. BrennanCenter for Advanced Studies in Measurement and Assessment

University of Iowa

Broadly conceived, reliability involves quantifying the consistencies and inconsis-tencies in observed scores. Generalizability theory, or G theory, is particularly wellsuited to addressing such matters in that it enables an investigator to quantify anddistinguish the sources of inconsistencies in observed scores that arise, or couldarise, over replications of a measurement procedure. Classical test theory is an his-torical predecessor to G theory and, as such, it is sometimes called a parent ofG theory. Important characteristics of both theories are considered in this article,but primary emphasis is placed on G theory. In addition, the two theories are brieflycompared with item response theory.

The pursuit of scientific endeavors necessitates careful attention to measurementprocedures, the purpose of which is to acquire information about certain attributesor characteristics of objects. The data obtained from any measurement proce-dure include errors, however, since the measurements may vary depending onnumerous conditions of measurement. From this perspective on measurement,“error” does not mean mistake in the conventional sense, and what constituteserror in scores from a measurement procedure is, in part, a matter of definition.It is one thing to say that error is an inherent aspect of a measurement proce-dure; it is quite another thing to quantify error and specify which conditions of

An earlier version of this paper was presented at the 2008 annual meeting of the AmericanEducational Research Association. The paper was one of two presented in a symposium sponsoredby the Buros Center for Testing, the sponsor of this journal. The other paper enumerated the benefitsof item response theory. We hope to be able to present this item response theory paper in a future issueof the journal.

Correspondence should be addressed to Robert L. Brennan, E. F. Lindquist Chair in Measurementand Testing and Director, Center for Advanced Studies in Measurement and Assessment (CASMA),210D Lindquist, University of Iowa, Iowa City, IA 52242. E-mail: [email protected]

Dow

nloa

ded

by [

Hon

g K

ong

Inst

itute

of

Edu

catio

n] a

t 00:

43 0

7 N

ovem

ber

2011

Page 2: Generalizability Theory and Classical Test Theory · 2011. 11. 7.  · GENERALIZABILITY THEORY AND CTT 3 First, since T and E are both unobserved variables, to use this model one

2 BRENNAN

measurement contribute to it. Doing so necessitates specifying what would con-stitute an “ideal” measurement (i.e., over what conditions of measurement isgeneralization intended) and the conditions under which observed scores areobtained.

These and other measurement issues are of concern in virtually all areas ofscience. Different fields may emphasize different issues, different objects, differ-ent characteristics of objects, and even different ways of addressing measurementissues, but the issues themselves pervade scientific endeavors. In education andpsychology, historically these types of issues have been subsumed under theheading of “reliability.”

Broadly conceived, reliability involves quantifying the consistencies andinconsistencies in observed scores. It has been stated that “A person with onewatch knows what time it is; a person with two watches is never quite sure!” Thissimple aphorism highlights how easily investigators can be deceived by havinginformation from only one element in a larger set of interest.

The above discussion is closely associated with the conceptual framework ofgeneralizability theory, or G theory, which is the principal focus of this article.G theory enables an investigator to quantify and distinguish the sources of incon-sistencies in observed scores that arise, or could arise, over replications of ameasurement procedure. Classical test theory (CTT) is an historical predecessorto G theory. Indeed, CTT is sometimes called a parent of G theory.

Provided next is a brief overview of CTT that serves as a bridge to the sub-sequent overview of G theory.1 The focus here is on important aspects of thetheories that serve to illustrate similarities and differences between them, as wellas between them and other theories, particularly item response theory (IRT).

CLASSICAL TEST THEORY

To understand G theory, it is helpful to consider first some aspects of the CTTmodel

X = T + E, (1)

where X, T , and E are observed, true, and error score random variables, respec-tively. Although CTT is very useful, the simplicity of this model, masks at leastfour important considerations.

1For more complete overviews of CTT see Lord and Novick (1968), Feldt and Brennan (1989),and Haertel (2006). For more complete overviews of G theory see Cronbach, Gleser, Nanda, andRajaratnam (1972) and Brennan (1992, 2001b).

Dow

nloa

ded

by [

Hon

g K

ong

Inst

itute

of

Edu

catio

n] a

t 00:

43 0

7 N

ovem

ber

2011

Page 3: Generalizability Theory and Classical Test Theory · 2011. 11. 7.  · GENERALIZABILITY THEORY AND CTT 3 First, since T and E are both unobserved variables, to use this model one

GENERALIZABILITY THEORY AND CTT 3

First, since T and E are both unobserved variables, to use this model one mustmake some additional assumptions. There are at least two ways to proceed. First,one can define T as the expected value of the observed scores X, which leads tothe expected value of E being zero. Second, one can define the expected value ofE as zero, which leads to T being the expected value of X. Clearly, both ways ofproceeding lead to the same result, but they differ with respect to what is assumed,and what is a consequence of the assumptions. Whichever way one proceeds,however, once T (or E) is defined, then E (or T) is derived unambiguously. Thatis, the CTT model suggest that T and E are so tightly tied together that if one ofthem were known, the other would be entirely evident.

Second, it is important to note that in the CTT model, T is definitely not pla-tonic or “in the eye of God” true score. Lord and Novick (1968) emphasizedthis over 40 years ago. More recently, Borsboom (2005, pp. 33–34) provided thefollowing interesting example. Currently, an autopsy is required for a definitivediagnosis of Alzheimer’s disease. Let C be a nominal variable that takes two val-ues: c = 0 for absence of Alzheimer’s based on an autopsy, and c = 1 for presenceof Alzheimer’s disease. This nominal variable can be viewed as a platonic truescore. (We neglect the possibility of autopsy errors.) Now, suppose there is someobservational test that results is a diagnosis of x = 0 if Alzheimer’s is not sus-pected and x = 1 if Alzheimer’s is suspected. If this diagnostic test (or differentforms of it) is repeated, clearly the expected value (i.e., true score) will be neither0 nor 1; hence, platonic true score C and expected-value true score T will not bethe same.

Third, the form of the CTT model in Equation 1 is so clearly reminiscent of asimple linear regression equation that it is easy to think of E as nothing more thanmodel fit error in the traditional statistical sense. Such a conception is misleadingat best, if not outright wrong. The CTT model is a tautology in which all variableson the right-hand side are unobservable, and these unobservable variables have nomeaning beyond the assumptions we attach to them. In particular, T does not havesome status independent of the other variables in the model, which means that itis misleading to characterize E as a residual or model fit error. Part of the problemhere is the multiple connotations associated with the word “model.” In traditionalstatistical contexts, the word “model” often carries with it the connotation of arelationship between dependent and fixed (i.e., known a priori) independent vari-ables. This notion of the word “model” clearly does not apply to the CTT model;not does it apply to G theory.

Fourth, as mentioned above, the CTT model is a tautology. As such, it is trueby definition. It’s truth/falsity cannot be tested by comparing it or its results tosome “objective” reality. Physical scientists tend to reserve the word “theory” formodels that can be falsified. No such falsification is possible for the CTT model orfor G theory. In applications of CTT what shall count as true score and what shallcount as error are very much under the control of the investigator, although this

Dow

nloa

ded

by [

Hon

g K

ong

Inst

itute

of

Edu

catio

n] a

t 00:

43 0

7 N

ovem

ber

2011

Page 4: Generalizability Theory and Classical Test Theory · 2011. 11. 7.  · GENERALIZABILITY THEORY AND CTT 3 First, since T and E are both unobserved variables, to use this model one

4 BRENNAN

fact is frequently overlooked. In this sense “truth” and “error” are not realities tobe discovered—they are investigator-specific constructions to be studied. In CTT“error” does not mean “mistake,” it does not mean lack of model fit, and “truth”and “error” are defined by the investigator even if he or she does not realize it!

Reliability Coefficients and Error Variances

The canonical definition of reliability is usually taken to be that it is the squaredcorrelation between observed and true scores, ρ2(X,T). Other expressions forreliability are given below:

ρ2(X, T) = ρ(X, X′) = σ 2(T)

σ 2(X)= σ 2(T)

σ 2(T) + σ 2(E). (2)

The last three expressions are typically derived by assuming that, for the indefi-nitely large population of examinees: (a) test forms (say X and X′) are classicallyparallel, which means that they have equal observed score means, variances, andcovariances, and they covary equally with any other measure; (b) the covariancebetween errors for parallel forms is 0; and (c) the covariance between true anderror scores is 0.2 Several traditional estimates of reliability are motivated by theρ(X, X′) expression for reliability. These estimates differ overtly with respect totheir data collection designs, and they also differ with respect to how error isimplicitly defined. For example, if reliability is estimated by computing the cor-relation between “parallel” forms, then the only errors that are taken into accountare those attributable to form differences. By contrast, if reliability is estimatedby computing a test–retest correlation, then form differences do not contribute toerror variance, but occasion differences do. Clearly, these two estimates of reli-ability are not estimates of the same parameter, but the CTT model is not richenough to distinguish clearly between them. These distinctions are much moreevident in G theory.

Other estimates of reliability are more closely linked to one or the other of thelast two expressions in Equation 2, both of which make explicit reference to truescore variance which, of course, is unknown. Typically, these estimates make useof the fact that the covariance between scores for classically parallel forms is truescore variance, that is σ (X, X′) = σ 2(T). The best known of these coefficients isCoefficient α.

Strictly speaking, Coefficient α can be derived using a parallelism assump-tion that is weaker than classically parallel forms, called essentially tau-equivalent

2Equivalently, for any indefinitely large subpopulation of examinees, the expected value of theerrors is 0 provided examinees are not selected based on their observed scores.

Dow

nloa

ded

by [

Hon

g K

ong

Inst

itute

of

Edu

catio

n] a

t 00:

43 0

7 N

ovem

ber

2011

Page 5: Generalizability Theory and Classical Test Theory · 2011. 11. 7.  · GENERALIZABILITY THEORY AND CTT 3 First, since T and E are both unobserved variables, to use this model one

GENERALIZABILITY THEORY AND CTT 5

forms, which are special cases of what are called congeneric forms. Two formsare congeneric if their true scores are linearly related; further, their error variancesneed not be equal, and, it follows that their observed score variances need not beequal. Notationally, scores for forms i and j are congeneric if

Xi = (ai + biT) + Ei and Xj = (aj + bjT) + Ej. (3)

When bi = bj we say the forms are essentially tau-equivalent. Lord and Novick(1968), Feldt and Brennan (1989), and Haertel (2006) provide extensive discus-sions of reliability coefficents based on these (and other) different definitions ofparallelism.

Reliability coefficients seldom play a role in other areas of scientific inquiry.Why are they so prevalent in psychometrics? There are probably at least threereasons. First, psychometrics is generally viewed as beginning with Spearman’s(1904) study of what we now call corrections for attenuation, which adjustobserved score correlations using reliability coefficients. Corrections for atten-uation are still of considerable interest in educational and psychological measure-ment. Second, the fact that reliability ranges between 0 and 1 is very appealingto many. Unfortunately, the appeal is deceptive in that it suggests that all of relia-bility can be captured in a single dimensionless number. That is not true, but theappeal persists, even though reliability coefficients are rather difficult to interpretcorrectly.3 Third, under the assumptions of CTT, it can be shown that standarderror of measurement (SEM) is a function of reliability. Specifically,

σ (E) = σ (X)√

1 − ρ2(X, T), (4)

which is arguably more important than ρ2(X, T) itself.

Coefficient α and its Misunderstandings

Without question, the most popular reliability coefficient is Coefficient α, whichis often call Cronbach’s α, since Cronbach (1951) popularized it and derived itfrom several different perspectives. As valuable and useful as this coefficient maybe, unfortunately it is widely misunderstood and misused, in part because it is soeasy to compute.

One misunderstanding is the common attribution of Coefficient α to Cronbach.As Cronbach (2003) himself noted, he did not invent Coefficient α; other equiv-alent coefficients were reported in the literature prior to Cronbach (1951). Indeed

3One complexity is that reliability coefficients have nonlinear characteristics. That is why it ismuch more difficult to raise a reliability coefficient from .90 to .95 than from .50 to .55.

Dow

nloa

ded

by [

Hon

g K

ong

Inst

itute

of

Edu

catio

n] a

t 00:

43 0

7 N

ovem

ber

2011

Page 6: Generalizability Theory and Classical Test Theory · 2011. 11. 7.  · GENERALIZABILITY THEORY AND CTT 3 First, since T and E are both unobserved variables, to use this model one

6 BRENNAN

derivations of one or more versions of Coefficient α (before and since 1951) mightbe the all time favorite psychometric parlor game!

As noted previously, from the perspective of CTT, the derivation of Coefficientα requires that forms be essentially tau-equivalent.4 In the vast majority of cases,Coefficient α is computed based on item scores; that is, items play the role offorms. It most circumstances, however, it seems highly unlikely that item scoressatisfy the assumption of essential tau-equivalence.

A particularly problematic misunderstanding is the frequently cited statementthat “Coefficient α is a lower limit to reliability.” Under a particular set of stringentassumptions, this is a mathematically correct statement (see Lord & Novick, 1968,pp. 87–88; Novick & Lewis, 1967), but these assumptions are rarely defensible inreal-world situations. In most cases, it is much more likely that Coefficient α is anupper limit to reliability, as Cronbach (1951) noted over a half century ago. Thismisinterpretation occurs when there is a disconnect between the data used to esti-mate Coefficient α and the definition of reliability intended by the investigator. Forexample, if data are collected on a single occasion, but the investigator’s notionof reliability involves generalizing to different occasions (as it usually does), thenit is almost certain that error variance will be underestimated.

Although Cronbach did not invent Coefficient α, he did name it, and hischoice of a name was not accidental. Consider the following quote from Cronbach(1951):

A . . . reason for the symbol is that α is one of six analogous coefficients (to bedesignated β, γ , δ, etc.) which deal with such other concepts as like-mindedness ofpersons, stability of scores, etc. (pp. 299–300)

Essentially, this quote reinforces the fact that there are many reliability coeffi-cients for any set of test scores. Cronbach did not publish subsequent papersthat specifically identified all of the other coefficients (i.e., β, γ , δ, etc.); rather,these notions got incorporated into what came to be called G theory. In short,Coefficient α is properly viewed as an historically important and often usefulestimator of reliability, but α should not be deified, and it is much overused.

Lord’s SEM

There are topics that are usually included in the CTT literature that are not quiteconsonant with the assumptions noted above. For the purposes of this article,a particularly important example is Lord’s (1955, 1957) SEM. Consider a testconsisting of k dichotomously scored items. Lord suggested that the SEM for

4Classically parallel forms satisfy the assumptions of essential tau-equivalence, but this is notnecessarily true for congeneric forms.

Dow

nloa

ded

by [

Hon

g K

ong

Inst

itute

of

Edu

catio

n] a

t 00:

43 0

7 N

ovem

ber

2011

Page 7: Generalizability Theory and Classical Test Theory · 2011. 11. 7.  · GENERALIZABILITY THEORY AND CTT 3 First, since T and E are both unobserved variables, to use this model one

GENERALIZABILITY THEORY AND CTT 7

an examinee can be viewed as the standard error of the mean for that examinee,where each observable mean is the examinee’s mean score on a random sample ofk items drawn from an infinite universe of items. In terms of parameters, Lord’sSEM is simply

σ (E∗) =√

τp(1 − τp)

k, (5)

where τ p is the true score for the examine in the mean-score metric (i.e., propor-tion correct scores).5 It is worth noting that Lord’s SEM is not a simple functionof reliability, whereas the CTT formula in Equation 4 is. Furthermore, it can beshown that the average value of σ 2(E∗) is greater than σ 2(E) when both are on thesame metric (see Brennan, 2001b, pp. 33, 160).

Lord’s SEM is a kind of bridge between CTT and G theory in at least twosenses. First, Lord’s SEM uses a random sampling model to estimate errorvariance rather than CTT notions of parallelism. Second, Lord’s SEM uses awithin-person design as opposed to the across-persons design that characterizedvirtually all the reliability literature prior to the 1950s. As discussed next, G theoryreplaces CTT notions of parallelism with randomly parallel forms, and G theoryexplicitly incorporates different types of data collection designs.

UNIVARIATE GENERALIZABILITY THEORY

G theory offers an extensive conceptual framework and a powerful set of statis-tical procedures for addressing numerous measurement issues. Often, CTT andanalysis of variance (ANOVA) are viewed as the parents of G theory.

Parents and Some History

In CTT, there is only one E term, which does not mean there is necessarily onlyone source of error; it does mean, however, that in a single application of CTT,all sources of error are confounded in one E term. One of the most important andsimplest perspectives on the G theory model is that it disconfounds the multiplesources of error that interest an investigator, say H of them; so, in a sense, the Gtheory model can be viewed as

X = μp + E1 + E2 + · · · EH , (6)

5The more familiar estimation formula for Lord’s SEM in the mean-score metric is:

σ̂ (E∗) =√

Xp(1 − Xp)/(k − 1).

Dow

nloa

ded

by [

Hon

g K

ong

Inst

itute

of

Edu

catio

n] a

t 00:

43 0

7 N

ovem

ber

2011

Page 8: Generalizability Theory and Classical Test Theory · 2011. 11. 7.  · GENERALIZABILITY THEORY AND CTT 3 First, since T and E are both unobserved variables, to use this model one

8 BRENNAN

where μp is universe score, which is the G theory analogue of true score.Importantly, in G theory the investigator must decide which sources of error areof interest, which effectively defines the facets of measurement. Universe score isthen defined as the expected value of observed scores over replications of the mea-surement procedure (see Brennan, 2001a), where each such replication involves adifferent random sample of conditions from each of the measurement facets.

In its essential features, the high-level model in Equation 6 is quite consis-tent with important aspects of the statistical framework for ANOVA. As noted byCronbach et al. (1972), when Fisher (1925) introduced ANOVA, he

revolutionized statistical thinking with the concept of the factorial experiment inwhich the conditions of observations are classified in several respects. Investigatorswho adopt Fisher’s line of thought must abandon the concept of undifferenti-ated error. The error formerly seen as amorphous is now attributed to multiplesources, and a suitable experiment can estimate how much variation arises fromeach controllable source. (p. 1)

The defining treatment of G theory is a monograph by Cronbach et al. (1972)entitled The Dependability of Behavioral Measurements. A history of the theoryis provided by Brennan (1997). Brennan (2001b) provides an extensive expositionof G theory. Shavelson and Webb (1991) provide a primer. Cardinet, Johnson,and Pini (2010) provide a treatment of G theory based on a perspective that issomewhat different from that of the previously cited authors. In discussing thegenesis of G theory, Cronbach (1991, pp. 391–392) states:

In 1957 I obtained funds from the National Institute of Mental Health to produce,with Gleser’s collaboration, a kind of handbook of measurement theory. . . . “Sincereliability has been studied thoroughly and is now understood,” I suggested to theteam, “let us devote our first few weeks to outlining that section of the handbook,to get a feel for the undertaking.” We learned humility the hard way—the enterprisenever got past that topic. Not until 1972 did the book appear . . . that exhaustedour findings on reliability reinterpreted as generalizability. Even then, we did notexhaust the topic.

When we tried initially to summarize prominent, seemingly transparent, con-vincingly argued papers on test reliability, the messages conflicted.

To resolve these conflicts, Cronbach and his colleagues devised a rich conceptualframework and married it to analysis of random effects variance components. Thenet effect is “a tapestry that interweaves ideas from at least two dozen authors”(Cronbach, 1991, p. 394). In particular, the work of Burt (1936), Ebel (1951),and Lindquist (1953, chap. 16) appears to have anticipated various aspects of Gtheory.

Dow

nloa

ded

by [

Hon

g K

ong

Inst

itute

of

Edu

catio

n] a

t 00:

43 0

7 N

ovem

ber

2011

Page 9: Generalizability Theory and Classical Test Theory · 2011. 11. 7.  · GENERALIZABILITY THEORY AND CTT 3 First, since T and E are both unobserved variables, to use this model one

GENERALIZABILITY THEORY AND CTT 9

Framework and Machinery

Although CTT and ANOVA can be viewed as the parents of G theory, the childis both more and less than the simple conjunction of its parents, and appreciat-ing G theory requires an understanding of more than its lineage. For example,although G theory liberalizes CTT, not all aspects of CTT are incorporated inG theory. Also, the ANOVA issues emphasized in G theory are different fromthose that predominate in many experimental design and ANOVA texts. In par-ticular, G theory concentrates on variance components and their estimation, notF tests.

Perhaps the most important aspect and unique feature of G theory is its concep-tual framework. Among the concepts are universes of admissible observations andG (generalizability) studies, as well as universes of generalization and D (deci-sion) studies. Some of the more important concepts and methods of G theory areintroduced next using a hypothetical scenario.

Suppose a testing company ABC decides that it wants to begin offering awriting proficiency testing program called WPT. ABC needs to identify, or oth-erwise characterize the types of essay prompts, t, that will be used and the typesof raters, r. Obviously, there are other considerations, too, but we will consideronly these two facets here. (A facet is simply a set of similar conditions of mea-surement, where the investigator decides what “similar” means.) Suppose that, intheory, responses to any prompt could be evaluated by any rater, and the numberof potential prompts and raters is indefinitely large. Under these specifications,we say that both facets are infinite in the universe of admissible observations, andthey are crossed, that is, t × r.

So far, no reference has been made to persons who respond to the essayprompts. In G theory the word universe is reserved for conditions of measurement(prompts and raters, here), while the word population is used for the objects ofmeasurement (persons, here). In the population and universe of admissible obser-vations, any observable score for a single essay prompt evaluated by a single ratercan be represented as:

Xptr = μ + νp + νt + νr + νpt + νpr + νtr + νptr, (7)

where μ is the grand mean in the population and universe and ν designatesany one of the seven uncorrelated effects, or components. We say the Equation7 is the p × t × r (persons crossed with tasks crossed with raters) linearmodel.

Assuming that the effects in Equation 7 are uncorrelated, the variance of theobserved scores is:

σ 2(Xptr) = σ 2(p) + σ 2(t) + σ 2(r) + σ 2(pt) + σ 2(pr) + σ 2(tr) + σ 2(ptr). (8)

Dow

nloa

ded

by [

Hon

g K

ong

Inst

itute

of

Edu

catio

n] a

t 00:

43 0

7 N

ovem

ber

2011

Page 10: Generalizability Theory and Classical Test Theory · 2011. 11. 7.  · GENERALIZABILITY THEORY AND CTT 3 First, since T and E are both unobserved variables, to use this model one

10 BRENNAN

The terms to the right of the equal sign are called random effects variance com-ponents. They can be estimated using expected mean square equations for a Gstudy in which a sample of np persons respond to nt prompts that are evaluated bynr raters.

Once estimated variance components are available, they can be used to esti-mate universe score variance, error variances, and reliability-like coefficients forvarious universe of generalization and D study designs. A universe of general-ization can be viewed as the universe of randomly parallel forms of WPT, whereeach such from uses n′

t prompts and n′r raters.6 A D study design is the design

used operationally for a form of WPT.7

A crucial consideration in defining a universe of generalization is answeringthe question, “Which facet(s) shall be considered random and which shall be con-sidered fixed?” A facet is considered random when its conditions in the D studyare a sample from those in the universe of generalization.8 A facet is fixed whenits conditions in the D study exhaust its conditions in the universe of generaliza-tion. G theory does not specify which facets should be considered random andwhich should be considered fixed; that is the prerogative and the responsibility ofthe investigator. It should be noted, however, that fixing one or more facets gener-ally lowers error variance and increases coefficients at the expense of narrowinginterpretations.

Infinite universe of generalization and crossed D study design

Suppose that ABC decides that both prompts and raters shall be viewed asrandom for WPT, and the D study design will have the same crossed structure asthe G study design.9 Then, universe score variance is

σ 2(τ ) = σ 2(p), (9)

relative error variance is

σ 2(δ) = σ 2(pt)

n′t

+ σ 2(pr)

n′r

+ σ 2(ptr)

n′tn′

r, (10)

6It need not be true that n′t = nt nor that n′

r = nr; that is, the sample sizes used to estimate variancecomponents need not equal the sample sizes used in an operational form of the test.

7D study designs can differ with respect to structure and/or sample sizes.8Strictly speaking, for a random facet it is assumed that the number of conditions in the universe

of generalization is indefinitely large.9That is, the D study design shall be p × T × R with n′

t prompts and n′r raters.

Dow

nloa

ded

by [

Hon

g K

ong

Inst

itute

of

Edu

catio

n] a

t 00:

43 0

7 N

ovem

ber

2011

Page 11: Generalizability Theory and Classical Test Theory · 2011. 11. 7.  · GENERALIZABILITY THEORY AND CTT 3 First, since T and E are both unobserved variables, to use this model one

GENERALIZABILITY THEORY AND CTT 11

absolute error variance is

σ 2() = σ 2(t)

n′t

+ σ 2(r)

n′r

+ σ 2(tr)

n′tn′

r+ σ 2(pt)

n′t

+ σ 2(pr)

n′r

+ σ 2(ptr)

n′tn′

r, (11)

a generalizability coefficient is

Eρ2 = σ 2(τ )

σ 2(τ ) + σ 2(δ), (12)

and a dependability coefficient is

� = σ 2(τ )

σ 2(τ ) + σ 2(). (13)

Equations 9–13 are expressed in terms of the mean score metric, which is thetradition in G theory; by contrast, CTT equations are almost always expressed interms of the total score metric.

Relative error variance, σ 2(δ), and a generalizability coefficient, Eρ2, are anal-ogous to σ 2(E) and ρ2(X, T), respectively, in CTT in that they characterize errorand reliability for decisions based on comparing examinees. It is important tonote, however, that except in trivial cases σ 2(δ) �= σ 2(E) and Eρ2 �= ρ2(X, T).10

By contrast, strictly speaking, CTT has no analogue for σ 2(), which is the errorvariance for making absolute (e.g., pass–fail) decisions about examinees. If wego beyond the strict realm of CTT and consider Lord’s error variance, however,there are some clear similarities—most obviously, both σ 2() and Lord’s errorvariance are derived under random sampling assumptions (see Brennan, 1997, formore details.)

If an investigator performs a CTT analysis (e.g., computes Coefficient α) whenthere is more than one random facet, it is likely that error variance will be under-estimated. Consider, for example, σ 2(δ) in Equation 10 which is based on n′

tn′r

observations for each examinee. If Coefficient α is computed using the n′tn′

r

observations for each examinee, then the estimated error variance in the meanscore metric will be [σ̂ 2(pt) + σ̂ 2(pr) + σ̂ 2(ptr)]/(n′

tn′r) , which is clearly smaller

than the estimate of σ 2(δ) based on Equation 10. This illustrates that CTT esti-mated error variances are generally too small when there is more than one randomfacet in the universe of generalization.

10The most common “trivial” case is a design and universe with a single random facet.

Dow

nloa

ded

by [

Hon

g K

ong

Inst

itute

of

Edu

catio

n] a

t 00:

43 0

7 N

ovem

ber

2011

Page 12: Generalizability Theory and Classical Test Theory · 2011. 11. 7.  · GENERALIZABILITY THEORY AND CTT 3 First, since T and E are both unobserved variables, to use this model one

12 BRENNAN

Different universes of generalization and D study designs

For different universes of generalization and D study designs, the expressionsfor Eρ2 and � in Equations 12 and 13, respectively, still apply. Universe scorevariance and error variances change, however, if the universe of generalizationchanges. In addition, error variances change if the design changes and/or samplesizes change.11

Suppose ABC decides to use the same tasks for all forms of WPT. If so, wewould say that tasks are fixed in the universe of generalization, and it can beshown that universe score variance is

σ 2(τ ) = σ 2(p) + σ 2(pt)

n′t

, (14)

relative error variance is

σ 2(δ) = σ 2(pr)

n′r

+ σ 2(ptr)

n′tn′

r, (15)

and absolute error variance is

σ 2() = σ 2(r)

n′r

+ σ 2(tr)

n′tn′

r+ σ 2(pr)

n′r

+ σ 2(ptr)

n′tn′

r. (16)

Comparing these equations with Equations 9–11, it is evident that when tasksare fixed, universe score variance increases and error variances decrease, whichleads to larger coefficients. Conceptually, fixing a facet restricts the universe ofgeneralization and, in doing so, decreases the gap between observed and universescores at the price of narrowing interpretations.

Suppose ABC publishes a technical manual in which it claims that WPT is ahighly reliable testing program because inter-rater reliability coefficients are high.Let us consider this claim from the perspective of G theory. Suppose each inter-rater coefficient is a Pearson correlation based on the responses of examinees toa single task with each response rated by the same two raters. Even if there aremultiple coefficients reported, as long as each of them is based on a single task,then task is effectively being treated as fixed, whether or not ABC realizes it.12

Furthermore, a correlation between two conditions or units (here, raters) is an

11CTT deals with sample size changes through the Spearman-Brown formula (see, Feldt &Brennan, 1989 and Haertel, 2006), which does not apply when there is more than one random facet.See Brennan (2001b, pp. 116–117) for an example.

12Averaging inter-rater coefficients does not obviate this problem; it merely masks it.

Dow

nloa

ded

by [

Hon

g K

ong

Inst

itute

of

Edu

catio

n] a

t 00:

43 0

7 N

ovem

ber

2011

Page 13: Generalizability Theory and Classical Test Theory · 2011. 11. 7.  · GENERALIZABILITY THEORY AND CTT 3 First, since T and E are both unobserved variables, to use this model one

GENERALIZABILITY THEORY AND CTT 13

estimate of reliability for one of them. Therefore, the inter-rater coefficients areinterpretable as estimates of

Eρ2 = σ 2(p) + σ 2(pt)

σ 2(p) + σ 2(pt) + [σ 2(pr) + σ 2(ptr)], (17)

where σ 2(δ) is enclosed in square brackets.Compare this with Eρ2 when both raters and tasks are random, and an

examinee’s score is the average over n′t tasks and n′

r raters (see Equations 9,10, and 12):

Eρ2 = σ 2(p)

σ 2(p) +[

σ 2(pt)

n′t

+ σ 2(pr)

n′r

+ σ 2(ptr)

n′tn

′r

] , (18)

where σ 2(δ) is enclosed in square brackets. Equation 18 almost always reflectsthe D study design and intended universe much better than Equation 17, but Eρ2

in Equation 18 is likely to be much smaller than Eρ2 in Equation 17, primarilybecause σ 2(pt) moves from universe score variance in Equation 17 to error vari-ance in Equation 18. This in an important matter. In most testing programs σ 2(pt)is quite large, which more than offsets the decrease in error variance that resultsfrom division by sample sizes in Equation 18, especially since n′

t and n′r tend to

be quite small in writing assessments.Sometimes inter-rater coefficients are reported based on a side study, but oper-

ationally each response is rated by a single rater. If so, Equations 17 and 18still apply, but n′

r = 1 in Equations 18. Importantly, however, σ 2(pr) cannot beestimated unless a G study is conducted that has nr ≥ 2.

The above discussion may be somewhat challenging, but it is still oversim-plified relative to what often happens in parctice. In particular, the assignmentof raters to prompts and/or examinees is often more complicated than impliedby the design considered above. Suppose, for example, that for the operationalassessment, a different set of raters will evaluate responses to each prompt ortask, t. This is a verbal description of the D study p × (R:T) design, where “:” isread “nested within.” For this design, if both raters and tasks are random, it canbe shown that

Eρ2 = σ 2(p)

σ 2(p) +[

σ 2(pt)

n′t

+ σ 2(pr:t)

n′tn

′r

] , (19)

where σ 2(pr:t) represents the confounding of σ 2(pr) and σ 2(ptr). This means thatif the G study were conducted using the p × t × r design, then σ 2(pr:t) = σ 2(pr)

Dow

nloa

ded

by [

Hon

g K

ong

Inst

itute

of

Edu

catio

n] a

t 00:

43 0

7 N

ovem

ber

2011

Page 14: Generalizability Theory and Classical Test Theory · 2011. 11. 7.  · GENERALIZABILITY THEORY AND CTT 3 First, since T and E are both unobserved variables, to use this model one

14 BRENNAN

+ σ 2(ptr). It follows that σ 2(pr) is divided by both n′t and n′

r in Equation 19,whereas σ 2(pr) is divided by n′

r, only, in Equation 18. Therefore, when n′r > 1

and n′t > 1, σ 2(δ) is smaller and Eρ2 is larger for the nested design than for the

crossed design. (A similar statement holds for �.)

In brief, this hypothetical WPT scenario illustrates that:

• universe score variance gets larger and error variances get smaller if a facetshifts from being considered random to being considered fixed;

• larger D study sample sizes lead to smaller error variances; and• nested D study designs usually lead to smaller error variances and larger

coefficients.

These conclusions are entirely predictable given the rich conceptual frameworkof G theory.

MULTIVARIATE GENERALIZABILITY THEORY

The essential features of univariate G theory were largely completed with techni-cal reports by the Cronbach team in 1960–1961. These were revised into threejournal articles, each with a different first author (Cronbach, Rajaratnam, &Gleser, 1963; Gleser, Cronbach, & Rajaratnam, 1965; Rajaratnam, Cronbach, &Gleser, 1965). In the mid 1960s, motivated by Harinder Nanda’s studies on inter-battery reliability, the Cronbach team began their development of multivariate Gtheory, which is incorporated in their 1972 monograph, and which they regardedas the most unique aspect of G theory.13 Cronbach (1976) provides more histor-ical details. The last four chapters in Brennan (2001b) provide an integrated andextended treatment of multivariate G theory.

Multivariate G theory is multivariate primarily in the sense of multiple uni-verses of generalization and, hence, multiple universe scores for each examinee.In addition, there are corresponding multiple universes of admissible observa-tions. Each one of the multiple universes is associated with a single fixed conditionof measurement. Statistically this implies that multivariate G theory analysesinvolve not only variance components but also covariance components.

To continue with the WPT example, suppose each form involves both narra-tive and informative types of prompts. We will designate these prompt types as v1

and v2, respectively. If, for each type, the population and universe of admissible

13It can be argued that stratified alpha (Cronbach, Schönemann, & McKie, 1965) is a CTTprecursor to multivariate G theory.

Dow

nloa

ded

by [

Hon

g K

ong

Inst

itute

of

Edu

catio

n] a

t 00:

43 0

7 N

ovem

ber

2011

Page 15: Generalizability Theory and Classical Test Theory · 2011. 11. 7.  · GENERALIZABILITY THEORY AND CTT 3 First, since T and E are both unobserved variables, to use this model one

GENERALIZABILITY THEORY AND CTT 15

observations is fully crossed (i.e., p × t × r), then there are seven variancecomponents for v1 and a different seven components for v2. For example, σ 2

1 (p)is the person variance component for v1, and σ 2

2 (p) is the person variancecomponent for v2. In addition, for each pair of variance components there isthe possibility of a covariance component. For this example, almost certainlypersons would respond to prompts of both types, which means that the covari-ance component for persons, σ 12(p), would be non-zero. On the other hand,probably there would be different prompts (t) for v1 and v2. If so, σ12(t) =σ12(pt) = σ12(tr) = σ12(ptr) = 0. The same raters might or might not be usedfor the two types of prompts, which means that σ 12(r) and σ 12(pr) might ormight not be zero. In short, the multivariate WPT example has seven variance-covariance matrices that replace the seven variance components for the univariateexample.

Univariate D study analyses for the WPT example can be performed for v1 andv2 separately, which gives results specific to narrative and informative prompts,respectively. In addition, for the WPT example it is likely that ABC wouldperform analyses for one or more composite universe scores defined generally as

μpC = w1μp1 + w2μp2.

For example, if w1 + w2 = 1, then the analyses would be for weighted meanscores over both narrative and informative prompts. If w1 = 1 and w2 = −1, thenthe analyses would be for difference scores.

This relatively simple WPT example hints at the power and flexibility of mul-tivariate G theory. Indeed, it can be said that multivariate G theory is the whole ofG theory, with univariate G theory simply being a special case. This multivariateperspective on G theory illustrates that it is essentially a random effects theory.

The reader may quarrel with this last assertion by noting that the previous dis-cussion of univariate G theory considered a mixed model in which there was afixed facet. True enough, but any univariate mixed model can always be refor-mulated as a multivariate model in which the levels of the fixed facet(s) becomelevels of v. Indeed, doing so provides a more flexible representation of levels of afixed facet,14 and usually greatly simplifies estimation, especially for mixed mod-els that have designs that are unbalanced with respect to nesting (see, for example,Brennan, 2001b, pp. 268–273).

14A mixed-model univariate analysis effectively makes a statistical “hidden” choice for the wweights for each fixed level, whereas a multivariate analysis leaves the choice of weights to theinvestigator.

Dow

nloa

ded

by [

Hon

g K

ong

Inst

itute

of

Edu

catio

n] a

t 00:

43 0

7 N

ovem

ber

2011

Page 16: Generalizability Theory and Classical Test Theory · 2011. 11. 7.  · GENERALIZABILITY THEORY AND CTT 3 First, since T and E are both unobserved variables, to use this model one

16 BRENNAN

COMPARING THEORIES

For ease of reference, in this section CTT and G theory are sometimes referredto as expected value theories to contrast them with item response theory (IRT).We begin with a comparison of CTT and G theory that includes considerationof some of the strengths and weaknesses of these expected value theories. Thenthese theories are briefly compared with IRT.

Expected Value Theories

CTT and G theory have a number of similarities. They are both tautologies inwhich terms to the right of the equal sign are unobserved, both theories definetrue (or universe) score as an expected value of observed scores, both theoriesexplicitly incorporate random errors of measurement, and both theories have well-defined (and similar) notions of reliability (or generalizability).

It has been said by Cronbach et al. (1972) and by Brennan (2001b) that Gtheory “liberalizes” CTT. This is true in several senses. First, G theory permits dis-entangling the multiple sources of error that are confounded in the single E termof CTT. Second, G theory has a much richer conceptual framework than CTT,which leads to resolutions of a number of apparent contradictions in various CTTdiscussions of reliability. The two most important characteristics of G theory thatfacilitate resolving contradictions are: (a) G theory’s distinction between fixed andrandom measurement facets and (b) G theory’s capability of dealing with differentD study designs. Third, multivariate G theory expands reliability considerationsto multiple universes of generalization, which have no corresponding status inCTT. Fourth, as noted by Cronbach et al. (1972) and Brennan (2001b), G theoryblurs distinctions between reliability and validity. Kane (1982), for example, pro-vides a particularly prescient discussion of the reliability–validity paradox fromthe perspective of G theory.

To say that G theory liberalizes CTT does not mean, however, that all of CTTis subsumed under G theory or that CTT can or should be completely replaced byG theory. There are still some important differences between the two theories thatmore than justify retaining both. Perhaps the most obvious difference is in defini-tions of parallelism. G theory incorporates a single notion of parallelism, namely,the notion of randomly parallel forms. This is quite different from the notion ofclassically parallel forms in CTT. Both types of parallelism are idealized and notever likely to be strictly true, although one or the other may be more sensible inparticular contexts. Furthermore, CTT has several well-developed, useful defini-tions of parallelism that are weaker than classically parallel forms (in particular,essentially tau-equivalent forms and congeneric forms), whereas G theory has norole, as yet, for different types of parallelism.

In considering models, it often seems that what is a strength from oneperspective is a weakness or limitation from another perspective. For example,

Dow

nloa

ded

by [

Hon

g K

ong

Inst

itute

of

Edu

catio

n] a

t 00:

43 0

7 N

ovem

ber

2011

Page 17: Generalizability Theory and Classical Test Theory · 2011. 11. 7.  · GENERALIZABILITY THEORY AND CTT 3 First, since T and E are both unobserved variables, to use this model one

GENERALIZABILITY THEORY AND CTT 17

one of the strengths of CTT is that it is based on a the very simple model X = T+ E, but the simplicity of the model is also a weakness in that it does not permitus to disentangle the multiple sources of error in E. By contrast, the capability ofdisentangling error sources is an important strength of G theory, but that strengthis purchased at the price of conceptual complexity.

The complexity of G theory is often a stumbling block for those who seek tofind simple answers to measurment questions. In reality, however, most thought-ful consideration of such questions requires grappling with conceptual mattersthat are often complex and not easily addressable with a template. An importantstrength of G theory is that it is rich enough to guide investigators through suchmeasurement mazes, but that strength makes cognitive demands on investigators.In the end, there is no psychometric “free lunch.”

Expected Value Theories and IRT

Given the popularity of item response theory (IRT) (see, for example, Lord, 1980and Yen & Fitzpatrick, 2006), it seems obvious to consider some similarities anddifferences between IRT and the two expected value theories discussed in this arti-cle. In both substantive and utilitarian senses, there is a rather obvious differencebetween the two types of theories. Specifically, IRT focuses on item responses,whereas CTT and G theory focus on test or form scores. Using IRT, investigatorscan clearly distinguish among different items. By contrast, G theory cannot distin-guish among items, since it is a random sampling model, just as different personsare not distinguishable in survey sampling research. CTT can make distinctionsamong items only if items are defined as forms, but if that is done, parallelismassumptions are often suspect.15

Some may object to the above characterization of CTT by noting that thereis long history of using so-called classical item analysis statistics such as diffi-culty levels and point-biserial discrimination indices. True enough. Such statistics,however, are not easily defended from a strict interpretation of CTT as discussedin this article. The essential problem is that almost always item scores grosslyviolate the assumption of classically parallel forms, and even the assumptions ofessentially tau-equivalent forms. Classical item analysis statistics have a long-standing demonstrated utility for test development, but that does not mean theyare well modelled by CTT.

A forest-trees metaphor is reasonably apt for considering IRT vis-à-visexpected value theories. Consider individual items as trees and the universe ofitems as the forest. If we focus on individual trees as we do in IRT, then weare easily oblivious to the forest. If we focus on the forest, then the trees are

15If items are considered as congeneric forms, then perhaps this problem can be circumvented (L.S. Feldt, personal communication, March 3, 2010).

Dow

nloa

ded

by [

Hon

g K

ong

Inst

itute

of

Edu

catio

n] a

t 00:

43 0

7 N

ovem

ber

2011

Page 18: Generalizability Theory and Classical Test Theory · 2011. 11. 7.  · GENERALIZABILITY THEORY AND CTT 3 First, since T and E are both unobserved variables, to use this model one

18 BRENNAN

indistinguishable. To put it another way, in IRT items (more correctly item param-eters) are effectively fixed, which means that a replication would consist ofidentically the same items (or, more correctly, a set of items with identically thesame parameters). Call this “strictly” parallel forms. The notion of randomly par-allel forms in G theory is much less restrictive, and even the various CTT notionsof parallel forms are much weaker than “strictly” parallel forms.

Traditional developments of IRT do not typically mention fixed items orstrictly parallel forms. These notions are implicit, however, in other aspectsof IRT. For example, in the derivation of the standard error of the maximum-likelihood estimate of θ , there is no consideration of sampling items; and if itemsare not sampled, they must be fixed. Also, the expected number-correct (ENR) onthe vertical axis in a test characteristic curve (TCC) is typically viewed as number-correct true score. However, ENR is not an expected value over any set of itemsdifferent from those for the specific TCC, since the TCC itself is conditional ona very specific set of items.16 Therefore, there is a discontinuity between the IRTnotion of true score (ENR) and the notion of true score in CTT and G theory. Thisis particularly evident in comparing IRT and G theory: items are fixed in IRT,whereas they are almost always treated as random in G theory.

Some of the above comments may appear to conflict with some old and cur-rent literature. For example, Lord and Novick (1968) show relationships betweencertain classical item analysis statistics and normal ogive item parameters. Trueenough, but the relationships are based on first assuming that a normal ogivemodel fits. The fact that proposition A implies proposition B does not mean that Bimplies A; that is, the Lord and Novick (1968) demonstration does not mean thatCTT and IRT are interchangeable for item analysis purposes. A similar type ofcomment, although more nuanced, applies to Holland and Hoskens (2003), whichin no way mitigates the quality or importance of their research. It is true that Lordand Novick (1968) and Holland and Hoskens (2003) have taken steps in the direc-tion of integrating CTT and IRT from certain perspectives; it is not true that thetwo theories are fully integrated, or that one is a subset of the other.

Current IRT models and G theory differ not only with respect to items beingfixed (IRT) or random (G theory), but also in the sense that G theory empha-sizes the contributions of multiple facets to measurement error, whereas almostall of the widely used IRT models have no explicit role for multiple facets. Thereis some research, however, that seeks to integrate aspects of G theory and IRT.For example, Bock, Brennan, and Muraki (2002) have proposed a procedure thatincorporates multiple sources of error directly into the information function and,hence, into the IRT SEM. Also, Briggs and Wilson (2007) and Chien (2008) have

16It might be argued that ENR is an expected value over a propensity distribution of performanceon the fixed items, but even then, the items (or item parameters) themselves are still fixed.

Dow

nloa

ded

by [

Hon

g K

ong

Inst

itute

of

Edu

catio

n] a

t 00:

43 0

7 N

ovem

ber

2011

Page 19: Generalizability Theory and Classical Test Theory · 2011. 11. 7.  · GENERALIZABILITY THEORY AND CTT 3 First, since T and E are both unobserved variables, to use this model one

GENERALIZABILITY THEORY AND CTT 19

TABLE 1Comparisons Among CTT, G Theory, and IRT

Issue CTT G Theory IRT

Forms andparallelism

Classically parallel,essentiallytau-equivalent, etc.

Randomly parallel Strictly parallel

True score Expectation over forms Expectation overrandomly parallelforms

Expected numberright for fixed setof items

Assumptions Relatively weak Relatively weak Very strongPrimary strengths Simplicity; widely used;

has stood test of timeConceptual breadth;

disentangles multiplesources of error;distinguishes betweenfixed and randomfacets

Mathematicallyelegant; solvesmany complexmeasurementproblems ifassumptions hold

Primary weaknesses Undifferentiated error Conceptual complexity Only fixed facet(s)Use and

understandingEasy Sometimes challenging Sometimes

challenging

considered an approach that estimates variance components based on IRT esti-mates of expected number correct scores rather than the actual observed scoresused in G theory. Briggs and Wilson (2007) consider an items facet, only; Chien(2008) considers two facets. In addition, there have been a number of informal,unpublished suggestions that Bayesian priors be used to turn the fixed items (morecorrectly, fixed item parameters) in IRT into random variables.17 None of theseapproaches have been studied much yet, but it is encouraging that researchers aremaking attempts at integrating G theory and IRT. Even if the attempts fall short,they may lead to beneficial insights.

Table 1 provides a comparison among CTT, G theory, and IRT with respectto many of the issues considered in this section. The comparative phrases inTable 1 are necessarily succinct; they should be interpreted in the more extendedsense discussed in this section. The differences among models are substantive andimportant, but each of these models is defensible and valuable, and no one of themis a substitute for the other, at least not in their current instantiations. It is unfor-tunate that much of the current research and practice in educational measurementdo not give more attention to the differences among these models, and especiallythe differences among their assumptions.

17Bayesian priors are actually involved in the Briggs and Wilson (2007) and Chien (2008)approaches, which employ MCMC methods.

Dow

nloa

ded

by [

Hon

g K

ong

Inst

itute

of

Edu

catio

n] a

t 00:

43 0

7 N

ovem

ber

2011

Page 20: Generalizability Theory and Classical Test Theory · 2011. 11. 7.  · GENERALIZABILITY THEORY AND CTT 3 First, since T and E are both unobserved variables, to use this model one

20 BRENNAN

REFERENCES

Bock, R. D., Brennan, R. L., & Muraki, E. (2002). The information in multiple ratings. AppliedPsychological Measurement, 26, 364–375.

Borsboom, D. (2005). Measuring the mind. Cambridge, UK: Cambridge University Press.Brennan, R. L. (1992). Elements of generalizability theory (rev. ed.). Iowa City, IA: ACT, Inc.Brennan, R. L. (1997). A perspective on the history of generalizability theory. Educational

Measurement: Issues and Practice, 16(4), 14–20.Brennan, R. L. (2001a). An essay on the history and future of reliability from the perspective of

replications. Journal of Educational Measurement, 38, 285–317.Brennan, R. L. (2001b). Generalizability theory. New York: Springer-Verlag.Briggs, D. C., & Wilson, M. (2007). Generalizability in item response modeling. Journal of

Educational Measurement, 44, 131–155.Burt, C. (1936). The analysis of examination marks. In P. Hartog & E. C. Rhodes (Eds.), The marks

of examiners. London: Macmillan.Cardinet, J., Johnson, S., & Pini, G. (2010). Applying generalizability theory using EduG. New York:

Taylor and Francis.Chien, Y. (2008). An investigation of testlet-based item response models with a random facets design

in generalizability theory. Unpublished doctoral dissertation, University of Iowa, Iowa City, Iowa.Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16,

297–334.Cronbach, L. J. (1976). On the design of educational measures. In D. N. M. de Gruijter & L. J. T. van

der Kamp (Eds.), Advances in psychological and educational measurement (pp. 199–208). NewYork: Wiley.

Cronbach, L. J. (1991). Methodological studies—A personal retrospective. In R. E. Snow & D. E.Wiley (Eds.), Improving inquiry in social science: A volume in honor of Lee J. Cronbach (pp.385–400). Hillsdale, NJ: Erlbaum.

Cronbach, L. J. (2003). My current thoughts on coefficient alpha and successor procedures.Educational and Psyhcological Measurement, 64, 391–418.

Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioralmeasurements: Theory of generalizability for scores and profiles. New York: Wiley.

Cronbach, L. J., Rajaratnam, N., & Gleser, G. C. (1963). Theory of generalizability: A liberalizationof reliability theory. British Journal of Statistical Psychology, 16, 137–163.

Cronbach, L. J., Schönemann, P., & McKie, T. D. (1965). Alpha coefficients for stratified-paralleltests. Educational and Psychological Measurement, 25, 291–312.

Ebel, R. L. (1951). Estimation of the reliability of ratings. Psychometrika, 16, 407–424.Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L. Linn (Ed.), Educational Measurement

(3rd ed., pp. 105–146). New York: American Council on Education and MacMillan.Fisher, R. A. (1925). Statistical methods for research workers. London: Oliver & Bond.Gleser, G. C., Cronbach, L. J., & Rajaratnam, N. (1965). Generalizability of scores influenced by

multiple sources of variance. Psychometrika, 30, 395–418.Haertel, E. H. (2006). Reliability. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp.

65–110). Westport, CT: American Council on Education/Praeger.Holland, P. W., & Hoskens, M. (2003). Classical test theory as a first order item response theory:

Application to true-score prediction from a possibly nonparallel test. Psychometrika, 68, 123–149.Kane, M. T. (1982). A sampling model for validity. Applied Psychological Measurement, 6, 125–160.Lindquist, E. F. (1953). Design and analysis of experiments in psychology and education. Boston:

Houghton Mifflin.Lord, F. M. (1955). Estimating test reliability. Educational and Psychological Measurement, 15,

325–336.

Dow

nloa

ded

by [

Hon

g K

ong

Inst

itute

of

Edu

catio

n] a

t 00:

43 0

7 N

ovem

ber

2011

Page 21: Generalizability Theory and Classical Test Theory · 2011. 11. 7.  · GENERALIZABILITY THEORY AND CTT 3 First, since T and E are both unobserved variables, to use this model one

GENERALIZABILITY THEORY AND CTT 21

Lord, F. M. (1957). Do tests of the same length have the same standard error of measurement?Educational and Psychological Measurement, 17, 510–521.

Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ:Erlbaum.

Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley.

Novick, M. R., & Lewis, C. (1967). Coefficient alpha and the reliability of composite measurements.Psychometrika, 32, 1–13.

Rajaratnam, N., Cronbach, L. J., & Gleser, G. C. (1965). Generalizability of stratified-parallel tests.Psychometrika, 30, 39–56.

Shavelson, R. J., & Webb, N. M. (1991). Generalizability theory: A primer. Newbury Park, CA: Sage.Spearman, C. (1904). The proof and measurement of association between two things. American

Journal of Psychology, 15, 72–101.Yen, W. M., & Fitzpatrick, A. R. (2006). Item response theory. In R. L. Brennan (Ed.), Educational

measurement (4th ed., pp. 11–154). Westport, CT: American Council on Education/Praeger.

Dow

nloa

ded

by [

Hon

g K

ong

Inst

itute

of

Edu

catio

n] a

t 00:

43 0

7 N

ovem

ber

2011