classical testtheoryin historical perspective

7
Classical Test Theory in historical Perspective Ross E . Traub The Ontario Institute for Studies in Education of the University of Toronto What were the historic origins of classical test theory? What have been major milestones in its development? lassical test theory is an ema- nation of the early 20th cen- tury. It was born of a ferment the ingredients of which included three remarkable achievements of the previous 150 years : A recognition of the presence of errors in measure- ments, a conception of that error as a random variable, and a conception of correlation and how to index it . Then in 1904 Charles Spearman showed us how to correct a correla- tion coefficient for attenuation due to measurement error and how to obtain the index of reliability needed in making the correction. Spearman's demonstration marked the beginning, as I see it, of classical test theory. Subsequently, the framework of classical theory was elaborated and refined by Spear- man, George Udny Yule, Truman Lee Kelley, and others over the quarter century or so following 1904 . Another milestone was laid in 1937 with the publication of the Kuder-Richardson formulas . This event was followed, shortly there- after, by the idea of lower bounds to reliability and the framework for en- hanced understanding found in the work of Louis Guttman. The culmi- nation of classical test theory was realized in the systematic treatment it received from Melvin Novick (1966 ; Lord & Novick, 1968) . What follows is an attempt to add a little flesh to the bare bones of the foregoing outline . Before doing so, however, I should be clear about what I take classical test theory to be . Classical Theory Circumscribed Classical' test theory is founded on the proposition that measurement error, a random latent variable, is a component of the observed score random variable . The latter vari- able is realized in the measure- ments that may be taken of a characteristic of the persons2 in some more-or-less well-defined pop- ulation . Add to this proposition (a) a result that is true by construction- namely, that the error variable has zero covariance with that other la- tent component of observed mea- surements, the true score variable (Lord & Novick, 1968)-and (b) a crucial assumption that the error component of a measure is indepen- dent of the error components of other measures, either of the same characteristic or of different charac- teristics, and one is able to prove the basic theorems of classical test the- ory. Essential adjuncts to the theory are the ancillary assumptions in- voked and experimental procedures followed in estimating coefficients of reliability and standard errors of measurement, two of classical test theory's basic results . The Zeitgeist A prominent feature of the zeitgeist at the turn of the 20th century was the notion of errors in scientific ob- servations . As early as the 17th century, apparently, "Galileo had . . . reasoned that errors of observa- tion are distributed symmetrically and tend to cluster around their true value" (Read, 1985, p . 348) . "By the eighteenth cen- tury," according to Churchill Eisen- hart (1983a), the practice of taking the arith- metic mean [of a set of observa- tions] "for truth" had become fairly widespread . Nonetheless, Thomas Simpson (1710-1761) wrote to the president of the Royal Society in March 1755 that some persons of considerable note maintained that one single obser- vation, taken with due care, was as much to be relied on as the mean of a great number. As this appeared to him to be a matter of much importance, he said that he had a strong inclination to ascer- tain whether by the application of mathematical principles the util- ity and advantage of the practice might be demonstrated . Measure- ments and functions of measure- ments, such as their arithmetic means, are not amenable to math- ematical theory, however, as long as individual measurements are regarded as unique entities, that is as fixed numbers y,, YZ . . . . A mathematical theory of measure- ments, and of functions of mea- surements, is possible only when particular measurements y,, yz, . . . are regarded as instances characteristic of hypothetical measurements Y,, Y 2 , . . that might have been, or might be, yielded by the same measurement process under the same circum- stances . Consequently, Simpson hypothesized that the respective chances of the different errors to Ross E. Traub is a Professor at the On- tario Institute for Studies in Education, OISE, University of Toronto, 252 Bloor St . W ., Toronto, Ontario, Canada M5S 1V6. His specializations are test theory and educational assessment. Educational Measurement : Issues and Practice

Upload: others

Post on 19-Dec-2021

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Classical TestTheoryin historical Perspective

Classical Test Theory inhistorical PerspectiveRoss E. TraubThe Ontario Institute for Studies in Education ofthe University of Toronto

What were the historic origins of classical test theory?What have been major milestones in its development?

lassical test theory is an ema-nation of the early 20th cen-

tury. It was born of a ferment theingredients of which included threeremarkable achievements of theprevious 150 years : A recognition ofthe presence of errors in measure-ments, a conception of that error asa random variable, and a conceptionof correlation and how to index it .Then in 1904 Charles Spearmanshowed us how to correct a correla-tion coefficient for attenuation dueto measurement error and how toobtain the index of reliabilityneeded in making the correction.Spearman's demonstration markedthe beginning, as I see it, of classicaltest theory. Subsequently, theframework of classical theory waselaborated and refined by Spear-man, George Udny Yule, TrumanLee Kelley, and others over thequarter century or so following1904 . Another milestone was laid in1937 with the publication of theKuder-Richardson formulas . Thisevent was followed, shortly there-after, by the idea of lower bounds toreliability and the framework for en-hanced understanding found in thework of Louis Guttman. The culmi-nation of classical test theory wasrealized in the systematic treatmentit received from Melvin Novick(1966 ; Lord & Novick, 1968) .What follows is an attempt to add

a little flesh to the bare bones of theforegoing outline . Before doing so,however, I should be clear aboutwhat I take classical test theoryto be .

Classical Theory CircumscribedClassical' test theory is founded onthe proposition that measurementerror, a random latent variable, is acomponent of the observed scorerandom variable . The latter vari-able is realized in the measure-ments that may be taken of acharacteristic of the persons2 insome more-or-less well-defined pop-ulation . Add to this proposition (a) aresult that is true by construction-namely, that the error variable haszero covariance with that other la-tent component of observed mea-surements, the true score variable(Lord & Novick, 1968)-and (b) acrucial assumption that the errorcomponent of a measure is indepen-dent of the error components ofother measures, either of the samecharacteristic or of different charac-teristics, and one is able to prove thebasic theorems of classical test the-ory. Essential adjuncts to the theoryare the ancillary assumptions in-voked and experimental proceduresfollowed in estimating coefficients ofreliability and standard errors ofmeasurement, two of classical testtheory's basic results .

The ZeitgeistA prominent feature of the zeitgeistat the turn of the 20th century wasthe notion of errors in scientific ob-servations . As early as the 17thcentury, apparently, "Galileo had. . . reasoned that errors of observa-tion are distributed symmetrically

and tend to cluster around theirtrue value" (Read, 1985,p . 348) . "By the eighteenth cen-tury," according to Churchill Eisen-hart (1983a),

the practice of taking the arith-metic mean [of a set of observa-tions] "for truth" had becomefairly widespread . Nonetheless,Thomas Simpson (1710-1761)wrote to the president of theRoyal Society in March 1755 thatsome persons ofconsiderable notemaintained that one single obser-vation, taken with due care, wasas much to be relied on as themean of a great number. As thisappeared to him to be a matter ofmuch importance, he said that hehad a strong inclination to ascer-tain whether by the application ofmathematical principles the util-ity and advantage of the practicemight be demonstrated . Measure-ments and functions of measure-ments, such as their arithmeticmeans, are not amenable to math-ematical theory, however, as longas individual measurements areregarded as unique entities, thatis as fixed numbers y,, YZ . . . . Amathematical theory of measure-ments, and of functions of mea-surements, is possible only whenparticular measurements y,, yz,. . . are regarded as instancescharacteristic of hypotheticalmeasurements Y,, Y2 , . . thatmight have been, or might be,yielded by the same measurementprocess under the same circum-stances . Consequently, Simpsonhypothesized that the respectivechances of the different errors to

Ross E. Traub is a Professor at the On-tario Institute for Studies in Education,OISE, University of Toronto, 252 BloorSt . W., Toronto, Ontario, Canada M5S1V6. His specializations are test theoryandeducational assessment.

Educational Measurement : Issues and Practice

Page 2: Classical TestTheoryin historical Perspective

which any single observation issubject could be expressed as adiscrete probability distributionof error. . . ." (p . 531)By the beginning of the 19th cen-

tury, astronomers had come to rec-ognize errors in observations as anarea worthy of research . CarlFriedrich Gauss derived the normaldistribution while trying to provethat the mean of many observationsof an unknown quantity, such as aparameter of the orbit of a planet, isthe most likely value of that quan-tity (Eisenhart, 1983b ; Read, 1985) . 3In other work-for example, that ofFriedrich Wilhelm Besse14-errorsin astronomical observations werethought to be composed of many in-dependent, elementary parts . Thedistribution of the sum of these er-rors was shown to be normal undervarious assumptions, hence the ref-erences, commonplace in the 19thcentury, to the law of errors in ob-servations (Eisenhart, 1983b; Read,1985) . Later in the 19th century,however, as noted by 0. B Sheynin(1968),

(the normal distribution,] as faras the theory of errors is con-cerned, had been almost forgottenand even the concept of randomerrors of observation . . . [had be-come] divorced from the conceptof random quantities in the the-ory of probability. . . . A qualify-ing remark should be added : . . .in the second half of the 19thcentury a definition of a ran-dom quantity as "dependent onchance" and possessing a certainlaw of distribution had become

. . natural. . . . As to randomerrors, these were usually takento be errors with certain proba-bilistic properties, their specificdistribution . . . being not soimportant . . . . It seems thatVassiliev (1885, Theory of Proba-bilities, in Russian, a lithographicedition, Kazan, p. 133) was thefirst who definitely held that ran-dom errors of observation are tobe ranked among random quanti-ties . (pp. 236-237)

We might expect the notion ofcorrelation, that other essential fea-ture ofSpearman's zeitgeist, to haveemerged from a study of the distri-bution of joint errors . Indeed, sev-eral individuals separately derivedexpressions resembling the "ordi-nates of the probability surface of

Winter 1997

normal correlation for two vari-ables" (Walker, 1929, p . 93) . Theseindividuals included Robert Adrain,an Irishman who emigrated to theU . S . and taught at Rutgers, Colum-bia College, and the University ofPennsylvania and published suchan equation in 1808 ; Pierre-Simon,Marquis de LaPlace published suchan equation in 1810 ; Giovanni Anto-nio Amedeo Plana, professor of as-tronomy at Turin, published anequation in 1812; Gauss in 1823 ;and Auguste Bravais, professor ofastronomy at Lyons and professor ofphysics at Paris, in 1846 . But noneof these scholars seems to have in-terpreted the cross-product term inthe exponents of their expressionsas an indicator of covariation or cor-relation . This enormous idea wasleft to Francis Galton . Although hedid not actually derive the formulafor the bivariate normal distribu-tion-it can be argued that this wasdone by a Cambridge mathemati-cian named J. D . Hamilton Dickson,to whom Galton gave the relevantconcepts'-Galton did originate thebasic ideas we associate with the bi-variate distribution . The notion ofthe scatterplot, on which is dis-played the two regression lines, canbe found in Galton's 1885 presiden-tial address to the AnthropologicalSection of the British Association,subsequently an article published in1886 in the Journal of the Anthropo-logical Institute under the title "Re-gression Towards Mediocrity inHereditary Stature.116 The term cor-relation was first used in a technicalsense 2 years later in an article en-titled "Co-Relations and Their Mea-surement" (Galton, 1888) . Galton,however, used the symbol r to referto reversion or regression . Credit forcalling r the coefficient ofcorrelationbelongs to Francis Y Edgeworth,and dates from 1892 (Boring, 1957) . 1Karl Pearson contributed impor-

tant mathematical work in supportofhis friend Galton's coefficient. Forexample, Pearson (1896) provedthat the best value of r is the covari-ance divided by the product of thestandard deviations .

Defining Achievements in theHistory of Classical Test TheoryWe find ourselves at the very begin-ning of the 20th century, a time

when the idea of errors in measure-ments had become widely accepted .Also, the coefficient of correlationwas well established as an impor-tant statistical concept, althoughPearson's product moment expres-sion had not yet gained acceptanceas the computational formula ofchoice . In addition, we encounteraround this time the first scientificpublications in which coefficients ofcorrelation were the principal re-sults of the research . For example,Karl Pearson, who, like Galton, hadan abiding interest in eugenics, in-vestigated the correlation betweencharacteristics of pairs of brothers(Pearson, 1904 ; Pearson & Lee,1903) . Pearson found, to his aston-ishment, that the magnitude of thiscorrelation was about 0.5, regard-less of characteristic, physical orpsychical . So this was the time, andthis was the context for the first ofthe defining achievements in thehistory of classical test theory-thecorrection of a correlation coefficientfor attenuation due to measurementerror. In what follows, I describethis and four other milestonesmarking the development of classi-cal test theory.Spearman's correction for attenu-

ation. Spearman was a psychologist .At the turn of the 20th century, hewas just beginning his study of in-telligence . In the course ofthis earlywork, he had discovered that inde-pendent measurements of a psychi-cal characteristic of a person-forexample, mental ability-vary in arandom-Spearman called it an"accidental" (1904, p . 89)-fashionfrom one measurement trial to an-other . In other words, the coefficientof correlation between such inde-pendent measurements for a groupof persons is not perfect . Spearmanthen had the insight that the ab-solute value ofthe coefficient of cor-relation between the measurementsfor any pair of variables must besmaller when the measurements foreither or both variables are influ-enced by accidental variation thanit would otherwise be . Eliminatingthis attenuating effect of accidentalerror was just one ofthe matters, al-beit in retrospect the most impor-tant of those matters, addressed bySpearman in his 1904 article, enti-tled "The Proof and Measurement of

9

Page 3: Classical TestTheoryin historical Perspective

Association Between Two Things"and published in the AmericanJournal ofPsychology .Spearman's article angered Karl

Pearson. The reason was thatSpearman had had the temerity tochallenge Pearson's conclusion thatthe coefficient of correlation be-tween pairs of brothers was 0.5 forpsychic characteristics, just as ithad been found to be for physicalcharacteristics . Using reliability es-timates from his own work, Spear-man estimated the correctedcorrelation coefficient for mentalability to be 0.8 .Pearson was co-editor of a newly

founded journal, Biometrika . He in-serted the following petulant note inhis 1904 article in that journal onthe correlation between selectedcharacteristics of sibling pairs .

10

I hardly know whether it is need-ful to refer here to a recent articleby Mr. C . Spearman . . . criticis-ing my results for the similarityof inheritance in the physical andpsychical characters. Withoutwaiting to read my paper in fullhe seems to think I must havedisregarded "home influences"and the personal equation of theschool teachers. He proceeded to"correct" my results for the errorof what he calls dilation on thedouble basis (i) of a formula in-vented by himself, but given with-out proof, and (ii) of his ownexperience that two observers' ob-servations or measurements ofthe same series of two characterswere such that the correlation be-tween their determinations was.58 in one case and .22 in theother. The formula invented byMr. Spearman for his so-called"dilation" is clearly wrong, for ap-plied to perfectly definite cases, itgives values greater than unityfor the correlation coefficient . Asto his second basis, all I can say isthat if the correlation betweentwo observers of the same thingin Mr. Spearman's case can be aslow as .22, he must have em-ployed the most incompetent ob-servers, or given them the mostimperfect instructions, or chosena character [more] suitable forrandom guessing than obser-vation in the scientific sense .Mr. Spearman says that "it is diffi-cult to avoid the conclusion thatthe remarkable coincidence an-nounced between physical and

mental heredity can be [nothing]more than mere accidental coinci-dence" (p. 98) . I think I may safelyleave him to calculate the odds foror against this most remarkable"mere accidental coincidence" . . . .Perhaps the best thing at presentwould be for Mr. Spearman towrite a paper giving algebraicproofs of all the formulae he hasused; and if he did not discovertheir erroneous nature in theprocess, he would at least providetangible material for definite crit-icism, which it is difficult to applyto mere unproven assertions .(Pearson, 1904, p . 160)Stung into responding, Spearman

published a proof of the correctionfor attenuation in 1907, again usingthe American Journal of Psychol-ogy . Subsequently, a proof in a formoften encountered in present daytextbooks on educational and psy-chological measurement was givenby Spearman (1910) and also byWilliam Brown (1910), both ofwhom ascribed it to Yule . This de-rivation stressed that the error com-ponents of all measures should beindependent, and hence uncorre-lated . Spearman's earlier "proof"had not emphasized this restriction(Walker, 1929) .Pearson was not the only critic of

the correction for attenuation . Inparticular, Brown (1910) challengedit on the grounds that measurementerror is not really random (acciden-tal) . Brown proposed a way of test-ing the equality of the covariancesS(x 1y l) and S(x2y2), where xl, x2, yl,and y2 are observed-score variables .Assuming xl is parallel to x2 and ylis parallel to y2, then both these co-variances, according to classical the-ory, should equal S(xy), where x andy are the true-score variables associ-ated with {x i , x2} and {y l , y2}, respec-tively. Brown (1913) reportedresults based on an application ofthis proposal, results he claimed didshow that measurement errors arenot accidental.Spearman was aware of Brown's

criticism, among others, and pre-sciently responded as follows in hisBritish Journal of Psychology arti-cle of 1910 :

1 . Spearman reiterated his posi-tion that of the two kinds oferrors in measurement, "regu-

lar" and "accidental" (p . 273),the correction formula appliesonly to the latter. As regardsBrown's contention that acci-dental errors can be correlated,Spearman observed that, if er-rors were indeed found to belinked (as might be the case,e.g., if a person were ill at thetime oftaking both tests x andy), then the investigator shouldemploy a better experimentaldesign .

2 . It had been suggested accord-ing to Spearman (1910, p. 272)that investigators should makemeasurements so "efficient"that no correction would beneeded . Spearman wonderedhow an investigator wouldknow his measurements wereefficient enough except byusing the correction formula?

3 . To Pearson's criticism that thecorrection could produce coeffi-cients greater than one, Spear-man countered that this mightoccur due to sampling error. Herecommended (p . 277) the coef-ficient be set to one wheneverthis happened .

The Spearman-Brown formula .Coefficients of reliability are neededin order to apply the correction forattenuation. In his 1904 article,Spearman had assumed the avail-ability of two independent measure-ments of both the characteristics forwhich a corrected correlation coeffi-cient is desired . The breakthrough,apparently achieved independentlyby Spearman and Brown, to a for-mula by which to calculate a relia-bility coefficient from the two halvesof just one composite measure waspublished in adjacent articles in a1910 issue of The British Journal ofPsychology . Brown's proof of the for-mula is the more elegant and bearsthe stamp ofYule .During the second and third

decades of the 20th century, numer-ous experiments were conducted totest predictions of the Spearman-Brown formula . A thoughtful reviewof publications emanating from thispreoccupation of early psychometri-cians can be found in the notes ontest theory prepared by Louis LeonThurstone (1932) .The index of reliability and other

results . It is a worthwhile experi-ence, though in the late 20th cen-

Educational Measurement: Issues and Practice

Page 4: Classical TestTheoryin historical Perspective

tury a humbling one, to read the ar-ticles on test theory that were pub-lished between 1910 and 1925 bySpearman, Brown, and Kelley,among others . These documentscontain a great many of the basic re-sults of classical test theory. Kelley's1923 text, Statistical Methods, in-cluded a compilation ofthese resultsin a section on reliability theory.Kelley also stated in this text thedefinition of reliability that hechampioned throughout his long ca-reer : the coefficient of correlationbetween "comparable tests" (p . 203) .Kelley laid down three conditionsfor test comparability :

(1) sufficient fore-exercise shouldbe provided to establish an atti-tude or set, thus lessening thelikelihood ofthe second test beingdifferent from the first, due to anew level of familiarity with themechanical features, etc . ; (2) theelements of the first test shouldbe as similar in difficulty andtype to those in the second, pairby pair, as possible ; but (3) shouldnot be so identical in word or formas to commonly lead to a memorytransfer of correlation betweenerrors . (p . 203)Kelley was critical of Brown

(1910) for using the term reliabilitycoefficient to refer to the correlationbetween scores on repeated admin-istrations of the same test . Brown'sdefinition fails to meet the third ofKelley's conditions . (Kelley's conclu-sion : "Accordingly the repetition of atest to secure a reliability coefficientis to be deprecated" Kelley, 1923,p. 203) .An important result, used by

Spearman in his 1910 proof of theprophecy formula, was the expres-sion for the correlation between twocomposite measures in terms of thevariances and covariances of thecomponents . From this result, Abel-son (1911, p. 314) derived the for-mula for what came to be knownseveral years later as the index ofreliability .' It seems this name wascoined quite by accident . Kelley in-dependently derived the formula forthe index in 1916 and then in usingit wrote that "the extent to whichthe grade determined by means ofthis test of forty words would corre-late with the true spelling ability ofthe individual is probably an evenmore significant index of reliability"

Winter 1997

(p . 74). Walker reported that"[w]hen Munroe published the for-mula in his Introduction to the The-ory of Educational Measurements(1923) he . . . [set the phrase incapital letters], ascribed it to Kelley,and established Index of Reliabilityas a definite term"(1929, p. 118).The Kuder-Richardson formulas .

Writing the coefficient of correlationbetween two composite measures interms of the variances and covari-ances of the measures' componentsmade it possible to study the effectsof the characteristics of item scoreson the characteristics of total testscores . By 1936, Marion Richard-son, in an article in the first volumeof Psychometrika, had demon-strated several propositions fortests composed of discrete, dichoto-mously scored items. Invoking theassumption that all test items haveequal variances -Richardson notedthat this assumption is close to truefor a wide range of item difficultyvalues-he showed that "the rejec-tion of items with low item-test cor-relations raises the reliability of atest, if the number of items is heldconstant" (p . 72). Richardson alsoshowed that "for tests of homoge-neous [item] difficulty and constantlength, the true variance is propor-tional to the average item intercor-relation" (p.75) .Given his work on the relation-

ship between item and total testscores, it is perhaps not surprisingto find Richardson a co-author, withFrederic Kuder, of the blockbusterarticle of 1937, the one containingthe famous Formulas 20 and 21 . (Ina footnote to the article, Kuder andRichardson reported they had inde-pendently arrived at the resultscontained in the article. They ap-parently discovered this fact quiteby accident, at which time they de-cided to publish the results jointly.)Kuder and Richardson began theirarticle with a critique of existing ap-proaches to the estimation of relia-bility. Of the test-retest method,they said,

[Using] the same form gives, ingeneral, estimates that are toohigh because of material remem-bered on the second application ofthe test. This memory factor can-not be eliminated by increasingthe length of time between thetwo applications, because of vari-

Of the split-half approach, Kuderand Richardson concluded that "al-though the authors have made noactual count, it seems safe to saythat most technicians use the split-half method of estimating reliabil-ity" (p . 151) . They then observedthat the number of splits possiblefor an n-item test (n being an even

n!number) is

z , a number so

2L(2 )~!~large for any test of reasonablelength that the reliability estimatesfrom all possible split-halves of thetest are very likely to vary consider-ably. So the issue for Kuder andRichardson was to devise a proce-dure by which the information inthe item scores of a test could beused to produce a single estimate ofreliability.

In deriving Formulas 20 and 21,Kuder and Richardson began bywriting the following expression-their Equation 3-for the correlationof a test composed of n dichoto-mously scored items with a hypo-thetical second test of n such items:

(3)

where 6r is the test variance,

pigi is the sum of the item vari-

able growth in the function testedwithin the population ofindividu-als . These difficulties are so seri-ous that the method is rarelyused. (p . 151)

rtt

n n6i -. pigi+1 riipigi

nances, and I riipigi is the sumofthe

item true score variances . The prob-lem Kuder and Richardson set forthemselves was that of estimatingria, the item reliability coefficient .They did so under various assump-tions. Those leading to Equation 20were that the matrix of interitemcorrelation coefficients has Rank 1and that all interitem correlationcoefficients are equal. Setting riiequal to ril, the interitem correla-tion coefficient, Kuder and Richard-son derived the result:

_ Cr'-_

pigipigi /2-Ipigi

Page 5: Classical TestTheoryin historical Perspective

12

z6t - nPq

(n - 1) npq I

which, when substituted in Equa-tion 3, and simplified, gives

n

r6t - npq26t

Formula 21 was derived under theadditional assumption of equal itemdifficulty indices .

Kelley criticized the KR formulasin a 1942 article in Psychometrikaon the grounds that they are validonly for tests "with unity of pur-pose"-that is, for tests composed ofitems that share just one factor incommon. Reiterating his long-stand-ing advocacy of the parallel-testdesign for reliability estimation,Kelley went on to say "we concludethat a belief that two or more mea-sures of a mental function exist isprerequisite to the concept of relia-bility, and, further, not only thatthey exist but that they are avail-able before a measure of reliabilityis possible" (p . 76) . As a further chal-lenge to the KR formulation, Kelleydemonstrated that a test with zerointeritem covariances could producea reasonable correlation with a"similar" (p . 81) test, even though itsKR2o index would be zero .

It is obvious, a half-century later,that Kelley's view did not prevail .The KR formulations quickly re-ceived widespread acceptance, abet-ted in part by the publication of anarticle by Paul Dressel (1940). Dres-sel showed that, when all the itemsof a test intercorrelate perfectly andall item variances are equal, KR20attains the value of l ; otherwise, itis less . He further demonstratedthat KR20 can take values less than0 . Dressel also increased the applic-ability of KR20 by deriving a versionfor tests to which the correction forguessing is applied .Lower bounds to reliability. Per-

haps it was Dressel's demonstrationthat KR20 can be negative thatmarks the beginning of work onlower bounds to reliability. Alterna-tively, a case might be made forPhilip Rulon's (1939) article in theHarvard Educational Review . Rulonintroduced the notion, not the name,of essentially tau-equivalent testhalves . The halves of a test are es-

sentially tau equivalent if, for anyexaminee in the test population, thetrue score on one test half differsfrom the true score on the other halfby a constant, which is the same forall examinees . Also, under the es-sentially tau-equivalent assump-tion, as distinct from the paralleltest-halves assumption, the errorvariance for an examinee on onetest half is not necessarily equal tothe error variance for the examineeon the other half (see Lord &Novick, 1968, p . 50) .Whatever the wellspring of work

on lower bounds, Louis Guttman(1945) published the first article, asfar as I know, in which lower boundsto reliability are explicitly derived .But this Psychometrika article isimportant for a reason other thanthe lower bounds it contains .Guttman also offered a theoreticalframework within which to treat, ifnot actually to reconcile, the antag-onistic views of Brown and Kelleyregarding how the reliability coeffi-cient should be estimated . Guttmandid this by first identifying threesources of variation in test re-sponses-persons, items, and trials .Guttman defined error variance ex-clusively in terms of variation in re-sponses over the universe of trials .This definition leads to a proof oftotal test variance as the sum oftrue-score and error variance, with-out the need to assume zero covari-ance of true and error scores . (Thelatter assumption lies at the heartof Yule's proof of the correction forattenuation .) Defining the reliabil-ity coefficient "as the complement ofthe ratio of error variance to total[test] variance" (p . 257), Guttmanthen went on to demonstrate (pp .267-268) that the reliability coeffi-cient can be estimated as the corre-lation between the test scores for agroup of examinees on two "experi-mentally independent" (p . 264) tri-als of a test . Given the results fromonly one trial, however, Guttmanshowed that the best result possibleis an estimate of a lower bound toreliability. This result was shown torest on the assumption

that the errors of observation areindependent between items andbetween persons over the uni-verse oftrials . In the conventional[Yule] approach, independence istaken over persons rather thantrials, and the problem of observ-

ability from a single trial is notexplicitly analyzed . (p. 257)Guttman derived six lower

bounds to reliability, of which threeare noted here . One of these is ageneralization of KR20 to tests com-posed of items scored on any scale,dichotomous or otherwise . He la-beled this index r3 . Subsequently, r3became better known as coefficientalpha (Cronbach, 1951) . Two of theother lower bounds to reliabilitywere labeled, not surprisingly, f1and r2, with the former typicallysmaller than alpha, the latter typi-cally larger. Research on lowerbounds to reliability constitutes asmall but still active line of psycho-metric research .

FormalizationVarious attempts to formalize clas-sical test theory have been madeover the years . Already mentionedis the section on reliability in Kel-ley's (1923) Statistical Method .Another early work is that by Thur-stone (1932) . The next presentationof note is Theory ofMental Tests byHarold Gulliksen (1950) . The culmi-nation of such efforts as these wasrealized in the work of MelvinNovick (1966) and in the early chap-ters of Statistical 'theories ofMentalTest Scores (1968) by Frederic Lordand Novick.

Concluding RemarksSeveral important topics from therealm of classical test theory havenot been covered in this brief retro-spective . Among them are the ef-fects of range restriction on themagnitude of the reliability coeffi-cient, the application of analysis ofvariance to the study of measure-ment error and reliability (this be-fore the advent of generalizabilitytheory), and the modeling of testdata generally addressed under thetopic of congeneric models . Clearly,there is more to classical test theory,and its history, than the work re-viewed in this article .

Lest we leave this topic thinkingclassical test theory an undulyimportant area of research in thehistory of empirical research in psy-chology and education, we will findit salutary to reflect on the following

Educational Measurement : Issues and Practice

Page 6: Classical TestTheoryin historical Perspective

remarks from the preface of the1932 edition of Thurstone's notes ontest theory :

Since this volume is devoted tothe validity and reliability con-cepts in their applications to men-tal tests and related correlationalprocedures, it is only fair to saythat personally I do not believethat these correlational methodsand particularly the reliabilityformulae have been responsiblefor much that can be called fun-damental, important or signifi-cant in psychology. On thecontrary, the correlational meth-ods have probably stifled scien-tific imagination as often as theyhave been of service . As tools intheir proper place they are usefulbut as the central theme of men-tal measurement they are rathersterile .

NotesThis article is a revised version of a

paper presented at the 1996 AnnualMeeting of the National Council onMeasurement in Education (SessionB1), New York .

' Something classical, according toWebster's New Collegiate Dictionary, isaccepted as standard and authoritative,as distinguished from novel or experi-mental : viz . classical physics .

a Classical test theory applies to themeasurements of a characteristic of themembers of any collection of objectswhatsoever. In particular, it is not re-stricted, as the wording here mightimply, to measurements of human char-acteristics .

s Although the normal distribution iscommonly referred to as the Gaussiandistribution, priority in its formulationrightfully belongs to Abraham deMoivre, who obtained it in 1733 (Read,1985) .

4 In the 18th and early 19th centuries,astronomers were required to make dif-ficult judgments, based on a combina-tion of auditory and visual cues, in orderto time stellar transits . A well-knownstory from the history of science (Bor-ing, 1957) is the firing in 1796 ofKinnebrook, an assistant to Maskelyne,the Astronomer Royal of England .Kinnebrook was relieved of his job forgiving inaccurate readings of stellartransits . Although he had providedreadings in agreement with Maske-lyne's 18 months prior to his dismissal,the hapless Kinnebrook by August 1795had begun to give times that differedfrom Maskelyne's by one-half second.Subsequently, Kinnebrook's readingsgrew even more discrepant, so by the

Winter 1997

time ofhis firing they were almost a sec-ond later than Maskelyne's . This mattermight not have attracted much interesthad not Maskelyne recorded it in Astro-nomical Observations at Greenwich(1799) . Seventeen years later, in ahistory of Greenwich Observatory pub-lished in German, Kinnebrook's tribula-tion came to the attention of Bessel, anastronomer at Konigsberg . Bessel con-ducted a series ofstudies culminating inthe notion of the personal equation-the name given the systematic differ-ence in recording times found tocharacterize the stellar transits of al-most any pair of astronomers . From theperspective of reliability theory, the per-sonal equation itself was not a highlysignificant discovery, for it refers to sys-tematic error, not the random errortreated by reliability theory. What in-terests us, instead, is Bessel's findingthat the personal equation itself is avariable quantity, one that differs fromone pair of astronomers to another. Thisvariation suggests random or accidentalerrors in observations, errors that, ifneither controllable nor amenable toelimination, at the least demand an ex-planation grounded in a theory or a sci-entific law.

Karl Pearson (1930) noted thatDickson did not actually write down theequation for the bivariate normal distri-bution but stopped one step short ofdoing so .

According to Walker (1929, pp .98-101), H . P Bowditch published atwo-way table of height and age for24,000 Boston school boys in 1877 in amanuscript entitled Growth of Chil-dren . Although he described one of theregression lines of a bivariate distribu-tion in his work, Bowditch neither pro-duced both lines nor conceived of ameasure ofthe relationship between thevariables.

Another of Galton's ideas was thatthe (normal) law of errors in observa-tions might describe the frequency dis-tributions of measurements of suchhuman characteristics as mental ability.This idea was accepted readily enoughby John Venn (1888), who wrote as fol-lows : "That our mental qualities, iftheycould be submitted to accurate mea-surement, would be found to follow theusual Law of Error may be assumedwithout much hesitation" (p. 49). Vennexpressed skepticism, however, over theidea that a normal distribution of men-tal measurements is yet another mani-festation of the law of error :

When we perform an operation our-selves with a clear consciousness ofwhat we are aiming at, we may quitecorrectly speak of every deviationfrom this as being an error; but when

Nature presents us with a group ofobjects of every kind, it is using arather bold metaphor to speak in thiscase also of a law of error, as if shehad been aiming at something all thetime, and had like the rest of usmissed her mark more or less in everyinstance . (p. 42)

a Walker (1929, p . 117) suggested thederivation of the formula may havebeen given to Abelson by Spearman .Abelson's article is, however, unclear onthis point. The article contains twoappendices . The first is described ashaving been "kindly supplied by Prof.Spearman" (p . 312). But the second ap-pendix, which contains the index ofreliability, bears no attribution toSpearman.

ReferencesAbelson, A . R. (1911). The measurement

of mental ability of "backward chil-dren." British Journal of Psychology,4,268-314 .

Boring, E . G . (1957) . A history ofexperi-mental psychology (2nd ed .) . NewYork: Appleton-Century-Crofts .

Brown, W. (1910) . Some experimentalresults in the correlation of mentalabilities . British Journal of Psychol-ogy, 3, 296-322 .

Brown, W. (1913). The effects of "obser-vational errors" and other factorsupon correlation coefficients in psy-chology. British Journal of Psychol-ogy, 6, 223-235 .

Cronbach, L . J. (1951). Coefficient alphaand the internal structure of tests .Psychometrika, 16, 297-334 .

Dressel, P. L. (1940). Some remarks onthe Kuder-Richardson reliability coef-ficient . Psychometrika, 5, 305-310 .

Eisenhart, C . (1983a) . Laws of error I :Development of the concept. In S .Kotz & N . L . Johnson (Eds .), Encyclo-pedia of statistical sciences (Vol . 4,pp . 530-547) . Toronto : Wiley

Eisenhart, C . (1983b) . Laws of error II :The Gaussian distribution . In S .Kotz & N. L . Johnson (Eds .), Encyclo-pedia of statistical sciences (Vol. 4,pp . 547-562) . Toronto : Wiley.

Galton, F. (1888) . Co-relations andtheir measurement, chiefly from an-thropometric data . Proceedings of theRoyal Society, 45, 135-145 .

Gulliksen, H . (1950) . Theory of mentaltests . New York : Wiley.

Guttman, L. (1945). A basis for analyz-ing test-retest reliability. Psychome-trika, 10, 255-282 .

Kelley, T. L. (1916). A simplified methodof using scaled data for purposes oftesting. School and Society, 4, 34-37,71-75 .

1 3

Page 7: Classical TestTheoryin historical Perspective

Kelley, T. L . (1923) . Statistical method .New York : Macmillan .

Kelley, T. L . (1942). The reliability coef-ficient . Psychometrika, 7, 75-83 .

Kuder, G . F., & Richardson, M. W(1937) . The theory of estimation oftest reliability. Psychometrika, 2,151-160 .

Lord, F. M., & Novick, M. R. (1968). Sta-tistical theories of mental test scores .Reading, MA: Addison-Wesley.

Novick, M . R . (1966) . The axioms andprincipal results of classical test the-ory. Journal ofMathematical Psychol-ogy, 3, 1-18 .

Pearson, K . (1896) . Mathematical con-tributions to the theory of evolution-III . Regression, heredity andpanmixia . Philosophical Transac-tions, A, 187, 252-318 .

Pearson, K. (1904) . On the laws of in-heritance in man. II . On the inheri-tance of the mental and moral

Robert L. BrennanUniversity of Iowa

O verviews of various parts ofthe history of generalizability

(G) theory are provided elsewhere .An indispensable starting point isthe preface and parts of the firstchapter of Cronbach, Gleser, Nanda,and Rajaratnam (1972) entitled TheDependability of Behavioral Mea-surements: Theory of Generalizabil-ity for Scores and Profiles . TheCronbach et al . monograph is stillthe most definitive treatment of Gtheory. Shavelson and Webb (1981)review the G theory literature from1973-1980, and Shavelson, Webb,

14

characters in man, and its compari-son with the inheritance of physicalcharacters. Biometrika, 3, 131-190.

Pearson, K. (1930) . The life, letters, andlabours of Francis Galton . Vol. IIIA.Correlation, personal identificationand eugenics . Cambridge : The Uni-versity Press .

Pearson, K, & Lee, A . (1903) . On thelaws of inheritance in man. I . Inheri-tance of physical characters . Bio-metrika, 2, 357-462 .

Read, C . B . (1985) . Normal distribution .In S . Kotz & N. L . Johnson (Eds .),Encyclopedia of statistical sciences(Vol . 6, pp . 347-359) . Toronto : Wiley.

Richardson, M. W. (1936). Notes on therationale of item analysis . Psychome-trika, 1(1), 69-76 .

Rulon, P J. (1939) . A simplified proce-dure for determining the reliability ofa test by split-halves . Harvard Edu-cational Review, 9, 99-103 .

A Perspective on the History ofGeneralizability Theory

What psychometric and scientific perspectives influ-enced the development of G theory? What practicaltesting problems gave impetus to its adoption? Whatwork remains to be done?

and Rowley (1989) cover additionalcontributions in the 1980s . A verybrief historical overview is providedby Brennan (1983, 1992a, pp. 1-2) .In addition, Cronbach (1976, 1989,1991) offers numerous perspectiveson G theory and its history. Cron-bach (1991) is particularly rich withfirst-person reflections .

This historical overview is not in-tended to repeat everything alreadycovered in published reviews, al-though a summary is provided .Parts of this article are basedlargely on my personal experience

Sheynin, O . B . (1968) . On the early his-tory ofthe law of large numbers . Bio-metrika, 55, 459-467 .

Spearman, C . (1904). The proof andmeasurement of association betweentwo things . American Journal ofPsy-chology, 15, 72-101 .

Spearman, C. (1907). Demonstration offormulae for true measurement ofcorrelation. American Journal ofPsy-chology, 18, 160-169 .

Spearman, C . (1910) . Correlation calcu-lated from faulty data . BritishJournal ofPsychology, 3, 271-295 .

Thurstone, L . L. (1932) . The reliabilityand validity of tests . Ann Arbor, MI :N . p .

Venn, J. (1888) . The logic ofchance (3rded.) . London : Macmillan .

Walker, H . M . (1929) . Studies in the his-tory of statistical method . Baltimore:Williams & Wilkins.

with G theory. Consequently, thisarticle provides a somewhat idiosyn-cratic perspective on the history of Gtheory and what I perceive as unfin-ished work for the theory. Almostcertainly, other reviewers would seethe landscape somewhat differently.

Theory Development andEnablingWorkIn discussing the genesis of Gtheory, Cronbach (1991) states :

In 1957 I obtained funds from theNational Institute of MentalHealth to produce, with Gleser's

Robert L. Brennan is Lindquist Pro-fessor ofEducational Measurement andDirector of the Iowa Testing Programs,University of Iowa, 334A LindquistCenter, Iowa City, IA 52242. His special-izations are generalizability theory,equating, and scaling.

Educational Measurement: Issues and Practice