reporting diagnostic scores in educational testing: temptations, pitfalls, and some solutions

This article was downloaded by: [141.212.109.170]On: 16 October 2014, At: 11:56Publisher: RoutledgeInforma Ltd Registered in England and Wales Registered Number: 1072954Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH,UK

Multivariate BehavioralResearchPublication details, including instructions forauthors and subscription information:http://www.tandfonline.com/loi/hmbr20

Reporting Diagnostic Scoresin Educational Testing:Temptations, Pitfalls, and SomeSolutionsSandip Sinharay a , Gautam Puhan a & Shelby J.Haberman aa Educational Testing ServicePublished online: 07 Jun 2010.

To cite this article: Sandip Sinharay , Gautam Puhan & Shelby J. Haberman(2010) Reporting Diagnostic Scores in Educational Testing: Temptations, Pitfalls,and Some Solutions, Multivariate Behavioral Research, 45:3, 553-573, DOI:10.1080/00273171.2010.483382

To link to this article: http://dx.doi.org/10.1080/00273171.2010.483382

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all theinformation (the “Content”) contained in the publications on our platform.However, Taylor & Francis, our agents, and our licensors make norepresentations or warranties whatsoever as to the accuracy, completeness,or suitability for any purpose of the Content. Any opinions and viewsexpressed in this publication are the opinions and views of the authors, andare not the views of or endorsed by Taylor & Francis. The accuracy of theContent should not be relied upon and should be independently verified withprimary sources of information. Taylor and Francis shall not be liable for anylosses, actions, claims, proceedings, demands, costs, expenses, damages,and other liabilities whatsoever or howsoever caused arising directly or

http://www.tandfonline.com/loi/hmbr20

http://www.tandfonline.com/action/showCitFormats?doi=10.1080/00273171.2010.483382

http://dx.doi.org/10.1080/00273171.2010.483382

indirectly in connection with, in relation to or arising out of the use of theContent.

This article may be used for research, teaching, and private study purposes.Any substantial or systematic reproduction, redistribution, reselling, loan,sub-licensing, systematic supply, or distribution in any form to anyone isexpressly forbidden. Terms & Conditions of access and use can be found athttp://www.tandfonline.com/page/terms-and-conditions

Dow

nloa

ded

by [

141.

212.

109.

170]

at 1

1:56

16

Oct

ober

201

4

http://www.tandfonline.com/page/terms-and-conditions

Multivariate Behavioral Research, 45:553–573, 2010

Copyright © Educational Testing Service

ISSN: 0027-3171 print/1532-7906 online

DOI: 10.1080/00273171.2010.483382

Reporting Diagnostic Scores inEducational Testing: Temptations,

Pitfalls, and Some Solutions

Sandip Sinharay, Gautam Puhan, and Shelby J. HabermanEducational Testing Service

Diagnostic scores are of increasing interest in educational testing due to their

potential remedial and instructional benefit. Naturally, the number of educational

tests that report diagnostic scores is on the rise, as are the number of research

publications on such scores. This article provides a critical evaluation of diagnostic

score reporting in educational testing. The existing methods for diagnostic score

reporting are discussed. A recent method (Haberman, 2008a) that examines if

diagnostic scores are worth reporting is reviewed. It is demonstrated, using results

from operational and simulated data, that diagnostic scores have to be based on a

sufficient number of items and have to be sufficiently distinct from each other to

be worth reporting and that several operationally reported subscores are actually

not worth reporting. Several recommendations are made for those interested to

report diagnostic scores for educational tests.

Diagnostic scores are of increasing interest in educational testing due to their

potential remedial and instructional benefit. According to the National Research

Council report “Knowing What Students Know” (2001), the target of assessment

is to provide particular information about an examinee’s knowledge, skill, and

abilities. Diagnostic scores are often perceived as providing such information.

Naturally, there is a substantial pressure on the testing programs to report

diagnostic scores both at the individual examinee level and at aggregate levels

(such as at the level of institutions or states).

Despite this demand and apparent usefulness of subscores, certain impor-

tant factors must be considered before diagnostic scores can be considered

Correspondence concerning this article should be addressed to Sandip Sinharay, Educational

Testing Service, Rosedale Road, P159B, Princeton, NJ 08541. E-mail: [email protected]

553

Dow

nloa

ded

by [

141.

212.

109.

170]

at 1

1:56

16

Oct

ober

201

4

554 SINHARAY, PUHAN, HABERMAN

useful. According to Haberman (2008a), diagnostic scores may be considered

of “added value” only when they provide a more accurate measure of the

construct being measured than is provided by the total score. Similarly, Tate

(2004) has emphasized the importance of ensuring reasonable diagnostic score

performance in terms of high reliability and validity to minimize incorrect

instructional and remediation decisions. As evident, there appears to be a conflict

between the demand for diagnostic scores and the need to exercise caution when

reporting diagnostic scores, especially if they do not provide any additional

information than what is provided by the total score and when they are less

reliable. Through this article, we intend to raise general consciousness regarding

potential psychometric issues surrounding the reporting of diagnostic subscores

in educational testing. This article starts by showing examples of diagnostic

scores reported by operational testing programs. Then a brief discussion of exist-

ing psychometric methods for reporting diagnostic scores is provided, followed

by a closer examination of some existing diagnostic scores and their usefulness.

Then a brief review of a method proposed by Haberman (2008a) that examines

if subscores (that are the simplest form of diagnostic scores and are reported

by several testing programs) have added value over the total score is provided.

Using results from several operational and simulated data sets, it is demonstrated

that it is not straightforward to have diagnostic scores with added value and that

factors such as scale length, reliability, and correlation among diagnostic scores

and their interactions with each other often determine when diagnostic scores

can be of added value. Finally, some recommendations are made for researchers

and practitioners interested in the issue of diagnostic score reporting.

WHAT ARE DIAGNOSTIC SCORES?

Figures 1 and 2 show the top and bottom parts of the score report of an imaginary

examinee (Mary D. Poppins) on two Praxis SeriesTM assessments (Educational

Testing Service, 2008)—Mathematics: Content Knowledge and Principles of

Learning and Teaching: Grades 7–12. Figure 1 shows the scaled scores on the

two assessments obtained by Mary and Figure 2 shows the raw scores earned

by Mary in the different score categories on the two assessments. The figure

also shows the raw points available in the categories and the range of scores

obtained by the middle 50% of a group of examinees of appropriate education

level. Figure 2 represents a typical diagnostic score report for examinees—the

scores on the categories are the diagnostic scores. The common perception is that

(a) such a report provides trustworthy information about the examinee’s strengths

and weaknesses, and (b) the examinee will work harder on the categories on

which she performed poorly (e.g., on “algebra and number theory,” on which

Mary scored less than the lower bound of the range) and improve on those areas.

Dow

nloa

ded

by [

141.

212.

109.

170]

at 1

1:56

16

Oct

ober

201

4

REPORTING DIAGNOSTIC SCORES 555

FIGURE 1 Top part of an operational score report for an examinee.

Figure 3 is a typical diagnostic score report at an aggregate level. It shows,

for the Praxis SeriesTM Pre-Professional Skills Tests (PPST) Writing Assessment

(Educational Testing Service, 2008), the average percentage correct score of an

institution on four categories of the assessment, the corresponding averages

for the state the institution belongs to, and the corresponding averages for the

whole nation. Here, the common perception is that (a) such a report provides

trustworthy information about the overall strengths and weaknesses of examinees

from the institution, and (b) if an institution finds that its examinees performed

poorly on a score category compared with the state or the nation, a remedial

and instructional workshop can be given to the examinees to improve their

performance on the category. The categories usually correspond to the content

areas in the test (see Figure 2).

SUBSCORES, AUGMENTED SUBSCORES, AND

OBJECTIVE PERFORMANCE INDEX

The diagnostic score report shown in Figure 2 is based on raw scores on different

categories. These are also referred to as subscores. The score report in Figure 3 is

based on the average of subscores. Subscores are the simplest (e.g., raw number

correct) possible diagnostic scores and are used by several testing programs such

as SAT®, ACT®, and LSAT.

Dow

nloa

ded

by [

141.

212.

109.

170]

at 1

1:56

16

Oct

ober

201

4

FIG

UR

E2

Bo

tto

mp

art

of

ano

per

atio

nal

sco

rere

po

rtfo

ran

exam

inee

.

556

Dow

nloa

ded

by [

141.

212.

109.

170]

at 1

1:56

16

Oct

ober

201

4


FIGURE 3 An operational score report for an institution with hypothetical data.

Wainer et al. (2001) suggested an approach to increase the precision of a

subscore by borrowing information from the other subscores. Because subscores

are almost always found to correlate moderately or highly with each other, it is

reasonable to assume that, for example, the science subscore of a student has

some information about the math subscore of the same student. In this approach,

weights (or regression coefficients) are assigned to each of the subscores and

an examinee’s “augmented” score on a particular subscale (e.g., math) would

be a function of that examinee’s ability on math and that person’s ability on

the remaining subscales (e.g., science, reading, etc). The subscales that have

the strongest correlation with the math subscale have larger weights and thus

provide more information on the “augmented” math subscore. It has been shown

in previous research (e.g., Puhan, Sinharay, Haberman, & Larkin, 2008; Wainer

et al., 2001) that augmented subscores are often found to be of added value and

are more reliable than subscores.

Dow

nloa

ded

by [

141.

212.

109.

170]

at 1

1:56

16

Oct

ober

201

4


The objective performance index (OPI; Yen, 1987) is another approach to

enhance a subscore by borrowing information from other part of the test. This ap-

proach uses a combination of item response theory (IRT) and Bayesian method-

ology. OPI is a weighted average of two estimates of performance: (a) the

observed subscore and (b) an estimate, obtained using a unidimensional IRT

model, of the subscore based on the examinee’s overall test performance. If the

observed and estimated subscores differ significantly, then the OPI is defined as

the observed subscore expressed as a percentage. It should be noted that this

approach, because of the use of a unidimensional IRT model, may not provide

accurate results when the data are truly multidimensional. Ironically, that is when

subscores can be expected to have added value.

MODEL-BASED APPROACHES FOR DIAGNOSTIC

SCORE REPORTING

Researchers have suggested several model-based approaches for diagnostic score

reporting. These models assume that (a) solving each test item requires one or

more skills, (b) each examinee has a latent ability parameter corresponding

to each of the skills, and (c) the probability that an examinee will answer an

item correctly is a mathematical function of the skills the item requires and

the latent ability parameters of the examinee. Several of these models, mostly

those that assume that the examinee ability parameters are discrete, fall under

the family of cognitive diagnosis models (CDM; Fu & Li, 2007). CDMs are

alternatively known as diagnostic classification models (DCM; Rupp & Templin,

2008) or models for cognitive diagnosis (von Davier, DiBello, & Yamamoto,

2008). After such a model is fitted to a data set, the diagnostic scores are the

estimated values of the ability parameters. These models include the rule space

model (Tatsuoka, 1983, 2009), the attribute hierarchy method (Leighton, Gierl,

& Hunka, 2004), the fusion model (Roussos et al., 2007), the Diagnostic Inputs,

Noisy And (DINA) and Noisy Inputs, Deterministic And (NIDA) gate models

(Junker & Sijtsma, 2001), the multiple classification latent class model (Maris,

1999), the general diagnostic model (von Davier, 2008), the unified model

(DiBello, Stout, & Roussos, 1995), the reparameterized unified model (Roussos

et al., 2007), the Bayesian inference networks (e.g., Almond, DiBello, Moulder,

and Zapata-Rivera, 2007), the higher order latent-trait model (de la Torre &

Douglas, 2004), the Diagnostic Inputs, Noisy Or (DINO) and Noisy Inputs,

Deterministic Or (NIDO) gate models (see, e.g., Rupp & Templin, 2008), and

the multicomponent latent trait model (e.g., Embretson, 1997). See, for example,

DiBello, Roussos, and Stout (2007), Fu and Li (2007), Rupp and Templin (2008),

and von Davier et al. (2008) for excellent reviews of models that can be used for

diagnostic score reporting. De la Torre and Patz (2005), Haberman and Sinharay

Dow

nloa

ded

by [

141.

212.

109.

170]

at 1

1:56

16

Oct

ober

201

4


(in press), and Yao and Boughton (2007) examined reporting of diagnostic scores

using multivariate IRT models (MIRT; e.g., Reckase, 1997); the first two of

these papers found that there is not much of a difference between MIRT-based

diagnostic scores and Classical Test Theory (CTT)-based augmented subscores.

HOW GOOD ARE THE EXISTING DIAGNOSTIC

SCORES? A CLOSER LOOK

It is not uncommon to observe decent reliabilities of diagnostic scores on

personality inventories designed to measure specific personality traits such as

anxiety, hostility, trust, and so on. For example, Goldberg (1999) reported that

the reliabilities of the 30 subscale scores (each subscale consisting of 8 items)

in the revised Neuroticism, Extraversion, Openness to Experience Personality

Inventory (NEO PI-R) ranged from 0.61 to 0.85 .M D 0:75/. Considering

the relatively small number of items in the subscales, these reliabilities seem

reasonably high for diagnostic use. One possible reason for this could be that

personality inventories are often designed to measure fewer but much more

focused traits (e.g., aggression, gregariousness, etc.) and are therefore more

reliable. The diagnostic scores found in educational measurement literature and

operational practice often consist of only a few items. It is not uncommon to find

diagnostic scores on, for example, 10 skills based on only about 20–30 items.

Such scores are most often outcomes of retrofitting (reporting of diagnostic

scores from tests that were designed to measure only one overall skill). These

scores are usually provided to comply with clients’ requests for more diagnostic

information on examinees without an increase in test length. Because these tests

have been constructed specifically to measure a single construct, little reason

exists to expect useful diagnostic scores. In addition, it is often ignored that

a diagnostic score in educational measurement refers to a domain area that is

usually much broader than those covered by diagnostic scores in personality

inventories. Consider a certification test for prospective teachers of elementary

education that measures content knowledge in domain areas such as math,

science, and reading. It is often observed that, within each domain area, all

items do not measure a single ability. For example, the math domain may have

numerous smaller subdomains such as problem solving; representation; number

sense; numeration; algebra; geometry; data organization using pie charts, bar

graphs, and so on. Therefore a much larger number of items in each subarea

will be typically required to make the math domain reliable.

Further, there are few checks done to make sure that the reported diagnostic

scores have decent psychometric properties. Even though Standards 2.1 and

6.5 of the Standards for Educational and Psychological Testing (American Ed-

ucational Research Association, American Psychological Association, National

Dow

nloa

ded

by [

141.

212.

109.

170]

at 1

1:56

16

Oct

ober

201

4


Council on Measurement in Education, 1999) demand proof of adequate reliabil-

ity of any reported scores, reliability of diagnostic scores is often not reported so

that examinees and users of test results are not informed of the degree to which

confidence can be placed in skill classifications. In addition, the reliability figures

in some applications of psychometric models for diagnostic score reporting are

based on simulated data (see, e.g., Roussos et al., 2007, pp. 304–305) rather than

on empirical data. It may be unwise to report diagnostic information for a test

unless there is clear evidence that reliable skill classifications can be obtained

from the test data.

There is a lack of studies demonstrating the validity of diagnostic scores.

For example, there is little evidence showing that diagnostic information is

related to other external criteria. It is difficult to have much confidence in

any diagnostic information whose validity has not been established. Haberman

(2008b) demonstrated via theoretical derivations that the validity of subscores is

limited when the subscores are either not reliable or are highly correlated with

total scores.

As Sinharay and Haberman (2008b) explained, the data may not provide

information as fine-grained as suggested by the cognitive theory or as hoped

by the testing practitioner. A theory of response processes based on cognitive

psychology may suggest several skills. But a test includes a limited number of

items and may not have enough items to provide enough information about all

of these skills. Thissen-Roe, Hunt, and Minstrell (2004) have shown in the area

of physics education that out of a large number of hypothesized misconceptions

in student learners, only a very few misconceptions could be found empirically.

Another example is the iSkillsTM test (e.g., Katz, Attali, Rijmen, & Williamson,

2008), for which an expert committee identified seven performance areas that

they thought comprised Information and Communications Technology (ICT)

literacy skill, but a factor analysis of the data revealed only one factor and the

confirmatory factor models in which the factors corresponded to performance

areas or a combination thereof did not fit the data at all (Katz et al., 2008). As a

result, only an overall ICT literacy score is reported for the test. The basic issue

is that the data may not support the conjectures made by content experts about

how examinees behave. Hence, an investigator attempting to report diagnostic

information has to make an informed judgment on how much evidence the data

can reliably provide and report only that much information.

To demonstrate the problems with subscores consisting of few items, we used

actual test data from a licensure test that is designed for prospective teachers of

children in primary through upper elementary school grades. The 120 multiple-

choice questions focus on four major subject areas: language arts/reading, math-

ematics, social studies, and science. There are 30 questions per area and a

subscore is reported for each of these areas. The subscore reliabilities are

between 0.71 and 0.83. We ranked the questions on mathematics and science

Dow

nloa

ded

by [

141.

212.

109.

170]

at 1

1:56

16

Oct

ober

201

4


separately in the order of difficulty (proportion correct) and then created a form

A that consists of the questions ranked 1, 4, 7, : : : , 28 in mathematics and the

questions ranked 1, 4, 7, : : : , 28 in science. Similarly, we created a form B with

questions ranked 2, 5, 8, : : : , 29 in mathematics and in science and a form C

with the remaining questions. Forms A, B, and C can be considered roughly

parallel forms and, by construction, all of the several thousand examinees took

all three of these forms. The subscore reliabilities on these forms range between

0.46 and 0.60. We identified all the 271 examinees who obtained a subscore

of 7 on mathematics and 3 on science on Form A. Such examinees will most

likely be thought to be strong on mathematics and weak on science and given

additional science lessons. Is that justified? We examined the mathematics and

science subscores of the same examinees on Forms B and C. Table 1 gives a

cross-tabulation of the subscores on form B of such examinees. It shows that

the subscores on Form B often lead to different conclusions; for example, the

science subscore is higher than the mathematics subscore for several of these

examinees. The percentage of the 271 examinees with a mathematics subscore

of 5 or lower is 34 and 39, respectively, for Forms B and C. The percentage

with a science subscore of 6 or higher is 39 and 32, respectively, for Forms

B and C. The percentage of examinees whose mathematics score is higher

than their science score is only 59 and 66 on Form B. This simple example

demonstrates that remedial and instructional decisions based on short subscores,

which in turn results in lower reliability, will often be wrong. Note that if we

had identified examinees whose mathematics and science scores were closer

(e.g., 6 in mathematics and 4 in science), then the remedial and instructional

TABLE 1

Cross-Tabulations on Form B of the 271 Examinees With Mathematics

Subscore of 7 and Science Subscore of 3 on Form A

Science Subscore

Math

Subscore 0 1 2 3 4 5 6 7 8 9 Total

1 0 0 0 1 1 1 0 0 0 0 3

2 0 0 0 1 3 1 2 0 0 0 7

3 1 0 0 1 2 4 3 1 1 1 14

4 0 0 2 7 7 6 4 0 1 0 27

5 0 1 1 1 10 14 8 5 1 1 42

6 2 0 1 5 10 11 15 8 1 1 54

7 0 1 4 4 4 11 10 7 4 0 45

8 0 1 1 5 12 13 7 5 4 0 48

9 0 0 1 1 6 3 7 4 3 0 25

10 0 0 0 1 1 2 1 1 0 0 6

Total 3 3 10 27 56 66 57 31 15 3 271

Dow

nloa

ded

by [

141.

212.

109.

170]

at 1

1:56

16

Oct

ober

201

4


decisions based on the subscores would be even more inaccurate. In addition,

the correlation of the science subscore on Form A and the science subscore on

Form B is 0.48 whereas the correlation of the science subscore on Form A and

the total score on Form B is 0.63—thus the total score on a form is a better

predictor than the science subscore on the form of the science subscore on a

parallel form.

The aforementioned description shows that diagnostic scores based on a few

items may have poor psychometric quality. Unfortunately, researchers and prac-

titioners often do not pay adequate attention to the quality of diagnostic scores.

The next section describes a method that can be used to examine the quality of

subscores. The method was suggested by Haberman (2008a) to determine when

it is justified to report subscores in addition to the total score.

A METHOD BASED ON CLASSICAL TEST THEORY TO

EXAMINE IF SUBSCORES HAVE ADDED VALUE

GIVEN THE TOTAL SCORE

Let us denote the subscore and the total score of an examinee as s and x,

respectively. Haberman (2008a), taking a classical test theory (CTT) viewpoint,

assumed that a reported subscore is intended to be an estimate of the true

subscore and considered the following estimates of the true subscore:

� An estimate ss D s C ’.s � s/ based on the observed subscore, where s is

the average subscore for the sample of examinees and ’ is the reliability

of the subscore.

� An estimate sx D s C c.x � x/ based on the observed total score, where

x is the average total score and c is a constant that depends on the

reliabilities and standard deviations of the subscore and the total score

and the correlations between the subscores.

� An estimate ssx D s C a.s � s/ C b.x � x/ that is a weighted average

of the observed subscore and the observed total score, where a and b are

constants that depend on the reliabilities and standard deviations of the

subscore and the total score and the correlations between the subscores.

It is also possible to consider the augmented subscore (Wainer et al., 2001) that

was discussed earlier as an estimate of the true subscore. The estimate ssx is a

special case of the augmented subscore. However, augmented subscores yield

results that are very similar to those for ssx (e.g., Sinharay, in press) and hence

are not considered here.

To compare the performances of ss, sx, and ssx as estimates of st , Haberman

(2008a) suggested the use of the proportional reduction in mean squared error

Dow

nloa

ded

by [

141.

212.

109.

170]

at 1

1:56

16

Oct

ober

201

4


or PRMSE. (Computational details can be found in Haberman, 2008a; Sinharay,

Haberman, & Puhan, 2007). The larger the PRMSE, the more accurate is the

estimate.1 This article denotes the PRMSE for ss, sx, and ssx as PRMSEs ,

PRMSEx, and PRMSEsx, respectively. The quantity PRMSEs can be shown to be

exactly equal to the reliability of the subscore. Haberman (2008a) recommended

the following strategies to decide whether a subscore or a weighted average has

added value:

� If PRMSEs is less than PRMSEx, declare that the subscore “does not

provide added value over the total score” because the observed total score

will provide more accurate diagnostic information (in the form of a lower

mean squared error in estimating the true subscore) than the observed

subscore in that case. Thus, from a statistical perspective, added value is

assessed in terms of a larger PRMSE for the subscore. Sinharay et al.

(2007) discussed why this strategy is reasonable and how it ensures that

a subscore satisfies professional standards. Conceptually, if the subscore

is highly correlated with the total score (i.e., the subscore and the total

score measure the same basic underlying skill/s), then the subscore does

not provided any added value (information) than what is already provided

by the total score.

� The quantity PRMSEsx will always be at least as large as PRMSEs and

PRMSEx. However, ssx requires a bit more computation than does either

ss or sx. Hence, declare that a weighted average has added value only if

PRMSEsx is substantially larger compared to both PRMSEs and PRMSEx .

If neither the subscore nor the weighted average has added value, diagnostic

information should not be reported for the test.

The computations for application of the method of Haberman (2008a) are

simple and involve only the sample variances, correlations, and reliabilities of

the total score and the subscores. The computations of the PRMSEs involve

the disattenuated correlations2 among the subscores. For some data sets, it is

possible to have disattenuated correlations between subscores larger than 1, and,

as a result, PRMSEs larger than 1. In these cases, it is concluded that neither

subscores nor weighted averages have any added value.

Haberman (2008a) and Sinharay et al. (2007) explained that a subscore is

more likely to have added value when it has high reliability and it is distinct

from the other subscores. Haberman, Sinharay, and Puhan (2009) extended the

1A larger PRMSE is equivalent to a smaller mean squared error in estimating the true subscore

and hence is desirable.2Disattenuated correlation between two subscores is equal to the simple correlation between

them divided by the square root of the product of the reliabilities of the two subscores.

Dow

nloa

ded

by [

141.

212.

109.

170]

at 1

1:56

16

Oct

ober

201

4


method of Haberman (2008a) to aggregate-level subscores. Because several

large-scale examinations such as the SAT Reasoning test, American College

Testing (ACT), and Law School Admissions Test (LSAT) report overall scores

using CTT, the methods of Haberman (2008a) and Haberman et al. (2009) will

have wide applicability.

One could consider other methods, for example the method of fitting beta-

binomial models to the observed subscore distributions (Harris & Hanson, 1991)

or factor analysis (e.g., as in Ling, 2009; Stone, Ye, Zhu, & Lane, 2010; Wainer

et al., 2001), to determine whether a subscore has added value. However, the

method of Harris and Hanson (1991) involves significance testing with a ¦2

statistic whose null distribution is not well established (p. 5), and factor analysis

involves several issues such as determining whether to analyze at the item level or

at the item parcel level, determining whether to use exploratory or confirmatory

factor analysis, and determining which test statistics to use to find the number

of factors in the data, which complicate the process of determining whether a

subscore has added value. Wainer et al. (2001) recommended that a test can

be considered unidimensional (which implies that subscores do not have added

value) if the reliability of the augmented scores are close to the reliability of

total test score. Another way to examine if subscores have added value is to

perform a statistical test of whether a multidimensional IRT model provides a

better fit than a unidimensional IRT model (see, e.g., von Davier, 2008, Table

12). Researchers like Ackerman and Shu (2009) used dimensionality assessment

programs such as DIMTEST (Stout, 1987) and DETECT (Zhang & Stout, 1999)

to determine the usefulness of subscores.

RESULTS FROM A SURVEY OF OPERATIONAL DATA

Sinharay (in press) performed an extensive survey regarding whether subscores

have added value (where value-addedness of a subscore was determined by the

method of Haberman, 2008a) for operational tests. The findings are summarized

in Table 2.

Each row in Table 2 shows, for a test, the number of subscores, average length

of the subscores, average reliability of the subscores, average correlation among

the subscores, average disattenuated correlation,3 the number of subscores that

have added value, and the number of weighted averages that have added value

(where the assumption was made that a weighted average has added value if the

3Note that although the table reports the averages to summarize a lot of information in a

compendious manner, for some of these tests, the lengths, reliabilities, and correlations of the

subscores are substantially unequal.

Dow

nloa

ded

by [

141.

212.

109.

170]

at 1

1:56

16

Oct

ober

201

4


TABLE 2

Results From Analysis of Operational Data Sets

Name/Nature of the Test

No. of

Subscores

Avg.

Length

Avg.

’

Avg.

Corr.

Average

Corr.

(Disatt.)

How Many

Subscores

Have Added

Value?

How Many

Wtd. Avs.

Have Added

Value?

P-ACTC Eng 2 25 0.79 0.76 0.95 None None

P-ACTC Math 2 20 0.71 0.67 0.94 None One

SAT Verbal 3 26 0.79 0.74 0.95 None One

SAT Math 3 20 0.78 0.75 0.97 None None

SAT 2 69 0.92 0.70 0.76 Two Two

Praxis 4 25 0.72 0.56 0.78 Two Four

DSTP Math (8th grade) 4 19 0.77 0.77 1.00 None None

State Reading (5th grade) 2 37 0.72 0.65 0.92 None One

SweSAT 5 24 0.78 0.55 0.71 Four Five

MFT Business 7 17 0.56 0.47 0.85 None Six

CA (for teachers in elementary

schools)

4 30 0.74 0.59 0.79 One Four

CB (for teachers of special ed.

programs)

3 19 0.46 0.42 0.96 None None

CC (for beginning teachers) 4 19 0.38 0.44 1.00 None None

CD (for teachers of social studies) 6 22 0.63 0.54 0.87 None Six

CE (for teachers of Spanish) 4 29 0.80 0.65 0.80 One Two

CF (for principals and school leaders) 4 15 0.48 0.41 0.85 None Four

CG (for teachers of mathematics) 3 16 0.62 0.59 0.95 None None

CH (for paraprofessionals) 3 24 0.85 0.76 0.89 None Three

TA (measures cognitive and technical

skills)

7 11 0.42 0.51 1.00 None None

TB1 (tests mastery of a language) 2 44 0.85 0.77 0.90 One Two

TB2 (tests mastery of a language) 2 43 0.90 0.68 0.75 Two Two

TC1 (measures achievement in a

discipline)

3 68 0.85 0.76 0.90 One Three

TC2 (measures achievement in a

discipline)

3 67 0.87 0.72 0.82 Two Three

TD1 (measures school and individual

student progress)

4 15 0.70 0.73 0.98 None No

TD2 (measures school and individual

student progress)

6 13 0.70 0.75 1.00 None No

Note. The reliability is denoted as ’. Weighted averages are denoted as “Wtd. Avs.” The first two tests were

discussed in Harris & Hanson (1991). The next four tests were discussed by Haberman (2008a). The next four

tests were discussed in Stone, Ye, Zhu, & Lane (2010), Ackerman & Shu (2009), Lyren (2009), and Ling (2009).

The eight tests denoted CA-CH are certification tests discussed in Puhan, Sinharay, Haberman, & Larkin (2008).

The last seven tests, denoted TA, TB1, : : : TD2, were discussed in Sinharay & Haberman (2008a).

corresponding PRMSEsx is larger than the maximum of PRMSEs and PRMSEx

by 0.01 or larger4).

The reliability of the different scores and subscores was estimated using

Cronbach’s ’. Most of the tests had only multiple-choice items. For these tests,

“length” refers to the number of items. Some tests such as CF (see Table 2) had

constructed response (CR) items. For a subscore involving CR items, “length”

refers to the total number of score categories minus the number of items (e.g., for

4Changing 0.01 to other small values such as 0.02 or 0.03 did not affect the conclusions much.

Dow

nloa

ded

by [

141.

212.

109.

170]

at 1

1:56

16

Oct

ober

201

4


a subscore with four items, each with score categories 0, 1, and 2, the length is

4 � 3 � 4 D 8).

Figures 4–6 show, for the operational data sets, the percentage of subscores

(Figures 4 and 5) or weighted averages (Figure 6) that had added value. In each

of these figures, the y-axis corresponds to the average disattenuated correlation

among the subscores. In Figure 4, the X-axis denotes the average length of

the subscores, whereas in Figures 5 and 6, the x-axis denotes average subscore

reliability.

The three figures plot, for each row listed in Table 2, a number that is the same

as the percentage of subscores (or weighted averages) that have added value

at the point .x; y/, where x is the corresponding average subscore reliability

multiplied by 100 (or length in Figure 4) and y is 100 times the average

disattenuated correlation. For example, in Table 2, the third row shows that

the SAT has average length of 69, average disattenuated correlation of 0.76,

and two subscores (that is, 100% of all subscores) that had added value. Hence

Figure 4 has the number 100 plotted at the point (69,76) at the bottom right

corner.

FIGURE 4 The percentage of subscores that have added value for different subscore length

and average disattenuated correlation for the operational data.

Dow

nloa

ded

by [

141.

212.

109.

170]

at 1

1:56

16

Oct

ober

201

4


FIGURE 5 The percentage of subscores that have added value for different average

subscore reliability and average disattenuated correlation for the operational data.

Table 2 and Figures 4–6 show that several of the subscores, including some

operationally reported ones, do not have added value. They also show that in

general, long subscores (which have high reliability) tend to have added value.

For example, the tests for which one or more subscores have added value have at

least 24 items per subscore. Of these, for TC1 and TC2 (see Table 2), subscores

consist of 67–68 items on an average. However, not all long subscores have

added value. For example, for the test TC1, which has an average subscore

length of 68, two of the three subscores do not have added value. In general,

tests with low average disattenuated correlation tended to have subscores with

added value. The average disattenuated correlation for tests for which one or

more subscores have added value is never above 0.90. However, not all tests

with average disattenuated correlation less than 0.90 have added value. For

example, the average disattenuated correlation is 0.85 for Test CF, and none of

its subscores have added value.

Often, percentage of subscores with a specific average length (or average

reliability) that have added value depends on the average disattenuated corre-

Dow

nloa

ded

by [

141.

212.

109.

170]

at 1

1:56

16

Oct

ober

201

4


FIGURE 6 The percentage of weighted averages that have added value for different

average subscore reliability and average disattenuated correlation for the operational data.

lations.5 In each of these figures, as one goes from the top left corner to the

bottom right corner (that is, as the average length/reliability increases and the

average disattenuated correlation decreases), the subscores show more tendency

to have added value. Figure 6 shows that the weighted averages have added

value for many of the operational data sets and that weighted averages are much

more likely to have added value compared with the subscores themselves.

The following question naturally arises: Could the conclusions from the

operational data be different if a DCM (one of those discussed earlier) were

applied? Though this issue requires further research, the answer is most likely

negative. Haberman and Sinharay (in press) found that PRMSE of subscores

based on MIRT are very close to PRMSE of subscores based on CTT. They

5In other words, there is an interaction between the two factors average length (or average

reliability) and average disattenuated correlation.

Dow

nloa

ded

by [

141.

212.

109.

170]

at 1

1:56

16

Oct

ober

201

4


also found in a study of several operational data sets that MIRT-based subscores

had added value if, and only if, the CTT-based subscores had added value.

DCMs are quite similar to MIRT models (e.g., both of them are latent variable

models) except that the examinee abilities are considered discrete in the former

and continuous in the latter. Hence it can be expected that DCM-based ability

estimates will have added value if, and only if, the CTT-based subscores have

added value. In addition, Henson, Templin, and Douglas (2007) showed that the

use of subscores resulted in only a modest reduction in correct classification

rates in comparison with a DCM—so DCM-based ability estimates seem to

lead to conclusions similar to the subscores. Hence, although the values of the

relevant numerical measures and the relationships between them in the DCM-

based methods will differ from the CTT-based method, the DCM-based methods

are also expected to require subtests that include sufficient number of items and

are as distinct as possible from the other subtests to be able to provide useful

feedback to the examinees.

It may be of interest to look at what the researchers using other methods have

found regarding the usefulness of subscores. Stone et al. (2010) reported, using

an exploratory factor analysis method, the presence of only one factor in the

aforementioned Spring 2006 assessment of the Delaware Student Testing Pro-

gram (DSTP) eighth-grade mathematics assessment. Harris and Hanson (1991)

found subscores to have little added value for the English and mathematics tests

from the P-ACTC examination. Wainer et al. (2001) found that the six subscores

in the American Production and Inventory Control Society certification examina-

tion were not useful because the test was extremely unidimensional, and the four

subscores of the performance assessment part of the North Carolina Test of Com-

puter Skills were useful. Ackerman and Shu (2009) found the subscores to be not

useful for the aforementioned fifth-grade end-of-grade assessment. So, it seems

that subscores in operational tests have more often been found to be not useful.

RECOMMENDATIONS AND CONCLUSIONS

This article demonstrated that even though there is an increasing interest in

reporting diagnostic scores, diagnostic scores often lack adequate psychometric

quality. Our recommendations on diagnostic score reporting are given here:

� As implied earlier, evidence of reliability and validity of the information

should be reported whenever diagnostic scores are provided.

� If a psychometric model is employed to report diagnostic scores, the burden

of proof lies on the person applying the model to demonstrate that the

model parameters can be reliably estimated, that the model approximates

the observed pattern of responses better than a simpler model (e.g., a

Dow

nloa

ded

by [

141.

212.

109.

170]

at 1

1:56

16

Oct

ober

201

4


univariate IRT model), and that the diagnostic scores reported by the model

have added value over a simple subscore or over the score(s) reported by

a simple model. The simplest of the models passing these ordeals may be

operationally used for diagnostic scoring.

� Any reported diagnostic information, in order to be reliable, should be

based on a sufficient number of carefully constructed items. Combining

some subscores may result in subscores that have higher reliability and

hence added value (although it might make the definition of the subscore

broader). For example, subscores for “Physics: theory” and “Physics: ap-

plications” may be combined to yield one subscore for “Physics.” It is also

important to ensure that the skills of interest are as distinct as possible

from each other (though this is quite a difficult task before seeing the

data). Sinharay (in press) provides some ideas regarding the number of

items and the extent of distinctness needed.

� It is important to remember the advice of Luecht, Gierl, Tan, & Huff

(2006) that “inherently unidimensional item and test information cannot be

decomposed to produce useful multidimensional score profiles-no matter

how well intentioned or which psychometric model is used to extract the

information” and that we should not “try to extract something that is

not there” (p. 6). Thus, for some tests, where the diagnostic scores are

not distinct from each other (i.e., they are highly correlated), it will not

be possible to report diagnostic scores, regardless of the method used.

Changing the structure of such tests, for example, using sound assessment

engineering practices for item and test design (Luecht et al., 2006) may

be the only option in order to be able to report diagnostic scores. If

restructuring the test is not a reasonable option, then, instead of diagnostic

score reporting, one can consider alternatives such as scale anchoring

(e.g., Beaton & Allen, 1992), which makes claims about what students

at different scale points know and can do, and item mapping (e.g., Zwick,

Senturk, Wang, & Loomis, 2001), which refers to the use of exemplar

items to characterize particular score points.

� Weighted averages are often found to have added value. This finding

should come as good news to practitioners interested in reporting diagnostic

scores. Weighted averages may be difficult to explain to the general public,

who may not like the idea that, for example, a reported reading subscore is

based not only on the observed reading subscore but also on the observed

writing subscore. Several approaches to the problem of explanation can be

considered. One is that the weighted average better estimates the examinee

proficiency in the content domain represented by the subscore than does

the subscore itself. This result can be discussed in terms of prediction

of performance on an alternative test. The issue can also be discussed in

terms of common cases in which information is customarily combined. For

Dow

nloa

ded

by [

141.

212.

109.

170]

at 1

1:56

16

Oct

ober

201

4


example, premiums for automobile insurance reflect not just the driving ex-

perience of the policyholder but also related information (such as education

and marital status) that predicts future driving performance. However, this

difficulty in explanation of results is more than compensated by the higher

PRMSE (that is, more precision) of the weighted average. Note that if a

test has only a few short subscores, a weighted average may have added

value but should not be reported because its PRMSE, although substantially

larger than PRMSEs and PRMSEx , may still not be sufficiently high.

� Diagnostic scores must be reported on an established scale. A temptation

may exist to make this scale comparable to the scale for the total score or

to the fraction of the scale that corresponds to the relative importance of

the diagnostic score, but these choices are not without difficulties given that

diagnostic scores and total scores typically differ in reliability. In addition,

reported diagnostic scores should be properly equated so that the definition

of a strong performance in a subject area does not change between different

administrations of a test. In typical cases, equating is feasible for the total

score but not for diagnostic scores (e.g., if an anchor test is used to equate

the total test, only a few of the items will correspond to a particular subtest

so that an anchor test equating of the corresponding diagnostic score is not

feasible). Some work on equating of subscores has been done in Puhan

and Liang (in press).

ACKNOWLEDGMENT

Any opinions expressed in this article are those of the authors and not necessarily

of Educational Testing Service.

REFERENCES

Ackerman, T., & Shu, Z. (2009, April). Using confirmatory MIRT modeling to provide diagnostic

information in large scale assessment. Paper presented at the annual meeting of the National

Council of Measurement in Education, San Diego, CA.

Almond, R. G., DiBello, L. V., Moulder, B., & Zapata-Rivera, J. (2007). Modeling diagnostic

assessment with Bayesian networks. Journal of Educational Measurement, 44, 341–359.

American Educational Research Association (AERA), American Psychological Association, National

Council on Measurement in Education. (1999). Standards for Educational and Psychological

Testing. Washington, DC: AERA.

Beaton, A. E., & Allen, N. L. (1992). Interpreting scales through scale anchoring. Journal of

Educational Statistics, 17, 191–204.

de la Torre, J., & Douglas, J. (2004). Higher-order latent trait models for cognitive diagnosis.

Psychometrika, 69, 333–353.

Dow

nloa

ded

by [

141.

212.

109.

170]

at 1

1:56

16

Oct

ober

201

4


de la Torre, J., & Patz, R. J. (2005). Making the most of what we have: A practical application

of multidimensional IRT in test scoring. Journal of Educational and Behavioral Statistics, 30,

295–311.

DiBello, L. V., Roussos, L., & Stout, W. F. (2007). Review of cognitive diagnostic assessment and

a summary of psychometric models. In C. R. Rao & S. Sinharay (Eds.), Handbook of statistics,

Volume 26 (pp. 45–79). Amsterdam: Elsevier Science B.V.

DiBello, L. V., Stout, W. F., & Roussos, L. A. (1995). Unified cognitive/psychometric diagnostic

assessment likelihood-based classification techniques. In P. D. Nichols, S. F. Chipman, & R. L.

Brennan (Eds.), Cognitively diagnostic assessment (pp. 361–389). Hillsdale, NJ: Erlbaum.

Educational Testing Service. (2008). PraxisTM 2008–09 Information Bulletin. Princeton, NJ: Author.

Embretson, S. E. (1997). Multicomponent latent trait models. In W. van der Linden & R. Hambleton

(Eds.), Handbook of modern item response theory (pp. 305–322). New York: Springer-Verlag.

Fu, J., & Li, Y. (2007, April). Cognitively diagnostic psychometric models: An integrative review.

Paper presented at the annual meeting of the National Council on Measurement in Education,

Chicago, IL.

Goldberg, L. R. (1999). A broad-bandwidth, public domain, personality inventory measuring the

lower-level facets of several five-factor models. In I. Mervielde, I. Deary, F. De Fruyt, & F.

Ostendorf (Eds.), Personality Psychology in Europe (Vol. 7, pp. 7–28). Tilburg, The Netherlands:

Tilburg University Press.

Haberman, S. J. (2008a). When can subscores have value? Journal of Educational and Behavioral

Statistics, 33, 204–229.

Haberman, S. J. (2008b). Subscores and validity (ETS Research Report No. RR-08-64). Princeton,

NJ: Educational Testing Service.

Haberman, S. J., & Sinharay, S. (in press). Reporting of subscores using multidimensional item

response theory. Psychometrika.

Haberman, S. J., Sinharay, S., & Puhan, G. (2009). Reporting subscores for institutions. British

Journal of Mathematical and Statistical Psychology, 62, 79–95.

Harris, D. J., & Hanson, B. A. (1991, April). Methods of examining the usefulness of subscores. Paper

presented at the annual meeting of National Council on Measurement in Education, Chicago, IL.

Henson, R., Templin, J., & Douglas, J. (2007). Using efficient model based sum-scores for conducting

skills diagnoses. Journal of Educational Measurement, 44, 361–376.

Junker, B. W., & Sijtsma, K. (2001). Cognitive assessment models with few assumptions, and

connections with nonparametric item response theory. Applied Psychological Measurement, 25,

258–272.

Katz, I. R., Attali, Y., Rijmen, F., & Williamson, D. M. (2008, April). ETS’s iSkillsTM assessment:

Measurement of information and communication technology literacy. Paper presented at the

twenty-third annual conference of the Society for Industrial and Organizational Psychology, San

Francisco, CA.

Leighton, J. P., Gierl, M. J., & Hunka, S. M. (2004). The attribute hierarchy model for cognitive

assessment: A variation on Tatsuoka’s rule-space approach. Journal of Educational Measurement,

41, 205–237.

Ling, G. (2009, April). Why the major field (business) test does not report subscores of individual

testtakers—reliability and construct validity evidence. Paper presented at the annual meeting of

the National Council of Measurement in Education, San Diego, CA.

Luecht, R. M., Gierl, M. J., Tan, X., & Huff, K. (2006, April). Scalability and the development

of useful diagnostic scales. Paper presented at the annual meeting of the National Council on

Measurement in Education, San Francisco, CA.

Lyren, P. (2009). Reporting subscores from college admission tests. Practical Assessment, Research,

and Evaluation, 14, 1–10.

Maris, E. (1999). Estimating multiple classification latent class models. Psychometrika, 64, 187–212.

National Research Council. (2001). Knowing what students know: The science and design of

educational assessment. Washington, DC: The National Academies Press.

Dow

nloa

ded

by [

141.

212.

109.

170]

at 1

1:56

16

Oct

ober

201

4


Puhan, G., & Liang, L. (in press). Equating subscores under the non equivalent anchor test (NEAT)

design (ETS Research Report). Princeton, NJ: Educational Testing Service.

Puhan, G., Sinharay, S., Haberman, S. J., & Larkin, K. (2008). Comparison of subscores based on

classical test theory. ETS Research Report Series 08-54. Educational Testing Service. October,

2008.

Reckase, M. D. (1997). The past and future of multidimensional item response theory. Applied

Psychological Measurement, 21, 25–36.

Roussos, L. A., DiBello, L. V., Stout, W. F., Hartz, S. M., Henson, R. A., & Templin, J. L. (2007).

The fusion model skills diagnostic system. In J. Leighton & M. Gierl (Eds.), Cognitive diagnostic

assessment for education: Theory and applications (pp. 275–318). Cambridge, UK: Cambridge

University Press.

Rupp, A. A., & Templin, J. L. (2008). Unique characteristics of diagnostic classification models: A

comprehensive review of the current state-of-the-art. Measurement, 6, 219–262.

Sinharay, S. (in press). How often do subscores have added value? Results from operational and

simulated data. Journal of Educational Measurement.

Sinharay, S., & Haberman, S. J. (2008a). Reporting subscores: A survey (ETS Research Memoran-

dum No. RR-08-18). Princeton, NJ: Educational Testing Service.

Sinharay, S., & Haberman, S. J. (2008b). How much can we reliably know about what students

know? Measurement: Interdisciplinary Research and Perspectives, 6, 46–49.

Sinharay, S., Haberman, S. J., & Puhan, G. (2007). Subscores based on classical test theory: To

report or not to report. Educational Measurement: Issues and Practice, 26(4), 21–28.

Stone, C. A., Ye, F., Zhu, X., & Lane, S. (2010). Providing subscale scores for diagnostic information:

A case study when the test is essentially unidimensional. Applied Measurement in Education,

23(1), 63–86.

Stout, W. (1987). A nonparametric approach for assessing latent trait unidimensionality. Psycho-

metrika, 52, 589–617.

Tate, R. L. (2004). Implications of multidimensionality for total score and subscore performance.

Applied Measurement in Education, 17, 89–112.

Tatsuoka, K. K. (1983). Rule space: An approach for dealing with misconceptions based on item

response theory. Journal of Educational Measurement, 20, 345–354.

Tatsuoka, K. K. (2009). Cognitive assessment: An introduction to the rule space method. New York:

Routledge Academic.

Thissen-Roe, A., Hunt, E., & Minstrell, I. (2004). The DIAGNOSER project. Combining assessment

and learning. Behavioral Research Methods, Instruments, and Computers, 36, 234–240.

von Davier, M. (2008). A general diagnostic model applied to language testing data. British Journal

of Mathematical and Statistical Psychology, 61, 287–307.

von Davier, M., DiBello, L. V., & Yamamoto, K. (2008). Reporting test outcomes using models for

cognitive diagnosis. In J. Hartig, E. Klieme, & D. Leutner (Eds.), Assessment of competencies in

educational contexts (pp. 151–176). Toronto: Hogrefe & Huber.

Wainer, H., Vevea, J. L., Camacho, F., Reeve, B. B., Rosa, K., Nelson, L., et al. (2001). Augmented

scores—“borrowing strength” to compute scores based on small numbers of items. In D. Thissen

& H. Wainer (Eds.), Test scoring (pp. 343–387). Mahwah, NJ: Erlbaum.

Yao, L., & Boughton, K. A. (2007). A multidimensional item response modeling approach for

improving subscale proficiency estimation and classification. Applied Psychological Measurement,

31, 83–105.

Yen, W. M. (1987, June). A Bayesian/IRT index of objective performance. Paper presented at the

annual meeting of the Psychometric Society, Montreal, Quebec, Canada.

Zhang, J., & Stout, W. (1999). The theoretical DETECT index of dimensionality and its application

to approximate simple structure. Psychometrika, 64, 213–249.

Zwick, R., Senturk, D., Wang, J., & Loomis, S. C. (2001). An investigation of alternative methods

for item mapping on the National Assessment of Educational Progress. Educational Measurement:

Issues and Practice, 20, 15–25.

Dow

nloa

ded

by [

141.

212.

109.

170]

at 1

1:56

16

Oct

ober

201

4

reporting diagnostic scores in educational testing: temptations, pitfalls, and some solutions

Documents