reporting diagnostic scores in educational testing: temptations, pitfalls, and some solutions
TRANSCRIPT
This article was downloaded by: [141.212.109.170]On: 16 October 2014, At: 11:56Publisher: RoutledgeInforma Ltd Registered in England and Wales Registered Number: 1072954Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH,UK
Multivariate BehavioralResearchPublication details, including instructions forauthors and subscription information:http://www.tandfonline.com/loi/hmbr20
Reporting Diagnostic Scoresin Educational Testing:Temptations, Pitfalls, and SomeSolutionsSandip Sinharay a , Gautam Puhan a & Shelby J.Haberman aa Educational Testing ServicePublished online: 07 Jun 2010.
To cite this article: Sandip Sinharay , Gautam Puhan & Shelby J. Haberman(2010) Reporting Diagnostic Scores in Educational Testing: Temptations, Pitfalls,and Some Solutions, Multivariate Behavioral Research, 45:3, 553-573, DOI:10.1080/00273171.2010.483382
To link to this article: http://dx.doi.org/10.1080/00273171.2010.483382
PLEASE SCROLL DOWN FOR ARTICLE
Taylor & Francis makes every effort to ensure the accuracy of all theinformation (the “Content”) contained in the publications on our platform.However, Taylor & Francis, our agents, and our licensors make norepresentations or warranties whatsoever as to the accuracy, completeness,or suitability for any purpose of the Content. Any opinions and viewsexpressed in this publication are the opinions and views of the authors, andare not the views of or endorsed by Taylor & Francis. The accuracy of theContent should not be relied upon and should be independently verified withprimary sources of information. Taylor and Francis shall not be liable for anylosses, actions, claims, proceedings, demands, costs, expenses, damages,and other liabilities whatsoever or howsoever caused arising directly or
indirectly in connection with, in relation to or arising out of the use of theContent.
This article may be used for research, teaching, and private study purposes.Any substantial or systematic reproduction, redistribution, reselling, loan,sub-licensing, systematic supply, or distribution in any form to anyone isexpressly forbidden. Terms & Conditions of access and use can be found athttp://www.tandfonline.com/page/terms-and-conditions
Dow
nloa
ded
by [
141.
212.
109.
170]
at 1
1:56
16
Oct
ober
201
4
Multivariate Behavioral Research, 45:553–573, 2010
Copyright © Educational Testing Service
ISSN: 0027-3171 print/1532-7906 online
DOI: 10.1080/00273171.2010.483382
Reporting Diagnostic Scores inEducational Testing: Temptations,
Pitfalls, and Some Solutions
Sandip Sinharay, Gautam Puhan, and Shelby J. HabermanEducational Testing Service
Diagnostic scores are of increasing interest in educational testing due to their
potential remedial and instructional benefit. Naturally, the number of educational
tests that report diagnostic scores is on the rise, as are the number of research
publications on such scores. This article provides a critical evaluation of diagnostic
score reporting in educational testing. The existing methods for diagnostic score
reporting are discussed. A recent method (Haberman, 2008a) that examines if
diagnostic scores are worth reporting is reviewed. It is demonstrated, using results
from operational and simulated data, that diagnostic scores have to be based on a
sufficient number of items and have to be sufficiently distinct from each other to
be worth reporting and that several operationally reported subscores are actually
not worth reporting. Several recommendations are made for those interested to
report diagnostic scores for educational tests.
Diagnostic scores are of increasing interest in educational testing due to their
potential remedial and instructional benefit. According to the National Research
Council report “Knowing What Students Know” (2001), the target of assessment
is to provide particular information about an examinee’s knowledge, skill, and
abilities. Diagnostic scores are often perceived as providing such information.
Naturally, there is a substantial pressure on the testing programs to report
diagnostic scores both at the individual examinee level and at aggregate levels
(such as at the level of institutions or states).
Despite this demand and apparent usefulness of subscores, certain impor-
tant factors must be considered before diagnostic scores can be considered
Correspondence concerning this article should be addressed to Sandip Sinharay, Educational
Testing Service, Rosedale Road, P159B, Princeton, NJ 08541. E-mail: [email protected]
553
Dow
nloa
ded
by [
141.
212.
109.
170]
at 1
1:56
16
Oct
ober
201
4
554 SINHARAY, PUHAN, HABERMAN
useful. According to Haberman (2008a), diagnostic scores may be considered
of “added value” only when they provide a more accurate measure of the
construct being measured than is provided by the total score. Similarly, Tate
(2004) has emphasized the importance of ensuring reasonable diagnostic score
performance in terms of high reliability and validity to minimize incorrect
instructional and remediation decisions. As evident, there appears to be a conflict
between the demand for diagnostic scores and the need to exercise caution when
reporting diagnostic scores, especially if they do not provide any additional
information than what is provided by the total score and when they are less
reliable. Through this article, we intend to raise general consciousness regarding
potential psychometric issues surrounding the reporting of diagnostic subscores
in educational testing. This article starts by showing examples of diagnostic
scores reported by operational testing programs. Then a brief discussion of exist-
ing psychometric methods for reporting diagnostic scores is provided, followed
by a closer examination of some existing diagnostic scores and their usefulness.
Then a brief review of a method proposed by Haberman (2008a) that examines
if subscores (that are the simplest form of diagnostic scores and are reported
by several testing programs) have added value over the total score is provided.
Using results from several operational and simulated data sets, it is demonstrated
that it is not straightforward to have diagnostic scores with added value and that
factors such as scale length, reliability, and correlation among diagnostic scores
and their interactions with each other often determine when diagnostic scores
can be of added value. Finally, some recommendations are made for researchers
and practitioners interested in the issue of diagnostic score reporting.
WHAT ARE DIAGNOSTIC SCORES?
Figures 1 and 2 show the top and bottom parts of the score report of an imaginary
examinee (Mary D. Poppins) on two Praxis SeriesTM assessments (Educational
Testing Service, 2008)—Mathematics: Content Knowledge and Principles of
Learning and Teaching: Grades 7–12. Figure 1 shows the scaled scores on the
two assessments obtained by Mary and Figure 2 shows the raw scores earned
by Mary in the different score categories on the two assessments. The figure
also shows the raw points available in the categories and the range of scores
obtained by the middle 50% of a group of examinees of appropriate education
level. Figure 2 represents a typical diagnostic score report for examinees—the
scores on the categories are the diagnostic scores. The common perception is that
(a) such a report provides trustworthy information about the examinee’s strengths
and weaknesses, and (b) the examinee will work harder on the categories on
which she performed poorly (e.g., on “algebra and number theory,” on which
Mary scored less than the lower bound of the range) and improve on those areas.
Dow
nloa
ded
by [
141.
212.
109.
170]
at 1
1:56
16
Oct
ober
201
4
REPORTING DIAGNOSTIC SCORES 555
FIGURE 1 Top part of an operational score report for an examinee.
Figure 3 is a typical diagnostic score report at an aggregate level. It shows,
for the Praxis SeriesTM Pre-Professional Skills Tests (PPST) Writing Assessment
(Educational Testing Service, 2008), the average percentage correct score of an
institution on four categories of the assessment, the corresponding averages
for the state the institution belongs to, and the corresponding averages for the
whole nation. Here, the common perception is that (a) such a report provides
trustworthy information about the overall strengths and weaknesses of examinees
from the institution, and (b) if an institution finds that its examinees performed
poorly on a score category compared with the state or the nation, a remedial
and instructional workshop can be given to the examinees to improve their
performance on the category. The categories usually correspond to the content
areas in the test (see Figure 2).
SUBSCORES, AUGMENTED SUBSCORES, AND
OBJECTIVE PERFORMANCE INDEX
The diagnostic score report shown in Figure 2 is based on raw scores on different
categories. These are also referred to as subscores. The score report in Figure 3 is
based on the average of subscores. Subscores are the simplest (e.g., raw number
correct) possible diagnostic scores and are used by several testing programs such
as SAT®, ACT®, and LSAT.
Dow
nloa
ded
by [
141.
212.
109.
170]
at 1
1:56
16
Oct
ober
201
4
FIG
UR
E2
Bo
tto
mp
art
of
ano
per
atio
nal
sco
rere
po
rtfo
ran
exam
inee
.
556
Dow
nloa
ded
by [
141.
212.
109.
170]
at 1
1:56
16
Oct
ober
201
4
REPORTING DIAGNOSTIC SCORES 557
FIGURE 3 An operational score report for an institution with hypothetical data.
Wainer et al. (2001) suggested an approach to increase the precision of a
subscore by borrowing information from the other subscores. Because subscores
are almost always found to correlate moderately or highly with each other, it is
reasonable to assume that, for example, the science subscore of a student has
some information about the math subscore of the same student. In this approach,
weights (or regression coefficients) are assigned to each of the subscores and
an examinee’s “augmented” score on a particular subscale (e.g., math) would
be a function of that examinee’s ability on math and that person’s ability on
the remaining subscales (e.g., science, reading, etc). The subscales that have
the strongest correlation with the math subscale have larger weights and thus
provide more information on the “augmented” math subscore. It has been shown
in previous research (e.g., Puhan, Sinharay, Haberman, & Larkin, 2008; Wainer
et al., 2001) that augmented subscores are often found to be of added value and
are more reliable than subscores.
Dow
nloa
ded
by [
141.
212.
109.
170]
at 1
1:56
16
Oct
ober
201
4
558 SINHARAY, PUHAN, HABERMAN
The objective performance index (OPI; Yen, 1987) is another approach to
enhance a subscore by borrowing information from other part of the test. This ap-
proach uses a combination of item response theory (IRT) and Bayesian method-
ology. OPI is a weighted average of two estimates of performance: (a) the
observed subscore and (b) an estimate, obtained using a unidimensional IRT
model, of the subscore based on the examinee’s overall test performance. If the
observed and estimated subscores differ significantly, then the OPI is defined as
the observed subscore expressed as a percentage. It should be noted that this
approach, because of the use of a unidimensional IRT model, may not provide
accurate results when the data are truly multidimensional. Ironically, that is when
subscores can be expected to have added value.
MODEL-BASED APPROACHES FOR DIAGNOSTIC
SCORE REPORTING
Researchers have suggested several model-based approaches for diagnostic score
reporting. These models assume that (a) solving each test item requires one or
more skills, (b) each examinee has a latent ability parameter corresponding
to each of the skills, and (c) the probability that an examinee will answer an
item correctly is a mathematical function of the skills the item requires and
the latent ability parameters of the examinee. Several of these models, mostly
those that assume that the examinee ability parameters are discrete, fall under
the family of cognitive diagnosis models (CDM; Fu & Li, 2007). CDMs are
alternatively known as diagnostic classification models (DCM; Rupp & Templin,
2008) or models for cognitive diagnosis (von Davier, DiBello, & Yamamoto,
2008). After such a model is fitted to a data set, the diagnostic scores are the
estimated values of the ability parameters. These models include the rule space
model (Tatsuoka, 1983, 2009), the attribute hierarchy method (Leighton, Gierl,
& Hunka, 2004), the fusion model (Roussos et al., 2007), the Diagnostic Inputs,
Noisy And (DINA) and Noisy Inputs, Deterministic And (NIDA) gate models
(Junker & Sijtsma, 2001), the multiple classification latent class model (Maris,
1999), the general diagnostic model (von Davier, 2008), the unified model
(DiBello, Stout, & Roussos, 1995), the reparameterized unified model (Roussos
et al., 2007), the Bayesian inference networks (e.g., Almond, DiBello, Moulder,
and Zapata-Rivera, 2007), the higher order latent-trait model (de la Torre &
Douglas, 2004), the Diagnostic Inputs, Noisy Or (DINO) and Noisy Inputs,
Deterministic Or (NIDO) gate models (see, e.g., Rupp & Templin, 2008), and
the multicomponent latent trait model (e.g., Embretson, 1997). See, for example,
DiBello, Roussos, and Stout (2007), Fu and Li (2007), Rupp and Templin (2008),
and von Davier et al. (2008) for excellent reviews of models that can be used for
diagnostic score reporting. De la Torre and Patz (2005), Haberman and Sinharay
Dow
nloa
ded
by [
141.
212.
109.
170]
at 1
1:56
16
Oct
ober
201
4
REPORTING DIAGNOSTIC SCORES 559
(in press), and Yao and Boughton (2007) examined reporting of diagnostic scores
using multivariate IRT models (MIRT; e.g., Reckase, 1997); the first two of
these papers found that there is not much of a difference between MIRT-based
diagnostic scores and Classical Test Theory (CTT)-based augmented subscores.
HOW GOOD ARE THE EXISTING DIAGNOSTIC
SCORES? A CLOSER LOOK
It is not uncommon to observe decent reliabilities of diagnostic scores on
personality inventories designed to measure specific personality traits such as
anxiety, hostility, trust, and so on. For example, Goldberg (1999) reported that
the reliabilities of the 30 subscale scores (each subscale consisting of 8 items)
in the revised Neuroticism, Extraversion, Openness to Experience Personality
Inventory (NEO PI-R) ranged from 0.61 to 0.85 .M D 0:75/. Considering
the relatively small number of items in the subscales, these reliabilities seem
reasonably high for diagnostic use. One possible reason for this could be that
personality inventories are often designed to measure fewer but much more
focused traits (e.g., aggression, gregariousness, etc.) and are therefore more
reliable. The diagnostic scores found in educational measurement literature and
operational practice often consist of only a few items. It is not uncommon to find
diagnostic scores on, for example, 10 skills based on only about 20–30 items.
Such scores are most often outcomes of retrofitting (reporting of diagnostic
scores from tests that were designed to measure only one overall skill). These
scores are usually provided to comply with clients’ requests for more diagnostic
information on examinees without an increase in test length. Because these tests
have been constructed specifically to measure a single construct, little reason
exists to expect useful diagnostic scores. In addition, it is often ignored that
a diagnostic score in educational measurement refers to a domain area that is
usually much broader than those covered by diagnostic scores in personality
inventories. Consider a certification test for prospective teachers of elementary
education that measures content knowledge in domain areas such as math,
science, and reading. It is often observed that, within each domain area, all
items do not measure a single ability. For example, the math domain may have
numerous smaller subdomains such as problem solving; representation; number
sense; numeration; algebra; geometry; data organization using pie charts, bar
graphs, and so on. Therefore a much larger number of items in each subarea
will be typically required to make the math domain reliable.
Further, there are few checks done to make sure that the reported diagnostic
scores have decent psychometric properties. Even though Standards 2.1 and
6.5 of the Standards for Educational and Psychological Testing (American Ed-
ucational Research Association, American Psychological Association, National
Dow
nloa
ded
by [
141.
212.
109.
170]
at 1
1:56
16
Oct
ober
201
4
560 SINHARAY, PUHAN, HABERMAN
Council on Measurement in Education, 1999) demand proof of adequate reliabil-
ity of any reported scores, reliability of diagnostic scores is often not reported so
that examinees and users of test results are not informed of the degree to which
confidence can be placed in skill classifications. In addition, the reliability figures
in some applications of psychometric models for diagnostic score reporting are
based on simulated data (see, e.g., Roussos et al., 2007, pp. 304–305) rather than
on empirical data. It may be unwise to report diagnostic information for a test
unless there is clear evidence that reliable skill classifications can be obtained
from the test data.
There is a lack of studies demonstrating the validity of diagnostic scores.
For example, there is little evidence showing that diagnostic information is
related to other external criteria. It is difficult to have much confidence in
any diagnostic information whose validity has not been established. Haberman
(2008b) demonstrated via theoretical derivations that the validity of subscores is
limited when the subscores are either not reliable or are highly correlated with
total scores.
As Sinharay and Haberman (2008b) explained, the data may not provide
information as fine-grained as suggested by the cognitive theory or as hoped
by the testing practitioner. A theory of response processes based on cognitive
psychology may suggest several skills. But a test includes a limited number of
items and may not have enough items to provide enough information about all
of these skills. Thissen-Roe, Hunt, and Minstrell (2004) have shown in the area
of physics education that out of a large number of hypothesized misconceptions
in student learners, only a very few misconceptions could be found empirically.
Another example is the iSkillsTM test (e.g., Katz, Attali, Rijmen, & Williamson,
2008), for which an expert committee identified seven performance areas that
they thought comprised Information and Communications Technology (ICT)
literacy skill, but a factor analysis of the data revealed only one factor and the
confirmatory factor models in which the factors corresponded to performance
areas or a combination thereof did not fit the data at all (Katz et al., 2008). As a
result, only an overall ICT literacy score is reported for the test. The basic issue
is that the data may not support the conjectures made by content experts about
how examinees behave. Hence, an investigator attempting to report diagnostic
information has to make an informed judgment on how much evidence the data
can reliably provide and report only that much information.
To demonstrate the problems with subscores consisting of few items, we used
actual test data from a licensure test that is designed for prospective teachers of
children in primary through upper elementary school grades. The 120 multiple-
choice questions focus on four major subject areas: language arts/reading, math-
ematics, social studies, and science. There are 30 questions per area and a
subscore is reported for each of these areas. The subscore reliabilities are
between 0.71 and 0.83. We ranked the questions on mathematics and science
Dow
nloa
ded
by [
141.
212.
109.
170]
at 1
1:56
16
Oct
ober
201
4
REPORTING DIAGNOSTIC SCORES 561
separately in the order of difficulty (proportion correct) and then created a form
A that consists of the questions ranked 1, 4, 7, : : : , 28 in mathematics and the
questions ranked 1, 4, 7, : : : , 28 in science. Similarly, we created a form B with
questions ranked 2, 5, 8, : : : , 29 in mathematics and in science and a form C
with the remaining questions. Forms A, B, and C can be considered roughly
parallel forms and, by construction, all of the several thousand examinees took
all three of these forms. The subscore reliabilities on these forms range between
0.46 and 0.60. We identified all the 271 examinees who obtained a subscore
of 7 on mathematics and 3 on science on Form A. Such examinees will most
likely be thought to be strong on mathematics and weak on science and given
additional science lessons. Is that justified? We examined the mathematics and
science subscores of the same examinees on Forms B and C. Table 1 gives a
cross-tabulation of the subscores on form B of such examinees. It shows that
the subscores on Form B often lead to different conclusions; for example, the
science subscore is higher than the mathematics subscore for several of these
examinees. The percentage of the 271 examinees with a mathematics subscore
of 5 or lower is 34 and 39, respectively, for Forms B and C. The percentage
with a science subscore of 6 or higher is 39 and 32, respectively, for Forms
B and C. The percentage of examinees whose mathematics score is higher
than their science score is only 59 and 66 on Form B. This simple example
demonstrates that remedial and instructional decisions based on short subscores,
which in turn results in lower reliability, will often be wrong. Note that if we
had identified examinees whose mathematics and science scores were closer
(e.g., 6 in mathematics and 4 in science), then the remedial and instructional
TABLE 1
Cross-Tabulations on Form B of the 271 Examinees With Mathematics
Subscore of 7 and Science Subscore of 3 on Form A
Science Subscore
Math
Subscore 0 1 2 3 4 5 6 7 8 9 Total
1 0 0 0 1 1 1 0 0 0 0 3
2 0 0 0 1 3 1 2 0 0 0 7
3 1 0 0 1 2 4 3 1 1 1 14
4 0 0 2 7 7 6 4 0 1 0 27
5 0 1 1 1 10 14 8 5 1 1 42
6 2 0 1 5 10 11 15 8 1 1 54
7 0 1 4 4 4 11 10 7 4 0 45
8 0 1 1 5 12 13 7 5 4 0 48
9 0 0 1 1 6 3 7 4 3 0 25
10 0 0 0 1 1 2 1 1 0 0 6
Total 3 3 10 27 56 66 57 31 15 3 271
Dow
nloa
ded
by [
141.
212.
109.
170]
at 1
1:56
16
Oct
ober
201
4
562 SINHARAY, PUHAN, HABERMAN
decisions based on the subscores would be even more inaccurate. In addition,
the correlation of the science subscore on Form A and the science subscore on
Form B is 0.48 whereas the correlation of the science subscore on Form A and
the total score on Form B is 0.63—thus the total score on a form is a better
predictor than the science subscore on the form of the science subscore on a
parallel form.
The aforementioned description shows that diagnostic scores based on a few
items may have poor psychometric quality. Unfortunately, researchers and prac-
titioners often do not pay adequate attention to the quality of diagnostic scores.
The next section describes a method that can be used to examine the quality of
subscores. The method was suggested by Haberman (2008a) to determine when
it is justified to report subscores in addition to the total score.
A METHOD BASED ON CLASSICAL TEST THEORY TO
EXAMINE IF SUBSCORES HAVE ADDED VALUE
GIVEN THE TOTAL SCORE
Let us denote the subscore and the total score of an examinee as s and x,
respectively. Haberman (2008a), taking a classical test theory (CTT) viewpoint,
assumed that a reported subscore is intended to be an estimate of the true
subscore and considered the following estimates of the true subscore:
� An estimate ss D s C ’.s � s/ based on the observed subscore, where s is
the average subscore for the sample of examinees and ’ is the reliability
of the subscore.
� An estimate sx D s C c.x � x/ based on the observed total score, where
x is the average total score and c is a constant that depends on the
reliabilities and standard deviations of the subscore and the total score
and the correlations between the subscores.
� An estimate ssx D s C a.s � s/ C b.x � x/ that is a weighted average
of the observed subscore and the observed total score, where a and b are
constants that depend on the reliabilities and standard deviations of the
subscore and the total score and the correlations between the subscores.
It is also possible to consider the augmented subscore (Wainer et al., 2001) that
was discussed earlier as an estimate of the true subscore. The estimate ssx is a
special case of the augmented subscore. However, augmented subscores yield
results that are very similar to those for ssx (e.g., Sinharay, in press) and hence
are not considered here.
To compare the performances of ss, sx, and ssx as estimates of st , Haberman
(2008a) suggested the use of the proportional reduction in mean squared error
Dow
nloa
ded
by [
141.
212.
109.
170]
at 1
1:56
16
Oct
ober
201
4
REPORTING DIAGNOSTIC SCORES 563
or PRMSE. (Computational details can be found in Haberman, 2008a; Sinharay,
Haberman, & Puhan, 2007). The larger the PRMSE, the more accurate is the
estimate.1 This article denotes the PRMSE for ss, sx, and ssx as PRMSEs ,
PRMSEx, and PRMSEsx, respectively. The quantity PRMSEs can be shown to be
exactly equal to the reliability of the subscore. Haberman (2008a) recommended
the following strategies to decide whether a subscore or a weighted average has
added value:
� If PRMSEs is less than PRMSEx, declare that the subscore “does not
provide added value over the total score” because the observed total score
will provide more accurate diagnostic information (in the form of a lower
mean squared error in estimating the true subscore) than the observed
subscore in that case. Thus, from a statistical perspective, added value is
assessed in terms of a larger PRMSE for the subscore. Sinharay et al.
(2007) discussed why this strategy is reasonable and how it ensures that
a subscore satisfies professional standards. Conceptually, if the subscore
is highly correlated with the total score (i.e., the subscore and the total
score measure the same basic underlying skill/s), then the subscore does
not provided any added value (information) than what is already provided
by the total score.
� The quantity PRMSEsx will always be at least as large as PRMSEs and
PRMSEx. However, ssx requires a bit more computation than does either
ss or sx. Hence, declare that a weighted average has added value only if
PRMSEsx is substantially larger compared to both PRMSEs and PRMSEx .
If neither the subscore nor the weighted average has added value, diagnostic
information should not be reported for the test.
The computations for application of the method of Haberman (2008a) are
simple and involve only the sample variances, correlations, and reliabilities of
the total score and the subscores. The computations of the PRMSEs involve
the disattenuated correlations2 among the subscores. For some data sets, it is
possible to have disattenuated correlations between subscores larger than 1, and,
as a result, PRMSEs larger than 1. In these cases, it is concluded that neither
subscores nor weighted averages have any added value.
Haberman (2008a) and Sinharay et al. (2007) explained that a subscore is
more likely to have added value when it has high reliability and it is distinct
from the other subscores. Haberman, Sinharay, and Puhan (2009) extended the
1A larger PRMSE is equivalent to a smaller mean squared error in estimating the true subscore
and hence is desirable.2Disattenuated correlation between two subscores is equal to the simple correlation between
them divided by the square root of the product of the reliabilities of the two subscores.
Dow
nloa
ded
by [
141.
212.
109.
170]
at 1
1:56
16
Oct
ober
201
4
564 SINHARAY, PUHAN, HABERMAN
method of Haberman (2008a) to aggregate-level subscores. Because several
large-scale examinations such as the SAT Reasoning test, American College
Testing (ACT), and Law School Admissions Test (LSAT) report overall scores
using CTT, the methods of Haberman (2008a) and Haberman et al. (2009) will
have wide applicability.
One could consider other methods, for example the method of fitting beta-
binomial models to the observed subscore distributions (Harris & Hanson, 1991)
or factor analysis (e.g., as in Ling, 2009; Stone, Ye, Zhu, & Lane, 2010; Wainer
et al., 2001), to determine whether a subscore has added value. However, the
method of Harris and Hanson (1991) involves significance testing with a ¦2
statistic whose null distribution is not well established (p. 5), and factor analysis
involves several issues such as determining whether to analyze at the item level or
at the item parcel level, determining whether to use exploratory or confirmatory
factor analysis, and determining which test statistics to use to find the number
of factors in the data, which complicate the process of determining whether a
subscore has added value. Wainer et al. (2001) recommended that a test can
be considered unidimensional (which implies that subscores do not have added
value) if the reliability of the augmented scores are close to the reliability of
total test score. Another way to examine if subscores have added value is to
perform a statistical test of whether a multidimensional IRT model provides a
better fit than a unidimensional IRT model (see, e.g., von Davier, 2008, Table
12). Researchers like Ackerman and Shu (2009) used dimensionality assessment
programs such as DIMTEST (Stout, 1987) and DETECT (Zhang & Stout, 1999)
to determine the usefulness of subscores.
RESULTS FROM A SURVEY OF OPERATIONAL DATA
Sinharay (in press) performed an extensive survey regarding whether subscores
have added value (where value-addedness of a subscore was determined by the
method of Haberman, 2008a) for operational tests. The findings are summarized
in Table 2.
Each row in Table 2 shows, for a test, the number of subscores, average length
of the subscores, average reliability of the subscores, average correlation among
the subscores, average disattenuated correlation,3 the number of subscores that
have added value, and the number of weighted averages that have added value
(where the assumption was made that a weighted average has added value if the
3Note that although the table reports the averages to summarize a lot of information in a
compendious manner, for some of these tests, the lengths, reliabilities, and correlations of the
subscores are substantially unequal.
Dow
nloa
ded
by [
141.
212.
109.
170]
at 1
1:56
16
Oct
ober
201
4
REPORTING DIAGNOSTIC SCORES 565
TABLE 2
Results From Analysis of Operational Data Sets
Name/Nature of the Test
No. of
Subscores
Avg.
Length
Avg.
’
Avg.
Corr.
Average
Corr.
(Disatt.)
How Many
Subscores
Have Added
Value?
How Many
Wtd. Avs.
Have Added
Value?
P-ACTC Eng 2 25 0.79 0.76 0.95 None None
P-ACTC Math 2 20 0.71 0.67 0.94 None One
SAT Verbal 3 26 0.79 0.74 0.95 None One
SAT Math 3 20 0.78 0.75 0.97 None None
SAT 2 69 0.92 0.70 0.76 Two Two
Praxis 4 25 0.72 0.56 0.78 Two Four
DSTP Math (8th grade) 4 19 0.77 0.77 1.00 None None
State Reading (5th grade) 2 37 0.72 0.65 0.92 None One
SweSAT 5 24 0.78 0.55 0.71 Four Five
MFT Business 7 17 0.56 0.47 0.85 None Six
CA (for teachers in elementary
schools)
4 30 0.74 0.59 0.79 One Four
CB (for teachers of special ed.
programs)
3 19 0.46 0.42 0.96 None None
CC (for beginning teachers) 4 19 0.38 0.44 1.00 None None
CD (for teachers of social studies) 6 22 0.63 0.54 0.87 None Six
CE (for teachers of Spanish) 4 29 0.80 0.65 0.80 One Two
CF (for principals and school leaders) 4 15 0.48 0.41 0.85 None Four
CG (for teachers of mathematics) 3 16 0.62 0.59 0.95 None None
CH (for paraprofessionals) 3 24 0.85 0.76 0.89 None Three
TA (measures cognitive and technical
skills)
7 11 0.42 0.51 1.00 None None
TB1 (tests mastery of a language) 2 44 0.85 0.77 0.90 One Two
TB2 (tests mastery of a language) 2 43 0.90 0.68 0.75 Two Two
TC1 (measures achievement in a
discipline)
3 68 0.85 0.76 0.90 One Three
TC2 (measures achievement in a
discipline)
3 67 0.87 0.72 0.82 Two Three
TD1 (measures school and individual
student progress)
4 15 0.70 0.73 0.98 None No
TD2 (measures school and individual
student progress)
6 13 0.70 0.75 1.00 None No
Note. The reliability is denoted as ’. Weighted averages are denoted as “Wtd. Avs.” The first two tests were
discussed in Harris & Hanson (1991). The next four tests were discussed by Haberman (2008a). The next four
tests were discussed in Stone, Ye, Zhu, & Lane (2010), Ackerman & Shu (2009), Lyren (2009), and Ling (2009).
The eight tests denoted CA-CH are certification tests discussed in Puhan, Sinharay, Haberman, & Larkin (2008).
The last seven tests, denoted TA, TB1, : : : TD2, were discussed in Sinharay & Haberman (2008a).
corresponding PRMSEsx is larger than the maximum of PRMSEs and PRMSEx
by 0.01 or larger4).
The reliability of the different scores and subscores was estimated using
Cronbach’s ’. Most of the tests had only multiple-choice items. For these tests,
“length” refers to the number of items. Some tests such as CF (see Table 2) had
constructed response (CR) items. For a subscore involving CR items, “length”
refers to the total number of score categories minus the number of items (e.g., for
4Changing 0.01 to other small values such as 0.02 or 0.03 did not affect the conclusions much.
Dow
nloa
ded
by [
141.
212.
109.
170]
at 1
1:56
16
Oct
ober
201
4
566 SINHARAY, PUHAN, HABERMAN
a subscore with four items, each with score categories 0, 1, and 2, the length is
4 � 3 � 4 D 8).
Figures 4–6 show, for the operational data sets, the percentage of subscores
(Figures 4 and 5) or weighted averages (Figure 6) that had added value. In each
of these figures, the y-axis corresponds to the average disattenuated correlation
among the subscores. In Figure 4, the X-axis denotes the average length of
the subscores, whereas in Figures 5 and 6, the x-axis denotes average subscore
reliability.
The three figures plot, for each row listed in Table 2, a number that is the same
as the percentage of subscores (or weighted averages) that have added value
at the point .x; y/, where x is the corresponding average subscore reliability
multiplied by 100 (or length in Figure 4) and y is 100 times the average
disattenuated correlation. For example, in Table 2, the third row shows that
the SAT has average length of 69, average disattenuated correlation of 0.76,
and two subscores (that is, 100% of all subscores) that had added value. Hence
Figure 4 has the number 100 plotted at the point (69,76) at the bottom right
corner.
FIGURE 4 The percentage of subscores that have added value for different subscore length
and average disattenuated correlation for the operational data.
Dow
nloa
ded
by [
141.
212.
109.
170]
at 1
1:56
16
Oct
ober
201
4
REPORTING DIAGNOSTIC SCORES 567
FIGURE 5 The percentage of subscores that have added value for different average
subscore reliability and average disattenuated correlation for the operational data.
Table 2 and Figures 4–6 show that several of the subscores, including some
operationally reported ones, do not have added value. They also show that in
general, long subscores (which have high reliability) tend to have added value.
For example, the tests for which one or more subscores have added value have at
least 24 items per subscore. Of these, for TC1 and TC2 (see Table 2), subscores
consist of 67–68 items on an average. However, not all long subscores have
added value. For example, for the test TC1, which has an average subscore
length of 68, two of the three subscores do not have added value. In general,
tests with low average disattenuated correlation tended to have subscores with
added value. The average disattenuated correlation for tests for which one or
more subscores have added value is never above 0.90. However, not all tests
with average disattenuated correlation less than 0.90 have added value. For
example, the average disattenuated correlation is 0.85 for Test CF, and none of
its subscores have added value.
Often, percentage of subscores with a specific average length (or average
reliability) that have added value depends on the average disattenuated corre-
Dow
nloa
ded
by [
141.
212.
109.
170]
at 1
1:56
16
Oct
ober
201
4
568 SINHARAY, PUHAN, HABERMAN
FIGURE 6 The percentage of weighted averages that have added value for different
average subscore reliability and average disattenuated correlation for the operational data.
lations.5 In each of these figures, as one goes from the top left corner to the
bottom right corner (that is, as the average length/reliability increases and the
average disattenuated correlation decreases), the subscores show more tendency
to have added value. Figure 6 shows that the weighted averages have added
value for many of the operational data sets and that weighted averages are much
more likely to have added value compared with the subscores themselves.
The following question naturally arises: Could the conclusions from the
operational data be different if a DCM (one of those discussed earlier) were
applied? Though this issue requires further research, the answer is most likely
negative. Haberman and Sinharay (in press) found that PRMSE of subscores
based on MIRT are very close to PRMSE of subscores based on CTT. They
5In other words, there is an interaction between the two factors average length (or average
reliability) and average disattenuated correlation.
Dow
nloa
ded
by [
141.
212.
109.
170]
at 1
1:56
16
Oct
ober
201
4
REPORTING DIAGNOSTIC SCORES 569
also found in a study of several operational data sets that MIRT-based subscores
had added value if, and only if, the CTT-based subscores had added value.
DCMs are quite similar to MIRT models (e.g., both of them are latent variable
models) except that the examinee abilities are considered discrete in the former
and continuous in the latter. Hence it can be expected that DCM-based ability
estimates will have added value if, and only if, the CTT-based subscores have
added value. In addition, Henson, Templin, and Douglas (2007) showed that the
use of subscores resulted in only a modest reduction in correct classification
rates in comparison with a DCM—so DCM-based ability estimates seem to
lead to conclusions similar to the subscores. Hence, although the values of the
relevant numerical measures and the relationships between them in the DCM-
based methods will differ from the CTT-based method, the DCM-based methods
are also expected to require subtests that include sufficient number of items and
are as distinct as possible from the other subtests to be able to provide useful
feedback to the examinees.
It may be of interest to look at what the researchers using other methods have
found regarding the usefulness of subscores. Stone et al. (2010) reported, using
an exploratory factor analysis method, the presence of only one factor in the
aforementioned Spring 2006 assessment of the Delaware Student Testing Pro-
gram (DSTP) eighth-grade mathematics assessment. Harris and Hanson (1991)
found subscores to have little added value for the English and mathematics tests
from the P-ACTC examination. Wainer et al. (2001) found that the six subscores
in the American Production and Inventory Control Society certification examina-
tion were not useful because the test was extremely unidimensional, and the four
subscores of the performance assessment part of the North Carolina Test of Com-
puter Skills were useful. Ackerman and Shu (2009) found the subscores to be not
useful for the aforementioned fifth-grade end-of-grade assessment. So, it seems
that subscores in operational tests have more often been found to be not useful.
RECOMMENDATIONS AND CONCLUSIONS
This article demonstrated that even though there is an increasing interest in
reporting diagnostic scores, diagnostic scores often lack adequate psychometric
quality. Our recommendations on diagnostic score reporting are given here:
� As implied earlier, evidence of reliability and validity of the information
should be reported whenever diagnostic scores are provided.
� If a psychometric model is employed to report diagnostic scores, the burden
of proof lies on the person applying the model to demonstrate that the
model parameters can be reliably estimated, that the model approximates
the observed pattern of responses better than a simpler model (e.g., a
Dow
nloa
ded
by [
141.
212.
109.
170]
at 1
1:56
16
Oct
ober
201
4
570 SINHARAY, PUHAN, HABERMAN
univariate IRT model), and that the diagnostic scores reported by the model
have added value over a simple subscore or over the score(s) reported by
a simple model. The simplest of the models passing these ordeals may be
operationally used for diagnostic scoring.
� Any reported diagnostic information, in order to be reliable, should be
based on a sufficient number of carefully constructed items. Combining
some subscores may result in subscores that have higher reliability and
hence added value (although it might make the definition of the subscore
broader). For example, subscores for “Physics: theory” and “Physics: ap-
plications” may be combined to yield one subscore for “Physics.” It is also
important to ensure that the skills of interest are as distinct as possible
from each other (though this is quite a difficult task before seeing the
data). Sinharay (in press) provides some ideas regarding the number of
items and the extent of distinctness needed.
� It is important to remember the advice of Luecht, Gierl, Tan, & Huff
(2006) that “inherently unidimensional item and test information cannot be
decomposed to produce useful multidimensional score profiles-no matter
how well intentioned or which psychometric model is used to extract the
information” and that we should not “try to extract something that is
not there” (p. 6). Thus, for some tests, where the diagnostic scores are
not distinct from each other (i.e., they are highly correlated), it will not
be possible to report diagnostic scores, regardless of the method used.
Changing the structure of such tests, for example, using sound assessment
engineering practices for item and test design (Luecht et al., 2006) may
be the only option in order to be able to report diagnostic scores. If
restructuring the test is not a reasonable option, then, instead of diagnostic
score reporting, one can consider alternatives such as scale anchoring
(e.g., Beaton & Allen, 1992), which makes claims about what students
at different scale points know and can do, and item mapping (e.g., Zwick,
Senturk, Wang, & Loomis, 2001), which refers to the use of exemplar
items to characterize particular score points.
� Weighted averages are often found to have added value. This finding
should come as good news to practitioners interested in reporting diagnostic
scores. Weighted averages may be difficult to explain to the general public,
who may not like the idea that, for example, a reported reading subscore is
based not only on the observed reading subscore but also on the observed
writing subscore. Several approaches to the problem of explanation can be
considered. One is that the weighted average better estimates the examinee
proficiency in the content domain represented by the subscore than does
the subscore itself. This result can be discussed in terms of prediction
of performance on an alternative test. The issue can also be discussed in
terms of common cases in which information is customarily combined. For
Dow
nloa
ded
by [
141.
212.
109.
170]
at 1
1:56
16
Oct
ober
201
4
REPORTING DIAGNOSTIC SCORES 571
example, premiums for automobile insurance reflect not just the driving ex-
perience of the policyholder but also related information (such as education
and marital status) that predicts future driving performance. However, this
difficulty in explanation of results is more than compensated by the higher
PRMSE (that is, more precision) of the weighted average. Note that if a
test has only a few short subscores, a weighted average may have added
value but should not be reported because its PRMSE, although substantially
larger than PRMSEs and PRMSEx , may still not be sufficiently high.
� Diagnostic scores must be reported on an established scale. A temptation
may exist to make this scale comparable to the scale for the total score or
to the fraction of the scale that corresponds to the relative importance of
the diagnostic score, but these choices are not without difficulties given that
diagnostic scores and total scores typically differ in reliability. In addition,
reported diagnostic scores should be properly equated so that the definition
of a strong performance in a subject area does not change between different
administrations of a test. In typical cases, equating is feasible for the total
score but not for diagnostic scores (e.g., if an anchor test is used to equate
the total test, only a few of the items will correspond to a particular subtest
so that an anchor test equating of the corresponding diagnostic score is not
feasible). Some work on equating of subscores has been done in Puhan
and Liang (in press).
ACKNOWLEDGMENT
Any opinions expressed in this article are those of the authors and not necessarily
of Educational Testing Service.
REFERENCES
Ackerman, T., & Shu, Z. (2009, April). Using confirmatory MIRT modeling to provide diagnostic
information in large scale assessment. Paper presented at the annual meeting of the National
Council of Measurement in Education, San Diego, CA.
Almond, R. G., DiBello, L. V., Moulder, B., & Zapata-Rivera, J. (2007). Modeling diagnostic
assessment with Bayesian networks. Journal of Educational Measurement, 44, 341–359.
American Educational Research Association (AERA), American Psychological Association, National
Council on Measurement in Education. (1999). Standards for Educational and Psychological
Testing. Washington, DC: AERA.
Beaton, A. E., & Allen, N. L. (1992). Interpreting scales through scale anchoring. Journal of
Educational Statistics, 17, 191–204.
de la Torre, J., & Douglas, J. (2004). Higher-order latent trait models for cognitive diagnosis.
Psychometrika, 69, 333–353.
Dow
nloa
ded
by [
141.
212.
109.
170]
at 1
1:56
16
Oct
ober
201
4
572 SINHARAY, PUHAN, HABERMAN
de la Torre, J., & Patz, R. J. (2005). Making the most of what we have: A practical application
of multidimensional IRT in test scoring. Journal of Educational and Behavioral Statistics, 30,
295–311.
DiBello, L. V., Roussos, L., & Stout, W. F. (2007). Review of cognitive diagnostic assessment and
a summary of psychometric models. In C. R. Rao & S. Sinharay (Eds.), Handbook of statistics,
Volume 26 (pp. 45–79). Amsterdam: Elsevier Science B.V.
DiBello, L. V., Stout, W. F., & Roussos, L. A. (1995). Unified cognitive/psychometric diagnostic
assessment likelihood-based classification techniques. In P. D. Nichols, S. F. Chipman, & R. L.
Brennan (Eds.), Cognitively diagnostic assessment (pp. 361–389). Hillsdale, NJ: Erlbaum.
Educational Testing Service. (2008). PraxisTM 2008–09 Information Bulletin. Princeton, NJ: Author.
Embretson, S. E. (1997). Multicomponent latent trait models. In W. van der Linden & R. Hambleton
(Eds.), Handbook of modern item response theory (pp. 305–322). New York: Springer-Verlag.
Fu, J., & Li, Y. (2007, April). Cognitively diagnostic psychometric models: An integrative review.
Paper presented at the annual meeting of the National Council on Measurement in Education,
Chicago, IL.
Goldberg, L. R. (1999). A broad-bandwidth, public domain, personality inventory measuring the
lower-level facets of several five-factor models. In I. Mervielde, I. Deary, F. De Fruyt, & F.
Ostendorf (Eds.), Personality Psychology in Europe (Vol. 7, pp. 7–28). Tilburg, The Netherlands:
Tilburg University Press.
Haberman, S. J. (2008a). When can subscores have value? Journal of Educational and Behavioral
Statistics, 33, 204–229.
Haberman, S. J. (2008b). Subscores and validity (ETS Research Report No. RR-08-64). Princeton,
NJ: Educational Testing Service.
Haberman, S. J., & Sinharay, S. (in press). Reporting of subscores using multidimensional item
response theory. Psychometrika.
Haberman, S. J., Sinharay, S., & Puhan, G. (2009). Reporting subscores for institutions. British
Journal of Mathematical and Statistical Psychology, 62, 79–95.
Harris, D. J., & Hanson, B. A. (1991, April). Methods of examining the usefulness of subscores. Paper
presented at the annual meeting of National Council on Measurement in Education, Chicago, IL.
Henson, R., Templin, J., & Douglas, J. (2007). Using efficient model based sum-scores for conducting
skills diagnoses. Journal of Educational Measurement, 44, 361–376.
Junker, B. W., & Sijtsma, K. (2001). Cognitive assessment models with few assumptions, and
connections with nonparametric item response theory. Applied Psychological Measurement, 25,
258–272.
Katz, I. R., Attali, Y., Rijmen, F., & Williamson, D. M. (2008, April). ETS’s iSkillsTM assessment:
Measurement of information and communication technology literacy. Paper presented at the
twenty-third annual conference of the Society for Industrial and Organizational Psychology, San
Francisco, CA.
Leighton, J. P., Gierl, M. J., & Hunka, S. M. (2004). The attribute hierarchy model for cognitive
assessment: A variation on Tatsuoka’s rule-space approach. Journal of Educational Measurement,
41, 205–237.
Ling, G. (2009, April). Why the major field (business) test does not report subscores of individual
testtakers—reliability and construct validity evidence. Paper presented at the annual meeting of
the National Council of Measurement in Education, San Diego, CA.
Luecht, R. M., Gierl, M. J., Tan, X., & Huff, K. (2006, April). Scalability and the development
of useful diagnostic scales. Paper presented at the annual meeting of the National Council on
Measurement in Education, San Francisco, CA.
Lyren, P. (2009). Reporting subscores from college admission tests. Practical Assessment, Research,
and Evaluation, 14, 1–10.
Maris, E. (1999). Estimating multiple classification latent class models. Psychometrika, 64, 187–212.
National Research Council. (2001). Knowing what students know: The science and design of
educational assessment. Washington, DC: The National Academies Press.
Dow
nloa
ded
by [
141.
212.
109.
170]
at 1
1:56
16
Oct
ober
201
4
REPORTING DIAGNOSTIC SCORES 573
Puhan, G., & Liang, L. (in press). Equating subscores under the non equivalent anchor test (NEAT)
design (ETS Research Report). Princeton, NJ: Educational Testing Service.
Puhan, G., Sinharay, S., Haberman, S. J., & Larkin, K. (2008). Comparison of subscores based on
classical test theory. ETS Research Report Series 08-54. Educational Testing Service. October,
2008.
Reckase, M. D. (1997). The past and future of multidimensional item response theory. Applied
Psychological Measurement, 21, 25–36.
Roussos, L. A., DiBello, L. V., Stout, W. F., Hartz, S. M., Henson, R. A., & Templin, J. L. (2007).
The fusion model skills diagnostic system. In J. Leighton & M. Gierl (Eds.), Cognitive diagnostic
assessment for education: Theory and applications (pp. 275–318). Cambridge, UK: Cambridge
University Press.
Rupp, A. A., & Templin, J. L. (2008). Unique characteristics of diagnostic classification models: A
comprehensive review of the current state-of-the-art. Measurement, 6, 219–262.
Sinharay, S. (in press). How often do subscores have added value? Results from operational and
simulated data. Journal of Educational Measurement.
Sinharay, S., & Haberman, S. J. (2008a). Reporting subscores: A survey (ETS Research Memoran-
dum No. RR-08-18). Princeton, NJ: Educational Testing Service.
Sinharay, S., & Haberman, S. J. (2008b). How much can we reliably know about what students
know? Measurement: Interdisciplinary Research and Perspectives, 6, 46–49.
Sinharay, S., Haberman, S. J., & Puhan, G. (2007). Subscores based on classical test theory: To
report or not to report. Educational Measurement: Issues and Practice, 26(4), 21–28.
Stone, C. A., Ye, F., Zhu, X., & Lane, S. (2010). Providing subscale scores for diagnostic information:
A case study when the test is essentially unidimensional. Applied Measurement in Education,
23(1), 63–86.
Stout, W. (1987). A nonparametric approach for assessing latent trait unidimensionality. Psycho-
metrika, 52, 589–617.
Tate, R. L. (2004). Implications of multidimensionality for total score and subscore performance.
Applied Measurement in Education, 17, 89–112.
Tatsuoka, K. K. (1983). Rule space: An approach for dealing with misconceptions based on item
response theory. Journal of Educational Measurement, 20, 345–354.
Tatsuoka, K. K. (2009). Cognitive assessment: An introduction to the rule space method. New York:
Routledge Academic.
Thissen-Roe, A., Hunt, E., & Minstrell, I. (2004). The DIAGNOSER project. Combining assessment
and learning. Behavioral Research Methods, Instruments, and Computers, 36, 234–240.
von Davier, M. (2008). A general diagnostic model applied to language testing data. British Journal
of Mathematical and Statistical Psychology, 61, 287–307.
von Davier, M., DiBello, L. V., & Yamamoto, K. (2008). Reporting test outcomes using models for
cognitive diagnosis. In J. Hartig, E. Klieme, & D. Leutner (Eds.), Assessment of competencies in
educational contexts (pp. 151–176). Toronto: Hogrefe & Huber.
Wainer, H., Vevea, J. L., Camacho, F., Reeve, B. B., Rosa, K., Nelson, L., et al. (2001). Augmented
scores—“borrowing strength” to compute scores based on small numbers of items. In D. Thissen
& H. Wainer (Eds.), Test scoring (pp. 343–387). Mahwah, NJ: Erlbaum.
Yao, L., & Boughton, K. A. (2007). A multidimensional item response modeling approach for
improving subscale proficiency estimation and classification. Applied Psychological Measurement,
31, 83–105.
Yen, W. M. (1987, June). A Bayesian/IRT index of objective performance. Paper presented at the
annual meeting of the Psychometric Society, Montreal, Quebec, Canada.
Zhang, J., & Stout, W. (1999). The theoretical DETECT index of dimensionality and its application
to approximate simple structure. Psychometrika, 64, 213–249.
Zwick, R., Senturk, D., Wang, J., & Loomis, S. C. (2001). An investigation of alternative methods
for item mapping on the National Assessment of Educational Progress. Educational Measurement:
Issues and Practice, 20, 15–25.
Dow
nloa
ded
by [
141.
212.
109.
170]
at 1
1:56
16
Oct
ober
201
4