using technology-enhanced items to measure fourth grade ... · annual meeting of the american...
TRANSCRIPT
Using Technology-Enhanced Items to Measure
Fourth Grade Geometry Knowledge
Jessica Masters, Ph.D.
Matthew Gushta, Ph.D.
Paper presented at the
Annual Meeting of the American Educational Research Association
Annual Meeting
Washington, D.C.
April, 2018
Abstract
Technology-enhanced items have the potential to provide improved measures of student
knowledge compared to traditional item types. This paper uses quantitative analysis of fourth-
grade geometry field test data to explore a) the validity of inferences made from certain kinds of
technology-enhanced items and b) whether those item types provide improved measurement
compared to traditional item types. There was strong evidence based on internal structure and
based on the relationship with other variables that technology-enhanced items provided valid
inferences. The evidence of whether that measurement was an improvement over the
measurement provided by traditional selected-response items was mixed.
Using Technology-Enhanced Items to Measure Fourth Grade Geometry Knowledge 2/46
High-quality assessment is critical to the educational process. Teachers and educational
stakeholders can only make effective decisions about instruction and student progress if they
have access to assessments that result in reliable and valid inferences about student knowledge.
Technology-enhanced (TE) items have the potential to provide improved measures of student
knowledge over traditional item types because they require students to produce, rather than
select, a response. TE items can create a more engaging environment for students and can reduce
the effects of guessing and test-taking skills. Constructed-response (CR) items can often provide
similar benefits, but TE items have the additional benefit of supporting automated scoring, a
critical feature in the modern classroom that requires fast-paced, inexpensive, and accurate
assessment feedback. For these and other reasons, national assessment consortia, state
departments of education, and assessment developers have widely incorporated TE items into
their formative and summative assessments. While these potential benefits of TE items have
spurred the forward momentum of TE item development and use, there has been a lack of
rigorous research on the validity of inferences made from TE items and the ability of TE items to
provide improved measurement over traditional selected-response (SR) items (Bryant, 2017).
Researchers have stressed the need for evidence that TE items are more than merely engaging,
but also provide an accurate, and potentially improved, measure of student knowledge.
The current paper contributes to the small but growing base of research related to whether the
inferences made from TE items are valid and whether TE items provide improved measurement
over SR items. Field test data are examined to address the following research questions:
RQ1) To what extent do TE items provide a valid measure of geometry standards in the
elementary grades?
RQ2) To what extent do TE items provide improved measurement compared to SR items?
To address these questions, the Validity of Technology-Enhanced Assessment in Geometry
(VTAG) project collected from classroom administration of TE, SR, and CR items targeting
fourth grade Common Core State Standards in geometry (shown in Figure 1).
CCSS.MATH.CONTENT.4.G.A.1: Draw points, lines, line segments, rays, angles (right, acute,
obtuse), and perpendicular and parallel lines. Identify these in two-dimensional figures.
CCSS.MATH.CONTENT.4.G.A.2: Classify two-dimensional figures based on the presence or
absence of parallel or perpendicular lines, or the presence or absence of angles of a specified size.
Recognize right triangles as a category, and identify right triangles.
CCSS.MATH.CONTENT.4.G.A.3: Recognize a line of symmetry for a two-dimensional figure as a
line across the figure such that the figure can be folded along the line into matching parts. Identify line-
symmetric figures and draw lines of symmetry.
Figure 1: Common Core State Standards in Fourth Grade Geometry
Using Technology-Enhanced Items to Measure Fourth Grade Geometry Knowledge 3/46
1. Theoretical Framework
Assessment is a critical component within the instructional process and instruction should be
differentiated based on the results of assessments (Pellegrino, Chudowsky, & Glaser, 2001). The
2010 National Education Technology (NET) Plan’s goal related to assessment is that “Our
education system at all levels will leverage the power of technology to measure what matters and
use assessment data for continuous improvement (USDE, p. xvii).” Research has documented the
inadequacies of SR items to measure many types of high-level knowledge and understanding
(Archbald & Newmann, 1988; Bennett, 1993; Birenbaum & Tatsuoka, 1987; Hickson & Reed,
2009; Lane, 2004; Livingston, 2009; Darling-Hammond & Lieberman, 1992; Quellmalz, Timms,
& Schneider, 2009). One approach to overcome the shortcomings of SR items is the use of text-
entry or CR items, which have been used to measure high-order skills and knowledge. In recent
years, researchers have leveraged technological advancements to combine the measurement
power of CR items with the automated-scoring capability of SR items. One branch of this
research has focused on automated text and essay scoring (e.g., Dikli, 2006), while another
branch has focused on using technology to allow students to interact with digital content in
innovative ways, through the development of TE items. This second line of research is consistent
with the NET Plan’s assessment-related recommendations, which include the development of
assessments that provide “new and better ways” to assess students and the expansion of the
capacity to design, develop, and validate technology-enhanced assessments that can access
constructs difficult to measure with traditional assessments (USDE, 2010). For this
recommendation to be realized, more research is needed on the validity of technology-enhanced
assessments in a variety of contexts.
TE items offer many potential benefits over SR items. The most significant is that TE items have
the potential to provide improved measurement of certain constructs, specifically high-level or
cognitively complex constructs, because, rather than simply select information, they require
students to produce information, which is often a more authentic form of measurement
(Archbald & Newmann, 1988; Bennett, 1999; Harlen & Crick, 2003; Huff & Sireci, 2001;
Jodoin, 2003; McFarlane, Williams, & Bonnett, 2000; Sireci & Zenisky, 2006; Zenisky & Sireci,
2002). A second benefit is that TE items reduce the effects of test-taking skills and random
guessing (Huff & Sireci, 2001). A third benefit is that TE items have the potential to provide
richer diagnostic information by recording not only the student’s final response, but also the
interaction and response processes that can reveal the student’s thought process that lead to that
response (Birenbaum & Tatsuoka, 1987). CR items have always offered the first of these two
benefits, but TE items allow these benefits to be leveraged on items administered via computer
that can be automatically and instantly scored. A fourth potential benefit of TE items is a
possible reduction of cognitive load from non-relevant constructs, such as the reading load for
non-reading related items or the cognitive load required to keep various item constructs in
memory (Mayer & Moreno, 2003; Thomas, 2016). Finally, TE items tend to be more engaging to
students, an important consideration in an era when students frequently feel over-tested (Strain-
Using Technology-Enhanced Items to Measure Fourth Grade Geometry Knowledge 4/46
Seymour, Way, & Dolan, 2009; Dolan, Goodman, Strain-Seymour, Adams, & Sethuraman,
2011).
The potential of TE items to provide improved measurement was initially underscored by the
awarding of Race to the Top Assessment funds to PARCC and Smarter Balanced, who proposed
to develop next-generation assessment systems that would incorporate TE items in their
summative and non-summative assessments (PARCC, 2010; SBAC, 2010). Since that time, state
departments of education have continued to pursue the promise of TE items. Statewide
summative assessment Requests for Proposals frequently include specific provisions for the
development and administration of TE items, stipulating the availability of specific interaction
types (e.g., State of Maine, Department of Education, 2014) and the presumption of
measurement of higher-order thinking skills (e.g., Oklahoma State Department of Education,
2017). Despite the forward momentum to develop and use TE items, there is only a small
research base evaluating the validity of TE items in various contexts within K-12 education.
Pearson conducted cognitive labs with elementary, middle, and high school students to evaluate
perceptions of TE items, the cognitive processes used to respond to TE items, and the potential
for TE items to better evaluate constructs in both mathematics and English language arts (Dolan,
Goodman, Strain-Seymour, Adams, & Sethuraman, 2011). Although the results cannot be
broadly generalized because of the small sample sizes, the research found preliminary evidence
to suggest that TE items were highly usable and engaging. More importantly, the research found
that TE items produced measurements of constructs, particularly high-level constructs, that were
not easily measured with traditional items types. The study found that the use of TE items
reduced guessing and allowed students to have more authentic interactions with content. The
study also found that TE items required more time to complete and that this factor was
influenced by students’ technical proficiencies (ibid.).
In a separate research effort, Pearson partnered with the Minnesota State Department of
Education to evaluate and compare the performance of TE items, SR items, and CR items in the
context of fifth grade, eighth grade, and high school science (Wan & Henley, 2012). This study
explored TE items of the figural response type, which includes “hotspot” identification, drag-
and-drop, and reordering. Through item response theory analyses, this study found that TE items
provided the same amount of information as SR items in fifth and eighth grade, and slightly
more information in high school. While CR items provided more information than both TE items
and SR items, those items required human scoring. This study also found that TE items and SR
items were equally efficient (i.e., they provided equivalent amounts of information per unit of
time). The researchers concluded that their statistical analyses support the use of TE items in K-
12. However, they were also careful to note that further psychological testing (e.g., cognitive
labs) should be conducted to confirm the results of their statistical testing. The researchers also
advised caution in using TE items when standard SR items are able to measure a construct. “We
reviewed the test forms administered in this study and found that a number of innovative items
Using Technology-Enhanced Items to Measure Fourth Grade Geometry Knowledge 5/46
could be easily replaced by [multiple choice] items without changing the [knowledge, skills, and
abilities] measured. This issue is not uncommon in the development of innovative assessments:
the face validity of innovative item formats is so appealing that their real potential of providing
something beyond what is available using the [multiple choice] format may be overlooked (ibid.,
p. 74).” This is supported by other researchers who warn of innovation solely for innovation’s
sake when traditional items can fully and accurately measure a construct (Haldyna, 1999; Sireci
& Zenisky, 2006). The researchers recommended both psychometric and psychological research
to determine when TE items are appropriate.
Research conducted by Pacific Metrics Corporation, with funding from the U.S. Department of
Education through a Grant for Enhanced Assessment Instruments, explored SR items, CR items,
and TE items in the context of seventh grade mathematics and Algebra I. The researchers found
no significant difference when they compared the correlation between a test comprised of CR
and TE items and teacher ratings of student knowledge with the correlation of a test comprised of
SR items and teacher ratings of student knowledge. The CR/TE test was reviewed by experts and
found to be similar to the SR test in terms of measuring the intent of the standards and the depth
of knowledge. The CR/TE test was found to be significantly more reliable and provide more
information than the SR test (Winter, Wood, Lottridge, Hughes, & Walker, 2012). “These results
indicate that tests incorporating CR/TE items can measure some mathematics content with less
error than tests comprising only [SR] items (ibid, p. 53).” This study provides promising results,
but must be generalized cautiously because of the narrow content focus and the blending of text-
based CR items and TE items on the same test form.
The VTAG project contributes to this small but critical base of research related to the validity of
TE items, differing from previous efforts and making new contributions to the research base in
two important ways. First, VTAG used a mixed-methods design. The researchers conducted an
analysis of qualitative data collected via cognitive labs with small samples and statistical analysis
of field test data from larger samples (using a field test). (he latter analysis is the focus of the
current paper. Second, the VTAG project focuses on a broad content area not previously studied:
elementary geometry.
2. Selected Response and Technology-Enhanced Item Types
The VTAG project uses the taxonomy of items defined by Scalise and Gifford (2006) to
determine whether an item is categorized as SR or TE. This taxonomy, displayed in Figure 2,
organizes items by the degree of constraint placed on the student’s options for responding to or
interacting with the item. The taxonomy does not classify items based on the method or mode of
interaction required by the student or the media included in the item.
Using Technology-Enhanced Items to Measure Fourth Grade Geometry Knowledge 6/46
For example, the item displayed in Exhibit A.1(a) is a traditional multiple-choice item. This item
is clearly SR. The item could be reformatted, however, to use a different method of interaction.
In Exhibit A.1(b), a similar item is rewritten using a drag and drop interface. Exhibit A.1(c)
shows another similar item that uses a hotspot interface.
For the VTAG project, the researchers argue that the interface or mode of interaction is irrelevant
and that all three of these items are SR because they have the same degree of constraint: the
student is asked to select one option out of four options. Although Exhibit A.1(b) and Exhibit
A.1(c) use a more modern and interactive interface, the content of the items and the extent to
which they ask students to select (rather than produce) content is the same as the item displayed
in Exhibit A.1(a); all three of the items displayed in Exhibit A.1 are consistent with the “Multiple
Choice” category of the Scalise and Gifford taxonomy (displayed in Figure 2).
What defines an SR item is that it places a high degree of constraint on the student’s response,
thus asking them to select, rather than produce, a response. A TE item, on the other hand,
requires the student to produce a response. For the VTAG project, the item must also be able to
be automatically scored. For example, the item in Exhibit A.2(a) asks a student to produce
(draw) a shape as their response. Although this item is similar to what could be offered on paper-
and-pencil as a (non-text-based) CR item, the item is administered via computer and can be
Figure 2: Taxonomy of Item Types based on Level of Constraint (Scalise & Gifford, 2006, p. 9)
Using Technology-Enhanced Items to Measure Fourth Grade Geometry Knowledge 7/46
automatically scored via the computer; this difference makes the item TE for the purposes of the
VTAG project. Other items require a student to produce a different kind of response compared to
what could be elicited from a paper-and-pencil CR item. For example, the item displayed in
Exhibit A.2(b) asks students to move objects into various categories or buckets, which would be
more difficult to directly translate to a traditional non-computer based administration. The key
distinctions again of why this item is considered TE for the VTAG project is that it requires a
student to produce a response and can be automatically scored.
Some TE items do ask students to select a response. But, unlike the SR items, the combination of
possible responses from which to select result in an item with a higher cognitive complexity.
While the items do place more constraint on the student than one in which the student is truly
producing the response, they place much less constraint than a selection of one in four/five/or
six. For example, the item in Exhibit A.3 is classified as a TE item. In theory, this item could be
rewritten as a series of three multiple choice items, each with a list of all 36 possible street pairs
as the options. Thus, while technically the student is selecting their response, not producing it,
they are selecting it from a large enough sample of responses that this item is cognitively more
akin to item types that require a produced response.
To summarize, VTAG researchers defined SR items as those that require a student to select a
response from a small set of responses. SR items place a high degree of constraint on the
response. TE items were defined as those that require a student to either produce a response or
select a response from a very large set of responses (which are not explicitly enumerated). TE
items place a low degree of constraint on the response.
It should be noted that the VTAG project aimed to contribute to the current knowledge base in
topics that have been less explored, including those that would inform the ongoing and
increasing use of computer-based assessments in large-scale summative and classroom-based
formative diagnostic assessment. To this end, the project compared the measurement qualities of
TE items to SR items, not to CR items. The researchers acknowledge the possibility (even
probability!) that many CR items can provide improved measurement of constructs. However,
many of these items either cannot currently be automatically scored or require substantial
investment to establish automated scoring algorithms. Thus, the current study addresses the
comparison of item types that can be widely administered and instantly scored by the computer
(i.e., TE vs SR). The researchers further acknowledge the large body of existing research
exploring automated text and numerical response analysis; they acknowledge that many of these
types of CR items can be automatically scored. However, in the interest of contributing to the
current knowledge base in an area less explored, those item types were not included in the
VTAG project.
Using Technology-Enhanced Items to Measure Fourth Grade Geometry Knowledge 8/46
2.1. Selected Response Item Types
As described above, for the purposes of the VTAG project, SR items are items that place a high
degree of constraint on the response. The VTAG project defines four SR item types: standard
multiple choice, select-all, complex multiple choice, and matching. These items come primarily
from the two most constrained categories from the Scalise and Gifford (2006) taxonomy
(Multiple Choice and Selection/Identification) and approximately correspond to levels 1C
(Conventional or Standard Multiple Choice), 2A (Multiple True/False), 2C (Multiple Answer),
2D (Complex Multiple Choice), and 3A (Matching) (as displayed in Figure 2).
Standard multiple choice items and select-all items require a student to select one or more
options from a small set of options. Exhibit A.4 shows example standard multiple choice items;
Exhibit A.5 shows example select-all items. Complex multiple choice items are a combination of
more than one standard multiple choice or select-all items. These types of items could be revised
into multiple items that use either a standard multiple choice or a select-all format (each with a
reasonable number of response options). Exhibit A.6 shows example complex multiple choice
items. Matching items require a student to connect pieces of content. There is a one-to-one
correspondence between the content, i.e., if content A, B, C, and D is matched with 1, 2, 3, and
4, exactly one letter will be matched to exactly one number. Exhibit A.7 shows example
matching items. Note that while a more interactive interface may be employed with the
aforementioned SR items, they are still considered SR because of the degree of constraint placed
on student responses.
2.2. Technology-Enhanced Item Types
TE items place a lower degree of constraint on the response. The VTAG project defines five TE
item types: categorization, hotspot, matrix completion, figure placement, and drawing. These
items come primarily from the constraint categories 3-6 from the Scalise and Gifford (2006)
taxonomy (Reordering/Rearrangement, Substitution/Correction, Completion, and Construction)
and approximately correspond to levels 3B (Categorizing), 3C (Ranking and Sequencing), 4C
(Limited Figural Drawing), 5D (Matrix Completion), and 6B (Figural Constructed response) (as
displayed in Figure 2).
Categorization items require students to sort or classify content. It’s important to note that, in
theory, categorization items could be rewritten as SR items; however, many of these items would
have too large a response space, and thus be unworkable to administer in SR format. Exhibit A.8
shows examples of categorization items. For categorization items, the general guideline used in
the VTAG project was that if at least one “content tile” (i.e., object to be classified) belonged in
more than one “bucket” (i.e., category), the item would likely be classified as a TE item. Some
items fit this description but overall had a small, constrained response space (e.g., the item
displayed in Exhibit A.6); these items were instead classified as SR instead of TE because of the
higher level of constraint placed on the response options.
Using Technology-Enhanced Items to Measure Fourth Grade Geometry Knowledge 9/46
Hotspot items require the student to identify part of a graphical figure. Exhibit A.9 shows
examples of hotspot items. Similar to the categorization items, it is often possible to use the
hotspot interface to write SR items. The general guideline used for hotspot was that, for a hotspot
item to be considered a TE item, the placement of the response within the overall visual space
must be critical to the overall item context (e.g., select the parallel lines on the images below or
select all coastal cities on the map below).
Matrix completion items require students to place content into pre-defined spaces within the
problem space. This includes items asking students to rank or order content. Exhibit A.10 shows
examples of this type of item. Figure placement items require students to place content within the
larger space, which do not have pre-defined placement options (e.g., no “slots” or “buckets” in
which to drop objects). Exhibit A.11 shows example items. Drawing items require students to
draw a response, including the drawing of points, line segments, lines, rays, angles, or polygons.i
Exhibit A.12 shows example drawing items.
3. Methodology
The researchers administered SR, TE, and CR items, along with surveys, to fourth grade students
in a field test designed to collect data to address RQ1 and RQ2.
3.1. Instruments
The researchers developed a set of approximately ten SR and ten TE items to target each of the
three standards in fourth grade geometry. The items were revised based on feedback from
internal project content experts, independent experts on the project advisory board, and based on
the qualitative analysis of data collected from cognitive labs (Masters, in press; Masters,
Famularo, King, 2016). These items were assembled to create three fourth grade test forms, each
of which consisted of two sub-forms, covering standards 4.G.A.1 (Test AB), 4.G.A.2 (Test CD),
and 4.G.A.3 (Test EF). Table 1 presents test construction information for these three test forms
(six sub-forms), showing the breakdown in the number of SR and TE item types. The tests were
presented to students in sub-forms. For example, a student might take sub-form A in a single
sitting and then complete sub-form B in another sitting.
A small set of text-based CR items was developed for each of the targeted standards. CR items
are generally accepted as capable of providing a strong and reliable measure of student
knowledge (e.g., Bennett, 1993; Hickson & Reed, 2009; Livingston, 2009). Just like any item,
the ability to provide an accurate measure is dependent on the quality of the item, the context of
administration, and the scoring approach or rubrics (which is particularly important when it
involves human, subjective scoring). However, well-designed and administered CR items
(including text-based CR items) are often considered an improved form of measurement (albeit
one that cannot be administered or scored at a large scale). Accordingly, for VTAG, the CR
items were designed to serve as an independent measure of student knowledge, something that
Using Technology-Enhanced Items to Measure Fourth Grade Geometry Knowledge 10/46
could be considered as close to the true measure of the student’s knowledge as is possible to
measure within the current research context.
Table 1: Test Construction
Number of SR Items Number of TE Items
Standard
Tes
t
Su
b-F
orm
Sta
nd
ard
Mu
ltip
le
Ch
oic
e
Sel
ect-
All
Co
mp
lex
Mu
ltip
le
Ch
oic
e
Mat
chin
g
To
tal
Cat
egori
zati
on
Ho
tsp
ot
Mat
rix C
om
ple
tio
n
Fig
ure
Pla
cem
ent
To
tal
4.G.A.1 A 1 4 0 1 6 2 2 1 0 8
B 3 2 2 1 8 4 1 0 0 8
4.G.A.2 C 3 3 1 0 7 4 0 0 0 6
D 1 3 1 0 5 3 0 2 0 8
4.G.A.3 E 4 0 0 2 6 1 0 2 0 8
F 2 5 1 0 8 2 0 0 2 6
The researchers collected released items and item contexts from state and national tests,
modifying the language, targeted content, or item structure when necessary.ii To supplement the
limited pool of publicly available test items, the VTAG researchers developed additional CR
items specifically for use in this project. Exhibit A.13 shows example CR items.
A Technology Survey was designed to measure students’ comfort with, proficiency with, access
to, and frequency of use related to technology, as well as their attitude towards technology and
attitude towards mathematics. These areas were chosen because they represent the various
factors that might influence student performance on TE items. The researchers gathered released
survey items from various years of the fourth grade National Assessment of Educational
Progress (NCES, 2016) and Trends in International Mathematics and Science Study (NCES,
2018) surveys. Some questions were modified based on other smaller surveys (e.g., the Use,
Support, and Effect of Instructional Technology Project) (UseIt, 2018). The survey items were
reviewed internally by project experts, reviewed by members of the external VTAG expert
advisory board, and piloted with a sample of over 740 elementary students; the survey items
were revised accordingly.
3.2. Data Collection
The researchers collected initial validity evidence based on test content, e.g., the extent to which
the assessment represents the domain of interest with minimal construct-irrelevant information.
This type of evidence was collected through review by the expert advisory panel. Because
evidence based on test content is a fundamental underpinning of validity, item revisions were
Using Technology-Enhanced Items to Measure Fourth Grade Geometry Knowledge 11/46
made based on the feedback of the expert advisors before the items were used to collect
additional validity evidence.
Cognitive labs, or think alouds, are face-to-face interactions during which a researcher observes
and evaluates a student’s cognitive processes. Cognitive labs have become a widely used method
of gathering evidence related to the validity of inferences made by assessments, specifically
evidence about whether assessment items are measuring the intended constructs (Dolan,
Goodman, Strain-Seymour, Adams, & Sethuraman, 2011; Ericsson & Simon, 1999; Gorin, 2006;
Willis, 1999). This type of qualitative evidence adds significant value to more traditional,
quantitative validity evidence collected from larger samples (Beatty & Willis, 2007; Willis,
1999; Zucker, Sassman, & Case, 2004).
The expert review and cognitive labs were used both to collect validity evidence- evidence based
on test content and based on student thought processes. They were also used to revise and refine
the items used in the field test and described in the current paper (Masters, in press; Masters,
Famularo, King, 2016). The cognitive labs also revealed that: a) the increased number of
components in the TE items resulted in those items being both more difficult and more time
intensive; b) students generally preferred the most interactive TE items; and c) when the
technology of the testing interface was not a barrier, the TE items provided an equally accurate,
and sometimes improved, measure of knowledge (ibid). These results contributed to the overall
understanding related to the research questions and helped inform the final revisions to the items
and testing system that was used in the field test that is described in the current paper.
In spring 2016, the researchers conducted a field test of the SR and TE items, to collect validity
evidence based on internal structure and based on the relationship to other variables. All SR and
TE items were instantly computer-scored so that teachers received immediate feedback with
student results. In addition, students completed the CR items, a background survey, and the
Technology Survey (all administered via computer).
4. Results
There were 384 students that completed the combined Test Form AB (targeting standard
4.G.A.1), 374 students that completed the combined Test Form CD (targeting standard 4.G.A.2),
and 342 students that completed the combined Test Form E F (targeting standard 4.G.A.3). Table
2 presents information about student gender, race, ethnicity, and the frequency of when a
language other than English was spoken in the home. Answering these questions was voluntary
for students; thus, percentages may not always add up to 100%.
Table 3 shows the average test scores, ranges of item difficulty, and ranges of item-total
correlations. For the item analysis described throughout this section, all items were worth 1
point. Some items were dichotomously scored and others were scored with partial credit of 0,
0.5, or 1.iii
Using Technology-Enhanced Items to Measure Fourth Grade Geometry Knowledge 12/46
Demographic Subgroup Value
Total n 425
Gender Female 53.20%
Male 38.60%
Race
Asian 5.20%
American Indian / Alaskan
Native 7.80%
Black 22.40%
Native Hawaiian / Pacific
Islander 4.20%
White 42.60%
Ethnicity Hispanic / Latino 12.50%
Language Other
than English
All or most of the time 12.20%
About half of the time 5.40%
Once in a while 12.20%
Never 58.10%
Item Set Number of
Items Average Score
Range of Item
Difficulty
Range of Item-
Total Correlations
Test Form AB (Standard 4.G.A.1)
Combined SR
and TE Items 30 11.065 / 30 = 37% 0.063-0.865 0.269-0.644
SR Items 14 6.307 / 14 = 45% 0.129-0.865 0.216-0.560
TE Items 16 4.758 / 16 = 30% 0.063-0.543 0.365-0.587
Test Form CD (Standard 4.G.A.2)
Combined SR
and TE Items 24iv 7.253/ 24 = 30% 0.055-0.74 0.293-0.713
SR Items 11 4.350 / 11 = 40% 0.156-0.74 0.304-0.623
TE Items 13 2.902 / 13 = 22% 0.055-0.72 0.275-0.652
Test Form EF (Standard 4.G.A.3)
Combined SR
and TE Items 28 11.610 / 28 = 41% 0.039-0.824 0.222-0.631
SR Items 14 6.949 / 14 = 50% 0.156-0.824 0.326-0.602
TE Items 14 4.661 / 14 = 33% 0.039-0.614 0.191-0.624
Table 2: Participating Student Demographics
Table 3: Item Difficulties and Item-Total Correlations
Using Technology-Enhanced Items to Measure Fourth Grade Geometry Knowledge 13/46
Table 4 presents the results of a two-way ANOVA examining item difficulty by test form (AB,
CD, EF) and item type (SR or TE). In addition, effect size for each factor was calculated as η2
(SSeffect/SStotal). The results show a significant effect of Item Type (F1,76 = 14.265, p < 0.001) and
a medium-to-large effect size (η2 = 0.158; Cohen, 1988). A follow-up t-test confirms that the TE
items were more difficult than the SR items (t = 3.825, df = 77.154, p < 0.001). The effect of test
form was non-significant and small indicating no difference in difficulty between Test AB, Test
CD, and Test EF. Similarly, Table 5 presents the results of a two-way ANOVA examining item-
total correlation by test form and item type. The results show no significant effects of test form,
item type, or the interaction.
Factor df Sum of
Squares F sig η2
Test Form 2 0.171 2.291 n.s. 0.048
Item Type 1 0.533 14.265 p < 0.001 0.150
Interaction 2 0.001 0.014 n.s. 0
Residuals 76 2.837 0.801
Factor df Sum of
Squares F sig η2
Test Form 2 0.010 0.421 n.s. 0.011
Item Type 1 0.015 1.258 n.s. 0.016
Interaction 2 0.011 0.456 n.s. 0.012
Residuals 76 0.883 0.962
4.1. Internal Structure via Classical Test Theory
RQ1 asks whether TE items provide valid evidence. One type of evidence is based on internal
structure of the assessments, i.e., the statistical properties of an assessment which support the
idea that respondents are applying the targeted construct without applying additional constructs
not of direct interest (Goodwin & Leech, 2003). This was first evaluated using methods from
classical test theory. Cronbach’s alpha was calculated for all items on the test, the subset of SR
items, and the subset of TE items.
As shown in Table 6, for all three tests, the reliability was high for the full set of items and for
the subsets of SR and TE items (α ≥ 0.70 is often considered as the threshold for acceptable
reliability of smaller, formative assessments while α ≥ 0.80 is often considered as the threshold
for acceptable reliability of large-scale, high-stakes assessment; see Salvia, Ysseldyke, & Bolt,
2007). The Spearman-Brown correction was applied to the SR reliability index to estimate the
value it would have if the number of SR points were equal to the number of TE points. Table 6
Table 4: Item Difficulty ANOVA and Effect Size
Table 5: Item-Total Correlation ANOVA and Effect Size
Using Technology-Enhanced Items to Measure Fourth Grade Geometry Knowledge 14/46
shows the results, which are comparable to the reliability analysis conducted without the
correction: both item sets showed high reliabilities and the TE items showed a marginally higher
reliability for one of the test forms (Test CD).
To further evaluate internal consistency (RQ1), an exploratory factor analysis was conducted
using a Varimax rotation on the full set of items, the SR subset, and the TE subset. As shown in
Table 7, the variance accounted for by the first factor in the full set of items was comparable to
the variance accounted for by the first factor in the SR and TE subtests. For Test Forms AB and
CD, the first factor accounted for a somewhat greater percentage of variance for the TE items
compared to the SR items or the full test form. For all three sets of all three tests, the items
loaded strongly on the first factor and the average factor loadings were strong.
These results indicate that for all three tests, the group of items as a whole, the subset of SR
items, and the subset of TE items each show evidence of forming a unidimensional scale.
The reliability and factor analysis results provide one source of evidence that the TE items are a
valid measure of knowledge of the targeted standard (RQ1); the strong internal structure
demonstrated by these items suggests that valid inferences may be drawn about the assessed
geometry standards.
One way to evaluate if there is an improved measure (RQ2) is to evaluate whether the TE items
provide better information than SR items. Better information might be indicated by larger
reliability indices, greater proportion of variance accounted for by the first factor, or stronger
factor loadings in comparing SR and TE item subsets. As shown in Table 6, the Spearman-
Brown corrected reliability indices for the TE items was higher than for the SR items for Tests
AB and CD. The difference in reliabilities was statistically significant (p<0.05, Diedenhofen and
Musch, 2016) for only Test AB. For Tests AB and CD, the TE items also showed stronger
internal structure based on the other characteristics (including higher factor loadings).
4.2. Internal Structure via Item Response Theory
Item response theory (IRT) was employed to further examine the internal structure of the
assessments. A two-parameter logistic (2PL) model was fit to the dichotomous items and a
graded response model (GRM) was fit to the polytomous items, allowing for the IRT models to
accommodate differences across items with respect to both difficulty and discrimination.v
Similar to the study by Jodoin (2003), this analysis focused on the Test Information Functions
(TIFs), which are derived from the IRT item parameters estimated separately for the SR and TE
subtests.
Using Technology-Enhanced Items to Measure Fourth Grade Geometry Knowledge 15/46
Item Set Number
of Items
Reliability
(Cronbach’s α) Reliability
(Spearman-Brown)
Test Form AB (Standard 4.G.A.1)
Combined SR
and TE Items 30 0.897 --
SR Items 14 0.789 0.84
TE Items 16 0.834 0.84
Test Form CD (Standard 4.G.A.2)
Combined SR
and TE Items 24 0.889 --
SR Items 11 0.790 0.80
TE Items 13 0.827 0.84
Test Form EF (Standard 4.G.A.3)
Combined SR
and TE Items 28 0.902 --
SR Items 14 0.829 0.84
TE Items 14 0.818 0.82
Item Set Number of
Items
% of Variance
Accounted for
by First Factor
Range of Factor
Loadings on
First Factor
Average Factor
Loadings
Test Form AB (Standard 4.G.A.1)
Combined SR
and TE Items 30 27.095% 0.307-0.680 0.510
SR Items 14 28.076% 0.289-0.677 0.513
TE Items 16 31.095% 0.430-0.668 0.553
Test Form CD (Standard 4.G.A.2)
Combined SR
and TE Items 30 31.495% 0.314-0.770 0.547
SR Items 14 33.422% 0.378-0.748 0.566
TE Items 16 36.906% 0.357-0.755 0.594
Test Form EF (Standard 4.G.A.3)
Combined SR
and TE Items 28 28.760% 0.253-0.690 0.526
SR Items 14 31.852% 0.402-0.705 0.557
TE Items 14 31.178% 0.248-0.730 0.545
TIF is inversely proportional to the conditional standard error of measurement for the estimate
of theta (the latent trait variable), which is the error distribution of theta at various ability levels
(Lohman & Al-Mahrazi, 2003). To oversimplify, this can be restated as such: when the test gives
more information, there is less uncertainty in the estimate of the respondent’s ability.
Table 6: Reliability Analyses
Table 7: Factor Analysis
Using Technology-Enhanced Items to Measure Fourth Grade Geometry Knowledge 16/46
Interpretation of TIF values can be difficult, but one way to compare TIF peak values is to note
the relationship between TIF peak values and respondent ability.
Because the TIF is linearly related to the number of items (and, thus, the number of points) on a
test, the use of TIF as a comparison must allow for the adjustment of the TIF for the SR items
upward, to account for there being fewer SR points; this adjustment is analogous to the
Spearman-Brown correction applied when comparing the reliabilities.
Plots of the adjusted TIF values are presented in Figure 3 for the SR (solid line) and TE (dashed
line) items. Table 8 shows the comparison of the adjusted peak values.
The difference in TIF peak values between two tests can allow us to compare the standard errors,
or measurement precision, of two tests. The peak of the SR and TE TIF was comparable for Test
AB; the TE peak was notably higher for Test CD; the SR peak was slightly higher for Test EF.
Item Set Peak TIF Ratio TIF
Test Form AB (Standard 4.G.A.1)
SR Items 8.72 0.998
TE Items 8.75 1.002
Test Form CD (Standard 4.G.A.2)
SR Items 6.21 0.829
TE Items 9.04 1.207
Test Form EF (Standard 4.G.A.3)
SR Items 8.00 1.069
TE Items 7.00 0.935
By taking the square root of the ratio of the TIF peak value for the TE items on Test CD to the
TIF peak value for the SR items on Test CD, it can be estimated that scale scores from a test
composed only of SR items would have a standard error that is approximately 20% larger than
Figure 3: TIF Curves (Solid Line- SR, Dashed Line- TE)
Table 8: Peak TIF Values
Using Technology-Enhanced Items to Measure Fourth Grade Geometry Knowledge 17/46
the standard error of the scale scores from a test composed only of TE items. This indicates that
this set of TE items did, in fact, provide better information compared to the SR items. Using this
same analysis, the comparison of TIF peak values for Tests AB and EF indicate that the
difference is insignificant.
It should also be noted that the TIF peak value for the TE items always occurred at a higher
value of theta than for the SR items. The SR items had greater TIF than the TE items at lower
values of theta, while the opposite was true at higher values of theta. These differences in TIF
shape indicate that the TE items tended to have notably higher difficulty than the SR items, and
this difference was the primary difference in TIF between the SR and TE items for these tests.
The TIF analysis, resulting from IRT estimation, provides evidence that not only do the TE items
provide comparable statistical information or certainty about student ability as the SR items
(RQ1), but may provide differential certainty, i.e., an improvement over SR items (RQ2).
Assessment via TE items for Test CD demonstrated approximately 20% less error than
assessment via the traditional SR items.
4.3. Comparison with CR Items
Another source of validity evidence is based on the relationship of test scores to other variables,
which describes the extent to which an assessment provides information consistent with other
information about the examinee with respect to the construct of interest. The researchers used the
CR items as the external measure of geometry knowledge for comparison.
Students’ performance on the CR items is summarized in Table 9. The average scores and item
difficulties are generally comparable to performance on the SR and TE items, separately and
combined (displayed in Table 3), with the CR items tending to be slightly more difficult.
A one-way ANOVA was conducted across items difficulties (SR, TE, and CR items) along with
effect size (η2) and necessary follow-up analyses (displayed in Table 10). Test form was
excluded from analysis having previously been shown to be a non-significant factor in item
difficulty. Again, item type was found to be a significant factor in item difficulty (F2,94 = 8.89, p
< 0.001), accounting for 15.9% of the total variance in item difficulty. Follow-up t-tests indicated
again that TE items are more difficult than SR items (t = 3.825, df = 77.154, p < 0.001) and that
CR items are also more difficult than SR items (t = 3.4087, df = 47.009, p < 0.001); no
significance difference was demonstrated between TE and CR item difficulty.
Test Form Number of
Items Average Score
Range of Item
Difficulty
Test Form AB 15 5.3313 / 15 = 36% 0.148-0.628
Test Form CD 15 5.2120 / 15 = 35% 0.151-0.603
Test Form EF 15 5.3957 / 15 = 36% 0.153-0.621
Table 9: CR Item Set Difficulties
Using Technology-Enhanced Items to Measure Fourth Grade Geometry Knowledge 18/46
Factor df Sum of
Squares F sig η2
Item
Type 2 0.595 8.890 p < 0.001 0.159
Residuals 94 3.146 0.841
Students’ performance on the SR and TE items was compared to their performance on the CR
items. As displayed in Table 11, the combined items and the two subsets of items all displayed a
strong and statistically significant correlation to the CR items.
The strong, positive correlations to CR item performance provides another source of evidence
that the TE items provided a valid measure of knowledge (RQ1). It should also be noted that the
SR and TE items were strongly and statistically significantly correlated to each other
(r = 0.810** for Test AB, r = 0.755** for Test CD, and r = 0.813** for Test EF), which suggests
that the two tests might be measuring the same construct, the construct measured by the
independent measure (the CR items), which is being used as an independent estimate of the
student’s knowledge. This further indicates that the TE items are a valid measure of knowledge.
Again, TE items might be considered an improved measure if they provide a better measure. As
shown in Table 11, for all three tests, the correlation between the TE items and the CR items was
stronger than the correlation between the SR items and the CR items, although, again, both
correlations were strong and statistically significant. The correlation between performance on the
CR items and either SR or TE items is compared according to Fisher’s r-to-z transformation
(Preacher, 2002) indicating no significant difference for Test AB (z = -1.655, p > 0.05) or Test
EF (z = -0.342, p < 0.05). However, the correlation with CR item performance is stronger for TE
items than SR items on Test CD (z = -2.021, p < 0.05). The significance testing indicates that, for
Test CD, the TE items might be a better measure (i.e., closer to the measure provided by the CR
items) of that construct compared to the SR items.
Table 10: Item Difficulty ANOVA and Effect Size
Using Technology-Enhanced Items to Measure Fourth Grade Geometry Knowledge 19/46
Item Set Correlation
with CR Items
Test Form AB (Standard 4.G.A.1)
Combined SR
and TE Items r = 0.797**
SR Items r = 0.732**
TE Items r = 0.782**
Test Form CD (Standard 4.G.A.2)
Combined SR
and TE Items r = 0.771**
SR Items r = 0.689**
TE Items r = 0.759**
Test Form EF (Standard 4.G.A.3)
Combined SR
and TE Items r = 0.682**
SR Items r = 0.643**
TE Items r = 0.658**
* p<0.05, ** p < 0.01
4.4. Survey Data and Time
The researchers explored the relationship between subtest performance and students’ technology
affect, frequency of technology use, and students’ math affect. Table 12 shows correlations
between these factors and the three test forms plus the CR items.
Student affect towards technology was demonstrated weak positive correlations to each of the
four sets of items (r = 0.195 to 0.261, p < 0.05). These results suggest no practical relationship
between technology affect and performance on SR, TE, and CR item types. Frequency of
technology use was weak and negatively correlated to each of the four sets of items (r = -0.098 to
-0.187, p < 0.05), suggesting no practical relationship between the frequency with which students
use technology and their performance on any of the item types. Lastly, , student affect towards
math in general demonstrated weak positive correlations to all four item types, indicating that
students’ opinions towards math were not meaningfully related to their performance on any of
the item types.
It is often the case that TE items take longer to complete, which is often considered a potential
negative factor that can counter potential benefits of TE items (unless TE items provide more
information, thus potentially justifying the extra time required). Therefore, average time to
complete each item was compared across TE and DR items. The median seconds to answer the
SR items was 42, 26, and 11 for Test AB, Test CD, and Test EF, respectively. The median
number of seconds to answer the TE items was 64, 63, and 24. For each test, the difference in
time to answer by item type was statistically significant (p < 0.01).
Table 11: Correlations with CR Items for Fourth Grade
Using Technology-Enhanced Items to Measure Fourth Grade Geometry Knowledge 20/46
Table 12: Correlations with Student Survey Data for Fourth Grade
Item Set Technology
Affect
Frequency of
Technology
Use
Math
Affect
Test Form AB (Standard 4.G.A.1)
Combined SR
and TE Items r = 0.225** r = -0.110* r = 0.263**
SR Items r = 0.211** r = -0.111* r = 0.259**
TE Items r = 0.218** r = -0.098 r = 0.243**
CR Items r = 0.244** r = -0.139* r = 0.246**
Test Form CD (Standard 4.G.A.2)
Combined SR
and TE Items r = 0.222** r = -0.182** r = 0.274**
SR Items r = 0.202** r = -0.187* r = 0.286**
TE Items r = 0.216* r = -0.152** r = 0.227**
CR Items r = 0.261** r = -0.143* r = 0.241**
Test Form EF (Standard 4.G.A.3)
Combined SR
and TE Items r = 0.220** r = -0.179** r = 0.236**
SR Items r = 0.223** r = -0.173** r = 0.213**
TE Items r = 0.195** r = -0.169** r = 0.239**
CR Items r = 0.236** r = -0.161* r = 0.239**
* p<0.05, ** p < 0.01
These results show a weak relationship between student performance on the items and both
technology use and affect towards technology. This suggest that technology use and affect do not
present construct-irrelevant barriers to students’ performance on the test. These are positive
results in support of the use of TE items, as issues related to technology use and affect are often
considered potential reasons to avoid TE items. The results did show that the TE items took
significantly more time to complete, but that finding must be considered along with the overall
findings about whether the items also provide more or better information (Jodoin, 2003).
5. Discussion
A condensed summary of the results is provided in Figure 4. The analyses clearly provide
evidence of validity of inferences made from the TE items (RQ1). The evidence on whether the
TE items provided improved measures (RQ2) is somewhat inconsistent. In several cases, there
was evidence to support better or more information being obtained from the TE items. The
evidence was not generalizable across all tests or overwhelmingly conclusive for any individual
test, particularly considering the equally strong performance of the SR items. The analysis seems
to support the development of tests with a combination of SR and TE items as the best
measurement of student knowledge. It is clear that the circumstances under which TEs will
provide an improved measure are very context-dependent.
Using Technology-Enhanced Items to Measure Fourth Grade Geometry Knowledge 21/46
RQ Evidence Test
AB CD EF
RQ1
internal structure (reliability)
internal structure (unidimensionality)
internal structure (information)
relationship to other variables (correlation with
CR items)
lack of evidence of the effect of construct
irrelevant factors (technology and affect)
RQ2
comparable item difficulties between CR and TE
items (compared to significant differences in
difficulty between CR and SR items)
internal structure (significant improved
reliability) of TE over SR items
Stronger internal structure (unidimensionality)
for TE versus SR items
more information (higher TIF values and smaller
IRT-based standard errors) for TE versus SR
items
stronger relationship to other variables
(correlation between performance on CR items
and TE items, versus CR items and SR items)
Figure 4: Summary of Evidence from Field Test
There was strong evidence from all three tests that the TE items were a valid measure of student
knowledge for the three targeted standards (RQ1). The TE items had strong internal structure,
evidence of unidimensionality, and a strong relationship to an external measure of knowledge.
Further, there was no evidence found to support what are often considered potential negative
confounding factors that inhibit the ability of TE items to provide an accurate measure of
knowledge. Specifically, there was no evidence of a meaningful relationship between
performance on TE items and students’ opinions towards technology, frequency of use of
technology, or opinions towards math. These findings alleviate some of the more common
concerns about the use of TE items. Thus, there is strong evidence related to RQ1.
There was some evidence that TE items provided an improved measure (RQ2), but the evidence
was not consistent across tests. Tests CD had the most consistent evidence that TE items
provided an improved measure. For this test, TE and CR items demonstrated comparable
difficulties, indicating that, despite the TE items being more difficult than the SR items, they
were closer to the independent measure of knowledge, as provided by the CR items. This idea
Using Technology-Enhanced Items to Measure Fourth Grade Geometry Knowledge 22/46
was also supported by the significantly stronger correlation between the TE and CR items,
compared to the SR and CR items. The TE items had significantly higher reliability and higher
average factor loadings. Additionally, the IRT analysis provided information indicating that the
TE items on Test CD provided more statistical information, and therefore, improved
measurement over the SR items.
The evidence was less consistent for the other two tests. For Test AB, the TE items had higher
average factor loadings compared to the SR items and they had a significantly stronger
correlation to the CR items compared to the SR items. For Tests EF, the only evidence that TE
items might provide improved measurement was non-significant difference between TE and CR
item difficulties.
There are many indications in the data that, from an assessment development perspective, when
limited only to items that can be computer-scored, the best way measurement might be achieved
by including both SR and TE items. For all three tests, the individual item subsets and the
combined item sets all showed strong and often comparable measures of internal structure. And
the evidence of TE items providing improved measurement was inconsistent. Thus, under what
circumstances TEs will provide an improved measure might be incredibly context-dependent,
and, thus, without further research on specific contexts, the best approach is likely a blend of
item types. Further, the inclusion of SR items can mitigate the effects of the increased time
required to respond to TE items.
This recommendation is supported by an examination of the context of the three targeted
standards. The standards contained a mix of components, asking students to draw (4.G.A.1- Test
AB, 4.G.A.3- Test EF), identify (4.G.A.1, 4.G.A.2, 4.G.A.3, all three tests), classify (4.G.A.2-
Test CD), and recognize (4.G.A.2, 4.G.A.3- Test CD and Test EF). Two of these actions,
identifying and recognizing are more likely to be able to be accurately measured by an item with
a high level of constraint, meaning a typical SR item. By definition, these actions do not require
a student to produce their own response. Drawing, on the other hand, by definition does require a
student to produce a response, and thus it is perhaps best to measure this ability using an item
with a lower level of constraint (a TE item). The ability to classify appears to also be more
accurately measured by TE items with less constraint. While a more constrained, typical SR item
certainly can measure student’s ability to classify, it also allows guessing and “chance
classifications.” A less constrained, TE-style item requires the student to classify with less
opportunity for guessing. A TE-style item also allows for greater flexibility in the grouping of
objects, which might allow response options to more closely match a student’s true
understanding. The three targeted standards contain a mix of components, some of which would
be seemingly more appropriate to measure with more constrained SR items and others with less
constrained TE items. This helps contextualize the results and supports the idea that the content
being measured should be the driving force in selecting appropriate item types and that a blend
of item types might often be the best approach.
Using Technology-Enhanced Items to Measure Fourth Grade Geometry Knowledge 23/46
TE items are quickly becoming ubiquitous and thus it is critical to generate a research base about
the quality of measurement and the validity of inferences based on TE items in a variety of
contexts. The current paper contributes to this small but important base of research. The results
provide evidence that TE items can provide a valid measure of geometry standards in the
elementary grades
Based on the VTAG findings, the researchers recommend that item developers not consider TE
items a panacea to solve all modern measurement challenges. Rather, TE item types should be
treated as one of many tools in the item developer’s toolbox. This represents a kind of back to
basics for item writing: the selection of item type should be based on the construct being
measured. TE items should not be automatically favored over SR items because it is assumed
they will always provide a better measure. Whether TE items will provide improved
measurement is likely to be highly dependent on the construct. Item writers should consider the
action students will take to demonstrate knowledge to inform the selection of item type. For
example, measuring a student’s ability to identify or recognize might be actions that can be
accurately and fully measured with an SR item because the action can be captured even when a
high degree of constraint is placed upon the response options. Measuring a student’s ability to
draw, classify, represent, or interpret might be better measured with a TE item because those
actions might be better demonstrated with a less constrained response space.
The researchers further recommend that TE items should not be avoided on the basis that they
are typically more expensive to develop and test, they might introduce construct-irrelevant
factors, or because they require more time to answer. Well-designed TE items do have potential
to provide improved measurement of some constructs. Thus, again, item developers should
choose the item type based on the best way to measure a specifically targeted construct,
considering the varying levels of constraint represented by SR and TE item types as the broad set
of possible item types from which to choose.
There are limitations of the VTAG project. Participating teachers and students were small
samples of convenience, limiting statistical the generalizability and available statistical
procedures. With larger, representative samples, more complex IRT models could be estimated
and the impact of guessing evaluated. Finally, the research presented was conducted only with
students that did not require accommodations during testing. Future research should explore the
same questions posed by VTAG, using the VTAG item sets, but with the full breadth of the
student population, including students with a variety of visual, auditory, psychical, and motor
disabilities. The researchers acknowledge the strong (and continually growing) base of research
related to TE accommodations and acknowledge the VTAG project did not include test
accommodations.
Using Technology-Enhanced Items to Measure Fourth Grade Geometry Knowledge 24/46
Acknowledgements
This material is based upon work supported by the National Science Foundation under Grant
Number 1316557. Any opinions, findings, and conclusions or recommendations expressed in this
material are those of the author(s) and do not necessarily reflect the views of the National
Science Foundation.
i For drawing items, given the targeted age range, the interface decision was made to use “snap-to” behavior,
meaning that if the student clicks somewhere in between grid lines, rather than directly on the intersection of two
grid lines, the point would be placed on the nearest grid line. This is one of two possible behaviors for graphing (the
other being allowing a student to place a point anywhere on the grid). Given that the constructs being measured did
not relate to concepts involving fractions or decimals, and that the item writers and researchers purposefully avoided
fractions and decimals due to the age range and confounding influence of these mathematical ideas, the snap-to
behavior was chosen. The snap-to behavior also allows for more precise scoring algorithms that do not require
tolerances around correct values (which, in turn, require extensive field and validity testing). ii Because the majority of the items were released prior to the adoption of Common Core State Standards, and thus
were based on other standards, many of these items did not align clearly to the standards being targeted for the
VTAG project. Some required modification. Further, the researchers required that the items be strictly constructed-
response, meaning that items could not primarily use an interaction that could be used in a SR or TE item (e.g.,
selecting options, graphing, plotting, drawing, etc.). The researchers contended that the inclusion of these kinds of
items could influence the alignment between CR items and similar SR or TE items. For example, if the set of CR
items targeting a standard included items asking students to plot points, and the analysis later revealed a stronger
correlation between the CR and TE items (which also asked students to plot points) than between the CR and SR
items, one could argue that the correlation was a result of the interface rather than the accuracy of the construct
measurement. iii This assessment design was purposefully chosen to not presuppose that items supporting partial credit would be
worth more or weigh more heavily towards the overall test score. During initial item design, the number of points an
item was worth was made for each item on an individual basis, as was the decision of whether to award partial
credit. Half-points were not awarded; thus, items that awarded partial credit were, by default, written as 2-point
items. These are common conventions used when writing items for large-scale assessment. In response to the expert
advisors’ feedback, the researchers made all items worth one point. This decision was made so as not to assume that
any individual item or item type would return more information than any other, since whether or not this is the case
is part of what the VTAG project seeks to discover (i.e., do TE items provide more information than SR items). The
opportunity to earn partial credit (a score of 0.5) was used for all items where a response would indicate some level
of partial understanding without the demonstration of conflicting misunderstanding. Partial credit was awarded to
both SR items (select-all, matching, and complex multiplex choice types only) and TE items (all types). iv Two items were dropped from Test CD (one SR item and one TE item). These items both had very low item
difficulty statistics which means they were very difficult items (.073 and .005) and had very low or negative item-
total correlations, both when compared with the combined set of SR and TE items and with the SR or TE subset.
The analysis in this section is based on the CD test without these two items. v For the analysis using item response theory (IRT), response data was recoded such that items were either
dichotomously scored or polytomously scored as 0, 1, or 2.
Appendix A: Example Items
Note: Example items are presented to display SR and TE item type. Some items target the fourth
grade standards described in the current paper. Other items target fifth grade standards, also a
target of focus of the VTAG project.
Exhibit A.1: Examples of Traditionally-Formatted and Interactive SR Items
(a) (b) (c)
Exhibit A.2: Examples of TE Item
(a)
Using Technology-Enhanced Items to Measure Fourth Grade Geometry Knowledge 26/46
(b)
Exhibit A.3: Example TE Item for which the Response Space is Too Large to
Administer as SR
Using Technology-Enhanced Items to Measure Fourth Grade Geometry Knowledge 27/46
Exhibit A.4: Example SR Items (Standard Multiple Choice)
Using Technology-Enhanced Items to Measure Fourth Grade Geometry Knowledge 28/46
Exhibit A.5: Example SR Items (Select-All)
Using Technology-Enhanced Items to Measure Fourth Grade Geometry Knowledge 29/46
Using Technology-Enhanced Items to Measure Fourth Grade Geometry Knowledge 30/46
Exhibit A.6: Example SR Item (Complex Multiple Choice)
Using Technology-Enhanced Items to Measure Fourth Grade Geometry Knowledge 31/46
Exhibit A.7: Example SR Items (Matching)
Using Technology-Enhanced Items to Measure Fourth Grade Geometry Knowledge 32/46
Exhibit A.8: Example TE Items (Categorization)
Using Technology-Enhanced Items to Measure Fourth Grade Geometry Knowledge 33/46
Using Technology-Enhanced Items to Measure Fourth Grade Geometry Knowledge 34/46
Using Technology-Enhanced Items to Measure Fourth Grade Geometry Knowledge 35/46
Exhibit A.9: Example TE Item (Hotspot)
Using Technology-Enhanced Items to Measure Fourth Grade Geometry Knowledge 36/46
Exhibit A.10: Example TE Items (Matrix Completion)
Using Technology-Enhanced Items to Measure Fourth Grade Geometry Knowledge 37/46
Exhibit A.11: Example TE Items (Figure Placement)
Using Technology-Enhanced Items to Measure Fourth Grade Geometry Knowledge 38/46
Using Technology-Enhanced Items to Measure Fourth Grade Geometry Knowledge 39/46
Exhibit A.12: Example TE Item (Drawing)
Using Technology-Enhanced Items to Measure Fourth Grade Geometry Knowledge 40/46
Using Technology-Enhanced Items to Measure Fourth Grade Geometry Knowledge 41/46
Using Technology-Enhanced Items to Measure Fourth Grade Geometry Knowledge 42/46
Using Technology-Enhanced Items to Measure Fourth Grade Geometry Knowledge 43/46
Exhibit A.13: Example CR Items
Using Technology-Enhanced Items to Measure Fourth Grade Geometry Knowledge 44/46
References Cited
Archbald, D. A. & Newmann, F. M. (1988). Beyond Standardized Testing: Assessing Authentic
Academic Achievement in Secondary School. Reston, VA: National Association of Secondary
School Principals.
Beatty, P. C. & Willis, G. B. (2007). Research synthesis: The practice of cognitive interviewing.
Public Opinion Quarterly, 71(2), 287-311.
Bennett, R. E. (1993). On the meaning of constructed response. In R. E. Bennett and W. C.
Ward, (Eds.), Construction versus Choice in Cognitive Measurement: Issues in Constructed
Response, Performance Testing, and Portfolio Assessment (pp. 1–27). Hillsdale, NJ:
Lawrence Erlbaum Associates.
Bennett, R. E. (1999). Using new technology to improve assessment. Educational Measurement:
Issues and Practice, 18(3), 5-12.
Birenbaum, M., Kelly, A. E. & Tatsuoka, K. (1992). Towards a stable diagnostic representation
of students' errors in algebra. ETS-RR-92-58-ONR. ERIC Clearinghouse Doc. ED 356973.
Bryant, W. (2017). Developing a strategy for using technology-enhanced items in large-scale
standardized tests. Practical Assessment, Research & Evaluation, 22(1).
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ:
Lawrence Erlbaum Associates
Darling-Hammond, L. & Lieberman, A. (1992). The shortcomings of standardized tests. The
Chronicle of Higher Education, B1-B2.
Diedenhofen, B., & Musch, J. (2016). cocron: A web interface and R package for the statistical
comparison of Cronbach’s alpha coefficients. International Journal of Internet Science, 11,
51–60.
Dikli, S. (2006). An overview of automated scoring of essays. The Journal of Technology,
Learning, and Assessment, 5(1).
Dolan, R. P., Goodman, J., Strain-Seymour, E., Adams, J., & Sethuraman, S. (2011). Cognitive
Lab Evaluation of Innovative Items in Mathematics and English Language Arts Assessment
of Elementary, Middle, and High School Students. Iowa City, IA: Pearson Education.
Ericsson, K. A., & Simon, H. A. (1999). Protocol analysis: Verbal reports as data. Cambridge,
MA: Massachusetts Institute of Technology.
Goodwin, L. D. & Leech, N. L. (2003). The meaning of validity in the new Standards for
Educational and Psychological Testing: Implications for measurement courses. Measurement
and Evaluation in Counseling and Development, 36, 181-191.
Gorin, J. S. (2006). Test design with cognition in mind. Educational Measurement: Issues and
Practice, 25(4), 21-35.
Haladyna, T. M. (1999). Developing and validating multiple-choice test items. Hillsdale, NJ:
Erlbaum.
Harlen, W. & Crick, R. D. (2003). A Systematic Review of the Impact on Students and Teachers
of the Use of ICT for Assessment of Creative and Critical Thinking Skills. London: Institute
of Education, University of London.
Hickson, S., & Reed, W.R. (2009) Do constructed-response and multiple-choice questions
measure the same thing? Department of Economics. Working Paper Series. Retrieved March,
2018 from http://hdl.handle.net/10092/2465.
Huff, K. L. & Sireci, S. G. (2001). Validity issues in computer-based testing. Educational
Measurement: Issues and Practice. 20(3), 16-25.
Using Technology-Enhanced Items to Measure Fourth Grade Geometry Knowledge 45/46
inTASC. (2018). USEiT: Use, Support, and Effect of Instructional Technology Study. Retrieved
March, 2018, from http://www.bc.edu/research/intasc/researchprojects/USEIT/useit.shtml
Jodoin, M. G. (2003). Measurement efficiency of innovative item formats in computer-based
testing. Journal of Educational Measurement, 40(1), 1-15.
Lane, S. (2004). Validity of high-stakes assessment: Are students engaged in complex thinking?
Educational Measurement: Issues and Practice, 23(3), 6-14.
Livingston, S. (2009, September). Constructed-response test questions: Why we use them; how
we score them. ETS R&D Connections, no. 11.
Lohman, D. F., & Al-Mahrzi, R. (2003). Personal standard errors of measurement. Retrieved
March, 2018 from https://faculty.education.uiowa.edu/docs/dlohman/personalSEMr5.pdf.
Masters, J. (in press) The Validity of Technology-Enhanced Assessment in Geometry (VTAG)
Project: Technical Report. Measured Progress.
Masters, J., Famularo, L., & King, K. (2016). Using Technology-Enhanced Items to Measure
Fifth Grade Geometry. Annual Meeting of the National Council on Measurement in
Education. Washington, D.C.
Mayer, R. E., & Moreno, R. (2003). Nine ways to reduce cognitive load in multimedia learning.
Educational Psychologist, 38, 43–52.
McFarlane, A., Williams, J. M., & Bonnett, M. (2000). Assessment and multimedia authoring- a
tool for externalizing understanding. Journal of Computer Assisted Learning, 16, 201-212.
National Council of Education Statistics (NCES). (2016). NAEP Tools and Applications.
Retrieved March, 2018, https://nces.ed.gov/nationsreportcard/about/naeptools.aspx/
National Council of Education Statistics (NCES). (2018). Trends in International Math and
Science Study. Retrieved March, 2018, https://nces.ed.gov/timss/educators.asp
Oklahoma State Department of Education (2017). Solicitation #2650000340. Retrieved March,
2018 from https://www.ok.gov/cio/documents/Solicitation2650000340.pdf
Partnership for Assessment of Readiness for College and Careers (PARCC). (2010). Application
for the Race to the Top Comprehensive Assessment Systems Competition. Retrieved
December, 2011 from
http://www.parcconline.org/sites/parcc/files/PARCC%20Application%20-%20FINAL.pdf.
Pellegrino, J. W., Chudowsky, N., & Glaser, R. (Eds.) (2001). Knowing What Students Know:
The Science and Design of Educational Assessment. Washington, DC: The National
Academy Press.
Preacher, K. J. (2002, May). Calculation for the test of the difference between two independent
correlation coefficients [Computer software]. Available March, 2018 from
http://quantpsy.org.
Quellmalz, E. S., Timms, M. J., & Schneider, S. A. (2009). Assessment of student learning in
science simulations and games. Paper prepared for the National Research Council Workshop
on Gaming and Simulations. Washington, DC: National Research Council.
Salvia, J., Ysseldyke, J. E., & Bolt, S. (2007). Assessment in special and inclusive education.
Wadsworth.
Scalise, K. & Gifford, B. (2006). Computer-based assessment in e-earning: A framework for
constructing “intermediate constraint” questions and tasks for technology platforms. The
Journal of Technology, Learning, and Assessment, 4(6).
Sireci, S. G. & Zenisky, A. L. (2006). Innovative item formats in computer-based testing: In
pursuit of improved construct representation. In S. M. Downing and T. M. Haladyna (Eds.),
Using Technology-Enhanced Items to Measure Fourth Grade Geometry Knowledge 46/46
Handbook of test development (pp. 329-347). Mahwah, NJ, US: Lawrence Erlbaum
Associates Publishers.
Smarter Balanced Assessment Consortium (SBAC). (2010). Race to the Top Assessment
Program Application for New Grants; Comprehensive Assessment Systems CFDA Number:
84.395B. Washington, D.C.: U.S. Department of Education.
State of Maine, Department of Education (2014). Maine Comprehensive Asssessment System.
Procurement for the Implementation of English Language Arts and Mathematics Assessment
in Grades 3 through 8 and High School. Retrieved March, 2018 from
http://www.maine.gov/doe/assessment/documents/rfp.doc
Strain-Seymour, E., Way, W., & Dolan, R. (2009). Strategies and Processes for Developing
Innovative Items in Large-Scale Assessments. Iowa City, IA: Pearson Education.
Thomas, A. (2016). Evaluating the Validity of Technology-Enhanced Educational Assessment
Items and Tasks: An Empirical Approach to Studying Item Features and Scoring Rubrics.
(2016). CUNY Academic Works.
U.S. Department of Education. (2010). Transforming American Education: Learning Powered by
Technology. National Educational Technology Plan 2010. Washington, D. C.
National Council of Education Statistics (NCES). (2016). NAEP Tools and Applications.
Retrieved March, 2018, https://nces.ed.gov/nationsreportcard/about/naeptools.aspx/
Wan, L. & Henly, G. A. (2012): Measurement properties of two innovative item formats in a
computer-based test, Applied Measurement in Education, 25(1), 58-78.
Willis, G. B. (1999). Cognitive interviewing: A “how to” guide. Paper presented at the Meeting
of the American Statistical Association. North Carolina: Research Triangle Institute.
Winter, P. C., Wood, S. W., Lottridge, S. M., Hughes, T. B., & Walker, T. E. (2012). The utility
of online mathematics constructed-response items: Maintaining important mathematics in
state assessments and providing appropriate access to students. Final research report. Pacific
Metrics Corporation.
Zenisky, A. L. & Sireci, S. G. (2002). Technological innovations in large-scale assessment.
Applied Measurement in Education, 15(4), 337-362.
Zucker, S., Sassman, C., & Case, B. J. (2004). Cognitive Labs. San Antonio, TX: Harcourt
Assessment, Inc.