using technology-enhanced items to measure fourth grade ... · annual meeting of the american...

Using Technology-Enhanced Items to Measure

Fourth Grade Geometry Knowledge

Jessica Masters, Ph.D.

[email protected]

Matthew Gushta, Ph.D.

[email protected]

Paper presented at the

Annual Meeting of the American Educational Research Association

Annual Meeting

Washington, D.C.

April, 2018

Abstract

Technology-enhanced items have the potential to provide improved measures of student

knowledge compared to traditional item types. This paper uses quantitative analysis of fourth-

grade geometry field test data to explore a) the validity of inferences made from certain kinds of

technology-enhanced items and b) whether those item types provide improved measurement

compared to traditional item types. There was strong evidence based on internal structure and

based on the relationship with other variables that technology-enhanced items provided valid

inferences. The evidence of whether that measurement was an improvement over the

measurement provided by traditional selected-response items was mixed.

Using Technology-Enhanced Items to Measure Fourth Grade Geometry Knowledge 2/46

High-quality assessment is critical to the educational process. Teachers and educational

stakeholders can only make effective decisions about instruction and student progress if they

have access to assessments that result in reliable and valid inferences about student knowledge.

Technology-enhanced (TE) items have the potential to provide improved measures of student

knowledge over traditional item types because they require students to produce, rather than

select, a response. TE items can create a more engaging environment for students and can reduce

the effects of guessing and test-taking skills. Constructed-response (CR) items can often provide

similar benefits, but TE items have the additional benefit of supporting automated scoring, a

critical feature in the modern classroom that requires fast-paced, inexpensive, and accurate

assessment feedback. For these and other reasons, national assessment consortia, state

departments of education, and assessment developers have widely incorporated TE items into

their formative and summative assessments. While these potential benefits of TE items have

spurred the forward momentum of TE item development and use, there has been a lack of

rigorous research on the validity of inferences made from TE items and the ability of TE items to

provide improved measurement over traditional selected-response (SR) items (Bryant, 2017).

Researchers have stressed the need for evidence that TE items are more than merely engaging,

but also provide an accurate, and potentially improved, measure of student knowledge.

The current paper contributes to the small but growing base of research related to whether the

inferences made from TE items are valid and whether TE items provide improved measurement

over SR items. Field test data are examined to address the following research questions:

RQ1) To what extent do TE items provide a valid measure of geometry standards in the

elementary grades?

RQ2) To what extent do TE items provide improved measurement compared to SR items?

To address these questions, the Validity of Technology-Enhanced Assessment in Geometry

(VTAG) project collected from classroom administration of TE, SR, and CR items targeting

fourth grade Common Core State Standards in geometry (shown in Figure 1).

CCSS.MATH.CONTENT.4.G.A.1: Draw points, lines, line segments, rays, angles (right, acute,

obtuse), and perpendicular and parallel lines. Identify these in two-dimensional figures.

CCSS.MATH.CONTENT.4.G.A.2: Classify two-dimensional figures based on the presence or

absence of parallel or perpendicular lines, or the presence or absence of angles of a specified size.

Recognize right triangles as a category, and identify right triangles.

CCSS.MATH.CONTENT.4.G.A.3: Recognize a line of symmetry for a two-dimensional figure as a

line across the figure such that the figure can be folded along the line into matching parts. Identify line-

symmetric figures and draw lines of symmetry.

Figure 1: Common Core State Standards in Fourth Grade Geometry


1. Theoretical Framework

Assessment is a critical component within the instructional process and instruction should be

differentiated based on the results of assessments (Pellegrino, Chudowsky, & Glaser, 2001). The

2010 National Education Technology (NET) Plan’s goal related to assessment is that “Our

education system at all levels will leverage the power of technology to measure what matters and

use assessment data for continuous improvement (USDE, p. xvii).” Research has documented the

inadequacies of SR items to measure many types of high-level knowledge and understanding

(Archbald & Newmann, 1988; Bennett, 1993; Birenbaum & Tatsuoka, 1987; Hickson & Reed,

2009; Lane, 2004; Livingston, 2009; Darling-Hammond & Lieberman, 1992; Quellmalz, Timms,

& Schneider, 2009). One approach to overcome the shortcomings of SR items is the use of text-

entry or CR items, which have been used to measure high-order skills and knowledge. In recent

years, researchers have leveraged technological advancements to combine the measurement

power of CR items with the automated-scoring capability of SR items. One branch of this

research has focused on automated text and essay scoring (e.g., Dikli, 2006), while another

branch has focused on using technology to allow students to interact with digital content in

innovative ways, through the development of TE items. This second line of research is consistent

with the NET Plan’s assessment-related recommendations, which include the development of

assessments that provide “new and better ways” to assess students and the expansion of the

capacity to design, develop, and validate technology-enhanced assessments that can access

constructs difficult to measure with traditional assessments (USDE, 2010). For this

recommendation to be realized, more research is needed on the validity of technology-enhanced

assessments in a variety of contexts.

TE items offer many potential benefits over SR items. The most significant is that TE items have

the potential to provide improved measurement of certain constructs, specifically high-level or

cognitively complex constructs, because, rather than simply select information, they require

students to produce information, which is often a more authentic form of measurement

(Archbald & Newmann, 1988; Bennett, 1999; Harlen & Crick, 2003; Huff & Sireci, 2001;

Jodoin, 2003; McFarlane, Williams, & Bonnett, 2000; Sireci & Zenisky, 2006; Zenisky & Sireci,

2002). A second benefit is that TE items reduce the effects of test-taking skills and random

guessing (Huff & Sireci, 2001). A third benefit is that TE items have the potential to provide

richer diagnostic information by recording not only the student’s final response, but also the

interaction and response processes that can reveal the student’s thought process that lead to that

response (Birenbaum & Tatsuoka, 1987). CR items have always offered the first of these two

benefits, but TE items allow these benefits to be leveraged on items administered via computer

that can be automatically and instantly scored. A fourth potential benefit of TE items is a

possible reduction of cognitive load from non-relevant constructs, such as the reading load for

non-reading related items or the cognitive load required to keep various item constructs in

memory (Mayer & Moreno, 2003; Thomas, 2016). Finally, TE items tend to be more engaging to

students, an important consideration in an era when students frequently feel over-tested (Strain-


Seymour, Way, & Dolan, 2009; Dolan, Goodman, Strain-Seymour, Adams, & Sethuraman,

2011).

The potential of TE items to provide improved measurement was initially underscored by the

awarding of Race to the Top Assessment funds to PARCC and Smarter Balanced, who proposed

to develop next-generation assessment systems that would incorporate TE items in their

summative and non-summative assessments (PARCC, 2010; SBAC, 2010). Since that time, state

departments of education have continued to pursue the promise of TE items. Statewide

summative assessment Requests for Proposals frequently include specific provisions for the

development and administration of TE items, stipulating the availability of specific interaction

types (e.g., State of Maine, Department of Education, 2014) and the presumption of

measurement of higher-order thinking skills (e.g., Oklahoma State Department of Education,

2017). Despite the forward momentum to develop and use TE items, there is only a small

research base evaluating the validity of TE items in various contexts within K-12 education.

Pearson conducted cognitive labs with elementary, middle, and high school students to evaluate

perceptions of TE items, the cognitive processes used to respond to TE items, and the potential

for TE items to better evaluate constructs in both mathematics and English language arts (Dolan,

Goodman, Strain-Seymour, Adams, & Sethuraman, 2011). Although the results cannot be

broadly generalized because of the small sample sizes, the research found preliminary evidence

to suggest that TE items were highly usable and engaging. More importantly, the research found

that TE items produced measurements of constructs, particularly high-level constructs, that were

not easily measured with traditional items types. The study found that the use of TE items

reduced guessing and allowed students to have more authentic interactions with content. The

study also found that TE items required more time to complete and that this factor was

influenced by students’ technical proficiencies (ibid.).

In a separate research effort, Pearson partnered with the Minnesota State Department of

Education to evaluate and compare the performance of TE items, SR items, and CR items in the

context of fifth grade, eighth grade, and high school science (Wan & Henley, 2012). This study

explored TE items of the figural response type, which includes “hotspot” identification, drag-

and-drop, and reordering. Through item response theory analyses, this study found that TE items

provided the same amount of information as SR items in fifth and eighth grade, and slightly

more information in high school. While CR items provided more information than both TE items

and SR items, those items required human scoring. This study also found that TE items and SR

items were equally efficient (i.e., they provided equivalent amounts of information per unit of

time). The researchers concluded that their statistical analyses support the use of TE items in K-

12. However, they were also careful to note that further psychological testing (e.g., cognitive

labs) should be conducted to confirm the results of their statistical testing. The researchers also

advised caution in using TE items when standard SR items are able to measure a construct. “We

reviewed the test forms administered in this study and found that a number of innovative items


could be easily replaced by [multiple choice] items without changing the [knowledge, skills, and

abilities] measured. This issue is not uncommon in the development of innovative assessments:

the face validity of innovative item formats is so appealing that their real potential of providing

something beyond what is available using the [multiple choice] format may be overlooked (ibid.,

p. 74).” This is supported by other researchers who warn of innovation solely for innovation’s

sake when traditional items can fully and accurately measure a construct (Haldyna, 1999; Sireci

& Zenisky, 2006). The researchers recommended both psychometric and psychological research

to determine when TE items are appropriate.

Research conducted by Pacific Metrics Corporation, with funding from the U.S. Department of

Education through a Grant for Enhanced Assessment Instruments, explored SR items, CR items,

and TE items in the context of seventh grade mathematics and Algebra I. The researchers found

no significant difference when they compared the correlation between a test comprised of CR

and TE items and teacher ratings of student knowledge with the correlation of a test comprised of

SR items and teacher ratings of student knowledge. The CR/TE test was reviewed by experts and

found to be similar to the SR test in terms of measuring the intent of the standards and the depth

of knowledge. The CR/TE test was found to be significantly more reliable and provide more

information than the SR test (Winter, Wood, Lottridge, Hughes, & Walker, 2012). “These results

indicate that tests incorporating CR/TE items can measure some mathematics content with less

error than tests comprising only [SR] items (ibid, p. 53).” This study provides promising results,

but must be generalized cautiously because of the narrow content focus and the blending of text-

based CR items and TE items on the same test form.

The VTAG project contributes to this small but critical base of research related to the validity of

TE items, differing from previous efforts and making new contributions to the research base in

two important ways. First, VTAG used a mixed-methods design. The researchers conducted an

analysis of qualitative data collected via cognitive labs with small samples and statistical analysis

of field test data from larger samples (using a field test). (he latter analysis is the focus of the

current paper. Second, the VTAG project focuses on a broad content area not previously studied:

elementary geometry.

2. Selected Response and Technology-Enhanced Item Types

The VTAG project uses the taxonomy of items defined by Scalise and Gifford (2006) to

determine whether an item is categorized as SR or TE. This taxonomy, displayed in Figure 2,

organizes items by the degree of constraint placed on the student’s options for responding to or

interacting with the item. The taxonomy does not classify items based on the method or mode of

interaction required by the student or the media included in the item.


For example, the item displayed in Exhibit A.1(a) is a traditional multiple-choice item. This item

is clearly SR. The item could be reformatted, however, to use a different method of interaction.

In Exhibit A.1(b), a similar item is rewritten using a drag and drop interface. Exhibit A.1(c)

shows another similar item that uses a hotspot interface.

For the VTAG project, the researchers argue that the interface or mode of interaction is irrelevant

and that all three of these items are SR because they have the same degree of constraint: the

student is asked to select one option out of four options. Although Exhibit A.1(b) and Exhibit

A.1(c) use a more modern and interactive interface, the content of the items and the extent to

which they ask students to select (rather than produce) content is the same as the item displayed

in Exhibit A.1(a); all three of the items displayed in Exhibit A.1 are consistent with the “Multiple

Choice” category of the Scalise and Gifford taxonomy (displayed in Figure 2).

What defines an SR item is that it places a high degree of constraint on the student’s response,

thus asking them to select, rather than produce, a response. A TE item, on the other hand,

requires the student to produce a response. For the VTAG project, the item must also be able to

be automatically scored. For example, the item in Exhibit A.2(a) asks a student to produce

(draw) a shape as their response. Although this item is similar to what could be offered on paper-

and-pencil as a (non-text-based) CR item, the item is administered via computer and can be

Figure 2: Taxonomy of Item Types based on Level of Constraint (Scalise & Gifford, 2006, p. 9)


automatically scored via the computer; this difference makes the item TE for the purposes of the

VTAG project. Other items require a student to produce a different kind of response compared to

what could be elicited from a paper-and-pencil CR item. For example, the item displayed in

Exhibit A.2(b) asks students to move objects into various categories or buckets, which would be

more difficult to directly translate to a traditional non-computer based administration. The key

distinctions again of why this item is considered TE for the VTAG project is that it requires a

student to produce a response and can be automatically scored.

Some TE items do ask students to select a response. But, unlike the SR items, the combination of

possible responses from which to select result in an item with a higher cognitive complexity.

While the items do place more constraint on the student than one in which the student is truly

producing the response, they place much less constraint than a selection of one in four/five/or

six. For example, the item in Exhibit A.3 is classified as a TE item. In theory, this item could be

rewritten as a series of three multiple choice items, each with a list of all 36 possible street pairs

as the options. Thus, while technically the student is selecting their response, not producing it,

they are selecting it from a large enough sample of responses that this item is cognitively more

akin to item types that require a produced response.

To summarize, VTAG researchers defined SR items as those that require a student to select a

response from a small set of responses. SR items place a high degree of constraint on the

response. TE items were defined as those that require a student to either produce a response or

select a response from a very large set of responses (which are not explicitly enumerated). TE

items place a low degree of constraint on the response.

It should be noted that the VTAG project aimed to contribute to the current knowledge base in

topics that have been less explored, including those that would inform the ongoing and

increasing use of computer-based assessments in large-scale summative and classroom-based

formative diagnostic assessment. To this end, the project compared the measurement qualities of

TE items to SR items, not to CR items. The researchers acknowledge the possibility (even

probability!) that many CR items can provide improved measurement of constructs. However,

many of these items either cannot currently be automatically scored or require substantial

investment to establish automated scoring algorithms. Thus, the current study addresses the

comparison of item types that can be widely administered and instantly scored by the computer

(i.e., TE vs SR). The researchers further acknowledge the large body of existing research

exploring automated text and numerical response analysis; they acknowledge that many of these

types of CR items can be automatically scored. However, in the interest of contributing to the

current knowledge base in an area less explored, those item types were not included in the

VTAG project.


2.1. Selected Response Item Types

As described above, for the purposes of the VTAG project, SR items are items that place a high

degree of constraint on the response. The VTAG project defines four SR item types: standard

multiple choice, select-all, complex multiple choice, and matching. These items come primarily

from the two most constrained categories from the Scalise and Gifford (2006) taxonomy

(Multiple Choice and Selection/Identification) and approximately correspond to levels 1C

(Conventional or Standard Multiple Choice), 2A (Multiple True/False), 2C (Multiple Answer),

2D (Complex Multiple Choice), and 3A (Matching) (as displayed in Figure 2).

Standard multiple choice items and select-all items require a student to select one or more

options from a small set of options. Exhibit A.4 shows example standard multiple choice items;

Exhibit A.5 shows example select-all items. Complex multiple choice items are a combination of

more than one standard multiple choice or select-all items. These types of items could be revised

into multiple items that use either a standard multiple choice or a select-all format (each with a

reasonable number of response options). Exhibit A.6 shows example complex multiple choice

items. Matching items require a student to connect pieces of content. There is a one-to-one

correspondence between the content, i.e., if content A, B, C, and D is matched with 1, 2, 3, and

4, exactly one letter will be matched to exactly one number. Exhibit A.7 shows example

matching items. Note that while a more interactive interface may be employed with the

aforementioned SR items, they are still considered SR because of the degree of constraint placed

on student responses.

2.2. Technology-Enhanced Item Types

TE items place a lower degree of constraint on the response. The VTAG project defines five TE

item types: categorization, hotspot, matrix completion, figure placement, and drawing. These

items come primarily from the constraint categories 3-6 from the Scalise and Gifford (2006)

taxonomy (Reordering/Rearrangement, Substitution/Correction, Completion, and Construction)

and approximately correspond to levels 3B (Categorizing), 3C (Ranking and Sequencing), 4C

(Limited Figural Drawing), 5D (Matrix Completion), and 6B (Figural Constructed response) (as

displayed in Figure 2).

Categorization items require students to sort or classify content. It’s important to note that, in

theory, categorization items could be rewritten as SR items; however, many of these items would

have too large a response space, and thus be unworkable to administer in SR format. Exhibit A.8

shows examples of categorization items. For categorization items, the general guideline used in

the VTAG project was that if at least one “content tile” (i.e., object to be classified) belonged in

more than one “bucket” (i.e., category), the item would likely be classified as a TE item. Some

items fit this description but overall had a small, constrained response space (e.g., the item

displayed in Exhibit A.6); these items were instead classified as SR instead of TE because of the

higher level of constraint placed on the response options.


Hotspot items require the student to identify part of a graphical figure. Exhibit A.9 shows

examples of hotspot items. Similar to the categorization items, it is often possible to use the

hotspot interface to write SR items. The general guideline used for hotspot was that, for a hotspot

item to be considered a TE item, the placement of the response within the overall visual space

must be critical to the overall item context (e.g., select the parallel lines on the images below or

select all coastal cities on the map below).

Matrix completion items require students to place content into pre-defined spaces within the

problem space. This includes items asking students to rank or order content. Exhibit A.10 shows

examples of this type of item. Figure placement items require students to place content within the

larger space, which do not have pre-defined placement options (e.g., no “slots” or “buckets” in

which to drop objects). Exhibit A.11 shows example items. Drawing items require students to

draw a response, including the drawing of points, line segments, lines, rays, angles, or polygons.i

Exhibit A.12 shows example drawing items.

3. Methodology

The researchers administered SR, TE, and CR items, along with surveys, to fourth grade students

in a field test designed to collect data to address RQ1 and RQ2.

3.1. Instruments

The researchers developed a set of approximately ten SR and ten TE items to target each of the

three standards in fourth grade geometry. The items were revised based on feedback from

internal project content experts, independent experts on the project advisory board, and based on

the qualitative analysis of data collected from cognitive labs (Masters, in press; Masters,

Famularo, King, 2016). These items were assembled to create three fourth grade test forms, each

of which consisted of two sub-forms, covering standards 4.G.A.1 (Test AB), 4.G.A.2 (Test CD),

and 4.G.A.3 (Test EF). Table 1 presents test construction information for these three test forms

(six sub-forms), showing the breakdown in the number of SR and TE item types. The tests were

presented to students in sub-forms. For example, a student might take sub-form A in a single

sitting and then complete sub-form B in another sitting.

A small set of text-based CR items was developed for each of the targeted standards. CR items

are generally accepted as capable of providing a strong and reliable measure of student

knowledge (e.g., Bennett, 1993; Hickson & Reed, 2009; Livingston, 2009). Just like any item,

the ability to provide an accurate measure is dependent on the quality of the item, the context of

administration, and the scoring approach or rubrics (which is particularly important when it

involves human, subjective scoring). However, well-designed and administered CR items

(including text-based CR items) are often considered an improved form of measurement (albeit

one that cannot be administered or scored at a large scale). Accordingly, for VTAG, the CR

items were designed to serve as an independent measure of student knowledge, something that


could be considered as close to the true measure of the student’s knowledge as is possible to

measure within the current research context.

Table 1: Test Construction

Number of SR Items Number of TE Items

Standard

Tes

t

Su

b-F

orm

Sta

nd

ard

Mu

ltip

le

Ch

oic

e

Sel

ect-

All

Co

mp

lex

Mu

ltip

le

Ch

oic

e

Mat

chin

g

To

tal

Cat

egori

zati

on

Ho

tsp

ot

Mat

rix C

om

ple

tio

n

Fig

ure

Pla

cem

ent

To

tal

4.G.A.1 A 1 4 0 1 6 2 2 1 0 8

B 3 2 2 1 8 4 1 0 0 8

4.G.A.2 C 3 3 1 0 7 4 0 0 0 6

D 1 3 1 0 5 3 0 2 0 8

4.G.A.3 E 4 0 0 2 6 1 0 2 0 8

F 2 5 1 0 8 2 0 0 2 6

The researchers collected released items and item contexts from state and national tests,

modifying the language, targeted content, or item structure when necessary.ii To supplement the

limited pool of publicly available test items, the VTAG researchers developed additional CR

items specifically for use in this project. Exhibit A.13 shows example CR items.

A Technology Survey was designed to measure students’ comfort with, proficiency with, access

to, and frequency of use related to technology, as well as their attitude towards technology and

attitude towards mathematics. These areas were chosen because they represent the various

factors that might influence student performance on TE items. The researchers gathered released

survey items from various years of the fourth grade National Assessment of Educational

Progress (NCES, 2016) and Trends in International Mathematics and Science Study (NCES,

2018) surveys. Some questions were modified based on other smaller surveys (e.g., the Use,

Support, and Effect of Instructional Technology Project) (UseIt, 2018). The survey items were

reviewed internally by project experts, reviewed by members of the external VTAG expert

advisory board, and piloted with a sample of over 740 elementary students; the survey items

were revised accordingly.

3.2. Data Collection

The researchers collected initial validity evidence based on test content, e.g., the extent to which

the assessment represents the domain of interest with minimal construct-irrelevant information.

This type of evidence was collected through review by the expert advisory panel. Because

evidence based on test content is a fundamental underpinning of validity, item revisions were


made based on the feedback of the expert advisors before the items were used to collect

additional validity evidence.

Cognitive labs, or think alouds, are face-to-face interactions during which a researcher observes

and evaluates a student’s cognitive processes. Cognitive labs have become a widely used method

of gathering evidence related to the validity of inferences made by assessments, specifically

evidence about whether assessment items are measuring the intended constructs (Dolan,

Goodman, Strain-Seymour, Adams, & Sethuraman, 2011; Ericsson & Simon, 1999; Gorin, 2006;

Willis, 1999). This type of qualitative evidence adds significant value to more traditional,

quantitative validity evidence collected from larger samples (Beatty & Willis, 2007; Willis,

1999; Zucker, Sassman, & Case, 2004).

The expert review and cognitive labs were used both to collect validity evidence- evidence based

on test content and based on student thought processes. They were also used to revise and refine

the items used in the field test and described in the current paper (Masters, in press; Masters,

Famularo, King, 2016). The cognitive labs also revealed that: a) the increased number of

components in the TE items resulted in those items being both more difficult and more time

intensive; b) students generally preferred the most interactive TE items; and c) when the

technology of the testing interface was not a barrier, the TE items provided an equally accurate,

and sometimes improved, measure of knowledge (ibid). These results contributed to the overall

understanding related to the research questions and helped inform the final revisions to the items

and testing system that was used in the field test that is described in the current paper.

In spring 2016, the researchers conducted a field test of the SR and TE items, to collect validity

evidence based on internal structure and based on the relationship to other variables. All SR and

TE items were instantly computer-scored so that teachers received immediate feedback with

student results. In addition, students completed the CR items, a background survey, and the

Technology Survey (all administered via computer).

4. Results

There were 384 students that completed the combined Test Form AB (targeting standard

4.G.A.1), 374 students that completed the combined Test Form CD (targeting standard 4.G.A.2),

and 342 students that completed the combined Test Form E F (targeting standard 4.G.A.3). Table

2 presents information about student gender, race, ethnicity, and the frequency of when a

language other than English was spoken in the home. Answering these questions was voluntary

for students; thus, percentages may not always add up to 100%.

Table 3 shows the average test scores, ranges of item difficulty, and ranges of item-total

correlations. For the item analysis described throughout this section, all items were worth 1

point. Some items were dichotomously scored and others were scored with partial credit of 0,

0.5, or 1.iii


Demographic Subgroup Value

Total n 425

Gender Female 53.20%

Male 38.60%

Race

Asian 5.20%

American Indian / Alaskan

Native 7.80%

Black 22.40%

Native Hawaiian / Pacific

Islander 4.20%

White 42.60%

Ethnicity Hispanic / Latino 12.50%

Language Other

than English

All or most of the time 12.20%

About half of the time 5.40%

Once in a while 12.20%

Never 58.10%

Item Set Number of

Items Average Score

Range of Item

Difficulty

Range of Item-

Total Correlations

Test Form AB (Standard 4.G.A.1)

Combined SR

and TE Items 30 11.065 / 30 = 37% 0.063-0.865 0.269-0.644

SR Items 14 6.307 / 14 = 45% 0.129-0.865 0.216-0.560

TE Items 16 4.758 / 16 = 30% 0.063-0.543 0.365-0.587

Test Form CD (Standard 4.G.A.2)

Combined SR

and TE Items 24iv 7.253/ 24 = 30% 0.055-0.74 0.293-0.713

SR Items 11 4.350 / 11 = 40% 0.156-0.74 0.304-0.623

TE Items 13 2.902 / 13 = 22% 0.055-0.72 0.275-0.652

Test Form EF (Standard 4.G.A.3)

Combined SR

and TE Items 28 11.610 / 28 = 41% 0.039-0.824 0.222-0.631

SR Items 14 6.949 / 14 = 50% 0.156-0.824 0.326-0.602

TE Items 14 4.661 / 14 = 33% 0.039-0.614 0.191-0.624

Table 2: Participating Student Demographics

Table 3: Item Difficulties and Item-Total Correlations


Table 4 presents the results of a two-way ANOVA examining item difficulty by test form (AB,

CD, EF) and item type (SR or TE). In addition, effect size for each factor was calculated as η2

(SSeffect/SStotal). The results show a significant effect of Item Type (F1,76 = 14.265, p < 0.001) and

a medium-to-large effect size (η2 = 0.158; Cohen, 1988). A follow-up t-test confirms that the TE

items were more difficult than the SR items (t = 3.825, df = 77.154, p < 0.001). The effect of test

form was non-significant and small indicating no difference in difficulty between Test AB, Test

CD, and Test EF. Similarly, Table 5 presents the results of a two-way ANOVA examining item-

total correlation by test form and item type. The results show no significant effects of test form,

item type, or the interaction.

Factor df Sum of

Squares F sig η2

Test Form 2 0.171 2.291 n.s. 0.048

Item Type 1 0.533 14.265 p < 0.001 0.150

Interaction 2 0.001 0.014 n.s. 0

Residuals 76 2.837 0.801

Factor df Sum of

Squares F sig η2

Test Form 2 0.010 0.421 n.s. 0.011

Item Type 1 0.015 1.258 n.s. 0.016

Interaction 2 0.011 0.456 n.s. 0.012

Residuals 76 0.883 0.962

4.1. Internal Structure via Classical Test Theory

RQ1 asks whether TE items provide valid evidence. One type of evidence is based on internal

structure of the assessments, i.e., the statistical properties of an assessment which support the

idea that respondents are applying the targeted construct without applying additional constructs

not of direct interest (Goodwin & Leech, 2003). This was first evaluated using methods from

classical test theory. Cronbach’s alpha was calculated for all items on the test, the subset of SR

items, and the subset of TE items.

As shown in Table 6, for all three tests, the reliability was high for the full set of items and for

the subsets of SR and TE items (α ≥ 0.70 is often considered as the threshold for acceptable

reliability of smaller, formative assessments while α ≥ 0.80 is often considered as the threshold

for acceptable reliability of large-scale, high-stakes assessment; see Salvia, Ysseldyke, & Bolt,

2007). The Spearman-Brown correction was applied to the SR reliability index to estimate the

value it would have if the number of SR points were equal to the number of TE points. Table 6

Table 4: Item Difficulty ANOVA and Effect Size

Table 5: Item-Total Correlation ANOVA and Effect Size


shows the results, which are comparable to the reliability analysis conducted without the

correction: both item sets showed high reliabilities and the TE items showed a marginally higher

reliability for one of the test forms (Test CD).

To further evaluate internal consistency (RQ1), an exploratory factor analysis was conducted

using a Varimax rotation on the full set of items, the SR subset, and the TE subset. As shown in

Table 7, the variance accounted for by the first factor in the full set of items was comparable to

the variance accounted for by the first factor in the SR and TE subtests. For Test Forms AB and

CD, the first factor accounted for a somewhat greater percentage of variance for the TE items

compared to the SR items or the full test form. For all three sets of all three tests, the items

loaded strongly on the first factor and the average factor loadings were strong.

These results indicate that for all three tests, the group of items as a whole, the subset of SR

items, and the subset of TE items each show evidence of forming a unidimensional scale.

The reliability and factor analysis results provide one source of evidence that the TE items are a

valid measure of knowledge of the targeted standard (RQ1); the strong internal structure

demonstrated by these items suggests that valid inferences may be drawn about the assessed

geometry standards.

One way to evaluate if there is an improved measure (RQ2) is to evaluate whether the TE items

provide better information than SR items. Better information might be indicated by larger

reliability indices, greater proportion of variance accounted for by the first factor, or stronger

factor loadings in comparing SR and TE item subsets. As shown in Table 6, the Spearman-

Brown corrected reliability indices for the TE items was higher than for the SR items for Tests

AB and CD. The difference in reliabilities was statistically significant (p<0.05, Diedenhofen and

Musch, 2016) for only Test AB. For Tests AB and CD, the TE items also showed stronger

internal structure based on the other characteristics (including higher factor loadings).

4.2. Internal Structure via Item Response Theory

Item response theory (IRT) was employed to further examine the internal structure of the

assessments. A two-parameter logistic (2PL) model was fit to the dichotomous items and a

graded response model (GRM) was fit to the polytomous items, allowing for the IRT models to

accommodate differences across items with respect to both difficulty and discrimination.v

Similar to the study by Jodoin (2003), this analysis focused on the Test Information Functions

(TIFs), which are derived from the IRT item parameters estimated separately for the SR and TE

subtests.


Item Set Number

of Items

Reliability

(Cronbach’s α) Reliability

(Spearman-Brown)


Combined SR

and TE Items 30 0.897 --

SR Items 14 0.789 0.84

TE Items 16 0.834 0.84


Combined SR


SR Items 11 0.790 0.80

TE Items 13 0.827 0.84


Combined SR


SR Items 14 0.829 0.84

TE Items 14 0.818 0.82

Item Set Number of

Items

% of Variance

Accounted for

by First Factor

Range of Factor

Loadings on

First Factor

Average Factor

Loadings


Combined SR

and TE Items 30 27.095% 0.307-0.680 0.510

SR Items 14 28.076% 0.289-0.677 0.513

TE Items 16 31.095% 0.430-0.668 0.553


Combined SR

and TE Items 30 31.495% 0.314-0.770 0.547

SR Items 14 33.422% 0.378-0.748 0.566

TE Items 16 36.906% 0.357-0.755 0.594


Combined SR

and TE Items 28 28.760% 0.253-0.690 0.526

SR Items 14 31.852% 0.402-0.705 0.557

TE Items 14 31.178% 0.248-0.730 0.545

TIF is inversely proportional to the conditional standard error of measurement for the estimate

of theta (the latent trait variable), which is the error distribution of theta at various ability levels

(Lohman & Al-Mahrazi, 2003). To oversimplify, this can be restated as such: when the test gives

more information, there is less uncertainty in the estimate of the respondent’s ability.

Table 6: Reliability Analyses

Table 7: Factor Analysis


Interpretation of TIF values can be difficult, but one way to compare TIF peak values is to note

the relationship between TIF peak values and respondent ability.

Because the TIF is linearly related to the number of items (and, thus, the number of points) on a

test, the use of TIF as a comparison must allow for the adjustment of the TIF for the SR items

upward, to account for there being fewer SR points; this adjustment is analogous to the

Spearman-Brown correction applied when comparing the reliabilities.

Plots of the adjusted TIF values are presented in Figure 3 for the SR (solid line) and TE (dashed

line) items. Table 8 shows the comparison of the adjusted peak values.

The difference in TIF peak values between two tests can allow us to compare the standard errors,

or measurement precision, of two tests. The peak of the SR and TE TIF was comparable for Test

AB; the TE peak was notably higher for Test CD; the SR peak was slightly higher for Test EF.

Item Set Peak TIF Ratio TIF


SR Items 8.72 0.998

TE Items 8.75 1.002


SR Items 6.21 0.829

TE Items 9.04 1.207


SR Items 8.00 1.069

TE Items 7.00 0.935

By taking the square root of the ratio of the TIF peak value for the TE items on Test CD to the

TIF peak value for the SR items on Test CD, it can be estimated that scale scores from a test

composed only of SR items would have a standard error that is approximately 20% larger than

Figure 3: TIF Curves (Solid Line- SR, Dashed Line- TE)

Table 8: Peak TIF Values


the standard error of the scale scores from a test composed only of TE items. This indicates that

this set of TE items did, in fact, provide better information compared to the SR items. Using this

same analysis, the comparison of TIF peak values for Tests AB and EF indicate that the

difference is insignificant.

It should also be noted that the TIF peak value for the TE items always occurred at a higher

value of theta than for the SR items. The SR items had greater TIF than the TE items at lower

values of theta, while the opposite was true at higher values of theta. These differences in TIF

shape indicate that the TE items tended to have notably higher difficulty than the SR items, and

this difference was the primary difference in TIF between the SR and TE items for these tests.

The TIF analysis, resulting from IRT estimation, provides evidence that not only do the TE items

provide comparable statistical information or certainty about student ability as the SR items

(RQ1), but may provide differential certainty, i.e., an improvement over SR items (RQ2).

Assessment via TE items for Test CD demonstrated approximately 20% less error than

assessment via the traditional SR items.

4.3. Comparison with CR Items

Another source of validity evidence is based on the relationship of test scores to other variables,

which describes the extent to which an assessment provides information consistent with other

information about the examinee with respect to the construct of interest. The researchers used the

CR items as the external measure of geometry knowledge for comparison.

Students’ performance on the CR items is summarized in Table 9. The average scores and item

difficulties are generally comparable to performance on the SR and TE items, separately and

combined (displayed in Table 3), with the CR items tending to be slightly more difficult.

A one-way ANOVA was conducted across items difficulties (SR, TE, and CR items) along with

effect size (η2) and necessary follow-up analyses (displayed in Table 10). Test form was

excluded from analysis having previously been shown to be a non-significant factor in item

difficulty. Again, item type was found to be a significant factor in item difficulty (F2,94 = 8.89, p

< 0.001), accounting for 15.9% of the total variance in item difficulty. Follow-up t-tests indicated

again that TE items are more difficult than SR items (t = 3.825, df = 77.154, p < 0.001) and that

CR items are also more difficult than SR items (t = 3.4087, df = 47.009, p < 0.001); no

significance difference was demonstrated between TE and CR item difficulty.

Test Form Number of

Items Average Score

Range of Item

Difficulty

Test Form AB 15 5.3313 / 15 = 36% 0.148-0.628

Test Form CD 15 5.2120 / 15 = 35% 0.151-0.603

Test Form EF 15 5.3957 / 15 = 36% 0.153-0.621

Table 9: CR Item Set Difficulties


Factor df Sum of

Squares F sig η2

Item

Type 2 0.595 8.890 p < 0.001 0.159

Residuals 94 3.146 0.841

Students’ performance on the SR and TE items was compared to their performance on the CR

items. As displayed in Table 11, the combined items and the two subsets of items all displayed a

strong and statistically significant correlation to the CR items.

The strong, positive correlations to CR item performance provides another source of evidence

that the TE items provided a valid measure of knowledge (RQ1). It should also be noted that the

SR and TE items were strongly and statistically significantly correlated to each other

(r = 0.810** for Test AB, r = 0.755** for Test CD, and r = 0.813** for Test EF), which suggests

that the two tests might be measuring the same construct, the construct measured by the

independent measure (the CR items), which is being used as an independent estimate of the

student’s knowledge. This further indicates that the TE items are a valid measure of knowledge.

Again, TE items might be considered an improved measure if they provide a better measure. As

shown in Table 11, for all three tests, the correlation between the TE items and the CR items was

stronger than the correlation between the SR items and the CR items, although, again, both

correlations were strong and statistically significant. The correlation between performance on the

CR items and either SR or TE items is compared according to Fisher’s r-to-z transformation

(Preacher, 2002) indicating no significant difference for Test AB (z = -1.655, p > 0.05) or Test

EF (z = -0.342, p < 0.05). However, the correlation with CR item performance is stronger for TE

items than SR items on Test CD (z = -2.021, p < 0.05). The significance testing indicates that, for

Test CD, the TE items might be a better measure (i.e., closer to the measure provided by the CR

items) of that construct compared to the SR items.

Table 10: Item Difficulty ANOVA and Effect Size


Item Set Correlation

with CR Items


Combined SR

and TE Items r = 0.797**

SR Items r = 0.732**

TE Items r = 0.782**


Combined SR





Combined SR




* p<0.05, ** p < 0.01

4.4. Survey Data and Time

The researchers explored the relationship between subtest performance and students’ technology

affect, frequency of technology use, and students’ math affect. Table 12 shows correlations

between these factors and the three test forms plus the CR items.

Student affect towards technology was demonstrated weak positive correlations to each of the

four sets of items (r = 0.195 to 0.261, p < 0.05). These results suggest no practical relationship

between technology affect and performance on SR, TE, and CR item types. Frequency of

technology use was weak and negatively correlated to each of the four sets of items (r = -0.098 to

-0.187, p < 0.05), suggesting no practical relationship between the frequency with which students

use technology and their performance on any of the item types. Lastly, , student affect towards

math in general demonstrated weak positive correlations to all four item types, indicating that

students’ opinions towards math were not meaningfully related to their performance on any of

the item types.

It is often the case that TE items take longer to complete, which is often considered a potential

negative factor that can counter potential benefits of TE items (unless TE items provide more

information, thus potentially justifying the extra time required). Therefore, average time to

complete each item was compared across TE and DR items. The median seconds to answer the

SR items was 42, 26, and 11 for Test AB, Test CD, and Test EF, respectively. The median

number of seconds to answer the TE items was 64, 63, and 24. For each test, the difference in

time to answer by item type was statistically significant (p < 0.01).

Table 11: Correlations with CR Items for Fourth Grade


Table 12: Correlations with Student Survey Data for Fourth Grade

Item Set Technology

Affect

Frequency of

Technology

Use

Math

Affect


Combined SR

and TE Items r = 0.225** r = -0.110* r = 0.263**

SR Items r = 0.211** r = -0.111* r = 0.259**

TE Items r = 0.218** r = -0.098 r = 0.243**

CR Items r = 0.244** r = -0.139* r = 0.246**


Combined SR

and TE Items r = 0.222** r = -0.182** r = 0.274**

SR Items r = 0.202** r = -0.187* r = 0.286**

TE Items r = 0.216* r = -0.152** r = 0.227**

CR Items r = 0.261** r = -0.143* r = 0.241**


Combined SR

and TE Items r = 0.220** r = -0.179** r = 0.236**

SR Items r = 0.223** r = -0.173** r = 0.213**

TE Items r = 0.195** r = -0.169** r = 0.239**

CR Items r = 0.236** r = -0.161* r = 0.239**

* p<0.05, ** p < 0.01

These results show a weak relationship between student performance on the items and both

technology use and affect towards technology. This suggest that technology use and affect do not

present construct-irrelevant barriers to students’ performance on the test. These are positive

results in support of the use of TE items, as issues related to technology use and affect are often

considered potential reasons to avoid TE items. The results did show that the TE items took

significantly more time to complete, but that finding must be considered along with the overall

findings about whether the items also provide more or better information (Jodoin, 2003).

5. Discussion

A condensed summary of the results is provided in Figure 4. The analyses clearly provide

evidence of validity of inferences made from the TE items (RQ1). The evidence on whether the

TE items provided improved measures (RQ2) is somewhat inconsistent. In several cases, there

was evidence to support better or more information being obtained from the TE items. The

evidence was not generalizable across all tests or overwhelmingly conclusive for any individual

test, particularly considering the equally strong performance of the SR items. The analysis seems

to support the development of tests with a combination of SR and TE items as the best

measurement of student knowledge. It is clear that the circumstances under which TEs will

provide an improved measure are very context-dependent.


RQ Evidence Test

AB CD EF

RQ1

internal structure (reliability)

internal structure (unidimensionality)

internal structure (information)

relationship to other variables (correlation with

CR items)

lack of evidence of the effect of construct

irrelevant factors (technology and affect)

RQ2

comparable item difficulties between CR and TE

items (compared to significant differences in

difficulty between CR and SR items)

internal structure (significant improved

reliability) of TE over SR items

Stronger internal structure (unidimensionality)

for TE versus SR items

more information (higher TIF values and smaller

IRT-based standard errors) for TE versus SR

items

stronger relationship to other variables

(correlation between performance on CR items

and TE items, versus CR items and SR items)

Figure 4: Summary of Evidence from Field Test

There was strong evidence from all three tests that the TE items were a valid measure of student

knowledge for the three targeted standards (RQ1). The TE items had strong internal structure,

evidence of unidimensionality, and a strong relationship to an external measure of knowledge.

Further, there was no evidence found to support what are often considered potential negative

confounding factors that inhibit the ability of TE items to provide an accurate measure of

knowledge. Specifically, there was no evidence of a meaningful relationship between

performance on TE items and students’ opinions towards technology, frequency of use of

technology, or opinions towards math. These findings alleviate some of the more common

concerns about the use of TE items. Thus, there is strong evidence related to RQ1.

There was some evidence that TE items provided an improved measure (RQ2), but the evidence

was not consistent across tests. Tests CD had the most consistent evidence that TE items

provided an improved measure. For this test, TE and CR items demonstrated comparable

difficulties, indicating that, despite the TE items being more difficult than the SR items, they

were closer to the independent measure of knowledge, as provided by the CR items. This idea


was also supported by the significantly stronger correlation between the TE and CR items,

compared to the SR and CR items. The TE items had significantly higher reliability and higher

average factor loadings. Additionally, the IRT analysis provided information indicating that the

TE items on Test CD provided more statistical information, and therefore, improved

measurement over the SR items.

The evidence was less consistent for the other two tests. For Test AB, the TE items had higher

average factor loadings compared to the SR items and they had a significantly stronger

correlation to the CR items compared to the SR items. For Tests EF, the only evidence that TE

items might provide improved measurement was non-significant difference between TE and CR

item difficulties.

There are many indications in the data that, from an assessment development perspective, when

limited only to items that can be computer-scored, the best way measurement might be achieved

by including both SR and TE items. For all three tests, the individual item subsets and the

combined item sets all showed strong and often comparable measures of internal structure. And

the evidence of TE items providing improved measurement was inconsistent. Thus, under what

circumstances TEs will provide an improved measure might be incredibly context-dependent,

and, thus, without further research on specific contexts, the best approach is likely a blend of

item types. Further, the inclusion of SR items can mitigate the effects of the increased time

required to respond to TE items.

This recommendation is supported by an examination of the context of the three targeted

standards. The standards contained a mix of components, asking students to draw (4.G.A.1- Test

AB, 4.G.A.3- Test EF), identify (4.G.A.1, 4.G.A.2, 4.G.A.3, all three tests), classify (4.G.A.2-

Test CD), and recognize (4.G.A.2, 4.G.A.3- Test CD and Test EF). Two of these actions,

identifying and recognizing are more likely to be able to be accurately measured by an item with

a high level of constraint, meaning a typical SR item. By definition, these actions do not require

a student to produce their own response. Drawing, on the other hand, by definition does require a

student to produce a response, and thus it is perhaps best to measure this ability using an item

with a lower level of constraint (a TE item). The ability to classify appears to also be more

accurately measured by TE items with less constraint. While a more constrained, typical SR item

certainly can measure student’s ability to classify, it also allows guessing and “chance

classifications.” A less constrained, TE-style item requires the student to classify with less

opportunity for guessing. A TE-style item also allows for greater flexibility in the grouping of

objects, which might allow response options to more closely match a student’s true

understanding. The three targeted standards contain a mix of components, some of which would

be seemingly more appropriate to measure with more constrained SR items and others with less

constrained TE items. This helps contextualize the results and supports the idea that the content

being measured should be the driving force in selecting appropriate item types and that a blend

of item types might often be the best approach.


TE items are quickly becoming ubiquitous and thus it is critical to generate a research base about

the quality of measurement and the validity of inferences based on TE items in a variety of

contexts. The current paper contributes to this small but important base of research. The results

provide evidence that TE items can provide a valid measure of geometry standards in the

elementary grades

Based on the VTAG findings, the researchers recommend that item developers not consider TE

items a panacea to solve all modern measurement challenges. Rather, TE item types should be

treated as one of many tools in the item developer’s toolbox. This represents a kind of back to

basics for item writing: the selection of item type should be based on the construct being

measured. TE items should not be automatically favored over SR items because it is assumed

they will always provide a better measure. Whether TE items will provide improved

measurement is likely to be highly dependent on the construct. Item writers should consider the

action students will take to demonstrate knowledge to inform the selection of item type. For

example, measuring a student’s ability to identify or recognize might be actions that can be

accurately and fully measured with an SR item because the action can be captured even when a

high degree of constraint is placed upon the response options. Measuring a student’s ability to

draw, classify, represent, or interpret might be better measured with a TE item because those

actions might be better demonstrated with a less constrained response space.

The researchers further recommend that TE items should not be avoided on the basis that they

are typically more expensive to develop and test, they might introduce construct-irrelevant

factors, or because they require more time to answer. Well-designed TE items do have potential

to provide improved measurement of some constructs. Thus, again, item developers should

choose the item type based on the best way to measure a specifically targeted construct,

considering the varying levels of constraint represented by SR and TE item types as the broad set

of possible item types from which to choose.

There are limitations of the VTAG project. Participating teachers and students were small

samples of convenience, limiting statistical the generalizability and available statistical

procedures. With larger, representative samples, more complex IRT models could be estimated

and the impact of guessing evaluated. Finally, the research presented was conducted only with

students that did not require accommodations during testing. Future research should explore the

same questions posed by VTAG, using the VTAG item sets, but with the full breadth of the

student population, including students with a variety of visual, auditory, psychical, and motor

disabilities. The researchers acknowledge the strong (and continually growing) base of research

related to TE accommodations and acknowledge the VTAG project did not include test

accommodations.


Acknowledgements

This material is based upon work supported by the National Science Foundation under Grant

Number 1316557. Any opinions, findings, and conclusions or recommendations expressed in this

material are those of the author(s) and do not necessarily reflect the views of the National

Science Foundation.

i For drawing items, given the targeted age range, the interface decision was made to use “snap-to” behavior,

meaning that if the student clicks somewhere in between grid lines, rather than directly on the intersection of two

grid lines, the point would be placed on the nearest grid line. This is one of two possible behaviors for graphing (the

other being allowing a student to place a point anywhere on the grid). Given that the constructs being measured did

not relate to concepts involving fractions or decimals, and that the item writers and researchers purposefully avoided

fractions and decimals due to the age range and confounding influence of these mathematical ideas, the snap-to

behavior was chosen. The snap-to behavior also allows for more precise scoring algorithms that do not require

tolerances around correct values (which, in turn, require extensive field and validity testing). ii Because the majority of the items were released prior to the adoption of Common Core State Standards, and thus

were based on other standards, many of these items did not align clearly to the standards being targeted for the

VTAG project. Some required modification. Further, the researchers required that the items be strictly constructed-

response, meaning that items could not primarily use an interaction that could be used in a SR or TE item (e.g.,

selecting options, graphing, plotting, drawing, etc.). The researchers contended that the inclusion of these kinds of

items could influence the alignment between CR items and similar SR or TE items. For example, if the set of CR

items targeting a standard included items asking students to plot points, and the analysis later revealed a stronger

correlation between the CR and TE items (which also asked students to plot points) than between the CR and SR

items, one could argue that the correlation was a result of the interface rather than the accuracy of the construct

measurement. iii This assessment design was purposefully chosen to not presuppose that items supporting partial credit would be

worth more or weigh more heavily towards the overall test score. During initial item design, the number of points an

item was worth was made for each item on an individual basis, as was the decision of whether to award partial

credit. Half-points were not awarded; thus, items that awarded partial credit were, by default, written as 2-point

items. These are common conventions used when writing items for large-scale assessment. In response to the expert

advisors’ feedback, the researchers made all items worth one point. This decision was made so as not to assume that

any individual item or item type would return more information than any other, since whether or not this is the case

is part of what the VTAG project seeks to discover (i.e., do TE items provide more information than SR items). The

opportunity to earn partial credit (a score of 0.5) was used for all items where a response would indicate some level

of partial understanding without the demonstration of conflicting misunderstanding. Partial credit was awarded to

both SR items (select-all, matching, and complex multiplex choice types only) and TE items (all types). iv Two items were dropped from Test CD (one SR item and one TE item). These items both had very low item

difficulty statistics which means they were very difficult items (.073 and .005) and had very low or negative item-

total correlations, both when compared with the combined set of SR and TE items and with the SR or TE subset.

The analysis in this section is based on the CD test without these two items. v For the analysis using item response theory (IRT), response data was recoded such that items were either

dichotomously scored or polytomously scored as 0, 1, or 2.

Appendix A: Example Items

Note: Example items are presented to display SR and TE item type. Some items target the fourth

grade standards described in the current paper. Other items target fifth grade standards, also a

target of focus of the VTAG project.

Exhibit A.1: Examples of Traditionally-Formatted and Interactive SR Items

(a) (b) (c)

Exhibit A.2: Examples of TE Item

(a)


(b)

Exhibit A.3: Example TE Item for which the Response Space is Too Large to

Administer as SR


Exhibit A.4: Example SR Items (Standard Multiple Choice)


Exhibit A.5: Example SR Items (Select-All)


Exhibit A.6: Example SR Item (Complex Multiple Choice)


Exhibit A.7: Example SR Items (Matching)


Exhibit A.8: Example TE Items (Categorization)


Exhibit A.9: Example TE Item (Hotspot)


Exhibit A.10: Example TE Items (Matrix Completion)


Exhibit A.11: Example TE Items (Figure Placement)


Exhibit A.12: Example TE Item (Drawing)


Exhibit A.13: Example CR Items


References Cited

Archbald, D. A. & Newmann, F. M. (1988). Beyond Standardized Testing: Assessing Authentic

Academic Achievement in Secondary School. Reston, VA: National Association of Secondary

School Principals.

Beatty, P. C. & Willis, G. B. (2007). Research synthesis: The practice of cognitive interviewing.

Public Opinion Quarterly, 71(2), 287-311.

Bennett, R. E. (1993). On the meaning of constructed response. In R. E. Bennett and W. C.

Ward, (Eds.), Construction versus Choice in Cognitive Measurement: Issues in Constructed

Response, Performance Testing, and Portfolio Assessment (pp. 1–27). Hillsdale, NJ:

Lawrence Erlbaum Associates.

Bennett, R. E. (1999). Using new technology to improve assessment. Educational Measurement:

Issues and Practice, 18(3), 5-12.

Birenbaum, M., Kelly, A. E. & Tatsuoka, K. (1992). Towards a stable diagnostic representation

of students' errors in algebra. ETS-RR-92-58-ONR. ERIC Clearinghouse Doc. ED 356973.

Bryant, W. (2017). Developing a strategy for using technology-enhanced items in large-scale

standardized tests. Practical Assessment, Research & Evaluation, 22(1).

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ:

Lawrence Erlbaum Associates

Darling-Hammond, L. & Lieberman, A. (1992). The shortcomings of standardized tests. The

Chronicle of Higher Education, B1-B2.

Diedenhofen, B., & Musch, J. (2016). cocron: A web interface and R package for the statistical

comparison of Cronbach’s alpha coefficients. International Journal of Internet Science, 11,

51–60.

Dikli, S. (2006). An overview of automated scoring of essays. The Journal of Technology,

Learning, and Assessment, 5(1).

Dolan, R. P., Goodman, J., Strain-Seymour, E., Adams, J., & Sethuraman, S. (2011). Cognitive

Lab Evaluation of Innovative Items in Mathematics and English Language Arts Assessment

of Elementary, Middle, and High School Students. Iowa City, IA: Pearson Education.

Ericsson, K. A., & Simon, H. A. (1999). Protocol analysis: Verbal reports as data. Cambridge,

MA: Massachusetts Institute of Technology.

Goodwin, L. D. & Leech, N. L. (2003). The meaning of validity in the new Standards for

Educational and Psychological Testing: Implications for measurement courses. Measurement

and Evaluation in Counseling and Development, 36, 181-191.

Gorin, J. S. (2006). Test design with cognition in mind. Educational Measurement: Issues and

Practice, 25(4), 21-35.

Haladyna, T. M. (1999). Developing and validating multiple-choice test items. Hillsdale, NJ:

Erlbaum.

Harlen, W. & Crick, R. D. (2003). A Systematic Review of the Impact on Students and Teachers

of the Use of ICT for Assessment of Creative and Critical Thinking Skills. London: Institute

of Education, University of London.

Hickson, S., & Reed, W.R. (2009) Do constructed-response and multiple-choice questions

measure the same thing? Department of Economics. Working Paper Series. Retrieved March,

2018 from http://hdl.handle.net/10092/2465.

Huff, K. L. & Sireci, S. G. (2001). Validity issues in computer-based testing. Educational

Measurement: Issues and Practice. 20(3), 16-25.


inTASC. (2018). USEiT: Use, Support, and Effect of Instructional Technology Study. Retrieved

March, 2018, from http://www.bc.edu/research/intasc/researchprojects/USEIT/useit.shtml

Jodoin, M. G. (2003). Measurement efficiency of innovative item formats in computer-based

testing. Journal of Educational Measurement, 40(1), 1-15.

Lane, S. (2004). Validity of high-stakes assessment: Are students engaged in complex thinking?

Educational Measurement: Issues and Practice, 23(3), 6-14.

Livingston, S. (2009, September). Constructed-response test questions: Why we use them; how

we score them. ETS R&D Connections, no. 11.

Lohman, D. F., & Al-Mahrzi, R. (2003). Personal standard errors of measurement. Retrieved

March, 2018 from https://faculty.education.uiowa.edu/docs/dlohman/personalSEMr5.pdf.

Masters, J. (in press) The Validity of Technology-Enhanced Assessment in Geometry (VTAG)

Project: Technical Report. Measured Progress.

Masters, J., Famularo, L., & King, K. (2016). Using Technology-Enhanced Items to Measure

Fifth Grade Geometry. Annual Meeting of the National Council on Measurement in

Education. Washington, D.C.

Mayer, R. E., & Moreno, R. (2003). Nine ways to reduce cognitive load in multimedia learning.

Educational Psychologist, 38, 43–52.

McFarlane, A., Williams, J. M., & Bonnett, M. (2000). Assessment and multimedia authoring- a

tool for externalizing understanding. Journal of Computer Assisted Learning, 16, 201-212.

National Council of Education Statistics (NCES). (2016). NAEP Tools and Applications.

Retrieved March, 2018, https://nces.ed.gov/nationsreportcard/about/naeptools.aspx/

National Council of Education Statistics (NCES). (2018). Trends in International Math and

Science Study. Retrieved March, 2018, https://nces.ed.gov/timss/educators.asp

Oklahoma State Department of Education (2017). Solicitation #2650000340. Retrieved March,

2018 from https://www.ok.gov/cio/documents/Solicitation2650000340.pdf

Partnership for Assessment of Readiness for College and Careers (PARCC). (2010). Application

for the Race to the Top Comprehensive Assessment Systems Competition. Retrieved

December, 2011 from

http://www.parcconline.org/sites/parcc/files/PARCC%20Application%20-%20FINAL.pdf.

Pellegrino, J. W., Chudowsky, N., & Glaser, R. (Eds.) (2001). Knowing What Students Know:

The Science and Design of Educational Assessment. Washington, DC: The National

Academy Press.

Preacher, K. J. (2002, May). Calculation for the test of the difference between two independent

correlation coefficients [Computer software]. Available March, 2018 from

http://quantpsy.org.

Quellmalz, E. S., Timms, M. J., & Schneider, S. A. (2009). Assessment of student learning in

science simulations and games. Paper prepared for the National Research Council Workshop

on Gaming and Simulations. Washington, DC: National Research Council.

Salvia, J., Ysseldyke, J. E., & Bolt, S. (2007). Assessment in special and inclusive education.

Wadsworth.

Scalise, K. & Gifford, B. (2006). Computer-based assessment in e-earning: A framework for

constructing “intermediate constraint” questions and tasks for technology platforms. The

Journal of Technology, Learning, and Assessment, 4(6).

Sireci, S. G. & Zenisky, A. L. (2006). Innovative item formats in computer-based testing: In

pursuit of improved construct representation. In S. M. Downing and T. M. Haladyna (Eds.),


Handbook of test development (pp. 329-347). Mahwah, NJ, US: Lawrence Erlbaum

Associates Publishers.

Smarter Balanced Assessment Consortium (SBAC). (2010). Race to the Top Assessment

Program Application for New Grants; Comprehensive Assessment Systems CFDA Number:

84.395B. Washington, D.C.: U.S. Department of Education.

State of Maine, Department of Education (2014). Maine Comprehensive Asssessment System.

Procurement for the Implementation of English Language Arts and Mathematics Assessment

in Grades 3 through 8 and High School. Retrieved March, 2018 from

http://www.maine.gov/doe/assessment/documents/rfp.doc

Strain-Seymour, E., Way, W., & Dolan, R. (2009). Strategies and Processes for Developing

Innovative Items in Large-Scale Assessments. Iowa City, IA: Pearson Education.

Thomas, A. (2016). Evaluating the Validity of Technology-Enhanced Educational Assessment

Items and Tasks: An Empirical Approach to Studying Item Features and Scoring Rubrics.

(2016). CUNY Academic Works.

U.S. Department of Education. (2010). Transforming American Education: Learning Powered by

Technology. National Educational Technology Plan 2010. Washington, D. C.

National Council of Education Statistics (NCES). (2016). NAEP Tools and Applications.

Retrieved March, 2018, https://nces.ed.gov/nationsreportcard/about/naeptools.aspx/

Wan, L. & Henly, G. A. (2012): Measurement properties of two innovative item formats in a

computer-based test, Applied Measurement in Education, 25(1), 58-78.

Willis, G. B. (1999). Cognitive interviewing: A “how to” guide. Paper presented at the Meeting

of the American Statistical Association. North Carolina: Research Triangle Institute.

Winter, P. C., Wood, S. W., Lottridge, S. M., Hughes, T. B., & Walker, T. E. (2012). The utility

of online mathematics constructed-response items: Maintaining important mathematics in

state assessments and providing appropriate access to students. Final research report. Pacific

Metrics Corporation.

Zenisky, A. L. & Sireci, S. G. (2002). Technological innovations in large-scale assessment.

Applied Measurement in Education, 15(4), 337-362.

Zucker, S., Sassman, C., & Case, B. J. (2004). Cognitive Labs. San Antonio, TX: Harcourt

Assessment, Inc.

using technology-enhanced items to measure fourth grade ... · annual meeting of the american...

Documents