item position and item difficulty change

24
APPLIED MEASUREMENT IN EDUCATION, 22: 38–60, 2009 Copyright © Taylor & Francis Group, LLC ISSN: 0895-7347 print / 1532-4818 online DOI: 10.1080/08957340802558342 HAME 0895-7347 1532-4818 Applied Measurement in Education, Vol. 22, No. 1, November 2008: pp. 1–36 Applied Measurement in Education Item Position and Item Difficulty Change in an IRT-Based Common Item Equating Design item position change meyers, miller, way Jason L. Meyers Pearson G. Edward Miller Texas Education Agency Walter D. Way Pearson In operational testing programs using item response theory (IRT), item parameter invariance is threatened when an item appears in a different location on the live test than it did when it was field tested. This study utilizes data from a large state’s assessments to model change in Rasch item difficulty (RID) as a function of item position change, test level, test content, and item format. As a follow-up to the real data analysis, a simulation study was performed to assess the effect of item position change on equating. Results from this study indicate that item position change signif- icantly affects change in RID. In addition, although the test construction procedures used in the investigated state seem to somewhat mitigate the impact of item position change, equating results might be impacted in testing programs where other test construction practices or equating methods are utilized. INTRODUCTION Item response theory (IRT) applications typically utilize the characteristic of item parameter invariance. That is, when the underlying assumptions are satisfied, IRT supports the practice of applying item parameter estimates obtained from a Correspondence should be addressed to Jason L. Meyers, Pearson, Austin Operations Center—Tech Ridge, 400 Center Ridge Drive, Austin, TX 78753. E-mail: [email protected]

Upload: sheldore

Post on 24-Apr-2015

85 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Item Position and Item Difficulty Change

APPLIED MEASUREMENT IN EDUCATION, 22: 38–60, 2009Copyright © Taylor & Francis Group, LLCISSN: 0895-7347 print / 1532-4818 onlineDOI: 10.1080/08957340802558342

HAME0895-73471532-4818Applied Measurement in Education, Vol. 22, No. 1, November 2008: pp. 1–36Applied Measurement in Education

Item Position and Item Difficulty Change in an IRT-Based Common Item

Equating Design

item position changemeyers, miller, way Jason L. MeyersPearson

G. Edward MillerTexas Education Agency

Walter D. WayPearson

In operational testing programs using item response theory (IRT), item parameterinvariance is threatened when an item appears in a different location on the live testthan it did when it was field tested. This study utilizes data from a large state’sassessments to model change in Rasch item difficulty (RID) as a function of itemposition change, test level, test content, and item format. As a follow-up to the realdata analysis, a simulation study was performed to assess the effect of item positionchange on equating. Results from this study indicate that item position change signif-icantly affects change in RID. In addition, although the test construction proceduresused in the investigated state seem to somewhat mitigate the impact of item positionchange, equating results might be impacted in testing programs where other testconstruction practices or equating methods are utilized.

INTRODUCTION

Item response theory (IRT) applications typically utilize the characteristic of itemparameter invariance. That is, when the underlying assumptions are satisfied,IRT supports the practice of applying item parameter estimates obtained from a

Correspondence should be addressed to Jason L. Meyers, Pearson, Austin Operations Center—TechRidge, 400 Center Ridge Drive, Austin, TX 78753. E-mail: [email protected]

Page 2: Item Position and Item Difficulty Change

ITEM POSITION CHANGE 39

previous sample of test-takers to new samples. In theory, the advantages of itemparameter invariance permit applications such as computerized-adaptive testing(CAT) and test pre-equating.

A number of threats to item parameter invariance exist at the item level. Theseinclude context effects, item position effects, instructional effects, variable samplesizes, and other sources of item parameter drift that are not formally recognizedor controlled for in IRT applications. Several researchers have documented theexistence and investigated the influence of such item level effects (Whitely &Dawis, 1976; Yen, 1980; Klein & Bolus, 1983; Kingston & Dorans, 1984; Rubin &Mott, 1984; Leary & Dorans, 1985).

In operational testing programs using IRT, model advantages must often beweighed against concerns over threats to item parameter invariance. One placewhere this tension occurs is the position of items in a test from one use to the next.The conservative view is that item sets used for test equating should be kept inidentical or very similar positions within the test from one use to the next. Ofcourse, this requirement can be limiting and difficult to sustain over time. Forexample, using items more than once in operational administrations may be aconcern, either for security reasons and/or because of test disclosure requirements. Inaddition, test content and design considerations often require at least some shiftingin the location of common items across test administrations. Many state depart-ments of education face both of the above. It is common for states to be requiredto release their operational test items after test administration. This makes embed-ding field test items on multiple forms necessary. It is virtually impossible to con-struct new tests from these sets of embedded field test items with the item positionsintact. In addition, reading tests contain items that are connected to a reading passageand, therefore, must stay together. This adds to the problem of states being ableto maintain item position from field test to operational test.

The most liberal stance with respect to position effects is implicit in CAT, inwhich items are administered by design in varied positions that are difficult topredict. CAT assumes that item position effects may be ignored. In practice, it isdifficult to investigate this assumption because CAT data are sparse and becauseeach test-taker typically receives a unique test.

Most scaling and equating designs fall between the extremes of re-administeringlinking items in exactly the same positions and the typical CAT administration.In some IRT common item equating designs, item parameter estimates from severalpreviously administered forms may be used to link a new operational test form toa common scale. Two procedures are commonly used in such designs to mitigatethe potential impact of position and other context effects. First, item parametersare re-estimated based on the operational data and linked back to the previousestimates (as opposed to developing pre-equated conversions based on the initialestimates). Second, a screening process is typically used to eliminate items fromthe common item set if the new and previous estimates differ by more than some

Page 3: Item Position and Item Difficulty Change

40 MEYERS, MILLER, WAY

threshold (Miller, Rotou, & Twing, 2004). There is disagreement among psycho-metricians over what this threshold should be. Theoretically, the threshold shouldbe related to sample size. Wright and Douglas (1975, pp. 35–39), however, notethat random uncertainty in item difficulty of less than 0.3 logits has no practicalbearing on person measurement. As a consequence, a number of Rasch-based assess-ment programs use the threshold of 0.3.

Because operational testing programs often require scaling and equatingapproaches where the positions of linking items vary across administrations, it isimportant to examine position effects and to try to quantify these effects in rela-tion to other possible sources of variation in item performance. The purpose ofthis study was to analyze and model the effects of item position on item difficultyestimates in a large-scale K–12 testing program. The testing program utilizes aRasch-based, common item equating design where the operational tests are equatedby re-estimating item parameters and linking them back to field-test estimates.Because the field-testing occurs by embedding field-test items within the opera-tional test, the positions of the items within the test change between field-testingand operational use. Changes in Rasch item difficulties (RIDs) from field-testingto operational testing were modeled as a function of changes in item position, testlevel, test content, and item format (e.g., passage-based vs. discrete). Based onthe final models, a simulation study was also performed to assess the effect ofitem position change on equating.

RELATED LITERATURE

A number of researchers have investigated the effect of item position (or theeffect of changing the item position) on item difficulty. Tests varying across age-levels and purposes have been considered in these studies: from elementary gradereading and math tests to post-graduate level bar and licensure exams. Reviewedin this article are studies published post-1975; an excellent review of studies per-formed prior to 1975 (with a few later than 1975) is provided by Leary and Dorans(1982).

Many studies have documented the finding that changing the position ofitems in a test can affect item difficulty. Whitely and Dawis (1976) found sig-nificantly different Rasch item difficulty estimates for 6 of 15 core items thatdiffered only in item position across forms. Eignor and Stocking (1986) alsofound that the positioning of SAT items from pretest to operational testappeared to make a difference in the item difficulty. Harris (1991) analyzedthree forms of the ACT with items located in different positions and found thatdifferent scale scores would have been obtained for some students. The PISA(Programme for International Student Assessment) 2000 Technical Report iden-tified a booklet (or item position) effect with the only difference in the different

Page 4: Item Position and Item Difficulty Change

ITEM POSITION CHANGE 41

test booklets being item arrangement. Haertel (2004) found that differences inlinking item position between the year 1 and year 2 tests caused anomalousbehavior in linking items. Finally, Davis and Ferdous (2005) identified itemposition effects on item difficulty for a grade 3 reading test, and for math andreading tests at grade 5. From these studies it is fairly evident that when itemschange positions between testing administrations, the associated item difficultiescan change as well.

In addition to impacting item difficulty, change in item position has beenshown to affect equating results. Yen (1980) investigated item position on theCalifornia Achievement Test (CAT) and found some effects on parameter esti-mates as well as an impact on equating results. Similarly, Kingston and Dorans(1982) found that item position effects had an adverse effect on equating formsof the Graduate Record Examination (GRE). Brennan (1992) found that equat-ing of the ACT was affected by the scrambling of items on different forms.Kolen and Harris (1990) found that when ACT mathematics items were pre-tested in a separate section (presumably at the end of the test), effects due toa lack of motivation or fatigue had a subsequent adverse impact on equatingresults. Zwick (1991) attributed a scaling anomaly between the 1984 and 1986reading NAEP to differences in item positions. Pommerich and Harris (2003)identified an item position effect for both ACT Math and Reading and con-cluded that different equating relations may be obtained as a result of differentitem positions. In each of these studies, item position changes led not only tochanges in individual item difficulties, but also contributed to differences intest equating results.

Other studies have identified more specific impacts of changing item posi-tion. Wise, Chia, and Park (1989) found that item order and other item con-text alterations impacted low-achieving test-takers more than high-achievingtest-takers on tests of work knowledge and arithmetic reasoning. In addition,Way, Carey, and Golub-Smith (1992) found that TOEFL reading comprehen-sion items that were pretested toward the beginning of the section but thatappeared toward the end of the section on the operational test became moredifficult, but items pretested near the end of the section and operationallytested near the beginning of the section became easier. These studies suggestthat item position effects can vary depending on factors such as how muchthe item position changes, the direction of change (i.e., toward the beginningof the test versus toward the end of the test), and the ability levels of thetest-takers.

Taken as a whole, these studies consistently indicate effects on IRT-baseditem parameter estimates when items change positions between administrations.In this study, our interest was in modeling these effects based on data from a spe-cific testing program, and in investigating the implications of these effects for theequating procedures used to support the program.

Page 5: Item Position and Item Difficulty Change

42 MEYERS, MILLER, WAY

MODELING OF CHANGE IN ITEM DIFFICULTY FROM FIELD TEST TO OPERATIONAL TEST

The data used for modeling the change in Rasch item difficulty (RID) from field-testing to operational testing came from an assessment program in a large state inwhich students take standardized tests in the areas of reading, mathematics, writing,social studies, and science. These assessments have been administered since the2003 school year. In grades 3 through 8, the tests contain between 40 and 50 mul-tiple choice items. They are untimed, and students have up to an entire school dayto complete the assessment. In fact, there have been reports of students staying wellinto the evening hours in order to finish testing. The assessments in this state areindependently scaled for each grade and subject using the Rasch Model. The itemstatistics are those of items that appeared on live tests in spring 2004. The statisticsare based on exceptionally large sample sizes, with live test items being taken byseveral hundred thousand students.

An analysis of item position change effects was particularly important for thistesting program because the field-tested item difficulties are utilized in the opera-tional form equating procedures. For this testing program, the need to disclosecomplete operational test forms makes it infeasible to maintain an internal anchorset of equating items from one test administration to the next. As a result, a field-testdesign is employed in which the operational item parameter estimates are scaledthrough the parameter estimates obtained when the items were field-testing. In thisdesign, all of the operational items are effectively the common item anchor set.However, the design requires changing the positions of the items between fieldtesting and use in the final forms.

Only data from reading and mathematics were analyzed. In addition, onlygrades 3 through 8 were used, because these grades are based on a commonobjective structure; the curriculum shifts substantially at grade 9. Descriptivestatistics for the items appearing at each grade and subject are presented inTable 1.

For each grade and subject, the table displays the number of items on eachtest, the average field test Rasch item difficulty and its associated standard error,minimum and maximum value, and the average live test RID and its associatedstandard error, minimum and maximum value. On average, the field tests and livetests were of roughly equal difficulty.

In addition, Table 2 displays—for all grades and subjects—the average RIDchange between field test and live test based on the number of item positions anitem moved. It should be noted that for a given grade and subject, the field test andlive test contain the same number of items. So, for example, an item appearing asitem number 7 on the field test and item number 7 on the live test would have anitem position change of zero. As shown in the table, the majority of items movedat least 5 positions in either direction, and many moved more than 20 positions.

Page 6: Item Position and Item Difficulty Change

ITEM POSITION CHANGE 43

As expected, the average RID change decreases as item position decreases andincreases as item position increases.

Math Analyses

Using multiple regression, change in RID was first modeled as a function of itemposition change (item position), grade, objective, and time between field testing

TABLE 1Descriptive Statistics for Items Included in Analysis

Grade SubjectNumber of Items

Field Test RIDs Live Test RIDs

Mean STD Min Max Mean STD Min Max

3 MATH 27 0.2322 0.8518 −1.7587 2.4118 0.1930 0.8463 −1.7689 2.35824 MATH 42 0.0000 1.0000 −1.9874 2.1925 0.0000 1.0000 −2.0353 2.17625 MATH 37 0.2169 1.0000 −2.5930 2.1861 0.1855 1.0222 −2.6489 1.82876 MATH 39 0.1223 0.9754 −1.9140 1.6994 0.0975 0.9884 −1.9250 1.63907 MATH 48 0.0000 1.0000 −3.1525 2.2535 0.0000 1.0000 −3.3407 2.25098 MATH 48 0.0025 1.0151 −2.5637 2.1599 −0.0041 1.0151 −2.1135 2.14673 READING 36 0.0000 1.0000 2.2519 2.3310 0.0000 1.0000 −2.3091 2.04224 READING 28 0.0742 0.9615 −1.1489 2.3823 −0.1169 1.0535 −2.1965 2.32005 READING 42 0.0000 1.0000 −2.2883 2.4483 0.0000 1.0000 −2.9300 2.26686 READING 42 0.0000 1.0000 −2.1119 1.6004 0.0000 1.0000 −2.5112 1.78217 READING 48 0.0000 1.0000 −1.7013 2.5878 0.0000 1.0000 −2.3077 2.64138 READING 48 0.0000 1.0000 −1.9310 2.0124 0.0000 1.0000 −1.9486 2.0124

TABLE 2Average Change in RID as a Function of Change in Item Position

Item Position Change NMean RID

Change Std Dev Min Max

Greater than −20 66 −0.2337 0.2829 −1.0085 0.5591Between −15 and −20 48 −0.1397 0.2773 −0.6656 0.7371Between −10 and −15 66 −0.0767 0.2345 −0.8550 0.3795Between −5 and −10 42 0.0029 0.1964 −0.3289 0.4091Between 0 and −5 8 −0.0214 0.1571 −0.3041 0.1998Between 0 and 5 15 0.0466 0.1992 −0.4378 0.4845Between 5 and 10 49 0.0480 0.2176 −0.4161 0.5931Between 10 and 15 56 0.1669 0.2715 −0.3738 1.1266Between 15 and 20 47 0.1910 0.2360 −0.2389 0.8927Greater than 20 74 0.2120 0.2921 −0.4447 1.4158

Note. Item position change is calculated as position on field test – position on live test. Negativevalues indicate the item moved more toward the beginning of the test and positive values indicatethat the item moved more toward the end of the test.

Page 7: Item Position and Item Difficulty Change

44 MEYERS, MILLER, WAY

and live testing (time). Because each grade and subject is scaled independently,pre- and posttest RIDs were standardized to unit normal; this allowed for meaningfulcomparisons across grade and subject. Regressions were conducted separatelyfor each grade, and then across grades using grade as a predictor. This resulted ina total of seven regressions.

Inspection of the parameter estimates and t-test p-values for all seven regressionssuggested that all independent variables other than item position and time couldbe dropped from the analyses. Full versus Reduced model tests were conductedto confirm this. Because grade was dropped from the combined 3–8 model, nofurther analysis of the separate single grade datasets was deemed appropriate.

At this point, only item position and time remained as predictors in the analysis.However, a closer look at the original regression output revealed that the variabletime should also be dropped. Time was only significant for grade 6, it had a nega-tive coefficient (which suggests that over time the items become more difficult—contrary to research), and only one item was making the coefficient statisticallysignificant. Time was dropped from subsequent models and it was decided thatonly items field tested one year prior to live testing would be included in futureanalyses.

The mathematics test model now included only the single explanatory variableitem position. A scatterplot of item position by change in RID (Figure 1) indicated acurvilinear relationship. Both quadratic and cubic models were estimated; thecubic model provided the closest fit. Even so, the cubic model predicted close toa 0.06 RID change with no item position change. Knowing from the actual datathat the mean RID difference for item position differences of 5 or less was approxi-mately 0.005, the cubic model was re-estimated with the intercept constrained tobe zero. The final mathematics tests model was

FIGURE 1 RID_difference by item position change for grades 3–8 math combined.

–0.8

–0.6

–0.4

–0.2

0

0.2

0.4

0.6

0.8

–60 –40 –20 0 20 40Item Position Change

RID

Diff

eren

ce

Y = 0.00329x + 0.00002173x2 + 0.00000677x3

Page 8: Item Position and Item Difficulty Change

ITEM POSITION CHANGE 45

The mathematics regression model had an R-square value of 0.5604, indicatingthat about 56% of the variance in change in Rasch item difficulty could be attributedto item position change. One may note that the model indicates that the effect ofa change in item position is asymmetric about zero; specifically the model indi-cates that placing an item nearer the end of the test has slightly more effect on itsitem difficulty than placing it nearer the beginning of the test. Yen (1980) andWise et al. (1989) found that fatigue impacts low-achieving test-takers more thanhigh-achieving test-takers and that this effect is most pronounced toward the endof the test. Thus, the asymmetry of the regression equation produced appears tobe supported by previous research. It should also be noted that a cross-validation ofthe regression equations was considered for this study, but was ultimately decidedagainst due to the large sample sizes and the fact that the regression equationswere only used for demonstration purposes.

Reading Analyses

Change in RID for the reading tests was modeled as a function of item position,grade, and time. Objective was excluded due to (a) possible objective-passageconfounding and (b) objective-passage combination sample size considerations.However, the consideration that change in RID might be affected by the readingpassage that the item is associated with had to be taken into account. Consequently,the reading test data were initially modeled using a two-level random coefficientmodel where items were nested within passages. Separate random coefficientsregressions by grade were performed along with a single random coefficients regres-sion on the combined-grades dataset.

Results from these regressions indicated that passage variability was not statisti-cally significant and, thus, standard ordinary least squares (OLS) regression wasused for all the subsequent analyses.

As was the case in mathematics, the variable grade (in the combined data regres-sion analysis) was not found to be statistically significant in any of the regressions.Additionally, as observed in all but grade 6 of the mathematics tests, the variabletime was not statistically significant for any reading tests. Thus, all further analysis ofreading data was restricted to the combined grades data with the single independentvariable item position.

A scatterplot of item position by change in RID (Figure 2) indicated that therelationship between the two variables was curvilinear. Both quadratic and cubic

rid difference item position

it

MATH_ . ( )

. (

=

+

0 00329

0 00002173 eem position

item position

)

. ( )

2

30 00000677+

(1)

Page 9: Item Position and Item Difficulty Change

46 MEYERS, MILLER, WAY

models were estimated; the cubic model provided the closest fit. Finally, as hadbeen done with math, the cubic model was re-estimated with the intercept con-strained to be zero. The final reading tests model was

The reading regression model had an R-square value of 0.7326, indicating thatabout 73% of the variance in change in Rasch item difficulty could be attributedto item position change.

SIMULATION STUDY-METHOD

To further illustrate the effects of item position change in the context of the testequating procedures used in this testing program, demonstration simulationswere conducted using a set of 281 field-test items from the grade 5 mathematicstest. We chose not to simulate reading data because the modeling results for readingand math were similar and a single set of simulations seemed sufficient to demon-strate the effects in question. The mean, standard deviation, minimum, and maximumof the RIDs for the mathematics items were 0.20, 1.32, –3.43, and 3.21, respectively.These items were spiraled in sets of eight in different operational test form versionscontaining 44 items. The field-test item difficulty estimates were used as true

FIGURE 2 RID_difference by item position change for grades 3–8 reading combined.

–1.5

–1

–0.5

0

0.5

1

1.5

2

–60 –40 –20 0 20 40Item Position Change

RID

Diff

eren

ce

Y = .00845 – .0008343x2 + .00001135x3

rid difference item position

ite

READ_ . ( )

. (

=

0 00845

0 00008343 mm position

item position

)

. ( )

2

30 00001135

+

(2)

Page 10: Item Position and Item Difficulty Change

ITEM POSITION CHANGE 47

item parameters for the simulations, which repeated the following steps for eachof 100 iterations:

1. Create a set of “field-test” item parameter estimates by “perturbing” (i.e.adding an error component to) the true parameters, with error componentsbased on SEb assuming a particular distribution of ability (θ) and sample size.

First, for the Rasch item difficulty (RID) of each field test item i, i = 1,. . .,281, compute its information as follows :

where

bi = “true” RID for item i = assumed proportion of students with ability θi

NFT = assumed number of students field testedThe standard error for each bi is then computed as follows:

Finally, the perturbed “field test” item parameter estimate for each item is

where zrandom = random variate from N(0,1) distribution.2. Construct a test by selecting at random from the available pool of 281

field-test items a fixed number of items across various difficulty strata thatwere established from analyzing a final test form used operationally. Theselected items were then ordered by difficulty in a manner that modeledoperational practice. Figure 3 presents the field-test Rasch difficulty valuesplotted against their operational item positions for the mathematics tests atgrades 3 through 8. As can be seen in Figure 3, these tests were constructedby placing easier items toward the beginning and end of the test, and moredifficult items in the middle of the test. (Note that new field-test items areplaced in positions 21 to 30.) The relationship between Rasch difficultyand item position is fit reasonably well by a polynomial regression. Thisrelationship was used to order the items in the simulations.

I b P P Ni ij ijj

FTj( ) ( )= −

=∑ 1

0

44

pq

P eijbi j= + − −[ ]1 1q

pq j

SE I bb ii= 1/ ( )

^bi i random bb z SE

i= + ( )

Page 11: Item Position and Item Difficulty Change

48 MEYERS, MILLER, WAY

3. For the items selected, create a set of “final-form” item parameter estimatesby perturbing the true parameter components based on SEb as in step 1,assuming a sample size that was five times the corresponding field-testsample size. In addition, a fixed component was added to the true item dif-ficulty that was calculated by entering the difference between each item’sposition field-tested and final form item positions into Equation 1.

For the RID of each field test item i, i = 1,. . .,281, compute its informa-tion as follows :

where

FIGURE 3 Field-test Rasch difficulty by item position for math tests at grades 3–8.

y = –0.0024x2 + 0.14x – 1.294R2 = 0.3245

–4

–3

–2

–1

0

1

2

3

0 10 20 30 40 50 60

Item Position

Fiel

d Te

st B

-Val

ue

I b P P Ni ij ijj

FFj( ) ( )= −

=∑ 1

0

44

pq

P eijbj i= [ ]1 1+ − −q

Page 12: Item Position and Item Difficulty Change

ITEM POSITION CHANGE 49

bi = “true” RID for item i= assumed proportion of students with ability θi

NFF = assumed number of students tested with “Final Form”

The standard error for each bi is then:

The perturbed “final form” item parameter estimate for each item is

where � = final form item position - field test item positionzrandom = random variate from N(0,1) distribution.

4. Link the final form item parameter estimates to the field-test estimates bycalculating the mean difference between the two sets of estimates. As partof the linking, an iterative “stability check” was utilized, through whichitems with differences between field-test and final form difficulty estimatesgreater than 0.3 were eliminated from the linking item set. This step imitatedoperational procedures used in the testing program, which are commonlyemployed in IRT test equating settings.

5. Generate a test characteristic curve (TCC) based on the final linked itemparameter estimates and compare this to the true TCC based on the trueitem parameters.

The outcomes of interest over the 100 replications included the true minusestimated TCC differences (evaluated over fixed theta levels), the differ-ences between the mean of the true and final estimated item difficulties(i.e., the equating constant), and the number of items removed in the iterativestability checks across the 100 replications. To vary the values of SEb for thesimulations, four variations of field-test and final sample sizes were used:500 and 2,500, 1,000 and 5,000, 2,000 and 10,000, and 20,000 and 100,000.For each sample size combination, one set of 100 replications simulated thesystematic error due to the calculated changes in item position and a secondset of 100 replications did not incorporate this systematic error. In thesecond set of 100 replications, step 3 as described earlier included the zrandom(SEbi) component but did not include the component based on the function ofitem position change. This resulted in eight sets of replicated simulations intotal.

pq j

SE I bb ii= 1/ ( )

^( . . . ) ( )b b z SEi i random bi

= + + + +0 00329 0 00002173 0 000006772 3Δ Δ Δ

Page 13: Item Position and Item Difficulty Change

50 MEYERS, MILLER, WAY

SIMULATION STUDY-RESULTS

Figure 4 presents the relationships between the field-test Rasch difficulties andthe final form item positions for the test forms selected in 10 simulation iterationsof the grade 5 math tests. It can be seen from comparing Figure 3 and Figure 4that the relationships between item difficulty and item position in the simulationsare quite similar to the relationships seen in the operational tests.

Table 3 summarizes the results from the various simulation study conditions.As shown in the table, the number of items remaining in the equating set after theiterative stability check removed items with RID changes of 0.30 or greater fromfield test to final form increased as the sample sizes increased. Fewer itemsremained in the equating set in the conditions where the effect of item positionchange was incorporated into the model, although once the sample size reached

FIGURE 4 Field-test Rasch difficulty by item position for 10 simulation replications grade5 math test.

y = –0.0024x2 + 0.1269x – 1.036R2 = 0.3336

–4

–3

–2

–1

0

1

2

3

0 5 10 15 20 25 30 35 40 45 50 55

Item Position

Fiel

d-Te

st B

-Val

ue

Page 14: Item Position and Item Difficulty Change

ITEM POSITION CHANGE 51

2,000 on the field test and 10,000 on the final form there were no observabledifferences.

Table 3 also shows that the magnitude of the equating constant decreased assample size increased. Again, in the conditions in which the effect of item positionchange was incorporated, the equating constants were larger than when it was notincluded. However, once the sample size reached 2,000 on the field test, the impactof item position change was negligible.

Finally, the mean differences between the true and estimated test characteristiccurves are shown in Table 3 and revealed a similar pattern. The larger the samplesize, the more accurate were the estimated TCCs. Item position change impactedthe accuracy slightly, although once the sample size reached 2,000 on the fieldtest the differences between the conditions simulating the impact of item positionchange and those ignoring it were minimal.

Figure 5 summarizes the true minus estimated TCC differences over fixedtheta levels for simulation sample sizes of 500 (field-test) and 2,500 (final form).The upper panel shows results where no item position effects were simulated andthe lower panel shows results based on incorporating the item position effectsinto the simulations. In these graphs, the plotted points represent the mean differencebetween the true and estimated TCC equivalents over the 100 simulated replica-tions at each true theta level, and the vertical lines extending from the points arebased on one standard deviation of the differences between the true and estimatedTCC equivalents over the 100 replications. These results indicate only a very smalleffect due to changes in item position. The reason for this finding is the practicefollowed for ordering the items in the final operational forms. Because easy anddifficult items are placed in the operational forms symmetrically around the positionswhere the items were originally field-tested, the effects of item position changes

TABLE 3Summary of Grade 5 Math Simulation Results

Items Remaining Equating Constant True-Estimated TCCs

Condition M(SD) Min,Max M(SD) Min,Max M(SD) Min,Max

500/2500* 39.51(2.11) (35,44) –0.016(0.033) (–0.100,0.080) –0.007(0.009) (–0.020,0.003)500/2500 41.96(1.21) (39,44) –0.001(0.024) (–0.056,0.055) –0.005(0.007) (–0.014,0.005)1000/5000* 41.78(1.14) (38,44) –0.012(.019) (0.054,0.050) –0.009(0.005) (–0.016,0.000)1000/5000 43.65(0.58) (42,44) –0.001(0.017) (–0.049,0.045) –0.014(0.008) (–0.023,0.000)2000/10000* 44(0) (44,44) –0.000 (0.011) (–0.033,0.029) –0.007(0.004) (–0.013,0.000)2000/10000 44(0) (44,44) –0.001(0.011) –0.029(0.027) –0.006(0.003) (–0.009,0.000)20000/100000* 44(0) (44,44) –0.000 (0.004) (–0.013,0.012) –0.004(0.002) (–0.006,0.000)20000/100000 44(0) (44,44) –0.000(0.004) (–0.009,0.009) –0.003(0.002) (–0.004,0.000)

Note. * = effect of item position change simulated.

Page 15: Item Position and Item Difficulty Change

52 MEYERS, MILLER, WAY

largely cancel out at the level of the test characteristic curve. Figures 6 through 8present the corresponding TCCs for the 1,000/5,000, 2,000/10,000 and 5,000/100,000 conditions. These figures indicate that effects due to changes in itemposition decreased even more as sample sizes increased.

Sequencing a test form so that the most difficult items appear in the middle ofthe test is somewhat unusual, as conventional test development practices typicallyorder items so that the easier items appear at the beginning of the test and the

FIGURE 5 True minus estimated TCCs for the grade 5 math test by ability level with noitem position change effects simulated (upper panel) and with item position change effectssimulated (lower panel). Note: Plotted points represent the mean difference between the trueand estimated TCC and lines extending from the points represent one standard deviation ofthese differences.

–0.6

–0.4

–0.2

0

0.2

0.4

0.6

–6.3 –3.2 –2.2 –1.4 –0.8 –0.3 0.3 0.8 1.4 2.1 3.3 6.2

Theta

True

-Est

TC

C

–0.6

–0.4

–0.2

0

0.2

0.4

0.6

–6.3 –3.2 –2.2 –1.4 –0.8 –0.3 0.3 0.8 1.4 2.1 3.3 6.2

Theta

True

-Est

TC

C

Page 16: Item Position and Item Difficulty Change

ITEM POSITION CHANGE 53

most difficult items appear toward the end of the test. For the testing programstudied here, the rationale for ordering items so that the most difficult items are inthe middle of the exam is that students have warmed up by the middle of the testbut are not yet fatigued, so that their concentration is thought to be at its mostoptimal. It should be noted that the tests in this program are untimed, so there areno concerns about test speededness. With timed tests, an additional rationale for

FIGURE 6 True minus estimated TCCs for the grade 5 math test by ability level with noitem position change effects simulated (upper panel) and with item position change effectssimulated (lower panel) for the 1000/5000 condition. Note: Plotted points represent the meandifference between the true and estimated TCC and lines extending from the points representone standard deviation of these differences.

–0.6

–0.4

–0.2

0

0.2

0.4

0.6

–6.3 –3.2 –2.2 –1.4 –0.8 –0.3 0.3 0.8 1.4 2.1 3.3 6.2

Theta

True

-Est

TC

C

–0.6

–0.4

–0.2

0

0.2

0.4

0.6

–6.3 –3.2 –2.2 –1.4 –0.8 –0.3 0.3 0.8 1.4 2.1 3.3 6.2

Theta

True

-Est

TC

C

Page 17: Item Position and Item Difficulty Change

54 MEYERS, MILLER, WAY

including the most difficult items at the end is to reduce the potential penalty onstudents who find themselves running out of time.

The effect of changes in item position on equating results would be muchmore severe if this program ordered items in the operational tests according toconventional practice, that is, from easiest to most difficult. To illustrate thiseffect, the simulations were repeated using an “easiest to most difficult” ordering

FIGURE 7 True minus estimated TCCs for the grade 5 math test by ability level with noitem position change effects simulated (upper panel) and with item position change effectssimulated (lower panel) for the 2000/10000 condition. Note: Plotted points represent the meandifference between the true and estimated TCC and lines extending from the points representone standard deviation of these differences.

–0.6

–0.4

–0.2

0

0.2

0.4

0.6

–6.3 –3.2 –2.2 –1.4 –0.8 –0.3 0.3 0.8 1.4 2.1 3.3 6.2

Theta

True

-Est

TC

C

–0.6

–0.4

–0.2

0

0.2

0.4

0.6

–6.3 –3.2 –2.2 –1.4 –0.8 –0.3 0.3 0.8 1.4 2.1 3.3 6.2

Theta

True

-Est

TC

C

Page 18: Item Position and Item Difficulty Change

ITEM POSITION CHANGE 55

based on the field-test item difficulties. To make the ordering realistic, a N(0,1)random error component was added to the field-test b-values chosen at each iteration,and sum of the original field test b-value and the N(0,1) error component wassorted from low to high to determine the final form item positions. Figure 9 sum-marizes the relationships between the field-test Rasch difficulties and the finalform item positions for test forms selected in 10 simulation iterations using this

FIGURE 8 True minus estimated TCCs for grade 5 math test by ability level with no itemposition change effects simulated (upper panel) and with item position change effects simulated(lower panel) for the 20,000/100,000 condition. Note: Plotted points represent the mean differencebetween the true and estimated TCC and lines extending from the points represent one standarddeviation of these differences.

–0.6

–0.4

–0.2

0

0.2

0.4

0.6

–6.3 –3.2 –2.2 –1.4 –0.8 –0.3 0.3 0.8 1.4 2.1 3.3 6.2

Theta

True

-Est

TC

C

–0.6

–0.4

–0.2

0

0.2

0.4

0.6

–6.3 –3.2 –2.2 –1.4 –0.8 –0.3 0.3 0.8 1.4 2.1 3.3 6.2

Theta

True

-Est

TC

C

Page 19: Item Position and Item Difficulty Change

56 MEYERS, MILLER, WAY

alternate approach. In this figure, the most difficult items tend to be placedtoward the end of the test, although the relationship is not perfect due to the influenceof the random error component.

Figure 10 summarizes the true minus estimated TCC differences over fixedtheta levels for simulation sample sizes of 500 (field-test) and 2,500 (final form)over 100 replications where the operational items were approximately orderedfrom easiest to most difficult. These results indicate that when final form itemsare ordered in this manner, the changes in item position impact the equatingresults such that estimated scores at high ability levels are lower than the truescores, and vice versa. At the lower ability levels, the differences are approxi-mately 0.3 of a raw score point and at the higher ability levels the differences areapproximately 0.2 of a raw score point. These differences, although relativelysmall, could have a tangible impact on equating results. In addition, the directionof the change is systematic in that the easy items are getting easier and the moredifficult items are getting more difficult.

FIGURE 9 Field-test Rasch difficulty by item position for 10 simulation replications grade5 math test with items ordered from easiest to most difficult.

y = 0.0008x2 – 0.0083x – 0.518R2 = 0.4628

–4

–3

–2

–1

0

1

2

3

0 5 10 15 20 25 30 35 40 45 50 55

Item Position

Fiel

d-Te

st B

-Val

ue

Page 20: Item Position and Item Difficulty Change

ITEM POSITION CHANGE 57

DISCUSSION AND CONCLUSIONS

The results of this study indicated that although item position change from fieldtesting to live testing can impact Rasch item difficulties, the observed effects forthe testing program examined in this study were mitigated by the practice ofordering the field-test items such that easier items appear at the beginning andend of the test while more difficult items appear in the middle of the test.Although this ordering of items in the operational form was originally done fordifferent reasons, it turns out to be very beneficial for the equating design employedwith this testing program. That is, the simulations indicated that the effects onitem difficulty due to position change essentially cancel each other out under thisdesign.

However, the simulations in which final form items were ordered by puttingthe easier items (based on field-tested b-values) in the beginning of the test andthe more difficult items toward the end of the test did indicate measurable effectsrelated to test equating. These effects reflected the modeled relationship betweenfield-test to final form item position change and b-value changes: difficult itemsbecame more difficult as they moved toward the end of the test, and easier itemsbecame easier as they moved toward the beginning of the test. As Figure 10 indi-cates, the resulting impact is that the test seems more difficult than it truly is forhigher ability students and easier than it truly is for lower ability students. Thiswould benefit higher ability students and disadvantage lower ability students in aresulting equating.

FIGURE 10 True minus estimated TCCs for grade 5 math test by ability level with itemposition change effects simulated based on easiest to difficult item ordering.

–0.6

–0.4

–0.2

0

0.2

0.4

0.6

–6.3 –3.2 –2.2 –1.4 –0.8 –0.3 0.3 0.8 1.4 2.1 3.3 6.2

Theta

True

-Est

TC

C

Page 21: Item Position and Item Difficulty Change

58 MEYERS, MILLER, WAY

Clearly, these results justify the concerns expressed in the literature about itempositions shifting from one administration to another when the IRT item parameterestimates for these items are utilized in the subsequent equating procedures. Astrategy of field testing items together in the middle of the test seems to be tenable,although it seems to be somewhat dependent on the strategy used to order itemsin the final form. Another seemingly feasible strategy would involve field-testingitems in an interspersed fashion throughout the test and placing limits on howmuch the position can change when the item is used in a final form (e.g., 5 positionsor less). Such rules could work well for content areas involving discrete items.However, when reading passages are involved, it is much more difficult to restricttest construction so that items are ordered in a prescribed way or the number ofpositions an item can shift is limited. One flawed strategy of obtaining field-testitem parameter estimates, assuming such estimates are to be used in subsequentequating procedures, is to place all field-test items at the end of the test. Based onthe prediction equations obtained in this study, as well as the literature on itemcontext effects, this practice would almost certainly make the operational testappear more difficult than it truly was. As a result, such a practice could give thefalse appearance of a sustained improvement in student performance over time.

This study was limited in several ways. First, we looked at only one large K–12testing program that utilizes a particular measurement model, equating procedures,test administration procedures (i.e., an untimed test), and test construction procedures.These program-specific factors limit the generalization of findings to other testingprograms. In addition, the simulations conducted to investigate the effects of therelationships between item position change and item difficulty change assumed allchanges followed the polynomial regression model, and that the only other com-ponent affecting item difficulty was estimation error. Although these assumptionsseemed reasonable for the purposes of the study, undoubtedly other factors existthat impact this relationship that we did not consider in our simulations.

In large-scale, high-stakes testing, which has increased exponentially underthe No Child Left Behind legislation, state testing programs should continuallyevaluate their practices to ensure that they result in reliable results and validinferences. Testing programs sometimes utilize item statistics obtained fromfield-testing to support the equating and scaling of operational tests. The resultsof this study illustrated that—in the context of one large-scale testing program—the effects of item position changes between field-testing and final form use candiffer depending on multiple factors, including the original field-test positionsand the way that field-test items are sequenced once they are used in final forms.Future research is needed to compare different strategies for embedding field-testitems (e.g., located together in the middle of the test, placed in two different loca-tions, interspersed throughout the test, located at the end of the test), and shouldinvestigate how the these strategies interact with practices for sequencing theitems in operational test forms. Finally, additional empirical data are needed to

Page 22: Item Position and Item Difficulty Change

ITEM POSITION CHANGE 59

confirm or contradict the models developed in this study to predict item difficultychange as a function of item position change, particularly from testing programswith characteristics that differ from the testing program utilized in this study.

ACKNOWLEDGMENT

The authors thank the reviewers for their extremely helpful comments and sugges-tions toward the improvement of this article.

REFERENCES

Brennan, R. (1992). The context of context effects. Applied Measurement in Education, 5, 225–264.Davis, J., & Ferdous, A. (2005). Using item difficulty and item position to measure test fatigue. Paper

presentation at the Annual Meeting of the American Educational Research Association, Montreal,Quebec.

Eignor, D. R., & Stocking, M. (1986). An investigation of possible causes for the inadequacy of IRTpre-equating (ETS RR-86-14). Princeton, NJ: Educational Testing Service.

Haertel, E. (2004). The behavior of linking items in test equating. CSE Report 630. CRESST.Harris, D. (1991). Effects of passage and item scrambling on equating relationships. Applied Psycho-

logical Measurement, 15(3), 247–256.Kingston, N. M., & Dorans, N. J. (1984). Item location effects and their implications for IRT equating

and adaptive testing. Applied Psychological Measurement, 8, 147–154.Klein, S., & Bolus. R. (1983). The effect of item sequence on bar examination scores. Paper presented

at the Annual Meeting of the National Council on Measurement in Education, Montreal, Quebec.Kolen, M., & Harris, D. (1990). Comparison of item pre-equating and random groups equating using

IRT and equipercentile methods. Journal of Educational Measurement, 27(1), 27–29.Leary, L., & Dorans, N. (1982). The effects of item rearrangement on test performance: A review of

the literature. (ETS Research Report RR-82-30). Princeton, NJ: Educational Testing Service.Leary, L., & Dorans, N. (1985). Implications for altering the context in which test items appear: A

historical perspective on an immediate concern. Review of Educational Research, 55, 387–413.Miller, G. E., Rotou, O., & Twing, J. S. (2004). Evaluation of the 0.3 logits criterion in common item

equating. Journal of Applied Measurement, 5(2), 172–177.PISA. (2000). PISA 2000 Technical Report—Magnitude of Booklet Effects. Paris, France: OECD

Programme for International Student Assessment (PISA).Pommerich, M., & Harris, D. J. (2003). Context effects in pretesting:Impact on item statistics and

examinee scores. Paper presented at the Annual Meeting of the American Educational ResearchAssociation, Chicago.

Rubin, L., & Mott, D. (1984). The effect of the position of an item within a test on the item difficultyvalue. Paper presentation at the Annual Meeting of the American Educational Research Association,New Orleans, LA.

Way, W. D., Carey, P., & Golub-Smith, M. (1992). An exploratory study of characteristics related toIRT item parameter invariance with the Test of English as a Foreign Language. (TOEFL Tech.Rep. No. 6.) Princeton, NJ: Educational Testing Service.

Whitely, E., & Dawis, R. (1976). The influence of test context on item difficulty. Educational andPsychological Measurement, 36, 329–337.

Page 23: Item Position and Item Difficulty Change

60 MEYERS, MILLER, WAY

Wise, L., Chia, W., & Park, R. (1989). Item position effects for test of work knowledge and arithmeticreasoning. Paper presentation at the annual meeting of the AERA, San Francisco.

Wright, B. D., & Douglas, G. A. (1975). Best test design and self-tailored testing. ResearchMemorandum No. 19., Statistical Laboratories, Department of Education, University of Chicago.

Yen, W. M (1980). The extent, causes and importance of context effects on item parameters for twolatent trait models. Journal of Educational Measurement, 17, 297–311.

Zwick, R. (1991). Effects of item order and context on estimation of NAEP reading proficiency.Educational Measurement: Issues and Practice, 10, 10–16.

Page 24: Item Position and Item Difficulty Change