the effect of the position of an item within a test
DESCRIPTION
article about itemTRANSCRIPT
THE EFFECT OF THE POSITION OF AN ITEM WITHIN
A TEST ON ITEM RESPONDING BEHAVIOR:
AN ANALYSIS BASED ON ITEM RESPONSE THEORY
Neal M. Kingston
and
Neil J. Dorans
GRE Board Professional Report GREB No.,7942bP ETS Research Report 82-22
June 1982
This report presents the findings of a research project funded by and carried out under the auspices of the Graduate Record Examinations Board.
THE EFFECT OF THE POSITION OF AN ITEM WITHIN A TEST ON ITEM RESPONDING BEHAVIOR:
AN ANALYSIS BASED ON ITEM RESPONSE THEORY
Neal M. Kingston
and
Neil J. Dorans
GRE Board Professional Report GREB No. 79-12bP
May 1982
Copyright&l982 by Educational Testing Service. All rights reserved.
ABSTRACT
The research described in this paper deals solely with the effect of the position of an item within a test on examinee's responding behavior at the item level. For simplicity's sake, this effect will be referred to as practice effect when the result is improved examinee performance and as fatigue effect when the result is poorer examinee performance. Item response theory item statistics were used to assess position effects because, unlike traditional item statistics, they are sample invariant. In addition, the use of item response theory statistics allows one to make a reasonable adjustment for speededness, which is important when, as in this research, the same item administered in different positions is likely to be affected differently by speededness, depending upon its location in the test.
Five types of analyses were performed as part of this research. The first three types involved analyses of differences between the two estimations of item difficulty (b), item discrimination (a), and pseudoguessing (c) parameters. The fourth type was an analysis of the differences between equatings based on items calibrated when administered in the operational section and equatings based on items calibrated when administered in section V. Finally, an analysis of the regression of the difference between b's on item position within the operational section was conducted. The analysis of estimated item difficulty parameters showed a strong practice effect for analysis of explanations and logical diagrams items and a moderate fatigue effect for reading comprehension items. Analysis of other estimated item parameters, a and c, produced no consistent results for the two test forms analyzed.
Analysis of the difference between equatings for Form 3CGRl reflected the differences between estimated b's found for the verbal, quantitative, and analytical item types. A large practice effect was evident for the analytical section, a small practice effect, probably due to capitalization on chance, was found for the quantitative section, and no effect was found for the verbal section.
Analysis of the regression of the difference between b's on item position within the operational section for analysis of explanations items showed a rather consistent relationship for Form ZGRl and a weaker but still definite relationship for Form 3CGRl.
The results of this research strongly suggest one particularly important implication for equating. If an item type exhibits a within-test context effect, any equating method, e.g., IRT based equating, that uses item data either directly or as part of an equating section score should provide for administration of the items in the same position in the old and new forms. Although a within-test context effect might have a negligible influence on a single equating, a chain of such equatings might drift because of the systematic bias.
-ii-
TABLE OF CONTENTS
INTRODUCTION. ...........................
Traditional Analysis of Practice Effects ............
Item Response Theory ......................
Potential Advantages of Using IRT Item Statistics to Investigate the Effects of Item Position on Item Responding Behavior ......................
RESEARCHDESIGN ..........................
TestForms ...........................
Item Calibration Procedures. ..................
IRT Linking Procedure. .....................
THE EFFECT OF ITEM POSITION ON ITEM KESPONDING BEHAVIOR ..............................
Verbal Item Types. .......................
Quantitative Item Types. ....................
Analytical Item Types. .....................
SUMMARY AND IMPLICATIONS. .....................
9
9
12
16
24
REFERENCES............................. 25
-l-
INTRODUCTION
When test forms are equated one wants to minimize the error variance of the process. Generally, one attempts to do so by choosing an appropriate data collection design. The standard errors of equating associated with linear methods for various data collection designs are well known (Angoff, 1971; Lord, 1950; Lord & Stocking, 1973). Designs that yield the smallest standard errors of equating for a given sample size are those in which the examinees taking both the old and the new forms take some common items. This can be accomplished in several ways. Each test may have a set of items, which may or may not count toward an examinee's score, identical to a set in the other test or, to carry the common item idea to its extreme, each examinee can take both forms of the test.
It has long been assumed (Lord (1950) is the earliest reference the authors have found, but the idea is probably older) that if all examinees took form 1 followed by form 2, the equating might be biased by an order effect. To overcome this effect, a counterbalanced administration of forms has been used: A random half of the examinees takes form 1 followed by form 2; the other half takes form 2 followed by form 1. The order effect could then be estimated and accounted for in the equating process (Lord, 1950; Angoff, 1971). This estimation procedure assumes that order effect is proportional to the standard deviation of scores on each test form and that there is no form by order interaction. We are aware of no empirical evidence supporting these assumptions.
It is usually difficult to get a sufficient number of examinees willing to take two full-length tests. In addition, new legislative test disclosure requirements are producing new constraints on the collection of data for equating. One relatively new equating method, item response theory (IRT) based true score equating (Lord, 1980), has been the subject of considerable interest (for example, Cowell, 1981; Kingston 6; Dorans, 1982; Petersen, Cook, & Stocking, 1981; Scheuneman & Kay, 1981). Of particular interest is the use of a data collection scheme known as precalibration, that would allow the collection of item statistics (and thus the equating of the test form) before the test form is operationally administered. The appropriateness of IRT equating based on precalibra- tion requires either that the position of an item within a test have no effect on examinees item responding behavior or that items be calibrated and operationally administered in the same position within old and new forms. The latter solution, same positioning of items in old and new forms, assumes that form-specific context variance is negligible. Even if this is so, the administrative complexities of such a solution make it less than appealing.
The research described in this paper deals solely with the effect of the position of an item within a test on examinee's responding behavior at the item level. For simplicity's sake, this effect will be referred to as practice effect when the result is improved examinee performance and as fatigue effect when the result is poorer examinee performance. It is realized that this simplistic labeling in no way fully describes these effects and that there might be other explanations for these results. The interested reader is referred to Greene (1941) or Wing (1980) for further discussion of the underlying psychology of these effects.
-2-
Traditional Analvsis of Practice Effects
Traditional item level analyses have focused on p, the proportion of examinees responding to an item correctly, or a normalized version of p such as delta. Attempts have been made to adjust these statistics by taking into account only the examinees who had sufficient time to respond to the item. There is, however, no demonstrably correct way based on classical test theory to make this adjustment.
Lord (1977) has pointed out that p, the proportion of examinees responding correctly to an item, is not a true measure of item difficulty. If two items are administered to two groups with different distributions of ability the p of item 1 might be larger than that of item 2 for the first group but smaller for the second group. Thus, neither proportion correct nor normalized proportion correct (e.g., delta) is a particularly good statistic to analyze in order to ascertain the effect of the positon of an item within a test on item responding behavior. In addition, a within-test context effect might well affect the discrimination of an item as well as its difficulty. Classical measures of item discrimina- tion (biserial or point-biserial correlation between item score and total test score) are both difficult to estimate accurately and confounded with item difficulty (Kingston & White 1980; Lord & Novick, 1968).
Item Kesnonse Theorv
Item response theory provides a mathematical expression for the probability of success on an item as a function of a single characteristic of the individual answering the item, his or her ability, and multiple characteristics of the item. This mathematical expression is called an item response function. A reasonable mathematical form for a multiple choice item (both on psychometric grounds and for reasons of tractability) employed for the item response function is the three-parameter logistic model,
l-c (1) Pg(8) = c +
g 1 + e -1.7 ag F6 -bg) ’
where
Pg (W is the.probability that an examinee with ability (3 item g correctly
e is the base of the natural logarithm approximately 2.7183,
ag is
b is g
cg is of
a measure of item discrimination for item g,
a measure of item difficulty for item g, and
the lower asymptote of the item response curve, the probability very low ability examinees answering item g correctly.
answers
equal to
-30
In equation (l), 8 is the ability parameter, a characteristic of the item parameters that determine the examinee, and a b
shape of the it& r&p~~~,'s&~~~on (see Figure 1).
PO.7
LO.6
F0.S
A 0.4
0.2
0. I
0.0
Figure 1
I ITEM RESPONSE FUNCTION 1
B-O
c * .2
One of the major assumptions of IRT embodied in equation (1) is that the set of items under study is unidimensional, i.e., the probability of successful response by examinees to a set of items can be modelled by a mathematical model with only one ability parameter, 0. The second major assumption is that performance on an item can be adequately described by the three-parameter logistic model. Previous research has shown that, to some extent, GRE items violate these assumptions but, for the most part, IRT based methods seem robust to these violations (Kingston & Dorans 1982).
Potential Advantages of Using IRT Item Statistics to Investigate the Effects of Item Position on Item Responding Behavior
IRT item statistics have several advantages over classical item statistics when one is investigating practice and fatigue effects. To the extent that model assumptions are correct, IRT item statistics are sample invariant. The IRT ability metric provides interval level data. The IRT item discrimination statistic is sample invariant, not confounded
with item difficulty, and is more accurately estimated than classical item discrimination indices (Kingston 61 White, 1980). In addition, the use of IRT statistics allows one to make a reasonable adjustment for speededness (see the section on item calibration procedures). This is important when, as in this research, the items administered in one position in a test are likely to be affected differently by speededness than the same items administered in a different position.
-5-
RESEARCH DESIGN
Test Forms
Two operational forms of the GRE Aptitude Test were used in this study: ZGRl and 3CGRl. Form ZGRl is composed of four separately timed operational sections:
Section
I
Timing in Number of Item Type Minutes Items
Verbal 50 80 analogies 17 antonyms 18 sentence completion 20 reading comprehension 25
II Quantitative 50 55 quantitative comparison 30 data interpretation 10 regular mathematics 15
III
IV
Analytical 25 40 analysis of explanations 40
Analtyical 25 30 logical diagrams 15 analytical reasoning 15
Form 3CGRl is also composed of four separately timed operational
sections: Timing in Number of
Section Item Type Minutes Items
I Verbal 50 75 analogies 18 antonyms 22 sentence completion 13 reading comprehension 22
II Quantitative 50 55 quantitative comparisons 30 data interpretation 10 regular mathematics 15
III
IV
Analtyical 25 36 analysis of explanations 36
Analytical 25 30 logical diagrams 15 analytical reasoning 15
Examples of the various GRE Aptitude Test item types can be found in Conrad, Trismen, and Miller (1977).
-60
Form 3CGRl was administered with six different 25-minute fifth sections. The items in these fifth sections were not experimental pretest items. Instead they were items taken from the four operational sections of Form ZGRl. Table 1 lists the six fifth sections of 3CGR1, the number of items in the section, and the section of ZGRl from which they were drawn.
Form ZGRl was administered with six section V's at the same administration at which Form 3CGRl was administered with the six section V's listed in Table 1. Table 2 lists the six fifth sections of Form ZGRl, indicating the number of items in the section and the section of 3CGRl from which they were drawn.
Inspection of Tables 1 and 2 reveals that each operational item from Form ZGRl appears in one of the six section V's of Form 3CGRl and each operational item from 3CGRl appears in one of the six "C-subforms" of Form ZGRl. This commonality of items was used to study for position effects.
Table 1
Six Section V's for Form 3CGRl
Designation Item Type Number of
Items Location in ZGRl
C41 Verbal 39 C42 Verbal 41
Section I Section I
c43 Quantitative 27 Section II c44 Quantitative 28 Section II
c45 Analytical 40 C46 Analytical 30
Section III Section IV
Table 2
Six Section V's for Form ZGRl
Number of Designation Item Type Items Location in 3CGRl
c47 Verbal 37 Section I C48 Verbal 38 Section I
c49 Quantitative 27 Section II c50 Quantitative 28 Section 11
c51 Analytical 36 Section III C52 Analytical 30 Section IV
Table 3
Description of Samples Used in this Research
Forms Pretest Section
ZGRlC47 ZGRlC48 ZGRlC49 ZGRlC50 ZGRlC51 ZGRlC52
3CGRlC41 3CGRlC42 3CGRlC43 3CGRlC44 3CGRlC45 3CGRlC46
Sample Size
2483 13.23 8.01 31.61 15.86 2486 14.62 8.10 31.53 16.30 2898 11.94 6.43 24.46 10.47 2484 12.88 5.93 24.26 10.34 2488 18.73 9.13 32.89 15.21 2482 14.14 6.98 32.69 15.66
1489 15.54 8.42 30.17 15.38 1495 15.91 8.80 30.43 15.55 1487 11.65 5.59 24.94 11.51 1497 12.27 5.43 24.41 11.74 1526 24.26 11.75 28.86 15.41 1476 15.92 7.19 28.52 14.87
Formula Score Means (M) and Standard Deviations (S) Pretest Operational*
M S M S
* Operational formula raw scores are for the operational score (V, Q, or A) corresponding to the prete;.,t section listed under the heading Pretest Section. When comparing these statistics, keep in mind that the number of items going into each new score is not constant. Tables 1 and 2 contain the number of items in each of the pretest sections. The tables embedded in the text on page 5 contain the number of operational items for each section of Forms ZGRl and 3CGRl.
-8-
Item Calibration Procedures
A total of 10 different item types were administered within each form. Parameter estimates were based on the set of all verbal items (analogies, antonyms, sentence completion, and reading comprehension), all quantitative items (quantitative comparisons, data interpretation, and regular mathematics), or all analytical items (analysis of explanations, logical diagrams, and analytical reasoning).
All item parameter estimates and ability estimates were obtained with the program LOGIST (Wood, Wingersky, & Lord, 1978). The function of LOGIST is to estimate, for each item, the three item parameters of the three-parameter logistic model: a (discrimination), b (difficulty), and c (pseudoguessing parameter);gand, for each examinge, theta (ability). The f6llowing constraints were imposed on the estimation process: a was restricted to values between 0.01 and 1.50 inclusive, except for analytical item calibrations where the upper bound was 1.20; the lower limit for estimated theta was -7; and c was restricted to values between 0.0 and 0.5. We also required each e&minee to have responded to at least 20 items in order to insure stable ability estimates. Choosing appropriate constraints is a complex procedure, but necessary to speed convergence and produce stable estimates.
Since IRT based item parameters are sample invariant and the IRT ability parameter is item invariant, IRT parameter estimation allows a reasonable correction for speededness. If one assumes that an examinee answers test questions in a sequential progression, one can consider all contiguous unanswered items at the end of a test as not reached and ignore them in the ability estimation process. Hence, the item responses (actually, lack of responses) of all examinees whose item responses are coded "not reached*' are ignored in the item parameter estimation process. Use of this coding convention will minimize any differences in item calibrations due to a differential speededness in the two administrations of each item set. Each item was calibrated twice, once as an operational item and once when it appeared in section V.
IRT Linking Procedure
Spirallingl of test forms at the June 1980 administration of the GRE Aptitude Test was used to link parameter estimates on Form 3CGRl to parameter estimates on the base form, Form ZGRl. Linking by spiralling assumes that alternating forms administered to examinees results in a random assignment of examinees to forms. Since large equivalent groups take each form, the distributions of ability in the two groups should be the same, and separate parameterizations based on these two random groups via separate LOGIST runs should produce a single ability metric.
ASpiralling is a term used to describe a test administration practice in which test books are packaged in spiralled order, i.e., alternating Form A with Form B, such that half the examinees at any testing center take Form A while the other half take Form B.
-9-
THE EFFECT OF ITEM POSITION ON ITEM RESPONDING BEHAVIOR
Verbal Item Types
Tables 5, 6, and 7 summarize the effect of item position on IRT difficulty, discrimination, and pseudoguessing parameter estimates, respectively, for the verbal item types of Form ZGRl. Tables 8, 9, and 10 do likewise for Form 3CGRl. Each table contains means and standard deviations of the appropriate estimated item parameter for the five item types (all verbal, analogies, antonym, sentence completion, and reading comprehension), as well as mean differences, standard deviations of the differences, and their associated dependent sample t statistics. In these tables and in Tables 11-16, significant mean difficulty differences at the .Ol level between item parameters estimated in their operational and nonoperational locations are marked by double asterisks, while a single asterisk denotes a significant mean difference at the .05 level. For all t-tests, the numbers of degrees of freedom (df) is one less than the number of items of each type.
Tables 5 and 8 show no strong consistent evidence of a practice or fatigue effect. The reading comprehension items, however, exhibit a consistent (though statistically significant only for Form ZGRl) moderate fatigue effect. The mean difference between b's for Form ZGRl was 0.14 (t = -2.72, df = 24, significant at the .05 level) and for Form 3CGRl was -.13 (t = -1.76, df = 21).
Tables 6 and 9 show three of the 10 mean differences between estimated a's to be statistically significant at the .05 level. Since none of these findings is consistent across the two forms, these results indicate either a highly complex relationship between within-test item context and item discrimination or, more likely, a chance result.
Tables 7 and 10 do not indicate any relationship between within-test item position and estimated c. In interpreting these results, one must consider the difficulties in estimating c and the artificially constrained variance of the estimated c's (Kingston & Dorans, 1982; Wood, Wingersky, & Lord, 1978).
As stated previously, the primary goal in this research is to assess the effect of within-test context on IRT true score equating. To estimate this effect, Form 3CGRl was equated to Form ZGRl twice, once based on the item parameters for Form 3CGRl estimated when they appeared in their operational form and positions, and once based on the item parameters estimated when they appeared in section V of Form ZGRl. In the first case, there was no context effect, but the estimated parameters were linked to scale by a relatively weak procedure, dependence on group equivalence. In the latter case, there might be a context effect, but the linking of the metrics was by a relatively strong procedure, common items within a single LOGIST run. The operational position calibrations, relative to the section V calibrations, have a greater opportunity for error variance but
-lO-
Table 5
IRT Item Difficulty (b) Parameter Estimates for Operational Verbal Items from Form ZGRl
Item TvDe
ZGRl 3CGRl ZGRl-3CGRl Operational Section V Difference
Mean S.D. Mean S.D. Mean S.D. t - .I a
Verbal .3136 1.1223 .3466 1.0337 -.0330
Analogy .6597 1.3122 .5849 1.1941 .0748
Antonyms .7217 1.2481 .6760 1.1689 .0457
Sent. Comp. -.0148 .8786 .0654 .8836 -.0802
Read. Comp. -.0387 .7764 .1029 .7267 -.1415
Table 6
.2574 -1.15
.1774 1.79
.3023 .68
.2050 -1.61
.2603 -2.72*
IRT Item Discrimination (a) Parameter Estimates for Operational Verbal Items from Form ZGRl
Item Type
ZGRl 3CGRl ZGRl-3CGRl Operational ’ Section V Difference
Mean S.D. Mean S.D. Mean S.D. t -
Verbal .8722 .2764 .8886 .2858 -.0164 .1486 .99
Analogy .8413 .2572 .9087 .2206 -.0674 .1220 -2.34*
Antonyms 1.0482 .3344 1.0242 .3755 .0240 .2046
Sent. Comp. .7831 .2535 .7936 .2740 -.0106 .1312
Read. Comp. .8143 .1714 .8302 .1949 -.0159 .1191
Table 7
IRT Item Pseudoguessing (c) Parameter Estimates for Operational Verbal Items from Form ZGRl
Item Type
Verbal
ZGRl 3CGRl ZGRl-3CGRl Operational Section V Difference
Mean S.D. Mean S.D. Mean S.D.
.1778 .0545 .1784 .0604 -.0006 .0504
.52
-.33
-.67
t -
-.ll
Analogy .1673 .0386 .1727 .0458 -.0054 .0212 -1.08
Antonyms .2219 .0707 .2149 .0874 .0070 .0828 .38
Sent. Comp. .1526 .0301 .1571 .0313 -.0045 .0158 -1.17
Read. Comp. .1672 .0405 .1677 .0433 -.0005 .0487 -.05
*p < .05
-ll-
Table 8
. Q
IRT Item Difficulty (b) Parameter Estimates for Operational Verbal Items from Form 3CGRl
Item Type
ZGRl 3CGRl 3CGRl-ZGRl Section V Operational Difference
Mean S.D. Mean S.D. Mean S.D. t -
Verbal .2534 1.1739 .2630 1.1371 .0095 .2817 .29
Analogy .4535 1.3307 .4819 1.1823 .0284 .2350 .51
Antonyms .1140 1.0975 .2342 1.0663 .1202 .2527 2.23*
Sent. Comp. -.0369 1.1734 -.0003 1.1339 .0365 .1464 .90
Read. Comp. .4007 1.0524 .2682 1.1342 -.1325 .3523 -1.76
Table 9
IRT Item Discrimination (a) Parameter Estimates for Operational Verbal Items from Form 3CGRl
Item TvDe
ZGRl 3CGRl 3CGRl-ZGRl Section V Operational Difference
Mean S.D. Mean S.D. Nean S.D. t -
Verbal .8926 .3048 .9027 .3153 .OlOl .1522 .57
Analogy .8518 .2750 .9022 .3129 .0504 .1634 1.31
Antonyms 1.0693 .3699 1.1406 .3075 .0714 .1267 2.65*
Sent. Comp. .8662 .1829 .8722 .2213 .0060 .1088 .20
Read. Comp. .7649 .2211 .6830 .1729 -.0819 .1521 -2.53*
Table 10
IRT Item Pseudoguessing (c) Parameter Estimates for Operational Verbal Items from Form 3CGRl
Item Type
ZGRl 3CGRl 3CGRl-ZGRl Section V Operational Difference
Mean S.D. Mean S.D. Mean S.D.
Verbal .1682 .0460 .1729 .0441 .0047 .0372
Analogy .1602 .0365 .1660 .0370 .0057 .0265
Antonyms .1862 .0498 .1946 .0473 .0084 .0408
Sent. Comp. .1577 .0460 .1697 .0520 .0120 .0265
Read. Comp. .1629 .0440 .1588 .0309 -.0041 .0461
*p < .05
t -
1.09
.91
.97
1.63
-.42
20
. 15
10
5
0
-5
-10
-15
-20
.
FIGURE 2 c
L Effect of Item Position on the Equating of the Verbal Section of Form 3CGRl
c I I I I 1 I I I I -
I I I 1 I I I I I I I 1 0 20 40 60 80
FORMULA SCORE
* SEE TEXT
-130
lesser opportunity for bias entering the system during the linking of the metrics. It is assumed, however, that error variance, based on the sample sizes used in this research, is small in relation to bias.
Figure 2 shows the differences between the two equatings. At each formula raw score, the converted score based on the equating that used section V calibrations was subtracted from the converted score based on the equating that used the operational section calibrations. A horizontal line, what one would expect if there was no context effect and no error of equating, is drawn for a no-effect reference. The results, despite the earlier finding of a moderate fatigue effect for reading comprehension items, lend no support to the hypothesis of an overall verbal context effect.
Quantitative Item Types
Tables 11, 12, and 13 summarize the effect of item position on IRT difficulty, discrimination, and pseudoguessing parameter estimates for the quantitative item types (all quantitative, quantitative comparisons, data interpretation, regular mathematics) from Form ZGRl, using the same format as the similar verbal tables. Tables 14, 15, and 16 present similar summaries for Form 3CGRl.
Tables 11 and 14 do not provide evidence of any strong consistent practice or fatigue effect. The results for data interpretation items, however, were peculiar and demanded further scrutiny. Form ZGRl showed a strong, highly significant fatigue effect (mean difference = -.20, t = -4.31 with 9 df), but Form 3CGRl showed the largest practice effect found in this study (mean difference = .39), although this result was not statistically significant (t = 1.10 with 9 df). The large standard deviation of the differences (1.11) helps explain this.
Further investigation showed this large standard deviation was due to a single item that had a b of 1.66 when administered operationally, but a b of -1.75 when administered in section V, a difference of 3.41. Analysis of the actual proportion of examinees getting the item correct when grouped by estimated theta proved illuminating. Just as very different regression lines may result from two samples from a population with an underlying correlation of zero, very different item response functions resulted from the three-parameter logistic regressions for this particular item. It should be noted that this extreme difference between b's is highly unusual and that the median absolute difference for all quantitative items was about .15.
The discrepancy between the mean difference between b's for the data interpretation items in forms ZGRl and 3CGRl is much less extreme when this questionable item is not included in the analysis. The mean difference between b's for the data interpretation items of .39 in Table 14 would only be a mean difference of .05. Similarly, the mean difference between b's for all quantitative items in form 3CGRl is artificially inflated by
Item Type
Quantitative
Reg. Math
Data. Int.
-140
Table 11
IRT Item Difficulty (b) Parameter Estimates for Operational Quantitative Items from Form ZGRl
ZGRl 3CGRl ZGRl-3CGRl Operational Section V Dffference
Mean S.D. Mean S.D. Mean S.D. t -
.0051 1.5040 .0154 1.5301 -.0103 .3123 -.24
.6870 1.3936 .7317 1.3510 0.0447 .1159 -1.11
-1.0795 1.0441 -.a789 .9985 0.2007 .1472 -4.31**
Quant. Comp. .0257 1.4791 0.0447 1.5877 .0704 .3786 1.02
Table 12
IRT Item Discrimination (a) Parameter Estimates for Operational Quantitative Items from Form ZGRl
ZGRl 3CGRl ZGRl-3CGRl Operational Section V Dffference
Item Type Mean S.D. Mean S.D. Mean S.D. t -
Quantitative .a560 .3947 .8484 .3950 .0076 .1545 .36
Reg. Math .9308 .4135 .9430 .4296 -.0122 .1213 -.39
Data Int. .5894 .2305 .6283 .2552 -.0389 .0706 -1.74
Quant. Comp. .9075 .3915 .a745 .3883 .0331 .1849 .98
Table 13
IRT Item Pseudoguessing (c) Parameter Estimates for Operational Quantitative Items from Form ZGRl
ZGRl 3CGRl ZGRl-3CGRl Operational Section V Dffference
Item Type Mean S.D. Mean S.D. Mean S.D. t -
Quantitative .1826 .0733 .1667 .0771 .0159 .0315 3.74**
Reg. Math. .1780 .0894 .1602 .0979 .0178 .0241 2.86*
Data. Int. .1376 .02_2 .1303 .0249 .0074 .0357 .66
Quant. Comp. .1999 .0686 .1821 .0721 .0178 .0338 2.88**
**p < .Ol
*p < .05
-15-
Table 14
IRT Item Difficulty (b) Parameter Estimates for Operational Quantitative Items from Form 3CGRl
ZGRl 3CGRl XGRl-ZGRl Section V Operational Difference
Item Type Mean S.D. Mean S.D. Mean S.D.
Quantitative 0.2571 1.4435 0.1175 1.3656 .1396 .5403
Reg. Math .2431 .9190 .3352 .9945 .0921 .2436
Data. Int. -.6966 1.9727 0.3104 1.9248 .3861 1.1055
Quant. Comp. 0.3607 1.3786 -.2796 1.2443 .0811 .3315
Table 15
IRT Item Discrimination (a) Parameter Estimates for Operational Quantitative Items from Form 3CGRl
ZGRl 3CGRl XGRl-ZGRl Section V Operational Difference
Item Type Mean S.D. Mean S.D. Mean S.D.
Quantitative .8376 .3231 .8825 .3284 .0449 .1735
Reg. Math .8792 .2605 .9690 .2895 .0898 .1728
Data. Int. .4801 .1879 .5012 .1681 .0211 .2074
Quant. Comp. .9360 .3042 .9663 .2952 .0304 .1638
Ikble 16
IRT Item Pseudoguessing (c) Parameter Estimates for Operational Quantitative Items from Form 3CGRl
Item Type
Quantitative
Reg. Math
Data. Int.
Quant. Comp.
ZGRl 3CGRl 3CGRl-ZGRl Section V Operational Difference
Mean S.D. Mean S.D. Mean S.D.
.1693 .0591 .1655 .0770 -.0039 .0700
.1288 .0500 .1415 l 0740 .0128 .0498
.16.4 .0708 .1522 .1172 -.0123 .1456
t -
1.92
1.46
1.10
1.34
t -
1.92
2.01
.32
1.02
t -
0.41
1.00
-.27
.1912 .0466 .1818 .0543 -.0094 .0340 -1.51
-17-
the inclusion of this item. The mean difference of .14 given in Table 14 would have been only .08 without the questionable item.
The mean difference between b's for the data interpretation items in the two forms are quite discrepant even when the peculiar item is removed (0.20 versus .05). This discrepancy might be due to a lack of local independence for the data interpretation items. This dependence has been demonstrated empirically (Dorans & Kingston, 1982) but also follows intuition. Data interpretation items come in sets that refer to a single graphical display. It is possible that in general there is a fatigue effect for sets of data interpretation items but for idiosyncratic sets there is either no effect or a practice effect.
Tables 12 and 15 show that none of the eight mean differences between estimated a's is statistically significant at the .05 level.
Table 13 indicates a statistically significant effect on estimated c for all quantitative, regular mathematics, and quantitative comparison items for Form ZGRl. These findings were not replicated for Form 3CGR1, as can be seen in Table 16. The problems in interpreting the results of the analysis of estimated c's, mentioned earlier, apply to this analysis.
Figure 3 compares the two quantitative equatings of Form 3CGRl to Form ZGRl. As with Figure 2, the converted score resulting from the equating based on the section V estimation of item parameters is subtracted from the converted score resulting from the equating based on the opera- tional section estimation of item parameters.
Although the mean difference in b's indicated in Table 14, .14, was not statistically significant, this result is reflected in the difference between the two equatings.
Analytical Item Types
Tables 17 through 22 summarize the effect of item position on IRT parameter estimates for the analytical item types from Form ZGRl and Form 3CGRl. The format of these tables is the same as for the previous verbal and quantitative tables.
Tables 17 and 20 show a large, consistent, statistically significant difference between item difficulty estimates based on operational section and section V administration of the same items for analysis of explanations and logical diagrams items, and consequently for analytical items as a whole. Analysis of explanations items are considerably easier (mean difference between b's of .25 for Form ZGRl anu .30 for Form 3CGRl) when administered after examinees have already answered an operational section containing this item type. This might be due to both a pervasive relative unfamiliarity with this item type within the GRE test-taker population and the complexity of the directions. It is likely that initially most examinees frequently refer back to the directions for this
-18-
Table 17
IRT Item Difficulty (b) Parameter Estimates for Operational Analytical Items from Form ZGRl
Item Type
ZGRl 3CGRl Operational Section V
Mean S.D. Mean S.D.
ZGRl-3CGRl Difference
Mean S.D. t
Analytical -.0467 1.0853 -.2322 1.0920 .1855 .2605 5.96**
Anal of Exp. .1514 1.0619 .0959 1.1094 .2473 .2378 6.58**
Log. Diag. -.2729 .9009 -.4950 .8348 .2221 .2609 3.30**
Anal. Reas. -.3486 1.1963 -.3327 1.2069 -.0159 .2305 0.27
Table 18
IRT Item Discrimination (a) Parameter Estimates for Operational Analytical Items from Form ZGRl
Item Type
ZGRl Operational
Mean S.D.
3CGRl Section V
Mean S.D.
ZGRl-3CGRl Difference
Mean S.D. t
Analytical .7683 .2595 .7287 .2666 .0396 .1460 2.27*
Anal. of Exp. .7385 .2593 .6919 .2716 .0466 .1512 1.95
Log. Diag. .9806 .2176 .8930 .2254 .0876 .1191 2.85*
Anal. Reas. .6356 .1504 .6626 .2217 -.0270 .1406 -.74
Table 19
IRT Item Pseudoguessing (c) Parameter Estimates for Operational Analytical Items from Form ZGRl
Item Type
ZGRl Operational
Mean S.D.
3CGRl Section V
Mean S.D.
ZGRl-3CGRl Difference
Mean S.D. t
Analytical .1608 .0480 .1421 .0499 .0186 .0447 3.48**
Anal. of Exp. .1602 .0392 .1452 .0360 .0150 .0408 2.33*
Log. Diag. .1652 .0473 .1326 .0238 .0326 .0520 2.43*
Anal. Reas. .1579 .0663 .1435 .0866 .0144 .0471 1.18
**p < .Ol
*p < .05
-190
Table 20
IRT Item Difficulty (b) Parameter Estimates for Operational Analytical Items from Form 3CGRl
Item Type
ZGRl 3CGRl Section V Operational
Mean S.D. Mean S.D.
3CGRl-ZGRl Difference
Mean S.D.
Analytical -.1688 .9043 .0431 .8821 .2118 .2022
Anal. of Exp. 0.2105 .8639 .0919 .8234 .3024 .1837
Log. Diag. 0.0806 .5914 .0700 .6437 .1507 .1346
Anal. Reas. 0.1566 1.2022 -.lOlO 1.1639 .0556 .1897
Table 21
IRT Item Discrimination (a) Parameter Estimates for Operational Analytical Items from Form 3CGRl
t -
8.51**
9.88**
4.33**
1.14
Item Type
Analytical
Anal. of Exp
Log. Diag.
Anal. Reas.
.
ZGRl Section V
Mean S.D.
3CGRl Operational
Mean S.D.
3CGR.l.ZGRl Difference
Mean S.D. t
.8221 .2439 .8033 .2344 -.0188 .1120 -1.36
.8716 .2016 .8733 .1841 .0017 .1235 .08
.9261 .2781 .8499 .2688 0.0762 .0677 -4.36**
.5994 .1434 .5888 .1681 -.0105 .1044 -.39
Table 22
IRT Item Pseudoguessing (c) Parameter Estimates for Operational Analytical Items from Form 3CGRl
Item Type
ZGRl Section V
Mean S.D.
3CGRl Operational
Mean S.D.
3CGRl-ZGRl Difference
Mean S.D. t
Analytical .1613 .0498 .1594 .0555 -.0019 .0250 -.62
Anal. of Exp. .1609 .0507 .1580 .0547 -.0029 .0232 -.75
Log. Diag. .1783 .0571 .1871 .0619 .0088 .0315 1. .s
Anal. Reas. .1454 .0304 .1354 .0349 -.OlOO .0189 -2.05
**p < .Ol
-2o-
item type. By the time an examinee has responded to some number of analysis of explanations items, perhaps 30 or more, the examinee might not have to refer back to the directions. Until the directions are internalized by the examinee, items will be more difficult then they would be otherwise. If this is so, it might be reasonable to expect that items that appear early in the operational section will undergo a larger practice effect that those appearing late in the operational section. If the law of diminishing returns is applicable, some value for difference between b's will be approached asymptotically.
To investigate this hypothesis, the regression of the difference between estimated b's on item position was plotted. In order to smooth the regression, item position was grouped; that is, the mean difference between b's of items 1 through 4 (based on the operational appearance of the items) was plotted against grouped item position 1, the mean difference between b's of items 5 through 8 was plotted against grouped item position 2, et cetera. Figures 4 and 5 show these regressions for the operational items from Forms ZGRl and 3CGR1, respectively.
The relationship between practice effect and item position is clearly shown for Form ZGRl and less clearly shown but still evident for Form 3CGRl. The regression shows no clear sign of leveling off, perhaps because the operational section was not of sufficient length. Interpretation of this result might be confounded by at least two factors: (1) existence of a similar effect within the section V administration of the items and (2) item ordering within section is not random. This result does, however, point out that data collection design has a potentially large impact on the results of a practice effect study.
Logical diagrams items also exhibit a strong practice effect: mean difference between b's of .22 for Form ZGRl and .15 for Form 3CGRl. Analytical reasoning items showed no evidence of a practice or fatigue effect.
Although three out of eight tests yield statistically significant differences at either the .05 or .Ol level, Tables 18 and 21 show a statistically significant result for logical diagrams, but in opposite directions.
Tables 19 and 22 present the results for the analysis or differences between estimated c's. Three of the four analyses produced statistically significant differences at either the .05 or .Ol level for Form ZGRl, but none of these results was replicated for Form 3CGRl. The difficulties inherent in interpreting the results of the analysis of c's has been mentioned earlier in this report.
Figure 6 compares the two equatings of the analytical section of Form 3CGRl to Form ZGKl as Figures 2 and 3 did for the verbal and quantitative equatings. The results reflect the differences found in Table 20.
0.55
0.50
0.45
0.40
0.35
0.30
0.25
0.20
0.15
0.10 0
FORM ZGRt OPERATIONAL ITEMS
ITEM POSITION VERSUS MEAN DIFFERENCE BETWEEN B’S
4 6 8 10
POSITION
FIGURE 4
0.55
0.50
0.45
0.40
0.35
0.30
0.25
0.20
0.15
0. IO
FORM 3CGRl OPERATIONAL ITEMS
ITEM POSITION VERSUS MEAN DIFFERENCE BETWEEN B’S
-
2 0
POSITION
FIGURE 5
50
40
30
20
10
0
-10
-20
-30
-40
-50
Effect of Item Position on the Equating of the Analytical Se- ?K.Rl
#I.-L. 0 20 40 60
FORMULA SCORE
* SEE TEXT
-24-
SUMMARY AND IMPLICATIONS
Five types of analyses were performed as part of this research. The first three types involved analyses of differences between the two estimations of item difficulty (b), item discrimination (a), and pseudoguessing (c) parameters. The fourth type was an analysis of the differences between equatings based on items calibrated when administered in the operational section and equatings based on items calibrated when administered in section V. Finally, an analysis of the regression of the difference between b's on item position within the operational section was conducted. The analysis of estimated item difficulty parameters showed a strong practice effect for analysis of explanations and logical diagrams items 2
and a moderate fatigue effect for reading comprehension items. Analysis of other estimated item parameters, a and c, produced no consistent results for the two test forms analyzed.
Analysis of the difference between equatings for Form 3CGRl reflected the differences between estimated b's found for the verbal, quantitative, and analytical item types. A large practice effect was evident for the analytical section, a small practice effect, probably due to capitalization on chance, was found for the quantitative section, and no effect was found for the verbal section.
Analysis of the regression of the difference between b's on item position within the operational section for analysis of explanations items showed a rather consistent relationship for Form ZGRl and a weaker but still definite relationship for Form 3CGRl.
The results of this research strongly suggest one particularly important implication for equating. If an item type exhibits a within-test context effect, any equating method, e.g., IRT based equating, that uses item data either directly or as part of an equating section score should provide for administration of the items in the same position in the old and new forms. Although a within-test context effect might have a negligible influence on a single equating, a chain of such equatings might drift because of the systematic bias.
Item responding behavior might be affected by other within-test variables besides item position. The difficulty (relative or absolute) of items preceding an item in question might influence responding behavior. The presence or absence of dissimilar item types also might affect responding behavior. Variables such as these might have confounded the results of this study. Further research should prove enlightening. Interested readers may want to consult Swinton, Wild, and Wallmark (1982) for further research on GRE item types.
2 Because of these large practice effects, the GRE analytical section has been revised. Both anal ysis of explana items have be en dropped from the test.
tions items and logical diagrams
-25-
REFERENCES
Angoff, W. H. Scales, norms, and equivalent scores. In R. L. Thorndike (Ed.), Educational measurement (2nd ed.). Washington, DC: American Council on Education, 1971.
Conrad, L., Trismen, D., & Miller, R. (Eds.) Graduate Record Examinations Technical Manual. Princeton, NJ: Educational Testing Service, 1977.
Cowell, W. R. Applicability of a simplified three-parameter logistic model for equating tests. Paper presented at the annual meeting of the American Educational Research Association, Los Angeles, April 14, 1981.
Dorans, N. J., & Kingston, N. M. Assessing the local independence assumption of item response theory in GRE item types and populations. Paper presented at the annual meeting of the Psychometric Society, Chapel Hill, NC, May 1981.
Greene, E. B. Measurement of human behavior. New York: The Odyssey Press, 1941.
Kingston, N. M., & Dorans, N. J. The feasibility of using item response theory as a psychometric model for the GRE Aptitude Test. GRE Board Professional Report 79-12bP, Princeton, NJ: Educational Testing Service, 1982.
Kingston, N. M., 6 White, E. B. Item response theory statistics, classi- cal test, and item statistics; What does it all mean when you sit down to construct a test? Paper presented at the annual meeting of the American Educational Research Association, Boston, April 8, 1980.
Lord, F. M. Notes on comparable scales for test scores. Research Bulletin 50-48. Princeton, NJ: Educational Testing Service, 1950.
Lord, F. M., A study of item bias, using item characteristic curve theory. In Y. H. Poortinga (Ed.), Basic problems in cross-cultural psychology. Amsterdam: Swets and Zeitlinger, 1977, pp.19-29.
Lord, F. M. Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum Associates, 1980.
Lord, F. M., & Novick, M. R. Statistical theories of mental test scores. Reading, MA: Addison-Wesley, 1968.
Lord, F. M., & Stocking, M. L. Autest - Program to perform automated hypothesis tests for nonstandard problems. Research Memorandum 73-7. Princeton, NJ: Educational Testing Service, 1973.
-26-
Petersen, N., Cook, L., and Stocking, M. L. IRT versus conventional equating methods: A comparative study of scale stability. Paper presented at the annual meeting of the American Educational Research Association, Los Angeles, April 14, 1981.
Scheuneman, J. D., & Kay, E. F. Homogeneity of ability and certification decisions. Paper presented at the annual meeting of the American Educational Research Association, Los Angeles, April 14, 1981.
Swinton, S., Wild, C. L., and Wallmark, M. Investigation of practice effects on item types in the Graduate Record Examinations Aptitude Test. GRE Board Professional Report 800lbP, Princeton, NJ: Educational Testing Service, 1982.
Wing, H. Practice effects with traditional mental test items. Applied Psychological Measurement, 1980, A, 141-155.
Wood, R. L., Wingersky, M. & Lord, F. LOGIST: A computer program for estimating examinee ability and item characteristic curve parameters. ETS Research Memorandum 76-6 (modified l/78). Princeton, N.J.: Educational Testing Service, 1978.