document resume - eric · document resume ed 235.197 tm 830 594 author eignor, daniel r.; cook,...
TRANSCRIPT
DOCUMENT RESUME
ED 235.197 TM 830 594
AUTHOR Eignor, Daniel R.; Cook, Linda L.TITLE An Investigation of the Feasibility of USing Item
Response Theory in the Pre-Equating of AptitudeTests.
INSTITUTION College Entrance Eiamination Board, New York, N.Y.
PUB DATE Apr 83NOTE 54p.; Paper presented at the Annual Meeting of the
American Educational Research Association (67th,
Montreal, Quebec, April 11-15, 1983). Small print insome figures.
PUB TYPE Speeches/Conference Papers (150) -- ReportsResearch /Technical (143)
EDRS PRICEDESCRIPTORS
IDENTIFIERS
ABSTRACTThe purpose of this study was to determine the extent
to which item parameters estimated on pretest data from the verbal
section of the Scholastic Aptitude Test (SAT) can be used forequating purposes in a situation where intact final form SAT testingdata have normally been used Items appearing in two final SAT - verbal
formt were calibrated almost completely_from pretest data An
elaborate linkage system was devised and_utilized to get parameterestimates for the items, contained in multiple pretests, on the sameScale. The final forms understudy were equated to two different oldforms; the equatings were then redone using item parameter estimates
based on the pretest data For each form, the item response theory
(IRT) equating based on pretest Statistics was then compared to theIRT equating based on intact final form_data and the linear equating
used operationally. The results varied considerably across forms,ranging from acceptable to marginally acceptable. The overall resultsof the pre-equating were deemed sufficiently promising that aninvestigation_of pre-equating two forms of SAT-mathematical will be
undertaken. (BW)
MF01/PC03 Plus Postage.Aptitude Tettt; *College Entrance Examinations;*Equated Scoresr-*Feasibility Studies; Goodness ofFit; *Latent Trait Theory*Pre Equating (Tests); Scholastic Aptitude Test
***********************************************************************Reproductions supplied by FDRS are the best that can be made
from the original document.***********************************************************************
N-
(
U.S. DEPARTMENT OF EDUCATIONNATIONAL INSTITUTE OF EOUCATION
EDUCATIONAL RESOURCES INFORMATIONCENTER (ERIC)
This oocumont has boon reproduced asreceived from the person or organizationorigineting it
I Minor chang rm des have be mae to improveregroduction quality.
Points of view or opinions stated in this documeet do not necessarily represent official NIEposmonoroacy.
c4-\
(NJ An Investigation of the Feasibility of Using Item Response
an _1,2,3Theory in the Pre-equating of Aptitude Tests
LLi
cm
Daniel R. EignorLinda L. Cook
Educational Testing Service
"PERMISSION TO REPR_O_DUCETRISMATERIAL HAS BEEN GRANTED BY
TO THE EDUCATIONAL RESOURCESINFORMATION CENTER (ERIC)...
1A paper presented at the annual meeting of A.. RA, Montreal, 1983.
2This study was supported by the College Board through Jbint Staff Research and
Development Committee funding.3-The authors would like to acknowledge the technical assistance of Martha Stoc king
and Nancy Wright and the programming assistance of Inge Stiebritz and Karen Carroll
in performing-this study.
An Investigation of the Feasibility -of Using Item Response
Theory in the Pre-equating of Aptitude Tests
Daniel R. EignorLinda L. Cook
Educational Testing Service
Introduction
The current thrust of research devoted to the applications of
item- response theory (IRT) has generated an active interest in the use
Of IRT methods in the solution of score equating problems (see Cook and
Eignor, 1983). Because of the Special properties of test data
characterized by IRT models, users are often able to'solve prOblemS not
amenable to traditional equa tang methods. For other situations. IRT
equating offers an alternative against whiCh to evaluate traditional
methods. In addition, a number of other important outcomes accrue fr om
the use of IRT Eor equating tests; among these are: 1) Improved
equating,, including better equating at the ends of the Scald where
important decisions are often made, 2) greater test security through
less dependence on items in common with a single old form easier
re-equating should items be deleted; and 4) the possible reduction of
bias or drift in equating introduced when traditional methods are used
over time in-certain situations, most notably when the equating samples
for the old and new forms are not random samples from the same
population.
While the abOVe listed Outcomes accrue as the result of the
application Of any IRT equating method; if the test forms to be equated
can be pre-equated using IRT:methods, a number of additional advantages
accrue. Pre-equating refers to the process of establishing conversions
from raw to scaled scores prior to the tidew_ the new test is administered
operationally,- The process depends on the adequate pretesting of a pool
of items from which the new test will be built, the calibratidil of these
items using .CRT methods and the utilization of a linking scheme to
place the IRT parameters from the pretested items on the same scale
Among the additional advantage§ offered by IRT pre-equating are: 1)
Since equating using IRT pre - equating methods is possible prior to the
actual administration of the test, new forms can be introduced at low
volume special administrations, a particular problem if traditional
methods are used; 2) since pre-equatitg permits linkages to many old
forms, it is the most likely of any equating method to yield acceptable
results ShoUld testing legislation Mandate the disclosure of pretest cr
equating items; 3) pre-equating would allow more time to do
_
reasonableness and quality control checks, Which are normally done iit.a
hu't-ried fashion due to score reporting deadlines; and 4) pre-equating
would actually permit a reduction in the usual score reporting cycle
While simultaneously allowing more time to do the equating itself. In
short; the listed advantages that can potentially accrue from the use of
IRT pre-equating build a strong case for investigation of the
feasibility of application of this method; In this report, the
applicability of IRT pre-equating to the:Scholastic Aptitude Test (SAT)
verbal section is considered.
-3-
Problem and Purpose
date; investigations of the feasibility of pre-equating using IRT
for tests developed and administered by Educational Testing Service for
the College Bbard have been done using data from the Test of Standard
Written English (TSWE) (Bejar and Wingersky, 1982). The Bejar and
WingerSky study (1982) indicated some discrepancies between pre-equating
results and the results;from traditional equating§ in situations where
tradi.tonal equating was a reasonable procedure; The calibration system
used for pre-equating TSWE was considerably different, howeVer, frOM any
system that could be devised for pre-equating the SAT. ThuS, although
the results of the TSWE pre-equating study were not altogether
promising, there is little reason to suggest that these results are
generalizable to Ore=eqUating the SAT; For this reason, it was dedthed
important to investigate the feasibility of pre - equating the SAT using
an appropriate calibration system, such as that deviSed for this study;
-The purpose of this study was to determine the extent to which item
parameters estimated on SAT verbal pretest data can be used for equating
purposes in a Sitdatibn where intact final for SAT testing data has .
normally been used.' The items that appear in any final SAT=Verbal form
come from multiple pretests and to the extent that the item parameter
estimates are sensitive to the context in which the item appears, vr to
sample differenceS, there may be differences between these parameter
estimates and parameter estimates generated using data from the actual
final form administration, resulting in a discrepancy between equating
based on pretest item parameter estimates and intact final form item
-4--
parameter estimates. More specifically; in the study, items appearing
in two final SAT-verbal forms, 3ASA3 and 3BSA3, were calibrated almost
completely from pretest data. (See section on IRT Calibration Design
and Linkage System.) An elaborate linkage system; quite representative
of the system that would exist Were pre-equating to be considered for
operational use; was devised and utilized to get; parameter estimates for
the items, contained in multiple pretests, on the Sate Scale. The two
verbal forms under consideration were both part of thiS linkage system.
The effects of using the parameter estimates; obtained from the
pretest data, on the equating process were evaluated in the following
way; Each of the SAT - verbal final forms under study, when administered
for the firSt time operationally, was equated to two different old forms
and t!-.6 results of the equatings averaged. ConventiOnal linear equating
methods were used when this equating was done These equatings were
redone using item parameter estimates based on the pretest data and item
'parameter estimates generated from the intact final form adtinistratioti.
In each Cage, IRT true-score equating was performed. For each form the
IRT equating based on pretest statistics was then compared to the IRT
equating based on intact rinal form data and the linear equating used
operationally When each form was put on scale. IRT equating based on
intact final form data and linear equating rdSultS were used as criteria
in this study for the following reasons: (1) In recent IRT equating
feasibility studies (Petersen, Cook; and Stocking, in press; Kingston
and Dorans, 1982), it has been demonstrated that intact form IRT
true-score equating is a viable equating method for aptitude test data;
and, (2) the linear methodsactually performed to put the forms on scale
-5-
operationally haVe undergone many years of scrutiny through their use
fOr operational score reporting purposes. This study was done using two
SAT-verbal forms so that all results could be replicated. This should
form the basis for drawing stronger ConcldSions about the feasibility of
pre-equating the SAT-verbal section than had the replication not taken
place.
Me-thOdology
Description of Tests
Test booklets containing SAT fords suctias those used in this study
consist of six 30-Minute sections: two SAT-verbal sect.onsi two
SAT-mathematical sections, one Test of Standard Written English (TSWE),
and one variably section. All examinees at a given administration take
the same test sections except for the variable section; where different
subsampleS of the total group receive different variable sections. The
variable section consists of either one of two verbal or mathematical
common item equating sections (anchor tests) or one of a number of
verbal, matheMatical, or TSWE pretests. In this study; data from only
the verbal sections, Verbal common item equating sections, and verbal
pretests were used. The samples used for calibration purposes in the
Study either took the verbal sections and one of the verbal common item
equating sections or the verbal sections and one Of the verbal pretests.
she two SAT-verbal sections contain a total of 85 five-choice items
(45 items in one section; 40 items in the other section) c.omprised of 25
antonyms, 20 analogies, 15 sentence completiona; and 5 reading passages
7
-6-
each of which is followed by 5 items based on the passage. The verbal
Common Item equating sections contain 40 items (10 of each type); theSe
sections are built to be as parallel as possible to the 40 item
section. The verbal pretest sections either contain 45 or 40
items and are built to be as parallel as possible to the comparable
length SAT-verbal sections.
Prior to 1982; raw scores on the SAT were typically transformed to
scaled scores on the College Board 200 to 800 scale via linear equating
tethOds. Since January of 1982, IRT true -score equating using intact
fihal fort data has been used to put forms on scale. SAT-verbal raw
scores are obtained scores that have been corrected for guessing. Raw
scores are computed by the formula R-04, where R is the number of
correct responses, W is the number of incorrect responses, and (k+1)
equaIS the number of choices per item.
Item-Calibration Design and Linkage-System
Pretest items corresponding to the verbal sections of two forms of
the SAr, 3ASA3 and 3BSA3, were calibrated and placed on a common scale
through an elaborate linkage system which utilized data on overlapping
items from the administration of intact final forms with either pretest
Sections or common item equating sections. The calibration linkage
system, involVing the pretests; final forms, and equating sections is
depicted in Figure 1. Responses from randomly Selected samples of
Approximately 3000 examinees taking each pretest-final form combination
and approximately 2700 takihg each final form-equating section
combination were used for calibration purposes. Each box in Figure 1
represents a separate calibration ( computer run). The dotted-lire boxes
77-
within the larger bokda indicate the overlapping items that were used to
place parameter estimates on the same scale within a single calibration
-tun. The directional arrows between the boxes indicate that a scaling
program (described in a later section Of this paper) was run to place
parameter estimates from the separate calibration runs on the same
scale; It should be noted that all items contained in each 40 item
equating section appearing in Figure 1 were calibrated; howevet this was
not thb case for all items in each pretest or 4461 fort. In. order to
reduce calibration costs, only the 40 i m section of SAT-verbal forms
-Iused for linking purposes and OnIT-the 170 (85 items X 2 forms) pretest
items which eventually appeared in final forms 3ASA3 and 3BSA3 were
calibrated. Table 1 contains the total number of items and also the
total number of examinees responding to the items for each of the 13
calibration runs. Table 2 lists the number of pretest items calibrated
in each of the runs. Further reduction in costs were made possible by
using existing parameter estimates from the SAT IRT Scald Drift Study
(Petersen, Cook; and Stocking, in press) whenever gossible Also;
certain final form-equating section combinations from the Scale Drift
Study (labeled C-C, in Figiird 1) and certain final formequating section
calibration runs (numbered 9 and 13 in Figure 1) were linked into the
overall calibration linking system though they were not essential to
getting the pretest parameter estimates on the same scale. This was
done for equating purposes, and will be described in a later section.
IRT_ModeI.and Item Calibration
Item response theory (IRT) assumes that there is a mathematidal
function which relates the probability of a correct response on an item
Pretest data did not exist for 8 of the 85 items in Fort 3ASA3; Therefore,
final form data had tc be used in thecalibratiOn Syatem; This data was
obtained from calibration run number 9 in Figure 1;
9
A
6
EicYt
it)
I
Ce
1l. i'ttlIn
100lit I
...
I 1AL I
kW, ;AL
(
IA/
X I241XL
ICI
It IPA-
46.
A3
lormr
Nara 3: Verbal pa-OquitIng ceillretionudil.Iloking plan,..Upper7ceav letters
iulluved by oae dlgic_ileulgnate intactSAT flu/ lump. -(ln the
(Regrets, all Intact SAT final furor JebignatIons r abbreviated; Al is Uit_altbreviatian fur 3ASA3, etc.) Upperran Idlers iuliothof by
three or ionr_digits &alpaca pretest (urns, Loot:in:de! letter( depignete cowmanlints equating aeelIune, _Bow (nuAiled 17,11) indicate
Rpm: IIX,IST ollIvotP:o runs Ilnatni huOeq %Obi the lay( boles inilleate_usurInpping Iteav that ogre wed to (Awl parawitt
natkatuu on CI4 mat stall LOCIST111116 OVAI, (1,ttettol A,C)rums and quAtins sectiao lot Aid parotid estilligdg
already Cobb u iron SAT IliT Drift Study, Amite indicatti ditection in which eating (linking) rune took place.
LoctsT run inaber n peceutiary in that kill 85 Al Imo can ba calibrated. (Only 40 Revs (run Al were calibrated In gag run unbar 3.)
eaof a necessary nu Owl the AS
calibrated in IJICIST run number 8 could be placed on (Neill,
ll
Table 1 I
Total Number of Items and Total Number of Examinees
for each of the LOGIST Calibration RUN
LOGIST Calibration
Run Number
Total Number of.
Items Calibrated
Number of PeeteSt
It6mS Calibrated
Number of Equating
Section Items Calibrated
Number of SATyerbal Total NuMbers
Section Items Calibrated-- of Examinag
135 55 40 40 8;459
2 162 2 80 80 8;519
14 3.121 1. 80 40 7;964
4 120 80 40 6;181
5 174 14 80 80 14,069
6 132 12 80 40 22,922
120 80 40 _5)41
8 137 17 80 40 '4,778
9 ,125 40 85 2,777
10 298 58 120 120 20;460
11 161 1 80 80 10;347
12 82 2 40 40 8;146
13 125 40 85 2;754
;.;
1;892 1622 920 810 143,499
LOGIST run number refers to identification scheme in Figure 1;
Pretest data did not exist for 8 of the 85 itemSIii_1ASA3, and hence;final form data had to be used for
calibration purposes. Thus only 162 of the total 170 pretest items were calibrated;
2-
Table 2
Number of items Calibthted from Each Pretest For
Pretest
Form
LOGIST1
Run No.
Total No.
of Items
Calibrated
No, of
Items
'in 3ASA3
No, of
Itimi
in 38SA3
Pretest LOGIST
-F Ron No.
Total No;
of Items
Calibrated
No; of
Items
in 3ASA3
No. of
_ItOmi
in DSO
C167 I. 27 13 14 X2232 8 2 1 1
0168 1 28 16 12 X2111 8 1-
X4058 2 2 - 2 X2069 8 1 -
A1128 3 1- I X2163 8 2 1 1
A2120 5 7 - 7 X2134 8 4 4 -
A2061 5 4 - .4 X2216 8 1 1
W4057 5X2128 8 6 6
X5050 625069 10 1 - 1
X5126 6 2 C237 10 29 15 14
X5I61 1 C238 Ili 28 14 14
X5132 1NO W5014 11
X5111 6 5- 5 24125 12 1 1
24066 12 1 1
Totals 1622 772 85
LWIST run number refers to the identification scheme in Figure 1.
2-Pretest data did not exist for 8 of the 85 items in 3ASA3, and
hence, final form data had to be used for
calibration purposes. Thus, only 77 (of 85) pretest items were calibrated for 3ASA3 and 162 (of 170) for
both forms.
14
to an examinee's ability. (See Lord, 1980, for a detailed discussion.)
Many different mathematical models of this functional relationship are
possible. The model chosen for this study was the three-parameter
logistic model. In this model, the probability of a correct response to
item i, i(0),
p (0) _ c.6-1;702 ai(6-bi)
1 - ni(1)
where,
and ci
are three parameters describing the item andai
,represents an examinee's ability. These parameters have specific
interpretations: bi is the point on the e Metric at the inflection
point.of Pi(6) and is interpreted as the iterii-d-ifficulty; ai is
proportional to the sloPe of Pi(0) at the point ofsinfiection and
represents the item diactknttnavtanl and ci is the lower asymptote of
P.(e) and represents a pseudo,,guessing_ parameter.
The item parameters and examinee abilities for this study were
calibrated using the program LOGIST (WingerSky, Barton; and Lord, 1982;
Wingerskyi 1983). The estimates are obtained by a (modified) maximum
likelihood procedure with special procedures for the treatment of
omitted items (see Lord, 1974).
LOGIST requires as input the responses to a set of items from a
group of examinees, coded to reflect items answered correctly,
incorrectly, omitted and not reached. In additicim, the user may specify
certain restrictions on the data and parameters in order to speed
convergence of the iterative procedure. The major restrictions
specified for the study for most of the LOGIST computer runs were:
16
-12-
1. examinees who answered less than one-third of the items were not
used,
2. a's were restricted to a range of .01 to 1.75,
3. c'S were restricted to a range of .0 to .50 or .75(p+), and
4. 0's were restricted to a range of -7.0 to 5;0;
LOGIST produces as output estimates of the a; b, and c for each item,
and e for each examinee.
Thirteen separate LOGIST runs were necessary to calibrate the
pretest items, final form and equating section items useor linking
purposes, and the final forms to be used for equating purposes. These'
LOGIST runs are numbered 1-1 hi Figure 1. Each of the separate LOGIST
runs generated item parameter estimates on the particular:scale defined
by the ability distribution of the group of examinees used it the
calibration, and hence; a scaling program had to be run to put parameter
estimates from the separate LOGIST runs on a common scale. This Scaling
program also had to oe run to put the final form-equating section
combinations from the SAT IRT Scale Drift Study (Petersen et al,
press) on the cotton scale; LOGIST run 10 in Figure 1 was chosen as the
base form for scaling purposes because it contains an SAT-verbal form
and equating sectio,.. which are in common with a partial pre - calibration
linkage system recently deViSed (Cook and Petersen, 1982) for possible
future operational SAT use.
Scalings
The Scalings Just referred to are indicated by the directional
arrows in Figure 1 (and alai) Figure 2, to be diScussed in the foliowing
section). A recently devised scaling method (Stocking and Lord, 1982)
17
13
was used in the study. Briefly, the method works as follows. Letting
b, a; and c denote item difficulty; discrimination, and lower asymptote
parameters, a linear transformation of the form
.-bT =rb +
aT = a/r (T = transformed)(2)
is found which places new fort Item parameters on the base form scale.
The r and m of this transformation are chosen to minimize the average
squared difference between true scores on the common item set for a
particular group of examinees who have taken the ',lase fort; It should
be noted that cf = c. so that there is no necessity to transform lower
asymptote parameters. This method implicitly makes use of data from all
the parameterS-chatacterizing,an item because true scores are used in
the minimization process.
Equating Design
Operationally, the verbal sections of 3ASA3 and 3BSA3 were each
linearly equated to two old SAT forms and the results averaged. Thdad
equatingS can be used as a means for evaluating the effects of using
items calibrated from pretest data in the equating process. The
following diagram depicts the actual equatings that took place, and the
common item sections used for the equatings.
3ASA3
XSA2 YSA3
SA33
fk lw
= 3ASA1YSA1
For each equating depicted, IRT truescore equating, to be de-Scribed
in detail in the next section, was done three different ways. The first
way, referred to as IRT preequating, involved the use of item paramet'er
18
-14-
estimates based on pretest items which constitute 3ASA3 and 3BSA3, while
the other two ways (both used as criteria to evaluate the IRT.
pre - equating) involve the use of item parameter estimates based on data
collected when 3ASA3 and 3BSA3 were administered. as final forMs in an
intact fashion. The second and third ways differ in the following
fashion. In One situation, referred to as intact form calibration
system equating, item parameter estimates for 3ASA3, 3BSA3, and the old
forms to which they were equated were placed on the same acald, which is
essential for IRT equating, by being linked into the overall calibration
and linking plan shoWn in Figure 1. In this situation; the forms to be
equated were linked indirectly through multiple scaling runs applied to
a number of intervening LOG1ST runs which contain multiple final forms
and equating sections. This was done in an attempt to simulate_
conditions of one possible model under which intact final form IRT
equating might take place for the SAT in -the future; In the other case,
referred to as intact firm direct link equating, parameter estimates for
the new (3ASA3 and 3BSA3) and old forms to be equated were linked
directly through common equating sections. This linking is depicted in
Figure 2.
Equating Methods
Linear equating methods produce an equating transforMation of the
form T(x) = Ax + B; where T is the equating transformation, x is the
test score to which it is applied, and A and B are parameters estimated
from the data. The Tudker, Levine Equally Reliable; and Levine
Unequally Reliable linear equating models (Angoff, 1971, pp: 579-583)
are the models that have been used until 1982 for equating SAT=verbal.
19
-15-
To equate A3 to X2 and Y3directly using intact form A3 data
fmA3
INTACT
C fm X2
9
fm Y3 )
To- equate B3 4i Y2 and Aldirectly -using intact-- -form B3 data
(: B3 fk 2)-INTACT
fk Y2 :)
E
INTACT133_
13
fw Al
Figure 2: Verbal intact form direct link calibration and linking plan.liPer-case letters followed by one digit designate intact SAT
final forms. Lower-case letters designate common item equating
sections` -Boxes and ovals are numbered to directly correspond_
to comparable boxes and ovals in Figure 1. Arrows indicate
direction in which scaling (linking) took place.
-16-
Choice of which of the three models to use for score reporting purposes
dePeheS on 1) differences in ability between new and old form groups, as
Measured by a-set of common items (anchor test), and 2) whether the new
_-and old forms are equally reliable, which is typically interpreted to
mean --of equal test length. TheSe models are based on univariate
selectiOn sampling theory. Scores on the relevant selection attribute
(the attribute on which the equating savTles vary) are assumed to be
Collinear with scores on the anchor test in the case of the TUcker model
and with true scores on both the anchor test and the test form in the
case of the Levine models. Scores on the common item set (anchor test)
are used to estimate performance of the combined group of examinees on
both the old and new forms of the test; thus simulating by statistical
Methods the situation in which the same group of examinees takes both
forms of the test.
The parameters A and B of the equating tranformation are estimated
by means of an equation that expresses the relationship between raw
scores on two test for in standard score terms:
X - M_ y - 14_x y
S S:y
(3)
where x and y refer to the test scores to be equated, and M and S refer
to the means and standard deviations of the scores in some group of
examinees. Methods using the above equation differ in their
identification of the means and standard deviations to be estimated.
The Tucker and Levine Equally Reliable methods are based on the
estimated means and standard deviations of observed scores whereas the
Levine Unequally Reliable method is based on the estimated means and
standard deviations of true scores.
21
-17-
Table 3
Formulas for Linder Conversion Parameters
Tucker
A4. 62 it2 t2 ,/t4 .1
yb yvb' vc vb" vb xa xyas ye ve va'
2 -2B H + C 01 - H )/S - AM - AC (H - H _ )/S
yb yvb ye vb vb xa xva ye va va
Levine Equally Reliable
2 2 2A (S- + (S-- - 52., = S- )/(S2 - S-
2
yb yb y' c vb vb v"b
--2 -2 1 2 2 -2 2(Sva + (S_x_a
Sx"a)(Svc - Sva va v-)/(S--
- S- "a )
B ti + (1 - )((S2 - S2 ) ifi2 -N2 -05yb vc vb yb vb V"b"
2 2 2 -2 LI
- AM5ai A01 -vc
)((S3-c, Sx.4)/(Nv4 Sv.,))
Levine Unequally Reliable
2A ((S - S
2. )/(S-
2
yb y b vb
B 14yb
ii0)11(02_ 2(s
i2V_
"xa 4a
2 2 2(1.4 - )((Ssl - Sy0/(Svi, - Nei))
1/4
-
Angoff Error Variance Estimates (Anchor Test External
2S (S
2S2 - C
2 MS2+ C _
P pg vg pvg vg pvg
2 -2S2
(S2
S2
- C-pv g pg vg vg)/(S + C
pg pvg)
Notatiotr
Xi
to Total Test)
New Test Form ZOICTest FormEither New or Old Test FormAnch-or Test V
Observed Score x, y, v, p
2".Error Scoria et v", P"Group Taking Test X and That VGroup Taking Test Y and Test VGroup Taking'Test P.and Teat VCombined Group or (a + b)
MeanStandard DeviationCovariance
22
The formulas for computing the A and B parameters for the Tucker;
Levine Equally Reliable; and LeVine Unequally Reliable modeIs,are given
in Table 3. AS noted in Table 3; the formulas for the Levine models
require error variance estimates; Angoff's method (1953) Of estimating
error variances is used for operational linear equating. ThiS method
assumes that the test to be equated and the anchor test are parallel_
except for length.
When a new form is equated to two old forms; the final linear
parameters to 1:1Ut the new form on scale are arrived at in the folloWing
faShiort. Each of the old forms has linear parameters for placing it on
Scale; these parameters are combined with linear parameters generated
from the equating relationship to derive parameters to put the new form
on scale. There will be a set of parameters for each equating to each
Old fotM; the final set of parameters are arrived at by averaging
parameters from each of the single equatingS.
Although there are a number of equating techniques possible when
using IRT, this study was concerned only with true formula score
equating (Lord, 1980). The expected value of an examinee's doserVed
formula score is defined as his or her true formula score. For the true
formula score, s , we have
(k + 1)
= E P (e)k
i 1
(4)
n is the number of items in the test and (ki+1) is the number of
choices for item i. If we have two tests measuring the same ability8
23
k}ien true formula scores and n from the two tests are relatea by the
equations
mn= E
j = 1
(ki_ +1)
k; IT1T1 ]
(k. + 1)
(5)
Clearly, for a particular 6 corresponding true scores and n have
identidal meaning. They are said to be equated.
Because true formula scores below the chance score level are
undefined for the three-parameter logistic model, Some Method must be
established to obtain a relatiOnShip between scores below the chance
level oft the two test fortS to be equated; The approach used for this
study (Lord, 1980) was tb estimate the mean (M) and standard-deviatidh
(S) Of below chance level scores on the two tests to be equated via the
formulas
IE
i =3-Li+ 1)/ki liki. and (6)
2n
S = E (c
=
where n is the huMbet of items in the test; (ki I) is the number of
choices for item i, and d iis the psuedo-guessing parameter for item i;
And then to use these estimates to do a simple linear equating (see
equation (3)) between the two sets of bdloW chance level scores.
=20=
o_
.
In practice, true score equating is carried out by Substituting
'// estimated parameters into the equations (5)-and (6). Paired values of
and n are then computed for a.SerieS Of arbitrary values of e. Since we
cannot know an examinee's true formula -.sqpre; we act as if relationships
(5) and (6) apply to an examinee's observed formula score.
Two furner points. require clarification. First, the mechanics of
doing IRT true -score equating based on pretest data (pre-equating) and
based on intact final f rm data are exactly the same. What differs are
the item parameter estimates that are used to calbulate P1(e) in
equation (4). In one instance the parameters have been calibrated for
the item-When given in a pretest, and in the other instance, when the
item was given as part of an intact final form. Second, when performirig
-score equating to two old-forms using IRT true-score equating
techniques,, a conversion table is generated for each'new form-old form -
relationship ant' then the correspondingAlf
entries in each table are Simply
averaged to generate the final table.
Results
Pre-e tatin<z Results
4$ A number of figures and 'tables have been prepared to summarize the
results 'of this study; Because' the equatings done for 3ASA3 and for
3BSA3 are independent, and meant to serve as replications of the
pre-equating process, the figures for each form can be viewed
separately. BeCAUSe each of the forms was equated to two old forms,
there are figures for each of the single equatings and then the equating
'tegtlting from the averaging of the single equatings.
.9
-21-
In the figures for each equating performed, thete are two plbts.
The first plot compares the raw to scaled score conversion line
resulting from the IRT pre-equating to one of the three Comparison
conversion lines, resulting from either the intact form calibration
system IRT equating, the intact form direct link IRT equating, or the
intact forM linear equating actually used operationally for score
reporting purposes. The second plot contains residuals. These
residuals are simple differentet between scaled scores resulting from
the IRT pre-equating and one of the comparison equatings fOr each
possible formula score point. The plots use the comparison- equating
(intact form calibration IRT equating, intact fOrM direct link IRT
equating; or intact form linear equating) as the baseline and show
differences between the Ptd=equating equating and the baselineequating
results across the formula score scale. As mentioned earlier, the
intact form talibiation system and 1irect link IRT equatings were chosen
as baseline egliatings for these residual plots because this sort of IRT
equating has been shown in previout studies to be a viable eq.,:ating
method for SAT data, and provides a good criterion equating against
-
-which to evaluate the Pre-equating results; The intact form linear
equating was also used as abasellne.because thit was the method
actually used to put 3ASA3 and 3BSA3 on scale operationally. Of the
three comparison equatings, the intact form direct link equating should
provide the best criterion against which to evaluate the pre-equating
regultt in that 1) the relationship between the parameter estimates; for
the forms to be equated, .from the separate LOGIST runs have not been
influenced by intervening scalingt, and 2) in contrast with linear4
\_
-22-
Equating Plot Res-a, dual Plot
D 35SAT-V A37SAT-V X2
SOO 1.----L F
I 30 -00 700 L F 25 -
.?-1 S- 0 20
-0M
C 600 -A
F 15 -L s 10 -
-CP E SOO I"-43
C -D 1- A
E-1 5400-E
Ir 0-ad1-1 C I- D- -5 -
0 300 fra-10 -
S RI/ E
1-1 ld 200 - - - IRT PRE-EQUATING..15
o -20 -al >,
- INTACT FORM _ 4CS)
S -30-25___
1--C0 100- IRT EQUATING
E
,--1
SAT-V A3/SAT-V X2 EQUATING RESIDUALS_
E1-4
0
m00
4.J --i
0
"! JJ
0
0 1 1 1 1 1 1 1 I 1-i-1_1_1_1 1_1_1 III LI I 1 1 35-30 -20 -10 0 10 20 30 - 40_ - 50 10 70 80 90
FORMULA TRUE-SCORE
600!
700 T-
S rC 800A
E500
Sol"C _
R0 3001-Ea4-
looE11 t1-1--rii-r-t-i-1-1111,1
-30 -20 -10 0 10_ -20 30 40 50 60 70 80 90FORMULA TRUE-SCORE
800
700
SC 6001-A
= --LSCO
0t-1
S 4°°
0 300 L
Egoal-
i CO
SAT-V A3/SAT-V X2
IRT PRE-EQUATING
INTACT FORM CDL):CRT EQUATING
SAT -V A3/SAT -V X2
- IRT PRE-EQUATING
INTACT FORM OPER,LINEAR EQUATING
0 t-1 1 1 1 -1 1 l 1 1 1 1 1 1 1 1 I 1 I 1 1 1 1 I 1 1-
-30 -20 -10 0 10 20 30 40 SO 80 70 SO 90FORMULA TRUE-SCORE
D
I 30F 25
-
1.1- =RT PRE-EQUATING
INTACT-FORM-41:S)IRT EQUATING
-30 -20 -10 O 10 20 30 40 SO so 70 60 DO
FORMULA TRUE -SCORE
SAT-V A3/SAT-V X2 EQUATING RESIDUALS
O 20F 15
S 10CAL 0ED
s __r J--zo
RI
E -25 r-30
- IRT PRE-EQUATING
INTACT -FORM CDL)IRT EQUATING
sl-1 -11 1_1 1_1 1 I _I_ I 1 -r- 1 1 t 1 t I 1 t h 1 f- t-30 -20 -10 0 10 29 30 40 50 80 70 80 90
FORMULA TRUE-SCORE
SAT -V AS/SAT -V X2 EQUATING RESIDUALS
D 35IF 3°F 25
6 20F IS
s 10
S
LED -5sc --I sO -20R 1
/ E -25S -30
351
I- - IRT PRE-EQUATING
f. ---- INTACT FORM OPER.rT LINEAR EQUATING[L
I--
I.- 7 ._01- /
1---.
I-
pI-I1-
1_1 _1_1 L1 1 i I 1 1 1 I 11 1 1-1-1-1-1-1- I-1--30 -20 -10 0 10 20 30 40, SO 60 70 60 20
7UFKLA TRUE-SCORE
Figure 3: SAT-verbal Form 3ASA3 equated- to erb XSA2 - Picts of I) IRT
pre-equating raw to scaled score transformation compared to corresponding
intact form calibration system IRT, direct link IRT; and operational;. _ ___ . _
rawTine_ equating raw to scaled score transformati,:ms, and 2) difference-S.
betWeen Sdaied scores (IRT pre-equating - comparison equating) resulting
from the equatings
27.
600
700
S-C600A
EUM0
__400C0 300
E-200
100
0
SOO
E.suaring_Plot
-23-
SAT-V A9/SAT-V Y3
7
I- I I-
- IRT PRE - EQUATING
INTACT-FORM-ECS)IRT EQUATING
1 1 1 1 t 1 1 1 1 I 1 1 1 1 1 I t I I
-30 -20 -10 0 10 20 90 40 SO 80 70 60 00FORMULA TRUE -SCORE ,
SAT-V A3/SAT-V Y3
0 60
-4C9 C31
ME crW P4
P.0
E-4
.1-.1
4.1 5_Z
1.4
700
SC 600A
E S001-D
S4001-
C
0 300PE
100
0
r
-
h"*
-30 .20 -10 0
0E CO p-3
1J f(1
Cl CIC 0
.1.3 rlC
1--I
C)
aO
600rt
tCO
- IRT PRE-EQUATING
_ FORM COL)IRT EQUATING
10 20 30 _ 40 _ -50 80FORMULA TRUE-SCORE
70 SO Go
SAT-V A3/SAT -V Y3
-30 -20 --IC 0 10 20 30 40 SO 60 70FORMLLA TRUE -SCONE
Figure 4.
o nI
3O
F 2S
O 20
F 16
S 10C s
LL 0
-6
S0-160-20E -25
S,3
Residual Pl
SAT-V A9 /SAT=4, Y9 EQUATING RESIDUALS
35 30 -.20 .10 0 10 20 30 40 50 6Q 70 60 610
FORMULA TRUE-SCORE
IRT PRE-EQUATING
INTACT-FORM-CCS)IRT EQUATING _
SAT -V A3/SAT -V Y3 EQUATING RESIDUALS
ss
F --F 2S
0 dF isS 10CAL OrED -
..1SC-16r° -20 -RE25
-30
/
- IRT PRE-EQUATING
INTACT_ F_ORM CDL)IRT EQUATING
-35 1i1-10 0 113 20 30 40 SO 00 70 60 GO
FORMULA TRUE-SCORE
D 3SI 30F 25
2°F IS1-s 10 1--
C sAL 0E -5
-100-150 .40RE -25
-3035
-90
SAT -/ A3 /SAT -/ Y9 EQUATING RESIDUALS
- IRT PRE-EQUATING
INTACT FORM OPER.LINEAR EQUATING
..
-20 -IC 0 10 20 30 40 50FORMULA TRUE -SCORE
00 70 80 90
SAT-verbal Form 3ASA3 equated to SAT- verbal Form YSA3 - Plots of 1) IRT
pre-equating raw -to scaled Store transformation compared to corresponding
intact form calibratidn system IRT; direct link IRT, and operational
linear equating rat..? to scaled score transformationsi and 2) differences
between scaled_Scrites (IRT pre-equating - comparison equating) resulting
from the equatings
Equating-Plat Residual Plot
0-tm 00J.) clam
X MIa
(.7
E Pas.. i-4-0
Eti)-Lt
L1 tit
m
700
S600
AL _E 500
400 I-
0300RE-
200
co
SAT -V AWSAT-v X2 AND Y3
---1-
IRT PRE - EQUATING
INTACT FORM CCS)IRT EQUATING
+-1}14111-NY -20 -10 0 tO 20 30 40 50 80 70 80 GO
FORMULA TRUE-SCORE
SAT -V A3 /SAT -V X2 AND Y3
awl
700 -Sc 800 -A
,
E SOO
1-
S 450C
0 3001R ,
E r-200 I-
1901- rel[
LLI.J..11 1 if t U 1 1 IL_L-1 / 1 1 1 t 1 t L,-30 -20 -10 0 10 29 30 40 SO 60 70 80 GO
FORMULA TRUE -SCORE
- IRT PRE - EQUATING
INTACT -FORM LDL)IRT EQUATING
M_ WS Z0 ..- Q0LL
H -,14,-1 ta 4.:U C Z4) -..-i .., p',14 c
ii
euP
MO -SC81:10 ',--A I
i_ _!E 500 1--0 1--
S400 I--
0 300 1PE I--
W
SAT -V .43./sAT-v X2 AND v3
IRT PRE-EQUATING
cL 7 INTACT FORM OPER.CD 100 r . LINEAR EQUATING
01h' j 1[1-- 1--1-- I f-lr-V-1-4- 4-1 -1- i 1 1 1 1 I I 1 1 1 1
-30 -20 -10 0 10 20 30 40 50 190 70 80 GO
FORMULA TRUE-SCORE
1
1) 35ISO
F zs
oF 15S 10
SACOED
S-10
c -160 -20E -25
-30SS -30
SAT -v AS/SAT -V X2 AND Y3 EQUATING RESIDUALS
D 3530zs -
0
IsFS 10
ShAL -6 F-
E -I-5 r--t
c -IS
E-25s -30 -
35-30
o/
- - - IRT PRE - EQUATING
INTACT_ FORM (CS)IRT EQUATING
I I 1_11 I : t
-20 -10 O 1o__20FORMULA
3 0 4 0 5 0
TRUE-SCORE
60 70 80 80
SAT -V A3/SAT-v X2 AND Y3 EQUATING RESICUALS__
- - IRT PRE-EQUATING
INTACT-FORM-0740IRT EQUATING
-20 -10 0 10 20 30 40 50 80 70
FORMULA TRUE-SCORE
80 GO
SAT -V AS/SAT-v X2 MELTS EQUATING RESIDUALS
D 35
FF .2s -O ::0 -F 1G-
loi-
A5
L
r)D-t
c-I50 -20 -R-25 -
S
35
- IRT PRE - EQUATING
_____INTACT FORM OPER.LINEAR EQUATING
/ N,:---=
-,... /--,
.\.....--=- '* \N -/
-20 -10 O 1+0_ 20 30 4_0___50 $0FORMULA TRUE-SCORE
Figure SATverbal Form 3ASiLLeeiyated to .:1AT=Verbal Fort-XSA2 AteJ. SAT-verbal
Form YSA3 - Plots of 1) IRT ptdEotiati4 raw to scaled score trans-
"i formation compared td Corresponding intact form calibration systemIRT,diredt link IRT, and operational linear equating raw to sca4ed
Score tranSfortaticts, and 2) differences between scaled scores (IRT
pre-equating -.comparison equating) resulting from the equ4lAinv.
29
70 80 GO
E 0O cc
Lz. C-1
4-1 MC
0 0C 4.1 r. r.1
:4 0
CY0
DqUatiag Residual Plot
SAT-V AUSAT-V X2 AND Y3
700
SC 800 ,-A -
E SOO -O 0-.
-1-
S 4C)°C0 300RE200 - - - IRT PRE - EQUATING
C -16 -
°-20 --
INTACT FORM CCS) R
mOH IRT EQUATINGE-25 -
,
------INTACT_FORM tCS)
S -30 - IRT EQUATING
3,-1--1-1-it'.-L-1-E-111t111__IJILIIIII -351 1 i 1
1_11L_II1 I- -t-r-t-i-r-t --I- -I- -F : f-II-3o- -20 -10 0 10 20 30 AO SO SO 70 80 GO -30 -20 -10 0 10 20 90 40 SO 80 70 SO GO.
FORMULA TRUE-SCORE FORMULA TRUE -SCORE
0IFF
0F
SAT-4 AS/SAT V X2 AND Y3 EQUATING RESIDUALS
3$
30-25-20-IS -
S 10 -CAL 0 -g =5
so
- IRT PRE- EQUATING
SAT-V A3/SAT-V X2 AND Y3
ecuL
700-
C 800A
LE 5001-O 1--
S 443a
C
° 300 ;--Rr_ r200 I-
L_
loc4-
ied--
; L_ILL I I J till!! LL:=11111__I 1 I
-30 -20 -10 0 10 20 30 40 50 00 70 80 GO
FORMULA TRUE-SCORE
- IRT PRE-EQUATING
IRT EQUATING
8up
7G0
C BCCAL
Scc{-
S 400"C 1-p0 300 {-
E
1 A3/SAT-V X2 AND V3
200 F -- -IRT PRE-EQUATING
INTACT FORM OPER.100r- LINEAR EQUATING
01r-11,1-1-4-1 1-1--r-F1-1-1f11J11111111-30 11-20 -10 0 10 20 30 40 SO 80 70 SO GO
FORMULA TRUE-SCORE
35I 30
F 2$ -
0 at$ :50 F
1_A
5
L 0 FE 1
-10$C -15
R. -20-205 -2S -S -30 -
3$
SAT-4 A3/SAT-V X2 AND Y3 EQUATING RESICLI
-30 -20 -10
PRE - EQUATING
INTACT-FORM-LDL)IRT EQUATING-1-
O 10 20 30 40 50 80FORMULA TRUE-SCORE
70 80 GO
SAT--V Al/SAT-V X2 AND _Y3 EQUATING RESIDUALS
35
3°1-F :5-
20^-1 i c
5 10LgL
A -1
0EL 0
0_20_
Es -25
35
------IRT PRE-EQUATING
L-__.--INTACT FORM OPER.LINEAR EQUATING
-20 -10 0: 1_0__ 20 30 40__50 80 73FORMULA TRUE-SCORE
Figure SAT-verbal Form 3ASiL3 _e_luated to .3AT-verbal Form-XSA2 Ll SAT-verbal
Form YSA3 Plots of I.) IRT pre- equating raw to scaled score trans-
formation compared to corresponding intact form calibration systemIRT, ;direct link IRT, and operational linear equating raw to scaldscore traUfOrtations; arid 2) differences between scai2d scores (IRT
pre-equating - .comparison equating) resulting from the eq74cAngr7
29
SO GO
'C
±.) .5,L)
.a iti
1-at P.1OE
800
700
C 800A
E SOOD
4°°
-
L 1-4 C
0 300 -RE
C.) UJ
C13
r./1 143C4-
01_1_1_1 t-30 -20
T..1
u 007001-
49 S I-4 Li
two 1-
=I MZ
A
hE SOO
cr"1.i LT.1
0 -C
EU004
fr. SC
)..) 1-I 0300-U Ral ...
5E
200--t-
G,-)1-1
loo
4.)
-26-
WIatidg Plat Residual Plot
SAT-V 83/SAT-V Al
-7
- - IRT ME-EQUATINGINTACT FORM <CS)
IRT EQUATING
1 I I 4-1-1- -t -1 -t t f l t 4-1 I II-10 0 10 20 30 40 50 80 70 80 00
FORMULA TRUE-SCORE
SAT -v 83,SAT -v A I
- - IRT PRE-EQUATING
INTACT -FORM- <DoIRT EQUATING
0- ..1_1 1 l l 1 II 1 1 1 1 1 1 1 1 I 1 1 1 1
- 30. -20 - 0 0 10 20 30 40 ZO 80 70 SO 00
FORMULA TRUE -SCORE
600
noL$c6-661-A
E SOO -
--1 ,-1D
z4001 -
93ou
kCi 200r- - - - IRT PRE-EQUATINGW.
0 100 L/NEAR EQUATINGINTAcT FORm opER
-30 - 01 0I
10I
20\ 30 0i_50 -1-
SO 1 70FoRmuLAITRuE-SCORE
_ _5AT-VU/SAT-1.1 A
80 00
3130
F 256 20
F 15S 10
As
L 0
O -5-10
sc I 6
Rci -20 k
E -25S
-35-30
SAT -V 83/SAT-V Al EQUATING RESIDUAL
-20 -10 0 10 20FORMULA
--urr PRE-EcUATINGINTACT-FORM-GCS)
IRT EQUATING .
30 40 SO 89 70 80 GO
TRUE-SCORE
SAT-V 83/SAT-V Al EQUATING RESIDUALS
1) 35IFF 2s -O 20 -F 15
s 10CAL 0
g -15F_ -10 -- I S -
R_20
E-25 -S_
5511 1 1 I 1 I
-30 -20 -10
D 35SD
F 25O 20F isS 10CAL 0-O -5
R-25 - INTACT -FORM OPER,
-30 - LINEAR EQUATING
35 LI 1_1 1_1 l I L I _L I 1_4_4_1 11 I
-30 -20 -10 0 t0 20 30 49_ So 60 70 80 DO
'ORMULA ;'RUE -SCORE
- - IRT PRE-EQUATING
--,INTACT FORM -<1:41:5....... IRT EQUATING
1 . 1 1 I 1 1 1 1 1_1 1_4_1 1 1 1-1-1--1E--0 10 20 30 - 40_ - GO SO 70 80 20
FORMULA TRUE-SCORE
SAT -7 53 /SAT -V A l EQUATING_RESIDUAL_S
N
-IRT PRE-EQUATING
Fig,.:re 7: SA.T-vzrbal-Fr.i_m_3BSA3 equated to SAT-verbal-FOrm_3ASA1 Piocs of I) TRT
pre-equatingr_raw to scaled sCete_ttah§f.dtMation compared to corresponding'
intact form talibiatiOn system IRT, direct link IRT; and operational
linear equating raw to scaled score transformations; and 2) differences
between scaled scores (IRT pre - equating - comparison equating) resulting
from the equatings.
800
700 -
SC 800 -A
E 500 -0 _
400 -C
300 -RE
200 7
100 p
-SO -20 -10 0 10_ 20 30 _40 SO 80 70 80 90
FOPMULA TRUE-SCORE
-27-
Equating Plot Residual Plot
SAT-V 53/SAT-V12AND AL
800
- IRT PRE-EQUATING
INTACT TORN -SCS)IRT EQUATING
SAT-V 83/SAT-V Y2 AND AI
...., ru to 700c-,
W C S r,..
,-1 4.3C 800 -
a m - AI= /-000 f
5 -or 0 ,--1-
)1 430
9-4007
PL. E7C 1-
JJ /--I ° 300 I-U _ RM .W E I-4-1 C 200 ,
ro
1-1 74C
1-2
.63 ro
M 0L r10 .1.3
CU
0
00 -
--IRT PRE-EQUATING
INTACT FORM -(Dl)IRT EQUATING
0-30 -20 -10 0 10 20 30 40 50 00 70 80 GO
FORMULA TRUE -SCORE
8C0
SAT -V 83/SAT-V Y2 AND Al
700 7
scam. r-A
66soo
n
Cd
or
5400C
0300
200
100 -
IRT PRE-EQUATING
INTACT FORm_OPER-LINEAR EQUATING
I 1 1 LI 1_1 1 1
-30 -20 -10 0 10 20 30 49 __S0 80 70 80 00FORMULA TRUE-SCORE
SAT-V 83/SAT AND AI EQUATING RESIDUALS
D 95p 30_F 25-0 20 -r 15 -S 10 -cAC 0E
°
s-10 -
0 _E -25 - INTACT_FORM tCS)
-30 -IRT EQUATING
35111-1 1_1111_11 1111-r-t11111fil
-30 -20 -i0 0 L0 20 SO _40 40 00 70 80 GO
FORMULA TRUE-SCORE
.......'-"'",0.--/ .
- IRT PRE-EQUATING
SAT -V 83/SAT -V Y2 AND Al EQUATING RESIDUALS
0 95F 30-F 25 -0 20 -F Is -S 10 -C 5-AL o
O./
E-- ...... ,...
-s-10-
C - I 5 -
R - - -----IRT PRE-EQUATING
E -25 - INTACT FORM _COL)S.30.. IRT EQUATING
35 1- 1 1-1 1 I- -1-1--1-1-1 1-1 t 1 I 1 1 I 1 1 1 I I I
-30 -20 -10 n 10 20 30 40 50 80 70 80 00
FORMULA TRUE -SCORE
SAT -V 83/SAT-V Y2 AND Al EQUATING RESIDUALS
D 35I 30-
F 25-0 07-F 15_
S 10 -C /'A
0 -E0 -5
S _ _
0-20-R
-----IRT PRE - EQUATING
INTACT FORM OPER,S.-30 - LINEAR EQUATING
35 I- I 1111 1111 1 -1- 1 1- I -1- I -1- 1 1
-30 -20 -10 0 10 20 30 40 50 60 70 CO 03FORMULA TRUE -SCORE
Figure :SAT-verhaL_Form 3BSA3 equated to SAT-verbal F6rm-YSA2_and SAT-verbal
Form3ASA1 - Plots of 1) IRT pre- equating raw to scaled score trans-.
formation _tompared to corresponding intact form calibration system IRTi
direct link IRTi. and_ operational linear equating raw to scaled score
transformations,_2) differences between scaled scores (IRT Pre-equating -
comparison equating) resulting from the equatings.
4
32
Table 4
Scaled Score Summary Statistics Resulting from Application of Four Equating Methods
SAT-verbal Sections of Forms 3ASA3 and 38SA3
Form N
IRT Intact Form
(Direct Link)
IRT Intact Form
(Calibration System)
Intact Form Linear IRT Pre-equating
3ASA3 126,788
M 437.04 437.01 441.45 439.26
S.D. 111.91 111.30 108,34 109.65
3BSA3 253,354
430.25 43042 431.42 440,39
S. D, 105.99 105.57 106.53 110.55
34
33
-29-
equating, curvilinear relationships are permitted. The residual plots,
in conjunction with data presented in Table 4, to be described next,
provide much of the data upon which to evaluate the results of this
study.
Table 4 provides the Scaled score means and standard deviations for
FOrMS 3ASA3 and 3BSA3 that would have resulted froth use for score
reporting purposes of pre-equating, intact form calibration system IRT
equating, intact form direct link IRT equating,. and intact form linear
equating to the old fen:ha. The means and standard deviations_were
computed using frequendiea for the total groups taking Forms 3ASA3 and
3BSA3.at the respective initial intact form administrations.
Based on the data presented for Form 3ASA3, it is clear that the
pre-equating was quite successful. In no residual plot is the
difference between the Scaled score resulting from the pre-equating and
the comparison intact form calibration system IRT or direct liOk IRT
equatings tere.then 15 score peihts ot a-SCald 'Containing'600-score--
pOints; The differences between the pre-equating results and the intact
form linear results are grater than the differences resulting fret the
intact forM IRT.equatings, particularly at the upper end of the formula
score :4ca1e. This is beauseall three IRT equatings demonstrate that
the raw to scaled score conversion is curVilitear_in this region, and
the linear equating cannot account 'for this curviIinearity. The
differences in scaled score means and standard deviations presented in
Table 4 are very small. The scaled score means and standard deviations
resulting frOM the two IRT methods used as criteria are Almost
identical. The scaled score mean resulting from the pre-equating lies
-30=
between the scaled score mean resulting from either the intact form
calibration system or direct lidkJRT equatings and the scaled score
mean resulting frOm the intact form linear equating. The scaled score
difference between the mean resulting from the pre-equating and any of
the Other equatings is about 2 points. What is particularly interesting
to note is the pattern of the residuals plots for the comparison of the
pre- equating results with the intact forM calibration system and direct
link IRT equating results, displayed in Figures 3-5 The patterns of
residuals are the same across both the single equatings and the. average
equating. The pre-equating results in lower scaled scores at the bottom
and top of the formula score scale and slightly higher scaled scores in
the middle region. As mentioned earlier, at no point are these
differenceS greater than 15 scaled score points, and hence, although the
pattern Of differences is consistent across equatings, the differences
themselves are minor when compared toi for instance the scaled standard
error of me357uremenrIor:S4T7-Nerhali_Whic.h ta. approximately. :0 sCaled_
score points.
The nre=equating of Form 3BSA3 was clearly not as successful as the
pre-equatitig for Form 3ASA3. The residual plots show maximum
differences in scaled scores resulting from the pre-equating and the
operational calibration system or diteet link IRT equating of upwards of
20 score points. Once again, the differences between.the pre-equating
and the intact form linear equating are even greater, partiCularly in
the regions of the formula score scale where.the raw to scaled
conversion is curvilinear. The differences in scaled score means and
standard deviations resulting from the pre-equating and the comparison
36
-31-
equatings are mktph larger than thoSe for Form 3ASA3. The two IRT
methods used as criteria produced scaled score summary statistics that
are very similar. Unlike the equatings for 3ASA3, scaled-score summary
statistics produced by the linear equatings are fairly similar to those
produced by the 1RT criterion equatings. The scaled score mean
resulting from the pre equating IS about ten points greater than the
scaled score means resulting from the IRT intadt form calibratien
system, IRT intact forth direct link, and intact form linear equatings,
Which are all within a scaled score point of each other. Once again,
the patterns in the residual plots for the IRT pre-equating and the
comparison IRT equatings are the an across both of the single
uatings (to 7SA2 and to 3ASA1) and; ,hence, the subsequent average
equating. The pre equating results in slightly loWer Scaled scores at
the lower end of the formula score scale and consistently higher scaled
scores through the middle and,upper end of the formula score scale. The
maximum differences-odour 5n-all-plots .pround_a_formula score of_70,_
S elemental Investib,ationS-and-Results
A number of possible explanations were generated-fo why the 3BSA3
pre-equating results were different from the 3BSA3 comparison equating
results and clearly not of the same quality as the 3ASA3 pre-equating
results; Exploration of these possible reasons for the inferiority of
.
the 3BSA3 ore- equating results are reported on next, and also diScussed
briefly in the conclusions section.
One possible explanation for the 3BSA3 pre-equating results has to
do with practice effects generated from the manner in which the test
sections are sequenced. In other words, for 3ASA3 there may have been
-32--
more or less of a balancing effect of the sequencing of the operational
final form section'and the pretest section (perhaps in about 50% of the
final form = pretest combinations represented in Figure 1 the
operational section appeared first and in the other 50% of the
combinations the pretest section appeared first), while for 3BSA3 the
balancing may not haVe occured. It can be hypothesized; given that the
3BSA3 pre-equating results are consistently higher in the upper part Of
the fOrmdla score scale than'any of the comparison equatings, that the
preteOt section followed the operational section in a disproportionate
number of. cases, and that practice effects resulted. An investigation
of the sequencing of sections did indeed show that; for 3BSA3, the
pretests occurred after the operational sections in 65% of the final
form - pretest combinations; but for 3ASA3, this was true 64% of the
time. Hence it would appear that the above explanation cannot be used
to explain why the 3BSA3 pre-equating results were so different from
those for 3ASA3.
Two other potential explanations for differences in pre-equating
results have to do with equating samples and LOGIST calibration runs.
These are:
I, The use of two different equating samples with the 3BSA3 intact
form calibt8tiOn system and direct link equatings. In doing the
3ASA3 intact form calibration system and direct link equatinge,
the same equating section, fm, was in common with old fotms XSA2
and YSA3; and hence the same sample, and subsequent set of
parameter estimates; could be used for both equatings. This was
not true for 3BSA3 in that fk was in common with YSA2 and fw
with 3ASA1. This necessitated the use of two different samples,
And hence, two different sets of item parameter estimates (both
Sets taken from the SAT IRT Scale Drift. Study) to perform the
equatings.
2. The use of different versions of the LOGIST program to generate
item. parameter estimates. For. 3ASA3, both the pretest and. the
final form parameter estimates were generated .from the current
version of LOGIST, and this is also true of the 3BSA3 pretest
parameter estimates. TO save on calibration costs; the'-3BSA3,
:final intact fOrM parameter estimates were recovered from the
SAT -IRT Scale Drift study (Petersen, et al; in press) run on a
different version of LOGIST. It is possible that the updating
and refinement of the LOGIST prOgram caused subtle differences
in parameter estimates, which collectively caused the
differences seen in the residual plots for 3BSA3.
The Possible-explanations above implicitly assume that it is not the
pre-equating for 3BSA3 that is somehow faulty, but instead the
comparison IRT equatings. To investigate whether or not it is
reasonable to explain the differences in pre-equating results this way,
the folloWing was done. The operational final form-equating section
combinations needed to equate intact final form 3BSA3 to old forma YSA2/
and 3ASA1 (see Figure 2) were run together in one large LOGIST;tUn,
using the current version of LOGIST and the intact final form equating
.redone. As a result, the patattitei- estimates for the 3BSA3 pre-equating
and the 3BSA3 intact final form equating were generated using the same
version of LOGIST. Further, by running the data for 3BSA3 and the two
33
-34--
old forms concurrently, there-was no need for scaling parameter
estimates (all parameter. estimates needed in the equating'are
automatically on the same scale) and only one set of 3BSA3 final form
parameter estimates were used in the equating (unlike the previous IRT
comparison equatings); In sum, the results of- equating intact final
form 3BSA3 to the old forms using the parameter estimates from the
concurrent LOGIST run should provide the best criterion possible for
evaluating the 3ESA3'pre-equating results.
A comparison of the 3BSA3 pre-equating results to this new IRT
comparison equating is presented in Figure 9 for each of the single
equatings and the average equating. The new' tomparison equating has
been labeled intact form concurrent equating in this figure.
Information on the scaled score summary statistics resulting from this
new equating and the others previously deScribed is presented in
Table 5.
The results presented in Figure 9 clearly lead to the conclusion
that it is not the comparison IRT equatings for Fort 3BSA3 that are
faulty, but rather the Form 3BSA3 pre-equating results. The data
presented in Figure 9 show differences between the IRT pre-equating and
the intact form concurrent IRT equating that are comparable to the
differences in the residual plots using the other intact form comparison
equatings.. Thus the poSsible explanations for differences in equating
results based on the use of different versions of LOGIST and multiple
sets Of parameter estimates, generated from the IRT Scale Drift Study
(Petersen et al, in press) must be discounted.
Equating Plot
=35-
C
R0 300 -
E ----TAT PRE-EQUATING
INTACT FORS 2 CONC3
100 IR7 EQUATING
0I-1-1-1 1-1-1-34 -20 -t0 0
11 I 1 1 I 1-I 1 -1-1 -1-11 1
10 20 30 40 50 80 70 80 90
FORNuLA TRUE-SCORE
SAT-V 83 /SAT-V Al
600
700
C 800ALESN)
cn-mmC034w
200
100
01-1 1 I I 1 -I 1
Residual -P-1 16-t
SAT-v 83/SAT-v Y2 EQUATING RESIDUALS
O 3S30 -
F 25 -ID 20 -F 15.-S 10 -CAL
(7-5
e-15R-5E -25 - INTACT-FORM-CC:0NC)
S -30 -IRT EQUATING_
-1-1-1-1 I 1,,1 1 IlL11111_1_1_3s -301 1,20t -I--1-101 01
10 20 30 40 60 80 70 80 90
FORMULA TRUE-SCORE
\ .
--IRT PRE-EQUATING
SAT-V 63/SAT -V AI EQUATING RESIDUALS
35
F 3°F 25 -
OF 15
10-C 5-AL 0 7E
S-
D
---IRT PRE-EQUATING 0-20 _
de'll
-.LL.-INTACT Fowl CCONC3IRT EQUATING
RE- -S -30 -
-INTACT_FORM CCONC3IRT EQUATING
I I I 1_ 1 I 35 1 I 1 1 L 1 1 I I 1 1-1-1-4 I I 1 1-
-30 -20 -10 0 ol 20 30 40 SO 80 70 80 90 -30 -20 -10 0 tO 20 30 40 CO 00 70 80 90
FORMULA TRIE SCORE FORMA TRUE-SCORE
.
- -IRT PRE-EQUATING
3A7-4 EM/SAT4 Y2 AND Al
800
7C0
C801
"c3 LESTM
71 -4 0 7CV < 3400 -
C
>4$00
a
200 - - PRE-EQUATING
I 00r _SW A CT-FORM-CO:WC,IRT EQUATING
0171-1--1-1-( I I I 1_1 I L I I 11 1 I I 1 1 I 1
-30 -20 -0 0 10 20 30 40 SO 00 70 BO 90
FORMULA TRUE-SCORE
D 3$I 30FF25O 20F 15s 10C gAL 0E
-6D-10
C-1604CiRE -25 - FORM CCONC35--30 _
IRT EQUATING
1_1 1 ri 1--1 1-1 1.1 I35
-30 -20 -10
SAT -V 83/SAT,V Y2 AND AI EQUATING RESIDUALS
MIN
4""
-IRT PRE-EQUATING
O 10 20FORMULA
30 40 60TRUE-SCORE
60 70 80
Samme_r_bel Form 3BSA3 equated to SAT-verbal Form YSA2; Form 3ASA1) and.
Eorms YSA2 and 3ASA1 -_Pldta 6f, 1) IRT pre-equating raw to scaled score
transformation compared to intact final form concurrent_IRT equating
n,4 to scaled score tranSformation; and 2) differences between scaled
scores (IRT Ord=equating - intact final form concurrent IRT equating)
resulting from the equatings;
90
Table 5
Scaled Score Summary Statistics Resulting from40licatiOn of Five Equating Methods
SAT-verbal Sections of Form 3BSA3
Form N
IRT Intact Form
(Direct Link)
IRT Intact Form
(Calibration System)
Intact Form Linear IRT Intact Form
(Concurrent Run) '
,
IRT Pre-equating
3BSA3 253,354
M 430;25 430.42 431,42 431.54 440.39
S.D.
......
105.99 105.57 106.53 105.86 110.55
43
42
The only'other possible explanation for,the differences between the
Form 3BSA3 pre-equating and intact form comparison equating results haS
to do with the quality of the parameter estimates for the 85 3BSA3 items
when they appeared in pretest form: In order f r the equatings to be as
discrepant as they are, the pretest and final form parameter estimates
for certain of the items must be quite different. The following method
was used to compare the-Se two sets of parameter estimates in an attempt
to locate those items for which the pretest parameter estimates were
probleMatiC. A mean absolute difference And a Mean signed difference
_
between the item response functionS for each item, where the functions
were generated using the pretest and the final form parameter estimates,
were obtained. Using all individuals in the sample taking Form 3BSA3
when calibrated as an intact final form, the abSolUte difference and the
signed difference between the item response functions for each person
(i.e., value of S) were obtained and then averaged across people. Items
having the largest mean AbSolUte difference and signed absolute
difference valudS were then located. The above analysis was also done
for the two sets of Form 3ASA3 item parameter estimates so that the
discrepancies between parameter estimates for 3ASA3, where the
pre-equating results were more than acceptable, could be compared to the
3BSA3 discrepancies.
Using the mean absolute and mean signed differences between the item
response functions as criteria for selection Of problematic items,
thirteen items from 3BSA3 and twelve from 3ASA3 were identified. Upon
inspection of these two subsets of problem items, they were found to
.
differconsiderably in characteristics. Of the thirteen items
-38-
_
identified for ForM 3BSA3, eleven were reading comprehension items. Of
the eleven, four were based on the same passage and .thred on another
passage; The remaining four reading comprehension items were single
items based on four different passages. Of the twelve items identified
for Form 3ASA3., four were reading comprehension items (two from one
passage, the other two from two other passages different from the
first), four were antonym items, three were analogies, and one was a
sentence completion item. Of the thirteen 3BSA3 items identified;
twelve were more difficult when given in a pretest than in the final
fort-. Of the twelve 3ASA3 items, seven were more difficult when given
in a pretest and five when given as part of the intact final form.
Upon closer inspection of the eleven reading comprehension items
from 3BSA3 exhibiting large absolute and signed differences in item
response functions, it was found that nine Of these items came from
pretests in Which the passage they were linked to .was located in the
final position Of the pretest section. For all but one of these items,
the item as it appeared in the,pretest was more difficult, sometimes
considerably more, than when it appeared in the final form. For the
lone exception, a word was deleted from the correct response (the only
such occurrence on either 3ASA3 or 3BSA3) between the time when the item
was given in a pretest and the time it appeared in the final fort. This
word also appeared as a key word in the text of the passage and it may
be hypothesized that it acted as a "clue" to the correct response, thus
explaining the increase in difficulty upon removal of the word from the
correct response when the item appeared in the final form. Figure 10
contains plots of the item response functionS baSed on pretest and
-39-
PEVIM CINIPRDIENa014 11E14 1
1.0
0.9
r0.1a O.70A 0.60
0.5
10.4
0.3
0.2
0.1
0.0 -3
A01.572OP .236c. .147
----FINAL MOA01.11-7
C. .092
0.9
II-111111ft2 3
I -1
mm
-2 0
'META
READING CONPREHEWIIN ITEN s
1,0.6
Ft00.7 r-
6A0.3-67:0.6-L-
3:0.4-
Y0.31-
0.2...........
0.1 -
I 1
---*
1 I0.0
MI .itltOW .931
-Aat.243OP .1G1Co .170
I -t-I L-4-1 IA1 2
THETA
1.0
0.2
0.t013.7
A0.60 .60I 0. 5
10.4Y 0.3
0;2
0.1
0.0
DING CONPREHOICION ITEM z
1.0
0.2
r 0' 41
:13.71-0 -_-A 0 . 60I0.5L10.4 -
151: .443..tag1:1
T
...-'FINAL FOR0.2 - .... - - 11
Ka . AOC!la-. 207
0 . I - Ca .147
0.0 1 1 I 1---1---3 -2 -1 0 1 2 3
THETA
1111111111
1 I-3
A±, .501111-. 1 OSOn .096
-FINAL FMA. 4966-.G24C. ;147
I I 1 1 1 1 1 1 1 -1--
Mml
- 2 0THETA
2
READING CONME107191121 ITEM 4
1 1 1 I 1
Figure 10: Plots of item response functions based on pretest and intact final form
item parameter estimates for thirteen problematic Form 3BSA3
46
1.0
0.2
REJOINS CIIMPREHEMS19c>f'1 ITEM
0.2
0.1
0.0 I 1 i -I- A-z -1
THETA
REPDIMS 0:11CPCHEMIZMI% 11131 7
1.0
00.7-3-A0.6a-x13.5
ID. 4
0.3 -0. 2
0.1_
0.0 t-3
1 1 1 I 1 1 1 1 1
Am .00715o .410Co Atia
-FINAL FORMAm .620Oo .10SCo ;147
1 I -V- I
0 1
THETA
A01.00Doo.70C
_.2011_---FINAL. FORM
Am .20.X1.114CS .147
-ILI-- 1 11110
THETA
1 2 3
FEARING CORREA:MUM 11134 I
0.a f_
a .7 -0
-4 A0.6 -alox =10.4 -Y0.3-
- 0.2
-3I-
-=t
A! -.7a40m-.345Ca .138
-FINAL FORMAo .7tS
-0o-.723Co. ;147
1 1- -I- -10 1
THETA
Figure 10: Plots Of item response functions based on pretest and intact final form
item parameter estimates for-thirteen problematic Form 3BSA3 items.
4'7
1.0
0.9
-ra.e
00.79
-10.4
Y0.3
0.1
0.01
rF
-3
A -6g4-Do-G476Co ;ON
FORMA .612Bo-1.623Co .147
111 1 I
aTHETA
THETA
0.9
0.1
'Ae .692Bo .446Cr .303
AP .7406o,-.160Co .147
mei
0.0 L l I I 1 1 1 1 1 L 1 I
-3 2 -1 0 1 2 3
I.0
0.0
THETA
£0. -00.70-A0.6
YO.SL_INALser:Ivocfrets
A. .S49
Xr0 4Ar.4CE
-.
-0.2 -
0.1 er..-Co .418347
0.0 1 A L I I I I I 1 I ' : t-5 -2 -1 0 i Z 3
1.0
r0CR0"A 0.111
10.6
X0.4T
Ya.31-
a.2
0.1
0.0
ANTONYM 1Te1
1
-
-
1
T 1 I 1 I 1
1-
AP .4421006-2.t73Co .121-
.....;-F274AC FORMAP .6410A-2.1191Co .147
1=1,
ME.
-3 -t -1 0THETA
1 2 3
THETA
Figure 10: Plots of item response functions based on pretest and intact final form
item parameter estimates for thirteen problematic Form 3BSA3 items.
-42-
intact final form parameter estimates for the thirteen problematic 3BSA3
items, identified by item type. For the reading comprehension items
(numbered 1-11), items numbered 1-4 are all based on the same reading
passage, as mentioned earlier, and items.numbered 5-7 are based an the
other passage discussed. Reading comPrehensicn item nuMbet 11 is the
item in which the word was deleted from the correct response betWeed
When the item was given in pretest and in final form; The remaining two
problematic items are presented after the reading comprehension items;
one of the items is an analogy item and the other is an antonym item.
Upon inspection Of the four problematic reading comprehension items
identified in Form 3ASA3, it was determined that, exactly like the
situation for Form 3BSA3, the items were located in pretests where the
passage they were linked to was located last in the pretest section.
All four items were also more difficult when they appeared in the
pretest section than in the final form.
On the basis, of the above data, it may be hypothesized that either
something approaching a "fatigue factot" is being exhibited in the
responses of candidates to reading comprehension item! based on passages
located at the end of pretest sections or, because of lack of time,
random responses are being supplied by these candidatea to certain of
the questions based on this last passage. In aither case, the items are
more difficult in the pretest than they are in the final form; where due
to passage location, a fatigue factor or the supplying of random
responses is not occurring. The data on the reading comprehenaion items
from both 3BSA3 and 3ASA3 are consistent with this statement.
4 9
-43-
If the above is happening to pretest reading comprehension items
based on passages located at the end of pretest sections, one might be
concerned about whether there ire large discrepancies in parameter
estimates between pretest and final form for reading comprehension items
in the intact final form based on passages at the end ofthe SAT-verbal
45 item and 40 item sections. The 45 item SAT-verbal sections do not
en&with reading comprehension items, but the 40 item sections do. For
3ASA3, the passage upon which the last set of reading COthprehenSiOh
items (items 36=40) were baSed Wa: also located in the final position in
the pretest::: TWO of the five items demonstrated discrepancies large
enough -to be included in the overall set of twelve items discussed
earlier. For 3BSk3, the passage upon which the last set of reading
\ comprehension items (items 36=40) were based was not located at the endN
of the pretest section. It was, however, the only reading comprehension
pa60age in the pretest, and one of these last five items (36-40) in
3BSA3did exhibit large discrepancies in pretest-final form parameter
estimates. Thus it would appear that While the outcome in terms of
parameter estimate discrepancies for reading comprehension items located
at the end of SAT-verbal sections is not as clear cut as for comparable
pretest reading comprehension items, there is still cause for concern.
The effect on, equating of having, in particular, reading
comprehension preteSt ittt difficulties estimated to be lower than they
are when estimated onAntact final form data is predictable', and
demonstrated in the Forth 3BSk3 pre-equating results.; If the same items
are more difficult ih the first "test" (made up of pretest items) than
in the second (made up of the items in the intact final form), then the
-44-
same raw score on both "tests" should result in a higher scaled score on
the first "test" than the second. This appears-to be exactly what is
happening with the 3BSA3 pre-equating results. For 3ASA3, on the other
hand, there is more of a balancing effect of the discrepancies between
pretest and intact final form parameter estimates and the result is that
the pre-equating and the intact final form comparison IRT equatings more
closely coincide. One might still be Concerned, however, about the fact
that any discrepancies at all showed up between the pretest and intact
final form parameter estimates, particularly for 3ASA3, where the
problem with final passage reading comprehension items was minimal.
,--study by Cook, Eignor, and Petersen (1982), examining the stability over
time of intact final form SAT-verbal (and other testing data)parameter
estimates, can be used to address this issue. The magnitudes of the
discrepancies found in the Cook, et al, (1982) study, based on the same
intact final form items given on two occasions, were of the magnitudes
of the discrepancies found in thit study. Hence; the lack of parameter
invariance demonstrated by certain of the items in this study may not be
such. a . serious cause f5r concern; this will be discussed further in the
conclusions'sectiOn.
Additional-F54ating,_ResuIts
In Concluding tliis section, one other noteworthy result should be
mentioned; this folloWs from a comparison of the intact form calibration
system IRT equating to the intact form direct link and intact form
concurrent IRT equatings. It would appear, based upon the equatings
done, that the equating is.more than adequate-when done thrOugh the
indirect linking of the new and old forms used for equating via the
'94
overall calibration system: That isi.even though in this situation the
forms to be equated are linked indirectly through intervening LOGIST
runs; and parameterestimates placed on a scale defined by the ability
distribution of the sample taking a form not*used in the equatings; the
quality of the eqUatinga are comparable to those resulting from either
linking the new and old forms directly (direct link equating) or
calibrating all data concurrently so that new and old form parameter
estimates are automatically on the same scale. This has important'
implications for the future construction of a large pool of calibrated
items and test forms to be used in intact final form IRT equating of
SAT-verbal.
Conclu sions
The results of pie-equating the two forals of SAT-verbal reported on
in this study; when compared to the intact final form IRT equating;
varied considerably; ranging from acceptable for Poi= 3ASA3 to only
marginally acceptable for Fort 3BSA3. Reasons for the infetiority
the Form 3BSA3 pre=eqUating results; having to do with the location of
reading passages and reading comprehension items at the end Of pretest
sections; have been advanced and discussed. The overall results of the
pre-equating of the two forma of SAT-verbal were deemed sufficiently
promising that an investigation .of pre-equating two forms of;
4
SAT=MSthematical is presently being undertaken. Data ftbitithe second
study, when considered-with the data froM this study; should supply
information necessary for the consideration of IRT pre-equating of the
52
SAT on a regular Operational basis. The results reported here also have
clear implications for changeS in test deVelopment practice; having to
do with the positioning of pretest and final form reading comprehension
items and making "minor" changes in the wording of items between. pretest
and final form; if pre-equating the SAT-verbal section is to become a
reality-
On a more general level, the results of this study indicate that the
IRT item parameter estimates generated for certain items When 1Jyen in
pretest form did not remain invariant when given in intact final form.
Rased on the results of recent studies; partidularly-Cooki Signor, and.
Petersen (1982), parameter invariance for all items in a test forth Would
not be expected to be the case. The real issue is whether the lack of
parameter invariance is serious enough to cause one to dismiss the use
of item response theory*for the particular application of concern. The
application in this study is pre-equating and the results, particularly
as they pertain to SAT --verbal Form 3ASA3, suggest that even though there'
is some lack of parameter invariance; IRT pre- equating may be A.
reasonable equating method for the SAT - verbal section.
v"
-47.-
References
AngoffW. H. Test reliability and effective test length. PSychometrika,
'1953, 1-8-, 1-14;
Angoff; W. H. Scales; norms and equivalent scores._ In R. L. Thorndike
(Ed.); Educational Measurement (2nd dd.). Washington, DC: American
Council on Education, 1971.
Bejar; I. I., and Wingerdky,_M, S. A study of pre-equating based on item
response theOtY. Applied Psychologital_Measurement; 1982; 6, 309-325
,
Cook, L. L., and Eignor, D. R. Practical considerations regarding the use
Of item response theory to equate tests. In R. K.,_HambletOn (Ed.),
Applications of item response theory. Vancouver, B.C.: Edutational
Research Institute of British Columbia, 1983.
Cook; L. L.; and Petersen, N. S. Item response theory equating for the
SAT: New designs and directions, Proposal submitted to College
Board -- ETS Joint Staff Research and Development Committee; 1982.
Cook; L. L., Eignor, D.. R., and Petersen; N. S. A study of the teMporal
stability of IRT item parameter estimates. A paper presented at the
Annual meeting of AERA; New York; 1982.
Kingston, N. M.; and Dorans, N. J. The feasibility of using item response
theory as a psychometric model for the GRE Aptitude Test. RR-82-I2.
Princeton, NJ: Educational Tegting Service; 1982:
Lord, F. M. Estimation Of latent ability and item parameters when there
are omitted responses. P- sychometrika; 1977; 14, 117 -138.
Lprd; F,_M. Applations of_ttem response theory to practical testing
problems. HIlisdale, NJ: ErIbaum. 1980.
Petersen; 1% S;; Cook, L. L., and StoCking, M. L. IRT versus conventional
equating methods: A comparative study of scale stability. Journal
of Educational Statistics, in press.
.Stocking, M. L.i and Lord, F. M. Developing a common metric in item
response theOry. RR.=-82-25-ONR; Princeton; NJ: Educational Testing
SerViCe, 1982.
WingerSky, M. S. LOGIST: A program for computing maximum likelihoodprocedures for logistic test models. In R. K. Hambleton (Ed.);
Applirrittons of item response theory. Vancouver; B.C.: Educational
'-Research Institute of BritiSh COluMbid, 1983.
Wingersky, M. S.,- Barton, M. A.; and Lord; F. M. LOGIST V user's guide.
PrinOetOn, N.J.: Educational Testing Service, 1982.
54