vert&hor equating 111024
DESCRIPTION
Vertical and horizontal equating and measurement invarianceTRANSCRIPT
Vertical and horizontal test equating in educational
research
Eveline Gebhardt&
Wolfram Schulz
Method for estimating
• change over time in student abilities• growth between year levels
CLASSICAL TEST THEORYA method based on
Classical test theory
• Student performance: % correct on set of items– Compare students that respond to identical
set of items• Item difficulty: % of students responding
correctly– Compare items that were administered to
the same group of students
Constraints
• Limited number of items to measure a domain
• All items need to be kept secure
Problematic
• Comparing students from different age groups (ceiling or floor effect)
• Comparing student abilities over time when not all items can be kept secure
• Item difficulty and student performance are confounded
ITEM RESPONSE THEORYA method based on
Rasch model
Common scale for item difficulties and student abilities– If ability = difficulty, the student has 50%
chance to respond correctly to that item– If ability > difficulty, most likely to respond
correctly– If ability < difficulty, most likely to respond
incorrectly
Example scale – Year 6
Year 6 students
xx
xxxxxxxxxxxxxxx
Items
6
1 53 7 94 10
82
3
-3
2
1
0
-1
-2
Example scale – Year 10
3
-3
2
1
0
-1
-2
Items
1412613 151 5 113 7 94 10
82
Year 10 students
xxxxxxxxxxxxxx
x
Example scale – Combined
Year 6
xx
xxxxxxxxxxxxxx
3
-3
2
1
0
-1
-2
Items
1412613 151 5 113 7 94 10
82
Year 10
xxxxxxxxxxxxxx
x
Vertical and horizontal equating
Year 10
Year 6
2011 2014V
ertic
alHorizontal
COMMON ITEM EQUATINGThree methods
Several methods
• Average item difficulty of set of link items needs to be equal in both tests
• Three common methods:– Shift method (trends)– Joint scaling (booklets)– Anchoring item difficulties
SHIFT METHODMethod 1
Shift method
• Test 1 and test 2 are scaled separately• Average difficulty of items B in test 1 (MN1) and test 2
(MN2) is computed
• Difference between averages (d = MN1 – MN2) is computed
• Difference is added to the student abilities of test 2 (θ2* = θ 2 + d)
Items A Items B Items C
Test 1 X X
Test 2 X X
0
Test 1
MN1
d 0
Test 2
MN2
Item Difficulty T1 Difficulty T2
A
1 -1.12 1.63 -2.64 0.95 -1.8
B
6 0.8 -1.27 1.7 -0.38 0.9 -1.19 -0.2 -2.2
10 -0.9 -2.9
C
11 2.112 1.513 0.414 2.415 1.2
AVG all 0.0 0.0AVG link 0.5 -1.5 2.0 = shift
JOINT SCALINGMethod 2
Joint scaling
• Data of test 1 and 2 are joined in one data set• Test 1 and 2 are scaled together• Difficulties of items B are estimated only once• Difficulties of items B are identical for test 1
and 2• Tests are on the same scale• Also called concurrent equating
Joint scaling - Data file A B C
Std Year i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15
1 6 0 1 0 0 1 0 0 1 1 1 n n n n n
2 6 0 0 1 1 1 0 1 0 0 0 n n n n n
3 6 1 0 1 1 1 0 1 1 0 1 n n n n n
4 6 0 0 1 1 0 1 1 0 0 1 n n n n n
5 6 1 1 0 1 1 1 1 1 1 1 n n n n n
6 10 n n n n n 0 0 0 0 0 0 1 0 0 0
7 10 n n n n n 0 1 0 1 0 0 0 0 0 0
8 10 n n n n n 0 1 1 0 1 1 1 1 1 1
9 10 n n n n n 1 1 0 0 1 1 1 0 1 1
10 10 n n n n n 1 1 1 1 1 1 1 0 1 1
ANCHORINGMethod 3
Anchoring
• Test 1 (items A and B) is scaled• Difficulties of items B are copied• Test 2 (items B and C) is scaled,
anchoring items B to the same values as test 1
Set Item Difficulty T1 Difficulty T2
A
1 -1.12 1.63 -2.64 0.95 -1.8
B
6 0.8 0.8*7 1.7 1.7*8 0.9 0.9*9 -0.2 -0.2*
10 -0.9 -0.9*
C
11 4.112 3.513 2.414 4.415 3.2
AVG all 0.0 2.5AVG link 0.5 0.5
EVALUATION OF LINK ITEMSBefore equating tests
Link item invariance
• Relative item difficulty• Discrimination• Differential item functioning (DIF)
RELATIVE ITEM DIFFICULTYEvaluation of
Relative item difficulty
-5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
-3.0
-2.5
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
2.5
3.0
ITEM DISCRIMINATIONEvaluation of
Item discrimination
• Discriminate between able and less able• Some items discriminate more than others• Average abilities of students:
Item 1 Item 2
Answer A 1.00 0.62
Answer B -0.22 0.61
Answer C -0.15 0.81
Answer D -0.02 0.53
Slopes
• Level of discrimination is reflected by the slope of the item characteristic curve
Assumption
• Assumption of the Rasch model:slopes are equal across items
• However, in practice slopes always vary a little within a test
• The expected slope is the average slope of all items in a test
• Steeper average slopes reflect a larger spread in abilities in the population
Link items &
Discrimination
• The average discrimination of link item can vary between tests
• Individual link items can vary in discrimination between tests
Experiment - 1
• Same test with 10 items is used in Year 6 and Year 10
• Spread in abilities is larger in Year 10 than in Year 6
• Item discriminate more in Year 10 than in Year 6
Results experiment 1
Average discrimination
Population variance
True variance
Separate Joint Separate Joint
Year 6 0.25 0.34 0.76 1.07 0.80
Year 10 0.41 0.34 1.89 1.49 2.00
DIFFERENTIAL ITEM FUNCTIONING
Evaluation of
Differential Item Functioning
• Assumption of Rasch model:all students with the same ability have the same probability to respond correctly to an item, independent of the subgroup a student belongs to
• The violation of this assumption is called Differential Item Functioning (DIF)
Example: sex DIF
Link items &
DIF
• Set of link items needs to have the same average DIF as the non-link items in both tests
• The following experiment shows why
Experiment 2
• Item pool of 105 items for assessment at time 1
• Selection of 55 trend items all favouring boys
• Scale two sets of items on the same set of student responses
Results experiment 2
All items Boys items
0.44
0.50
0.60
0.44
Abilities by subgroup
All items Link items
M F M F
Conclusion experiment 2
• Selecting link items that on average favour a subgroup of students changes the gap in performance between subgroups
• The average DIF should be as close to 0 as possible
OTHER ITEM CHARACTERISTICS
Evaluation of
Link items &
Sub-domains
• Equating shift should be based on a set of items that is representative of the whole test
• Equating shifts can be slightly different for different sub-domains
• Best practice to have equal proportions of sub-domains in trend items and in the total item pool
Link items &
Item types
• Equating shifts can be slightly different for multiple choice items than for open ended items
• Best practice to have equal proportions of item types in trend items and in the total item pool
EQUATING EXAMPLEHorizontal and vertical equating in NAP Civics and Citizenship
Equating in practice
• NAP-CC survey• Year 6 and Year 10• Assessment every 3 years since 2004
Equating overview
45 horizontal link items in Year 10
-5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
-3.0
-2.5
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Relative difficulties link items
Average discrimination
2007 2010
45 link items 0.43 0.45
Plot discrimination
0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00
-0.10
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
Average gender DIF
2007 2010
45 link items -0.027 -0.014
Selection of link items
• 32 of 45 items were selected to use as link items based on:– change in relative difficulty– change in discrimination– average gender DIF
-5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
-3.0
-2.5
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Relative difficulties 45 ink items
-5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
-3.0
-2.5
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Relative difficulties 32 link items
Average discrimination
2007 2010
45 link items 0.43 0.45
32 link items 0.41 0.42
0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00
-0.10
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
Discrimination 45 items
0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00
-0.10
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
Discrimination 32 items
Average gender DIF
2007 2010
45 link items -0.027 -0.014
32 link items -0.035 -0.023
Horizontal equating Year 6
• The process for Year 6 was identical• 24 out 27 link items could be used for
equating from 2010 to 2007
Equating shifts
Year 6 Year 10
Average difficulty 2010 0.384 0.618
Average difficulty 2007 -0.089 -0.159
Difference (=shift) -0.473 -0.777
Equating overview
EQUATING ERRORRelated to common item equating is the
Uncertainty in the link
• The equating shift depends on the change in relative difficulty of each item
• Different sets of items will lead to slightly different shifts
• An uncertainty is associated with equating two tests due to sampling of items
Equating error
• Expressed as a standard error, just like the student sampling error
• Take into account when estimating change over time
• The equating error is added to the standard error of the difference when comparing across time