vert&hor equating 111024

Vertical and horizontal test equating in educational

research

Eveline Gebhardt&

Wolfram Schulz

Method for estimating

• change over time in student abilities• growth between year levels

CLASSICAL TEST THEORYA method based on

Classical test theory

• Student performance: % correct on set of items– Compare students that respond to identical

set of items• Item difficulty: % of students responding

correctly– Compare items that were administered to

the same group of students

Constraints

• Limited number of items to measure a domain

• All items need to be kept secure

Problematic

• Comparing students from different age groups (ceiling or floor effect)

• Comparing student abilities over time when not all items can be kept secure

• Item difficulty and student performance are confounded

ITEM RESPONSE THEORYA method based on

Rasch model

Common scale for item difficulties and student abilities– If ability = difficulty, the student has 50%

chance to respond correctly to that item– If ability > difficulty, most likely to respond

correctly– If ability < difficulty, most likely to respond

incorrectly

Example scale – Year 6

Year 6 students

xx

xxxxxxxxxxxxxxx

Items

6

1 53 7 94 10

82

3

-3

2

1

0

-1

-2

Example scale – Year 10

3

-3

2

1

0

-1

-2

Items

1412613 151 5 113 7 94 10

82

Year 10 students

xxxxxxxxxxxxxx

x

Example scale – Combined

Year 6

xx

xxxxxxxxxxxxxx

3

-3

2

1

0

-1

-2

Items

1412613 151 5 113 7 94 10

82

Year 10

xxxxxxxxxxxxxx

x

Vertical and horizontal equating

Year 10

Year 6

2011 2014V

ertic

alHorizontal

COMMON ITEM EQUATINGThree methods

Several methods

• Average item difficulty of set of link items needs to be equal in both tests

• Three common methods:– Shift method (trends)– Joint scaling (booklets)– Anchoring item difficulties

SHIFT METHODMethod 1

Shift method

• Test 1 and test 2 are scaled separately• Average difficulty of items B in test 1 (MN1) and test 2

(MN2) is computed

• Difference between averages (d = MN1 – MN2) is computed

• Difference is added to the student abilities of test 2 (θ2* = θ 2 + d)

Items A Items B Items C

Test 1 X X

Test 2 X X

0

Test 1

MN1

d 0

Test 2

MN2

Item Difficulty T1 Difficulty T2

A

1 -1.12 1.63 -2.64 0.95 -1.8

B

6 0.8 -1.27 1.7 -0.38 0.9 -1.19 -0.2 -2.2

10 -0.9 -2.9

C

11 2.112 1.513 0.414 2.415 1.2

AVG all 0.0 0.0AVG link 0.5 -1.5 2.0 = shift

JOINT SCALINGMethod 2

Joint scaling

• Data of test 1 and 2 are joined in one data set• Test 1 and 2 are scaled together• Difficulties of items B are estimated only once• Difficulties of items B are identical for test 1

and 2• Tests are on the same scale• Also called concurrent equating

Joint scaling - Data file A B C

Std Year i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15

1 6 0 1 0 0 1 0 0 1 1 1 n n n n n

2 6 0 0 1 1 1 0 1 0 0 0 n n n n n

3 6 1 0 1 1 1 0 1 1 0 1 n n n n n

4 6 0 0 1 1 0 1 1 0 0 1 n n n n n

5 6 1 1 0 1 1 1 1 1 1 1 n n n n n

6 10 n n n n n 0 0 0 0 0 0 1 0 0 0

7 10 n n n n n 0 1 0 1 0 0 0 0 0 0

8 10 n n n n n 0 1 1 0 1 1 1 1 1 1

9 10 n n n n n 1 1 0 0 1 1 1 0 1 1

10 10 n n n n n 1 1 1 1 1 1 1 0 1 1

ANCHORINGMethod 3

Anchoring

• Test 1 (items A and B) is scaled• Difficulties of items B are copied• Test 2 (items B and C) is scaled,

anchoring items B to the same values as test 1

Set Item Difficulty T1 Difficulty T2

A

1 -1.12 1.63 -2.64 0.95 -1.8

B

6 0.8 0.8*7 1.7 1.7*8 0.9 0.9*9 -0.2 -0.2*

10 -0.9 -0.9*

C

11 4.112 3.513 2.414 4.415 3.2

AVG all 0.0 2.5AVG link 0.5 0.5

EVALUATION OF LINK ITEMSBefore equating tests

Link item invariance

• Relative item difficulty• Discrimination• Differential item functioning (DIF)

RELATIVE ITEM DIFFICULTYEvaluation of

Relative item difficulty

-5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

-3.0

-2.5

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

ITEM DISCRIMINATIONEvaluation of

Item discrimination

• Discriminate between able and less able• Some items discriminate more than others• Average abilities of students:

Item 1 Item 2

Answer A 1.00 0.62

Answer B -0.22 0.61

Answer C -0.15 0.81

Answer D -0.02 0.53

Slopes

• Level of discrimination is reflected by the slope of the item characteristic curve

Assumption

• Assumption of the Rasch model:slopes are equal across items

• However, in practice slopes always vary a little within a test

• The expected slope is the average slope of all items in a test

• Steeper average slopes reflect a larger spread in abilities in the population

Link items &

Discrimination

• The average discrimination of link item can vary between tests

• Individual link items can vary in discrimination between tests

Experiment - 1

• Same test with 10 items is used in Year 6 and Year 10

• Spread in abilities is larger in Year 10 than in Year 6

• Item discriminate more in Year 10 than in Year 6

Results experiment 1

Average discrimination

Population variance

True variance

Separate Joint Separate Joint

Year 6 0.25 0.34 0.76 1.07 0.80

Year 10 0.41 0.34 1.89 1.49 2.00

DIFFERENTIAL ITEM FUNCTIONING

Evaluation of

Differential Item Functioning

• Assumption of Rasch model:all students with the same ability have the same probability to respond correctly to an item, independent of the subgroup a student belongs to

• The violation of this assumption is called Differential Item Functioning (DIF)

Example: sex DIF

Link items &

DIF

• Set of link items needs to have the same average DIF as the non-link items in both tests

• The following experiment shows why

Experiment 2

• Item pool of 105 items for assessment at time 1

• Selection of 55 trend items all favouring boys

• Scale two sets of items on the same set of student responses

Results experiment 2

All items Boys items

0.44

0.50

0.60

0.44

Abilities by subgroup

All items Link items

M F M F

Conclusion experiment 2

• Selecting link items that on average favour a subgroup of students changes the gap in performance between subgroups

• The average DIF should be as close to 0 as possible

OTHER ITEM CHARACTERISTICS

Evaluation of

Link items &

Sub-domains

• Equating shift should be based on a set of items that is representative of the whole test

• Equating shifts can be slightly different for different sub-domains

• Best practice to have equal proportions of sub-domains in trend items and in the total item pool

Link items &

Item types

• Equating shifts can be slightly different for multiple choice items than for open ended items

• Best practice to have equal proportions of item types in trend items and in the total item pool

EQUATING EXAMPLEHorizontal and vertical equating in NAP Civics and Citizenship

Equating in practice

• NAP-CC survey• Year 6 and Year 10• Assessment every 3 years since 2004

Equating overview

45 horizontal link items in Year 10

-5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

-3.0

-2.5

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Relative difficulties link items


2007 2010

45 link items 0.43 0.45

Plot discrimination

0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00

-0.10

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

Average gender DIF

2007 2010

45 link items -0.027 -0.014

Selection of link items

• 32 of 45 items were selected to use as link items based on:– change in relative difficulty– change in discrimination– average gender DIF

-5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

-3.0

-2.5

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Relative difficulties 45 ink items

-5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

-3.0

-2.5

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Relative difficulties 32 link items


2007 2010

45 link items 0.43 0.45

32 link items 0.41 0.42

0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00

-0.10

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

Discrimination 45 items

0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00

-0.10

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

Discrimination 32 items

Average gender DIF

2007 2010

45 link items -0.027 -0.014

32 link items -0.035 -0.023

Horizontal equating Year 6

• The process for Year 6 was identical• 24 out 27 link items could be used for

equating from 2010 to 2007

Equating shifts

Year 6 Year 10

Average difficulty 2010 0.384 0.618

Average difficulty 2007 -0.089 -0.159

Difference (=shift) -0.473 -0.777

Equating overview

EQUATING ERRORRelated to common item equating is the

Uncertainty in the link

• The equating shift depends on the change in relative difficulty of each item

• Different sets of items will lead to slightly different shifts

• An uncertainty is associated with equating two tests due to sampling of items

Equating error

• Expressed as a standard error, just like the student sampling error

• Take into account when estimating change over time

• The equating error is added to the standard error of the difference when comparing across time

vert&hor equating 111024

Documents

n n n n n5

n n n n n2

n n n n n4

n n n n n3

n n n n n6

items b items ctest

average difficulty of

difficulties of items