overview

49
Scaling and Equating Joe Willhoft Assistant Superintendent of Assessment and Student Information Yoonsun Lee Director of Assessment and Psychometrics Office of Superintendent of Public Instruction

Upload: moira

Post on 05-Jan-2016

29 views

Category:

Documents


0 download

DESCRIPTION

Scaling and Equating Joe Willhoft Assistant Superintendent of Assessment and Student Information Yoonsun Lee Director of Assessment and Psychometrics Office of Superintendent of Public Instruction. Overview. Scaling Definition Purposes Equating Definition Purposes Designs Procedures - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Overview

Scaling and Equating

Joe Willhoft Assistant Superintendent of Assessment and Student Information

Yoonsun Lee Director of Assessment and Psychometrics

Office of Superintendent of Public Instruction

Page 2: Overview

Overview

• Scaling– Definition– Purposes

• Equating– Definition– Purposes– Designs– Procedures

• Vertical Scale

Page 3: Overview

What is Scaling?

• Scaling is the process of associating numbers with the performance of examinees

• What does 400 mean in WASL? It is not a raw score but a scaled score.

Page 4: Overview

Primary Score Scale

• Many educational tests use one primary score scale for reporting scores

• Raw scores, scaled scores, percentile

• WASL and WLPT-II use scaled scores

Page 5: Overview

Activity

Grade 3 Mathematics Items

Page 6: Overview

G3 Math Items

Points Possible

Test Difficulty

Cut Score

Form A 6 Easy 5

Form B 6 Difficult 3

Page 7: Overview

Why Use a Scaled Score?

• Minimizing misinterpretationse.g. Emmy got 30 points last year and met the

standard. I got 31points this year but did not meet the standard. Why?

The cut score last year was 30 points and the cut score this year is 32points. Did you raise the standard?

Page 8: Overview

Why Use a Scale Score?

• Facilitate meaningful interpretation– Comparison of examinees’ performance on

different forms– Tracking of trends in group performance over

time– Comparison of examinees’ performance on

different difficulty levels of a test

Page 9: Overview

Raw Score and Scaled Score

• Linearly (Monotonic) related

• Based on Item Response Theory Ability Scale– Each observed performance is corresponding

to an ability value (theta)– Scaled score = a + b *(theta)

Page 10: Overview

Linear Transformation

Simple linear trasformation:   Scaled Score= a + b*(ability)   Two parameters are used to describe that

relationship: a and b. We obtain some sample data and find

the values of a and b that best fit the data to the linear regression model.

Page 11: Overview

WASL

400 = a + b*(theta 1)375 = a + b*(theta 2)

• Theta 1 and theta 2 are established by the standard setting committees.

• a and b are determined by solving the equations above.

Page 12: Overview

WLPT-II

• Min Scaled Score = 300

• Max Scaled Score = 900

300 = a + b*(theta 1)

900 = a + b*(theta 2)

Page 13: Overview

WASL Scaling

• 375 is the cut between level 1 and level 2 for all grade levels and content areas

• 400 is the cut between level 2 and level 3 for all grade levels and content areas.

• Each grade/content has a separate scale (WASL)

• All grade levels are in the same scale (WLPT-II) - vertically linked

Page 14: Overview

WASL

G 3 G 4 G 5 G 6 G 7 G 8 HS

400

375

Page 15: Overview

WLPT-II (Vertical Scale)

K 1 2 3 4 5 6 7 8 9 10 11 12 300

900

Page 16: Overview

Equating

Page 17: Overview

17

Purpose of Equating

• Large scale testing programs use multiple forms of the same test

• Differences in item and test difficulties across forms must be controlled

• Equating is used to ensure that scale scores are equivalent across tests

Page 18: Overview

18

Requirements of Equating Four necessary conditions for equating (Lord, 1980):

• Ability - Equated tests must measure the same construct (ability)

• Equity – After transformation, the conditional frequencies for each test are same

• Population invariance

• Symmetry

Page 19: Overview

19

Ability - Equated Tests Must Measure the Same Construct (Ability)

• Item and test specifications are based on definitions of the abilities to be assessed– Item specifications define how the abilities are

shown– Test specifications ensure representation of

all aspects of the construct

• Tests to be equated should measure the same abilities in the same ways

Page 20: Overview

20

Equity

• Scales on the tests to be equated should be strictly parallel after equating

• Frequency distributions should be roughly equivalent after transformation

Page 21: Overview

21

Population Invariance• The outcome of the transformation must be the same

regardless of which group is used as the anchor

• If score Y1 on Y is equated to score X1 on X, the result should be the same as if score X1 is equated to score Y1

• If a score of 10 on 2007 Mathematics is equivalent to a score of 11 on 2006 Mathematics (when 2006 is used as the anchor), then a score of 11 on 2006 Mathematics should be equivalent to a score of 10 on 2007 Mathematics (when 2007 is used as the anchor)

Page 22: Overview

22

Symmetry• The function used to transform the Y scale

to the X scale is the inverse of the function used to transform the X scale to the Y scale

• If the 2007 Mathematics scale is equated to 2006 Mathematics scale, the function used to do the equating should be the inverse of the function used when the 2006 Mathematics scale is equated to the 2007 Mathematics scale

Page 23: Overview

23

Equating Design Used in WASL

• Common-Item Nonequivalent Groups Design (Kolen & Brennan, 1995)

1. A set of items in common (anchor items)

2. Different groups of examinees

(in different years)

Page 24: Overview

24

Equating Method

• Item Response Theory Equating uses a transformation from one scale to the other

1. to make score scales comparable

2. to make item parameters comparable

Page 25: Overview

Equating of WASL

• The items on a WASL test differ from year-to-year (within grade and content area)

• Some items on the WASL have appeared in earlier forms of the test, and item calibrations (“b” difficulty/step values) were established. These are called “Anchor Items”.

• Each year’s WASL is equated to the previous year’s scale using these anchor items.

Page 26: Overview

Equating Procedure

1. Identify anchor item difficulties from bank.

2. Calibrate all items on current test form without fixing anchor item difficulties.

3. Calculate mean of anchor items using bank difficulties.

4. Calculate mean of anchor items using calibrated difficulties from current test

5. Add constant to current test difficulties so the mean equals mean from bank values.

Page 27: Overview

Equating Procedure

6. For each anchor item, subtract current difficulty from the bank difficulty (after adding the constant).

7. Drop the item with largest absolute difference greater than 0.3 from consideration as an anchor item.

8. Repeat steps 3-7 using remaining anchor items.

Page 28: Overview

Equating Example

+------------------------------------------------------------------------------------------------+ |ENTRY RAW MODEL| INFIT | OUTFIT |PTMEA| | |NUMBER SCORE COUNT MEASURE S.E. |MNSQ ZSTD|MNSQ ZSTD|CORR.| ITEM G | |------------------------------------+----------+----------+-----+-------------------------------| | 1 19948 22864 -1.451 .021| .89 -8.3| .66 -9.9| .42| 1,Op,B,1,ME,-1.484,Y 0 | | 2 17402 22864 -.588 .017| .99 -1.1| .96 -2.5| .37| 2,Op,A,1,PS,-0.2, 0 | | 3 17164 22864 -.523 .016|1.10 9.9|1.15 9.2| .28| 3,Op,A,1,GS,-0.467, 0 | | 4 41987 22864 -1.544 .016|1.01 .6|1.41 9.9| .31| 4,Op,S,2,NS,-1.275000036, 0 | | 5 17243 22864 1.329 .010|1.07 8.9|1.07 6.5| .51| 5,Op,S,2,CU,1.715, 0 | | 6 10013 22864 1.081 .014|1.05 8.3|1.11 9.9| .37| 6,Op,C,1,NS,0.68, 0 | | 7 26514 22864 .488 .009|1.00 .0|1.00 .0| .58| 7,Op,S,2,MC,1.165, 0 | | 8 17236 22864 -.542 .016| .88 -9.9| .78 -9.9| .48| 8,Op,C,1,MC,-0.258,Y 0 | | 9 20200 22864 -1.564 .021| .86 -9.9| .58 -9.9| .44| 9,Op,A,1,AS,-1.728999972, 0 | | 10 26018 22864 .530 .009|1.03 3.8|1.09 5.5| .56| 10,Op,S,2,PS,0.56,Y 0 | | 11 73926 22864 -.258 .007|1.18 9.9|1.44 9.9| .59| 11,Op,E,4,SR,-0.2175, 0 | | 12 11697 22864 .726 .014|1.05 9.5|1.13 9.9| .36| 12,Op,B,1,ME,0.215000004, 0 | | 13 9021 22864 1.294 .015|1.07 9.9|1.13 9.9| .35| 13,Op,C,1,GS,0.939999998, 0 | | 14 14762 22864 .066 .015|1.02 3.7|1.08 6.9| .37| 14,Op,B,1,AS,-0.17, 0 | | 15 23072 22864 .760 .010| .99 -1.1| .99 -1.4| .55| 15,Op,S,2,SR,0.51, 0 | | 16 31836 22864 -.316 .012|1.00 -.1| .99 -.7| .49| 16,Op,S,2,GS,-0.525, 0 | | 17 11330 22864 .803 .014|1.05 9.2|1.09 9.9| .37| 17,Op,A,1,PS,0.305000007,Y 0 | | 18 21885 22864 .893 .011| .77 -9.9| .75 -9.9| .68| 18,Op,S,2,MC,1.235, 0 |

•Item Calibrations before equating (Anchor items flagged on right with “Y”

Page 29: Overview

Equating Example

Item Bank Current Adj Dif Bank Current Adj Dif1 -1.484 -1.451 -1.440 0.044 -1.484 -1.451 -1.383 0.1018 -0.258 -0.542 -0.531 0.273 -0.258 -0.542 -0.474 0.21610 0.560 0.530 0.541 0.019 0.560 0.530 0.598 0.03817 0.305 0.803 0.814 0.509 - - - removed - - -22 -0.100 -0.181 -0.170 0.070 -0.100 -0.181 -0.113 0.01324 -1.254 -1.191 -1.180 0.074 -1.254 -1.191 -1.123 0.13127 0.446 0.446 0.457 0.011 0.446 0.446 0.514 0.06828 -0.055 -0.285 -0.274 0.219 -0.055 -0.285 -0.217 0.16229 0.453 0.462 0.473 0.020 0.453 0.462 0.530 0.07730 -2.605 -2.693 -2.682 0.077 -2.605 -2.693 -2.625 0.020

-0.39920 -0.41020 -0.39920 0.13160 -0.47744 -0.54500 -0.47744 0.09173

(Bank) - (Current) 0.011 (Bank) - (Current) 0.06756

First Round Second Round

•Item #17 was removed as an anchor item; other anchors were kept.

Page 30: Overview

Equating Example

+---------------------------------------------------------------------------------------------------------+ |ENTRY RAW MODEL| INFIT | OUTFIT |PTMEA| | | |NUMBER SCORE COUNT MEASURE S.E. |MNSQ ZSTD|MNSQ ZSTD|CORR.|DISPLACE| ITEM G | |------------------------------------+----------+----------+-----+--------+-------------------------------| | 1 19948 22864 -1.484A .021| .96 -2.8| .71 -9.9| .42| .107| 1,Op,B,1,ME,-1.484,Y 0 | | 2 17402 22864 -.515 .017| .99 -1.1| .96 -2.5| .37| .000| 2,Op,A,1,PS,-0.2, 0 | | 3 17164 22864 -.450 .016|1.10 9.9|1.15 9.3| .28| .000| 3,Op,A,1,GS,-0.467, 0 | | 4 41987 22864 -1.472 .016|1.01 .6|1.41 9.9| .31| .000| 4,Op,S,2,NS,-1.275000036, 0 | | 5 17243 22864 1.402 .010|1.07 9.1|1.07 6.6| .51| .000| 5,Op,S,2,CU,1.715, 0 | | 6 10013 22864 1.154 .014|1.05 8.5|1.11 9.9| .37| .000| 6,Op,C,1,NS,0.68, 0 | | 7 26514 22864 .561 .009|1.00 .1|1.00 .1| .58| .000| 7,Op,S,2,MC,1.165, 0 | | 8 17236 22864 -.258A .016| .82 -9.9| .72 -9.9| .48| -.216| 8,Op,C,1,MC,-0.258,Y 0 | | 9 20200 22864 -1.492 .021| .86 -9.9| .58 -9.9| .44| .000| 9,Op,A,1,AS,-1.728999972, 0 | | 10 26018 22864 .560A .009|1.06 7.9|1.11 7.0| .56| .037| 10,Op,S,2,PS,0.56,Y 0 | | 11 73926 22864 -.187 .007|1.18 9.9|1.44 9.9| .59| .000| 11,Op,E,4,SR,-0.2175, 0 | | 12 11697 22864 .799 .014|1.05 9.6|1.13 9.9| .36| .000| 12,Op,B,1,ME,0.215000004, 0 | | 13 9021 22864 1.368 .015|1.07 9.9|1.13 9.9| .35| .000| 13,Op,C,1,GS,0.939999998, 0 | | 14 14762 22864 .139 .015|1.02 3.8|1.08 7.0| .37| .000| 14,Op,B,1,AS,-0.17, 0 | | 15 23072 22864 .833 .010| .99 -.9| .99 -1.2| .55| .000| 15,Op,S,2,SR,0.51, 0 | | 16 31836 22864 -.244 .012|1.00 .0| .99 -.6| .49| .000| 16,Op,S,2,GS,-0.525, 0 | | 17 11330 22864 .876 .014|1.05 9.4|1.09 9.9| .37| .000| 17,Op,A,1,PS,0.305000007,Y 0 | | 18 21885 22864 .966 .011| .77 -9.9| .75 -9.9| .68| .000| 18,Op,S,2,MC,1.235, 0 |

•Item Calibrations after equating (Anchor items fixed with “A” in Measure column

Page 31: Overview

Transformed ScoresRaw-to-Theta-to-Scale Procedures

1. Calibration software provides a Raw-to-Theta look-up table.

2. Theta-to-Scale Score transformation is applied (derived from Theta at 3 cut-points from Standard Setting committee:

(L2) 375

(L3) 400

(L4) SS, obtained by solving for (L4) in SS=m*+b derived from (L2) and (L3)

Page 32: Overview

Transformed Scores Example•In Grade 4 Mathematics, the Standard Setting Committee established the following cut-scores:

Cut Theta

L2 -0.090

L3 0.572

L4 1.290•Setting (L2) = 375 and (L3) = 400, establishes this Theta-to-SS formula:

SS = (37.76435 * ) + 378.3988

•Solving for (L4), SS(L4) = 427.115

Page 33: Overview

Theta-to-SS Transformations

•The current Theta-to-SS transformations:New (2004/05) Standards

L2

L3

L4

Reading Gr 4 -0.331 0.952 2.178 SS = Theta*19.48558 + 381.4497

Gr 7 -0.045 1.092 1.918 SS = Theta*21.98769 + 375.9894

Gr 10 -0.145 0.693 1.596 SS = Theta*29.83294 + 379.3258

L2

L3

L4

Math Gr 4 -0.090 0.572 1.290 SS = Theta*37.76435 + 378.3988

Gr 7 -0.397 0.200 0.967 SS = Theta*41.87605 + 391.6248

Gr 10 -0.385 0.291 1.213 SS = Theta*36.98225 + 389.2382

L2

L3

L4

Science Gr 5 0.219 1.262 2.38 SS = Theta*23.96932 + 369.7507

Gr 8 -0.063 0.729 1.739 SS = Theta*31.56566 + 376.9886

Gr 10 0.004 0.575 1.91 SS = Theta*43.78284 + 374.8249

Grade 4 and 7 Reading and Math w ent into effect in 2004

Grade 10 Reading and Math w ent into effect in 2005

Page 34: Overview

Transformed Scores•Raw-to-Scale Score table from equating report

SS=Theta*37.76435+378.3988 RS Theta SS Rounded

SS SE

0 -5.378 175.302 175 69.637 1 -4.129 222.470 222 38.935 2 -3.376 250.906 251 28.210 3 -2.914 268.353 268 23.527 4 -2.573 281.231 281 20.770 5 -2.298 291.616 292 18.958 6 -2.064 300.453 300 17.598 7 -1.859 308.195 308 16.579 8 -1.676 315.106 315 15.785 9 -1.509 321.412 321 15.106

10 -1.356 327.190 327 14.502 11 -1.214 332.553 333 13.935 12 -1.083 337.500 338 13.406 13 -0.961 342.107 342 12.915 14 -0.848 346.375 346 12.462 15 -0.743 350.340 350 12.047 16 -0.645 354.041 354 11.669

Page 35: Overview

How to Determine Cut Score (Until 2006)

• If there is 400, the cut score is 400

• If 400 does not exist, the nearest score becomes the cut score

e.g.

- 397, 400, 402: 400 is the cut score

- 398, 401, 403: 401 is the cut score

- 399, 402, 405: 399 is the cut score

Page 36: Overview

How to Determine Cut Score (2007)

• If there is 400, the cut score is 400

• If 400 does not exist, the next lowest score becomes the cut score

e.g.

- 397, 400, 402: 400 is the cut score

- 398, 401, 403: 398 is the cut score

- 399, 402, 405: 399 is the cut score

Page 37: Overview

Vertical Scaling

Page 38: Overview

Vertical Scale

• Examinee performance across grade levels on a single scale

• Measure individual student growth

• Locate all items across grade level on a single scale

• Proficiency standard from different grade levels to a single scale

Page 39: Overview

Vertical Scaling vs. Equating

• Equating: scores on different test forms to be used interchangeably within grade level

• Vertical scaling: – Performance across all grade levels on the

same scale – Measure students’ growth – Not equating

Page 40: Overview

Data Collection Design

• Common item design– Common items between adjacent grade levels– Select appropriate level items to each grade

• Equivalent group design– Same examinees – Take on-grade test or off-grade test (usually

lower grade test)

Page 41: Overview

Common Item Design (WASL)

Base Grade

Test grade

3 4 5 6 7 8

3 G3

4 G3 G4

5 G4 G5

6 G5 G6

7 G6 G7

8 G7

Page 42: Overview

Previous Vertical Linking Study

• Math in Grades 3, 4, and 5

• Purpose of the study– How much are students growing over time?– What is the precision of these estimates?

Page 43: Overview

Data

• The data consists of items used in the pilot test for Grades 3 and 5 in 2004 and 2005

• Operational data for Grade 4 in 2005

Page 44: Overview

Linking Design

• Items across all forms in three grades

• Each form within grade includes a common block of items

• Common item non-equivalent groups design

Page 45: Overview

Common Item Design (WASL)

Base Grade

Test grade

3 4 5

3 G3 G4

4 G3 G4 G5

5 G4 G5

Page 46: Overview

Item Review (Item Means)

Common Items Grade 3 Grade 4

1 .65 .96

2 .85 .95

3 1.59 1.84

4 1.06 1.35

5 .81 .89

6 .59 .72

7 .63 1.14

8 .67 .87

9 .77 .87

10 .32 .53

11 .95 1.42

Page 47: Overview

Item Review

Item Grade 4 Grade 5

1 .49 .56

2 .45 .51

3 0.97 1.03

4 0.59 .60

5 1.13 1.12

6 1.14 1.18

7 .98 1.17

8 .54 .67

9 1.26 1.17

10 .67 .56

11 .72 .61

12 .80 0.85

Page 48: Overview

Results

• Comparing the p-values for the linking items across grades suggests some instability

• Growth is larger from grades 3 to 4 than grades 4 to 5

• Pilot data vs. operational data

• Motivation factor (G4 to G5)

• Backward Equating

Page 49: Overview

Future Plan

• Vertical linking study will be conducted in January 2008 using the 2007 reading WASL.

• The results will be presented next year.