psychometric issues in formative assessment: measuring ... kingston... · psychometric issues in...

Psychometric Issues in Formative Assessment: Measuring Student Learning Throughout the

Academic Year Using Interim Assessments

Jonathan Templin The University of Georgia

Neal Kingston and Wenhao Wang

University of Kansas

MARCES 2011

Talk Overview

• Interim versus Summative Tests

• Examining current interim tests for growth

• Are interim tests able to provide additional (helpful) feedback for students and teachers?

What is the structure/nature of interim tests? Can psychometric models be adapted to fit structure of interim

tests as-is?

• Issues in measuring growth of students across academic year: psychometric and statistical

MARCES 2011 2

Talk Summary

• The problems of dimensionality…

MARCES 2011 3

Premises for Talk

• This talk is based on two (somewhat controversial) premises:

1. We spend too much time testing students relative to the information currently attained from tests Utility of the outcome is key to the usefulness of a test For learning and growth, tests are generally at too abstract of level to

present actionable evidence to help students

2. Psychometrics (statistics) must be able to adapt to the empirical nature of the traits being studied For too long we have tailored our tests to fit existing psychometric

models, or mainly, software choices The field has advanced to a state where this is no longer something

that has to be done

MARCES 2011 4

CURRENT PRACTICES IN LARGE SCALE ASSESSMENT

MARCES 2011 5

Current Practices in Large Scale Assessment

• Current large scale (summative) assessment practices are driven by the measurement of a single, coarse-level trait

Such as state End-of-Course or End-of-Grade tests for assessing NCLB proficiency reporting requirements

Traits measured: Math (grade-level) English reading (grade-level) Science History

• Single scores are reported Typically students are then rated into proficiency categories Proficiency categories mandated by state law Set by psychometric standard setting procedures

MARCES 2011 6

Test Blueprint

• Although a single trait is measured/reported, tests are constructed to match a content-area blueprint

Blueprint mirrors underlying dimensions of the trait

• Psychometric issue: finding items which fit into content areas

of the blueprint but are unidimensional “enough” to still measure overall trait

Not quite multidimensional – certainly not enough items of each content area to measure students on a continuum with adequate reliability

Not quite unidimensional

• Knowledge of the blueprint means knowledge of the test

Teachers under duress may teach subjects appearing most frequently on the test

MARCES 2011 7

Example Math Test Blueprint

Retrieved from: http://www.mde.k12.ms.us/acad/osa/blueprints/

Numbers and Operations

Algebra

Geometry

Measurement

Data Analysis and Probability

MCT2 Mathematics and Algebra 1 Test Blueprint

MARCES 2011 8

http://www.mde.k12.ms.us/acad/osa/blueprints/

http://www.mde.k12.ms.us/acad/osa/blueprints/

Items: From Creation to Piloting

• Large-scale test items are typically created in large quantities, often by third party vendors

Many items ensures some will be of high enough quality to be included on operational test form

Items which have “face validity” are included in pilot forms

• Items are often piloted prior to operational use to determine if they have “necessary” statistical properties

Piloted items are stratified by blueprint content areas Ensures blueprint coverage for final operational items

• Items are written by many types of people Content area “experts” Former teachers Researchers in the field

MARCES 2011 9

Model Level Psychometric Screening

3-Parameter Model, Normal Metric Item: 1

Subtest: PRETEST

Chisq = 28.98 DF = 9.0 Prob< 0.0007

0

0.2

0.4

0.6

0.8

1.0

-3 -2 -1 0 1 2 3

c

b

Ability

Pro

bab

ilit

y

Item Characteristic Curve: MATH01

a = 0.651 b = -1.599 c = 0.186

This item does not fit the 3PL – and therefore will not likely be used in operational test form

Depending on state law, items must “fit” the prescribed unidimensional IRT model

Items that fit a given model are then further screened (i.e., for bias)

MARCES 2011 10

Resulting Test Data

• Through the process of psychometric item screening, the resulting test data are usually unidimensional:

Items exhibiting “too much” dimensionality are omitted

• Test measures a single trait with a reasonable level of reliability

• More than one trait, even with different analysis methods, is typically impossible to measure

Because final item screening is based on unidimensional psychometric models, test is unidimensional

• Unidimensionality of test data is a product of process Remains to be seen whether or not multiple dimensions exist

MARCES 2011 11

Test Utility

• The testing process is the result of years of psychometric research, ensuring stakeholders that test results are reliable and valid

But…are they useful?

• Knowing, in a normative sense, how students rank in terms of an overall ability seems to have little use to the practice of teaching and informing students

But continually is how we measure students

We do not even attempt to measure more than one dimension

MARCES 2011 12

INTERIM ASSESSMENTS: BETWEEN FORMATIVE AND SUMMATIVE

MARCES 2011 13

Interim Assessments: The Middle Ground

• Interim assessments are increasingly popular methods of assessment used to evaluate student progress toward large scale educational goals

• Interim Assessment (from Perie, Marion, & Gong, 2009, p. 6): “Assessments administered during instruction to evaluate students’ knowledge and skills relative to a specific set of academic goals in order to inform policymaker or educator decisions at the classroom, school, or district level. The specific interim assessment designs are driven by the purposes and intended uses, but the results of any interim assessment must be reported in a manner allowing aggregation across students, occasions, or concepts”

• Key terms: Aggregation across…

Occasions Concepts

MARCES 2011 14

Positioning Interim Assessments

From: p. 7 of Perie, Marion, and Gong (1999). Moving toward a comprehensive assessment system: a framework for considering interim assessments. Educational Measurement: Issues and Practice, 28, 5-13.

MARCES 2011 15

Purposes of Interim Assessments

• Generally, interim assessments fall into one of three categories:

Instructional – meant to inform instruction

Evaluative – meant to demonstrate effects of instruction

Predictive – meant to predict student performance on summative assessments

• As such, interim assessments are generally less flexible than formative assessments

Constructed at a wider level

MARCES 2011 16

Questions of Our Talk

• What information can be gained from interim assessments currently in place?

• Is the information useful and actionable?

• What is the dimensionality of interim assessments?

• What are the psychometric issues that are present in interim assessment?

MARCES 2011 17

LONGITUDINAL ANALYSIS OF INTERIM ASSESSMENT DATA

MARCES 2011 18

State Interim Assessment Data

• Data for this talk come from the 2010 interim mathematics assessment of students in a midwestern state

Teachers were allowed to opt-in to the testing system

Grades 3-8

Students took up to three interim tests: Fall #1, Fall #2, and Winter

• Items on tests were piloted during 2009 summative tests Item parameters fixed across 2010 administration

Is dimensionality the same? Are parameters invariant across time?

• Scores on all tests were reported, but no tracking was done for instructional, evaluational, or predictive purposes

MARCES 2011 19

Longitudinal Analyses Results by Grade

• Shown: Scaled scores (percent correct)

• Intercepts: 56.2 (7th grade) – 68.3 (3rd grade)

• Slopes: 4.1 (8th grade) – 10.3 (3rd grade)

50

55

60

65

70

75

80

Fall1 Fall2 Winter1

Grade 3

Grade 4

Grade 5

Grade 6

Grade 7

Grade 8

Interim Test Administration

Fall 1 Fall 2 Winter

Scal

ed

Sco

re

MARCES 2011 20

A Closer Look: Grade 7 Data

• From Grade 7:

Scal

ed

Sco

re

Fall 1 Fall 2 Winter

MARCES 2011 21

Question: Can Growth Results Be Trusted?

• As with any analysis, there are issues in believing the results:

• Reliability: Two-stage analysis assumed scaled scores were perfectly reliable

Likely to have much more error than reported

Single-stage analysis will control for error

Accuracy of student trajectories is low Only three time-points :: hard to determine where students are

• Statistical issues: Assessment uses fixed item parameters with unidimensional scale –

but dimensionality may be different in emerging learners

Assumed item parameters are invariant across time Hard to tell with current implementation design

MARCES 2011 22

Current Status of Interim Assessments

• Currently, interim assessment data is helpful, but limited

• Question: can interim assessments provide more helpful information to students and teachers

Multidimensional feedback

• Is dimensionality constant across time?

• Are methods available for reporting multidimensional feedback?

MARCES 2011 23

ASSESSING THE DIMENSIONALITY OF INTERIM ASSESSMENTS

MARCES 2011 24

Large Scale Tests

• Most large scale tests are created based on a test blueprint

• Blueprint details the content areas assessed by the test and the numbers of items for each area

• Has the potential to introduces areas where diagnostic information may be able to be obtained

• However, because most tests measure a common content area there is still a dominant dimension

Most large scale tests are analyzed with unidimensional models

MARCES 2011 25

Large Scale Test Construction Process

• Often, many items are piloted

• Items not fitting with the psychometric model used by the testing program are omitted from the exam

3PL used in Florida, Kansas

Rasch used in Georgia

• By omitting items not fitting the unidimensional model, the resulting test loses the ability to detect multiple dimensions…

MARCES 2011 26

Basic Math Ability

Path Diagram of Unidimensional IRT Models

MARCES 2011 27

2+3-1 4/2 (4x2)+3

Path Diagram of Multidimensional IRT Models

MARCES 2011 28

2+3-1 4/2 (4x2)+3

Addition Subtraction Multiplication Division

Path Diagram of Diagnostic Models

MARCES 2011 29

2+3-1 4/2 (4x2)+3

Addition Subtraction Multiplication Division

Potential for Multidimensional Models?

• Because of the unique construction of large scale tests multidimensional feedback may be possible

Question of utility – and level of grain size

• Multidimensional Psychometric Models: Multidimensional Item Response Models

Multiple scaled scores

Diagnostic classification models (DCMs) Like MIRT models, but ordinal-level measurement (proficiency status)

Hybrid approaches such as bi-factor models Overall continuous ability

Sub-abilities for each dimension

MARCES 2011 30

Concerns About Multidimensional Models

• Many psychometric approaches have been developed for measuring multiple dimensions

Classical Test Theory - Scale Subscores Multidimensional Item Response Theory (MIRT) Diagnostic classification models

• Yet, issues in application have remained: Reliability Large samples are needed Dimensions are often very highly correlated

• DCMs have changed landscape, but is information worthwhile? Question: would ordinal information (master/non-master or

proficient/not proficient) be valuable in interim testing situations?

MARCES 2011 31

Theoretical Reliability Comparison

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

3 13 23 33 43 53 63 73 83 93

Reliab

ilit

y

Number of Items

DCM

Rasch IRT

Reliability Level DCM IRT

0.80 8 Items 34 Items



MARCES 2011 32

Uni- and Multidimensional Comparison

0.5

0.6

0.7

0.8

0.9

1

1.1

1-Dimension 2-Dimension BiFactor

Reliab

ilit

y

Dimensional Model

DCM

IRT DCM

IRT IRT

DCM

MARCES 2011 33

Multidimensionality and Large Scale Tests

• Applications of multidimensional psychometric models can lead to very poor results:

Non-convergence (weak convergence)

Non-informative item parameters

Highly associated latent variables Essentially forming a unidimensional factor

• Such results hold for both MIRT models and DCMs You will see DCM analyses published, however, I argue that is

because goodness of fit procedures lag model application It is currently easy to hide poor DCM fit behind novelty of models

MARCES 2011 34

WHEN MULTIDIMENSIONALITY GOES BAD

MARCES 2011 35

How DCMs and Multidimensional Models Fail

• Willse, Henson, & Templin (2009, NCME) presented the results of a DCM analysis of a 5th grade End-of-Grade reading exam

73 items

2 content standards Used as Q-matrix

All items had simple structure (one attribute measured)

• For measured attributes, the analysis revealed High reliabilities (.95 and .90)

Highly certain classifications

MARCES 2011 36

However…Model Fit Was Poor

10 20 30 40 50 60 70

0.0

00

.02

0.0

40

.06

0.0

80

.10

0.1

2

Test Score

Re

lative

Fre

qu

en

cy

Observed DataSimulated from Model Parameters

Residual Correlations

Fre

qu

en

cy

-0.1 0.0 0.1 0.2 0.3

01

00

20

03

00

40

05

00

60

0

Q-matrix used fit marginally better than randomly permuted Q-matrix!

MARCES 2011 37

DIMENSIONAL ANALYSIS OF INTERIM TEST DATA

MARCES 2011 38

Assessing Dimensionality of Interim Assessments

• Dimensionality of interim assessments is important If more dimensions are present, potential for more informative

feedback for students and teachers exists

• Classical notions of dimensionality are due for an update Newer classes of statistical models challenge traditional notions

of dimensionality

• We used multiple models to assess dimensionality of interim assessment data for each grade/time assessment

Models shown subsequently: Unidimensional/multidimensional Continuous/categorical traits Bifactor combinations

MARCES 2011 39

Single Continuous Latent Variable Observed Items

IRT

X1 X3 X4 X5 X2

θ

The 1PL Item Response Model

Logit 𝑃 𝑋𝑟𝑖 = 1|𝜃𝑟 = 𝜃𝑟 − 𝑏𝑖 = 𝜆𝑖,0 + 𝜃𝑟

𝜃𝑟 ∼ 𝑁 0, 𝜎𝜃2

1 1 1 1 1

𝜎𝜃2

Item intercepts: 𝜆𝑖,0

MARCES 2011 40

Single Continuous Latent Variable Observed Items

IRT

X1 X3 X4 X5 X2

θ

The 2PL Item Response Model

Logit 𝑃 𝑋𝑟𝑖 = 1|𝜃𝑟 = 𝑎𝑖 𝜃𝑟 − 𝑏𝑖 = 𝜆𝑖,0 + 𝜆𝑖,1𝜃𝑟

𝜃𝑟 ∼ 𝑁 0,1

1

Item intercepts: 𝜆𝑖,0

MARCES 2011 41

MIRT

X1 X3 X4 X5 X2

Observed Items Continuous Latent Variable(s)

ηc

The MIRT Model

Logit 𝑃 𝑋𝑟𝑖 = 1|𝜽𝑟 = 𝜆𝑖,0 + 𝜆𝑖,𝑓𝜃𝑟,𝑓

𝜽𝑟 ∼ 𝑁 𝟎, 𝚺

𝜃1 𝜃2 𝜃3 𝜃4

Latent Variable Correlations

One latent variable measured per item

1 1 1

1

MARCES 2011 42

MIRT

X1 X3 X4 X5 X2

Observed Items Continuous Latent Variable(s)

ηc

The Bi-Factor MIRT Model

Logit 𝑃 𝑋𝑟𝑖 = 1|𝜽𝑟 , 𝑔𝑟 = 𝜆𝑖,0 + 𝜆𝑖,𝑓𝜃𝑟,𝑓 + 𝜆𝑖,𝑔𝐺𝑟

𝐺𝑟 ∼ 𝑁 0,1 ; 𝜽𝑟 ∼ 𝑁 𝟎, 𝚺

𝜃1 𝜃2 𝜃3 𝜃4

Latent Variable Correlations

One latent variable measured per item

1 1 1

1

G 1 General “ability” measured

by all items of a test

IRT

MARCES 2011 43

Bi-Factor MIRT Model Specifics

• The Bi-Factor MIRT Model has: • A continuous latent variable measured by all items

• General content area ability

• A set of continuous latent variables measured by items indicated in the Q-matrix

• For current data set, these represent state-level standards:

1. Number and Computation

2. Algebra

3. Geometry

4. Data

• …and is easily estimable using MML with Mplus

MARCES 2011 44

LCDM

X1 X3 X4 X5 X2

Observed Items Discrete Latent Attributes α1 α2 α3 α4

ηc

Latent Variable Interactions

Diagnostic Structural Model

The General DCM Approach

Logit 𝑃 𝑋𝑟𝑖 = 1|𝜶𝑟 = 𝜆𝑖,0 + 𝝀𝑖𝑇𝒉 𝒒𝑖 , 𝜶𝑟

𝜶𝑟 ∼ 𝑀𝑉𝐵(𝜼)

MARCES 2011 45

Modifications to DCMs

• Because of the suspected unidimensionality of existing large scale tests, DCMs are not directly applicable

As shown previously

• General form of DCMs allows for more types of latent variables to be present in the model

• Our solution? • Analog of Thurstone’s Hierarchical Factors model (i.e., general

and specific intelligences) with sub-factors represented with categorical latent variables

• Here general intelligence was overall content area knowledge • Specific intelligences were set of standards

MARCES 2011 46

IRT

LCDM

X1 X3 X4 X5 X2

θ Continuous Latent Variable(s) Observed Items Discrete Latent Attributes α1 α2 α3 α4

ηc

Latent Variable Interactions

Diagnostic Structural Model

The Bi-Factor DCM

Logit 𝑃 𝑋𝑟𝑖 = 1|𝜃𝑟 , 𝜶𝑟 = 𝜆𝑖,0 + 𝜆𝑖,1𝜃𝑟 + 𝝀𝑖𝑇𝒉 𝒒𝑖 , 𝜶𝑟

𝜃𝑟 ∼ 𝑁 0,1 ; 𝜶𝑟 ∼ 𝑀𝑉𝐵(𝜼)

1

MARCES 2011 47

Bi-Factor DCM Item Response Function

P(X

ri =

1 | α

r, θ

r)

θr

Bi-Factor DCM IRF (one measured attribute)

Red line: Students proficient at sub-area

Black line: Students not proficient at sub-area

MARCES 2011 48

Bi-Factor DCM Specifics

• The Bi-Factor DCM has: • A continuous latent variable measured by all items

• General content area ability

• A set of categorical latent variables measured by items indicated in the Q-matrix

• Mastery/Non-mastery of content area knowledge • For current data set, these represent state-level standards: 1. Number and Computation 2. Algebra 3. Geometry 4. Data

• …and is also easily estimable using MML with Mplus

• The model better accounts for violations of local independence when pure DCM was used

• Gave examinees a boost in the probability of a correct response for higher content area knowledge

MARCES 2011 49

Other Details

• If we had observed the continuous latent ability and the categorical latent attributes, this model would be an ANCOVA model (the logistic version)

Estimated model parameters would essentially be adjusted for equal ability in each group

• Following Thurstone’s convention, the general factor was uncorrelated with the specific factors

Specific factors are correlated May or may not be needed/useful

Helps preserve semi-multidimensional nature of model

• Content area attributes have smaller variability than comparable continuous latent variables

Not “as” multidimensional as full Bi-Factor model

MARCES 2011 50

Model Fit Results

• Best model based on information criteria:

• In every year and test, a bi-factor model fit best Question: what do latent variables mean?

• Data may be multidimensional We may be able to harness this dimensionality for more

feedback for students

Grade Fall 1 Fall 2 Winter

3rd BF-DCM BF-DCM BF-DCM

4th BF-DCM BF-DCM BF-DCM

5th BF-DCM BF-DCM BF-DCM

6th BF-DCM BF-DCM BF-MIRT

7th BF-DCM BF-MIRT BF-MIRT

8th BF-DCM BF-MIRT BF-DCM

MARCES 2011 51

LONGITUDINAL MEASUREMENT ISSUES AND PRACTICES

MARCES 2011 52

Issues with Measurement Longitudinally

• Invariance of item parameters Do items function similarly across time?

• Reliability of measurement at each time point Continuous v. categorical latent variable models

• Reliability of measurement of growth Number of measurements needed for accurate reliability may be

impossible to reach Additional measures may be needed (formative tests, homework

assignments, etc…).

• Longitudinal modeling of latent variables Two-stage modeling has propagating measurement error DCMs (if used) have implementation issues longitudinally

Parameters for change are numerous (but help is on the way)

MARCES 2011 53

Testing Practices that Affect Measurement of Growth

• The most important design part of measuring growth is for students to take overlapping items across time

Current implementations confounds growth with item change

Impossible to discern change in item properties from growth of ability in students

• Items (or testlets) can come from overall pool Randomly assigned to each examinee across time

Some overlap is needed

• Validity of modeling results needs verification Perhaps most important point

MARCES 2011 54

CONCLUDING REMARKS

MARCES 2011 55

Interim Assessments: Potential for More Feedback

• Interim assessments are part of the current (and new) testing environment and are likely to stay

But better psychometric modeling will increase their usefulness

• There may be more dimensionality in tests than we have realized previously

What dimensions are needs determined

• The key to any longitudinal assessment program is utility If a test is built that does not serve its purpose, what good is it?

• We can make interim assessments part of the learning process and help students grow

MARCES 2011 56

Thank You!

• Questions? Comments? Complaints? References?

• Contact me at:

[email protected]

• Talk slides available at:

http://projects.coe.uga.edu/jtemplin

• Thank you!

MARCES 2011 57

mailto:[email protected]

http://projects.coe.uga.edu/jtemplin

psychometric issues in formative assessment: measuring ... kingston... · psychometric issues in...

Documents