psychometric issues in formative assessment: measuring ... kingston... · psychometric issues in...
TRANSCRIPT
Psychometric Issues in Formative Assessment: Measuring Student Learning Throughout the
Academic Year Using Interim Assessments
Jonathan Templin The University of Georgia
Neal Kingston and Wenhao Wang
University of Kansas
MARCES 2011
Talk Overview
• Interim versus Summative Tests
• Examining current interim tests for growth
• Are interim tests able to provide additional (helpful) feedback for students and teachers?
What is the structure/nature of interim tests? Can psychometric models be adapted to fit structure of interim
tests as-is?
• Issues in measuring growth of students across academic year: psychometric and statistical
MARCES 2011 2
Talk Summary
• The problems of dimensionality…
MARCES 2011 3
Premises for Talk
• This talk is based on two (somewhat controversial) premises:
1. We spend too much time testing students relative to the information currently attained from tests Utility of the outcome is key to the usefulness of a test For learning and growth, tests are generally at too abstract of level to
present actionable evidence to help students
2. Psychometrics (statistics) must be able to adapt to the empirical nature of the traits being studied For too long we have tailored our tests to fit existing psychometric
models, or mainly, software choices The field has advanced to a state where this is no longer something
that has to be done
MARCES 2011 4
CURRENT PRACTICES IN LARGE SCALE ASSESSMENT
MARCES 2011 5
Current Practices in Large Scale Assessment
• Current large scale (summative) assessment practices are driven by the measurement of a single, coarse-level trait
Such as state End-of-Course or End-of-Grade tests for assessing NCLB proficiency reporting requirements
Traits measured: Math (grade-level) English reading (grade-level) Science History
• Single scores are reported Typically students are then rated into proficiency categories Proficiency categories mandated by state law Set by psychometric standard setting procedures
MARCES 2011 6
Test Blueprint
• Although a single trait is measured/reported, tests are constructed to match a content-area blueprint
Blueprint mirrors underlying dimensions of the trait
• Psychometric issue: finding items which fit into content areas
of the blueprint but are unidimensional “enough” to still measure overall trait
Not quite multidimensional – certainly not enough items of each content area to measure students on a continuum with adequate reliability
Not quite unidimensional
• Knowledge of the blueprint means knowledge of the test
Teachers under duress may teach subjects appearing most frequently on the test
MARCES 2011 7
Example Math Test Blueprint
Retrieved from: http://www.mde.k12.ms.us/acad/osa/blueprints/
Numbers and Operations
Algebra
Geometry
Measurement
Data Analysis and Probability
MCT2 Mathematics and Algebra 1 Test Blueprint
MARCES 2011 8
Items: From Creation to Piloting
• Large-scale test items are typically created in large quantities, often by third party vendors
Many items ensures some will be of high enough quality to be included on operational test form
Items which have “face validity” are included in pilot forms
• Items are often piloted prior to operational use to determine if they have “necessary” statistical properties
Piloted items are stratified by blueprint content areas Ensures blueprint coverage for final operational items
• Items are written by many types of people Content area “experts” Former teachers Researchers in the field
MARCES 2011 9
Model Level Psychometric Screening
3-Parameter Model, Normal Metric Item: 1
Subtest: PRETEST
Chisq = 28.98 DF = 9.0 Prob< 0.0007
0
0.2
0.4
0.6
0.8
1.0
-3 -2 -1 0 1 2 3
c
b
Ability
Pro
bab
ilit
y
Item Characteristic Curve: MATH01
a = 0.651 b = -1.599 c = 0.186
This item does not fit the 3PL – and therefore will not likely be used in operational test form
Depending on state law, items must “fit” the prescribed unidimensional IRT model
Items that fit a given model are then further screened (i.e., for bias)
MARCES 2011 10
Resulting Test Data
• Through the process of psychometric item screening, the resulting test data are usually unidimensional:
Items exhibiting “too much” dimensionality are omitted
• Test measures a single trait with a reasonable level of reliability
• More than one trait, even with different analysis methods, is typically impossible to measure
Because final item screening is based on unidimensional psychometric models, test is unidimensional
• Unidimensionality of test data is a product of process Remains to be seen whether or not multiple dimensions exist
MARCES 2011 11
Test Utility
• The testing process is the result of years of psychometric research, ensuring stakeholders that test results are reliable and valid
But…are they useful?
• Knowing, in a normative sense, how students rank in terms of an overall ability seems to have little use to the practice of teaching and informing students
But continually is how we measure students
We do not even attempt to measure more than one dimension
MARCES 2011 12
INTERIM ASSESSMENTS: BETWEEN FORMATIVE AND SUMMATIVE
MARCES 2011 13
Interim Assessments: The Middle Ground
• Interim assessments are increasingly popular methods of assessment used to evaluate student progress toward large scale educational goals
• Interim Assessment (from Perie, Marion, & Gong, 2009, p. 6): “Assessments administered during instruction to evaluate students’ knowledge and skills relative to a specific set of academic goals in order to inform policymaker or educator decisions at the classroom, school, or district level. The specific interim assessment designs are driven by the purposes and intended uses, but the results of any interim assessment must be reported in a manner allowing aggregation across students, occasions, or concepts”
• Key terms: Aggregation across…
Occasions Concepts
MARCES 2011 14
Positioning Interim Assessments
From: p. 7 of Perie, Marion, and Gong (1999). Moving toward a comprehensive assessment system: a framework for considering interim assessments. Educational Measurement: Issues and Practice, 28, 5-13.
MARCES 2011 15
Purposes of Interim Assessments
• Generally, interim assessments fall into one of three categories:
Instructional – meant to inform instruction
Evaluative – meant to demonstrate effects of instruction
Predictive – meant to predict student performance on summative assessments
• As such, interim assessments are generally less flexible than formative assessments
Constructed at a wider level
MARCES 2011 16
Questions of Our Talk
• What information can be gained from interim assessments currently in place?
• Is the information useful and actionable?
• What is the dimensionality of interim assessments?
• What are the psychometric issues that are present in interim assessment?
MARCES 2011 17
LONGITUDINAL ANALYSIS OF INTERIM ASSESSMENT DATA
MARCES 2011 18
State Interim Assessment Data
• Data for this talk come from the 2010 interim mathematics assessment of students in a midwestern state
Teachers were allowed to opt-in to the testing system
Grades 3-8
Students took up to three interim tests: Fall #1, Fall #2, and Winter
• Items on tests were piloted during 2009 summative tests Item parameters fixed across 2010 administration
Is dimensionality the same? Are parameters invariant across time?
• Scores on all tests were reported, but no tracking was done for instructional, evaluational, or predictive purposes
MARCES 2011 19
Longitudinal Analyses Results by Grade
• Shown: Scaled scores (percent correct)
• Intercepts: 56.2 (7th grade) – 68.3 (3rd grade)
• Slopes: 4.1 (8th grade) – 10.3 (3rd grade)
50
55
60
65
70
75
80
Fall1 Fall2 Winter1
Grade 3
Grade 4
Grade 5
Grade 6
Grade 7
Grade 8
Interim Test Administration
Fall 1 Fall 2 Winter
Scal
ed
Sco
re
MARCES 2011 20
A Closer Look: Grade 7 Data
• From Grade 7:
Scal
ed
Sco
re
Fall 1 Fall 2 Winter
MARCES 2011 21
Question: Can Growth Results Be Trusted?
• As with any analysis, there are issues in believing the results:
• Reliability: Two-stage analysis assumed scaled scores were perfectly reliable
Likely to have much more error than reported
Single-stage analysis will control for error
Accuracy of student trajectories is low Only three time-points :: hard to determine where students are
• Statistical issues: Assessment uses fixed item parameters with unidimensional scale –
but dimensionality may be different in emerging learners
Assumed item parameters are invariant across time Hard to tell with current implementation design
MARCES 2011 22
Current Status of Interim Assessments
• Currently, interim assessment data is helpful, but limited
• Question: can interim assessments provide more helpful information to students and teachers
Multidimensional feedback
• Is dimensionality constant across time?
• Are methods available for reporting multidimensional feedback?
MARCES 2011 23
ASSESSING THE DIMENSIONALITY OF INTERIM ASSESSMENTS
MARCES 2011 24
Large Scale Tests
• Most large scale tests are created based on a test blueprint
• Blueprint details the content areas assessed by the test and the numbers of items for each area
• Has the potential to introduces areas where diagnostic information may be able to be obtained
• However, because most tests measure a common content area there is still a dominant dimension
Most large scale tests are analyzed with unidimensional models
MARCES 2011 25
Large Scale Test Construction Process
• Often, many items are piloted
• Items not fitting with the psychometric model used by the testing program are omitted from the exam
3PL used in Florida, Kansas
Rasch used in Georgia
• By omitting items not fitting the unidimensional model, the resulting test loses the ability to detect multiple dimensions…
MARCES 2011 26
Basic Math Ability
Path Diagram of Unidimensional IRT Models
MARCES 2011 27
2+3-1 4/2 (4x2)+3
Path Diagram of Multidimensional IRT Models
MARCES 2011 28
2+3-1 4/2 (4x2)+3
Addition Subtraction Multiplication Division
Path Diagram of Diagnostic Models
MARCES 2011 29
2+3-1 4/2 (4x2)+3
Addition Subtraction Multiplication Division
Potential for Multidimensional Models?
• Because of the unique construction of large scale tests multidimensional feedback may be possible
Question of utility – and level of grain size
• Multidimensional Psychometric Models: Multidimensional Item Response Models
Multiple scaled scores
Diagnostic classification models (DCMs) Like MIRT models, but ordinal-level measurement (proficiency status)
Hybrid approaches such as bi-factor models Overall continuous ability
Sub-abilities for each dimension
MARCES 2011 30
Concerns About Multidimensional Models
• Many psychometric approaches have been developed for measuring multiple dimensions
Classical Test Theory - Scale Subscores Multidimensional Item Response Theory (MIRT) Diagnostic classification models
• Yet, issues in application have remained: Reliability Large samples are needed Dimensions are often very highly correlated
• DCMs have changed landscape, but is information worthwhile? Question: would ordinal information (master/non-master or
proficient/not proficient) be valuable in interim testing situations?
MARCES 2011 31
Theoretical Reliability Comparison
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
3 13 23 33 43 53 63 73 83 93
Reliab
ilit
y
Number of Items
DCM
Rasch IRT
Reliability Level DCM IRT
0.80 8 Items 34 Items
0.85 10 Items 48 Items
0.90 13 Items 77 Items
MARCES 2011 32
Uni- and Multidimensional Comparison
0.5
0.6
0.7
0.8
0.9
1
1.1
1-Dimension 2-Dimension BiFactor
Reliab
ilit
y
Dimensional Model
DCM
IRT DCM
IRT IRT
DCM
MARCES 2011 33
Multidimensionality and Large Scale Tests
• Applications of multidimensional psychometric models can lead to very poor results:
Non-convergence (weak convergence)
Non-informative item parameters
Highly associated latent variables Essentially forming a unidimensional factor
• Such results hold for both MIRT models and DCMs You will see DCM analyses published, however, I argue that is
because goodness of fit procedures lag model application It is currently easy to hide poor DCM fit behind novelty of models
MARCES 2011 34
WHEN MULTIDIMENSIONALITY GOES BAD
MARCES 2011 35
How DCMs and Multidimensional Models Fail
• Willse, Henson, & Templin (2009, NCME) presented the results of a DCM analysis of a 5th grade End-of-Grade reading exam
73 items
2 content standards Used as Q-matrix
All items had simple structure (one attribute measured)
• For measured attributes, the analysis revealed High reliabilities (.95 and .90)
Highly certain classifications
MARCES 2011 36
However…Model Fit Was Poor
10 20 30 40 50 60 70
0.0
00
.02
0.0
40
.06
0.0
80
.10
0.1
2
Test Score
Re
lative
Fre
qu
en
cy
Observed DataSimulated from Model Parameters
Residual Correlations
Fre
qu
en
cy
-0.1 0.0 0.1 0.2 0.3
01
00
20
03
00
40
05
00
60
0
Q-matrix used fit marginally better than randomly permuted Q-matrix!
MARCES 2011 37
DIMENSIONAL ANALYSIS OF INTERIM TEST DATA
MARCES 2011 38
Assessing Dimensionality of Interim Assessments
• Dimensionality of interim assessments is important If more dimensions are present, potential for more informative
feedback for students and teachers exists
• Classical notions of dimensionality are due for an update Newer classes of statistical models challenge traditional notions
of dimensionality
• We used multiple models to assess dimensionality of interim assessment data for each grade/time assessment
Models shown subsequently: Unidimensional/multidimensional Continuous/categorical traits Bifactor combinations
MARCES 2011 39
Single Continuous Latent Variable Observed Items
IRT
X1 X3 X4 X5 X2
θ
The 1PL Item Response Model
Logit 𝑃 𝑋𝑟𝑖 = 1|𝜃𝑟 = 𝜃𝑟 − 𝑏𝑖 = 𝜆𝑖,0 + 𝜃𝑟
𝜃𝑟 ∼ 𝑁 0, 𝜎𝜃2
1 1 1 1 1
𝜎𝜃2
Item intercepts: 𝜆𝑖,0
MARCES 2011 40
Single Continuous Latent Variable Observed Items
IRT
X1 X3 X4 X5 X2
θ
The 2PL Item Response Model
Logit 𝑃 𝑋𝑟𝑖 = 1|𝜃𝑟 = 𝑎𝑖 𝜃𝑟 − 𝑏𝑖 = 𝜆𝑖,0 + 𝜆𝑖,1𝜃𝑟
𝜃𝑟 ∼ 𝑁 0,1
1
Item intercepts: 𝜆𝑖,0
MARCES 2011 41
MIRT
X1 X3 X4 X5 X2
Observed Items Continuous Latent Variable(s)
ηc
The MIRT Model
Logit 𝑃 𝑋𝑟𝑖 = 1|𝜽𝑟 = 𝜆𝑖,0 + 𝜆𝑖,𝑓𝜃𝑟,𝑓
𝜽𝑟 ∼ 𝑁 𝟎, 𝚺
𝜃1 𝜃2 𝜃3 𝜃4
Latent Variable Correlations
One latent variable measured per item
1 1 1
1
MARCES 2011 42
MIRT
X1 X3 X4 X5 X2
Observed Items Continuous Latent Variable(s)
ηc
The Bi-Factor MIRT Model
Logit 𝑃 𝑋𝑟𝑖 = 1|𝜽𝑟 , 𝑔𝑟 = 𝜆𝑖,0 + 𝜆𝑖,𝑓𝜃𝑟,𝑓 + 𝜆𝑖,𝑔𝐺𝑟
𝐺𝑟 ∼ 𝑁 0,1 ; 𝜽𝑟 ∼ 𝑁 𝟎, 𝚺
𝜃1 𝜃2 𝜃3 𝜃4
Latent Variable Correlations
One latent variable measured per item
1 1 1
1
G 1 General “ability” measured
by all items of a test
IRT
MARCES 2011 43
Bi-Factor MIRT Model Specifics
• The Bi-Factor MIRT Model has: • A continuous latent variable measured by all items
• General content area ability
• A set of continuous latent variables measured by items indicated in the Q-matrix
• For current data set, these represent state-level standards:
1. Number and Computation
2. Algebra
3. Geometry
4. Data
• …and is easily estimable using MML with Mplus
MARCES 2011 44
LCDM
X1 X3 X4 X5 X2
Observed Items Discrete Latent Attributes α1 α2 α3 α4
ηc
Latent Variable Interactions
Diagnostic Structural Model
The General DCM Approach
Logit 𝑃 𝑋𝑟𝑖 = 1|𝜶𝑟 = 𝜆𝑖,0 + 𝝀𝑖𝑇𝒉 𝒒𝑖 , 𝜶𝑟
𝜶𝑟 ∼ 𝑀𝑉𝐵(𝜼)
MARCES 2011 45
Modifications to DCMs
• Because of the suspected unidimensionality of existing large scale tests, DCMs are not directly applicable
As shown previously
• General form of DCMs allows for more types of latent variables to be present in the model
• Our solution? • Analog of Thurstone’s Hierarchical Factors model (i.e., general
and specific intelligences) with sub-factors represented with categorical latent variables
• Here general intelligence was overall content area knowledge • Specific intelligences were set of standards
MARCES 2011 46
IRT
LCDM
X1 X3 X4 X5 X2
θ Continuous Latent Variable(s) Observed Items Discrete Latent Attributes α1 α2 α3 α4
ηc
Latent Variable Interactions
Diagnostic Structural Model
The Bi-Factor DCM
Logit 𝑃 𝑋𝑟𝑖 = 1|𝜃𝑟 , 𝜶𝑟 = 𝜆𝑖,0 + 𝜆𝑖,1𝜃𝑟 + 𝝀𝑖𝑇𝒉 𝒒𝑖 , 𝜶𝑟
𝜃𝑟 ∼ 𝑁 0,1 ; 𝜶𝑟 ∼ 𝑀𝑉𝐵(𝜼)
1
MARCES 2011 47
Bi-Factor DCM Item Response Function
P(X
ri =
1 | α
r, θ
r)
θr
Bi-Factor DCM IRF (one measured attribute)
Red line: Students proficient at sub-area
Black line: Students not proficient at sub-area
MARCES 2011 48
Bi-Factor DCM Specifics
• The Bi-Factor DCM has: • A continuous latent variable measured by all items
• General content area ability
• A set of categorical latent variables measured by items indicated in the Q-matrix
• Mastery/Non-mastery of content area knowledge • For current data set, these represent state-level standards: 1. Number and Computation 2. Algebra 3. Geometry 4. Data
• …and is also easily estimable using MML with Mplus
• The model better accounts for violations of local independence when pure DCM was used
• Gave examinees a boost in the probability of a correct response for higher content area knowledge
MARCES 2011 49
Other Details
• If we had observed the continuous latent ability and the categorical latent attributes, this model would be an ANCOVA model (the logistic version)
Estimated model parameters would essentially be adjusted for equal ability in each group
• Following Thurstone’s convention, the general factor was uncorrelated with the specific factors
Specific factors are correlated May or may not be needed/useful
Helps preserve semi-multidimensional nature of model
• Content area attributes have smaller variability than comparable continuous latent variables
Not “as” multidimensional as full Bi-Factor model
MARCES 2011 50
Model Fit Results
• Best model based on information criteria:
• In every year and test, a bi-factor model fit best Question: what do latent variables mean?
• Data may be multidimensional We may be able to harness this dimensionality for more
feedback for students
Grade Fall 1 Fall 2 Winter
3rd BF-DCM BF-DCM BF-DCM
4th BF-DCM BF-DCM BF-DCM
5th BF-DCM BF-DCM BF-DCM
6th BF-DCM BF-DCM BF-MIRT
7th BF-DCM BF-MIRT BF-MIRT
8th BF-DCM BF-MIRT BF-DCM
MARCES 2011 51
LONGITUDINAL MEASUREMENT ISSUES AND PRACTICES
MARCES 2011 52
Issues with Measurement Longitudinally
• Invariance of item parameters Do items function similarly across time?
• Reliability of measurement at each time point Continuous v. categorical latent variable models
• Reliability of measurement of growth Number of measurements needed for accurate reliability may be
impossible to reach Additional measures may be needed (formative tests, homework
assignments, etc…).
• Longitudinal modeling of latent variables Two-stage modeling has propagating measurement error DCMs (if used) have implementation issues longitudinally
Parameters for change are numerous (but help is on the way)
MARCES 2011 53
Testing Practices that Affect Measurement of Growth
• The most important design part of measuring growth is for students to take overlapping items across time
Current implementations confounds growth with item change
Impossible to discern change in item properties from growth of ability in students
• Items (or testlets) can come from overall pool Randomly assigned to each examinee across time
Some overlap is needed
• Validity of modeling results needs verification Perhaps most important point
MARCES 2011 54
CONCLUDING REMARKS
MARCES 2011 55
Interim Assessments: Potential for More Feedback
• Interim assessments are part of the current (and new) testing environment and are likely to stay
But better psychometric modeling will increase their usefulness
• There may be more dimensionality in tests than we have realized previously
What dimensions are needs determined
• The key to any longitudinal assessment program is utility If a test is built that does not serve its purpose, what good is it?
• We can make interim assessments part of the learning process and help students grow
MARCES 2011 56
Thank You!
• Questions? Comments? Complaints? References?
• Contact me at:
• Talk slides available at:
http://projects.coe.uga.edu/jtemplin
• Thank you!
MARCES 2011 57