steven viger lead psychometrician michigan department of education office of educational assessment...

35
Steven Viger Lead Psychometrician Michigan Department of Education Office of Educational Assessment and Accountability Measurement 102

Upload: emma-benson

Post on 25-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Steven VigerLead Psychometrician

Michigan Department of Education Office of Educational Assessment and

Accountability

Measurement 102

2

Student Performance Measurement

• The previous session discussed some basic mechanics involved in psychometric analysis.– Graphical and statistical methods

• The focus of this session is on the interpretations of the data in light of the often used terms reliability.

• Some attention will also be paid to some of the higher level psychometrics that go on behind the scenes.– How the scale scores are REALLY made!

3

Making inferences from measurements

• The inferences one can make based solely on educational measurement are limited.

• The extent of the limitation is largely a function of whether or not evidence of the valid use of scores is accumulated.

• At times, the terms validity and reliability are confused. Unfortunately, these terms describe extremely different concepts.

4

Some basic validity definitions

• Validity

• The degree to which the assessment measures the intended construct(s)

• Answers the question, “are you measuring what you think you are?”

• More contemporary definitions focus on the accumulation of evidence for the validity of the inferences and interpretations made from the scores produced.

5

Some basic reliability definitions

• Reliability

• Consistency

• The degree to which students would be ranked ordered the same if they were to be administered the same assessment numerous times.

– Actually, the assumption is based on an ‘infinite amount’ of retesting with no memory of the previous administrations…an unrealistic scenario.

6

More about reliability

• Reliability is one of the most fundamental requirements for measurement—if the measures are not reliable, then it is difficult to support claims that the measures can be valid for any particular decision.

• Reliability refers to the degree to which instrument scores for a group of participants are consistent over repeated applications of a measurement procedure and are, therefore, dependable and repeatable.

7

Reliability and Classical Test Theory

X = T + E•True Score (T): A theoretic score for a person on an instrument that is equal to the average score for that person over an infinitely large number of ‘retakes’.  •Error (E): The degree to which an observed score (X) varies from the person’s theoretical true score (T).

In this context, reliability refers to the degree to which scores are free of measurement errors for a particular group if we assume the relationship of observed and true scores are depicted as above.

8

‘Unreliability’ AKA the standard error of measurement

• The standard error of measurement (SEM) is an estimate of the amount of error present in a student’s score.

– If X= T + E, the SEM serves as a general estimate of the ‘E’ portion of the equation.

• There is an inverse relationship between the SEM and reliability. Tests with higher reliability have smaller SEMs.  

• Reliability coefficients are indicators that reflect the degree to which scores are free of measurement error.

9

More on the Standard Error of Measurement

• The smaller the SEM for a test (and, therefore, the higher the reliability), the greater one can depend on the ordering of scores to represent stable differences between students.

– The higher the reliability, the more likely it is that the rank ordering of students by score is due to differences in true ability rather than random error.

– The higher the reliability, the more confident you can be in the observed score, X, being an accurate estimate of the student’s true score, T.

10

Standards for Reliability

• There are no mathematical ‘rules’ to determine what constitutes an acceptable reliability coefficient.

• Some advice:• Individual based decisions should be based on scores

produced from highly precise instruments.• The higher the stakes, the higher you will want your

reliability to be. • Group-based decisions in a research setting typically allow

lower reliability.• If you are making high-stakes decisions about individuals,

you need reliabilities above .80 and preferably in the .90s.

11

Establishing validity

• Past practice has been to treat validity as if there criterion related to an amount necessary to deem an instrument as valid.

• That practice is outdated and inappropriate.– Does not acknowledge that numerous pieces of

information need to come together to facilitate valid inferences.

– Tends to discount some pieces of evidence and over emphasize others.

– Leads to a narrowing of scope and can encourage one to be limited in their approach to gathering evidence.

12

Process vs. Product

• Rather than speak of validity as a thing, we need to start approaching it as an on-going process that is fed from all aspects of a testing program; validation.

• The current AERA and APA standards for validity tend to treat the validation process similar to a civil court proceeding.– A preponderance of the evidence is sought

with the evidence coming from multiple sources.

13

Validation from item evidence

• Focus is on elimination of “construct-irrelevant variance”

• Some ways this is accomplished:– Well established item development/review

procedures– Demonstrate alignment of individual items to

standards– Show the items/assessments are free of bias;

quantitatively and qualitatively– Simple item analyses: eliminate items with

questionable stats (e.g. p-values too high, low point-biserial correlation, etc.)

14

Validation from scaled scores

• Scale score level validity evidence includes but is not limited to:– Input from item-level validity evidence (the

validity of the score scale depends upon the validity of the items that contribute to that score scale)

– Convergent and divergent relationships with appropriate external criteria.

– Reliability evidence– Appropriate use of a ‘strong’ measurement

model for the production of student scores.

15

Is it valid, reliable, or both?

Low reliabilityHigh validity

Low reliabilityLow validity

High reliabilityHigh validity

High reliabilityLow validity

16

Measurement models

• The measurement models used by MDE fall under the general category of Item Response Theory (IRT) models.

• IRT models depict the statistical relationship that occurs as a result of person /item interactions.– Specifically, statistical information regarding the

persons and the items are used to predict the probability of correctly responding to a particular item; if the item is constructed response it is the probability of a person receiving a specific score point from the rubric.

• Like all statistically based models, IRT models carry with them some assumptions; some are theoretical whereas others are numerical.

17

IRT assumptions

• Unidimensionality: there is a single underlying construct being measured by the assessment (i.e. mathematics achievement, writing achievement, etc.)

• As a result of the assumption of the single construct, the model dictates we treats all sub-components (strand level, domain, subscales in general) as contributing to the single construct– Assumes that there is a high correlation between sub-

components– It would probably be better to measure the sub-

components separately, but that would require significantly more assessment items to attain decent reliability

18

IRT assumptions

• Assumes that a more able person has a higher probability of responding correctly to an item than a less able person– Specifically, when a person’s ability is greater than

the item difficulty, they have a better than 50% chance of getting the item correct.

• Local independence: the response to one item is independent of and does not influence your probability of responding correctly to another item.

• The data fit the model!– The item and person parameter estimates are

reasonable representations of reality and the data collected meets the IRT model assumptions.

19

The Rasch Model(MEAP and ELPA)

)(

)(

1 b

b

e

eP

20

The Rasch Model (1 parameter logistic model)

• An item characteristic curve for a sample MEAP itemSimple IRT Model

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

-3 -2 -1 0 1 2 3

Achievement

Pro

ba

bil

ity

of

co

rre

ct

res

po

ns

e

Item difficulty

Inflection point

50% probability of correct response

21

The 3 Parameter Logistic Model(MME and MEAP Writing)

)(

)(

1

)1(bDa

bDa

e

eccP

22

The 3 Parameter Logistic Model

• An item characteristic curve for a sample MME item.

More Complex IRT Model

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

-3 -2 -1 0 1 2 3

Achievement

Pro

bab

ilit

y o

f c

orr

ec

t re

sp

on

se

Item difficulty

Slope at inflection point indicates how well the item discriminates between high and low achievers

Probability halfway between item "guessability" and 1

Item "guessability"

23

• Before I show you what a string of items looks like using IRT I’d like to first point out some differences in the model that will lead to some major differences in the way the items look graphically.

• In particular, we need to pay attention to the differences in the formulas.

– Are there features of the 3PL model that do not appear in the 1PL model?

24

• In both models, the quantity driving the solution to the equation is the difference between person ability and item difficulty; θ - b.

• However, in one model, that relationship is altered and we cannot rely on the difference between ability and difficulty alone to determine the probability of a correct response to an item.

25

1PL vs. 3PL

• In the 1 parameter model, the item difficulty parameter (assuming the student’s ability is a known and fixed quantity), and its difference from student ability drives the probability of a correct response. All other elements are constants in the equation.– Hence the name, 1 parameter model– Therefore, when you see the plots of multiple

items, they should only differ by a constant in terms of their location on the scale.

26

1PL vs. 3PL

• In the 3 parameter model, there are still constants and the difference between ability and difficulty is still the critical piece. However, a, the discrimination parameter, has a multiplicative affect on the difference between ability and difficulty. Furthermore, the minimum possible result for the equation is influenced by the ‘c’ parameter.

– If c > 0.00, the probability of correct response must be greater than 0.

– Item characteristic curves will vary by location on the scale as well as by origin (c parameter) and slope (a parameter).

– Knowing how difficult an item is compared to another is still relevant but is not the only piece of information that leads to item differences.

27

MEAP example (10 items scaled using Rasch)

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

-3 -2 -1 0 1 2 3

Item Difficulty / student achievement

Pro

bab

ility

of

corr

ect

resp

on

se

Item 1Item 2

Item 3

Item 4

Item 5Item 6

Item 7

Item 8Item 9

Item 10

28

MME example (10 items scaled using the 3-PL model)

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

-3 -2 -1 0 1 2 3

Item difficulty / student achievement

Pro

ba

bil

ity

of

co

rre

ct

res

po

ns

e

Item 1Item 2Item 3Item 4Item 5Item 6Item 7Item 8Item 9Item 10

29

How do we get there?

• Although the graphics and equations on the previous screens may make conceptual sense, you may have noticed that the solution to the equations depends on knowledge of the values of some of the variables.

• We are psychometricians…not psychomagicians, so the numbers come from somewhere.

• The item and person parameters have to be estimated.• We need a person by item matrix to begin the process.

MME Science010100101111000111101110101111000101100110110011011000011101001011010111011001010011

30

• The person by item matrix is fed into an IRT program to produce estimates of item parameters and person parameters.

• An estimation algorithm is used, which is essentially a predefined process with ‘stop and go’ rules. The end products are best estimates of the item parameters and person ability estimates.– Item parameters are the ‘guessability’,

discrimination and difficulty parameters– Person parameters are the ability estimates we

use to create a student’s scale score.

IRT Estimation

31

Parameter Estimation

• For single parameter (item difficulty) models, WINSTEPS is the industry standard.

• More complex models like the 3 parameter model used in the MME require more specialized software such as PARSCALE.

• The estimation process is iterative but happens very quickly; most programs converge in less than 10 seconds.

– Typically, item parameters are estimated followed by person ability parameters.

32

Estimating Ability

• Once item parameters are known, we can use the item responses for the individuals to estimate their ability (theta).

• For the 3PL model, when people share the same response string (pattern of correct and incorrect responses) they will have the same estimate of theta.

• In the 1PL model, the raw score is used to derive the thetas.– Essentially, the same raw score will generate

different estimates of theta but they are close. The program will create a table that relates raw scores, to theta, to scale scores based on maximum likelihood estimation.

33

From theta to scale score

• Remember the following formula?– y = mx + b

• That is an example of a linear equation. • MDE uses linear equations to transform thetas to

scale scores.• There is a different transformation for each grade

and content area.• Performance levels are determined by the

student’s scale score.– Cut scores are produced by standard setting panelists.

34

Summary

• In this session you found out a bit about reliability and validity.– Two important pieces of information for any

assessment.– Remember, it is the validity of the inferences we make

that is important.• The evidence is accumulated and the process is

ongoing.• There are no ‘types’ of validity.

• You were also introduced to item response theory models and how they are used to produce MDE scale scores.

• The hope is that you leave with a greater understanding of how MDE assessments are scored, scaled, and interpreted.

• In addition, you now have some ‘tools’ that can assist you in your own analyses.

35

Contact Information

Steve Viger

Michigan Department of Education608 W. Allegan St.Lansing, MI 48909

(517) [email protected]