® validity of standard-setting outcomes michael kane educational testing service oslo september,...

®

Validity of Standard-setting Outcomes

Michael Kane

Educational Testing Service

Oslo September, 2015

Unpublished Work Copyright © 2013 by Educational Testing Service. All Rights Reserved. These materials are an unpublished, proprietary work of ETS. Any limited distribution shall not constitute publication. This work may not be reproduced or distributed to third parties without ETS's prior written consent. Submit all requests through www.ets.org/legal/index.html.

Educational Testing Service, ETS, the ETS logo, and Listening. Learning. Leading. are registered trademarks of Educational Testing Service (ETS).

http://www.ets.org/legal/index.html

®

A Tragic Story

• On June 17, 1998, almost thirty million Americans became clinically overweight and several million became clinically obese.

• This catastrophe was not caused by an eating binge.

• Rather, the U. S. National Institutes of Health (NIH) changed its cutscores for these diagnoses on the body mass index (BMI).

• Similarly, changes in the cutscores on tests can have similar abrupt and dramatic effects.

Copyright © 2013 by Educational Testing Service. ETS, the ETS logo and LISTENING. LEARNING. LEADING. are registered trademarks of Educational Testing Service (ETS).

2

®

Arbitrariness and Subjectivity

• Glass (1978) suggested that the results of educational standard setting tend to be arbitrary, and that it is “...wishful thinking to base a grand scheme on a fundamental unsolved problem” (p. 237).

• Hambleton, Popham, Shepard, & others, acknowledged that standards are judgmental, but argued that they need not be arbitrary in the sense of being unjustified or capricious.


3

®

Degree of Arbitrariness

• All standards are arbitrary to some extent.• Some standards are more arbitrary than

others.• The extent to which the arbitrariness is a

problem depends on how much it interferes with the intended use of the standard.

• If it is a problem, arbitrariness needs to be controlled by providing support for the standard, or by changing its use.


4

®

The NIH Standards

• The 1998 changes in the BMI cutscores were developed subjectively (by a committee), but they were supported by clinical research, and their general locations were not arbitrary.

• Their exact values were a bit arbitrary, but that ambiguity was not too serious.

• In cases where the evidence is less clear or more precision is needed, health agencies can defer action or provide weaker advice.


5

®

Why Do We Set Standards?

• By setting a standard (e.g., a passing score on a test), we can change a subjective evaluation into a simple, objective comparison of a test score to the passing score.

• To the extent that the standard is applied uniformly, it promotes fairness.


6

®

Standard Setting Can be Very Useful?

• Standard setting adds a layer of interpretation (involving a series of qualitative levels) to some underlying assessment or judgment

• By doing so, it can change subjective evaluations of “more or less” into objective decisions based on the level assignment.

• To the extent that the resulting standard is widely adopted, the interpretation is transportable (e.g., CEFR).


7

®

Objectivity and fairness

• Objectivity is especially valued in contexts where fairness is a major issue (e.g., in high-stakes testing):

Scientific objectivity thus provides an answer to a moral demand for impartiality and fairness. Quantification is a way of making decisions without seeming to decide. (Porter, 1995, p.8).

• Instead of telling me that I am getting fat, my doctor can tell me that my BMI indicates that I am overweight (much nicer, more scientific).


8

®

Acceptabilityto Stakeholders

• To be widely accepted, the standards have to meet certain criteria:– They have to be based on relevant data.– They have to be applied consistently.– They have to be at an appropriate level, or at

least, not at an inappropriate level.– They have to support the claims associated with

the levels.• These criteria are basically issues of validity.


9

®

Standard Setting as Policy Making

• Standards are “set”, not found or estimated. • They don’t exist until we develop them.• The issue is not whether the standards are

accurate, but rather, whether they work, in the sense that they achieve their intended goals at acceptable cost.

• The consequences of standards-based decisions need to be acceptable to relevant stakeholders.


10

®

Overview

• Argument-based Validation

• Standard Setting (Broadly Conceived)

• Validating Standards-based Interpretations and Decisions


11

®

Validityan Argument-based Appproach


12

®

Argument-based Validation

• It is a proposed interpretation and uses of scores that are validated, not the test itself.

• The validity of a proposed interpretation/use depends on how well the evidence supports the claims being made.

• More ambitious claims require more support.


13

®

The Argument-based Approach Employs Two Kinds of Argument.

• An interpretation/use argument (IUA) specifies a proposed score interpretation uses as a chain of inferences and assumptions leading from the test performances to the interpretation and use.

• The validity argument evaluates the IUA’s coherence and completeness and the plausibility of its inferences and assumptions.


14

®

IUAs arePresumptive Arguments

• Presumptive arguments make real-world claims, based on warranted inferences (and reasonable assumptions).

• The IUA is presumptive in that it makes the claims plausible/reasonable, rather than certain.

• To the extent that it is accepted, it establishes a presumption in favor of the claims being made.


15

®

Toulmin’s Model for Presumptive Inferences


Datum [warrant] so {Qualifier} Claim Backing Exceptions

The warrants are “if – then” rules.

16

®


17

The warrants support specific kinds of inferences.

®

Warrants and Backing• Scoring inference from test performance to score

– SME judgment, interrater analyses, model fit• Generalization from score to expected score or

latent variable. – Reliability/generalizability studies, model fit

• Extrapolation from test domain to other contexts.– Regression analyses, or task analyses

• Decision based on the score– Consequences, positive and negative


18

®

Evaluative Criteria for IUAs

Clarity and completeness of the argument: The IUA should represent the score-based claims accurately and completely.

Coherence of the IUA: The chain of inferences from observed performances to conclusions and decisions should be sound.

Plausibility of inferences and assumptions: The assumptions should be plausible (based on relevant evidence), and alternative assumptions should be considered.


19

®

Standards-based Inference

• Standard setting adds a layer to the score interpretation and adds an inference (from test scores to levels) to the IUA.

• For the interpretation and use of the scores to be valid, the warrant for this additional inference has to be justified.

• The other inferences in the IUA also have to have adequate backing.


20

®

Standard Setting

• Prototype Standards• Empirical Methods

– (Dosage-response curves)• Judgmental Standard Setting


21

®

Prototype Standards

• An ordered sequences of prototypes (or examples) can be used to specify the levels.

• For example, in scoring essays on an n-point scale, scorers are often trained and monitored using prototype essays for each score point.


22

®

Standard Kilogram & Standard Meter


23

®

Secondary Prototypes(with many levels)


24

®

Using the Metric Standards

• The metric standards were arbitrary. • But, the use of the metric prototype

standards is objective (i.e., does not involve much judgment).

• The meaning assigned to these standards is clear and free of excess meaning.


25

®

Benefits of Prototype Standards• The standards are stable and can be

transportable.• To the extent that the standards are accepted,

they can change vague, variable, and context-based descriptions into standardized objective claims.

• They can change subjective decisions into objective decisions (e.g., postal weights).

• They are value free.


26

®

Prototype Scales in Education

• This kind of prototype standard may seem very different from educational standard setting, but they have essentially the same purpose.

• Ayres’ scale for handwriting is a classic example of this kind of prototype scale in education.


27

®

Ayres Handwriting Scale1920


28

.

®

Metric Standards: Arbitrary but Useful

• The metric prototype standards are mainly valued for their consistency and convenience.

• The choices made in developing the metric prototypes were quite arbitrary, but that has not interfered with their utility.

• They are generally accepted by the scientific and commercial stakeholders, in part because they have little excess meaning.


29

®

Empirical Standard SettingDosage-response Curves

• An empirical relationship between an input (the “dosage”) and a desirable outcome (the “response”) can be helpful in setting a standard dosage.

• If the relationship is roughly logistic, it can suggest and support an appropriate standard.


30

®

#1: An Easy Case


31

®

#2: An Intermediate Case


32

®

#3: A Hard case


33

®

Implications of a DR Curve for a Standard Dosage

• #1 suggests that the dosage should be about 30.

• #2 suggests that the dosage could be between 15 or 20 and 40 or 45.

• #3 does not provide much help in picking a dosage.

• Without additional constraints, establishing a standard dosage within these ranges can be arbitrary.


34

®

Reducing the Ambiguityfor Cases #2 or #3

• A common way to get a better fix on an appropriate standard is to evaluate the consequences (e.g., side effects) for various dosages.

• One might go for the strongest response, while avoiding serious side effects.

• The issue is one of balancing positive and negative consequences.


35

®

Avoiding Serious Side Effects


36

®

DR Curves inEducational Standard Setting

• To translate this approach to education, we would replace the response with the probability, p, of being at or above level-n, and the dosage with a test score.

• The cutscore for level-n can be set at the test score for which p = 0.5, or it can be adjusted based on concerns about false positives and false negatives.

• Unfortunately we don’t usually have a good apriori way of identifying which test takers are at or above level-n.


37

®

Relevance of Empirical Methods to educational Standard Setting

• This empirical, curve-based method is used to some extent:– in examinee-centered methods (e.g.,

contrasting groups)– In placement testing– For screening tests

• However, in setting standards, we often act as if we have Case #1, even though the data is more like Case #3.


38

®

Balancing Consequences• Standard setting has a goal.• The goal may be to cure patients or to have

students achieve some level of competence in some area.

• In setting the standards, we want to achieve the goal (a positive consequence) without too many negative consequences.

• It is necessarily a balancing act. • Goals and side effects are easier to deal with

when they are well defined and specific.


39

®

Judgmental Standard Setting

• Judgmental standard setting involves the use of “judges” to set cutscores on a test to represent particular levels of performance.

• I will refer to the levels of performance as target performance levels (TPLs).

• The TPL provides a conceptual specification of the Standard.

• The cutscore provides an operational specification of the Standard.


40

®

An Iterative Process

• The panelists start the process with preliminary versions of the TPLs, which are used to identify corresponding cutscores.

• If the panelists encounter ambiguities in, they may clarify the descriptions of the TPLs.

• The TPLs defi generates ne a new level of interpretation, and the cutscores provide a rule (or warrant) for going from a test score to a TPL.


41

®

Standard Setting as Policy Making.

• Standard setting turns a general policy outline into a more explicit, detailed, and operational policy.

• The policy is outlined in advance, as TPL descriptions and labels are specified.

• But the TPL descriptions tend to be fleshed out and revised during standard setting, and the cutscores are identified during standard setting.


42

®

The Validity of Standards-based Interpretations


43

®

TPL Inferences

• The cutscores are used to assign test takers to levels based on their test scores.

• For this TPL inference to be plausible, it needs backing (evidence) that the cutscore reflects the requirements in the TPL.


44

®

LabelsMerriam-Webster

• Basic– forming or relating to the most important part

of something– forming or relating to the first or easiest part of

something– not including anything extra

• Proficient– good at doing something


45

®

Evidence/Backingfor TPL Warrants

• Procedural - methods used to define the TPLs and to set the corresponding cutscores should be consistent with the intended use of the cutscores.

• Internal Consistency – the results should be internally consistent.

• External Criteria - TPL assignments should be consistent with appropriate external criteria.


46

®

Procedural EvidenceCan Support a TPL Inference

• The validity of metric measurements is taken for granted because:– the interpretation and use of the prototypes is

closely tied to their development and use. – and the labels assigned to the standards have little

excess meaning.

• Similarly, empirical standards based on DR curves provide strong support for the cutscores, because reflect the goal (the “response”).


47

®

Procedural Evidencefor Judgmental Standards

• Relevance of test content and format to use• Specification of goals of the process• Sampling of judges• Training of judges• Sampling of items or test-taker performances• Appropriate feedback to judges• Confidence of judges


48

®

Evaluating Procedural Evidence

• Procedural evidence can be decisive in undermining validity, but cannot, in itself, justify the TPL inference.

• For judgmental standard setting, the inherent subjectivity of the process makes it suspect to anyone who disagrees with the outcomes.

• Additional evidence is called for, especially in high-stakes contexts.


49

®

Internal Consistency Evidence

• Precision (reliability or generalizability) over judges, panels, occasions.

• For test-centered methods, agreement between item ratings and empirical item difficulties.

• Reasonableness of changes in ratings over rounds of the standard-setting process.

• Again, discrepancies undermine validity claims, but consistency is less decisive:– Consistency is necessary but not sufficient.


50

®

Some External Criteria

• Comparisons of level assignments to direct measures of the TPL performances.

• Comparisons to performances suggested by TPLs and labels (e.g., “advanced”)

• Comparisons to the results of other standard-setting methods.

• Comparisons to level assignments based on different tests or assessments.

• Comparisons to group-level data (percentiles).


51

®

Suitability of Criteria

• The quality of a criterion depends on its relevance to the goal of standard setting:– Easy case: placement testing– Hard cases: HS diplomas, NAEP, CEFR

• Developing relevant, reliable criterion measures is always difficult,

• But it is easier if the TPLS involve specific, well-defined performance descriptions.


52

®

Criterion DataCan Clarify the TPLs

• Criterion-related analyses can help flesh out the meaning of the TPLs by indicating the kinds of performances associated with different levels,

• and they can provide checks on the claims being made in the TPL descriptors and labels,

• or they can help to sharpen the descriptors and labels.


53

®

Using Partial Criteria

• Getting satisfactory criterion measures for TPLs is difficult, at best.

• But it may be possible to get partial criteria that provide empirical checks on cutscores.

• The strategy is akin to scientific theory testing or program evaluation.

• For example, see:– Validity Argument for NAEP Reporting on 12th Grade

Academic Preparedness for College (Ray Fields, 2013)


54

®

From Fields (2013)


55

®

Local Ambiguity

• In educational standard setting, we are imposing a sharp distinction where none exists to begin with.

• Wherever we set the cutscore, there will not be much substantive difference between the test taker with a score one point above the cutscore compared to the test taker with a score one point below the cutscore.

• If we think in terms of DR curves, we may be closer to #3 than we are to #1


56

®

The TPL Inference is Presumptive

• For the TPL inference to be accepted, it needs to be backed by adequate evidence.

• An effective response to charges of arbitrariness is a demonstration of an appropriate relationship between the standards and the goals of the program.

• Some ambiguity is OK, but we should try to get TPLs and cutscores that are in the right neighborhood..


57

®

“Take-away” Messages 1

• Standard setting adds a TPL inference to the IUA.

• For standards-based interpretations and uses to be valid, the TPL inference needs to be justified.

• The supporting evidence can be procedural, internal, and criterion-based.

• Criterion-based evidence is especially important in high-stakes contexts, where the standards tend to be contentious.

58


®

“Take-away” Messages 2

• As an exercise in policy making, standard setting is evaluated in terms of the consequences of the decisions being made.

• Standards are inherently judgmental, and therefore, to some extent, arbitrary.

• Some arbitrariness is OK, but the arbitrariness needs to be controlled, so that positive consequences outweigh negative consequences.


59

®

Finally

• In evaluating standards, the question is not whether we got it right.

• Rather, the question is whether, the decisions based on the cutscores are reasonable, broadly acceptable, and have mostly positive consequences.


60

®

Thank You

61

® validity of standard-setting outcomes michael kane educational testing service oslo september,...

Documents

ets logo

proprietary work of

unpublished work copyright

bmi cutscores

test score

passing score

fundamental unsolved

nih standardsthe