demonstrating validity for high-impact language tests - ealta

Demonstrating validity

Nivja de Jong & Jelle Goeman

Work in Progress - PLEASE DO NOT QUOTE WITHOUT PERMISSION

What is validity?

• Construct validity

• Criterion validity

• Face validity

• Content validity

• Consequential validity


Validity, back to basics

• Cattell, 1946; Kelley, 1927; Borsboom al., 2004:

Whether an instrument actually measures what it is set out to measure

• Borsboom (2004): “a test is valid for measuring an attribute if (a) the attribute exists and (b) variations in the attribute causally produce variation in the measurement outcomes”


What is validity?

• Whether an instrument actually measures what it is set out to measure

• To demonstrate validity, we need theory that specifies the processes that bring about the causal effect between variations in the attribute and variation in measurement outcome

Item difficulty should be theoretically predictable


What does a valid test look like?

• Item difficulty should be theoretically predictable

• Some form of unidimensionality of a test or sub-test is important because we want to summarize over all items

• Purposeful construction of items: sum-score should summarize


How do we check validity?

• In practice: what do we do about validity? – Posthoc relate item difficulty to item

characteristics?

– Correlation between test scores and scores from other tests?

– Unidimensionality: Cronbach’s alpha?

– …

• Is this OK?


Correlation between test scores and scores from other tests?

• A new test on English language proficiency should be strongly related to our old (previously validated) test of English language proficiency

• A new scale to measure weight should be strongly related to our old (previously validated) scale that measured weight


r = .8? We are measuring the same construct?


How do we check validity?

• In practice: what do we do about validity? – Posthoc relate item difficulty to item

characteristics?

– Correlation between test scores and scores from other tests?

– Unidimensionality: Cronbach’s alpha?

– …

• Is this OK?


Cronbach’s alpha persisting confusion

• Cronbach’s alpha is intended to measure reliability (test – retest)

• Reliability is not the same as unidimensionality / internal consistency.



• Sijtsma, 2009:

“A single number—alpha—that expresses both reliability and internal consistency—conceived of as an aspect of validity that suggests that items “measure the same thing”—is a blessing for the assessment of test quality. In the meantime, alpha “only” is a lower bound to the reliability and not even a realistic one”

Work in Progress - PLEASE DO NOT QUOTE

WITHOUT PERMISSION


• Cronbach’s alpha is used as:

– Test – retest

– Agreement between judges

– Unidimensionality across items

• In textbooks and handbooks: “alpha > .x allows you to sum the scores”

• But is this true? Does alpha reveal anything about summability?


Simulation 1

• We simulate subjects’ two unrelated skills or one skill

• Each skill we test with k items

• We make a test with all 1k or 2k items

• Item score = ability + noise

• Ability explains 50% of variance of each item

• k ranges from 2 to 50.

• We calculate Cronbach’s alpha for one and two skills within one test


Cronbach’s alpha for one or two constructs within one test


Cronbach’s alpha is an excellent measure of test length!


The concept of summability

Unidimensionality?

• Alpha is not about unidimensionality but about reliability

• Factor analysis is a good alternative

• BUT do we actually look for unidimensionality?

– Example: miscoded multiple choice will happily load onto the strongest first factor

Summability:

• We need the sum-score to summarize, not just any factor!

• items are purposefully constructed


Measuring summability

• How much of the variance of the item scores is captured by the sum-score?

• Our definition of summability: percentage of total item variance explained by the sum-score

• Like an R2 in regression

• Comes in unadjusted and adjusted form


Summability formula

• Unadjusted:

• Adjusted:

• v is the sum of all item variances

• c is the sum of all item variances and covariances

• k is the number of items


Simulation 2

• two unrelated skills or one skill

• Each skill tested with k items

• test with all 1k or 2k items

• Item score = ability + noise

• Ability explains 50% of variance of each item

• k ranges from 2 to 50.

• We compare Cronbach’s alpha with Summability


Simulation 2: comparing Cronbach’s alpha with Summability for one construct


Simulation 2: comparing Cronbach’s alpha with Summability for two constructs


Recap valid test characteristics

• Item difficulty should be theoretically predictable

• Some form of unidimensionality of a test or sub-test is important because we want to summarize over all items

• Purposeful construction of items: sum-score should summarize

Work in Progress - PLEASE DO NOT QUOTE

WITHOUT PERMISSION

Variation between items

• Item difficulty should relate to the theoretically grounded item characteristics

• Example: in a multiple choice reading comprehension test – texts differ in (linguistic) difficulty (sentence length, …,

…, …) – answer-options differ in plausibility In a valid MC reading test, (linguistic) difficulty of the

texts predict item difficulty.

• This is easy to check with a correlation or regression analysis


Application 1: productive vocabulary knowledge

• 90 vocabulary items performed by 198 pps (binomial score).

Item example: “Het was gisteren dinsdag, dus is het v___ woensdag.”

Tested words taken from 10 frequency bands (rank 1 – 1000, 1001 – 2000, …)

• Is this a valid test of vocabulary knowledge? – Can we sum the scores?

– Can we predict item difficulty in a theoretically grounded manner?

De Jong et al, 2012; Hulstijn et al., 2012 Work in Progress - PLEASE DO NOT QUOTE

WITHOUT PERMISSION

Application 1: productive vocabulary knowledge

• Summability of item-scores:

– Summability: .28

– (Cronbach’s alpha: .97)

• Item difficulty:

– Calculated as number of correct answers / number of all answers

– Related to (log) tested word frequency: R2 = .45

De Jong et al, 2012; Hulstijn et al., 2012 Work in Progress - PLEASE DO NOT QUOTE

WITHOUT PERMISSION

Application 2: human ratings

• 100 judges rating on 5 different aspects (5 groups of 20 judges) for 90 speech samples: 1. Fluency (pauses, speed, and repairs) 2. Pausing 3. Speed 4. Repairs 5. Accentedness

• Are these valid measures of fluency/accent? – Can we sum the scores for each group of 20 judges? – Can we predict ‘item difficulty’ in a theoretically

grounded manner? Bosker et al., 2013; Pinget et al., submitted


Application 2: human ratings

• Summability of judge-scores for group 1 (fluency): – Summability: .56


• NB: Collapsing over 80 judges in groups 2 – 5 (pausing, speed, repairs, accentedness) – Summability: .25


• ‘Item difficulty’: mean judge score – Calculated as mean score over all judges

– Related to a combination of objectively measured fluency characteristics of the speech samples: R2 = .84

Bosker et al., 2013; Pinget et al., submitted Work in Progress - PLEASE DO NOT QUOTE

WITHOUT PERMISSION

Discussion

On summability: • In the end a (sub)test is reduced to a single sum score;

validity is relevant for that sum score • Concept of summability: useful in language testing

practice whenever scores are summarized with a single sum-score

• Whether .28 and .56 are high or low, more experience needed.

On validity: • Purposeful construction of items: item characteristics

that are indentified beforehand must relate to post-hoc item difficulty


Questions?

• [email protected]

• [email protected]


demonstrating validity for high-impact language tests - ealta

Documents