demonstrating validity for high-impact language tests - ealta
TRANSCRIPT
Demonstrating validity
Nivja de Jong & Jelle Goeman
Work in Progress - PLEASE DO NOT QUOTE WITHOUT PERMISSION
What is validity?
• Construct validity
• Criterion validity
• Face validity
• Content validity
• Consequential validity
Work in Progress - PLEASE DO NOT QUOTE WITHOUT PERMISSION
Validity, back to basics
• Cattell, 1946; Kelley, 1927; Borsboom al., 2004:
Whether an instrument actually measures what it is set out to measure
• Borsboom (2004): “a test is valid for measuring an attribute if (a) the attribute exists and (b) variations in the attribute causally produce variation in the measurement outcomes”
Work in Progress - PLEASE DO NOT QUOTE WITHOUT PERMISSION
What is validity?
• Whether an instrument actually measures what it is set out to measure
• To demonstrate validity, we need theory that specifies the processes that bring about the causal effect between variations in the attribute and variation in measurement outcome
Item difficulty should be theoretically predictable
Work in Progress - PLEASE DO NOT QUOTE WITHOUT PERMISSION
What does a valid test look like?
• Item difficulty should be theoretically predictable
• Some form of unidimensionality of a test or sub-test is important because we want to summarize over all items
• Purposeful construction of items: sum-score should summarize
Work in Progress - PLEASE DO NOT QUOTE WITHOUT PERMISSION
How do we check validity?
• In practice: what do we do about validity? – Posthoc relate item difficulty to item
characteristics?
– Correlation between test scores and scores from other tests?
– Unidimensionality: Cronbach’s alpha?
– …
• Is this OK?
Work in Progress - PLEASE DO NOT QUOTE WITHOUT PERMISSION
Correlation between test scores and scores from other tests?
• A new test on English language proficiency should be strongly related to our old (previously validated) test of English language proficiency
• A new scale to measure weight should be strongly related to our old (previously validated) scale that measured weight
Work in Progress - PLEASE DO NOT QUOTE WITHOUT PERMISSION
r = .8? We are measuring the same construct?
Work in Progress - PLEASE DO NOT QUOTE WITHOUT PERMISSION
How do we check validity?
• In practice: what do we do about validity? – Posthoc relate item difficulty to item
characteristics?
– Correlation between test scores and scores from other tests?
– Unidimensionality: Cronbach’s alpha?
– …
• Is this OK?
Work in Progress - PLEASE DO NOT QUOTE WITHOUT PERMISSION
Cronbach’s alpha persisting confusion
• Cronbach’s alpha is intended to measure reliability (test – retest)
• Reliability is not the same as unidimensionality / internal consistency.
Work in Progress - PLEASE DO NOT QUOTE WITHOUT PERMISSION
Cronbach’s alpha persisting confusion
• Sijtsma, 2009:
“A single number—alpha—that expresses both reliability and internal consistency—conceived of as an aspect of validity that suggests that items “measure the same thing”—is a blessing for the assessment of test quality. In the meantime, alpha “only” is a lower bound to the reliability and not even a realistic one”
Work in Progress - PLEASE DO NOT QUOTE
WITHOUT PERMISSION
Cronbach’s alpha persisting confusion
• Cronbach’s alpha is used as:
– Test – retest
– Agreement between judges
– Unidimensionality across items
• In textbooks and handbooks: “alpha > .x allows you to sum the scores”
• But is this true? Does alpha reveal anything about summability?
Work in Progress - PLEASE DO NOT QUOTE WITHOUT PERMISSION
Simulation 1
• We simulate subjects’ two unrelated skills or one skill
• Each skill we test with k items
• We make a test with all 1k or 2k items
• Item score = ability + noise
• Ability explains 50% of variance of each item
• k ranges from 2 to 50.
• We calculate Cronbach’s alpha for one and two skills within one test
Work in Progress - PLEASE DO NOT QUOTE WITHOUT PERMISSION
Cronbach’s alpha for one or two constructs within one test
Work in Progress - PLEASE DO NOT QUOTE WITHOUT PERMISSION
Cronbach’s alpha is an excellent measure of test length!
Work in Progress - PLEASE DO NOT QUOTE WITHOUT PERMISSION
The concept of summability
Unidimensionality?
• Alpha is not about unidimensionality but about reliability
• Factor analysis is a good alternative
• BUT do we actually look for unidimensionality?
– Example: miscoded multiple choice will happily load onto the strongest first factor
Summability:
• We need the sum-score to summarize, not just any factor!
• items are purposefully constructed
Work in Progress - PLEASE DO NOT QUOTE WITHOUT PERMISSION
Measuring summability
• How much of the variance of the item scores is captured by the sum-score?
• Our definition of summability: percentage of total item variance explained by the sum-score
• Like an R2 in regression
• Comes in unadjusted and adjusted form
Work in Progress - PLEASE DO NOT QUOTE WITHOUT PERMISSION
Summability formula
• Unadjusted:
• Adjusted:
• v is the sum of all item variances
• c is the sum of all item variances and covariances
• k is the number of items
Work in Progress - PLEASE DO NOT QUOTE WITHOUT PERMISSION
Simulation 2
• two unrelated skills or one skill
• Each skill tested with k items
• test with all 1k or 2k items
• Item score = ability + noise
• Ability explains 50% of variance of each item
• k ranges from 2 to 50.
• We compare Cronbach’s alpha with Summability
Work in Progress - PLEASE DO NOT QUOTE WITHOUT PERMISSION
Simulation 2: comparing Cronbach’s alpha with Summability for one construct
Work in Progress - PLEASE DO NOT QUOTE WITHOUT PERMISSION
Simulation 2: comparing Cronbach’s alpha with Summability for two constructs
Work in Progress - PLEASE DO NOT QUOTE WITHOUT PERMISSION
Recap valid test characteristics
• Item difficulty should be theoretically predictable
• Some form of unidimensionality of a test or sub-test is important because we want to summarize over all items
• Purposeful construction of items: sum-score should summarize
Work in Progress - PLEASE DO NOT QUOTE
WITHOUT PERMISSION
Variation between items
• Item difficulty should relate to the theoretically grounded item characteristics
• Example: in a multiple choice reading comprehension test – texts differ in (linguistic) difficulty (sentence length, …,
…, …) – answer-options differ in plausibility In a valid MC reading test, (linguistic) difficulty of the
texts predict item difficulty.
• This is easy to check with a correlation or regression analysis
Work in Progress - PLEASE DO NOT QUOTE WITHOUT PERMISSION
Application 1: productive vocabulary knowledge
• 90 vocabulary items performed by 198 pps (binomial score).
Item example: “Het was gisteren dinsdag, dus is het v___ woensdag.”
Tested words taken from 10 frequency bands (rank 1 – 1000, 1001 – 2000, …)
• Is this a valid test of vocabulary knowledge? – Can we sum the scores?
– Can we predict item difficulty in a theoretically grounded manner?
De Jong et al, 2012; Hulstijn et al., 2012 Work in Progress - PLEASE DO NOT QUOTE
WITHOUT PERMISSION
Application 1: productive vocabulary knowledge
• Summability of item-scores:
– Summability: .28
– (Cronbach’s alpha: .97)
• Item difficulty:
– Calculated as number of correct answers / number of all answers
– Related to (log) tested word frequency: R2 = .45
De Jong et al, 2012; Hulstijn et al., 2012 Work in Progress - PLEASE DO NOT QUOTE
WITHOUT PERMISSION
Application 2: human ratings
• 100 judges rating on 5 different aspects (5 groups of 20 judges) for 90 speech samples: 1. Fluency (pauses, speed, and repairs) 2. Pausing 3. Speed 4. Repairs 5. Accentedness
• Are these valid measures of fluency/accent? – Can we sum the scores for each group of 20 judges? – Can we predict ‘item difficulty’ in a theoretically
grounded manner? Bosker et al., 2013; Pinget et al., submitted
Work in Progress - PLEASE DO NOT QUOTE WITHOUT PERMISSION
Application 2: human ratings
• Summability of judge-scores for group 1 (fluency): – Summability: .56
– (Cronbach’s alpha: .97)
• NB: Collapsing over 80 judges in groups 2 – 5 (pausing, speed, repairs, accentedness) – Summability: .25
– (Cronbach’s alpha: .97)
• ‘Item difficulty’: mean judge score – Calculated as mean score over all judges
– Related to a combination of objectively measured fluency characteristics of the speech samples: R2 = .84
Bosker et al., 2013; Pinget et al., submitted Work in Progress - PLEASE DO NOT QUOTE
WITHOUT PERMISSION
Discussion
On summability: • In the end a (sub)test is reduced to a single sum score;
validity is relevant for that sum score • Concept of summability: useful in language testing
practice whenever scores are summarized with a single sum-score
• Whether .28 and .56 are high or low, more experience needed.
On validity: • Purposeful construction of items: item characteristics
that are indentified beforehand must relate to post-hoc item difficulty
Work in Progress - PLEASE DO NOT QUOTE WITHOUT PERMISSION
Questions?
Work in Progress - PLEASE DO NOT QUOTE WITHOUT PERMISSION