30 years of evidence on the comparability of exam standards: myths, fiascos and unrealistic...
TRANSCRIPT
30 Years of Evidence on the Comparability of Exam Standards: Myths, Fiascos and Unrealistic Expectations
Paul E. NewtonCentre for Evaluation & Monitoring, University of Durham, 30th Anniversary Conference: 30 Years of Evidence in Education. 23 September 2014. London.
Statistics vs. Judgement:What Does 30 Years of Research Tell Us About the Best and Worst Way to Maintain Exam Standards?
What does it mean to ‘maintain’ an exam standard?
Grade Awarding
The process of identifying: which marks on this year’s exam correspond to levels of attainment
(i.e. levels of knowledge, skill and understanding) that were associated with
grade boundary marks on last year’s exam.
Why do exam boards need to move grade boundaries?
Because even exams that are designed to measure: exactly the same kind of attainment in exactly the same way may end up being slightly different in terms of the
overall difficulty of their questions
Have we always maintained exam standards like this?
30 years ago – in 1984? 60 years ago – in 1954?
Have we always maintained exam standards like this?
30 years ago – in 1984? 60 years ago – in 1954?
… yes, pretty much!
Attainment-referencing
From one examination to the next, corresponding grade boundaries should be located at marks associated with equivalent levels of attainment.
The myth
HYPOTHETICAL A level pass-rates for UCLES(Summer examinations, Home candidates only)
0
10
20
30
40
50
60
70
80
90
100
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
Latin
French
Physics
Biology
The myth… debunked
50
55
60
65
70
75
80
85
90
95
100
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
A level pass-rates for the 'Cambridge' boardUCLES (1960 to 1984)
(Summer examinations, Home candidates only)
Latin
Physics
How do you operationalise attainment-referencing?
20000
22000
24000
26000
28000
30000
32000
34000
60.0
65.0
70.0
75.0
80.0
85.0
90.0
95.0
100.0
1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
Cumulative percentage of A level Sociology students awarded grade E (blue)against total number of results awarded (red)
(for All Boards, Summer Awards, All Modes, by Syllabus Group)
Cum.% E
No. Results
Scrutiny of scripts(undertaken by examiners)
Comparing levels of attainment ‘directly’ by inspecting performances in examination scripts
a.k.a. ‘Judgement’
Scrutiny of data(undertaken by the Board)
Comparing levels of attainment indirectly by ‘modelling’ the causal determinants of attainment
a.k.a. ‘Statistics’
20000
22000
24000
26000
28000
30000
32000
34000
60.0
65.0
70.0
75.0
80.0
85.0
90.0
95.0
100.0
1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
Cumulative percentage of A level Sociology students awarded grade E (blue)against total number of results awarded (red)
(for All Boards, Summer Awards, All Modes, by Syllabus Group)
Cum.% E
No. Results
Which is better –statistics or judgement?
20000
22000
24000
26000
28000
30000
32000
34000
60.0
65.0
70.0
75.0
80.0
85.0
90.0
95.0
100.0
1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
Cumulative percentage of A level Sociology students awarded grade E (blue)against total number of results awarded (red)
(for All Boards, Summer Awards, All Modes, by Syllabus Group)
Cum.% E
No. Results
Which is better –statistics or judgement?
20000
22000
24000
26000
28000
30000
32000
34000
60.0
65.0
70.0
75.0
80.0
85.0
90.0
95.0
100.0
1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
Cumulative percentage of A level Sociology students awarded grade E (blue)against total number of results awarded (red)
(for All Boards, Summer Awards, All Modes, by Syllabus Group)
Cum.% E
No. Results
The battle of grade awarding
Examiners
We are just so impressed by the quality of performances that we see in our French exams.
The Board
But do you really have enough evidence to justify raising the pass-rate yet again?
After all: pass-rates haven’t been rising
in German or Spanish the French cohort is
expanding massively
What Does 30 Years of ResearchTell Us About the Best and Worst Way to
Maintain Exam Standards?
Evidence from Exam Boards
Evidence from Academia
Evidence from Regulators
What have we learned since 1984?
We shouldn’t put too much confidence in statistics
0
10
20
30
40
50
60
70
80
90
100
Cumulative % candidates with grade E (or higher)Averaged across 13 UCLES A level subjects, 1960-1984
(Summer examinations, Home candidates only, Main syllabuses only)
4 NEAB maths A levels P&A, P&M, P&S, SMP
MLM to control for prior achievement, gender, etc.
even after control, SMP still appeared too lenient
However the SMP syllabus more motivating excellent support materials more time-consuming
We shouldn’t put too much confidence in judgement
Grade boundaries set by examiner judgement alone for two exam papers
same subject different tiers
sat by same candidates
Many more students ended up with higher grades on the lower tier exam (than on the higher tier).
Judgemental innovations
We have learned how to harness examiner judgement more effectively
Statistical innovations
We have learned how to compute statistical analyses more effectively
It is extremely hard topredict and control
comparability threats.
Summer 2012
GCSE English anomaly
Summer 2002
Curriculum 2000 anomaly
The ‘fiascos’
January awarding, 2012 Clear tendency to ensure students marked ‘comfortably’ above historical boundaries
June awarding, 2012Same tendency, but many students no longer ‘comfortably’ above the raised boundaries
So, which is better –statistics or judgement?
20000
22000
24000
26000
28000
30000
32000
34000
60.0
65.0
70.0
75.0
80.0
85.0
90.0
95.0
100.0
1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
Cumulative percentage of A level Sociology students awarded grade E (blue)against total number of results awarded (red)
(for All Boards, Summer Awards, All Modes, by Syllabus Group)
Cum.% E
No. Results
Unrealistic expectations
Three ‘stages’ in understanding comparability
1. statistical auditing problems are routine solutions require ‘back of the envelope’ sums
2. scientific research problems are difficult solutions require rigorous and objective investigations
3. art criticism problems are perhaps insurmountable solutions require value judgements
(Bardell, Forrest and Shoesmith, 1978)
Realistic expectations +Persuasive justifications
Four ‘stages’ in understanding comparability1. statistical auditing
2. scientific research
3. art criticism
4. engineering pragmatism many comparability problems are technically insurmountable…
but some are less insurmountable than others and should be prioritised
all comparability solutions are inevitably imperfect… but some are less imperfect than others and should be prioritised
technically insurmountable problems and inevitably imperfect solutions highlight the fundamental importance of strong arguments in defence of policy and practice