30 years of evidence on the comparability of exam standards: myths, fiascos and unrealistic...

33
30 Years of Evidence on the Comparability of Exam Standards: Myths, Fiascos and Unrealistic Expectations Paul E. Newton Centre for Evaluation & Monitoring, University of Durham, 30th Anniversary Conference: 30 Years of Evidence in Education. 23 September 2014. London.

Upload: gannon-stinchcomb

Post on 02-Apr-2015

219 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: 30 Years of Evidence on the Comparability of Exam Standards: Myths, Fiascos and Unrealistic Expectations Paul E. Newton Centre for Evaluation & Monitoring,

30 Years of Evidence on the Comparability of Exam Standards: Myths, Fiascos and Unrealistic Expectations

Paul E. NewtonCentre for Evaluation & Monitoring, University of Durham, 30th Anniversary Conference: 30 Years of Evidence in Education. 23 September 2014. London.

Page 2: 30 Years of Evidence on the Comparability of Exam Standards: Myths, Fiascos and Unrealistic Expectations Paul E. Newton Centre for Evaluation & Monitoring,

Statistics vs. Judgement:What Does 30 Years of Research Tell Us About the Best and Worst Way to Maintain Exam Standards?

Page 3: 30 Years of Evidence on the Comparability of Exam Standards: Myths, Fiascos and Unrealistic Expectations Paul E. Newton Centre for Evaluation & Monitoring,

What does it mean to ‘maintain’ an exam standard?

Grade Awarding

The process of identifying: which marks on this year’s exam correspond to levels of attainment

(i.e. levels of knowledge, skill and understanding) that were associated with

grade boundary marks on last year’s exam.

Page 4: 30 Years of Evidence on the Comparability of Exam Standards: Myths, Fiascos and Unrealistic Expectations Paul E. Newton Centre for Evaluation & Monitoring,

Why do exam boards need to move grade boundaries?

Because even exams that are designed to measure: exactly the same kind of attainment in exactly the same way may end up being slightly different in terms of the

overall difficulty of their questions

Page 5: 30 Years of Evidence on the Comparability of Exam Standards: Myths, Fiascos and Unrealistic Expectations Paul E. Newton Centre for Evaluation & Monitoring,

Have we always maintained exam standards like this?

30 years ago – in 1984? 60 years ago – in 1954?

Page 6: 30 Years of Evidence on the Comparability of Exam Standards: Myths, Fiascos and Unrealistic Expectations Paul E. Newton Centre for Evaluation & Monitoring,

Have we always maintained exam standards like this?

30 years ago – in 1984? 60 years ago – in 1954?

… yes, pretty much!

Page 7: 30 Years of Evidence on the Comparability of Exam Standards: Myths, Fiascos and Unrealistic Expectations Paul E. Newton Centre for Evaluation & Monitoring,

Attainment-referencing

From one examination to the next, corresponding grade boundaries should be located at marks associated with equivalent levels of attainment.

Page 8: 30 Years of Evidence on the Comparability of Exam Standards: Myths, Fiascos and Unrealistic Expectations Paul E. Newton Centre for Evaluation & Monitoring,

The myth

HYPOTHETICAL A level pass-rates for UCLES(Summer examinations, Home candidates only)

0

10

20

30

40

50

60

70

80

90

100

1960

1961

1962

1963

1964

1965

1966

1967

1968

1969

1970

1971

1972

1973

1974

1975

1976

1977

1978

1979

1980

1981

1982

1983

1984

1985

1986

Latin

French

Physics

Biology

Page 9: 30 Years of Evidence on the Comparability of Exam Standards: Myths, Fiascos and Unrealistic Expectations Paul E. Newton Centre for Evaluation & Monitoring,

The myth… debunked

50

55

60

65

70

75

80

85

90

95

100

1960

1961

1962

1963

1964

1965

1966

1967

1968

1969

1970

1971

1972

1973

1974

1975

1976

1977

1978

1979

1980

1981

1982

1983

1984

A level pass-rates for the 'Cambridge' boardUCLES (1960 to 1984)

(Summer examinations, Home candidates only)

Latin

Physics

Page 10: 30 Years of Evidence on the Comparability of Exam Standards: Myths, Fiascos and Unrealistic Expectations Paul E. Newton Centre for Evaluation & Monitoring,

How do you operationalise attainment-referencing?

20000

22000

24000

26000

28000

30000

32000

34000

60.0

65.0

70.0

75.0

80.0

85.0

90.0

95.0

100.0

1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009

Cumulative percentage of A level Sociology students awarded grade E (blue)against total number of results awarded (red)

(for All Boards, Summer Awards, All Modes, by Syllabus Group)

Cum.% E

No. Results

Page 11: 30 Years of Evidence on the Comparability of Exam Standards: Myths, Fiascos and Unrealistic Expectations Paul E. Newton Centre for Evaluation & Monitoring,

Scrutiny of scripts(undertaken by examiners)

Comparing levels of attainment ‘directly’ by inspecting performances in examination scripts

a.k.a. ‘Judgement’

Page 12: 30 Years of Evidence on the Comparability of Exam Standards: Myths, Fiascos and Unrealistic Expectations Paul E. Newton Centre for Evaluation & Monitoring,

Scrutiny of data(undertaken by the Board)

Comparing levels of attainment indirectly by ‘modelling’ the causal determinants of attainment

a.k.a. ‘Statistics’

20000

22000

24000

26000

28000

30000

32000

34000

60.0

65.0

70.0

75.0

80.0

85.0

90.0

95.0

100.0

1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009

Cumulative percentage of A level Sociology students awarded grade E (blue)against total number of results awarded (red)

(for All Boards, Summer Awards, All Modes, by Syllabus Group)

Cum.% E

No. Results

Page 13: 30 Years of Evidence on the Comparability of Exam Standards: Myths, Fiascos and Unrealistic Expectations Paul E. Newton Centre for Evaluation & Monitoring,

Which is better –statistics or judgement?

20000

22000

24000

26000

28000

30000

32000

34000

60.0

65.0

70.0

75.0

80.0

85.0

90.0

95.0

100.0

1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009

Cumulative percentage of A level Sociology students awarded grade E (blue)against total number of results awarded (red)

(for All Boards, Summer Awards, All Modes, by Syllabus Group)

Cum.% E

No. Results

Page 14: 30 Years of Evidence on the Comparability of Exam Standards: Myths, Fiascos and Unrealistic Expectations Paul E. Newton Centre for Evaluation & Monitoring,

Which is better –statistics or judgement?

20000

22000

24000

26000

28000

30000

32000

34000

60.0

65.0

70.0

75.0

80.0

85.0

90.0

95.0

100.0

1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009

Cumulative percentage of A level Sociology students awarded grade E (blue)against total number of results awarded (red)

(for All Boards, Summer Awards, All Modes, by Syllabus Group)

Cum.% E

No. Results

Page 15: 30 Years of Evidence on the Comparability of Exam Standards: Myths, Fiascos and Unrealistic Expectations Paul E. Newton Centre for Evaluation & Monitoring,

The battle of grade awarding

Examiners

We are just so impressed by the quality of performances that we see in our French exams.

The Board

But do you really have enough evidence to justify raising the pass-rate yet again?

After all: pass-rates haven’t been rising

in German or Spanish the French cohort is

expanding massively

Page 16: 30 Years of Evidence on the Comparability of Exam Standards: Myths, Fiascos and Unrealistic Expectations Paul E. Newton Centre for Evaluation & Monitoring,

What Does 30 Years of ResearchTell Us About the Best and Worst Way to

Maintain Exam Standards?

Page 17: 30 Years of Evidence on the Comparability of Exam Standards: Myths, Fiascos and Unrealistic Expectations Paul E. Newton Centre for Evaluation & Monitoring,

Evidence from Exam Boards

Page 18: 30 Years of Evidence on the Comparability of Exam Standards: Myths, Fiascos and Unrealistic Expectations Paul E. Newton Centre for Evaluation & Monitoring,

Evidence from Academia

Page 19: 30 Years of Evidence on the Comparability of Exam Standards: Myths, Fiascos and Unrealistic Expectations Paul E. Newton Centre for Evaluation & Monitoring,

Evidence from Regulators

Page 20: 30 Years of Evidence on the Comparability of Exam Standards: Myths, Fiascos and Unrealistic Expectations Paul E. Newton Centre for Evaluation & Monitoring,

What have we learned since 1984?

Page 21: 30 Years of Evidence on the Comparability of Exam Standards: Myths, Fiascos and Unrealistic Expectations Paul E. Newton Centre for Evaluation & Monitoring,

We shouldn’t put too much confidence in statistics

0

10

20

30

40

50

60

70

80

90

100

Cumulative % candidates with grade E (or higher)Averaged across 13 UCLES A level subjects, 1960-1984

(Summer examinations, Home candidates only, Main syllabuses only)

Page 22: 30 Years of Evidence on the Comparability of Exam Standards: Myths, Fiascos and Unrealistic Expectations Paul E. Newton Centre for Evaluation & Monitoring,

4 NEAB maths A levels P&A, P&M, P&S, SMP

MLM to control for prior achievement, gender, etc.

even after control, SMP still appeared too lenient

However the SMP syllabus more motivating excellent support materials more time-consuming

Page 23: 30 Years of Evidence on the Comparability of Exam Standards: Myths, Fiascos and Unrealistic Expectations Paul E. Newton Centre for Evaluation & Monitoring,

We shouldn’t put too much confidence in judgement

Page 24: 30 Years of Evidence on the Comparability of Exam Standards: Myths, Fiascos and Unrealistic Expectations Paul E. Newton Centre for Evaluation & Monitoring,

Grade boundaries set by examiner judgement alone for two exam papers

same subject different tiers

sat by same candidates

Many more students ended up with higher grades on the lower tier exam (than on the higher tier).

Page 25: 30 Years of Evidence on the Comparability of Exam Standards: Myths, Fiascos and Unrealistic Expectations Paul E. Newton Centre for Evaluation & Monitoring,

Judgemental innovations

We have learned how to harness examiner judgement more effectively

Page 26: 30 Years of Evidence on the Comparability of Exam Standards: Myths, Fiascos and Unrealistic Expectations Paul E. Newton Centre for Evaluation & Monitoring,

Statistical innovations

We have learned how to compute statistical analyses more effectively

Page 27: 30 Years of Evidence on the Comparability of Exam Standards: Myths, Fiascos and Unrealistic Expectations Paul E. Newton Centre for Evaluation & Monitoring,

It is extremely hard topredict and control

comparability threats.

Page 28: 30 Years of Evidence on the Comparability of Exam Standards: Myths, Fiascos and Unrealistic Expectations Paul E. Newton Centre for Evaluation & Monitoring,

Summer 2012

GCSE English anomaly

Summer 2002

Curriculum 2000 anomaly

The ‘fiascos’

Page 29: 30 Years of Evidence on the Comparability of Exam Standards: Myths, Fiascos and Unrealistic Expectations Paul E. Newton Centre for Evaluation & Monitoring,

January awarding, 2012 Clear tendency to ensure students marked ‘comfortably’ above historical boundaries

Page 30: 30 Years of Evidence on the Comparability of Exam Standards: Myths, Fiascos and Unrealistic Expectations Paul E. Newton Centre for Evaluation & Monitoring,

June awarding, 2012Same tendency, but many students no longer ‘comfortably’ above the raised boundaries

Page 31: 30 Years of Evidence on the Comparability of Exam Standards: Myths, Fiascos and Unrealistic Expectations Paul E. Newton Centre for Evaluation & Monitoring,

So, which is better –statistics or judgement?

20000

22000

24000

26000

28000

30000

32000

34000

60.0

65.0

70.0

75.0

80.0

85.0

90.0

95.0

100.0

1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009

Cumulative percentage of A level Sociology students awarded grade E (blue)against total number of results awarded (red)

(for All Boards, Summer Awards, All Modes, by Syllabus Group)

Cum.% E

No. Results

Page 32: 30 Years of Evidence on the Comparability of Exam Standards: Myths, Fiascos and Unrealistic Expectations Paul E. Newton Centre for Evaluation & Monitoring,

Unrealistic expectations

Three ‘stages’ in understanding comparability

1. statistical auditing problems are routine solutions require ‘back of the envelope’ sums

2. scientific research problems are difficult solutions require rigorous and objective investigations

3. art criticism problems are perhaps insurmountable solutions require value judgements

(Bardell, Forrest and Shoesmith, 1978)

Page 33: 30 Years of Evidence on the Comparability of Exam Standards: Myths, Fiascos and Unrealistic Expectations Paul E. Newton Centre for Evaluation & Monitoring,

Realistic expectations +Persuasive justifications

Four ‘stages’ in understanding comparability1. statistical auditing

2. scientific research

3. art criticism

4. engineering pragmatism many comparability problems are technically insurmountable…

but some are less insurmountable than others and should be prioritised

all comparability solutions are inevitably imperfect… but some are less imperfect than others and should be prioritised

technically insurmountable problems and inevitably imperfect solutions highlight the fundamental importance of strong arguments in defence of policy and practice