www.cemcentre.org teacher assessment versus exams peter tymms cem, durham university

www.cemcentre.org

Teacher Assessment versus Exams

Peter Tymms

CEM, Durham University

Overview• The Issue• The importance of LAs, Schools and teachers• Fairness and bias• Coverage and sampling• Teacher assessment• Exams and tests• Predictive validity• Conclusions

The Issue

• Teacher assessment is unfair because it is unreliable and biased.

• Exams are simply snapshots and

are unrepresentative of the work that has really be done

Which matters most?

1. LA

2. School

3. Teacher

4. Pupil

Newcastle Commission: Data Sources

• Several national datasets including

– ASPECTS, PIPS, MidYIS & YELLIS – KS1, KS2, KS3 & GCSE

• Looked a value-added using 3 level multilevel models

Example using KS2 English

Pupil raw Pupil value-added

School raw

School value added

LEA raw LEA value added

-3.00

-2.00

-1.00

0.00

1.00

2.00

Willms’ Diagram

The Teacher Effect

Repeated Boosts: Vocabulary

0

1

2

3

4

5

ER Y1 Y2 Y3 Y4 Y5 Y6

Year

Le

ve

ls

Which matters most?

1. LA

2. School 3. Teacher 4. Pupil

Pupils vary enormously

Teachers have the greatest impact

Schools are relevant

Authorities hardly vary at all

Conclusion

Hypothesis

• The best teachers will be best at judging their students

What is bias?

• Bias appears in a test when part of an assessment is harder for a particular group.

• Or when an assessor systematically downgrades a group or an individual for

construct irrelevant reasons

Example of item bias

Pigeon

Turtle

Examples of teacher bias

• Annecdote• By Sex (eg baseline & page 17 Harlen)• By ability – judgement anchored by experience• By Ethnicity – assault experiments• By social class• By behaviour (origin of ability testing. Binet)• By Age – (EPICure study)• By incident – eg spilling a glass of water.

• The halo (or horns) effect (e.g. P scales)

speak. listen. read. write. using number shapes sci. enq life proc. mat. prop.speakinglistening 0.98reading 0.86 0.86writing 0.87 0.86 0.93using 0.80 0.80 0.79 0.81number 0.84 0.84 0.85 0.87 0.89shape 0.86 0.86 0.85 0.87 0.89 0.93sci. enq 0.78 0.78 0.75 0.78 0.82 0.82 0.83life proc. 0.80 0.80 0.76 0.79 0.81 0.82 0.83 0.95mat. prop 0.79 0.79 0.75 0.78 0.82 0.83 0.84 0.96 0.97phys. proc 0.78 0.78 0.75 0.78 0.82 0.82 0.84 0.95 0.96 0.97

P Scales in 2004

Teacher reliability

• How should reliability be assessed– By looking at the internal consistency of

judgements?– By looking at the link to external

assessments?– By comparing over time? – By comparing one teacher with others?

• Facets model within Rasch measurement

Trusting teachers’ judgement Harlen 2005

“The findings of the review by no means constitute a ringing endorsement of teachers’ assessment; there was evidence of low reliability and bias in teachers’ judgements”

5-14, Portfolios & single level tests

• 5-14 assessments

• What about portfolios?– inter-rater very low for maths and writing

• English teacher levels in SATs – early 1990s “considerable error”– later quite common to find teacher = test

results– single level tests compromised by teacher

judgement

Is it OK for teachers to assess their own pupils for High Stakes exams?

• How does the power to grade affect relationships?

• Would you give McEnroy a B?

Exam/test reliability

• Typically around 0.9 but …

• Distinguish the assessment of–Convergent questions–Divergent questions

Exam/test bias

• Pre-tests are often used to address issues of bias

• But we put much reliance on judgment.

• England’s major exams are largely not pre-tested.

Are Exams inappropriate snapshots?

• Issue 1: Questions must be representative samples of the course under exam conditions.

• Issue 2: Constraint on the nature of the assessment – Multi-method Multi-trait challenge

• Issue 3: Impact of stress on performance– Positive & Negative (links to introversion)

Introvert and Extrovert

Stimulus

Effort

We need to match format to content

• Some things must be assessed by judgement:– Social interactions– Quality of research– Poetry– Art

• Some things are best assessed left to tests– Mental arithmetic– Spelling– Phonological awareness– Diagnostic assessments (e.g. INCAS)

• Even so perhaps there is a final arbiter

Developed ability test (MidYIS/IQ/etc)

Attainment test (Std

Grade/Highers)

Teacher Grade

Later success – degree, salary etc

Predictive validity

We need the evidence but ..

• Prediction is often poor

– Two major reasons

Prediction of Educational Achievement

Prior Achievement

La

ter

Ach

ieve

me

nt

Correlation = 0.7

Prior Achievement

La

ter

Ach

ieve

me

nt

Select top 15%

Prior Achievement

La

ter

Ach

ieve

me

nt

Correlation = 0.39

Prior Achievement

La

ter

Ach

ieve

me

nt

Cream top 3%; r=0.19

Prior Achievement

La

ter

Ach

ieve

me

nt

So, poor prediction because of

• Prior selection

• Variable outcome measures

Conclusion: Judgements or tests?

• Should we do both? (Profiles)– But, how do we ensure that judgements

and tests are independent?– How can judgements be kept free from

bias?

• Virtually impossible in high stakes tests

• Essential for formative work

References

• Campbell, D. T., & Fiske, D. W. (1959). Convergent and Discriminant Validation by the Multitrait-Multimethod Matrix. Psychological Bulletin, 56, 81-105.

• Cooper, B. (1998). Using Bernstein and Bourdieu to understand children's difficulties with "realistic" mathematics testing: an exploratory study. Qualitative Studies in Education, II(4), 511-532.

• Eysenck, H. J. (2006) The Biollogical Basis of Personaility.Transaction publishers• Harlen, W. (2005). Trusting teachers' judgement: research evidence of reliability and validity of

teachers' assessment used for summative purposes. Research Papers in Education, 20(3), 245-270.

• Johnson, S., Hennessy, E., Smith, R., Trikic, R., Wolke, D., & Marlow, N. (2009). The EPICure Study: Academic attainment and special educational needs in extremely preterm children at 11 years. London: Nottingham/London/Warwick.

• Koretz, D., Stecher, B. M., Klein, S. P. & McCaffrey, D. (1994) The Vermont Portfolio Assessment• Program: findings and implications, Educational Measurement: Issues & Practice, 13, 5–16.• Tymms, P. (1997). Value-added Key Stage 1 to Key Stage 2. London: School Curriculum and

Assessment Authority.• Tymms, P., Jones, P., Albone, S., & Henderson, B. (2009). The first seven years at school.

Educational Assessment and Evaluation Accountability, 21, 67-80.• Tymms, P., Merrell, C., Heron, T., Jones, P., Albone, S., & Henderson, B. (2008). The importance

of districts. School Effectiveness and School Improvement, 19(3), 261-274.• Tymms, P., Merrell, C., & Jones, P. (2004). Using baseline assessment data to make international

comparisons. British Educational Research Journal, 30(5), 673-689.• Willms, J. D. (1987). Differences Between Scottish Educational Authorities in their Examinations

Attainment. Oxford Review of Education, 13(2), 211-232.

www.cemcentre.org teacher assessment versus exams peter tymms cem, durham university

Documents

conclusion slide

pupil slide

introversion slide

teacher judgement slide

teacher effect slide

predictive validity

ks2 english slide

formative work slide