inter-rater reliability in the performance test: summer workshop 2014 by: dustin florence

22
Inter-rater reliability in the Performance Test: Summer workshop 2014 By: Dustin Florence

Upload: arabella-terry

Post on 19-Dec-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Inter-rater reliability in the Performance Test: Summer workshop 2014 By: Dustin Florence

Inter-rater reliability in the Performance Test:Summer workshop 2014

By: Dustin Florence

Page 2: Inter-rater reliability in the Performance Test: Summer workshop 2014 By: Dustin Florence

Definitions

• ITA candidates-international graduate students seeking TA positions

• Performance test-short teaching demonstration, graded on Likert scale

• Discourse intonation-intonation contours (Thought groups, Prominence, Tone)

• Inter-rater reliability-coefficient that measures how similarly members of rater team rate the same performance

Page 3: Inter-rater reliability in the Performance Test: Summer workshop 2014 By: Dustin Florence

Why the workshop?

• Tech needs ITA’s to teach• ITA’s need to be able to communicate with

undergraduates• ITA’s need to pass three tests

Page 4: Inter-rater reliability in the Performance Test: Summer workshop 2014 By: Dustin Florence

Who is the workshop for?

• Stakeholders1. ITA candidates2. performance test raters3. ITA workshop directors4. TTU administrators5. Department heads of ITA candidates’

departments6. Undergraduates who might be taught by ITAs

Page 5: Inter-rater reliability in the Performance Test: Summer workshop 2014 By: Dustin Florence

Who is this presentation for?

1. Performance test raters2. ITA workshop directors 3. Anyone involved with rater training4. Anyone interested in issues of rater reliabilityWhy?• Inter-rater reliability necessary to answer

stakeholders worries• This study is a first step towards validating the

summer workshop program

Page 6: Inter-rater reliability in the Performance Test: Summer workshop 2014 By: Dustin Florence

What type of research is inter-rater reliability research?

Since this study is observing what the ITA candidates do, not experimentally manipulating them in some way, then the study’s research method is a “correlational or cross-sectional research” (Field, 2013, p. 13). In this case, we are measuring how closely two different raters rate the same criteria for the same ITA candidate on the same performance test. Correlation is an accepted measure of reliability. This study uses Kendall’s Tau for the correlation. Kendall’s Tau fits our needs because it does not require normal distribution and it works well with numerous equal data points (like the strings of 4s and 5s in our data) (Field, 2013).

Page 7: Inter-rater reliability in the Performance Test: Summer workshop 2014 By: Dustin Florence

Research Questions

• Do rater’s rate the same criteria in the same way? In other words, is there a moderate to high level of inter-rater reliability on the final performance test?

• Do raters’ ratings become more reliable the more experience they have with the ITA candidates and with the constructs of the ITA workshop course? Does inter-rater reliability increase from the midterm test to the final test?

Page 8: Inter-rater reliability in the Performance Test: Summer workshop 2014 By: Dustin Florence

Raters

• Three teams of paired raters• 4 male and 2 female raters• 4 native and 2 non-native English speakers• 5 held Master’s degrees and 1 was in process

of completing Master’s degree• Experience with ITA candidates ranged from

20 years to 0 years (but all had experience teaching EFL/ESL)

Page 9: Inter-rater reliability in the Performance Test: Summer workshop 2014 By: Dustin Florence

Rater training

• Conducted by the ITA workshop directors• Both trainers were authors of text used in the

workshop and had many years of experience with ITA candidates

• Training session for two days before workshop• Listened to performances, rated them,

discussed ratings and reasons.

Page 10: Inter-rater reliability in the Performance Test: Summer workshop 2014 By: Dustin Florence

Participants

• Gender-53 are male, and 31 are female• Native language-33 Chinese speakers, 10 Bengali

speakers, 8 Farsi speakers, 6 Arabic speakers, 6 Korean speakers, 6 Sinhalese speakers, 4 Tamil speakers, 3 Nepali speakers, 2 Spanish speakers, 2 French speakers, 2 Hindi speakers, and 1 speaker of each of the following: English, Indonesian, Japanese, Kamona, Urdu, Vietnamese and Yoruba

• Number in each group-Team one rated 30 students, Team two rated 29 students, Team three rated 25 students

Page 11: Inter-rater reliability in the Performance Test: Summer workshop 2014 By: Dustin Florence

Materials• ITA Performance test version 9.0• Four constructs and ten criteria1. Grammatical competence-pronunciation,

word stress, thought groups2. Textual competence-grammatical structures,

transitional phrases, definitions3. Sociolinguistic competence-prominence,

comprehension checks, tone4. Functional competence-answering students’

questions

Page 12: Inter-rater reliability in the Performance Test: Summer workshop 2014 By: Dustin Florence

Procedures

• Entered scores on Excel• Used SPSS to calculate Kendall’s Tau

coefficients for each team of raters ratings of each candidate on the midterm and final test

• Evaluated the reliability coefficients of the final test ratings

• Compared the coefficients of midterm and final test

Page 13: Inter-rater reliability in the Performance Test: Summer workshop 2014 By: Dustin Florence

Analysis

• Expect a value for Kendall’s Tau 0.2 to 0.4 for moderate correlation and 0.4 or better for good correlation because the ratings are subjective and there are many factors that can influence raters (fatigue, experience with certain groups of English learners).

• Expect the difference between the final and midterm correlation coefficients for each criteria to be positive indicating that the raters are rating more similarly.

Page 14: Inter-rater reliability in the Performance Test: Summer workshop 2014 By: Dustin Florence

Criteria Midterm correlation Final correlation Final-Midterm

1 pronunciation .272 .620 ** .348

2 word stress .169 Constant (4s), can’t compute na

3 thought groups -.083 .686** .769

4 grammar .171 .431* .260

5 transitional phrases .331 .354* .023

6 definitions and examples .389* .566** .177

7 prominence .391* -.050 -.441

8 comprehension checks .173 .546** .373

9 intonation .279 .527** .248

10 answering questions .651** .458** -.193

Team 1

Page 15: Inter-rater reliability in the Performance Test: Summer workshop 2014 By: Dustin Florence

criteria midterm final Final-Midterm

1 pronunciation .071 .494* .423

2 word stress .405* .314 -.091

3 thought groups -.063 .316 .379

4 grammar Constant (4s), can’t compute .692** na

5 transitional phrases .382 .330 -.052

6 definitions and examples

Constant (4s), can’t compute .520* na

7 prominence .244 .014 -.230

8 comprehension checks .216 .407* .191

9 intonation .329 .459* .130

10 answering questions .626** .307 -.319

Team 2

Page 16: Inter-rater reliability in the Performance Test: Summer workshop 2014 By: Dustin Florence

criteria midterm final Final-midterm

1 pronunciation .309 .443* .134

2 word stress .341 Constant (4s), can’t compute na

3 thought groups .243 .454* .211

4 grammar -.107 .801** .908

5 transitional phrases .342 .256 -.086

6 definitions and examples .548** .336 -.212

7 prominence .258 .753** .495

8 comprehension checks .586** .503** -.083

9 intonation .389* .373* -.016

10 answering questions .486** .618** .132

Team 3

Page 17: Inter-rater reliability in the Performance Test: Summer workshop 2014 By: Dustin Florence

The Good, the Bad and the Ugly News

• What we are really concerned with are the final test scores; the midterm is a practice for both the ITA candidates and the raters, and all of the criteria for every rater team had moderate to strong correlation on the final test with the notable exception of criteria 8, Prominence.

• Inter-rater reliability for most of the criteria for most of the teams went up from midterm to final, with gains in reliability far outweighing losses in all cases except criteria 8, Prominence (and oddly enough Criteria 10, answering students’ questions).

Page 18: Inter-rater reliability in the Performance Test: Summer workshop 2014 By: Dustin Florence

RQs answered• Do rater’s rate the same criteria in the same way?

In other words, is there a moderate to high level of inter-rater reliability on the final performance test? Yes, in every case excepting criteria 8, prominence.

• Do raters’ ratings become more reliable the more experience they have with the ITA candidates and the constructs of the ITA workshop course? Does inter-rater reliability increase from the midterm test to the final test? Yes, except for prominence (and to a lesser extent criteria 10, answering students’ questions).

Page 19: Inter-rater reliability in the Performance Test: Summer workshop 2014 By: Dustin Florence

What’s it all mean?

• Since two of the three teams had reliability issues with prominence and there were no other reliability issues,

• And there was a decrease in reliability on prominence with both of these teams from the midterm to the final test,

• It seems that the raters had a vague understanding of prominence and/or a difficulty perceiving prominence.

Page 20: Inter-rater reliability in the Performance Test: Summer workshop 2014 By: Dustin Florence

What’s to be done?

• More time in rater training sessions devoted to understanding prominence and prominence’s role in the construct of sociolinguistic competence in general.

• More time in rater train sessions devoted to hearing prominence when it is used.

Page 21: Inter-rater reliability in the Performance Test: Summer workshop 2014 By: Dustin Florence

Limitations of this study

• Only studies one summer workshop• Only studies three rater teams• Different rater teams are likely to have

different reliability issues• Did not interview raters to learn their

justifications for their ratings

Page 22: Inter-rater reliability in the Performance Test: Summer workshop 2014 By: Dustin Florence

Thank you for your attention.

Have a great day.