statistical analysis of scorer interrater reliability

JENNA PORTER DAVID JELINEK SACRAMENTO STATE UNIVERSITY Statistical Analysis of Scorer Interrater Reliability

Upload: graham-harding

Post on 01-Jan-2016

50 views

Category:

Documents

2 download

Report

Download

Tags:

Embed Size (px):

DESCRIPTION

Statistical Analysis of Scorer Interrater Reliability. JENNA PORTER DAVID JELINEK SACRAMENTO STATE UNIVERSITY. Background & Overview. Piloted since 2003-2004 Began with our own version of training & calibration Overarching question: Scorer interrater reliability - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Statistical Analysis of Scorer Interrater Reliability

JENNA PORTERDAVID JELINEK

SACRAMENTO STATE UNIVERSITY

Statistical Analysis of Scorer Interrater

Reliability

Background & Overview

Piloted since 2003-2004Began with our own version of training &

calibrationOverarching question: Scorer interrater reliability

i.e., Are we confident that the TE scores our candidates receive are consistent from one scorer to the next?

If not -- the reasons? What can we do out it?

Jenna: Analysis of data to get these questionsThen we’ll open it up to “Now what?” “Is there

anything we can/should do about it?” “Like what?”

Page 3: Statistical Analysis of Scorer Interrater Reliability

Presentation Overview

• Is there interrater reliability among our PACT scorers?

• How do we know?• Two methods of analysis

• Results

• What do we do about it?• Implications

Page 4: Statistical Analysis of Scorer Interrater Reliability

Data Collection

• 8 credential cohorts– 4 Multiple subject and 4 Single subject

• 181 Teaching Events Total – 11 rubrics (excludes pilot Feedback rubric)– 10% randomly selected for double score– 10% TEs were failing and double scored

• 38 Teaching Events were double scored (20%)

Page 5: Statistical Analysis of Scorer Interrater Reliability

Scoring Procedures

Trained and calibrated scorers University Faculty Calibrate once per academic year

Followed PACT Calibration Standard-Scores must: Result in same pass/fail decision (overall) Have exact agreement with benchmark at least 6 times Be within 1 point of benchmark

All TEs scored independently once If failing, scored by second scorer and evidence

reviewed by chief trainer

Page 6: Statistical Analysis of Scorer Interrater Reliability

Methods of Analysis

Percent Agreement Exact Agreement Agreement within 1 point Combined (Exact and within 1 point)

Cohen’s Kappa (Cohen, 1960) Indicates percentage of agreement accounted for

from raters above what is expected by chanceCohen (1960).

Cohen, J. (1960). A coefficient of agreement for nominal scales, Educational and Psychological Measurement, 20 (1), pp.37–46.

Page 7: Statistical Analysis of Scorer Interrater Reliability

Percent Agreement

Benefits Easy to Understand

Limitations Does not account for chance agreement Tends to overestimate true agreement (Berk, 1979;

Grayson, 2001).

Berk, R. A. (1979). Generalizability of behavioral observations: A clarification of interobserver agreement and interobserver reliability. American Journal of Mental Deficiency, 83, 460-472.

Grayson, K. (2001). Interrater Reliability. Journal of Consumer Psychology, 10, 71-73.

Page 8: Statistical Analysis of Scorer Interrater Reliability

Cohen’s Kappa

Benefits Accounts for chance agreement Can be used to compare across different conditions

(Ciminero, Calhoun, & Adams, 1986).

Limitations Kappa may decrease if low base rates, so need at least

10 occurances (Nelson and Cicchetti, 1995)

Ciminero, A. R., Calhoun, K. S., & Adams, H. E. (Eds.). (1986). Handbook of behavioral assessment (2nd ed.).

New York: Wiley. Nelson, L. D., & Cicchetti, D. V. (1995). Assessment of emotional functioning in brain impaired individuals.

Psychological Assessment, 7, 404–413.

Page 9: Statistical Analysis of Scorer Interrater Reliability

Kappa coefficient

Kappa= Proportion of observed agreement – chance agreement

1- chance agreement

Coefficient ranges from -1.0 (disagreement) to 1.0 (perfect agreement)

Altman, D.G. (1991). Practical Statistics for Medical Research. London: Chapman and Hall.Fleiss, J. L. (1981). Statistical methods for rates and proportions. NY:Wiley.Landis, J. R., Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics 33:159-

174.

Page 10: Statistical Analysis of Scorer Interrater Reliability

Percent Agreement

Page 11: Statistical Analysis of Scorer Interrater Reliability

Percent Agreement

Page 12: Statistical Analysis of Scorer Interrater Reliability

Pass/Fail Disagreement

Page 13: Statistical Analysis of Scorer Interrater Reliability

Cohen’s Kappa

Page 14: Statistical Analysis of Scorer Interrater Reliability

Interrater Reliability Compared

Page 15: Statistical Analysis of Scorer Interrater Reliability

Implications

Overall interrater reliability poor to fair Consider/reevaluate protocol for calibration Calibration protocol may be interpreted differently Drifting scorers?

Use multiple methods to calculate interrater reliabilityOther?