evaluating rater performance in the scoring of career

Upload: susan-metzger-gracia

Post on 03-Apr-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/29/2019 Evaluating Rater Performance in the Scoring of Career

    1/23

    Evaluating Rater

    Performance in the

    Scoring of Career

    Commitment EssaysSusan Gracia

    Ph.D. In Education Faculty Research

    SeminarDecember 4, 2008

  • 7/29/2019 Evaluating Rater Performance in the Scoring of Career

    2/23

    Career Commitment Essay

    One of several admissions

    requirements to FSEHD

    2-3 page essay describing: Why a candidate wants to be a

    teacher

    Personal skills and characteristicss/he brings

    Area for improvement

  • 7/29/2019 Evaluating Rater Performance in the Scoring of Career

    3/23

    CCE Rubric

    Evaluation based on 4 dimensions or traits:

    Content

    Expression/voice Organization

    Conventions

    Plus an overall, holistic score

    The 2 types of scoring provide a usefuldistinction between overall performance and

    performance using particular skills

  • 7/29/2019 Evaluating Rater Performance in the Scoring of Career

    4/23

    Scoring

    Scorers: higher ed. faculty & K-12 practitioners

    Scorer training: 1.5-2 hours

    Each essay gets 2 blind reads

    Holistic scores are averaged to determine pass/fail

    If 2 essay scores are exactly the same or within 1 point ofeach other, student receives average of the 2 scores .5 averages are bumped up to next highest score

    Average holistic scores of 3 or 4: pass

    Average holistic scores of 1 or 2: revise and resubmit

    If essay scores deviate by more than 1 point, essay isread by a 3rd scorer Student receives average of 2 highest scores

  • 7/29/2019 Evaluating Rater Performance in the Scoring of Career

    5/23

    Accreditation and other

    requirements

    o The unit conducts thorough studies to

    establish validity and reliability of its

    performance assessment procedures.

  • 7/29/2019 Evaluating Rater Performance in the Scoring of Career

    6/23

    Research Questions

    Do the raters differ in the levels ofseverity they exercise?

    Do faculty and practitioners rateessays in the same manner?

    Are there any inconsistent raterswhose patterns of ratings show littlesystematic relationship to theratings that other raters give?

  • 7/29/2019 Evaluating Rater Performance in the Scoring of Career

    7/23

    Research Questions

    Are there any raters who cannot effectivelydifferentiate between rubric dimensions,giving each candidate very similar ratingsacross a number of conceptually distinctdimensions?

    Do some candidates exhibit unusualprofiles of ratings, receivingunexpectedly high (or low) ratings on

    certain dimensions, given the ratings thecandidate received on otherdimensions?

  • 7/29/2019 Evaluating Rater Performance in the Scoring of Career

    8/23

    Sample

    40 scorers (17=higher ed.; 23

    practitioners) over 4 scoring sessions

    from Oct. 2006-March 2007 476 teacher candidates

    Essays were randomly assigned to

    scorers

  • 7/29/2019 Evaluating Rater Performance in the Scoring of Career

    9/23

    Data Analysis

    Many Facet Rasch Analysis A statistical approach to the analysis of rating data

    Facilitates the study of facets of interest inassessments that typically involve human judgment

    Facet: A definable aspect of an assessment settingthat may exert influence on the measurement process

    Raters, tasks, students, rater or student backgroundvariables, situational variables, etc.

    Each facet in the analysis is composed of variousindividual elements

    Facets examined in this study: Candidates, raters,tasks (i.e., rubric dimensions)

    FACETS software was used

  • 7/29/2019 Evaluating Rater Performance in the Scoring of Career

    10/23

    How is MFRM useful?

    Makes possible the analysis of assessments that havemultiple potential sources of measurement error, suchas: Tasks

    Raters Rating scales

    Allows us to quantify typical and expected sources ofvariability within the assessment system

    Enables us to identify observations that lie outside theusual ranges of variability (fit analysis)

    Helps answer the critical question: Are any of thesesources introducing unwanted construct-irrelevantvariation into the ratings?

  • 7/29/2019 Evaluating Rater Performance in the Scoring of Career

    11/23

    What is MFRM designed to

    do?

    Helps establish quality control over an

    assessment system by:

    Providing useful information about how

    individual elements within a facet are

    performing

    Determining which elements of the system

    are/are not working as intended

    Identifying specific aspects of the system

    that may need to be tweaked to remediate

    system deficiencies

  • 7/29/2019 Evaluating Rater Performance in the Scoring of Career

    12/23

    Findings: Raters

    All raters do not score with similarlevels of severity

    10.4 statistically distinct levels of raterseverity

    Significant fixed chi-square statistic,rejecting null hypothesis that all raters

    are equally lenient Exact inter-rater agreement=34.4%

    (1654 out of 4802 ratings)

  • 7/29/2019 Evaluating Rater Performance in the Scoring of Career

    13/23

    ----------------------------------------

    |Measr|-Judges |Scale|

    ----------------------------------------

    + 4 + + (4) +

    | | | |

    | | 1067 | |

    | | | |

    | | | |

    + 3 + + +

    | | | |

    | | | || | 1015 1022 | |

    | | 1003 | |

    + 2 + + 3 +

    | | | |

    | | | |

    | | 1069 | |

    | | | |

    + 1 + 1001 1006 1010 1064 + +

    | | 1018 | |

    | | 1005 1009 | |

    | | | |

    | | 1066 | |

    * 0 * 1012 1021 * --- *

    | | 1011 1062 1063 | |

    | | 1002 1019 1024 1061 | |

    | | 1008 1013 1023 1060 1065 | |

    | | 1072 | |

    + -1 + 1004 + +

    | | | |

    | | 1017 | |

    | | | |

    | | 1025 | |

    + -2 + 1016 1050 + 2 +

    | | 1007 1020 | |

    | | 1070 | |

    | | 1071 | |

    | | | |

    + -3 + + +

    | | | |

    | | | |

    | | | |

    | | | |

    + -4 + 1068 + --- +

    | | | |

    | | 1014 | |

    | | | |

    | | 1073 | |

    + -5 + + (1) +

    ----------------------------------------

    |Measr|-Judges |Scale|

    ----------------------------------------

  • 7/29/2019 Evaluating Rater Performance in the Scoring of Career

    14/23

    Findings: Raters

    21 of 4802 ratings (0.4%) were highlyunexpected

    Taking into account the raters overallseverity and the other ratings thecandidate received from other judges

    Two raters gave 8 of 21 unexpected

    ratings Most misfitting ratings were awarded

    on Conventions, followed by Content

  • 7/29/2019 Evaluating Rater Performance in the Scoring of Career

    15/23

    Unexpected Ratings

    ------------------------------------------------------------------

    |Cat Step Exp. Resd StRes| Num Stu Num Judg N Item |

    ------------------------------------------------------------------

    | 3 3 3.9 -.9 -4.1 | 325 325 1073 1073 2 Expression |

    | 3 3 3.9 -.9 -3.7 | 416 416 1073 1073 1 Content |

    | 3 3 3.9 -.9 -3.5 | 325 325 1073 1073 4 Conventions |

    | 3 3 3.9 -.9 -3.5 | 416 416 1073 1073 5 Overall |

    | 1 1 3.0 -2.0 -3.5 | 927 927 1065 1065 4 Conventions || 1 1 2.8 -1.8 -3.0 | 906 906 1065 1065 2 Expression |

    | 1 1 2.9 -1.9 -3.4 | 42 42 1060 1060 1 Content |

    | 1 1 2.9 -1.9 -3.4 | 915 915 1060 1060 1 Content |

    | 1 1 2.9 -1.9 -3.3 | 42 42 1060 1060 5 Overall |

    | 1 1 2.9 -1.9 -3.3 | 915 915 1060 1060 5 Overall |

    | 3 3 3.9 -.9 -3.7 | 280 280 1025 1025 4 Conventions |

    | 4 4 2.0 2.0 3.3 | 987 987 1022 1022 3 Organization |

    | 1 1 2.9 -1.9 -3.4 | 387 387 1012 1012 1 Content || 4 4 2.1 1.9 3.1 | 991 991 1012 1012 4 Conventions |

    | 4 4 1.8 2.2 3.7 | 381 381 1012 1012 4 Conventions |

    | 1 1 2.8 -1.8 -3.0 | 924 924 1006 1006 1 Content |

    | 3 3 3.9 -.9 -3.6 | 186 186 1002 1002 4 Conventions |

    | 1 1 3.1 -2.1 -3.7 | 914 914 1002 1002 4 Conventions |

    | 3 3 1.3 1.7 3.7 | 328 328 1001 1001 4 Conventions |

    | 1 1 3.0 -2.0 -3.6 | 81 81 1001 1001 1 Content |)

    ------------------------------------------------------------------

  • 7/29/2019 Evaluating Rater Performance in the Scoring of Career

    16/23

    Findings: Raters

    6 of 9 scorers with unexpected ratings werepractitioners

    15 of 21 unexpected ratings were from practitioners

    Overall, practitioner raters were more lenient thanfaculty raters

    21 raters did not use rating scales consistently acrossall candidates and all rubric dimensions (mean squareinfit and outfit statistics >1.2 or

  • 7/29/2019 Evaluating Rater Performance in the Scoring of Career

    17/23

    Findings: Candidates

    Candidates are well differentiated in termsof essay writing skills

    4.1 statistically distinct levels of candidate

    proficiency 26% had ratings that were more variable

    than expected (unusually severe or lenientraters)

    51% had ratings with less variance thanexpected (little variation in ratings acrossrubric dimensions)

  • 7/29/2019 Evaluating Rater Performance in the Scoring of Career

    18/23

    Findings: Candidates

    Observed vs. Fair Average scores

    Fair Average score adjusts Observed

    Average based on differences in

    severity/leniency of raters

    Overall Score:

    Observed Average=2.62; Fair Average=2.64

    Mean difference is not statisticallysignificant, but it may be significant for some

    students!

  • 7/29/2019 Evaluating Rater Performance in the Scoring of Career

    19/23

    Observed vs. Fair Average

    Scores

    Use of Fair

    Average would

    affect individual

    pass rates

    If used, cut off

    score must be

    carefully selected

    Observed Fair Difference Student

    2 1.11 0.89 363

    3.5 2.79 0.71 382

    3.5 2.79 0.71 416

    3 2.31 0.69 336

    3.5 2.82 0.68 72

    2.5 1.82 0.68 91

    3 2.33 0.67 254

    3.5 2.87 0.63 341

    2 1.41 0.59 298

    2 1.44 0.56 343

    2.5 1.94 0.56 337

    3 2.48 0.52 310

    3.5 2.98 0.52 3043 2.5 0.5 79

    4 3.5 0.5 96

    3.5 3.01 0.49 66

    2 1.52 0.48 961

    3 2.52 0.48 331

    3 2.52 0.48 389

    3.5 3.02 0.48 350

  • 7/29/2019 Evaluating Rater Performance in the Scoring of Career

    20/23

    Findings: Rubrics

    Rubric dimensions differ significantlyin terms of difficulty

    Difficulty measures (in logits) Overall Score (.33)

    Content (.19)

    Conventions (-.06)

    Organization (-.08)

    Expression (-.38)

  • 7/29/2019 Evaluating Rater Performance in the Scoring of Career

    21/23

    ----------------------------------------------------------------------------------------

    |Measr|+Students |-Judges |-Item | S.1 | S.2 | S.3 | S.4 | S.5 |

    ----------------------------------------------------------------------------------------

    + 6 + *. + + + (4) + (4) + (4) + (4) + (4) +

    | | | | | | | | | |

    | | | | | | | | | |

    | | . | | | | | | | |

    + 5 + . + + + + + + + +

    | | . | | | | | | | |

    | | . | | | | | | | || | | | | | | | | |

    + 4 + . + + + + + + + +

    | | . | | | | | | | |

    | | *. | | | | | | | |

    | | . | | | | --- | | | |

    + 3 + . + + + + + + --- + --- +

    | | . | | | --- | | --- | | |

    | | *** | | | | | | | |

    | | *. | | | | | | | |

    + 2 + **. + + + + + + + +

    | | ****. | | | | | | | |

    | | ****. | * | | | 3 | | | 3 |

    | | *****. | ** | | 3 | | 3 | 3 | |

    + 1 + ****** + *** + + + + + + +| | ****. | | | | | | | |

    | | *******. | ** | | | | | | |

    | | ********. | ** | Content Overall | | | | | |

    * 0 * *******. * ****** * Conventions Organization * * * --- * * --- *

    | | *******. | ** | | --- | --- | | --- | |

    | | *********. | *** | Expression | | | | | |

    | | ******. | ***** | | | | | | |

    + -1 + ******* + *** + + + + + + +

    | | *** | *** | | | | 2 | | |

    | | ****. | * | | 2 | 2 | | 2 | 2 |

    | | ***** | | | | | | | |

    + -2 + ***. + *** + + + + + + +

    | | ** | * | | | | | | |

    | | *. | | | | | --- | | || | **. | * | | --- | | | --- | |

    + -3 + . + + + + --- + + + --- +

    | | *. | * | | | | | | |

    | | . | | | | | | | |

    | | . | | | | | | | |

    + -4 + + + + + + + + +

    | | . | | | | | | | |

    | | | | | | | | | |

    | | . | * | | | | | | |

    + -5 + + + + (1) + (1) + (1) + (1) + (1) +

    ----------------------------------------------------------------------------------------

    Figure 1: Career Commitment Essay Variable Map

  • 7/29/2019 Evaluating Rater Performance in the Scoring of Career

    22/23

    Conclusions

    Inter-rater reliability could be improved

    Some ratings are overly consistent

    Some raters are inconsistent Current practice of bumping scores

    up compensates for overly severe

    raters

  • 7/29/2019 Evaluating Rater Performance in the Scoring of Career

    23/23

    Recommendations

    Increase rater training overall, with targeted training forspecific raters. Focus on:

    Meaning and distinctions among rubric dimensions

    Using entire rating scale

    Use skilled, experienced scorers rather thancontinuously recruiting new scorers

    Share results such as these in raters training

    Consider using Fair Average score

    Implement standard setting procedure for determiningcut off score

    Evaluate predictive validity of Career CommitmentEssay