performance assessments with mfrm
TRANSCRIPT
1
Producing Unbiased Performance Assessment Scores Using the Many-
Facet Rasch Model
Ross Brown, Ph.D.Measurement Incorporated
2
Background
• For more than two decades, the MRA Division of MI has used the many-facet Rasch model (MFRM) for the analysis of client performance assessments.
• Using MFRM for performance assessments offers benefits relating to measurement, fairness, administration, resources, and security.
3
Other Session Objectives
• Understand the psychometric properties of a many-facet Rasch measurement approach to performance assessment scoring
• Understand how stakeholder concerns regarding a MFRM approach can be addressed
• Understand the basics of setting up a performance assessment for a MFRM analysis, as well as setting a passing standard and equating that standard
4
Why Performance Assessments?
• Performance assessments complement written examinations, allowing testing organizations to assess candidates on higher-level decision-making abilities.
• Our clients often use performance assessments to measure candidates’ abilities to apply skills such as diagnosis, treatment, and management of complications in a clinical context, replicating real-world patient situations.
5
Performance Assessment Format
• Examiners rating candidate performance on standardized protocols, i.e., hypothetical patient scenarios, or candidates’ actual patients
• Candidates describe how they would diagnose and treat.
• Some other permutations• The methods we use for organizing and analyzing
such performance assessment can be used in different ways and in different fields.
6
Benefits of MFRM: Fairness
• Different examiners have different levels of severity when they assign ratings to candidate
• If candidate outcomes were determined based on raw scores alone, the severity of individual examiners could differentially affect candidates.
• MFRM allows for the severity of individual examiners to be accounted for before candidate scores are calculated.
7
Benefits of MFRM: Security
• Using MFR approach, different exam content (i.e., patient scenarios) can be used for different candidates, reducing the likelihood that candidates will be able to accurately disclose to other candidates information about the exam content.
8
Benefits of MFRM: Fairness
• However, different exam content logically would have different levels of difficulty.
• If candidate outcomes were determined using raw scores, this differential difficulty could unfairly penalize or benefit individual candidates.
• But calculating candidate scores using the MFRM takes into account differences in the particular exam content that different candidates are tested on.
9
Structuring a Performance Assessmentfor MFRM Analysis
• Examiners interview candidates regarding exam material such as standardized patient
• Examiners lead the discussion, asking pointed questions about how candidates would do such things as diagnose patients illnesses, treat patients, and manage complications.
10
Structuring a Performance Assessment
• Examiners use a rating scale, typically with four points on it, to assign ratings to candidates’ responses.
• Candidates rotate between examiners who assess them on different protocols.
• Ratings are assigned to specific skills, related to the exam materials, such as diagnosis, treatment, management of complications.
• Therefore, you have examiners rating candidates’ performance on skills within protocols.
11
Linking Facet Elements
Facets of the performance assessment:
• Candidates• Examiners• Protocols• Skills
12
Linking Facet Elements
• To quantify and account for differences in the severity of individual examiners and the difficulty of individual protocols, the performance assessment must be carefully structured so that there is overlap of examiners’ ratings on candidates and protocols.
• This overlap links the different facet elements and allows for differences between individual elements to be quantified and accounted for.
13
Benefits: Resources and Administration
• No adjustments necessary if all candidates perform the same skills on all protocols and are evaluated by the same examiners.
• Reality: This is usually too expensive or logistically impossible.
• MRFM: Candidates interact with some examiners on selected protocols; each candidate takes a parallel examination form.
14
Benefits: Resources and Administration
• The differences and biases in each of these examination forms must be accounted for to make the candidate ability estimates reasonably consistent, objective and reproducible.
• Organizing a PA this way also affords benefits in terms of the resources required to conduct the PA, and the administration of the PA
15
Benefits: Resources and Administration
• Candidates moved through several pairs of examiners who assess them on several protocols.
• A lot of performance information is collected efficiently as several candidates are assessed simultaneously.
16
Benefits: Resources and Administration
• Like the regular Rasch model with only two facets (persons and items), the MFRM produces candidate ability estimates of known precision (error) and reproducibility (reliability).
• Testing organizations can scale their performance assessments so that they achieve the measurement precision and reliability they desire with the resources (time for administration, number of examiners) that they have available.
17
Pnmijk = probability of person n being rated in category k by examiner m on skill j in protocol i,
Pnmij(k-1) = probability of person n being rated in category (k – 1) by examiner m on skill j in protocol i,
Bn = the ability of candidate n,Sm = the severity of examiner m,Ci = the difficulty of protocol i,Dj = the difficulty of skill j, and Fk = the difficulty of the step up from category (k – 1) to
category k.
Psychometric Model
kjimnknmij
nmijk FDC SBP
Plog
)1(
18
Psychometric Model
• Probability of a performance: A function of the difference between candidate ability and skill difficulty, after adjustment for the severity of the examiner and the difficulty of the protocol.
• If after adjustment, candidate's ability is higher, then the probability of an acceptable performance is greater than 50%.
• If after adjustment, skill difficulty is greater than the ability of the candidate, the probability of achieving an acceptable performance is less than 50%.
19
Psychometric Model: Ordering Facet Elements
• Ordering of the candidates, examiners, protocols, and skills on a linear scale provides a frame of reference for understanding the relationship of the facets of the PA:
• Candidate ability (Bn) from highest to lowest• Skill difficulty (Dj) from most to least difficult• Examiner severity (Sm) from most to least severe• Protocol difficulty (Ci) from most to least difficult.
20
Psychometric Model: Sums of Ratings
• Ratings given by examiners are the basic units of analysis.
• Skill difficulty is calculated from all ratings given to all candidates by all examiners on the skill.
• Protocol difficulty includes all ratings given to all candidates by all examiners on the protocol.
• Examiner severity includes the ratings given by the examiner on all skills across all protocols to all candidates encountered.
21
Psychometric Model: Logits
• Estimates are based on probability of performance given the nature of the facets of the examination encountered by a candidate.
• Log odds units or logits are used to construct an equal interval scale.
• All facet element calibration estimates (candidate ability, examiner severity, skill and/or protocol difficulty) are reported in logits, with a mean of zero.
22
Psychometric Model: Measurement Statistics
• Error • Reliability• Fit
23
Psychometric Model: Fit
• Estimates of the consistency of the ratings across examiners, skills, and protocols, reported as the fit of the data to the model. Fit statistics indicate inconsistent rating patterns on any of the facets.
• Model expects observed ratings to be consistent: • More able candidates should earn higher ratings
more frequently than less able candidates from all examiners on skills within the protocols.
• More difficult skills and protocols cause lower ratings to be awarded more frequently than easier skills and protocols by all examiners.
24
Psychometric Model: Fit
• Fit statistic is the ratio of the observed rating to the expected (modeled) rating
• 1 is perfect fit; range of acceptable fit is generally 0.5 to 1.5, although more stringent criteria have been suggested for high-stakes examinations.
25
Fit Statistic: Examiners
• The fit statistics for examiners indicate the degree to which each examiner is internally consistent across candidates, skills, and protocols (intra-examiner consistency).
• The fit statistic allows examiners who award unexpectedly high or low ratings to some candidates on some skills or protocols to be identified.
26
Fit Statistic: Candidates, Protocols and Skills
• The fit statistic for each candidate, protocol and skill indicates inter-examiner consistency.
• Misfit indicates that some examiners deviated significantly from others when grading the skill or protocol for some candidates.
• This information is useful for testing organizations to monitor, and, if necessary, conduct additional analysis to identify which rating situations are resulting in the larger unexpected ratings.
27
Guidelines for Implementing a MFRM PA
• Development of the rating scale is critical• Allows for a “disciplined dialogue” among
examiners about candidate performance• Rating scale example: Unacceptable, Deficient,
Acceptable and Excellent• Defining these terms and providing specific
examples of of candidate performance for each scale point is critical
28
Content Slide
• Content Slides
29
Content Slide
• Content Slides
30
Content Slide
• Content Slides
31
Content Slide
• Content Slides
32
Content Slide
• Content Slides
33
Content Slide
• Content Slides
34
Content Slide
• Content Slides
35
Content Slide
• Content Slides
36
Content Slide
• Content Slides
37
Content Slide
• Content Slides
38
Content Slide
• Content Slides
39
Thank You
If any questions, contact [email protected]
Please complete the session evaluation that has been distributed to you.