methodology: recommendations for outcome measure ...€¦ · 0.7 65% 143 43% 0.5 51% 200 100%...

1
RESEARCH INSTITUTE The Unreliability of Reliability Statistics: A Primer on Calculating Interrater Reliability in CNS Trials Popp, D 1 ; Mallinckrodt CH 2 ; Williams, JBW 1,3 ; Detke, MJ 3,4 1 MedAvante, Inc.; 2 Eli Lilly Company; 3 College of Physicians and Surgeons, Columbia University; 4 Indiana University School of Medicine ©2013 MedAvante Inc. Commonly used reliability statistics are reviewed and the appropriateness of their use with various data types and methodologies typical of CNS clinical trials is evaluated. Guidelines for selecting appropriate reliability statistics are presented. Finally, common misuses of reliability statistics are discussed and the impact of inappropriate analyses on estimates of reliability is demonstrated. In CNS clinical trial research, IRR is typically measured using one or both of the following methodologies: Investigator Meeting (IM) Rating Precision Exercises: typically, a large group of raters independently score one or more subject videotapes prior to study start In-Study Surveillance: an expert clinician reviews and independently scores audio/videotaped in-study assessments Further, IRR can be measured for both diagnosis and outcome variables (e.g., severity scales). The decision tree shown here can be used to determine the appropriate IRR measure for various methodologies based on the type of variable, the number of raters and the number of subjects or observations. Success rates in clinical trials of approved antidepressants are less than 50 percent even when theoretically powered at 80 – 90 percent (Khin et al., 2011). However, power calculations rarely take into account the variability attributable to the less than perfect agreement between raters in the subjective assessments of symptom severity in CNS clinical trials – that is, interrater reliability. As depicted in the table to the right, failing to account for interrater reliability can have substantial implications for study power and the ability to distinguish effective drugs from placebo. Poor IRR or inaccurate reliability estimates resulting from inappropriate reliability statistics can have significant consequences, including increased R&D costs, significant delays in getting effective drugs to patients who need them, and terminating development of effective drugs. Despite the importance of reliable outcome assessments, clinical trial reporting seldom includes estimates of IRR (Mulsant et al., 2002). When reported, selection of reliability statistics is inconsistent and often inappropriate for the level of measurement or methodology employed. A set of guidelines is proposed for the appropriate selection of reliability measures for CNS clinical trials. Kappa The most commonly used measure of IRR for psychiatric diagnosis (Cohen, 1960; Fleiss, 1971), Kappa is a measure of agreement between two or more raters across two or more subjects. Kappa can be used with binary, nominal or ordinal data. Kappa is preferred to percent agreement as it is corrected for chance agreement. Cohen’s Kappa is used when two raters rate two or more subjects such as with in-study surveillance methods whereas Fleiss’ Kappa is used for multiple raters, such as data collected at IMs. Outcome Measures Common efficacy outcomes in CNS clinical trials are summed total or subscale scores on psychiatric rating scales (e.g., MADRS, PANSS). T-Tests/Analysis of Variance (ANOVA) One method to assess agreement between two or more raters is a means comparison test, such as a paired-sample t-test or one-way repeated measures ANOVA. These tests examine whether multiple raters’ scores of the same subjects are statistically significantly different from one another (i.e., the disagreement between raters reaches statistical significance). Regardless of statistical significance, results of means comparisons should be accompanied by estimates of effect size, such as Cohen’s d, in order to judge the magnitude of difference between raters. Methodology: Recommendations for Diagnostic Reliability: Investigator meeting rating precision exercise Fleiss‘ Kappa In-study surveillance Cohen’s Kappa # of Raters # of Observations (i.e., subjects or videos) Appropriate Statistic Can Not Calculate Reliability Can Not Calculate Reliability Cohen’s Kappa % Agreement Fleiss’ Kappa Can Not Calculate Reliability CoV r wg ADI Paired t-test Bland-Altman ICC CoV r wg ADI RM ANOVA ICC 1 2+ 1 2+ 1 2+ 1 2+ 1 2 3+ 1 2 3+ Categorical (eg., Diagnosis) Continuous (eg., Severity Scale) Type of Variable Guidelines for selecting the appropriate IRR statistic *Muller & Szegedi, 2002 **Perkins, Wyatt & Bartko, 2002 Impact of Interrater Reliability (IRR) on Power and Sample Size Sample Size % Increase in Interrater Required to Retain Sample Size to Retain Reliability Power (1 – ß)* 80% Power** 80% Power 1.0 80% 100 0.9 76% 111 11% 0.7 65% 143 43% 0.5 51% 200 100% INTRODUCTION Bland-Altman Plots Another measure of the magnitude of (dis)agreement between two raters is the Bland-Altman test (Bland & Altman, 1986). A Bland-Altman plot visually depicts agreement between two raters across multiple observations. The difference of the two ratings is plotted on the Y-axis and the average of the two ratings on the X-axis. Three reference lines delineated on the plot indicate the average difference between the raters and the upper and lower confidence limits. The greater the agreement between the two raters, the more closely clustered the points around zero on the Y-axis. A sample Bland-Altman plot using surveillance data is shown here. This plot shows good agreement with values clustered around zero on the Y-axis and confidence limits near +/- 3 points on the MADRS. Difference Average 0 10 20 30 40 6 4 2 0 -2 -4 -6 •••• •• ••• • •• • •••••••••• ••••• • • • •• ••••••••• •• •• •• • • •• •• • •••••••••• ••• ••• •••• •• • ••••••• • • •• •• Shrout and Fleiss (1979) proposed six forms of ICC. Decisions about which form of ICC is estimated should be based on the type and number of raters and whether the outcome variable of interest is from a single rater or the average score from multiple raters (e.g., four raters assess all subjects on the MADRS and the outcome variable is the average of the four scores). Did the same raters rate all subjects? What is the variable of interest? Correct ICC Formula/ ANOVA Source Single Average Single Average Single Average Yes No Yes Yes No Were raters selected from a larger pool? ICC (2,1) Two-way Random Effects ANOVA ICC (2,n) Two-way Random Effects ANOVA ICC (1,1) One-way Random Effects ANOVA ICC (1,n) One-way Random Effects ANOVA ICC (3,1) Two-way Fixed Effects ANOVA ICC (3,n) Two-way Fixed Effects ANOVA Guidelines for selecting the appropriate ICC Common Misuses of Reliability Statistics Dichotomizing continuous outcome measures Kappa has often been misused to estimate the IRR of continuous outcome measures. In order to estimate Kappa from continuous outcome measures, the variable must be artificially transformed into a dichotomous or categorical variable. Kappa is highly influenced by the criterion measure selected. At times, a fixed criterion (e.g., +/- 20 percent) is used to indicate rater agreement with a “gold standard” score. For example, with a criterion of +/- 20 percent of the gold standard, 85 percent of raters may “meet criteria.” However, if the criterion is narrowed to within +/- 10 percent of the gold standard, the number of raters meeting criteria may drop to 45 percent. Selecting a broader criterion range can artificially inflate Kappa. IRR must be estimated using the variable as it will be used in the primary efficacy analysis to accurately assess the IRR of an outcome measure. That is, dichotomization of variables for IRR should only take place if one plans to dichotomize the outcome measure in the final data analysis. Therefore, Kappa is almost always the incorrect measure of IRR for severity scales in CNS clinical trials. Treating Items as Subjects It is sometimes impossible to obtain ratings of multiple observations or subjects. In cases where two or more raters rated a single subject (as in a group calibration at an investigator meeting) one common error is to treat individual items on a scale as independent observations to compensate for a lack of multiple observations. However, ICCs calculated this way may be inversely related to the reliability of a construct (James, Demaree, & Wolf, 1984). For example, imagine a situation in which 20 raters scored one videotaped Montgomery- Asberg Depression Rating Scale (MADRS) assessment at an investigator meeting. If one treats the individual items of the MADRS as 10 independent observations, by definition a high ICC is achieved only because the between-item mean squares are large in relation to the within-item mean square. That is, higher ICCs are actually inversely related to internal scale consistency, which may indicate that raters are not applying the scale correctly and additional observations may reveal that interrater reliability issues are present. When it is not possible to obtain ratings of multiple subjects, interrater agreement (not reliability) should be estimated using the agreement indices presented above. Diagnosis Reliability of psychiatric diagnosis can only be calculated when two or more raters rate two or more subjects. Reliability for diagnosis cannot be estimated with only one observation or subject because there is no variability. Intraclass Correlation Coefficient (ICC) The ICC (Shrout & Fleiss, 1979) is required for appropriate measurement of IRR with continuous outcome measures. ICC is a measure of the interchangeability of raters in a larger cohort. To calculate ICC, two or more raters must rate two or more subjects. An ICC, or any measure of reliability, cannot be calculated on ratings of a single subject. ICC is calculated as: Larger ICCs indicate better agreement between raters or a higher degree of interchange- ability. Confidence intervals should be reported when calculating ICCs. Variance due to rated subjects ICC = (Variance due to subjects + Variance due to raters + Residual Variance) Estimating Interrater Agreement from a Single Observation The statistics shown require that reliability be measured on more than one subject. However, it is not always possible to obtain multiple observations. While it is not possible to estimate reliability with only one observation, agreement can be estimated. The most straight-forward agreement statistic for a single observation is the Coefficient of Variation (CoV), a standardized measure of the variability of rater scores, calculated as the standard deviation divided by the mean. The lower the CoV, the more aligned the raters’scores, with 0 indicating that all of the scores are the same. Alternatively, one can estimate r wg (James, Demaree & Wolf, 1984) to compare the observed variance in multiple raters’ ratings of a single target to the variance expected if all of the ratings were random. R wg typically ranges from zero to one with higher values indicating greater agreement. Methodology: Recommendations for Outcome Measure Reliability (Two or more observations) - Repeated measures ANOVA with effect size - ICC (2,1) with 95% confidence intervals - Paired samples t-test with effect size - Bland-Altman plot - ICC (1,1) with 95% confidence intervals Investigator meeting rating precision exercise In-study surveillance Finally, average deviation (AD) indices such as average deviation of the mean (AD M ) or median (AD MD ) can be used to estimate agreement among raters on a single observation (Burke, Finkelstein & Dusig, 1999). Average deviation is calculated as the average absolute deviation across raters from a point of central tendency, namely the mean or median. One benefit of AD M and AD MD is that it maintains the raw metric of the observed variable. Methodology: Recommendations for Outcome Measure Reliability* (Single observation) - CoV - r wg - AD M or AD MD *Reliability cannot be estimated from one observation. Therefore, we recommend obtaining ratings of two or more subjects whenever possible. Investigator meeting rating precision exercise and in-study surveillance Reliability can have a significant impact on clinical trial outcomes. It is important to accurately assess and report IRR prior to study start and throughout the course of a clinical trial. When IRR is assessed prior to study start, it is possible for researchers to employ a methodology for obtaining IRR data that fully exploits the strengths of a particular statistic. However, since these estimates are often obtained without independent interviews (i.e., watching video- taped assessments), in artificial settings (i.e., at investigator meetings) and at a single point in time (i.e., prior to the start of the study) it is important to couple these estimates with IRR calculated from actual trial assessments throughout a study. When selecting reliability statistics researchers must take into account the type of variable (e.g., binary, nominal, interval), the number of raters, composition of the rater pool (i.e., same raters rate all subjects vs. raters selected from a larger pool) and the number of observations using the guidelines presented for various methodologies. Disclosure One or more authors report potential conflicts which are described in the program. References Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet, 1986; 327(8476):307-310. Burke MJ, Finkelstein LM, Dusig MS. On average deviation indices for estimating interrater agreement. Organizational Research Methods, 1999; 2(1):49-68. Cohen J. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 1960; 20(1):37–46. Fleiss JL. Measuring nominal scale agreement among many raters. Psychological Bulletin, 1971; 76(5):378–382. James LR, Demaree RG, Wolf G. Estimating within-group interrater reliability with and without response bias. Journal of Applied Psychology, 1984; 69:85-98. Khin NA, Chen Y, Yang Y, Yang P, Laughren TP. (2011). Exploratory Analyses of Efficacy Data From Major Depressive Disorder Trials Submitted to the US Food and Drug Administration in Support of New Drug Applications. Journal of Clinical Psychiatry, 2011; 72(4):464-472. Muller MJ, Szegedi A. Effects of Interrater Reliability of Psychopathologic Assessment on Power and Sample Size Calculations in Clinical Trials. Journal of Clinical Psychopharmacology, 2002; 22:318-325. Mulsant BH, Kastango KB, Rosen J, Stone RA, Mazumdar S, Pollock BG. Interrater Reliability in Clinical Trials of Depressive Disorders. American Journal of Psychiatry, 2002; 159:1598-1600. Perkins DO, Wyatt RJ, Bartko JJ. Penny-wise and pound-foolish: the impact of measurement error on sample size requirements in clinical trials. Biological Psychiatry, 2002; 47:762-766. Shrout PE, Fleiss JL. Intraclass Correlations: Uses in Assessing Rater Reliability. Psychological Bulletin, 1979; 86(2):420–428. METHODS RESULTS DISCUSSION

Upload: others

Post on 21-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Methodology: Recommendations for Outcome Measure ...€¦ · 0.7 65% 143 43% 0.5 51% 200 100% Methodology: Recommendations for Diagnostic Reliability: Investigator meeting rating

RESEARCH INSTITUTE

The Unreliability of Reliability Statistics: A Primer on Calculating Interrater Reliability in CNS Trials

Popp, D1; Mallinckrodt CH2; Williams, JBW1,3; Detke, MJ3,4

1MedAvante, Inc.; 2Eli Lilly Company; 3College of Physicians and Surgeons, Columbia University; 4Indiana University School of Medicine

©2013 MedAvante Inc.

Commonly used reliability statistics are reviewed and the appropriateness of their use with various data types and methodologies typical of CNS clinical trials is evaluated. Guidelines for selecting appropriate reliability statistics are presented.

Finally, common misuses of reliability statistics are discussed and the impact of inappropriate analyses on estimates of reliability is demonstrated.

In CNS clinical trial research, IRR is typically measured using one or both of the following methodologies: • InvestigatorMeeting(IM)RatingPrecisionExercises:typically,alargegroupofraters independently score one or more subject videotapes prior to study start

• In-StudySurveillance:anexpertclinicianreviewsandindependentlyscores audio/videotapedin-studyassessments

Further,IRRcanbemeasuredforbothdiagnosisandoutcomevariables(e.g.,severityscales).

The decision tree shown here can be used to determine the appropriate IRR measure for various methodologies based on the type of variable, the number of raters and the number of subjects or observations.

Successratesinclinicaltrialsofapprovedantidepressantsarelessthan50percentevenwhentheoreticallypoweredat80–90percent(Khinetal.,2011).However,powercalculationsrarelytakeintoaccountthevariability attributable to the less than perfect agreement between raters in the subjective assessments of symptom severity in CNS clinical trials – that is, interrater reliability. As depicted in the table to the right, failing to account for interrater reliability can have substantial implications for study power and the ability to distinguish effective drugs from placebo.

PoorIRRorinaccuratereliabilityestimatesresultingfrominappropriatereliabilitystatisticscanhavesignificantconsequences,includingincreasedR&Dcosts,significantdelaysingettingeffectivedrugstopatientswho need them, and terminating development of effective drugs.

Despitetheimportanceofreliableoutcomeassessments,clinicaltrialreportingseldomincludesestimatesofIRR(Mulsantetal.,2002).Whenreported,selectionofreliabilitystatisticsisinconsistentandofteninappropriate for the level of measurement or methodology employed. A set of guidelines is proposed for the appropriate selection of reliability measures for CNS clinical trials.

KappaThemostcommonlyusedmeasureofIRRforpsychiatricdiagnosis(Cohen,1960;Fleiss,1971),Kappaisameasureofagreementbetweentwoormoreratersacrosstwoormoresubjects.Kappacanbeusedwithbinary,nominalorordinaldata.Kappaispreferredtopercentagreementasitiscorrectedforchanceagreement.Cohen’sKappaisusedwhentworatersratetwoormoresubjectssuchaswithin-studysurveillancemethodswhereasFleiss’Kappaisusedformultipleraters,suchasdatacollectedatIMs.

Outcome MeasuresCommon efficacy outcomes in CNS clinical trials are summed total or subscale scores onpsychiatricratingscales(e.g.,MADRS,PANSS).

T-Tests/Analysis of Variance (ANOVA) One method to assess agreement between twoormoreratersisameanscomparisontest,suchasapaired-samplet-testorone-wayrepeatedmeasuresANOVA.Thesetestsexaminewhethermultipleraters’scoresofthesamesubjectsarestatisticallysignificantlydifferentfromoneanother(i.e.,thedisagreementbetweenratersreachesstatisticalsignificance).Regardlessofstatisticalsignificance,resultsofmeanscomparisonsshouldbeaccompaniedbyestimates of effect size, such as Cohen’s d, in order to judge the magnitude of difference between raters.

Impact of Interrater Reliability (IRR) on Power and Sample Size

Sample Size % Increase in Interrater Required to Retain Sample Size to RetainReliability Power (1 – ß)* 80% Power** 80% Power

1.0 80% 100

0.9 76% 111 11%

0.7 65% 143 43%

0.5 51% 200 100%

Methodology: Recommendations for Diagnostic Reliability:Investigator meeting rating precision exercise Fleiss‘ Kappa

In-study surveillance Cohen’s Kappa

Variance due to rated subjects ICC = (Variance due to subjects + Variance due to raters + Residual Variance)

Methodology: Recommendations for Outcome Measure Reliability (Two or more observations)

- Repeated measures ANOVA with effect size - ICC (2,1) with 95% confidence intervals

- Paired samples t-test with effect size - Bland-Altman plot - ICC (1,1) with 95% confidence intervals

Methodology: Recommendations for Outcome Measure Reliability* (Single observation)

- CoV - rwg

- ADM or ADMD

Diffe

renc

e

Average0 10 20 30 40

6

4

2

0

-2

-4

-6

•••• •• • • • • • •• • •• • •••• ••• • • • •• • • • • • • • • • • • •• • •••• •• • • • •

• • • • • •• • • • •••

••••

•• • • • • •• • •••• ••• • • • • • • • • •• • • • • •• • •••• • • • •

• •

Did thesame raters

rate allsubjects?

What is the

variable of interest?

CorrectICC

Formula/ANOVASource

Single

Average

Single

Average

Single

Average

Yes

No

Yes

Yes

No

Were raters selected from a larger pool?

ICC (2,1)Two-way Random

Effects ANOVA

ICC (2,n)Two-way Random

Effects ANOVA

ICC (1,1)One-way Random

Effects ANOVA

ICC (1,n)One-way Random

Effects ANOVA

ICC (3,1)Two-way FixedEffects ANOVA

ICC (3,n)Two-way FixedEffects ANOVA

# of Raters# of

Observations(i.e., subjects

or videos)

AppropriateStatistic

Can Not CalculateReliability

Can Not CalculateReliability

Cohen’s Kappa

% Agreement

Fleiss’ Kappa

Can Not CalculateReliability

CoVr wgADI

Paired t-testBland-Altman

ICC

CoVr wgADI

RM ANOVAICC

1

2+

1

2+

1

2+

1

2+

1

2

3+

1

2

3+

Categorical(eg., Diagnosis)

Continuous(eg., Severity Scale)

Type of Variable

*Reliability cannot be estimated from one observation. Therefore, we recommend obtaining ratings of two or more subjects whenever possible.

Investigator meeting rating precision exercise

In-study surveillance

Investigator meeting rating precision exerciseand in-study surveillance

Impact of Interrater Reliability (IRR) on Power and Sample Size

Sample Size % Increase in Interrater Required to Retain Sample Size to RetainReliability Power (1 – ß)* 80% Power** 80% Power

1.0 80% 100

0.9 76% 111 11%

0.7 65% 143 43%

0.5 51% 200 100%

Methodology: Recommendations for Diagnostic Reliability:Investigator meeting rating precision exercise Fleiss‘ Kappa

In-study surveillance Cohen’s Kappa

Variance due to rated subjects ICC = (Variance due to subjects + Variance due to raters + Residual Variance)

Methodology: Recommendations for Outcome Measure Reliability (Two or more observations)

- Repeated measures ANOVA with effect size - ICC (2,1) with 95% confidence intervals

- Paired samples t-test with effect size - Bland-Altman plot - ICC (1,1) with 95% confidence intervals

Methodology: Recommendations for Outcome Measure Reliability* (Single observation)

- CoV - rwg

- ADM or ADMD

Diff

eren

ce

Average0 10 20 30 40

6

4

2

0

-2

-4

-6

•••• •• • • • • • •• • •• • •••• ••• • • • •• • • • • • • • • • • • •• • •••• •• • • • •

• • • • • •• • • • •••

••••

•• • • • • •• • •••• ••• • • • • • • • • •• • • • • •• • •••• • • • •

• •

Did thesame raters

rate allsubjects?

What is the

variable of interest?

CorrectICC

Formula/ANOVASource

Single

Average

Single

Average

Single

Average

Yes

No

Yes

Yes

No

Were raters selected from a larger pool?

ICC (2,1)Two-way Random

Effects ANOVA

ICC (2,n)Two-way Random

Effects ANOVA

ICC (1,1)One-way Random

Effects ANOVA

ICC (1,n)One-way Random

Effects ANOVA

ICC (3,1)Two-way FixedEffects ANOVA

ICC (3,n)Two-way FixedEffects ANOVA

# of Raters# of

Observations(i.e., subjects

or videos)

AppropriateStatistic

Can Not CalculateReliability

Can Not CalculateReliability

Cohen’s Kappa

% Agreement

Fleiss’ Kappa

Can Not CalculateReliability

CoVr wgADI

Paired t-testBland-Altman

ICC

CoVr wgADI

RM ANOVAICC

1

2+

1

2+

1

2+

1

2+

1

2

3+

1

2

3+

Categorical(eg., Diagnosis)

Continuous(eg., Severity Scale)

Type of Variable

*Reliability cannot be estimated from one observation. Therefore, we recommend obtaining ratings of two or more subjects whenever possible.

Investigator meeting rating precision exercise

In-study surveillance

Investigator meeting rating precision exerciseand in-study surveillance

Guidelines for selecting the appropriate IRR statistic

*Muller&Szegedi,2002**Perkins,Wyatt&Bartko,2002

Impact of Interrater Reliability (IRR) on Power and Sample Size

Sample Size % Increase in Interrater Required to Retain Sample Size to RetainReliability Power (1 – ß)* 80% Power** 80% Power

1.0 80% 100

0.9 76% 111 11%

0.7 65% 143 43%

0.5 51% 200 100%

Methodology: Recommendations for Diagnostic Reliability:Investigator meeting rating precision exercise Fleiss‘ Kappa

In-study surveillance Cohen’s Kappa

Variance due to rated subjects ICC = (Variance due to subjects + Variance due to raters + Residual Variance)

Methodology: Recommendations for Outcome Measure Reliability (Two or more observations)

- Repeated measures ANOVA with effect size - ICC (2,1) with 95% confidence intervals

- Paired samples t-test with effect size - Bland-Altman plot - ICC (1,1) with 95% confidence intervals

Methodology: Recommendations for Outcome Measure Reliability* (Single observation)

- CoV - rwg

- ADM or ADMD

Diffe

renc

e

Average0 10 20 30 40

6

4

2

0

-2

-4

-6

•••• •• • • • • • •• • •• • •••• ••• • • • •• • • • • • • • • • • • •• • •••• •• • • • •

• • • • • •• • • • •••

••••

•• • • • • •• • •••• ••• • • • • • • • • •• • • • • •• • •••• • • • •

• •

Did thesame raters

rate allsubjects?

What is the

variable of interest?

CorrectICC

Formula/ANOVASource

Single

Average

Single

Average

Single

Average

Yes

No

Yes

Yes

No

Were raters selected from a larger pool?

ICC (2,1)Two-way Random

Effects ANOVA

ICC (2,n)Two-way Random

Effects ANOVA

ICC (1,1)One-way Random

Effects ANOVA

ICC (1,n)One-way Random

Effects ANOVA

ICC (3,1)Two-way FixedEffects ANOVA

ICC (3,n)Two-way FixedEffects ANOVA

# of Raters# of

Observations(i.e., subjects

or videos)

AppropriateStatistic

Can Not CalculateReliability

Can Not CalculateReliability

Cohen’s Kappa

% Agreement

Fleiss’ Kappa

Can Not CalculateReliability

CoVr wgADI

Paired t-testBland-Altman

ICC

CoVr wgADI

RM ANOVAICC

1

2+

1

2+

1

2+

1

2+

1

2

3+

1

2

3+

Categorical(eg., Diagnosis)

Continuous(eg., Severity Scale)

Type of Variable

*Reliability cannot be estimated from one observation. Therefore, we recommend obtaining ratings of two or more subjects whenever possible.

Investigator meeting rating precision exercise

In-study surveillance

Investigator meeting rating precision exerciseand in-study surveillance

INTRODUCTION

Bland-Altman Plots Another measure of the magnitude of (dis)agreementbetweentworatersistheBland-Altmantest(Bland&Altman,1986).ABland-Altmanplotvisuallydepicts agreement between two raters across multiple observations. The difference of the two ratings is plotted ontheY-axisandtheaverageofthetworatingsontheX-axis.Three reference lines delineated on the plot indicate the average difference between the raters and the upper and lower confidencelimits.Thegreatertheagreementbetweenthetworaters,themorecloselyclusteredthepointsaroundzeroontheY-axis.AsampleBland-Altmanplotusingsurveillance data is shown here. This plot shows good agreement with values clustered aroundzeroontheY-axisandconfidencelimitsnear+/-3pointsontheMADRS.

Impact of Interrater Reliability (IRR) on Power and Sample Size

Sample Size % Increase in Interrater Required to Retain Sample Size to RetainReliability Power (1 – ß)* 80% Power** 80% Power

1.0 80% 100

0.9 76% 111 11%

0.7 65% 143 43%

0.5 51% 200 100%

Methodology: Recommendations for Diagnostic Reliability:Investigator meeting rating precision exercise Fleiss‘ Kappa

In-study surveillance Cohen’s Kappa

Variance due to rated subjects ICC = (Variance due to subjects + Variance due to raters + Residual Variance)

Methodology: Recommendations for Outcome Measure Reliability (Two or more observations)

- Repeated measures ANOVA with effect size - ICC (2,1) with 95% confidence intervals

- Paired samples t-test with effect size - Bland-Altman plot - ICC (1,1) with 95% confidence intervals

Methodology: Recommendations for Outcome Measure Reliability* (Single observation)

- CoV - rwg

- ADM or ADMD

Diffe

renc

e

Average0 10 20 30 40

6

4

2

0

-2

-4

-6

•••• •• • • • • • •• • •• • •••• ••• • • • •• • • • • • • • • • • • •• • •••• •• • • • •

• • • • • •• • • • •••

••••

•• • • • • •• • •••• ••• • • • • • • • • •• • • • • •• • •••• • • • •

• •

Did thesame raters

rate allsubjects?

What is the

variable of interest?

CorrectICC

Formula/ANOVASource

Single

Average

Single

Average

Single

Average

Yes

No

Yes

Yes

No

Were raters selected from a larger pool?

ICC (2,1)Two-way Random

Effects ANOVA

ICC (2,n)Two-way Random

Effects ANOVA

ICC (1,1)One-way Random

Effects ANOVA

ICC (1,n)One-way Random

Effects ANOVA

ICC (3,1)Two-way FixedEffects ANOVA

ICC (3,n)Two-way FixedEffects ANOVA

# of Raters# of

Observations(i.e., subjects

or videos)

AppropriateStatistic

Can Not CalculateReliability

Can Not CalculateReliability

Cohen’s Kappa

% Agreement

Fleiss’ Kappa

Can Not CalculateReliability

CoVr wgADI

Paired t-testBland-Altman

ICC

CoVr wgADI

RM ANOVAICC

1

2+

1

2+

1

2+

1

2+

1

2

3+

1

2

3+

Categorical(eg., Diagnosis)

Continuous(eg., Severity Scale)

Type of Variable

*Reliability cannot be estimated from one observation. Therefore, we recommend obtaining ratings of two or more subjects whenever possible.

Investigator meeting rating precision exercise

In-study surveillance

Investigator meeting rating precision exerciseand in-study surveillance

ShroutandFleiss(1979)proposedsixformsofICC.DecisionsaboutwhichformofICCisestimatedshould be based on the type and number of raters and whether the outcome variable of interest is fromasingleraterortheaveragescorefrommultipleraters(e.g.,fourratersassessallsubjectsontheMADRSandtheoutcomevariableistheaverageofthefourscores).

Impact of Interrater Reliability (IRR) on Power and Sample Size

Sample Size % Increase in Interrater Required to Retain Sample Size to RetainReliability Power (1 – ß)* 80% Power** 80% Power

1.0 80% 100

0.9 76% 111 11%

0.7 65% 143 43%

0.5 51% 200 100%

Methodology: Recommendations for Diagnostic Reliability:Investigator meeting rating precision exercise Fleiss‘ Kappa

In-study surveillance Cohen’s Kappa

Variance due to rated subjects ICC = (Variance due to subjects + Variance due to raters + Residual Variance)

Methodology: Recommendations for Outcome Measure Reliability (Two or more observations)

- Repeated measures ANOVA with effect size - ICC (2,1) with 95% confidence intervals

- Paired samples t-test with effect size - Bland-Altman plot - ICC (1,1) with 95% confidence intervals

Methodology: Recommendations for Outcome Measure Reliability* (Single observation)

- CoV - rwg

- ADM or ADMD

Diff

eren

ce

Average0 10 20 30 40

6

4

2

0

-2

-4

-6

•••• •• • • • • • •• • •• • •••• ••• • • • •• • • • • • • • • • • • •• • •••• •• • • • •

• • • • • •• • • • •••

••••

•• • • • • •• • •••• ••• • • • • • • • • •• • • • • •• • •••• • • • •

• •

Did thesame raters

rate allsubjects?

What is the

variable of interest?

CorrectICC

Formula/ANOVASource

Single

Average

Single

Average

Single

Average

Yes

No

Yes

Yes

No

Were raters selected from a larger pool?

ICC (2,1)Two-way Random

Effects ANOVA

ICC (2,n)Two-way Random

Effects ANOVA

ICC (1,1)One-way Random

Effects ANOVA

ICC (1,n)One-way Random

Effects ANOVA

ICC (3,1)Two-way FixedEffects ANOVA

ICC (3,n)Two-way FixedEffects ANOVA

# of Raters# of

Observations(i.e., subjects

or videos)

AppropriateStatistic

Can Not CalculateReliability

Can Not CalculateReliability

Cohen’s Kappa

% Agreement

Fleiss’ Kappa

Can Not CalculateReliability

CoVr wgADI

Paired t-testBland-Altman

ICC

CoVr wgADI

RM ANOVAICC

1

2+

1

2+

1

2+

1

2+

1

2

3+

1

2

3+

Categorical(eg., Diagnosis)

Continuous(eg., Severity Scale)

Type of Variable

*Reliability cannot be estimated from one observation. Therefore, we recommend obtaining ratings of two or more subjects whenever possible.

Investigator meeting rating precision exercise

In-study surveillance

Investigator meeting rating precision exerciseand in-study surveillance

Guidelines for selecting the appropriate ICC

Common Misuses of Reliability StatisticsDichotomizing continuous outcome measuresKappahasoftenbeenmisusedtoestimatetheIRRofcontinuousoutcomemeasures.InordertoestimateKappafromcontinuousoutcomemeasures,thevariablemustbeartificiallytransformedintoadichotomous or categorical variable. Kappaishighlyinfluencedbythecriterionmeasureselected.Attimes,afixedcriterion(e.g.,+/-20percent)isusedtoindicaterateragreementwitha“goldstandard”score.Forexample,withacriterionof+/-20percentofthegoldstandard,85percentofratersmay“meetcriteria.”However,ifthecriterionisnarrowedtowithin+/-10percentofthegold standard, the number of raters meeting criteria may drop to 45 percent. Selecting a broadercriterionrangecanartificiallyinflateKappa.

IRR must be estimated using the variable as it will be used in the primary efficacy analysis to accurately assess the IRR of an outcome measure. That is, dichotomization of variables forIRRshouldonlytakeplaceifoneplanstodichotomizetheoutcomemeasureinthefinaldataanalysis.Therefore,KappaisalmostalwaystheincorrectmeasureofIRRforseverity scales in CNS clinical trials.

Treating Items as Subjects It is sometimes impossible to obtain ratings of multiple observations or subjects. In cases where two or more raters rated a single subject (asinagroupcalibrationataninvestigatormeeting)onecommonerroristotreatindividualitemsonascaleasindependentobservationstocompensateforalackofmultipleobservations.However,ICCscalculatedthiswaymaybeinversely related to thereliabilityofaconstruct(James,Demaree,&Wolf,1984). Forexample,imagineasituationinwhich20ratersscoredonevideotapedMontgomery-AsbergDepressionRatingScale(MADRS)assessmentataninvestigatormeeting.IfonetreatstheindividualitemsoftheMADRSas10independentobservations,bydefinitionahighICCisachievedonlybecausethebetween-itemmeansquaresarelargeinrelationtothewithin-itemmeansquare.Thatis,higherICCsareactuallyinverselyrelatedtointernalscale consistency, which may indicate that raters are not applying the scale correctly and additional observations may reveal that interrater reliability issues are present.

Whenitisnotpossibletoobtainratingsofmultiplesubjects,interrateragreement(notreliability)shouldbeestimatedusingtheagreementindicespresentedabove.

Diagnosis Reliability of psychiatric diagnosis can only be calculated when two or more raters rate two or more subjects. Reliability for diagnosis cannot be estimated with only one observation or subject because there is no variability.

Intraclass Correlation Coefficient (ICC)TheICC(Shrout&Fleiss,1979)isrequiredforappropriate measurement of IRR with continuous outcome measures. ICC is a measure of the interchangeability of raters in a larger cohort. To calculate ICC, two or more raters must rate two or more subjects. An ICC, or any measure of reliability, cannot be calculated on ratings of a single subject. ICC is calculated as:

LargerICCsindicatebetteragreementbetweenratersorahigherdegreeofinterchange-ability.ConfidenceintervalsshouldbereportedwhencalculatingICCs.

Impact of Interrater Reliability (IRR) on Power and Sample Size

Sample Size % Increase in Interrater Required to Retain Sample Size to RetainReliability Power (1 – ß)* 80% Power** 80% Power

1.0 80% 100

0.9 76% 111 11%

0.7 65% 143 43%

0.5 51% 200 100%

Methodology: Recommendation for Diagnostic Reliability:Investigator meeting rating precision exercise Fleiss‘ Kappa

In-study surveillance Cohen’s Kappa

Variance due to rated subjects ICC = (Variance due to subjects + Variance due to raters + Residual Variance)

Methodology: Recommendation for Outcome Measure Reliability (Two or more observations)

Investigator meeting rating precision exercise - Repeated Measures ANOVA with effect size - ICC (2,1) with 95% confidence intervals

In-study surveillance - Paired samples t-test with effect size - Bland-Altman plot - ICC (1,1) with 95% confidence intervals

Methodology: Recommendation for Outcome Measure Reliability (Single observation)

Investigator meeting rating precision exercise - CoVand in-study surveillance - rwg

- ADM or ADMD

Diffe

renc

e

Average0 10 20 30 40

6

4

2

0

-2

-4

-6

•••• •• • • • • • •• • •• • •••• ••• • • • •• • • • • • • • • • • • •• • •••• •• • • • •

• • • • • •• • • • •••

••••

•• • • • • •• • •••• ••• • • • • • • • • •• • • • • •• • •••• • • • •

• •

Did thesame raters

rate allsubjects?

What is the

variable of interest?

CorrectICC

Formula/ANOVASource

Single

Average

Single

Average

Single

Average

Yes

No

Yes

Yes

No

Were raters selected from a larger pool?

ICC (2,1)Two-way Random

Effects ANOVA

ICC (2,n)Two-way Random

Effects ANOVA

ICC (1,1)One-way Random

Effects ANOVA

ICC (1,n)One-way Random

Effects ANOVA

ICC (3,1)Two-way FixedEffects ANOVA

ICC (3,n)Two-way FixedEffects ANOVA

# of Raters# of

Observations(i.e., subjects

or videos)

AppropriateStatistic

Can Not CalculateReliability

Can Not CalculateReliability

Cohen’s Kappa

% Agreement

Fleiss’ Kappa

Can Not CalculateReliability

CoVr wgADI

Paired t-testBland-Altman

ICC

CoVr wgADI

RM ANOVAICC

1

2+

1

2+

1

2+

1

2+

1

2

3+

1

2

3+

Categorical(eg., Diagnosis)

Continuous(eg., Severity Scale)

Type of Variable

Reliability cannot be estimated from one observation. Therefore, we recommend obtaining ratings of two or more subjects whenever possible.

Estimating Interrater Agreement from a Single ObservationThestatisticsshownrequirethatreliabilitybemeasuredonmorethanonesubject.However,itisnotalwayspossibletoobtainmultipleobservations.Whileitisnotpossibletoestimatereliability with only one observation, agreement can be estimated.

Themoststraight-forwardagreementstatisticforasingleobservationistheCoefficientofVariation(CoV),astandardizedmeasureofthevariabilityofraterscores,calculatedasthestandarddeviationdividedbythemean.ThelowertheCoV,themorealignedtheraters’scores,with 0 indicating that all of the scores are the same.

Alternatively, one can estimate rwg(James,Demaree&Wolf,1984)tocomparetheobservedvarianceinmultipleraters’ratingsofasingletargettothevarianceexpectedifalloftheratingswererandom.Rwg typically ranges from zero to one with higher values indicating greater agreement.

Impact of Interrater Reliability (IRR) on Power and Sample Size

Sample Size % Increase in Interrater Required to Retain Sample Size to RetainReliability Power (1 – ß)* 80% Power** 80% Power

1.0 80% 100

0.9 76% 111 11%

0.7 65% 143 43%

0.5 51% 200 100%

Methodology: Recommendations for Diagnostic Reliability:Investigator meeting rating precision exercise Fleiss‘ Kappa

In-study surveillance Cohen’s Kappa

Variance due to rated subjects ICC = (Variance due to subjects + Variance due to raters + Residual Variance)

Methodology: Recommendations for Outcome Measure Reliability (Two or more observations)

- Repeated measures ANOVA with effect size - ICC (2,1) with 95% confidence intervals

- Paired samples t-test with effect size - Bland-Altman plot - ICC (1,1) with 95% confidence intervals

Methodology: Recommendations for Outcome Measure Reliability* (Single observation)

- CoV - rwg

- ADM or ADMD

Diffe

renc

e

Average0 10 20 30 40

6

4

2

0

-2

-4

-6

•••• •• • • • • • •• • •• • •••• ••• • • • •• • • • • • • • • • • • •• • •••• •• • • • •

• • • • • •• • • • •••

••••

•• • • • • •• • •••• ••• • • • • • • • • •• • • • • •• • •••• • • • •

• •

Did thesame raters

rate allsubjects?

What is the

variable of interest?

CorrectICC

Formula/ANOVASource

Single

Average

Single

Average

Single

Average

Yes

No

Yes

Yes

No

Were raters selected from a larger pool?

ICC (2,1)Two-way Random

Effects ANOVA

ICC (2,n)Two-way Random

Effects ANOVA

ICC (1,1)One-way Random

Effects ANOVA

ICC (1,n)One-way Random

Effects ANOVA

ICC (3,1)Two-way FixedEffects ANOVA

ICC (3,n)Two-way FixedEffects ANOVA

# of Raters# of

Observations(i.e., subjects

or videos)

AppropriateStatistic

Can Not CalculateReliability

Can Not CalculateReliability

Cohen’s Kappa

% Agreement

Fleiss’ Kappa

Can Not CalculateReliability

CoVr wgADI

Paired t-testBland-Altman

ICC

CoVr wgADI

RM ANOVAICC

1

2+

1

2+

1

2+

1

2+

1

2

3+

1

2

3+

Categorical(eg., Diagnosis)

Continuous(eg., Severity Scale)

Type of Variable

*Reliability cannot be estimated from one observation. Therefore, we recommend obtaining ratings of two or more subjects whenever possible.

Investigator meeting rating precision exercise

In-study surveillance

Investigator meeting rating precision exerciseand in-study surveillance

Finally,averagedeviation(AD)indicessuchasaveragedeviationofthemean(ADM)ormedian(ADMD)canbeusedtoestimateagreementamongratersonasingleobservation(Burke,Finkelstein&Dusig,1999).Averagedeviationiscalculatedastheaverageabsolutedeviationacrossratersfromapointofcentraltendency,namelythemeanormedian.OnebenefitofADMandADMD is that it maintains the raw metric of the observed variable.

Impact of Interrater Reliability (IRR) on Power and Sample Size

Sample Size % Increase in Interrater Required to Retain Sample Size to RetainReliability Power (1 – ß)* 80% Power** 80% Power

1.0 80% 100

0.9 76% 111 11%

0.7 65% 143 43%

0.5 51% 200 100%

Methodology: Recommendations for Diagnostic Reliability:Investigator meeting rating precision exercise Fleiss‘ Kappa

In-study surveillance Cohen’s Kappa

Variance due to rated subjects ICC = (Variance due to subjects + Variance due to raters + Residual Variance)

Methodology: Recommendations for Outcome Measure Reliability (Two or more observations)

- Repeated measures ANOVA with effect size - ICC (2,1) with 95% confidence intervals

- Paired samples t-test with effect size - Bland-Altman plot - ICC (1,1) with 95% confidence intervals

Methodology: Recommendations for Outcome Measure Reliability* (Single observation)

- CoV - rwg

- ADM or ADMD

Diffe

renc

e

Average0 10 20 30 40

6

4

2

0

-2

-4

-6

•••• •• • • • • • •• • •• • •••• ••• • • • •• • • • • • • • • • • • •• • •••• •• • • • •

• • • • • •• • • • •••

••••

•• • • • • •• • •••• ••• • • • • • • • • •• • • • • •• • •••• • • • •

• •

Did thesame raters

rate allsubjects?

What is the

variable of interest?

CorrectICC

Formula/ANOVASource

Single

Average

Single

Average

Single

Average

Yes

No

Yes

Yes

No

Were raters selected from a larger pool?

ICC (2,1)Two-way Random

Effects ANOVA

ICC (2,n)Two-way Random

Effects ANOVA

ICC (1,1)One-way Random

Effects ANOVA

ICC (1,n)One-way Random

Effects ANOVA

ICC (3,1)Two-way FixedEffects ANOVA

ICC (3,n)Two-way FixedEffects ANOVA

# of Raters# of

Observations(i.e., subjects

or videos)

AppropriateStatistic

Can Not CalculateReliability

Can Not CalculateReliability

Cohen’s Kappa

% Agreement

Fleiss’ Kappa

Can Not CalculateReliability

CoVr wgADI

Paired t-testBland-Altman

ICC

CoVr wgADI

RM ANOVAICC

1

2+

1

2+

1

2+

1

2+

1

2

3+

1

2

3+

Categorical(eg., Diagnosis)

Continuous(eg., Severity Scale)

Type of Variable

*Reliability cannot be estimated from one observation. Therefore, we recommend obtaining ratings of two or more subjects whenever possible.

Investigator meeting rating precision exercise

In-study surveillance

Investigator meeting rating precision exerciseand in-study surveillance

Reliabilitycanhaveasignificantimpactonclinicaltrialoutcomes.ItisimportanttoaccuratelyassessandreportIRRpriortostudystartandthroughoutthecourseofaclinicaltrial.WhenIRR is assessed prior to study start, it is possible for researchers to employ a methodology for obtainingIRRdatathatfullyexploitsthestrengthsofaparticularstatistic.However,sincetheseestimatesareoftenobtainedwithoutindependentinterviews(i.e.,watchingvideo-tapedassessments),inartificialsettings(i.e.,atinvestigatormeetings)andatasinglepointintime(i.e.,priortothestartofthestudy)itisimportanttocoupletheseestimateswithIRRcalculated from actual trial assessments throughout a study.

Whenselectingreliabilitystatisticsresearchersmusttakeintoaccountthetypeofvariable(e.g.,binary,nominal,interval),thenumberofraters,compositionoftheraterpool(i.e.,sameratersrateallsubjectsvs.ratersselectedfromalargerpool)andthenumberofobservationsusing the guidelines presented for various methodologies.

DisclosureOneormoreauthorsreportpotentialconflictswhicharedescribedintheprogram.

ReferencesBlandJM,AltmanDG.Statisticalmethodsforassessingagreementbetweentwomethodsofclinicalmeasurement.Lancet,1986;327(8476):307-310.BurkeMJ,FinkelsteinLM,DusigMS.Onaveragedeviationindicesforestimatinginterrateragreement.OrganizationalResearchMethods,1999;2(1):49-68.CohenJ.Acoefficientofagreementfornominalscales.EducationalandPsychologicalMeasurement,1960;20(1):37–46.FleissJL.Measuringnominalscaleagreementamongmanyraters.PsychologicalBulletin,1971;76(5):378–382.JamesLR,DemareeRG,WolfG.Estimatingwithin-groupinterraterreliabilitywithandwithoutresponsebias.JournalofAppliedPsychology,1984;69:85-98.KhinNA,ChenY,YangY,YangP,LaughrenTP.(2011).ExploratoryAnalysesofEfficacyDataFromMajorDepressiveDisorderTrialsSubmittedtotheUSFoodandDrugAdministrationinSupportofNewDrugApplications.JournalofClinicalPsychiatry,2011;72(4):464-472.MullerMJ,SzegediA.EffectsofInterraterReliabilityofPsychopathologicAssessmentonPowerandSampleSizeCalculationsinClinicalTrials.JournalofClinicalPsychopharmacology,2002;22:318-325.MulsantBH,KastangoKB,RosenJ,StoneRA,MazumdarS,PollockBG.InterraterReliabilityinClinicalTrialsofDepressiveDisorders.AmericanJournalofPsychiatry,2002;159:1598-1600.PerkinsDO,WyattRJ,BartkoJJ.Penny-wiseandpound-foolish:theimpactofmeasurementerroronsamplesizerequirementsinclinicaltrials.BiologicalPsychiatry,2002;47:762-766.ShroutPE,FleissJL.IntraclassCorrelations:UsesinAssessingRaterReliability.PsychologicalBulletin,1979;86(2):420–428.

METHODS

RESULTS

DISCUSSION