kappa per

Upload: alex-edwards

Post on 07-Apr-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 Kappa Per

    1/8

    Review

    Inter-rater reliability of case-note audit:a systematic review

    Richard Lilford, Alex Edwards, Alan Girling, Timothy Hofer1

    , Gian Luca Di Tanna,Jane Petty2, Jon Nicholl3

    Department of Public Health & Epidemiology, University of Birmingham, Birmingham, UK; 1Department of Internal Medicine, University ofMichigan Medical School, Ann Arbor, MI, USA; 2School of Psychology, University of Birmingham, UK; 3Medical Care Research Unit,University of Sheffield, Sheffield, UK

    Objective: The quality of clinical care is often assessed by retrospective examination of case-notes (charts,medical records). Our objective was to determine the inter-rater reliability of case-note audit.

    Methods: We conducted a systematic review of the inter-rater reliability of case-note audit. Analysis wasrestricted to 26 papers reporting comparisons of two or three raters making independent judgements about

    the quality of care.

    Results: Sixty-six separate comparisons were possible, since some papers reported more than one mea-

    surement of reliability. Mean kappa values ranged from 0.32 to 0.70. These may be inated due to publicationbias. Measured reliabilities were found to be higher for case-note reviews based on explicit, as opposed to

    implicit, criteria and for reviews that focused on outcome (including adverse eects) rather than process

    errors.We found an association between kappa and the prevalence of errors (poor quality care), suggesting

    alternatives such as tetrachoric and polychoric correlation coecients be considered to assess inter-rater

    reliability.

    Conclusions: Comparative studies should take into account the relationship between kappa and the preva-lence of the events being measured.

    Journal of Health Services Research & Policy Vol 12 No 3, 2007: 173180 r The Royal Society of Medicine Press Ltd 2007

    IntroductionImproving the quality and safety of health care has

    become the focus of much scientific, management andpolicy effort. In order to establish whether or not achange in policy or management practice is effective, itis necessary to develop metrics of quality.1 Similarly,performance management and quality assurance pro-grammes are based on measurements. In each case, thefollowing two questions can be asked:

    Is the measure valid? i.e. does it measure theunderlying construct we wish to examine, namely

    the quality of care? Is it reliable? i.e. what is the intra- and inter-observer variation?

    This article is concerned with inter-rater reliability.

    Quality of care may be measured in many waysincluding direct observation, prospective datacollection by staff, the use of simulated patients andevaluation of videotapes of episodes of care.2,3 How-ever, case-notes (referred to as medical records and, inNorth America, as charts) are the most widely used(and studied) source of information. Review of case-notes is used routinely by the worlds largest providerof medical services (Medicare). It formed the basis ofthe Harvard Medical Practice Study,4 the relatedColoradoUtah Study5 and the Quality in AustralianHealth Care Study.6 It does not disrupt normal

    patterns of care and can be conducted independentlyof care-givers to reduce observer bias.7 Given theimportance of case-note review in both qualityimprovement and research, it is important to knowhow reliable it is. Goldman8 conducted a review of 12studies published between the years 19591991.4,919

    We extended and updated that review and sought tointroduce a taxonomical layer by examining differenttypes of quality measurement with respect to what is

    being assessed and how it is being assessed.First, we consider what is being assessed. We

    distinguish three types of endpoint. First, quality maybe assessed with respect to clinical processes where thecorrect standard of care is judged to be violated,

    because the right care was not given (error of omission)Correspondence to: [email protected]

    Richard Lilford PhD, Professor of Clinical Epidemiology, AlexEdwards PhD, Research Fellow, Alan Girling MA, Senior ResearchFellow, Gian Luca Di Tanna MPhil, Research Fellow, Department ofPublic Health & Epidemiology, Jane Petty BSc, Research Assistant,School of Psychology, University of Birmingham, Birmingham B152TT, UK; Timothy Hofer MD, Associate Professor, Department ofInternal Medicine, University of Michigan Medical School, Ann

    Arbor, MI, USA; Jon Nicholl MA, Professor of Health ServicesResearch, Medical Care Research Unit, University of Sheffield,

    Sheffield, UK.

    J Health Serv Res Policy Vol 12 No 3 July 2007 173

  • 8/3/2019 Kappa Per

    2/8

    or the wrong care was given (error of commission). Wewill refer to these assessments as measurements ofprocess. Second, quality may be assessed in terms of theoccurrence of an adverse event or outcome. Lastly,quality may be assessed in terms of adverse events thatcan be attributed to clinical process error these we callmeasures of causality.

    Next, we turn to how assessments are made. We dis-

    cern two broad methods; explicit (algorithmic) methods, based on highly specified and detailed checklists andimplicit (holistic) methods, based on expert judgement.The latter may be subclassified as unstructured (wherethere is no predetermined guidance for the reviewer)or structured (where the reviewer is guided to lookfor certain categories of error or adverse event).

    We will refer to the [Process, Causality, Adverse Event]axis as Focus and the [Implicit, Explicit] axis as Style.This typology is summarized in Figure 1. Intuitively, onemight hypothesize that the explicit methods would bemore reliable than the implicit methods and that

    outcome measures would be more reliable than eitherprocess or causality. We examine these hypotheses.

    Methods

    Summary measures of reliability

    The statistic used for the calculation of reliability canaffect the measurement obtained.20 We were con-strained in our choice, since the most widely usedmethod was Cohens kappa,21 either in its weighted orunweighted form. Also, some studies used the intra-class correlation coefficient (ICC) as the measure ofinter-rater agreement. These two methods are equiva-

    lent for the case where ratings have multiple orderedcategories.22 Moreover, an ICC calculated for binary (0/1)data is identical to kappa calculated under theassumption of rater homogeneity the so-called intra-class kappa.23 Accordingly, we use the term kappagenerically, to encompass all versions of Cohens kappaas well as the few studies that calculate an ICC.

    Kappa is affected by the prevalence of the event onwhich reviewers judgements are required. An increasein the prevalence of an event being observed will, of

    itself, generate an increase in kappa until the event ratereaches 50%, following which it will again decline.24

    Prevalence is estimated from the marginal distributionof the individual raters assessments.2528 Thus, kappamay depend on the overall frequency of error oradverse event in a study, even though the ratersfundamental measurement process does not change.We therefore analysed our data-set with a view to

    detecting any effect of prevalence on kappa andadjusted for prevalence in the analysis where possible.

    In Cohens original paper,21 reliability is defined interms of a direct comparison between the judgementsof two reviewers. However, more than two reviewersmay be involved. For instance, the Harvard MedicalPractice Study29 averaged the rating of two reviewersand then calculated reliability between that averagerating and an average (consensus) rating obtainedindependently from a group of experts. Takayanagiet al.30 compared panels of raters of the quality ofcare. Likewise, Thomas et al.3 compared panels of

    reviewers. Rubin et al.20

    constructed yet another variant.They quoted the reliability of a single review comparedwith an average of a panel of reviewers. Higher levels ofagreement will occur when a measurement is an averageover several raters than when individual raters arecompared, as evidenced by the SpearmanBrownformula. Thus, including studies such as the HarvardPractice Study or Rubins method would have resultedin higher (more flattering) measurements of agreementthan those found in other studies. Therefore, werestricted ourselves to comparisons of two or threereviewers making separate judgements about the qualityof care. (In the event, the Harvard Medical Practice

    study data was re-analysed by Localio et al.31 in a waythat did allow just two raters to be compared, and thisstudy was included in the analysis.)

    Search strategy

    We used the National Library of Medicine (NLM)Gateway facility to search the MEDLINE and SCI-SEARCH databases using 21 search strings shown inTable 1. As can be seen, many produced massive yields

    StyleFocus

    Comparison

    CausalityProcess Error

    (Clinical)

    Structured Unstructured

    Implicit(Holistic)

    Explicit(Algorithmic)

    AdverseEvent

    Figure 1 Taxonomy of comparison type for studies of inter-rater reliability. Each instance where inter-rater agreement was measured wasclassified according to focus and then again according to style. Within these categories (focus and style), the material was broken down into the

    mutually exclusively categories shown. Note that a single paper might contain more than one measurement of inter-rater agreement

    Review Inter-rater reliability of case-note audit

    174 J Health Serv Res Policy Vol 12 No 3 July 2007

  • 8/3/2019 Kappa Per

    3/8

    and scrutinizing the abstracts to identify relevantpapers would have been a huge task. We thereforeused Goldmans papers as a benchmark we selectedthose strings that were most efficient at uncovering hisoriginal 12 references (we were unable to replicateGoldmans method since he did not report his searchmethod explicitly). Table 1 shows the search stringsthat were found collectively to locate 11 of the 12Goldman papers we were able to find via Gateway.

    The most productive search strings are defined as

    those that gave the highest result for the ratio:

    Productivity 100No: of Goldman papers found

    Total no: of papers found

    Searches using the most productive strings and theirvariants were continued until the level of productivity

    became prohibitively costly of resources. For example,using the string peer review quality medical carerecords, which had a productivity of 2.17 (Table 1),delivered two Goldman papers among 92 hits, whichmeant that on average 46 papers needed to be

    inspected before a potentially useful paper was found.On the other hand, use of the string care quality with aproductivity of 0.24 required, on average, the inspec-tion of 408 papers for every potentially useful paperfound and this clearly represented an unacceptablelevel of cost. There was a step decrease in productivity

    beyond the sixth string (Table 1).We reviewed the abstracts of the papers identified by

    the most productive six strings. This resulted inidentification of 54 papers that appeared promising.We obtained all papers that were available via theInternet or from the University of Birmingham library.These papers were included in our analysis if theycontained information about the degree of agreement

    between reviewers. The object of investigation was a set

    of case-notes, and the topic of study was quality of careas reflected by process, adverse event or causality.

    Additional papers were uncovered by examiningreference lists in the retrieved articles. In all, 32 eligiblepapers were found, including nine of the 12 originalGoldman papers. We were also previously aware of onefurther paper31 that we added to our total, yielding 33papers in all.

    Data extraction

    The papers were read independently by two investiga-tors (AE and JP) who extracted the data (Appendix A).Where discrepancies occurred, RL read the article andarbitrated. The most frequent point of disagree-ment was between the unstructured and structuredcategories of implicit reviews. In some cases, inter-rateragreement had to be calculated from source data, andthis was carried out by AE and confirmed by AG.

    ResultsExcluded papers

    Our search yielded 33 papers.24,6,9,10,1214,1720,24,2947

    Six papers3,12,17,20,29,30 were excluded because theyused large numbers of raters. The remaining 27 papershad all used two or three raters. One paper32 analysed70 items relating to quality of care, but did not measureinter-rater agreement for each of the items and failedto provide information necessary for classificationaccording to style or focus it gave only an averagekappa value across all items (0.9) and the kappa valuefor the lowest scoring item (0.6). We excluded thispaper, leaving 26 papers that yielded comparable data these are listed in Appendix A.

    Table 1 Productive search strings used for the location of Goldman papers via Gateway

    String\Goldman ref 10 6 11 5 7 3 9 4 62 8 63 64 CP

    Total Prod

    (1) Kappa care 1 1 1 3 3 66 4.55(2) Medical records peer review* care 1 1 1 3 6 119 2.52(3) Peer review quality medical care records 1 1 2 6 92 2.17(4) Outcome and process assessment (MH) peer review 1 1 7 61 1.64(5) Care quality medical records review* 1 1 1 1 4 8 257 1.56(6) Medical records (MH) review* care quality 1 1 1 3 8 219 1.37(7) Peer review 1 1 1 1 4 8 3647 1.10

    (8) Medical review* care 1 1 1 1 1 1 1 7 10 7989 0.88(9) Care quality medical records 1 1 1 1 1 5 11 808 0.62(10) Peer medical review* care 1 1 1 1 4 11 685 0.58(11) Medical records review* care 1 1 1 1 4 11 780 0.51(12) Medical records (MH) 1 1 1 1 1 5 11 11520 0.43(13) Kappa 1 1 1 3 11 8044 0.37(14) Outcome and process assessment 1 1 2 11 6705 0.30(15) Care quality medical review* 1 1 1 1 1 5 11 1787 0.28(16) Care quality 1 1 1 1 1 1 6 11 24495 0.24(17) Medical review 1 1 1 1 1 1 1 7 11 39132 0.18(18) Measurement of care 1 1 11 8927 0.11(19) (patient or medical) and records 1 1 2 11 21616 0.09(20) Care 1 1 1 1 1 1 1 1 1 1 1 11 11 233410 0.05(21) Review* 1 1 1 1 1 1 1 1 1 1 10 11 354651 0.03

    The table is ordered by decreasing productivity. The column headed C (for Coverage) contains the number of Goldman papers found by thestring in the first column. The column headed

    P contains the cumulative Coverage of all strings up to and including the string in the first column.

    Wildcard search stringMH, MeSH term

    Inter-rater reliability of case-note audit Review

    J Health Serv Res Policy Vol 12 No 3 July 2007 175

  • 8/3/2019 Kappa Per

    4/8

    Statistical methods

    In some papers, the assessments entailed the samecombination of focus and style; in others, kappas fromtwo or more focus/style combinations were reported.For e xample, the reliability of focus and style measure-ments might have been assessed for more than oneclinical domain. For analytical purposes, all assessments

    from the same focus/style typology within a given paperwere combined to form a cluster. A nested hierarchicalANOVA (analysis of variance [assessment within clusterwithin focus style]) was conducted using MINITABs

    Release 14. The impact of prevalence was exploredfurther by introducing prevalence as a covariate intothe ANOVA for kappa.

    Inter-rater reliability

    Between them, the 26 papers reported kappa valuesfrom 66 separate assessments (Appendix A). A sum-mary of the data, cross-classified by focus and style, ispresented in Table 2. As hypothesized, concordance

    between raters appears to be greater for reviewsguided by explicit criteria as compared with implicitreviews. The reliability is also higher for reviews thatare focused towards outcome rather than process.These results are shown graphically in Figure 2. Justeight papers reported kappas from two or more focus/style combinations, and a total of 39 clusters were formedfor the hierarchical analysis. A significant cluster effectwas obtained (P0.034), with an ICC of 0.38, reflectingthe degree of similarity between assessments within apaper. Significant results for both the effect of focus

    (P 0.033) and style (P0.034) were found, afterallowing for the cluster effect. The interaction effect(style focus) was not significant (P 0.973).

    The following conclusions may be drawn:

    kappa tends to be higher for explicit than forimplicit reviews;

    kappa tends to decline where greater concern forprocess is present in the assessment; and

    these effects operate independently of one another.

    In order to take into account the fact that differencesin the frequency of events may account for systematicchanges in kappa values, we examined the effect ofevent rates by examining each paper to ascertain theproportion of ratings classified as unsatisfactory werefer to this as prevalence. Prevalence could beascertained for 36 of the 66 assessments.

    The overall correlation between prevalence and

    kappa was 0.44 (Po0.008), thereby confirming the(expected) positive correlation between kappa valueand prevalence. After adjusting for prevalence, theeffects of Focus (P 0.114) and Style (P 0.112) wereno longer significant. The directions of effect are,however, the same as before and these comparisons areof low statistical power, since prevalence was notrecorded in all cases. Furthermore, if we combine thetwo categories of implicit review into one largercategory, a significant difference between styles ofreview (P 0.024) is found even after allowing forprevalence.

    Table 2 Summary of kappa data classified by focus and style from 66 assessments in 26 papers

    Adverse event Causality Process Total

    Explicit No. of assessments (clusters) 4 (3) 1 (1) 4 (3) 9 (7)Cases per assessment: median 31 15 28 25kappa: mean (SD) 0.70 (0.13) 0.64 (.) 0.55 (0.21) 0.62 (0.17)

    Implicit No. of assessments (clusters) 7 (5) 9 (6) 21 (9) 37 (20)Structured Cases per assessment: median 237 140 95 132

    kappa: mean (SD) 0.56 (0.16) 0.39 (0.14) 0.35 (0.19) 0.40 (0.19)Implicit No. of assessments (clusters) 3 (2) 7 (5) 10 (5) 20 (12)Unstructured Cases per assessment: median 37 225 171 171

    kappa: mean (SD) 0.51 (0.27) 0.40 (0.12) 0.32 (0.18) 0.38 (0.18)Total No. of assessments (clusters) 14 (10) 17 (12) 35 (17) 66 (39)

    Cases per assessment: median 166 140 89 108kappa: mean (SD) 0.59 (0.18) 0.41 (0.14) 0.37 (0.20) 0.42 (0.20)

    Each paper provides one or more assessments, each with a kappa value classified by style and focus. Cases refers to the numbers of medicalrecords reviewed in each assessment. Means and standard deviations (SDs) for kappa are calculated across assessments. Clusters wereformed within the papers (average size of cluster 66/39=1.7) by pooling assessments with the same focus and style

    Explicit Style

    Implicit (structured) Style

    Implicit (unstructured) Style

    1.0

    0.8

    0.6

    0.4

    0.2

    0.0 Focus

    Kappa

    CausalityAdverseEvent

    Process

    Figure 2 Mean kappa values and two standard error bars by focus

    and style

    Review Inter-rater reliability of case-note audit

    176 J Health Serv Res Policy Vol 12 No 3 July 2007

  • 8/3/2019 Kappa Per

    5/8

    Discussion

    Mean values of kappa are higher in studies that evaluateadverse events rather than processes and that useexplicit review rather than either type of implicit review.There are three broad reasons that could explain thesefindings. The first reason is that adverse events andexplicit criteria are both more clearly defined and

    discernible than their alternatives. The second reasonis that these are artefacts of changing prevalence giventhe use of Kappa as a measure of inter-rater agreement.The third reason could be that some other type ofconfounder is at work for example, that adverse eventsare measured in one type of disease and process inanother. The data-set we had was far too small to test forthe effects of disease or clinical setting on reliability.However, in over half the cases it was possible to controlfor prevalence. After controlling for prevalence, thestatistically significant associations between kappa and

    both style and focus disappeared. That said, the

    direction of effect remained unchanged and, in the caseof style, the statistical association remained significant ifsemi-structured and unstructured methods were com-

    bined. Tentatively, we think it is reasonable to concludethat the associations that we observed (both for style andfocus) are partially, but perhaps not totally, explained bythe inherent nature of the tasks.

    Intuitively, one might expect explicit review to yieldhigher reliability since it is based on the reviewer beingguided in an overt way by a prescribed set of norms ofgood practice and then applying these norms in theevaluation of a set of case-notes. Implicit review, on theother hand, relies on the reviewer making judgements

    through the employment of relatively uncodifiedknowledge held in his or her mind and perhapstailored to the circumstances of a specific case. Themeasurement process for explicit review does notinclude the development of the criteria for formingthe algorithms. The criteria and algorithms that resultare taken as fixed, and thus any imprecision from thatstep is removed from the estimate of reliability. Themeasurement procedure in implicit review requires thereviewers to form their criteria and apply them, andthus both sources of variability are included in themeasurement of reliability. Although explicit methods

    appear to yield higher reliability than implicit alter-natives, they have the disadvantage of missing elementsnot taken into account during the prescription of anexplicit framework. A way around this might be toadopt a mixed strategy in which a reviewer addresses aset of designed explicit requirements, but is theninvited to make any further implicit contribution thatmight be forthcoming.

    The mean kappa values in our review are moderateto good ranging from 0.32 (Implicit review of Process)to 0.70 (Explicit review of Adverse events). It is possiblethat all these results are somewhat inflated, sincestudies reporting low reliability may be submitted oraccepted for publication less frequently than those withbetter results.

    In Cohens21 original work, kappa was introduced asa measure of agreement for use with nominal (i.e.unordered) categories. But it is now recognized thatthere are better alternatives to kappa for orderedcategories48 which include the tetrachoric and poly-choric correlation coefficients.49 A major advantage ofthese measures is that they overcome the problem ofsensitivity to prevalence. The association between

    kappa and the prevalence of error in our reviewsuggests that further consideration be given to usingtetrachoric and polychoric correlation coefficients instudies of inter-rater reliability, particularly in com-parative studies across different clinical settings. In themeantime, it is worth noting that good care, with lowerror rates, is likely to be associated with lower inter-rater agreement, because of the correlation betweenkappa and prevalence.

    Acknowledgements

    This study was supported by a MRC Patient Safety Research Grantand the National Health Service R & D grant supporting the Co-ordinating Centre for Research Methodology.

    References

    1 Lilford R, Mohammed MA, Spiegelhalter D, Thomson R.Use and misuse of process and outcome data in managingperformance of acute medical care: avoiding institutionalstigma. Lancet 2004;363:114754

    2 Michel P, Quenon JL, de Sarasqueta AM, Scemama O.Comparison of three methods for estimating rates of

    adverse events and rates of preventable adverse events inacute care hospitals. BMJ 2004;328:1993 Thomas EJ, Lipsitz SR, Studdert DM, Brennan TA. The

    reliability of medical record review for estimating adverseevent rates. Ann Intern Med 2002;136:8126

    4 Brennan TA, Localio RJ, Laird NL. Reliability and validityof judgments concerning adverse events suffered byhospitalized patients. Med Care 1989;27:114858

    5 Thomas EJ, Studdert DM, Burstin HR, et al. Incidence andtypes of adverse events and negligent care in Utah andColorado. Med Care 2000;38:26171

    6 Wilson RM, Runciman WB, Gibberd RW, Harrison BT,Newby L, Hamilton JD. The quality in Australian HealthCare Study. Med J Aust 1995;163:45871

    7 Lilford RJ, Mohammed MA, Braunholtz D, Hofer TP. The

    measurement of active errors: methodological issues. QualSaf Health Care 2003;12(Suppl 2):ii8128 Goldman RL. The reliability of peer assessments of quality

    of care. JAMA 1992;267:958609 Empire State Medical, Scientific and Educational Founda-

    tion, Inc. Rochester region perinatal study. Medical reviewproject. N Y State J Med 1967;67:120510

    10 Bigby J, Dunn J, Goldman L, et al. Assessing the prevent-ability of emergency hospital admissions. A method forevaluating the quality of medical care in a primary carefacility. Am J Med 1987;83:10316

    11 Brook RH. Quality of Care Assessment: A Comparison of Five Methods of Peer Review. Washington, DC: US Depart-ment of Health, Education and Welfare, Public HealthService, Health Resources Administration, Bureau of

    Health Services Research and Evaluation, US Deptof Health, Education, and Welfare publication HRA-74-3100 1973

    Inter-rater reliability of case-note audit Review

    J Health Serv Res Policy Vol 12 No 3 July 2007 177

  • 8/3/2019 Kappa Per

    6/8

    12 Caplan RA, Posner KL, Cheney FW. Effect of outcome onphysician judgments of appropriateness of care. JAMA1991;265:195760

    13 Dubois RW, Brook RH. Preventable deaths: who, howoften, and why? Ann Intern Med 1988;109:5829

    14 Hastings GE, Sonneborn R, Lee GH, Vick L, Sasmor L.Peer review checklist: reproducibility and validity of amethod for evaluating the quality of ambulatory care. Am JPublic Health 1980;70:2228

    15 Horn SD, Pozen MW. An interpretation of implicit judgments in chart review. J Community Health 1977;2:2518

    16 Morehead MA, Donaldson RS, Sanderson S, Burt FE. AStudy of the Quality of Hospital Care Secured by a Sample ofTeamster Family Members in New York City. New York, NY:Columbia University School of Public Health and Admin-istrative Medicine, 1964

    17 Posner KL, Sampson PD, Caplan RA, Ward RJ, Cheney FW.Measuring interrater reliability among multiple raters: anexample of methods for nominal data. Stat Med1990;9:110315

    18 Rosenfeld LS. Quality of medical care in hospitals. Am JPublic Health 1957;47:85665

    19 Posner KL, Caplan RA, Cheney FW. Physician agreement in

    judging clinical performance. Anesthesiology 1991;75

    (Suppl3A):A105820 Rubin HR, Rogers WH, Kahn KL, Rubenstein LV, Brook

    RH. Watching the doctor-watchers. How well do peerreview organization methods detect hospital care qualityproblems? JAMA 1992;267:234954

    21 Cohen J. A coefficient of agreement for nominal scales. EducPsychol Meas 1960;20:3746

    22 Fleiss JL, Cohen J. The equivalence of weighted kappa andthe intraclass correlation coefficient as measures of relia-bility. Educ Psychol Meas 1973;55:6139

    23 Shoukri MM. Measures of Interobserver Agreement. BocaRaton, FL: CRC Press, Chapman & Hall, 2003

    24 Hayward RA, Hofer TP. Estimating hospital deaths due tomedical errors: preventability is in the eye of the reviewer.

    JAMA 2001;286:4152025 Fleiss JL, Nee JCM, Landis JR. Large sample variance ofkappa in the case of different sets of raters. Psychol Bull1979;86:9747

    26 Hutchinson TP. Focus on psychometrics. Kappa muddlestogether two sources of disagreement: tetrachoric correla-tion is preferable. Res Nurs Health 1993;16:3136

    27 Tanner MA, Young MA. Modelling agreement amongraters. J Am Stat Assoc 1985;80:17580

    28 Williamson JM, Manatunga AK. Assessing interrater agree-ment from dependent data. Biometrics 1997;53:70714

    29 Brennan TA, Leape LL, Laird NM, et al. Incidence ofadverse events and negligence in hospitalized patients.Results of the Harvard Medical Practice Study I. N Engl JMed 1991;324:3706

    30 Takayanagi K, Koseki K, Aruga T. Preventable traumadeaths: evaluation by peer review and a guide for qualityimprovement. Emergency Medical Study Group for Qual-ity. Clin Perform Qual Health Care 1998;6:1637

    31 Localio AR, Weaver SL, Landis JR, et al. Identifying adverseevents caused by medical care: degree of physicianagreement in a retrospective chart review. Ann Intern Med1996;125:45764

    32 Rottman SJ, Schriger DL, Charlop G, Salas JH, Lee S. On-line medical control versus protocol-based prehospital care. Ann Emerg Med 1997;30:628

    33 Dobscha SK, Gerrity MS, Corson K, Bahr A, Cuilwik NM.Measuring adherence to depression treatment guidelines

    in a VA primary care clinic. Gen Hosp Psychiatry 2003;25:2307

    34 Forbes SA, Duncan PW, Zimmerman MK. Review criteriafor stroke rehabilitation outcomes. Arch Phys Med Rehabil1997;78:11126

    35 Hofer TP, Bernstein SJ, DeMonner S, Hayward RA.Discussion between reviewers does not improve reliabilityof peer review of hospital quality. Med Care 2000;38:15261

    36 Hofer TP, Asch SM, Hayward RA, et al. Measuring quality ofcare: is there a role for peer review? BMC Health Serv Res.http://www.biomedcentral.com/1472-6963/4/9 2004

    37 Rubenstein LV, Kahn KL, Reinisch EJ, et al. Changes inquality of care for five diseases measured by implicitreview, 1981 to 1986. JAMA 1990;264:19749 Notes:COMMENTS: Comment in: JAMA 1990 October 17; 264(15):19956

    38 Ashton CM, Kuykendall DH, Johnson ML, Wray NP. Anempirical assessment of the validity of explicit and implicitprocess-of-care criteria for quality assessment. Med Care1999;37:798808

    39 Baker GR, Norton PG, Flintoft V, et al. The Canadian Adverse Events Study: the incidence of adverse eventsamong hospital patients in Canada. CMAJ 2004;170:

    16788640 Weingart SN, Davis RB, Palmer RH, et al. Discrepanciesbetween explicit and implicit review: physician and nurseassessments of complications and quality. Health Serv Res2002;37:48398

    41 Saliba D, Kington R, Buchanan J, et al. Appropriateness ofthe decision to transfer nursing facility residents to thehospital. J Am Geriatr Soc 2000;48:15463

    42 Lorenzo S, Lang T, Pastor R, et al. Reliability study of theEuropean appropriateness evaluation protocol. Int J QualHealth Care 1999;11:41924

    43 Hayward RA, McMahon Jr LF, Bernard AM. Evaluating thecare of general medicine inpatients: how good is implicitreview? Ann Intern Med 1993;118:5506

    44 Bair AE, Panacek EA, Wisner DH, Bales R, Sakles JC.

    Cricothyrotomy: a 5-year experience at one institution.J Emerg Med 2003;24:151645 Smith MA, Atherly AJ, Kane RL, Pacala JT. Peer review of

    the quality of care. Reliability and sources of variabilityfor outcome and process assessments. JAMA 1997;278:15738

    46 Camacho LA, Rubin HR. Reliability of medical audit inquality assessment of medical care. Cad Saude Publica1996;12(Suppl 2):8593

    47 Pearson ML, Lee JL, Chang BL, Elliott M, Kahn KL,Rubenstein LV. Structured implicit review: a new methodfor monitoring nursing care quality. Med Care2000;38:107491

    48 Kraemer HC, Periyakoil VS, Noda A. Kappa coefficients inmedical research. Stat Med 2002;21:210929

    49 Agresti A. Modelling ordered categorical data: recentadvances and future challenges. Stat Med 1999;18:2191207

    Appendix A

    Papers included in our analysis of the reliability of qualityassurance, broken down by the individual assessment,i.e. comparisons of inter-rater reliability according tostyle, focus, prevalence and number of raters.

    Review Inter-rater reliability of case-note audit

    178 J Health Serv Res Policy Vol 12 No 3 July 2007

  • 8/3/2019 Kappa Per

    7/8

    Appendix A

    Author Focus Style Design K Prevalence Notes

    P Ca AE E I T R C S

    Dubois and Brook (1998) X X 3 2 105 0.4 0.15 Three different kappas for three different diseasesX X 3 2 140 0.3 0.20

    X X 3 2 132 0.2 0.25

    Rosenfeld (1957) X X 2 4 105 0.325 0.543 Three different pairs of

    raters in three differentspecialities

    70 0.326 0.387112 0.342 0.589

    Hastings (1980) X X 2 3 10 0.35 0.14

    Bigby (1987) X X 2 2 110 0.33 0.16

    Posner (1991) X X 2 3 42 0.42 -

    Brennan (1989) X X 4 2 225 0.57 Two kappas for causalityusing a high and lowcut off

    X X 4 2 225 0.34

    Rochester study (1967) X X 2 2 1258 0.28 0.16 Two kappas one for mothers and onefor infants

    X X 2 2 1198 0.18 0.12

    Bair (2003) X X 3 2 25 0.87 0.32

    Camacho and Rubin (1996) X X 2 5 423 0.11 0.028 Paper reports fivekappas but fifthQuality kappa notexplained, so notincluded

    X X 2 2 423 0.58 0.340X X 2 2 423 0.40 0.222

    X X 2 2 423 0.39 0.036

    Michel (2004) X X 2 2 145 0.83 The P/O study wasbased on a 25%sample

    X X 2 2 145 0.31

    Rubenstein (1990) X X 2 4 342 0.57 0.28

    Wilson (1995) X X 2 2 2574 0.67 0.437 Nurses first screenedcharts for adverse

    eventsX 2 2 4207 0.55 0.20 Doctors evaluation

    of outcome.X X 2 3 4207 0.33 0.51 PreventabilityX X 2 3 4207 0.42 Causation

    Ashton (1999) X X 2 2 37 0.13 Three differentdiseasesX X 2 2 16 0.55

    X X 2 2 55 0.23

    Hofer (2004) X X 3 6 56 0.46 0.58 Four differentdiseasesX X 3 6 40 0.26 0.20

    X X 3 6 37 0.46 0.55X X 3 6 59 0.16 0.25

    Pearson (2000) X X 2 2 89 0.43 Two kappas for twodiseases

    X X 2 2 85 0.49

    Saliba (2000) X X 2 2 100 0.68 First kappa for appropriatenessof transfer tohospital. Theresults wererepeated takinginto accountadvance directionsfor this populationin long-term careand thus producedvery similar results

    X X 2 2 100 0.78

    Smith (1997) X X 2 5 180 0.17 Process of care

    acute long-termcare frail older people

    Inter-rater reliability of case-note audit Review

    J Health Serv Res Policy Vol 12 No 3 July 2007 179

  • 8/3/2019 Kappa Per

    8/8

    Smith (1997) X X 2 5 180 0.42 Outcome of careacute long-termcare frail older people

    Weingart (2002) X X 2 37 0.70 Four kappas (2 nurse,

    2 physician) for qualityof care

    Four kappas (2 nurse,2 physician) forcomplications oftreatment

    X X 2 37 0.22X X 2 37 0.62X 2 37 0.22

    X X 2 19 0.55X X 2 19 0.76

    X X 2 19 0.41X X 2 37 0.15

    Dobscha (2003) X X 2 27 60 0.81 0.51

    Forbes (1997) X X 2 5 15 0.64

    Hayward & Hofer (2001) X X 2 5 62 0.20 0.23 Kappa for single reviewer derived fromthat for average of two(0.34) usingSpearmanBrownformula

    Hayward (1993) X X ?2 6 171 0.5 0.11 Overall qualityX X ?2 5 34 0.5 0.09 Death preventable

    X X ?2 5 171 0.1 0.08 Errors in followingmedical orders

    X ?2 3 171 0.1 0.16 Readiness for dischargeX X ?2 5 171 0.3 0.07 Post-discharge follow-upX X ?2 5 171 0.2 - Care for presenting

    problemX X ?2 5 171 0.3 0.21 Appropriate response

    new informationX X ?2 5 171 0.3 0.14 Adequacy documented

    Hofer (2000) X X 2 5 95 0.46 - Whether laboratoryresult iatrogenic

    X X 2 6 95 0.35 Overall quality

    Lorenzo (1999) X X 2 2 19 0.61 0.37 Appropriateness of admission

    X X 2 2 31 0.58 0.42 Appropriateness of care

    Baker (2004) X X 2 2 375 0.70 Nurses and then doctorsX X 2 6 151 0.47 Prevalence of adverse

    eventsX X 2 6 151 0.45 CausalityX X 2 6 151 0.69 0.42 Preventability

    Localio (1996) X X 2 6 237 0.50 0.18 Kappa quoted only for cases reviewed bysenior physicians.Raw data given forall 7533 cases in study

    Papers not included in analysis: Thomas et al.3; Takayanagi30; Rottman et al.32; Rubin et al.20; Brennan et al.29; Caplan et al.12; Posneret al.17.Key: P=process; Ca=process from outcome; AE=outcome; E=explicit; I=implicit; T=structured implicit (semi-implicit); R=number ofreviewers; C=Classes see text; ?=Exact numbers of raters is unclear

    Continued.

    Author Focus Style Design K Prevalence Notes

    P Ca AE E I T R C S

    Review Inter-rater reliability of case-note audit

    180 J Health Serv Res Policy Vol 12 No 3 July 2007