calculating kappa - examples & practice examples
TRANSCRIPT
Calculating Kappa EXAMPLES & PRACTICE QUESTIONS
POPM*3240
In research and medicine, there are many examples where multiple individuals assess the same situation and then their results are compared. In research, this is common when a trait or characteristic is assessed visually or based on a somewhat subjective set of criteria. In care, this could be a simple situation of having two radiologist separately evaluate the same x-‐ray. As epidemiologists, we are interested in how similar (or dissimilar) these rankings or ratings might be. One approach to capturing this is to measure the “percent of agreement”. However, this method does not account for the proportion of agreement due to chance alone; thus, this method over-‐estimates or over represents the true degree of agreement between raters. A better measure we have to do that with is the “kappa statistic”. Kappa is a statistic that measures agreement beyond what would be due to chance alone. The following is a brief self-‐directed tutorial to help you better understand how to set-‐up and calculate a kappa statistic. The steps of this process are as follows:
a) Construct a 2x2 table for observed (dis)agreement b) Calculate % observed agreement c) Calculate % expected agreement (due to chance) d) Enter values into final formula
The following formula will be provided to you on any examination: AgreementExpected = [(a+b)*(a+c)/n + (c+d)*(b+d)/n] / n Kappa = (AgreementObs – AgreementExp) / (1 -‐ AgreementExp) Example 1: A farmer is trying to decide which of 60 cattle to cull from her herd. She asks two veterinarians to independently come and assess each animal and make a recommendation to either “keep” or “cull”. Both veterinarians agree to cull 18 of the animals and agree to keep 32 of the animals. The 1st veterinarian had recommended culling 6 animals that veterinarian #2 had said to keep. The 1st veterinarian had recommended keeping 4 animals that veterinarian #2 had said to cull. What was the kappa statistic for the two veterinarian’s rankings? Step a: Construct a 2x2 table for observed (dis)agreement Using the information in the above example set-‐up a 2x2 table where each rater/ranker (in this case veterinarians) is placed at either the top of the side. Then the two options for the rating are placed as options (in this case cull or keep). Read the information in the paragraph carefully to assign the appropriate values to the appropriate boxes in the 2x2 table.
Veterinarian #1 Cull Keep
Veterinarian #2 Cull 18 4 Keep 6 32
Step b: Calculate % observed agreement The percentage of observed agreement is a simple metric used to ascertain how often both raters/rankers made the same decision (in this example, either both say “cull” or both say “keep”) out of all of the animals that were assessed. Observed Agreement = a + d / n = 18 + 32 / 60 = 50 / 60 = 0.833 or 83.3% Step c: Calculate % expected agreement (due to chance) The percentage of expected agreement can seem complicated to calculate, but is the easiest way of determining how much of the agreement is based on chance alone. If you recall the use of chi-‐square tests in previous statistics training, the calculation for expected agreement is related. Expected Agreement = [(a+b)*(a+c)/n + (c+d)*(b+d)/n] / n = [(18+4)*(18+6)/60 + (6+32)*(4+32)/60] / 60 = [(22)*(24)/60 + (38)*(36)/60] / 60 = [528/60 + 1368/60] / 60 = (8.8 + 22.8) / 60 = 31.6 / 60 = 0.527 or 52.7% Step d: Enter values into final formula Using the values of Observed and Expected agreement calculated in the previous two steps we are now ready to calculate the actual kappa statistic. Observed = 0.833 (from step “b”) Expected = 0.527 (from step “c”) Kappa = (AgreementObs – AgreementExp) / (1 -‐ AgreementExp) = (0.833 – 0.527) / (1 – 0.527) = 0.306 / 0.473 =0.647
It is always important to interpret the kappa statistic based on the following guidelines:
< 0.2 slight agreement 0.2 -‐ 0.4 fair agreement 0.4 -‐ 0.6 moderate agreement 0.6 -‐ 0.8 substantial agreement > 0.8 excellent agreement
While you don’t need to memorize these categories, it is important to understand that the greater the kappa the better the indication of agreement between rankers/raters (excluding the role of change). Therefore, based on the above finding of kappa = 0.647, we would say that the two veterinarians had substantial agreement. Example 2: A new test (which we shall call “HPV-‐NEW”) is available that detects human papillomavirus type 16. The current test “HPV-‐OLD” is considered the “gold standard”. The new test is considerably cheaper than the old test and provides results in a much shorter time. A physician is interested in determining how well this new test is compared with the gold standard old test, so she gives both tests to 200 of her patients: 67 which test negative on both, 113 which test positive on both, 12 which only test positive on HPV-‐NEW, and 8 which only test positive on HPV-‐OLD. How well do these two tests agree? Step a: Construct a 2x2 table for observed (dis)agreement Using the information in the above example set-‐up a 2x2 table where each test (in this case tests for HPV) is placed at either the top of the side. Then the two options for the test are placed as options (in this case HPV+ or HPV-‐). Read the information in the paragraph carefully to assign the appropriate values to the appropriate boxes in the 2x2 table.
HPV-‐OLD HPV+ HPV-‐
HPV-‐NEW HPV+ 113 12 HPV-‐ 8 67
Step b: Calculate % observed agreement The percentage of observed agreement is a simple metric used to ascertain how often both tests made the same decision (in this example, either both come back HPV+ or both come back HPV-‐) out of all of the individuals tested.
Observed Agreement = a + d / n = 113 + 67 / 200 = 180 / 200 = 0.90 or 90% Step c: Calculate % expected agreement (due to chance) The percentage of expected agreement can seem complicated to calculate, but is the easiest way of determining how much of the agreement is based on chance alone. If you recall the use of chi-‐square tests in previous statistics training, the calculation for expected agreement is related. Expected Agreement = [(a+b)*(a+c)/n + (c+d)*(b+d)/n] / n = [(113+8)*(113+12)/200 + (12+67)*(8+67)/200] / 200 = [(121)*(125)/200 + (79)*(75)/200] / 200 = [15125/200 + 5925/200] / 200 = (75.625 + 29.625) / 200 = 105.25 / 200 = 0.526 or 52.6% NB: I swear it’s a coincidence these values are so similar between the two examples Step d: Enter values into final formula Using the values of Observed and Expected agreement calculated in the previous two steps we are now ready to calculate the actual kappa statistic.
Observed = 0.90 (from step “b”) Expected = 0.526 (from step “c”)
Kappa = (AgreementObs – AgreementExp) / (1 -‐ AgreementExp) = (0.9 – 0.526) / (1 – 0.526) = 0.374 / 0.474 =0.789 Therefore, based on the above finding of kappa = 0.789, we would say that the two veterinarians had substantial agreement. A further question we might ask you out of this is whether you think the physician should stick with the old test, or use the new test and to provide rationale. What do you think? Why? A simple answer is that the tests have substantial agreement, and so it might be reasonable to opt for using the HPV-‐NEW test in lieu of HPV-‐OLD as it is cheaper and quicker. The benefit of cost-‐savings and more rapid results for the patient could outweigh potential changes to the sensitivity and specificity of HPV-‐NEW, or it’s predictive values (you could use the same 2x2 table constructed earlier in this example to calculate all of those values as well; these values could be used to rationalize a decision regarding which test to use if interpreted within the context of the disease, in this case HPV).
ADDITIONAL PRACTICE QUESTIONS
a) A Canadian Social Deprivation Index (SDI) was developed in 2006 to categorize neighbourhoods based on social and material inequality. This has been considered the “gold standard” of the field ever since. A new Equity Index (EI) is being proposed based on questions in the 2011 census and National Household Survey. The regional public health unit is interested in determining how well the two indices relate with each other, and whether the new EI is substantially different from the previous SDI. The public health office uses the same data to calculate both indices to indicate whether a neighbourhood has “low inequality” or “high inequality”. The SDI and EI rate 44 neighbourhods both as having low inequality, and 22 neighbourhoods as having high inequality. The SDI categorized another 22 neighbourhoods as high inequality that the EI found to be “low”, while the EI ranked 4 neighbourhoods as high for inequality that the SDI rated as “low”. What is the measure of agreement between the two indices? What would be the positive and negative predictive values of the EI if the SDI was used as a gold standard?
b) The super intendant of the local school board is seeking to see how well the two assessors for the board agree with regards to classifying pupils as “gifted” within the school system. Being denoted gifted opens up educational enrichment opportunities for the child, but also costs the school board additional funds to run these programs. During the 2012-‐2013 school year, the super intendant has both assessors work with each candidate for the gifted program to independently provide a decisions whether they should be “admitted” or “not”. The first assessor agrees with the second assessor in admitting 65 students to the program, but felt a further 15 should also be accepted. The second assessor thought a different 10 students should also be accepted. Both assessors agreed that 210 students should not be admitted. How congruent are the two assessor’s decisions?
c) The Canadian Food Inspection Agency is trying to determine how well food inspectors are at using an abbreviated questionnaire regarding “Safe Food Handling” (SFH-‐short), which was developed from the original gold standard Safe Food Handling long form (SFH-‐long). The abbreviated questionnaire takes considerably less time to administer, but the Director of CFIA wants to be convinced that there won’t be great disparities between inspector’s adjudications. In order to build some evidence, you have two food inspectors separately use the questionnaire to rate 40 food establishments in Wellington county. The two inspectors agree on 32 of the food establishments: 20 of which are deemed “unsafe” and 12 of which are deemed “safe”. The first inspector also deemed the remaining establishments as safe, while the second inspector assessed them to be unsafe. How well do the inspectors agree in using the SFH-‐short? How would you advise the CFIA Director?