calculating kappa - examples & practice examples

5
Calculating Kappa EXAMPLES & PRACTICE QUESTIONS POPM*3240 In research and medicine, there are many examples where multiple individuals assess the same situation and then their results are compared. In research, this is common when a trait or characteristic is assessed visually or based on a somewhat subjective set of criteria. In care, this could be a simple situation of having two radiologist separately evaluate the same xray. As epidemiologists, we are interested in how similar (or dissimilar) these rankings or ratings might be. One approach to capturing this is to measure the “percent of agreement”. However, this method does not account for the proportion of agreement due to chance alone; thus, this method overestimates or over represents the true degree of agreement between raters. A better measure we have to do that with is the “kappa statistic”. Kappa is a statistic that measures agreement beyond what would be due to chance alone. The following is a brief selfdirected tutorial to help you better understand how to setup and calculate a kappa statistic. The steps of this process are as follows: a) Construct a 2x2 table for observed (dis)agreement b) Calculate % observed agreement c) Calculate % expected agreement (due to chance) d) Enter values into final formula The following formula will be provided to you on any examination: AgreementExpected = [(a+b)*(a+c)/n + (c+d)*(b+d)/n] / n Kappa = (AgreementObs – AgreementExp) / (1 AgreementExp) Example 1: A farmer is trying to decide which of 60 cattle to cull from her herd. She asks two veterinarians to independently come and assess each animal and make a recommendation to either “keep” or “cull”. Both veterinarians agree to cull 18 of the animals and agree to keep 32 of the animals. The 1 st veterinarian had recommended culling 6 animals that veterinarian #2 had said to keep. The 1 st veterinarian had recommended keeping 4 animals that veterinarian #2 had said to cull. What was the kappa statistic for the two veterinarian’s rankings? Step a: Construct a 2x2 table for observed (dis)agreement Using the information in the above example setup a 2x2 table where each rater/ranker (in this case veterinarians) is placed at either the top of the side. Then the two options for the rating are placed as options (in this case cull or keep). Read the information in the paragraph carefully to assign the appropriate values to the appropriate boxes in the 2x2 table.

Upload: xxpixiexx

Post on 22-Apr-2017

238 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Calculating Kappa - Examples & Practice Examples

Calculating  Kappa  EXAMPLES  &  PRACTICE  QUESTIONS  

POPM*3240    

 In  research  and  medicine,  there  are  many  examples  where  multiple  individuals  assess  the  same  situation  and  then  their  results  are  compared.  In  research,  this  is  common  when  a  trait  or  characteristic  is  assessed  visually  or  based  on  a  somewhat  subjective  set  of  criteria.  In  care,  this  could  be  a  simple  situation  of  having  two  radiologist  separately  evaluate  the  same  x-­‐ray.  As  epidemiologists,  we  are  interested  in  how  similar  (or  dissimilar)  these  rankings  or  ratings  might  be.  One  approach  to  capturing  this  is  to  measure  the  “percent  of  agreement”.  However,  this  method  does  not  account  for  the  proportion  of  agreement  due  to  chance  alone;  thus,  this  method  over-­‐estimates  or  over  represents  the  true  degree  of  agreement  between  raters.  A  better  measure  we  have  to  do  that  with  is  the  “kappa  statistic”.  Kappa  is  a  statistic  that  measures  agreement  beyond  what  would  be  due  to  chance  alone.      The  following  is  a  brief  self-­‐directed  tutorial  to  help  you  better  understand  how  to  set-­‐up  and  calculate  a  kappa  statistic.  The  steps  of  this  process  are  as  follows:  

a) Construct  a  2x2  table  for  observed  (dis)agreement  b) Calculate  %  observed  agreement  c) Calculate  %  expected  agreement  (due  to  chance)  d) Enter  values  into  final  formula  

 The  following  formula  will  be  provided  to  you  on  any  examination:  AgreementExpected  =  [(a+b)*(a+c)/n  +  (c+d)*(b+d)/n]  /  n  Kappa  =  (AgreementObs  –  AgreementExp)  /  (1  -­‐  AgreementExp)    Example  1:      A  farmer  is  trying  to  decide  which  of  60  cattle  to  cull  from  her  herd.  She  asks  two  veterinarians  to  independently  come  and  assess  each  animal  and  make  a  recommendation  to  either  “keep”  or  “cull”.  Both  veterinarians  agree  to  cull  18  of  the  animals  and  agree  to  keep  32  of  the  animals.  The  1st  veterinarian  had  recommended  culling  6  animals  that  veterinarian  #2  had  said  to  keep.  The  1st  veterinarian  had  recommended  keeping  4  animals  that  veterinarian  #2  had  said  to  cull.  What  was  the  kappa  statistic  for  the  two  veterinarian’s  rankings?    Step  a:  Construct  a  2x2  table  for  observed  (dis)agreement    Using  the  information  in  the  above  example  set-­‐up  a  2x2  table  where  each  rater/ranker  (in  this  case  veterinarians)  is  placed  at  either  the  top  of  the  side.  Then  the  two  options  for  the  rating  are  placed  as  options  (in  this  case  cull  or  keep).  Read  the  information  in  the  paragraph  carefully  to  assign  the  appropriate  values  to  the  appropriate  boxes  in  the  2x2  table.  

X
Highlight
X
Highlight
X
Highlight
X
Rectangle
X
Highlight
X
Highlight
X
Highlight
X
Highlight
X
Highlight
Page 2: Calculating Kappa - Examples & Practice Examples

  Veterinarian  #1  Cull   Keep  

Veterinarian  #2   Cull   18   4  Keep   6   32  

 Step  b:  Calculate  %  observed  agreement    The  percentage  of  observed  agreement  is  a  simple  metric  used  to  ascertain  how  often  both  raters/rankers  made  the  same  decision  (in  this  example,  either  both  say  “cull”  or  both  say  “keep”)  out  of  all  of  the  animals  that  were  assessed.    Observed  Agreement  =  a  +  d  /  n  =  18  +  32  /  60  =  50  /  60  =  0.833  or  83.3%    Step  c:  Calculate  %  expected  agreement  (due  to  chance)    The  percentage  of  expected  agreement  can  seem  complicated  to  calculate,  but  is  the  easiest  way  of  determining  how  much  of  the  agreement  is  based  on  chance  alone.  If  you  recall  the  use  of  chi-­‐square  tests  in  previous  statistics  training,  the  calculation  for  expected  agreement  is  related.    Expected  Agreement  =  [(a+b)*(a+c)/n  +  (c+d)*(b+d)/n]  /  n    =  [(18+4)*(18+6)/60  +  (6+32)*(4+32)/60]  /  60  =  [(22)*(24)/60  +  (38)*(36)/60]  /  60    =  [528/60  +  1368/60]  /  60  =  (8.8  +  22.8)  /  60  =  31.6  /  60  =  0.527  or  52.7%    Step  d:  Enter  values  into  final  formula    Using  the  values  of  Observed  and  Expected  agreement  calculated  in  the  previous  two  steps  we  are  now  ready  to  calculate  the  actual  kappa  statistic.    Observed  =  0.833   (from  step  “b”)  Expected  =  0.527   (from  step  “c”)    Kappa  =  (AgreementObs  –  AgreementExp)  /  (1  -­‐  AgreementExp)  =  (0.833  –  0.527)  /  (1  –  0.527)    =  0.306  /  0.473  =0.647    

Page 3: Calculating Kappa - Examples & Practice Examples

It  is  always  important  to  interpret  the  kappa  statistic  based  on  the  following  guidelines:  

<  0.2       slight  agreement  0.2  -­‐  0.4     fair  agreement  0.4  -­‐  0.6     moderate  agreement  0.6  -­‐  0.8     substantial  agreement  >  0.8       excellent  agreement  

While  you  don’t  need  to  memorize  these  categories,  it  is  important  to  understand  that  the  greater  the  kappa  the  better  the  indication  of  agreement  between  rankers/raters  (excluding  the  role  of  change).    Therefore,  based  on  the  above  finding  of  kappa  =  0.647,  we  would  say  that  the  two  veterinarians  had  substantial  agreement.        Example  2:      A  new  test  (which  we  shall  call  “HPV-­‐NEW”)  is  available  that  detects  human  papillomavirus  type  16.  The  current  test  “HPV-­‐OLD”  is  considered  the  “gold  standard”.  The  new  test  is  considerably  cheaper  than  the  old  test  and  provides  results  in  a  much  shorter  time.  A  physician  is  interested  in  determining  how  well  this  new  test  is  compared  with  the  gold  standard  old  test,  so  she  gives  both  tests  to  200  of  her  patients:  67  which  test  negative  on  both,  113  which  test  positive  on  both,  12  which  only  test  positive  on  HPV-­‐NEW,  and  8  which  only  test  positive  on  HPV-­‐OLD.  How  well  do  these  two  tests  agree?    Step  a:  Construct  a  2x2  table  for  observed  (dis)agreement    Using  the  information  in  the  above  example  set-­‐up  a  2x2  table  where  each  test  (in  this  case  tests  for  HPV)  is  placed  at  either  the  top  of  the  side.  Then  the  two  options  for  the  test  are  placed  as  options  (in  this  case  HPV+  or  HPV-­‐).  Read  the  information  in  the  paragraph  carefully  to  assign  the  appropriate  values  to  the  appropriate  boxes  in  the  2x2  table.    

  HPV-­‐OLD  HPV+   HPV-­‐  

HPV-­‐NEW   HPV+   113   12  HPV-­‐   8   67  

 Step  b:  Calculate  %  observed  agreement    The  percentage  of  observed  agreement  is  a  simple  metric  used  to  ascertain  how  often  both  tests  made  the  same  decision  (in  this  example,  either  both  come  back  HPV+  or  both  come  back  HPV-­‐)  out  of  all  of  the  individuals  tested.    

X
Highlight
X
Highlight
X
Highlight
X
Highlight
X
Highlight
X
Highlight
X
Highlight
Page 4: Calculating Kappa - Examples & Practice Examples

Observed  Agreement  =  a  +  d  /  n  =  113  +  67  /  200  =  180  /  200  =  0.90  or  90%    Step  c:  Calculate  %  expected  agreement  (due  to  chance)    The  percentage  of  expected  agreement  can  seem  complicated  to  calculate,  but  is  the  easiest  way  of  determining  how  much  of  the  agreement  is  based  on  chance  alone.  If  you  recall  the  use  of  chi-­‐square  tests  in  previous  statistics  training,  the  calculation  for  expected  agreement  is  related.    Expected  Agreement  =  [(a+b)*(a+c)/n  +  (c+d)*(b+d)/n]  /  n    =  [(113+8)*(113+12)/200  +  (12+67)*(8+67)/200]  /  200  =  [(121)*(125)/200  +  (79)*(75)/200]  /  200    =  [15125/200  +  5925/200]  /  200  =  (75.625  +  29.625)  /  200  =  105.25  /  200  =  0.526  or  52.6%   NB:  I  swear  it’s  a  coincidence  these  values  are  so  similar  between  the  two  examples    Step  d:  Enter  values  into  final  formula    Using  the  values  of  Observed  and  Expected  agreement  calculated  in  the  previous  two  steps  we  are  now  ready  to  calculate  the  actual  kappa  statistic.  

Observed  =  0.90   (from  step  “b”)  Expected  =  0.526   (from  step  “c”)  

 Kappa  =  (AgreementObs  –  AgreementExp)  /  (1  -­‐  AgreementExp)  =  (0.9  –  0.526)  /  (1  –  0.526)    =  0.374  /  0.474  =0.789    Therefore,  based  on  the  above  finding  of  kappa  =  0.789,  we  would  say  that  the  two  veterinarians  had  substantial  agreement.      A  further  question  we  might  ask  you  out  of  this  is  whether  you  think  the  physician  should  stick  with  the  old  test,  or  use  the  new  test  and  to  provide  rationale.  What  do  you  think?  Why?  A  simple  answer  is  that  the  tests  have  substantial  agreement,  and  so  it  might  be  reasonable  to  opt  for  using  the  HPV-­‐NEW  test  in  lieu  of  HPV-­‐OLD  as  it  is  cheaper  and  quicker.  The  benefit  of  cost-­‐savings  and  more  rapid  results  for  the  patient  could  outweigh  potential  changes  to  the  sensitivity  and  specificity  of  HPV-­‐NEW,  or  it’s  predictive  values  (you  could  use  the  same  2x2  table  constructed  earlier  in  this  example  to  calculate  all  of  those  values  as  well;  these  values  could  be  used  to  rationalize  a  decision  regarding  which  test  to  use  if  interpreted  within  the  context  of  the  disease,  in  this  case  HPV).  

Page 5: Calculating Kappa - Examples & Practice Examples

ADDITIONAL  PRACTICE  QUESTIONS    

a) A  Canadian  Social  Deprivation  Index  (SDI)  was  developed  in  2006  to  categorize  neighbourhoods  based  on  social  and  material  inequality.  This  has  been  considered  the  “gold  standard”  of  the  field  ever  since.  A  new  Equity  Index  (EI)  is  being  proposed  based  on  questions  in  the  2011  census  and  National  Household  Survey.  The  regional  public  health  unit  is  interested  in  determining  how  well  the  two  indices  relate  with  each  other,  and  whether  the  new  EI  is  substantially  different  from  the  previous  SDI.  The  public  health  office  uses  the  same  data  to  calculate  both  indices  to  indicate  whether  a  neighbourhood  has  “low  inequality”  or  “high  inequality”.  The  SDI  and  EI  rate  44  neighbourhods  both  as  having  low  inequality,  and  22  neighbourhoods  as  having  high  inequality.  The  SDI  categorized  another  22  neighbourhoods  as  high  inequality  that  the  EI  found  to  be  “low”,  while  the  EI  ranked  4  neighbourhoods  as  high  for  inequality  that  the  SDI  rated  as  “low”.  What  is  the  measure  of  agreement  between  the  two  indices?  What  would  be  the  positive  and  negative  predictive  values  of  the  EI  if  the  SDI  was  used  as  a  gold  standard?    

b) The  super  intendant  of  the  local  school  board  is  seeking  to  see  how  well  the  two  assessors  for  the  board  agree  with  regards  to  classifying  pupils  as  “gifted”  within  the  school  system.  Being  denoted  gifted  opens  up  educational  enrichment  opportunities  for  the  child,  but  also  costs  the  school  board  additional  funds  to  run  these  programs.  During  the  2012-­‐2013  school  year,  the  super  intendant  has  both  assessors  work  with  each  candidate  for  the  gifted  program  to  independently  provide  a  decisions  whether  they  should  be  “admitted”  or  “not”.  The  first  assessor  agrees  with  the  second  assessor  in  admitting  65  students  to  the  program,  but  felt  a  further  15  should  also  be  accepted.  The  second  assessor  thought  a  different  10  students  should  also  be  accepted.  Both  assessors  agreed  that  210  students  should  not  be  admitted.  How  congruent  are  the  two  assessor’s  decisions?      

c) The  Canadian  Food  Inspection  Agency  is  trying  to  determine  how  well  food  inspectors  are  at  using  an  abbreviated  questionnaire  regarding  “Safe  Food  Handling”  (SFH-­‐short),  which  was  developed  from  the  original  gold  standard  Safe  Food  Handling  long  form  (SFH-­‐long).  The  abbreviated  questionnaire  takes  considerably  less  time  to  administer,  but  the  Director  of  CFIA  wants  to  be  convinced  that  there  won’t  be  great  disparities  between  inspector’s  adjudications.  In  order  to  build  some  evidence,  you  have  two  food  inspectors  separately  use  the  questionnaire  to  rate  40  food  establishments  in  Wellington  county.  The  two  inspectors  agree  on  32  of  the  food  establishments:  20  of  which  are  deemed  “unsafe”  and  12  of  which  are  deemed  “safe”.  The  first  inspector  also  deemed  the  remaining  establishments  as  safe,  while  the  second  inspector  assessed  them  to  be  unsafe.  How  well  do  the  inspectors  agree  in  using  the  SFH-­‐short?  How  would  you  advise  the  CFIA  Director?    

X
Highlight
X
Highlight
X
Highlight
X
Highlight
X
Highlight
X
Highlight
X
Highlight
X
Highlight
X
Highlight
X
Highlight
X
Highlight
X
Highlight