why$are$biologists$terrified? · 2010. 4. 6. · 0 2 4 6 8 11 14 17 20 23 26 29 32 35 38 heads...

46
Why are biologists terrified?

Upload: others

Post on 07-Feb-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

  • Why  are  biologists  terrified?

  • Hypothesis:  All  swans  are  white

    Observe:  White  swans

    CANNOT  conclude  that  H  is  true

  • Hypothesis:  All  swans  are  white

    Observe:  A  black  swan

    CAN  conclude  H  is  FALSE

  • falsificationism doesn’t work

    • Nice  in  principle,  but  only  works  for  LOGICAL  hypotheses,  not  for  PROBABILISTIC  hypotheses.

    Hypothesis:  Most  swans  are  white

    Observe:  A  black  swan

    Conclude  what?

  • our goal

    • We  want  to  be  able  to  compare  the  predicNve  accuracy  of  different  models.

    • Hypotheses  take  the  form  of  different  funcNons  and  combinaNons  of  variables.

    • How  to  compare  them?

  • our goal

    • How  to  compare  them?• Several  common  ways:• p-‐values  and  null  hypothesis  tests• stepwise  procedures• informaNon  criteria

  • comparing models by usingp-values is bad

    • Common  (bad)  approach:• Fit  a  single  model  containing  all  variables  

    you  think  might  maSer

    • Conclude  that  those  variables  with  “significant”  effects  maSer

    • Conclude  those  without  “significant”  effects  do  not  maSer

  • how people use p

    • Most  people  perform  a  simple  ritual:  the  null  hypothesis  significance  test  (NHST).

    • (1)  Set  up  a  staNsNcal  null  hypothesis  of  “no  mean  difference”  or  “zero  correlaNon.”  Don’t  specify  the  predicNons  of  your  research  hypothesis  or  of  any  alternaNve  substanNve  hypotheses.  

    • (2)  Use  5%  as  a  convenNon  for  rejecNng  the  null.  If  rejected,  accept  your  research  hypothesis.  

    • (3)  Always  perform  this  procedure.  

  • NHST (null hypothesis significance testing)

    • what  is  a  “p-‐value”?• what  “p”  is  not• how  people  use  p-‐values• problems  with  using  p-‐values• aSempts  to  defend  p-‐values• so  what  instead?

  • p-values

    • What  is  a  p-‐value?

    Pr(estimate-or-more-extreme-estimate|true-value = 0)

    -15 -10 -5 0 5 10 15

    0.000.040.080.12

    estimate

    density

    estimate

    Pr(observation-or-more-extreme-observation|true-expectation = 0)

  • p-values

    • “Probability  of  obtaining  this  data  or  more  extreme  data,  given  that  the  null  hypothesis  is  true.”

    p ≡ Pr(data|hypothesis)

  • example

    • Flip  a  coin  10  Nmes.  Observe  3  heads.

    0 1 2 3 4 5 6 7 8 9 10

    heads observed

    likelih

    ood | p

    rob=

    0.5

    0.00

    0.10

    0.20

  • example

    • What  is  likelihood  of  3  or  fewer  heads,  assuming  unbiased  coin?

    0 1 2 3 4 5 6 7 8 9 10

    heads observed

    like

    liho

    od

    | p

    rob

    =0

    .5

    0.00

    0.10

    0.20

  • example

    • For  20  coin  tosses  and  6  observed  heads:

    0 1 2 3 4 5 6 7 8 9 11 13 15 17 19

    heads observed

    likelih

    ood | p

    rob=

    0.5

    0.00

    0.10

  • example

    • For  40  coin  tosses  and  13  observed  heads:

    0 2 4 6 8 11 14 17 20 23 26 29 32 35 38

    heads observed

    like

    liho

    od

    | p

    rob

    =0

    .5

    0.00

    0.06

    0.12

  • 0 1 2 3 4 5 6 7 8 9 10

    heads observed

    likelih

    ood | p

    rob=

    0.5

    0.00

    0.10

    0.20

    0 1 2 3 4 5 6 7 8 9 11 13 15 17 19

    heads observed

    like

    liho

    od

    | p

    rob

    =0

    .5

    0.00

    0.10

    0 2 4 6 8 11 14 17 20 23 26 29 32 35 38

    heads observed

    like

    liho

    od

    | p

    rob

    =0

    .5

    0.00

    0.06

    0.12

  • example

    • For  parameter  esNmates  (like  beta’s),  p-‐value  is  about  the  esNmate,  not  the  data  directly.

    -15 -10 -5 0 5 10 15

    0.000.040.080.12

    estimate

    likelihood

    nullβ̂β = 0mle

  • example

    null mle

    p

    -15 -10 -5 0 5 10 15

    0.00

    0.10

    0.20

    estimate

    likelihood

    not  the  same  as  p

    β̂β = 0

  • what p is not

    Suppose  you  have  a  treatment  that  you  suspect  may  alter  performance  on  a  certain  task.  You  compare  the  means  of  your  control  and  experimental  groups  (say,  20  subjects  in  each  sample).  The  observed  difference  between  the  means  of  the  groups  is  12.7.  Furthermore,  suppose  you  use  a  simple  independent  means  t-‐test  and  your  result  is  significant  (p  =  .01).  Please  mark  each  of  the  statements  below  as  “true”  or  “false.”  

  • what p is not

    You  have  absolutely  disproved  the  null  hypothesis  (i.e.,  there  is  no  difference  between  the  populaNon  means).

    • FALSE.  ProbabiliNes  are  statements  of  uncertainty,  and  cannot  prove  or  disprove  anything.

  • what p is not

    You  have  found  the  probability  of  the  null  hypothesis  being  true.

    • FALSE.  p  is  the  Pr(D|H),  not  Pr(H|D).  We  cannot  invert  the  probability  just  because  we  wish  we  could.

  • what p is not

    You  have  absolutely  proved  your  experimental  hypothesis  (that  there  is  a  difference  between  the  populaNon  means)

    • FALSE.  p  is  a  probability,  and  therefore  it  cannot  prove  anything.

  • what p is not

    You  can  deduce  the  probability  of  the  experimental  hypothesis  being  true.

    • FALSE.  p  provides  no  informaNon  about  the  experimental  hypothesis,  only  the  null  hypothesis.

  • what p is not

    You  know,  if  you  decide  to  reject  the  null  hypothesis,  the  probability  that  you  are  making  the  wrong  decision.

    • FALSE.  You  want  the  probability  of  the  hypothesis  being  true,  but  you  calculated  Pr(D|H),  not  Pr(H|D).  You  cannot  calculate  the  probability  the  hypothesis  is  true  or  false.

  • what p is not

    You  have  a  reliable  experimental  finding  in  the  sense  that  if,  hypotheNcally,  the  experiment  were  repeated  a  great  number  of  Nmes,  you  would  obtain  a  significant  result  on  99%  of  occasions.

    • FALSE.  We  don’t  know  if  H  is  true,  and  the  above  would  only  be  true  if  it  were.  If  some  other  hypothesis  is  true,  then  we  can’t  expect  to  have  the  right  probability  of  the  data.  p  is  Pr(D|H),  remember.

  • what p is not

    ...The  observed  difference  between  the  means  of  the  groups  is  12.7.  Furthermore,  suppose  you  use  a  simple  independent  means  t-‐test  and  your  result  is  NOT  significant  (p  =  .06).  

  • what p is not

    You  can  conclude  that  there  is  no  real  difference  between  the  means  of  the  two  groups.

    • FALSE.  The  maximum  likelihood  esNmate  of  the  difference  in  means  is  12.7,  and  this  is  true  whether  or  not  p  <  0.05.  InformaNon  about  the  size  of  the  effect  and  confidence  interval  of  the  effect  is  not  the  same  as  p.

  • how people use p

    • Most  people  perform  a  simple  ritual:  the  null  hypothesis  significance  test  (NHST).

    • (1)  Set  up  a  staNsNcal  null  hypothesis  of  “no  mean  difference”  or  “zero  correlaNon.”  Don’t  specify  the  predicNons  of  your  research  hypothesis  or  of  any  alternaNve  substanNve  hypotheses.  

    • (2)  Use  5%  as  a  convenNon  for  rejecNng  the  null.  If  rejected,  accept  your  research  hypothesis.  

    • (3)  Always  perform  this  procedure.  

  • what we want to do

    • Which  of  several  potenNally  useful  models  is  best?

    • In  answering  this  quesNon,  p-‐values  have  no  role  to  play.

    • Worse,  p-‐values  encourage  bad  inference.

  • problems with p

    • null  hypothesis  is  almost  always  false  a  priori• p  overstates  evidence  for  null• informaNon  about  a  hypothesis  we  don’t  care  

    about

    • always  <  0.05,  with  enough  data• no  informaNon  about  size  of  effect• no  informaNon  about  precision• thresholds  are  arbitrary  supersNNons

  • null hypothesis is almost always false a priori

    • Do  you  think  any  coin  has  an  exact  1/2  chance  of  heads?

    • Do  you  think  any  two  groups  of  people  can  have  exactly  the  same  average  height?

  • null hypothesis is almost always false a priori

    • The  hypothesis  that  all  group  means  are  the  same  is  false,  a  priori,  because  it  is  a  POINT  HYPOTHESIS.

    • The  difference  will  not  be  exactly  zero.  It  will  not  be  exactly  3,  either.

    • What  we  want  to  know  is  HOW  BIG  is  the  difference.

  • p overstates evidence for null

    • Pr(D|H)  uses  the  TAIL  of  the  sampling  distribuNon.• These  are  mostly  probabiliNes  of  data  that  we  have  NOT  

    observed.

    • Thus  we  base  most  of  our  judgment  about  the  null  hypothesis  on  events  that  have  not  happened!

    • This  inflates  likelihood  of  finding  observaNon  “consistent”  with  the  null.

    -15 -10 -5 0 5 10 15

    0.000.040.080.12

    estimate

    density

  • information about a hypothesis we don’t care about

    • Pr(D|H0)  does  not  tell  us  Pr(D|H1).• How  can  we  learn  about  H1  without  finng  it  

    to  the  data?

    • Law  of  likelihood  needs  to  compare  likelihoods  =>  mulNple  models  fit  to  data.

  • always < 0.05, with enough data

    • Because  null  is  false,  a  priori,  as  we  collect  more  data,  p  eventually  falls  below  0.05.

    • Thus  all  p  >  0.05  tells  us  is  WE  DIDN’T  COLLECT  ENOUGH  DATA.

    • All  p  <  0.05  tells  us  is  WE  DID  COLLECT  ENOUGH  DATA.

    • p-‐value  rouNnely  ignored  in  fields  with  very  large  data  sets  (because  everything  is  “significant”).

  • no information about size of effect

    • p  <  0.05  doesn’t  tell  us  how  scienNfically  important  the  effect  is.

    • The  maximum  likelihood  esNmate  is  the  effect  size.

  • no information about precision

    • Well,  vague  informaNon.• We  want  something  like  the  confidence  

    interval  around  the  esNmate.

    • p-‐value  open  correlated  with  precision,  but  not  the  same  calculaNon.

    • BeSer  to  use  the  actual  confidence  interval.

  • thresholds are arbitrary superstitions

    • Why  p  <  0.05  the  threshold  for  true/false?• If  p  =  0.06,  is  null  always  true?  Of  course  

    not.

    • If  p  =  0.04,  is  null  always  false?  Of  course  not.

    • But  people  say  of  p  =  0.12  (e.g.):  “There  was  no  effect.”

    • This  is  supersNNous.

  • thresholds are arbitrary superstitions

    • Given  all  the  uncertainty  in  staNsNcal  inference,  how  can  we  jusNfy  a  hair-‐line  cutoff  criterion  for  “truth”?

    • Law  of  likelihood  does  not  imply  a  cutoff.

  • defenses of p

    • Weak  defenses:• useful,  when  used  with  other  informaNon• used  for  long  Nme,  so  must  be  useful• have  to  use  them  to  get  published

  • defenses of p

    • Weak  defenses:• useful,  when  used  with  other  informa>on

    No,  need  mul-ple  models  to  use  law  of  likelihood.

    • used  for  long  Nme,  so  must  be  useful• have  to  use  them  to  get  published

  • defenses of p

    • Weak  defenses:• useful,  when  used  with  other  informaNon• used  for  long  >me,  so  must  be  useful

    No,  astrology  used  for  a  long  -me,  too.  What  important  scien-fic  result  hinged  upon  NHST?

    • have  to  use  them  to  get  published

  • defenses of p

    • Weak  defenses:• useful,  when  used  with  other  informaNon• used  for  long  Nme,  so  must  be  useful• have  to  use  them  to  get  published

    Jus-fica-on  of  a  coward—you  can  get  published  using  es-mates  and  confidence  intervals  and/or  real  model  comparison  (next  weeks).

  • what instead of “p”?

    • Never  use  the  word  “significant.”• Always  communicate  the  EFFECT  SIZE  

    (esNmate)  and  PRECISION  (confidence  interval).

    • Do  not  lie  about  uncertainty.  QuanNfy  and  communicate  the  uncertainty.

    • Use  mulNple  plausible  hypotheses;  no  obviously  false  “null”  hypotheses.