why$are$biologists$terrified? · 2010. 4. 6. · 0 2 4 6 8 11 14 17 20 23 26 29 32 35 38 heads...
TRANSCRIPT
-
Why are biologists terrified?
-
Hypothesis: All swans are white
Observe: White swans
CANNOT conclude that H is true
-
Hypothesis: All swans are white
Observe: A black swan
CAN conclude H is FALSE
-
falsificationism doesn’t work
• Nice in principle, but only works for LOGICAL hypotheses, not for PROBABILISTIC hypotheses.
Hypothesis: Most swans are white
Observe: A black swan
Conclude what?
-
our goal
• We want to be able to compare the predicNve accuracy of different models.
• Hypotheses take the form of different funcNons and combinaNons of variables.
• How to compare them?
-
our goal
• How to compare them?• Several common ways:• p-‐values and null hypothesis tests• stepwise procedures• informaNon criteria
-
comparing models by usingp-values is bad
• Common (bad) approach:• Fit a single model containing all variables
you think might maSer
• Conclude that those variables with “significant” effects maSer
• Conclude those without “significant” effects do not maSer
-
how people use p
• Most people perform a simple ritual: the null hypothesis significance test (NHST).
• (1) Set up a staNsNcal null hypothesis of “no mean difference” or “zero correlaNon.” Don’t specify the predicNons of your research hypothesis or of any alternaNve substanNve hypotheses.
• (2) Use 5% as a convenNon for rejecNng the null. If rejected, accept your research hypothesis.
• (3) Always perform this procedure.
-
NHST (null hypothesis significance testing)
• what is a “p-‐value”?• what “p” is not• how people use p-‐values• problems with using p-‐values• aSempts to defend p-‐values• so what instead?
-
p-values
• What is a p-‐value?
Pr(estimate-or-more-extreme-estimate|true-value = 0)
-15 -10 -5 0 5 10 15
0.000.040.080.12
estimate
density
estimate
Pr(observation-or-more-extreme-observation|true-expectation = 0)
-
p-values
• “Probability of obtaining this data or more extreme data, given that the null hypothesis is true.”
p ≡ Pr(data|hypothesis)
-
example
• Flip a coin 10 Nmes. Observe 3 heads.
0 1 2 3 4 5 6 7 8 9 10
heads observed
likelih
ood | p
rob=
0.5
0.00
0.10
0.20
-
example
• What is likelihood of 3 or fewer heads, assuming unbiased coin?
0 1 2 3 4 5 6 7 8 9 10
heads observed
like
liho
od
| p
rob
=0
.5
0.00
0.10
0.20
-
example
• For 20 coin tosses and 6 observed heads:
0 1 2 3 4 5 6 7 8 9 11 13 15 17 19
heads observed
likelih
ood | p
rob=
0.5
0.00
0.10
-
example
• For 40 coin tosses and 13 observed heads:
0 2 4 6 8 11 14 17 20 23 26 29 32 35 38
heads observed
like
liho
od
| p
rob
=0
.5
0.00
0.06
0.12
-
0 1 2 3 4 5 6 7 8 9 10
heads observed
likelih
ood | p
rob=
0.5
0.00
0.10
0.20
0 1 2 3 4 5 6 7 8 9 11 13 15 17 19
heads observed
like
liho
od
| p
rob
=0
.5
0.00
0.10
0 2 4 6 8 11 14 17 20 23 26 29 32 35 38
heads observed
like
liho
od
| p
rob
=0
.5
0.00
0.06
0.12
-
example
• For parameter esNmates (like beta’s), p-‐value is about the esNmate, not the data directly.
-15 -10 -5 0 5 10 15
0.000.040.080.12
estimate
likelihood
nullβ̂β = 0mle
-
example
null mle
p
-15 -10 -5 0 5 10 15
0.00
0.10
0.20
estimate
likelihood
not the same as p
β̂β = 0
-
what p is not
Suppose you have a treatment that you suspect may alter performance on a certain task. You compare the means of your control and experimental groups (say, 20 subjects in each sample). The observed difference between the means of the groups is 12.7. Furthermore, suppose you use a simple independent means t-‐test and your result is significant (p = .01). Please mark each of the statements below as “true” or “false.”
-
what p is not
You have absolutely disproved the null hypothesis (i.e., there is no difference between the populaNon means).
• FALSE. ProbabiliNes are statements of uncertainty, and cannot prove or disprove anything.
-
what p is not
You have found the probability of the null hypothesis being true.
• FALSE. p is the Pr(D|H), not Pr(H|D). We cannot invert the probability just because we wish we could.
-
what p is not
You have absolutely proved your experimental hypothesis (that there is a difference between the populaNon means)
• FALSE. p is a probability, and therefore it cannot prove anything.
-
what p is not
You can deduce the probability of the experimental hypothesis being true.
• FALSE. p provides no informaNon about the experimental hypothesis, only the null hypothesis.
-
what p is not
You know, if you decide to reject the null hypothesis, the probability that you are making the wrong decision.
• FALSE. You want the probability of the hypothesis being true, but you calculated Pr(D|H), not Pr(H|D). You cannot calculate the probability the hypothesis is true or false.
-
what p is not
You have a reliable experimental finding in the sense that if, hypotheNcally, the experiment were repeated a great number of Nmes, you would obtain a significant result on 99% of occasions.
• FALSE. We don’t know if H is true, and the above would only be true if it were. If some other hypothesis is true, then we can’t expect to have the right probability of the data. p is Pr(D|H), remember.
-
what p is not
...The observed difference between the means of the groups is 12.7. Furthermore, suppose you use a simple independent means t-‐test and your result is NOT significant (p = .06).
-
what p is not
You can conclude that there is no real difference between the means of the two groups.
• FALSE. The maximum likelihood esNmate of the difference in means is 12.7, and this is true whether or not p < 0.05. InformaNon about the size of the effect and confidence interval of the effect is not the same as p.
-
how people use p
• Most people perform a simple ritual: the null hypothesis significance test (NHST).
• (1) Set up a staNsNcal null hypothesis of “no mean difference” or “zero correlaNon.” Don’t specify the predicNons of your research hypothesis or of any alternaNve substanNve hypotheses.
• (2) Use 5% as a convenNon for rejecNng the null. If rejected, accept your research hypothesis.
• (3) Always perform this procedure.
-
what we want to do
• Which of several potenNally useful models is best?
• In answering this quesNon, p-‐values have no role to play.
• Worse, p-‐values encourage bad inference.
-
problems with p
• null hypothesis is almost always false a priori• p overstates evidence for null• informaNon about a hypothesis we don’t care
about
• always < 0.05, with enough data• no informaNon about size of effect• no informaNon about precision• thresholds are arbitrary supersNNons
-
null hypothesis is almost always false a priori
• Do you think any coin has an exact 1/2 chance of heads?
• Do you think any two groups of people can have exactly the same average height?
-
null hypothesis is almost always false a priori
• The hypothesis that all group means are the same is false, a priori, because it is a POINT HYPOTHESIS.
• The difference will not be exactly zero. It will not be exactly 3, either.
• What we want to know is HOW BIG is the difference.
-
p overstates evidence for null
• Pr(D|H) uses the TAIL of the sampling distribuNon.• These are mostly probabiliNes of data that we have NOT
observed.
• Thus we base most of our judgment about the null hypothesis on events that have not happened!
• This inflates likelihood of finding observaNon “consistent” with the null.
-15 -10 -5 0 5 10 15
0.000.040.080.12
estimate
density
-
information about a hypothesis we don’t care about
• Pr(D|H0) does not tell us Pr(D|H1).• How can we learn about H1 without finng it
to the data?
• Law of likelihood needs to compare likelihoods => mulNple models fit to data.
-
always < 0.05, with enough data
• Because null is false, a priori, as we collect more data, p eventually falls below 0.05.
• Thus all p > 0.05 tells us is WE DIDN’T COLLECT ENOUGH DATA.
• All p < 0.05 tells us is WE DID COLLECT ENOUGH DATA.
• p-‐value rouNnely ignored in fields with very large data sets (because everything is “significant”).
-
no information about size of effect
• p < 0.05 doesn’t tell us how scienNfically important the effect is.
• The maximum likelihood esNmate is the effect size.
-
no information about precision
• Well, vague informaNon.• We want something like the confidence
interval around the esNmate.
• p-‐value open correlated with precision, but not the same calculaNon.
• BeSer to use the actual confidence interval.
-
thresholds are arbitrary superstitions
• Why p < 0.05 the threshold for true/false?• If p = 0.06, is null always true? Of course
not.
• If p = 0.04, is null always false? Of course not.
• But people say of p = 0.12 (e.g.): “There was no effect.”
• This is supersNNous.
-
thresholds are arbitrary superstitions
• Given all the uncertainty in staNsNcal inference, how can we jusNfy a hair-‐line cutoff criterion for “truth”?
• Law of likelihood does not imply a cutoff.
-
defenses of p
• Weak defenses:• useful, when used with other informaNon• used for long Nme, so must be useful• have to use them to get published
-
defenses of p
• Weak defenses:• useful, when used with other informa>on
No, need mul-ple models to use law of likelihood.
• used for long Nme, so must be useful• have to use them to get published
-
defenses of p
• Weak defenses:• useful, when used with other informaNon• used for long >me, so must be useful
No, astrology used for a long -me, too. What important scien-fic result hinged upon NHST?
• have to use them to get published
-
defenses of p
• Weak defenses:• useful, when used with other informaNon• used for long Nme, so must be useful• have to use them to get published
Jus-fica-on of a coward—you can get published using es-mates and confidence intervals and/or real model comparison (next weeks).
-
what instead of “p”?
• Never use the word “significant.”• Always communicate the EFFECT SIZE
(esNmate) and PRECISION (confidence interval).
• Do not lie about uncertainty. QuanNfy and communicate the uncertainty.
• Use mulNple plausible hypotheses; no obviously false “null” hypotheses.