why$are$biologists$terriﬁed? · 2010. 4. 6. · 0 2 4 6 8 11 14 17 20 23 26 29 32 35 38 heads...

Why are biologists terrified?

Hypothesis: All swans are white

Observe: White swans

CANNOT conclude that H is true

Hypothesis: All swans are white

Observe: A black swan

CAN conclude H is FALSE

falsificationism doesn’t work

• Nice in principle, but only works for LOGICAL hypotheses, not for PROBABILISTIC hypotheses.

Hypothesis: Most swans are white

Observe: A black swan

Conclude what?

our goal

• We want to be able to compare the predicNve accuracy of different models.

• Hypotheses take the form of different funcNons and combinaNons of variables.

• How to compare them?

our goal

• How to compare them?• Several common ways:• p-‐values and null hypothesis tests• stepwise procedures• informaNon criteria

comparing models by usingp-values is bad

• Common (bad) approach:• Fit a single model containing all variables

you think might maSer

• Conclude that those variables with “significant” effects maSer

• Conclude those without “significant” effects do not maSer

how people use p

• Most people perform a simple ritual: the null hypothesis significance test (NHST).

• (1) Set up a staNsNcal null hypothesis of “no mean difference” or “zero correlaNon.” Don’t specify the predicNons of your research hypothesis or of any alternaNve substanNve hypotheses.

• (2) Use 5% as a convenNon for rejecNng the null. If rejected, accept your research hypothesis.

• (3) Always perform this procedure.

NHST (null hypothesis significance testing)

• what is a “p-‐value”?• what “p” is not• how people use p-‐values• problems with using p-‐values• aSempts to defend p-‐values• so what instead?

p-values

• What is a p-‐value?

Pr(estimate-or-more-extreme-estimate|true-value = 0)

-15 -10 -5 0 5 10 15

0.000.040.080.12

estimate

density

estimate

Pr(observation-or-more-extreme-observation|true-expectation = 0)

p-values

• “Probability of obtaining this data or more extreme data, given that the null hypothesis is true.”

p ≡ Pr(data|hypothesis)

example

• Flip a coin 10 Nmes. Observe 3 heads.

0 1 2 3 4 5 6 7 8 9 10

heads observed

likelih

ood | p

rob=

0.5

0.00

0.10

0.20

example

• What is likelihood of 3 or fewer heads, assuming unbiased coin?

0 1 2 3 4 5 6 7 8 9 10

heads observed

like

liho

od

| p

rob

=0

.5

0.00

0.10

0.20

example

• For 20 coin tosses and 6 observed heads:

0 1 2 3 4 5 6 7 8 9 11 13 15 17 19

heads observed

likelih

ood | p

rob=

0.5

0.00

0.10

example

• For 40 coin tosses and 13 observed heads:

0 2 4 6 8 11 14 17 20 23 26 29 32 35 38

heads observed

like

liho

od

| p

rob

=0

.5

0.00

0.06

0.12

0 1 2 3 4 5 6 7 8 9 10

heads observed

likelih

ood | p

rob=

0.5

0.00

0.10

0.20

0 1 2 3 4 5 6 7 8 9 11 13 15 17 19

heads observed

like

liho

od

| p

rob

=0

.5

0.00

0.10

0 2 4 6 8 11 14 17 20 23 26 29 32 35 38

heads observed

like

liho

od

| p

rob

=0

.5

0.00

0.06

0.12

example

• For parameter esNmates (like beta’s), p-‐value is about the esNmate, not the data directly.

-15 -10 -5 0 5 10 15

0.000.040.080.12

estimate

likelihood

nullβ̂β = 0mle

example

null mle

p

-15 -10 -5 0 5 10 15

0.00

0.10

0.20

estimate

likelihood

not the same as p

β̂β = 0

what p is not

Suppose you have a treatment that you suspect may alter performance on a certain task. You compare the means of your control and experimental groups (say, 20 subjects in each sample). The observed difference between the means of the groups is 12.7. Furthermore, suppose you use a simple independent means t-‐test and your result is significant (p = .01). Please mark each of the statements below as “true” or “false.”

what p is not

You have absolutely disproved the null hypothesis (i.e., there is no difference between the populaNon means).

• FALSE. ProbabiliNes are statements of uncertainty, and cannot prove or disprove anything.

what p is not

You have found the probability of the null hypothesis being true.

• FALSE. p is the Pr(D|H), not Pr(H|D). We cannot invert the probability just because we wish we could.

what p is not

You have absolutely proved your experimental hypothesis (that there is a difference between the populaNon means)

• FALSE. p is a probability, and therefore it cannot prove anything.

what p is not

You can deduce the probability of the experimental hypothesis being true.

• FALSE. p provides no informaNon about the experimental hypothesis, only the null hypothesis.

what p is not

You know, if you decide to reject the null hypothesis, the probability that you are making the wrong decision.

• FALSE. You want the probability of the hypothesis being true, but you calculated Pr(D|H), not Pr(H|D). You cannot calculate the probability the hypothesis is true or false.

what p is not

You have a reliable experimental finding in the sense that if, hypotheNcally, the experiment were repeated a great number of Nmes, you would obtain a significant result on 99% of occasions.

• FALSE. We don’t know if H is true, and the above would only be true if it were. If some other hypothesis is true, then we can’t expect to have the right probability of the data. p is Pr(D|H), remember.

what p is not

...The observed difference between the means of the groups is 12.7. Furthermore, suppose you use a simple independent means t-‐test and your result is NOT significant (p = .06).

what p is not

You can conclude that there is no real difference between the means of the two groups.

• FALSE. The maximum likelihood esNmate of the difference in means is 12.7, and this is true whether or not p < 0.05. InformaNon about the size of the effect and confidence interval of the effect is not the same as p.

how people use p

• Most people perform a simple ritual: the null hypothesis significance test (NHST).

• (1) Set up a staNsNcal null hypothesis of “no mean difference” or “zero correlaNon.” Don’t specify the predicNons of your research hypothesis or of any alternaNve substanNve hypotheses.

• (2) Use 5% as a convenNon for rejecNng the null. If rejected, accept your research hypothesis.

• (3) Always perform this procedure.

what we want to do

• Which of several potenNally useful models is best?

• In answering this quesNon, p-‐values have no role to play.

• Worse, p-‐values encourage bad inference.

problems with p

• null hypothesis is almost always false a priori• p overstates evidence for null• informaNon about a hypothesis we don’t care

about

• always < 0.05, with enough data• no informaNon about size of effect• no informaNon about precision• thresholds are arbitrary supersNNons

null hypothesis is almost always false a priori

• Do you think any coin has an exact 1/2 chance of heads?

• Do you think any two groups of people can have exactly the same average height?

null hypothesis is almost always false a priori

• The hypothesis that all group means are the same is false, a priori, because it is a POINT HYPOTHESIS.

• The difference will not be exactly zero. It will not be exactly 3, either.

• What we want to know is HOW BIG is the difference.

p overstates evidence for null

• Pr(D|H) uses the TAIL of the sampling distribuNon.• These are mostly probabiliNes of data that we have NOT

observed.

• Thus we base most of our judgment about the null hypothesis on events that have not happened!

• This inflates likelihood of finding observaNon “consistent” with the null.

-15 -10 -5 0 5 10 15

0.000.040.080.12

estimate

density

information about a hypothesis we don’t care about

• Pr(D|H0) does not tell us Pr(D|H1).• How can we learn about H1 without finng it

to the data?

• Law of likelihood needs to compare likelihoods => mulNple models fit to data.

always < 0.05, with enough data

• Because null is false, a priori, as we collect more data, p eventually falls below 0.05.

• Thus all p > 0.05 tells us is WE DIDN’T COLLECT ENOUGH DATA.

• All p < 0.05 tells us is WE DID COLLECT ENOUGH DATA.

• p-‐value rouNnely ignored in fields with very large data sets (because everything is “significant”).

no information about size of effect

• p < 0.05 doesn’t tell us how scienNfically important the effect is.

• The maximum likelihood esNmate is the effect size.

no information about precision

• Well, vague informaNon.• We want something like the confidence

interval around the esNmate.

• p-‐value open correlated with precision, but not the same calculaNon.

• BeSer to use the actual confidence interval.

thresholds are arbitrary superstitions

• Why p < 0.05 the threshold for true/false?• If p = 0.06, is null always true? Of course

not.

• If p = 0.04, is null always false? Of course not.

• But people say of p = 0.12 (e.g.): “There was no effect.”

• This is supersNNous.

thresholds are arbitrary superstitions

• Given all the uncertainty in staNsNcal inference, how can we jusNfy a hair-‐line cutoff criterion for “truth”?

• Law of likelihood does not imply a cutoff.

defenses of p

• Weak defenses:• useful, when used with other informaNon• used for long Nme, so must be useful• have to use them to get published

defenses of p

• Weak defenses:• useful, when used with other informa>on

No, need mul-ple models to use law of likelihood.

• used for long Nme, so must be useful• have to use them to get published

defenses of p

• Weak defenses:• useful, when used with other informaNon• used for long >me, so must be useful

No, astrology used for a long -me, too. What important scien-fic result hinged upon NHST?

• have to use them to get published

defenses of p

• Weak defenses:• useful, when used with other informaNon• used for long Nme, so must be useful• have to use them to get published

Jus-fica-on of a coward—you can get published using es-mates and confidence intervals and/or real model comparison (next weeks).

what instead of “p”?

• Never use the word “significant.”• Always communicate the EFFECT SIZE

(esNmate) and PRECISION (confidence interval).

• Do not lie about uncertainty. QuanNfy and communicate the uncertainty.

• Use mulNple plausible hypotheses; no obviously false “null” hypotheses.

why$are$biologists$terriﬁed? · 2010. 4. 6. · 0 2 4 6 8 11 14 17 20 23 26 29 32 35 38 heads...

Documents