crisis of confidence, p-hacking and the future of psychology

75
After the crisis Past disappointments, Present developments, Future opportunities

Upload: matti-heino

Post on 11-Apr-2017

1.091 views

Category:

Science


3 download

TRANSCRIPT

After the crisisPast disappointments,

Present developments,

Future opportunities

The pastBirth of a monster

Progress from academic freedom?

Scientific progress results

from the free play of

intellects, dictated by

their curiosity

(Vannevar Bush, 1945)[see link]

you throw something at

the wall, and P is less than

0.05, you win

(Bruce Cuthbert, 2014)[see link]

Will a study show an effect, if it exists?

• Effects are usually hard to detect

• Probability of detecting effect, if real: Statistical power

• Usually set at 80%, because convention• Once in every 5 studies, you’d miss the effect!

Pic source

Will a study show an effect, if it exists?• How many examples do you see?

• Bigger sample = bigger telescope

Image source

Will a study show an effect, if it exists?• How many examples do you see?

• Bigger sample = bigger telescope

• How much is there variation?• If not much, easier to detect effect

Image source Image source

Hard to distinguish(unless you know what

you’re looking for) Easy to distinguish

Will a study show an effect, if it exists?• Mean statistical power in social,

behavioural, and biological sciences• (to detect small effects, which are most

common)

24%(Smaldino & McElreath, 2016)

“Obviously, an experiment designed [with low power] is not

worth performing”

- J. Neyman, 1977

Pic source

Will a study show an effect, if it exists?• Mean statistical power in social,

behavioural, and biological sciences• (to detect small effects, which are most

common)

24%(Smaldino & McElreath, 2016)

Pic source

Will a study show an effect, if it exists?• Mean statistical power in social,

behavioural, and biological sciences• (to detect small effects, which are most

common)

24%(Smaldino & McElreath, 2016)

But over 90% of studies

show an effect!

Will a study show an effect, if it exists?• Those with p < 0.05 get published

• The rest stay hidden, banished in file drawers

“The Null Hypothesis Tests Us”

- Cohen (1990)

“… a puzzling state of affairs in

the currently accepted methodology

of the behavior sciences”

- Meehl (1967)

“Research experience is unlikely to help much”

- Tversky & Kahneman (1971)

What publication bias looks like

• Do people in bars perform worse than average in intelligence tests?

• What about holiday resorts?

P value distribution, when effect is real (50% power)

P value distribution, when effect is real

P value distribution, when effect is NOT real (5% power)

P value distribution, when effect is NOT real

0.041

0.0280.031

0.048 0.007

0.0210.003

0.017

0.009

0.014

0.041

0.0280.031

0.048 0.009 0.007

0.0210.003

0.017

0.014

0.041

0.028

0.015

0.0110.048

0.018

0.021

0.017

0.009

0.034

100 studies done 5 positive published 95 in file drawer

0.041

0.0280.031

0.048 0.009 0.007

0.0210.003

0.017

0.014

10 studies done 5 positive published 5 in file drawer

p = 0.017

p = 0.007

p = 0.003

p = 0.014

p = 0.021

p = 0.041

p = 0.009

p = 0.048

p = 0.031

p = 0.028

p = 0.017

p = 0.007

p = 0.003

p = 0.014

p = 0.021

[link][link]

P-curve for beer

Studies’ evidential value, if any, is inadequate:

Z = 0.194; p = .577 [link]

P-curve for vacations

Studies’ evidential value, if any, is inadequate:

Z = -2.109; p = .017 [link]

P-curve for Moscovici (1980) (ca. 850 citations)

Studies’ evidential value, if any, is inadequate:

Z = -2.031; p = .021 [link]

… This study is now a meta-analysis!

But we don’t need a 100 studies… just go

P-hacking!- Saves resources!

- Wins fame and prestige!

- More publish, less perish!

But we don’t need a 100 studies… just go

P-hacking!- Saves resources!

- Wins fame and prestige!

- More publish, less perish!

… But we know better, right?

The garden of forking paths: Why multiple comparisons can be a problem, even when

there is no “fishing expedition” or “p-hacking” and the research hypothesis was

posited ahead of time

Gelman & Loken (2013)

The Garden of Forking Paths

Data Analysis 2: p > 0.05

Analysis 3: p > 0.05

Analysis 1: p > 0.05

Main analysis

Check out these slides from Felix Schönbrodt, or this talk by Neurosceptic!

The Garden of Forking Paths

Data

Does that really look

normally distributed?

Check out these slides from Felix Schönbrodt, or this talk by Neurosceptic!

The Garden of Forking Paths

Data

Hey, aren’t those outliers?

Check out these slides from Felix Schönbrodt, or this talk by Neurosceptic!

The Garden of Forking Paths

Data

Actually, we need to

combine some conditions…

Check out these slides from Felix Schönbrodt, or this talk by Neurosceptic!

The Garden of Forking Paths

Data

YESSSS!

P<0.05

Check out these slides from Felix Schönbrodt, or this talk by Neurosceptic!

The presentA counter-culture awakens

Traditional analysis

• Logic:1. [Premise] If nothing’s going on, then data x is unlikely.

2. [Premise] Data x is perceived.

3. [Conclusion] Something’s going on.

http://blog.efpsa.org/2015/08/03/bayesian-statistics-why-and-how/

Traditional analysis: an example

• Logic:1. [Premise] Probability of being named Nelli, if human:

extremely small• 4055 people out of about 7 billion, are named Nelli

2. [Premise] We meet someone named Nelli.

http://blog.efpsa.org/2015/08/03/bayesian-statistics-why-and-how/

Traditional analysis: an example

• Logic:1. [Premise] Probability of being named Nelli, if human:

extremely small• 4055 people out of about 7 billion, are named Nelli

2. [Premise] We meet someone named Nelli.

3. [Conclusion] That person is no human.

http://blog.efpsa.org/2015/08/03/bayesian-statistics-why-and-how/

What went wrong?

• Logic:1. [Premise] Probability of being named Nelli, if human:

extremely small• 4055 people out of about 7 billion, are named Nelli

2. [Premise] We meet someone named Nelli.

3. [Conclusion] That person is no human.

Probability of ”Nelli” if NOT human, is even smaller!

What went wrong?

• Logic:1. [Premise] Probability of being named Nelli, if human:

extremely small• 4055 people out of about 7 billion, are named Nelli

2. [Premise] We meet someone named Nelli.

3. [Conclusion] That person is no human.

Probability of ”Nelli” if NOT human, is even smaller!

Maybe not that extraterrestrial, after all...

But this is a subjective judgement (credibility of alien abduction stories?)

A better question?”Which is more probable, null or alternative?

• Remember, p-value: Probability of data, given H0

• A Bayes factor BF10: Pr(𝑑𝑎𝑡𝑎, 𝑖𝑓 𝐻1)

Pr(𝑑𝑎𝑡𝑎, 𝑖𝑓 𝐻0)

1.11.2016 44A great explanation: http://alexanderetz.com/2015/11/01/evidence-vs-conclusions/

A better question?”Which is more probable, null or alternative?

1.11.2016 45

0 …∞1BF10:

Data favor null Data favor alternative

A great explanation: http://alexanderetz.com/2015/11/01/evidence-vs-conclusions/

A better question?”Which is more probable, null or alternative?

1.11.2016 46

0 …∞1 101/10BF10:

Data favor null Insufficient data Data favor alternative

A great explanation: http://alexanderetz.com/2015/11/01/evidence-vs-conclusions/

One proposal: “When 1

10< BF < 10,

evidence quite weak”

So, will Bayes save us?

● BFs can be hacked, just as p-values currently are

● Selective reporting will still undermine reliability of results **

● Average power in good psych journals still low ***

The bad news...

The good news...

* See e.g. link or link

** see link

*** see link

• Bayes may help! (if applied transparently and mindfully *)

● BFs can be hacked, just as p-values currently are

● Selective reporting will still undermine reliability of results

● Average power in good psych journals still low

• … but subjectivity is salient (think ESP vs. Higgs boson)

● BFs can be hacked, just as p-values currently are

● Selective reporting will still undermine reliability of results

● Average power in good psych journals still low

• … but (maybe) we know to ask for more info

• … but subjectivity is salient (think ESP vs. Higgs boson)

● BFs can be hacked, just as p-values currently are

● Selective reporting will still undermine reliability of results

● Average power in good psych journals still low

• … but lack of evidence from small samples becomes clear

• … but (maybe) we know to ask for more info

• … but subjectivity is salient (think ESP vs. Higgs boson)

The futureTransparency or death!

Conventions test us

• We set alpha at p=0.05 because orthodoxy: then “worthy of another

look”

• Never “finding is real”

• Not the probability hypothesis is false

• We set power at 80% because orthodoxy: “type 2 errors four times

more acceptable than type 1 errors”

Conventions test us

• We set BF at 3 / 10 / 100 because…?!

• With Bayes Factors, evidence is relative

• If H1 is a 1000 times more likely than H2, a third hypothesis might be more

likely than either!

• Note: in Frequentism, no evidence

• Only long-run error rates (conditional on p being computed correctly!*)

* Greenland et al. Statistical tests, P values, confidence

intervals, and power: a guide to misinterpretations [link]

Do BF cutoffs make sense?!

• ”How much should I update my prior odds?”

• Example: BF=6 for ESP• prior odds (0.01%) 1/10000 * 6 = 6/10000 = 0.06%

• Example: BF=6 for Higgs Boson• prior odds (33%) 1/2 * 6 = 6/2 = 75%

Researcher wants to show an effect

● Reports a p-value

Researcher wants to show evidence of no effect

● Reports a BF (with a wide prior scale)

A new kind of selective reporting?

New tools are coming

• TIVA: test of insufficient variance

• R-index

• Statcheck

• Powergraphs

• Z-curve…

Source: Wikipedia

New tools are coming: the GRIM test

• Are the reported results mathematically possible?

• E.g. the mean of two scores can be 3.5 or 2.0• 2.25 or 4.10 can not happen

• 50% of articles contained at least one mistake, 20% contained several

Link to GRIM paper: Brown, N. J. L., & Heathers, J. A. J. (2016)Source: Wikipedia

New tools: the GRIM test

Main results table: Festinger, L., & Carlsmith, J. M. (1959). (3037 citations)

A “taxpayer’s wish list”

• Reproducible analysis scripts

• Pre-registered hypotheses

• Shared data

• Open access

“We report how we determined our sample size, all data exclusions (if any),

all manipulations and all measures in the study.”

- Simmons, Nelson & Simonsohn (2012): A 21 word solution

Thank you! Take home:

@heinonmatti www.mattiheino.com

• Transparency counteracts hacking

• Subjective elements (e.g. priors) can and need to be justified

• Lack of reporting space no issue nowadays

• Pre-registration, data sharing, supplementary materials…

➡ e.g. store them at OSF: osf.io

Additional Slides

Maximum BF for a given p-value

More info: https://alexanderetz.com/2016/06/19/understanding-bayes-how-to-cheat-to-get-the-maximum-

bayes-factor-for-a-given-p-value/

Maximum BF for a given p-value

More info: https://alexanderetz.com/2016/06/19/understanding-bayes-how-to-cheat-to-get-the-maximum-

bayes-factor-for-a-given-p-value/

Maximum BF for a given p-value

BF ≈ 8

• 1 in 9 chance you’re wrong if start with 50% probability • (1/1 * 8/1 = 8/1)

• 1 in 3 if you start with 25% probability!• (1/4 * 8/1 = 2/1)

Maximum BF for a given p-value

BF ≈ 8

• 1 in 9 chance you’re wrong if start with 50% probability • (1/1 * 8/1 = 8/1)

• 1 in 3 if you start with 25% probability!• (1/4 * 8/1 = 2/1)

To consider:

How much money, how many years of work is this worth?

Tools of the trade

“The JASP Project aims to produce software for both

Bayesian and Frequentist statistical analyses, that

is easy to use and familiar to users of SPSS.”

Tools of the trade

• Need to determine “cauchy prior width” around zero

• Default is .707 – not acceptable to many of our contexts!

• Richard et al. (2003): average (Cohen’s) d in health psychology ~0.3

• If you think half of your effects are between d = -0.3 and d = 0.3, you set width

to 0.3

Example: BCT usage among girls and boys

“We did not detect a difference between boys and girls, t(439) = -0.773, p

= 0.440”

“The results indicated moderate support for the null hypothesis of no

difference between boys and girls (BF01 = 6.579)”

What’s under the hood?

Prior width 0.707

Prior width 0.30

Prior width 0.30

You get these graphs from JASP with 2 clicks!

Resources

Etz et al. 2016:

“How to become a

Bayesian in eight easy

steps: An annotated

reading list” [link]

Resources

http://xcelab.net/rm/statistical-rethinking/

(A coding approach, no

math needed!)

http://xcelab.net/rm/statistical-rethinking/