power and sample size calculations in fmri · 2016. 5. 25. · power and sample size calculations...
TRANSCRIPT
Joke Durnez Poldrack lab
Department of Psychology Stanford University
Power and Sample size calculations in fMRI
Thanks to Tom Nichols & Russ Poldrack for slides !! SAMSI CCNS 2016
OVERVIEW
1 Statistical power and sample size calculations 2 Power in neuroscience: why bother? 3 Power and sample size calculations for fMRI 4 Other variables affecting power and reproducibility
Statistical power and sample size calculations 1
set of possible results
Pro
ba
bil
ity d
en
sit
y
Density function
when H0 is true
Density function
when Ha is true
observed
data point
threshold
for activation
α-level
set of possible results
Pro
ba
bil
ity d
en
sit
y
Density function
when H0 is true
Density function
when Ha is true
threshold
for activation
1-βpower
α:Probabilityofrejec1ngH0whenH0istrue(falseposi1ve)
Power:Probabilityofrejec1ngH0whenHaistrue(trueposi1ve)
Statistical power
4
−20 −10 0 10 20 30
600 800 1000 1200 1400 1600
−4−2 0 2 4 6
μ1= 1000μ2= 1020σ = 100
μ1= 1μ2= 2σ = 5
Δ = μ/σ
= 0.2
−4 −2 0 2 4 6
E(T) = μ / (σ/√n) = Δ √n
n=10 n=50 n=100
Data Units, Effect Sizes, Statistics
5
STANDARDISED EFFECT SIZE = 0.2
n=10Power=0.14
n=50Power=0.41
n=100Power=0.64
−4 −2 0 2 4 6 −4 −2 0 2 4 6 −4 −2 0 2 4 6
Sample Size and Power
6
cα
H0
c1−β
Ha
−4 −2 6cα c
1−β
0
Δ√n
E(T | Ha)
−4 −2 0 4 6
zα
z1-β
�pn = z↵ + z�
n =
✓z↵ + z�µ/�
◆2
(1)
1
Sample Size Calculations
7
Power: why bother? 2
Simulated activation
20
40
60
80
100
20 40 60 80
Simulated gaussian noise
20
40
60
80
100
20 40 60 80
−4
−2
0
2
4
+ àn=15
Simulated T−map
20
40
60
80
100
20 40 60 80
−4
−2
0
2
4
6
8
10
Why bother about power?
9
1. A low powered study: hard to find the effects of interest.
10
●
●
●
●
0.00
0.25
0.50
0.75
1.00
10 15 20 25 30 35 40 45 50 55 60Sample size
True
Pos
itive
Rat
eTrue Positive Rate
1. A low powered study: hard to find the effects of interest.
11
poldracklab.org
Sample size in neuroimaging studies
Thanks to Sean David for sharing dataImage:RussPoldrack
1. A low powered study: hard to find the effects of interest.
12
poldracklab.org
Power in neuroimaging studies
Assuming lenient threshold of p<0.005 uncorrected
Image:RussPoldrack
1. A low powered study: hard to find the effects of interest.
13
• Recordedmedianpowerpermeta-analysis– Medianmedianpower21%
BuPon,etal.(2013).Powerfailure:whysmallsamplesizeunderminesthereliabilityofneuroscience.Nat.Rev.Neuros,14(5),365–76.
50%ofall
neurosciencestudieshave
atmosta1-in-5chanceofreplica1ng!
It has been claimed and demonstrated that many (and possibly most) of the conclusions drawn from biomedi-cal research are probably false1. A central cause for this important problem is that researchers must publish in order to succeed, and publishing is a highly competitive enterprise, with certain kinds of findings more likely to be published than others. Research that produces novel results, statistically significant results (that is, typically p < 0.05) and seemingly ‘clean’ results is more likely to be published2,3. As a consequence, researchers have strong incentives to engage in research practices that make their findings publishable quickly, even if those prac-tices reduce the likelihood that the findings reflect a true (that is, non-null) effect4. Such practices include using flexible study designs and flexible statistical analyses and running small studies with low statistical power1,5. A simulation of genetic association studies showed that a typical dataset would generate at least one false positive result almost 97% of the time6, and two efforts to replicate promising findings in biomedicine reveal replication rates of 25% or less7,8. Given that these pub-lishing biases are pervasive across scientific practice, it is possible that false positives heavily contaminate the neuroscience literature as well, and this problem may affect at least as much, if not even more so, the most prominent journals9,10.
Here, we focus on one major aspect of the problem: low statistical power. The relationship between study power and the veracity of the resulting finding is under-appreciated. Low statistical power (because of
low sample size of studies, small effects or both) nega-tively affects the likelihood that a nominally statistically significant finding actually reflects a true effect. We dis-cuss the problems that arise when low-powered research designs are pervasive. In general, these problems can be divided into two categories. The first concerns prob-lems that are mathematically expected to arise even if the research conducted is otherwise perfect: in other words, when there are no biases that tend to create sta-tistically significant (that is, ‘positive’) results that are spurious. The second category concerns problems that reflect biases that tend to co-occur with studies of low power or that become worse in small, underpowered studies. We next empirically show that statistical power is typically low in the field of neuroscience by using evi-dence from a range of subfields within the neuroscience literature. We illustrate that low statistical power is an endemic problem in neuroscience and discuss the impli-cations of this for interpreting the results of individual studies.
Low power in the absence of other biasesThree main problems contribute to producing unreliable findings in studies with low power, even when all other research practices are ideal. They are: the low probability of finding true effects; the low positive predictive value (PPV; see BOX 1 for definitions of key statistical terms) when an effect is claimed; and an exaggerated estimate of the mag-nitude of the effect when a true effect is discovered. Here, we discuss these problems in more detail.
1School of Experimental Psychology, University of Bristol, Bristol, BS8 1TU, UK.2School of Social and Community Medicine, University of Bristol, Bristol, BS8 2BN, UK.3Stanford University School of Medicine, Stanford, California 94305, USA.4Department of Psychology, University of Virginia, Charlottesville, Virginia 22904, USA.5Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, OX3 7BN, UK.6School of Physiology and Pharmacology, University of Bristol, Bristol, BS8 1TD, UK.Correspondence to M.R.M. e-mail: [email protected]:10.1038/nrn3475Published online 10 April 2013Corrected online 15 April 2013
Power failure: why small sample size undermines the reliability of neuroscienceKatherine S. Button1,2, John P. A. Ioannidis3, Claire Mokrysz1, Brian A. Nosek4, Jonathan Flint5, Emma S. J. Robinson6 and Marcus R. Munafò1
Abstract | A study with low statistical power has a reduced chance of detecting a true effect, but it is less well appreciated that low power also reduces the likelihood that a statistically significant result reflects a true effect. Here, we show that the average statistical power of studies in the neurosciences is very low. The consequences of this include overestimates of effect size and low reproducibility of results. There are also ethical dimensions to this problem, as unreliable research is inefficient and wasteful. Improving reproducibility in neuroscience is a key priority and requires attention to well-established but often ignored methodological principles.
ANALYSIS
NATURE REVIEWS | NEUROSCIENCE VOLUME 14 | MAY 2013 | 365
© 2013 Macmillan Publishers Limited. All rights reserved
Cited: 880
2. In a low powered study: most research findings are false.
14
P(FP) = 5% P(TP) = 10%
Pre-study probability of a true effect = 0.2
PPV = 1/3Post-study probability of a true effect
PLoS Medicine | www.plosmedicine.org 0696
Essay
Open access, freely available online
August 2005 | Volume 2 | Issue 8 | e124
Published research fi ndings are sometimes refuted by subsequent evidence, with ensuing confusion
and disappointment. Refutation and controversy is seen across the range of research designs, from clinical trials and traditional epidemiological studies [1–3] to the most modern molecular research [4,5]. There is increasing concern that in modern research, false fi ndings may be the majority or even the vast majority of published research claims [6–8]. However, this should not be surprising. It can be proven that most claimed research fi ndings are false. Here I will examine the key
factors that infl uence this problem and some corollaries thereof.
Modeling the Framework for False Positive Findings Several methodologists have pointed out [9–11] that the high rate of nonreplication (lack of confi rmation) of research discoveries is a consequence of the convenient, yet ill-founded strategy of claiming conclusive research fi ndings solely on the basis of a single study assessed by formal statistical signifi cance, typically for a p-value less than 0.05. Research is not most appropriately represented and summarized by p-values, but, unfortunately, there is a widespread notion that medical research articles
should be interpreted based only on p-values. Research fi ndings are defi ned here as any relationship reaching formal statistical signifi cance, e.g., effective interventions, informative predictors, risk factors, or associations. “Negative” research is also very useful. “Negative” is actually a misnomer, and the misinterpretation is widespread. However, here we will target relationships that investigators claim exist, rather than null fi ndings.
As has been shown previously, the probability that a research fi nding is indeed true depends on the prior probability of it being true (before doing the study), the statistical power of the study, and the level of statistical signifi cance [10,11]. Consider a 2 × 2 table in which research fi ndings are compared against the gold standard of true relationships in a scientifi c fi eld. In a research fi eld both true and false hypotheses can be made about the presence of relationships. Let R be the ratio of the number of “true relationships” to “no relationships” among those tested in the fi eld. R
is characteristic of the fi eld and can vary a lot depending on whether the fi eld targets highly likely relationships or searches for only one or a few true relationships among thousands and millions of hypotheses that may be postulated. Let us also consider, for computational simplicity, circumscribed fi elds where either there is only one true relationship (among many that can be hypothesized) or the power is similar to fi nd any of the several existing true relationships. The pre-study probability of a relationship being true is R⁄(R + 1). The probability of a study fi nding a true relationship refl ects the power 1 − β (one minus the Type II error rate). The probability of claiming a relationship when none truly exists refl ects the Type I error rate, α. Assuming that c relationships are being probed in the fi eld, the expected values of the 2 × 2 table are given in Table 1. After a research fi nding has been claimed based on achieving formal statistical signifi cance, the post-study probability that it is true is the positive predictive value, PPV. The PPV is also the complementary probability of what Wacholder et al. have called the false positive report probability [10]. According to the 2 × 2 table, one gets PPV = (1 − β)R⁄(R − βR + α). A research fi nding is thus
The Essay section contains opinion pieces on topics of broad interest to a general medical audience.
Why Most Published Research Findings Are False John P. A. Ioannidis
Citation: Ioannidis JPA (2005) Why most published research fi ndings are false. PLoS Med 2(8): e124.
Copyright: © 2005 John P. A. Ioannidis. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Abbreviation: PPV, positive predictive value
John P. A. Ioannidis is in the Department of Hygiene and Epidemiology, University of Ioannina School of Medicine, Ioannina, Greece, and Institute for Clinical Research and Health Policy Studies, Department of Medicine, Tufts-New England Medical Center, Tufts University School of Medicine, Boston, Massachusetts, United States of America. E-mail: [email protected]
Competing Interests: The author has declared that no competing interests exist.
DOI: 10.1371/journal.pmed.0020124
SummaryThere is increasing concern that most
current published research fi ndings are false. The probability that a research claim is true may depend on study power and bias, the number of other studies on the same question, and, importantly, the ratio of true to no relationships among the relationships probed in each scientifi c fi eld. In this framework, a research fi nding is less likely to be true when the studies conducted in a fi eld are smaller; when effect sizes are smaller; when there is a greater number and lesser preselection of tested relationships; where there is greater fl exibility in designs, defi nitions, outcomes, and analytical modes; when there is greater fi nancial and other interest and prejudice; and when more teams are involved in a scientifi c fi eld in chase of statistical signifi cance. Simulations show that for most study designs and settings, it is more likely for a research claim to be false than true. Moreover, for many current scientifi c fi elds, claimed research fi ndings may often be simply accurate measures of the prevailing bias. In this essay, I discuss the implications of these problems for the conduct and interpretation of research.
It can be proven that most claimed research
fi ndings are false.
PLoS Medicine | www.plosmedicine.org 0696
Essay
Open access, freely available online
August 2005 | Volume 2 | Issue 8 | e124
Published research fi ndings are sometimes refuted by subsequent evidence, with ensuing confusion
and disappointment. Refutation and controversy is seen across the range of research designs, from clinical trials and traditional epidemiological studies [1–3] to the most modern molecular research [4,5]. There is increasing concern that in modern research, false fi ndings may be the majority or even the vast majority of published research claims [6–8]. However, this should not be surprising. It can be proven that most claimed research fi ndings are false. Here I will examine the key
factors that infl uence this problem and some corollaries thereof.
Modeling the Framework for False Positive Findings Several methodologists have pointed out [9–11] that the high rate of nonreplication (lack of confi rmation) of research discoveries is a consequence of the convenient, yet ill-founded strategy of claiming conclusive research fi ndings solely on the basis of a single study assessed by formal statistical signifi cance, typically for a p-value less than 0.05. Research is not most appropriately represented and summarized by p-values, but, unfortunately, there is a widespread notion that medical research articles
should be interpreted based only on p-values. Research fi ndings are defi ned here as any relationship reaching formal statistical signifi cance, e.g., effective interventions, informative predictors, risk factors, or associations. “Negative” research is also very useful. “Negative” is actually a misnomer, and the misinterpretation is widespread. However, here we will target relationships that investigators claim exist, rather than null fi ndings.
As has been shown previously, the probability that a research fi nding is indeed true depends on the prior probability of it being true (before doing the study), the statistical power of the study, and the level of statistical signifi cance [10,11]. Consider a 2 × 2 table in which research fi ndings are compared against the gold standard of true relationships in a scientifi c fi eld. In a research fi eld both true and false hypotheses can be made about the presence of relationships. Let R be the ratio of the number of “true relationships” to “no relationships” among those tested in the fi eld. R
is characteristic of the fi eld and can vary a lot depending on whether the fi eld targets highly likely relationships or searches for only one or a few true relationships among thousands and millions of hypotheses that may be postulated. Let us also consider, for computational simplicity, circumscribed fi elds where either there is only one true relationship (among many that can be hypothesized) or the power is similar to fi nd any of the several existing true relationships. The pre-study probability of a relationship being true is R⁄(R + 1). The probability of a study fi nding a true relationship refl ects the power 1 − β (one minus the Type II error rate). The probability of claiming a relationship when none truly exists refl ects the Type I error rate, α. Assuming that c relationships are being probed in the fi eld, the expected values of the 2 × 2 table are given in Table 1. After a research fi nding has been claimed based on achieving formal statistical signifi cance, the post-study probability that it is true is the positive predictive value, PPV. The PPV is also the complementary probability of what Wacholder et al. have called the false positive report probability [10]. According to the 2 × 2 table, one gets PPV = (1 − β)R⁄(R − βR + α). A research fi nding is thus
The Essay section contains opinion pieces on topics of broad interest to a general medical audience.
Why Most Published Research Findings Are False John P. A. Ioannidis
Citation: Ioannidis JPA (2005) Why most published research fi ndings are false. PLoS Med 2(8): e124.
Copyright: © 2005 John P. A. Ioannidis. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Abbreviation: PPV, positive predictive value
John P. A. Ioannidis is in the Department of Hygiene and Epidemiology, University of Ioannina School of Medicine, Ioannina, Greece, and Institute for Clinical Research and Health Policy Studies, Department of Medicine, Tufts-New England Medical Center, Tufts University School of Medicine, Boston, Massachusetts, United States of America. E-mail: [email protected]
Competing Interests: The author has declared that no competing interests exist.
DOI: 10.1371/journal.pmed.0020124
SummaryThere is increasing concern that most
current published research fi ndings are false. The probability that a research claim is true may depend on study power and bias, the number of other studies on the same question, and, importantly, the ratio of true to no relationships among the relationships probed in each scientifi c fi eld. In this framework, a research fi nding is less likely to be true when the studies conducted in a fi eld are smaller; when effect sizes are smaller; when there is a greater number and lesser preselection of tested relationships; where there is greater fl exibility in designs, defi nitions, outcomes, and analytical modes; when there is greater fi nancial and other interest and prejudice; and when more teams are involved in a scientifi c fi eld in chase of statistical signifi cance. Simulations show that for most study designs and settings, it is more likely for a research claim to be false than true. Moreover, for many current scientifi c fi elds, claimed research fi ndings may often be simply accurate measures of the prevailing bias. In this essay, I discuss the implications of these problems for the conduct and interpretation of research.
It can be proven that most claimed research
fi ndings are false.
Cited: 3532
2. In a low powered study: most research findings are false.
15
●●
●
●●
●●
●●●
●
●●
●● ●● ●●
0.00
0.25
0.50
0.75
1.00
10 15 20 25 30 35 40 45 50 55 60Sample size
Posi
tive
Pred
ictiv
e Va
lue
Positive Predictive Value
2. In a low powered study: most research findings are false.
16Ioannidis(2005).Whymostresearchfindingsarefalse.PLoSMedicine,2,8.
PLoS Medicine | www.plosmedicine.org 0699
alternating extreme research claims and extremely opposite refutations [29]. Empirical evidence suggests that this sequence of extreme opposites is very common in molecular genetics [29].
These corollaries consider each factor separately, but these factors often infl uence each other. For example, investigators working in fi elds where true effect sizes are perceived to be small may be more likely to perform large studies than investigators working in fi elds where true effect sizes are perceived to be large. Or prejudice may prevail in a hot scientifi c fi eld, further undermining the predictive value of its research fi ndings. Highly prejudiced stakeholders may even create a barrier that aborts efforts at obtaining and disseminating opposing results. Conversely, the fact that a fi eld
is hot or has strong invested interests may sometimes promote larger studies and improved standards of research, enhancing the predictive value of its research fi ndings. Or massive discovery-oriented testing may result in such a large yield of signifi cant relationships that investigators have enough to report and search further and thus refrain from data dredging and manipulation.
Most Research Findings Are False for Most Research Designs and for Most FieldsIn the described framework, a PPV exceeding 50% is quite diffi cult to get. Table 4 provides the results of simulations using the formulas developed for the infl uence of power, ratio of true to non-true relationships, and bias, for various types of situations that may be characteristic of specifi c study designs and settings. A fi nding from a well-conducted, adequately powered randomized controlled trial starting with a 50% pre-study chance that the intervention is effective is
eventually true about 85% of the time. A fairly similar performance is expected of a confi rmatory meta-analysis of good-quality randomized trials: potential bias probably increases, but power and pre-test chances are higher compared to a single randomized trial. Conversely, a meta-analytic fi nding from inconclusive studies where pooling is used to “correct” the low power of single studies, is probably false if R ≤ 1:3. Research fi ndings from underpowered, early-phase clinical trials would be true about one in four times, or even less frequently if bias is present. Epidemiological studies of an exploratory nature perform even worse, especially when underpowered, but even well-powered epidemiological studies may have only a one in fi ve chance being true, if R = 1:10. Finally, in discovery-oriented research with massive testing, where tested relationships exceed true ones 1,000-fold (e.g., 30,000 genes tested, of which 30 may be the true culprits) [30,31], PPV for each claimed relationship is extremely low, even with considerable
Box 1. An Example: Science at Low Pre-Study Odds
Let us assume that a team of investigators performs a whole genome association study to test whether any of 100,000 gene polymorphisms are associated with susceptibility to schizophrenia. Based on what we know about the extent of heritability of the disease, it is reasonable to expect that probably around ten gene polymorphisms among those tested would be truly associated with schizophrenia, with relatively similar odds ratios around 1.3 for the ten or so polymorphisms and with a fairly similar power to identify any of them. Then R = 10/100,000 = 10−4, and the pre-study probability for any polymorphism to be associated with schizophrenia is also R/(R + 1) = 10−4. Let us also suppose that the study has 60% power to fi nd an association with an odds ratio of 1.3 at α = 0.05. Then it can be estimated that if a statistically signifi cant association is found with the p-value barely crossing the 0.05 threshold, the post-study probability that this is true increases about 12-fold compared with the pre-study probability, but it is still only 12 × 10−4.
Now let us suppose that the investigators manipulate their design,
analyses, and reporting so as to make more relationships cross the p = 0.05 threshold even though this would not have been crossed with a perfectly adhered to design and analysis and with perfect comprehensive reporting of the results, strictly according to the original study plan. Such manipulation could be done, for example, with serendipitous inclusion or exclusion of certain patients or controls, post hoc subgroup analyses, investigation of genetic contrasts that were not originally specifi ed, changes in the disease or control defi nitions, and various combinations of selective or distorted reporting of the results. Commercially available “data mining” packages actually are proud of their ability to yield statistically signifi cant results through data dredging. In the presence of bias with u = 0.10, the post-study probability that a research fi nding is true is only 4.4 × 10−4. Furthermore, even in the absence of any bias, when ten independent research teams perform similar experiments around the world, if one of them fi nds a formally statistically signifi cant association, the probability that the research fi nding is true is only 1.5 × 10−4, hardly any higher than the probability we had before any of this extensive research was undertaken!
DOI: 10.1371/journal.pmed.0020124.g002
Figure 2. PPV (Probability That a Research Finding Is True) as a Function of the Pre-Study Odds for Various Numbers of Conducted Studies, nPanels correspond to power of 0.20, 0.50, and 0.80.
August 2005 | Volume 2 | Issue 8 | e124
PLoS Medicine | www.plosmedicine.org 0699
alternating extreme research claims and extremely opposite refutations [29]. Empirical evidence suggests that this sequence of extreme opposites is very common in molecular genetics [29].
These corollaries consider each factor separately, but these factors often infl uence each other. For example, investigators working in fi elds where true effect sizes are perceived to be small may be more likely to perform large studies than investigators working in fi elds where true effect sizes are perceived to be large. Or prejudice may prevail in a hot scientifi c fi eld, further undermining the predictive value of its research fi ndings. Highly prejudiced stakeholders may even create a barrier that aborts efforts at obtaining and disseminating opposing results. Conversely, the fact that a fi eld
is hot or has strong invested interests may sometimes promote larger studies and improved standards of research, enhancing the predictive value of its research fi ndings. Or massive discovery-oriented testing may result in such a large yield of signifi cant relationships that investigators have enough to report and search further and thus refrain from data dredging and manipulation.
Most Research Findings Are False for Most Research Designs and for Most FieldsIn the described framework, a PPV exceeding 50% is quite diffi cult to get. Table 4 provides the results of simulations using the formulas developed for the infl uence of power, ratio of true to non-true relationships, and bias, for various types of situations that may be characteristic of specifi c study designs and settings. A fi nding from a well-conducted, adequately powered randomized controlled trial starting with a 50% pre-study chance that the intervention is effective is
eventually true about 85% of the time. A fairly similar performance is expected of a confi rmatory meta-analysis of good-quality randomized trials: potential bias probably increases, but power and pre-test chances are higher compared to a single randomized trial. Conversely, a meta-analytic fi nding from inconclusive studies where pooling is used to “correct” the low power of single studies, is probably false if R ≤ 1:3. Research fi ndings from underpowered, early-phase clinical trials would be true about one in four times, or even less frequently if bias is present. Epidemiological studies of an exploratory nature perform even worse, especially when underpowered, but even well-powered epidemiological studies may have only a one in fi ve chance being true, if R = 1:10. Finally, in discovery-oriented research with massive testing, where tested relationships exceed true ones 1,000-fold (e.g., 30,000 genes tested, of which 30 may be the true culprits) [30,31], PPV for each claimed relationship is extremely low, even with considerable
Box 1. An Example: Science at Low Pre-Study Odds
Let us assume that a team of investigators performs a whole genome association study to test whether any of 100,000 gene polymorphisms are associated with susceptibility to schizophrenia. Based on what we know about the extent of heritability of the disease, it is reasonable to expect that probably around ten gene polymorphisms among those tested would be truly associated with schizophrenia, with relatively similar odds ratios around 1.3 for the ten or so polymorphisms and with a fairly similar power to identify any of them. Then R = 10/100,000 = 10−4, and the pre-study probability for any polymorphism to be associated with schizophrenia is also R/(R + 1) = 10−4. Let us also suppose that the study has 60% power to fi nd an association with an odds ratio of 1.3 at α = 0.05. Then it can be estimated that if a statistically signifi cant association is found with the p-value barely crossing the 0.05 threshold, the post-study probability that this is true increases about 12-fold compared with the pre-study probability, but it is still only 12 × 10−4.
Now let us suppose that the investigators manipulate their design,
analyses, and reporting so as to make more relationships cross the p = 0.05 threshold even though this would not have been crossed with a perfectly adhered to design and analysis and with perfect comprehensive reporting of the results, strictly according to the original study plan. Such manipulation could be done, for example, with serendipitous inclusion or exclusion of certain patients or controls, post hoc subgroup analyses, investigation of genetic contrasts that were not originally specifi ed, changes in the disease or control defi nitions, and various combinations of selective or distorted reporting of the results. Commercially available “data mining” packages actually are proud of their ability to yield statistically signifi cant results through data dredging. In the presence of bias with u = 0.10, the post-study probability that a research fi nding is true is only 4.4 × 10−4. Furthermore, even in the absence of any bias, when ten independent research teams perform similar experiments around the world, if one of them fi nds a formally statistically signifi cant association, the probability that the research fi nding is true is only 1.5 × 10−4, hardly any higher than the probability we had before any of this extensive research was undertaken!
DOI: 10.1371/journal.pmed.0020124.g002
Figure 2. PPV (Probability That a Research Finding Is True) as a Function of the Pre-Study Odds for Various Numbers of Conducted Studies, nPanels correspond to power of 0.20, 0.50, and 0.80.
August 2005 | Volume 2 | Issue 8 | e124
PLoS Medicine | www.plosmedicine.org 0699
alternating extreme research claims and extremely opposite refutations [29]. Empirical evidence suggests that this sequence of extreme opposites is very common in molecular genetics [29].
These corollaries consider each factor separately, but these factors often infl uence each other. For example, investigators working in fi elds where true effect sizes are perceived to be small may be more likely to perform large studies than investigators working in fi elds where true effect sizes are perceived to be large. Or prejudice may prevail in a hot scientifi c fi eld, further undermining the predictive value of its research fi ndings. Highly prejudiced stakeholders may even create a barrier that aborts efforts at obtaining and disseminating opposing results. Conversely, the fact that a fi eld
is hot or has strong invested interests may sometimes promote larger studies and improved standards of research, enhancing the predictive value of its research fi ndings. Or massive discovery-oriented testing may result in such a large yield of signifi cant relationships that investigators have enough to report and search further and thus refrain from data dredging and manipulation.
Most Research Findings Are False for Most Research Designs and for Most FieldsIn the described framework, a PPV exceeding 50% is quite diffi cult to get. Table 4 provides the results of simulations using the formulas developed for the infl uence of power, ratio of true to non-true relationships, and bias, for various types of situations that may be characteristic of specifi c study designs and settings. A fi nding from a well-conducted, adequately powered randomized controlled trial starting with a 50% pre-study chance that the intervention is effective is
eventually true about 85% of the time. A fairly similar performance is expected of a confi rmatory meta-analysis of good-quality randomized trials: potential bias probably increases, but power and pre-test chances are higher compared to a single randomized trial. Conversely, a meta-analytic fi nding from inconclusive studies where pooling is used to “correct” the low power of single studies, is probably false if R ≤ 1:3. Research fi ndings from underpowered, early-phase clinical trials would be true about one in four times, or even less frequently if bias is present. Epidemiological studies of an exploratory nature perform even worse, especially when underpowered, but even well-powered epidemiological studies may have only a one in fi ve chance being true, if R = 1:10. Finally, in discovery-oriented research with massive testing, where tested relationships exceed true ones 1,000-fold (e.g., 30,000 genes tested, of which 30 may be the true culprits) [30,31], PPV for each claimed relationship is extremely low, even with considerable
Box 1. An Example: Science at Low Pre-Study Odds
Let us assume that a team of investigators performs a whole genome association study to test whether any of 100,000 gene polymorphisms are associated with susceptibility to schizophrenia. Based on what we know about the extent of heritability of the disease, it is reasonable to expect that probably around ten gene polymorphisms among those tested would be truly associated with schizophrenia, with relatively similar odds ratios around 1.3 for the ten or so polymorphisms and with a fairly similar power to identify any of them. Then R = 10/100,000 = 10−4, and the pre-study probability for any polymorphism to be associated with schizophrenia is also R/(R + 1) = 10−4. Let us also suppose that the study has 60% power to fi nd an association with an odds ratio of 1.3 at α = 0.05. Then it can be estimated that if a statistically signifi cant association is found with the p-value barely crossing the 0.05 threshold, the post-study probability that this is true increases about 12-fold compared with the pre-study probability, but it is still only 12 × 10−4.
Now let us suppose that the investigators manipulate their design,
analyses, and reporting so as to make more relationships cross the p = 0.05 threshold even though this would not have been crossed with a perfectly adhered to design and analysis and with perfect comprehensive reporting of the results, strictly according to the original study plan. Such manipulation could be done, for example, with serendipitous inclusion or exclusion of certain patients or controls, post hoc subgroup analyses, investigation of genetic contrasts that were not originally specifi ed, changes in the disease or control defi nitions, and various combinations of selective or distorted reporting of the results. Commercially available “data mining” packages actually are proud of their ability to yield statistically signifi cant results through data dredging. In the presence of bias with u = 0.10, the post-study probability that a research fi nding is true is only 4.4 × 10−4. Furthermore, even in the absence of any bias, when ten independent research teams perform similar experiments around the world, if one of them fi nds a formally statistically signifi cant association, the probability that the research fi nding is true is only 1.5 × 10−4, hardly any higher than the probability we had before any of this extensive research was undertaken!
DOI: 10.1371/journal.pmed.0020124.g002
Figure 2. PPV (Probability That a Research Finding Is True) as a Function of the Pre-Study Odds for Various Numbers of Conducted Studies, nPanels correspond to power of 0.20, 0.50, and 0.80.
August 2005 | Volume 2 | Issue 8 | e124
PLoS Medicine | www.plosmedicine.org 0699
alternating extreme research claims and extremely opposite refutations [29]. Empirical evidence suggests that this sequence of extreme opposites is very common in molecular genetics [29].
These corollaries consider each factor separately, but these factors often infl uence each other. For example, investigators working in fi elds where true effect sizes are perceived to be small may be more likely to perform large studies than investigators working in fi elds where true effect sizes are perceived to be large. Or prejudice may prevail in a hot scientifi c fi eld, further undermining the predictive value of its research fi ndings. Highly prejudiced stakeholders may even create a barrier that aborts efforts at obtaining and disseminating opposing results. Conversely, the fact that a fi eld
is hot or has strong invested interests may sometimes promote larger studies and improved standards of research, enhancing the predictive value of its research fi ndings. Or massive discovery-oriented testing may result in such a large yield of signifi cant relationships that investigators have enough to report and search further and thus refrain from data dredging and manipulation.
Most Research Findings Are False for Most Research Designs and for Most FieldsIn the described framework, a PPV exceeding 50% is quite diffi cult to get. Table 4 provides the results of simulations using the formulas developed for the infl uence of power, ratio of true to non-true relationships, and bias, for various types of situations that may be characteristic of specifi c study designs and settings. A fi nding from a well-conducted, adequately powered randomized controlled trial starting with a 50% pre-study chance that the intervention is effective is
eventually true about 85% of the time. A fairly similar performance is expected of a confi rmatory meta-analysis of good-quality randomized trials: potential bias probably increases, but power and pre-test chances are higher compared to a single randomized trial. Conversely, a meta-analytic fi nding from inconclusive studies where pooling is used to “correct” the low power of single studies, is probably false if R ≤ 1:3. Research fi ndings from underpowered, early-phase clinical trials would be true about one in four times, or even less frequently if bias is present. Epidemiological studies of an exploratory nature perform even worse, especially when underpowered, but even well-powered epidemiological studies may have only a one in fi ve chance being true, if R = 1:10. Finally, in discovery-oriented research with massive testing, where tested relationships exceed true ones 1,000-fold (e.g., 30,000 genes tested, of which 30 may be the true culprits) [30,31], PPV for each claimed relationship is extremely low, even with considerable
Box 1. An Example: Science at Low Pre-Study Odds
Let us assume that a team of investigators performs a whole genome association study to test whether any of 100,000 gene polymorphisms are associated with susceptibility to schizophrenia. Based on what we know about the extent of heritability of the disease, it is reasonable to expect that probably around ten gene polymorphisms among those tested would be truly associated with schizophrenia, with relatively similar odds ratios around 1.3 for the ten or so polymorphisms and with a fairly similar power to identify any of them. Then R = 10/100,000 = 10−4, and the pre-study probability for any polymorphism to be associated with schizophrenia is also R/(R + 1) = 10−4. Let us also suppose that the study has 60% power to fi nd an association with an odds ratio of 1.3 at α = 0.05. Then it can be estimated that if a statistically signifi cant association is found with the p-value barely crossing the 0.05 threshold, the post-study probability that this is true increases about 12-fold compared with the pre-study probability, but it is still only 12 × 10−4.
Now let us suppose that the investigators manipulate their design,
analyses, and reporting so as to make more relationships cross the p = 0.05 threshold even though this would not have been crossed with a perfectly adhered to design and analysis and with perfect comprehensive reporting of the results, strictly according to the original study plan. Such manipulation could be done, for example, with serendipitous inclusion or exclusion of certain patients or controls, post hoc subgroup analyses, investigation of genetic contrasts that were not originally specifi ed, changes in the disease or control defi nitions, and various combinations of selective or distorted reporting of the results. Commercially available “data mining” packages actually are proud of their ability to yield statistically signifi cant results through data dredging. In the presence of bias with u = 0.10, the post-study probability that a research fi nding is true is only 4.4 × 10−4. Furthermore, even in the absence of any bias, when ten independent research teams perform similar experiments around the world, if one of them fi nds a formally statistically signifi cant association, the probability that the research fi nding is true is only 1.5 × 10−4, hardly any higher than the probability we had before any of this extensive research was undertaken!
DOI: 10.1371/journal.pmed.0020124.g002
Figure 2. PPV (Probability That a Research Finding Is True) as a Function of the Pre-Study Odds for Various Numbers of Conducted Studies, nPanels correspond to power of 0.20, 0.50, and 0.80.
August 2005 | Volume 2 | Issue 8 | e124
2. In a low powered study: most research findings are false.
17
RESEARCH ARTICLE SUMMARY◥
PSYCHOLOGY
Estimating the reproducibility ofpsychological scienceOpen Science Collaboration*
INTRODUCTION: Reproducibility is a defin-ing feature of science, but the extent to whichit characterizes current research is unknown.Scientific claims should not gain credencebecause of the status or authority of theiroriginator but by the replicability of theirsupporting evidence. Even research of exem-plary quality may have irreproducible empir-ical findings because of random or systematicerror.
RATIONALE: There is concern about the rateand predictors of reproducibility, but limitedevidence. Potentially problematic practices in-clude selective reporting, selective analysis, andinsufficient specification of the conditions nec-essary or sufficient to obtain the results. Directreplication is the attempt to recreate the con-ditions believed sufficient for obtaining a pre-
viously observed finding and is the means ofestablishing reproducibility of a finding withnew data. We conducted a large-scale, collab-orative effort to obtain an initial estimate ofthe reproducibility of psychological science.
RESULTS:We conducted replications of 100experimental and correlational studies pub-lished in three psychology journals using high-powered designs and original materials whenavailable. There is no single standard for eval-uating replication success. Here, we evaluatedreproducibility using significance and P values,effect sizes, subjective assessments of replica-tion teams, and meta-analysis of effect sizes.The mean effect size (r) of the replication ef-fects (Mr = 0.197, SD = 0.257) was half the mag-nitude of the mean effect size of the originaleffects (Mr = 0.403, SD = 0.188), representing a
substantial decline.Ninety-sevenpercent of orig-inal studies had significant results (P < .05).Thirty-six percent of replications had signifi-
cant results; 47% of origi-nal effect sizes were in the95% confidence intervalof the replication effectsize; 39% of effects weresubjectively rated to havereplicated the original re-
sult; and if no bias in original results is as-sumed, combining original and replicationresults left 68% with statistically significanteffects. Correlational tests suggest that repli-cation success was better predicted by thestrength of original evidence than by charac-teristics of the original and replication teams.
CONCLUSION:No single indicator sufficient-ly describes replication success, and the fiveindicators examined here are not the onlyways to evaluate reproducibility. Nonetheless,collectively these results offer a clear conclu-sion: A large portion of replications producedweaker evidence for the original findings de-spite using materials provided by the originalauthors, review in advance for methodologi-cal fidelity, and high statistical power to detectthe original effect sizes. Moreover, correlationalevidence is consistent with the conclusion thatvariation in the strength of initial evidence(such as original P value) was more predictiveof replication success than variation in thecharacteristics of the teams conducting theresearch (such as experience and expertise).The latter factors certainly can influence rep-lication success, but they did not appear to doso here.Reproducibility is not well understood be-
cause the incentives for individual scientistsprioritize novelty over replication. Innova-tion is the engine of discovery and is vital fora productive, effective scientific enterprise.However, innovative ideas become old newsfast. Journal reviewers and editors may dis-miss a new test of a published idea as un-original. The claim that “we already know this”belies the uncertainty of scientific evidence.Innovation points out paths that are possible;replication points out paths that are likely;progress relies on both. Replication can in-crease certainty when findings are reproducedand promote innovation when they are not.This project provides accumulating evidencefor many findings in psychological researchand suggests that there is still more work todo to verify whether we know what we thinkwe know.▪
RESEARCH
SCIENCE sciencemag.org 28 AUGUST 2015 • VOL 349 ISSUE 6251 943
The list of author affiliations is available in the full article online.*Corresponding author. E-mail: [email protected] this article as Open Science Collaboration, Science 349,aac4716 (2015). DOI: 10.1126/science.aac4716
Original study effect size versus replication effect size (correlation coefficients). Diagonalline represents replication effect size equal to original effect size. Dotted line represents replicationeffect size of 0. Points below the dotted line were effects in the opposite direction of the original.Density plots are separated by significant (blue) and nonsignificant (red) effects.
ON OUR WEB SITE◥
Read the full articleat http://dx.doi.org/10.1126/science.aac4716..................................................
on A
pril
5, 2
016
Dow
nloa
ded
from
on
April
5, 2
016
Dow
nloa
ded
from
on
April
5, 2
016
Dow
nloa
ded
from
on
April
5, 2
016
Dow
nloa
ded
from
on
April
5, 2
016
Dow
nloa
ded
from
on
April
5, 2
016
Dow
nloa
ded
from
on
April
5, 2
016
Dow
nloa
ded
from
on
April
5, 2
016
Dow
nloa
ded
from
on
April
5, 2
016
Dow
nloa
ded
from
OpenScienceCollabora1on(2015).Es1ma1ngthereproducibilityofpsychologicalscience.Science,349,6251.
• OpenScienceCollabora1on(2015)• 100replica1onsofexperimentalandcorrela1onalstudies• 36%ofreplica1onscouldreproduceresult• Inabilitytoreplicatehasseveralsources:
– P-hacking– Publica1onbias– Falseposi:vesduetolowPPV
3. In a low powered study: winner’s curse.
18
●
●●
●
●
●●
●●
●
●
0.50
0.75
1.00
1.25
1.50
10 15 20 25 30 35 40 45 50 55 60Sample size
Estim
ated
Effe
ct S
izeWinner's curse
3. In a low powered study: winner’s curse.
19
LawofSmallNumbersaka“Winner’sCurse”Smallstudiesover-es1mateeffectsize E
ffect
Siz
e (L
og O
dds
Rat
io)
Sample Size (Log of Total N in Meta Analysis)
• 256 meta analyses − For a binary effect (odds ratio) − Drawn from Cochran database
• Lowest N, biggest effect sizes!
Ioannidis(2008).“Whymostdiscoveredtrueassocia1onsareinflated.”Epidemiology,19(5),640-8.
Power and sample size calculations for fMRI 3
State of the art in Neuroimaging
• Most work on power in neuroimaging: voxelwise BOLD contrasts: – Comparing activation between conditions/groups (task fMRI)
“Which brain region is responsible for understanding language?” – Comparing structural measurements between groups (VBM)
“Comparing patients with Alzheimer's disease with HC’s: are there specific regions that are structurally different?
– Comparing seed-based connectivity maps “When we look at the co-activations (connections) of the brain with the primary visual cortex, is there a difference between patients with psychosis and HC’s?”
– … • Power for (graph theory) connectivity analyses: only open questions
(see later)
21
A A A B B B
FOR EACH VOXEL:Y = b0 + b1 X + ε=> b1 , SE(b1) , T = b1/SE(b1) group: H0: b1=0
A A A B B B
Applications in Neuroimaging
• Problems for power analysis in fMRI – Multiple testing: different definitions of power – Very complex modeling strategies
Within subject variance, between subject variance, temporal correlation, …
– Spatial inference strategies: Should be reflected in power analysis
– Probably more than one alternative: • Varying Ha: δ1, δ2, δ3, …, δ99,999, δ100,000 (hard part) • Also: σ1, σ2, σ3 … too
22
Mixed Effects fMRI Modeling
PoweringroupfMRIdependsondandwithin-&between-subjectvariability…
Cov(εg) = Vg = diag{c(XTkVk-1Xk)-1σk2cT} + σB2 IN
^
βcont ^
=
βg1
βg2
+
= Xg βg + εg
Within subject variability Between subject variability
2nd Level fMRI Model
cβk
Mumford&Nichols(2008).Powercalcula1onforgroupfMRIstudiesaccoun1ngforarbitrarydesignandtemporal….NeuroImage,39(1),261–8.23
Mixed Effects fMRI Modeling
• Requiresspecifying– Intra-subjectcorrela1onVk– Intra-subjectvarianceσk2– BetweensubjectvarianceσB2– Nottomen1onXk,c&d
• But,thengivesflexibility– Canconsiderdifferentdesigns
• E.g.shorterrunsmoresubjects.• Op1mizeeventdesignforop1malpower
Cov(εg) = Vg = diag{c(XTkVk-1Xk)-1σk2cT} + σB2 IN
Within subject variability Between subject variability
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16Time (hours)
0 5 10 150
10
20
30
40
50
60
70
80
90
100
Number of on/off (20s/10s) cycles (TR=2.5)
Pow
er (%
)
24 subjects22 subjects20 subjects18 subjects16 subjects14 subjects12 subjects
Mumford&Nichols(2008).Powercalcula1onforgroupfMRIstudiesaccoun1ngforarbitrarydesignandtemporal….NeuroImage,39(1),261–8.24
Mixed Effects fMRI Modeling
Toolboxtoes1mateallthisfromexis1ngdata– JeanePeMumford’shttp://fmripower.org
WorkswithFSL&SPM!
25
Power based on non-central Random Field Theory
• Thepreviousmethodignoredspa1alcorrela1on
• PowerforRFTInference– Canprovidepowergivenamaskofsignal
– Orprovidemapsof‘localpower’– S1llvoxelwise!
Hayasakaetal.(2007).Powerandsamplesizecalcula1onforneuroimagingstudiesbynon-centralrandomfieldtheory.NeuroImage,37:721-30.26
Power based on non-central Random Field Theory
• PowerMapbySatoruHayasaka
http://sourceforge.net/ projects/powermap
27
Peak power (self promotion)
– Manyvariablestospecifyores1mateàrobustness?
- Voxelwise:correlatedtests+hugemul1pletes1ngproblem- Peakinference
- Peaksareindependent- Hugedatareduc1on,butlocalizingpower- Takesintoaccountsmoothness- Effectsizesarereported- ROIandwhole-brain- Simplifymodel
28
Activation intensity
X coordinate (Left−Right)
Y co
ordi
nate
s (A
nter
ior−
Post
erio
r)
20
40
60
80
20 40 60
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Activation intensity
X coordinate (Left−Right)
Y co
ordi
nate
s (A
nter
ior−
Post
erio
r)
20
40
60
80
20 40 60
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Peak power
Neuropower:www.neuropowertools.org
29
Applications in Neuroimaging
30
fMRIpower Powermap Neuropower
Inferencelevel ROI Voxels Peaks
Alterna:vehypothesis AverageROIeffect Distribu1onof
effectsizesDistribu1onofeffectsizes
Powerdefini:on Power Family-wisepower Averagepower
Differentvariancecomponents
Allvariancecomponentses1mated
X X
Spa:alalterna:ve X Yes Independenceofpeaks
Varyingeffectsizes PerROI Yes Yes
Requirespilotdata Yes Yes Yes
- Toes1mateeffectsizes/variancecomponents/…
- Pilotdata=expensive!- Solu1oninclinicaltrials:
- mul1-stagedesignwithinterimanalyses- Correc1onforpeaking- Samplesizeadapta1on
- Ourproposal:- Roughes1matebasedonopendata- Atern1subjects:re-es1matesamplesize
31
A note on pilot data
How about functional connectivity?
32Smithetal.(2011).NetworkmodelingmethodsforfMRI.Neuroimage,54,2
How about functional connectivity?
33Smithetal.(2011).NetworkmodelingmethodsforfMRI.Neuroimage,54,2
How about functional connectivity?
34
Fifty Shades of Gray, Matter:Using Bayesian priors to improve the power of whole-brain
voxel- and connexelwise inferences
Krzysztof J. Gorgolewski∗, Pierre-Louis Bazin†, Haakon Engen‡, Daniel S. Margulies∗∗Max Planck Research Group: Neuroanatomy and Connectivity
†Department of Neurophysics‡Department of Social Neuroscience,Max Planck Institute for Human Cognitive and Brain Sciences
Leipzig, Germany
Abstract—To increase the power of neuroimaging analyses, itis common practice to reduce the whole-brain search space to asubset of hypothesis-driven regions-of-interest (ROIs). Ratherthan strictly constrain analyses, we propose to incorporateprior knowledge using probabilistic ROIs (pROIs) using ahierarchical Bayesian framework. Each voxel prior probabilityof being “of-interest” or “of-non-interest” is used to perform aweighted fit of a mixture model. We demonstrate the utilityof this approach through simulations with various pROIs,and the applicability using a prior based on the NeuroSynthdatabase search term “emotion” for thresholding the fMRIresults of an emotion processing task. The modular structureof pROI correction facilitates the inclusion of other innovationsin Bayesian mixture modeling, and offers a foundation forbalancing between exploratory analyses without neglectingprior knowledge.
Keywords-inference; Bayesian inference; fMRI priors; mix-ture models;
I. INTRODUCTION
Many of our neuroimaging studies begin with a region-specific hypothesis, and yet we conduct whole-brain voxel-wise analyses to explore the entire brain for potential signal.While whole-brain analyses tolerably decrease signal-to-noise (SNR), when we move to the space of voxelwiseconnections, or connexels [1], the quadratic decrease inSNR for full exploration becomes prohibitive (Fig. 1). Effectsizes which were sufficient in voxelwise analyses cannotbe used to distinguish signal from noise in the connexelspace. Low SNR in connectivity analyses has traditionallybeen compensated for by using different kinds of region-of-interest (ROI) selection (“seed-based”) or data reductionapproaches. The search space is thus restricted to only theconnections from an a priori ROI drawn on the connectivitymatrix. However, this approach completely discards theinformation outside the ROI, thus excluding even strongsignal that does not fall within it. Additionally, analysesusing ROI masks are highly sensitive to their size and shapecriteria, and do not convey information about the uncertaintyof their borders. To balance these interests, we propose anovel approach for performing inference using a two-levelhierarchical mixture model. Instead of using a binary ROI
!"#$%& '"(($#$%&
! ! ! ! !!
&)*(+%(")&$
Figure 1. Schematic illustration of whole-brain voxel and connexelanalyses. Left: Symbolic representation of flattened voxels (non-active(black) and active (red)). Right: Connectivity matrix where each pointcorresponds to one connexel. Below: The relation between SNR in thevoxelwise (SNRv) and connexelwise (SNRc) cases, where n stands forthe number of voxels.
mask we propose a prior probability map (“probabilisticROI” or pROI) with values ranging from 0 to 1. This mapis used to set mixing parameters for probabilistic mixturemodel distinguishing between voxels-of-interest and thoseof non-interest (Fig. 2A). In a second level of this hier-archical model, voxels-/connexels-of-interest are modeledas two Gaussian distributions: noise and signal (Fig. 2B).The result of this two-level hierarchical model is inferencethat incorporates non-binary prior knowledge in the form ofpROIs.
II. METHODS
We propose to formally incorporate prior knowledge intothe inference process by using a Bayesian framework. Theprior informs the search area, which in turn is subdivided
2013 3rd International Workshop on Pattern Recognition in Neuroimaging
978-0-7695-5061-9/13 $26.00 © 2013 IEEEDOI 10.1109/PRNI.2013.57
194
Gorgolewskietal.(2013).FityshadesofGray,MaPer:UsingBayesianpriorstoimprovethepowerofwhole-brainvoxel-andconnexelwiseinference.Proceedingsfor3rdInterna1onalWorkshoponpaPernrecogni1oninneuroimaging.
1.Correla:onmatrixbetweenallvoxels:(mul1pletes1ngproblem)2
Difficultproblemforinference=Difficultproblemforpowercalcula1ons2.Clustervoxelsinameaningfulway:Atlas-based,func1on-based,data-driven,…S1llhugemul1pletes1ngproblemThrowawaydata3.Summarizeconnec:vityinglobalmetrics:Inference=subjectwisecomparisonofmetricsPowercalcula1on=classicalpoweranalysisGpower,...
Other variables affecting power and reproducibility 4
Questionable research practices
36
• Sloppyornonexistentanalysisprotocols
– Youstopwhenyougettheresultyouexpect– These“vibra1ons”canonlyleadtoinflatedfalseposi1ves
• Afflictswell-intendedresearchers– Mul1tudeofpreprocessing/modellingchoices
• Linearvs.non-linearalignment• CanonicalHRF?Deriva1ves?FLOBS?
“Try voxel-wise whole brain, then cluster-wise, then if not getting good results, look for subjects with bad movement, if still nothing, maybe try a global signal regressor; if still nothing do SVC for frontal lobe, if not, then try DLPFC (probably only right side), if still nothing, will look in literature for xyz coordinates near my activation, use spherical SVC… surely that’ll work!”
DoesyourlabhavewriTenprotocols?
Design efficiency
37
signal.
Although this design type might seem suboptimal because of the slow transitions between
conditions, for certain psychological experiments this method is appropriate. This is illus-
trated in one of the studies used in this master dissertation, the study of Van Opstal et al.
(2008). In that study di↵erent stages in a learning process were compared. The study is
explained further in more detail (see 1.4.1).
Blocked Design (1)
Condition
OFF
ON
Blocked Design (2)
Condition
OFF
ON
Event-Related Design
Convolu
ted C
onditio
n
OFF
ON
Convolu
ted C
onditio
n
OFF
ON
Figure 3: Trial course for 2 blocked and event-related designs.
Event-related design The basic principle underlying event-related fMRI designs is
very simple: one looks for stimulus-induced activity. Here the stimulus can be seen as e.g.
the onset of a condition or the beginning of task. The distinctive feature of this design is that
it permits to relate changes in the BOLD signal to specific stimuli/trials/events contrary to
the blocked designs where the activation is related to a longer time period. Since the relation
between the stimulus and the BOLD-signal is known or can be estimated, one can declare
regions or voxels active if there is an increased activity on moments predicted to be active.
This is what is described in the right panel of Figure 3. In the lower right panel the convoluted
consequence of the trials is shown.
As mentioned before, fMRI o↵ers more temporal flexibility (albeit not unlimited) than
PET. Additionally, with fMRI a more precise estimation of the HRF can be obtained. Al-
though this is considered as a major advantage (Amaro & Barker, 2006), this is not used
in the major part of the cognitive neuroscience studies (Lindquist, Loh, Atlas, & Wager,
2009). From a pure psychological perspective the gain in temporal flexibility is probably
9
Design efficiency
38
• Yi=Xiβ+εi• Var(β)=σ2(XʹX)−1àVariancepropor1onaltodiag[(X’X)-1]àStandardisedeffectsizedependentondesign.
• Readmore:– Smithetal.(2007).Meaningfuldesignandcontrastes1mabilityinfMRI.Neuroimage,34,127-136.
– Wager&Nichols(2003).Op1miza1onofexperimentaldesigninfMRI:ageneralframeworkusingagene1calgorithm.NeuroImage,18,293-309.
^
Power in bayesian analyses
39
• Bayesiananalysis:mixtureofH0andHa• BFameasureofbothFPRandpower• S1ll:cutoffcanbefoundmoreeasilyifthedistancebetweenH0andHaismaximal.
• Sameforanykindofanalysis:– Differentanalysesàdifferentcutoffs– Samedata!– Goal:maximalsepara1onbetweentrueandfalseeffects
Two-sided testing
40
• 1vs.2sidedtes1ngshouldreflecttheore1calhypothesis– Withadirec1onalhypothesis:one-sidedtes:ng!!– Stats=confirmatory– Samefalseposi1verate,bePertrueposi1verate– Ifyoufindtheoppositeeffect:moreexploratory,nosignificancetes1ng!
– WhyallowFPR’sforaneffectyou’renotevenlookingfor?
• Readmore:– ChoandAbe(2013).Istwo-tailedtes1ngfordirec1onalresearchhypothesestestslegi1mate?Journalofbusinessresearch,66,9.
Conclusion
41
• Lowpowerhasdetrimentaleffectsonscience– Inabilitytofindtrueeffects– Mostfindingsarefalse– Effectsareoveres1mated
• SpecificdetailsoffMRIrequirespecialisedapproaches.
• Solu1onsforvoxelwiseanalyses• Mostlyopenques1onforconnec1vityanalyses
Thank you!