power and sample size calculations in fmri · 2016. 5. 25. · power and sample size calculations...

Joke Durnez Poldrack lab

Department of Psychology Stanford University

Power and Sample size calculations in fMRI

Thanks to Tom Nichols & Russ Poldrack for slides !! SAMSI CCNS 2016

OVERVIEW

1 Statistical power and sample size calculations 2 Power in neuroscience: why bother? 3 Power and sample size calculations for fMRI 4 Other variables affecting power and reproducibility

Statistical power and sample size calculations 1

set of possible results

Pro

ba

bil

ity d

en

sit

y

Density function

when H0 is true

Density function

when Ha is true

observed

data point

threshold

for activation

α-level

set of possible results

Pro

ba

bil

ity d

en

sit

y

Density function

when H0 is true

Density function

when Ha is true

threshold

for activation

1-βpower

α:Probabilityofrejec1ngH0whenH0istrue(falseposi1ve)

Power:Probabilityofrejec1ngH0whenHaistrue(trueposi1ve)

Statistical power

4

−20 −10 0 10 20 30

600 800 1000 1200 1400 1600

−4−2 0 2 4 6

μ1= 1000μ2= 1020σ = 100

μ1= 1μ2= 2σ = 5

Δ = μ/σ

= 0.2

−4 −2 0 2 4 6

E(T) = μ / (σ/√n) = Δ √n

n=10 n=50 n=100

Data Units, Effect Sizes, Statistics

5

STANDARDISED EFFECT SIZE = 0.2

n=10Power=0.14

n=50Power=0.41

n=100Power=0.64

−4 −2 0 2 4 6 −4 −2 0 2 4 6 −4 −2 0 2 4 6

Sample Size and Power

6

cα

H0

c1−β

Ha

−4 −2 6cα c

1−β

0

Δ√n

E(T | Ha)

−4 −2 0 4 6

zα

z1-β

�pn = z↵ + z�

n =

✓z↵ + z�µ/�

◆2

(1)

1

Sample Size Calculations

7

Power: why bother? 2

Simulated activation

20

40

60

80

100

20 40 60 80

Simulated gaussian noise

20

40

60

80

100

20 40 60 80

−4

−2

0

2

4

+ àn=15

Simulated T−map

20

40

60

80

100

20 40 60 80

−4

−2

0

2

4

6

8

10

Why bother about power?

9

1. A low powered study: hard to find the effects of interest.

10

●

●

●

●

0.00

0.25

0.50

0.75

1.00

10 15 20 25 30 35 40 45 50 55 60Sample size

True

Pos

itive

Rat

eTrue Positive Rate


11

poldracklab.org

Sample size in neuroimaging studies

Thanks to Sean David for sharing dataImage:RussPoldrack


12

poldracklab.org

Power in neuroimaging studies

Assuming lenient threshold of p<0.005 uncorrected

Image:RussPoldrack


13

•  Recordedmedianpowerpermeta-analysis– Medianmedianpower21%

BuPon,etal.(2013).Powerfailure:whysmallsamplesizeunderminesthereliabilityofneuroscience.Nat.Rev.Neuros,14(5),365–76.

50%ofall

neurosciencestudieshave

atmosta1-in-5chanceofreplica1ng!

It has been claimed and demonstrated that many (and possibly most) of the conclusions drawn from biomedi-cal research are probably false1. A central cause for this important problem is that researchers must publish in order to succeed, and publishing is a highly competitive enterprise, with certain kinds of findings more likely to be published than others. Research that produces novel results, statistically significant results (that is, typically p < 0.05) and seemingly ‘clean’ results is more likely to be published2,3. As a consequence, researchers have strong incentives to engage in research practices that make their findings publishable quickly, even if those prac-tices reduce the likelihood that the findings reflect a true (that is, non-null) effect4. Such practices include using flexible study designs and flexible statistical analyses and running small studies with low statistical power1,5. A simulation of genetic association studies showed that a typical dataset would generate at least one false positive result almost 97% of the time6, and two efforts to replicate promising findings in biomedicine reveal replication rates of 25% or less7,8. Given that these pub-lishing biases are pervasive across scientific practice, it is possible that false positives heavily contaminate the neuroscience literature as well, and this problem may affect at least as much, if not even more so, the most prominent journals9,10.

Here, we focus on one major aspect of the problem: low statistical power. The relationship between study power and the veracity of the resulting finding is under-appreciated. Low statistical power (because of

low sample size of studies, small effects or both) nega-tively affects the likelihood that a nominally statistically significant finding actually reflects a true effect. We dis-cuss the problems that arise when low-powered research designs are pervasive. In general, these problems can be divided into two categories. The first concerns prob-lems that are mathematically expected to arise even if the research conducted is otherwise perfect: in other words, when there are no biases that tend to create sta-tistically significant (that is, ‘positive’) results that are spurious. The second category concerns problems that reflect biases that tend to co-occur with studies of low power or that become worse in small, underpowered studies. We next empirically show that statistical power is typically low in the field of neuroscience by using evi-dence from a range of subfields within the neuroscience literature. We illustrate that low statistical power is an endemic problem in neuroscience and discuss the impli-cations of this for interpreting the results of individual studies.

Low power in the absence of other biasesThree main problems contribute to producing unreliable findings in studies with low power, even when all other research practices are ideal. They are: the low probability of finding true effects; the low positive predictive value (PPV; see BOX 1 for definitions of key statistical terms) when an effect is claimed; and an exaggerated estimate of the mag-nitude of the effect when a true effect is discovered. Here, we discuss these problems in more detail.

1School of Experimental Psychology, University of Bristol, Bristol, BS8 1TU, UK.2School of Social and Community Medicine, University of Bristol, Bristol, BS8 2BN, UK.3Stanford University School of Medicine, Stanford, California 94305, USA.4Department of Psychology, University of Virginia, Charlottesville, Virginia 22904, USA.5Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, OX3 7BN, UK.6School of Physiology and Pharmacology, University of Bristol, Bristol, BS8 1TD, UK.Correspondence to M.R.M. e-mail: [email protected]:10.1038/nrn3475Published online 10 April 2013Corrected online 15 April 2013

Power failure: why small sample size undermines the reliability of neuroscienceKatherine S. Button1,2, John P. A. Ioannidis3, Claire Mokrysz1, Brian A. Nosek4, Jonathan Flint5, Emma S. J. Robinson6 and Marcus R. Munafò1

Abstract | A study with low statistical power has a reduced chance of detecting a true effect, but it is less well appreciated that low power also reduces the likelihood that a statistically significant result reflects a true effect. Here, we show that the average statistical power of studies in the neurosciences is very low. The consequences of this include overestimates of effect size and low reproducibility of results. There are also ethical dimensions to this problem, as unreliable research is inefficient and wasteful. Improving reproducibility in neuroscience is a key priority and requires attention to well-established but often ignored methodological principles.

ANALYSIS

NATURE REVIEWS | NEUROSCIENCE VOLUME 14 | MAY 2013 | 365

© 2013 Macmillan Publishers Limited. All rights reserved

Cited: 880

2. In a low powered study: most research findings are false.

14

P(FP) = 5% P(TP) = 10%

Pre-study probability of a true effect = 0.2

PPV = 1/3Post-study probability of a true effect

PLoS Medicine | www.plosmedicine.org 0696

Essay

Open access, freely available online

August 2005 | Volume 2 | Issue 8 | e124

Published research fi ndings are sometimes refuted by subsequent evidence, with ensuing confusion

and disappointment. Refutation and controversy is seen across the range of research designs, from clinical trials and traditional epidemiological studies [1–3] to the most modern molecular research [4,5]. There is increasing concern that in modern research, false fi ndings may be the majority or even the vast majority of published research claims [6–8]. However, this should not be surprising. It can be proven that most claimed research fi ndings are false. Here I will examine the key

factors that infl uence this problem and some corollaries thereof.

Modeling the Framework for False Positive Findings Several methodologists have pointed out [9–11] that the high rate of nonreplication (lack of confi rmation) of research discoveries is a consequence of the convenient, yet ill-founded strategy of claiming conclusive research fi ndings solely on the basis of a single study assessed by formal statistical signifi cance, typically for a p-value less than 0.05. Research is not most appropriately represented and summarized by p-values, but, unfortunately, there is a widespread notion that medical research articles

should be interpreted based only on p-values. Research fi ndings are defi ned here as any relationship reaching formal statistical signifi cance, e.g., effective interventions, informative predictors, risk factors, or associations. “Negative” research is also very useful. “Negative” is actually a misnomer, and the misinterpretation is widespread. However, here we will target relationships that investigators claim exist, rather than null fi ndings.

As has been shown previously, the probability that a research fi nding is indeed true depends on the prior probability of it being true (before doing the study), the statistical power of the study, and the level of statistical signifi cance [10,11]. Consider a 2 × 2 table in which research fi ndings are compared against the gold standard of true relationships in a scientifi c fi eld. In a research fi eld both true and false hypotheses can be made about the presence of relationships. Let R be the ratio of the number of “true relationships” to “no relationships” among those tested in the fi eld. R

is characteristic of the fi eld and can vary a lot depending on whether the fi eld targets highly likely relationships or searches for only one or a few true relationships among thousands and millions of hypotheses that may be postulated. Let us also consider, for computational simplicity, circumscribed fi elds where either there is only one true relationship (among many that can be hypothesized) or the power is similar to fi nd any of the several existing true relationships. The pre-study probability of a relationship being true is R⁄(R + 1). The probability of a study fi nding a true relationship refl ects the power 1 − β (one minus the Type II error rate). The probability of claiming a relationship when none truly exists refl ects the Type I error rate, α. Assuming that c relationships are being probed in the fi eld, the expected values of the 2 × 2 table are given in Table 1. After a research fi nding has been claimed based on achieving formal statistical signifi cance, the post-study probability that it is true is the positive predictive value, PPV. The PPV is also the complementary probability of what Wacholder et al. have called the false positive report probability [10]. According to the 2 × 2 table, one gets PPV = (1 − β)R⁄(R − βR + α). A research fi nding is thus

The Essay section contains opinion pieces on topics of broad interest to a general medical audience.

Why Most Published Research Findings Are False John P. A. Ioannidis

Citation: Ioannidis JPA (2005) Why most published research fi ndings are false. PLoS Med 2(8): e124.

Copyright: © 2005 John P. A. Ioannidis. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abbreviation: PPV, positive predictive value

John P. A. Ioannidis is in the Department of Hygiene and Epidemiology, University of Ioannina School of Medicine, Ioannina, Greece, and Institute for Clinical Research and Health Policy Studies, Department of Medicine, Tufts-New England Medical Center, Tufts University School of Medicine, Boston, Massachusetts, United States of America. E-mail: [email protected]

Competing Interests: The author has declared that no competing interests exist.

DOI: 10.1371/journal.pmed.0020124

SummaryThere is increasing concern that most

current published research fi ndings are false. The probability that a research claim is true may depend on study power and bias, the number of other studies on the same question, and, importantly, the ratio of true to no relationships among the relationships probed in each scientifi c fi eld. In this framework, a research fi nding is less likely to be true when the studies conducted in a fi eld are smaller; when effect sizes are smaller; when there is a greater number and lesser preselection of tested relationships; where there is greater fl exibility in designs, defi nitions, outcomes, and analytical modes; when there is greater fi nancial and other interest and prejudice; and when more teams are involved in a scientifi c fi eld in chase of statistical signifi cance. Simulations show that for most study designs and settings, it is more likely for a research claim to be false than true. Moreover, for many current scientifi c fi elds, claimed research fi ndings may often be simply accurate measures of the prevailing bias. In this essay, I discuss the implications of these problems for the conduct and interpretation of research.

It can be proven that most claimed research

fi ndings are false.


Essay

Open access, freely available online


Published research fi ndings are sometimes refuted by subsequent evidence, with ensuing confusion

and disappointment. Refutation and controversy is seen across the range of research designs, from clinical trials and traditional epidemiological studies [1–3] to the most modern molecular research [4,5]. There is increasing concern that in modern research, false fi ndings may be the majority or even the vast majority of published research claims [6–8]. However, this should not be surprising. It can be proven that most claimed research fi ndings are false. Here I will examine the key

factors that infl uence this problem and some corollaries thereof.

Modeling the Framework for False Positive Findings Several methodologists have pointed out [9–11] that the high rate of nonreplication (lack of confi rmation) of research discoveries is a consequence of the convenient, yet ill-founded strategy of claiming conclusive research fi ndings solely on the basis of a single study assessed by formal statistical signifi cance, typically for a p-value less than 0.05. Research is not most appropriately represented and summarized by p-values, but, unfortunately, there is a widespread notion that medical research articles

should be interpreted based only on p-values. Research fi ndings are defi ned here as any relationship reaching formal statistical signifi cance, e.g., effective interventions, informative predictors, risk factors, or associations. “Negative” research is also very useful. “Negative” is actually a misnomer, and the misinterpretation is widespread. However, here we will target relationships that investigators claim exist, rather than null fi ndings.

As has been shown previously, the probability that a research fi nding is indeed true depends on the prior probability of it being true (before doing the study), the statistical power of the study, and the level of statistical signifi cance [10,11]. Consider a 2 × 2 table in which research fi ndings are compared against the gold standard of true relationships in a scientifi c fi eld. In a research fi eld both true and false hypotheses can be made about the presence of relationships. Let R be the ratio of the number of “true relationships” to “no relationships” among those tested in the fi eld. R

is characteristic of the fi eld and can vary a lot depending on whether the fi eld targets highly likely relationships or searches for only one or a few true relationships among thousands and millions of hypotheses that may be postulated. Let us also consider, for computational simplicity, circumscribed fi elds where either there is only one true relationship (among many that can be hypothesized) or the power is similar to fi nd any of the several existing true relationships. The pre-study probability of a relationship being true is R⁄(R + 1). The probability of a study fi nding a true relationship refl ects the power 1 − β (one minus the Type II error rate). The probability of claiming a relationship when none truly exists refl ects the Type I error rate, α. Assuming that c relationships are being probed in the fi eld, the expected values of the 2 × 2 table are given in Table 1. After a research fi nding has been claimed based on achieving formal statistical signifi cance, the post-study probability that it is true is the positive predictive value, PPV. The PPV is also the complementary probability of what Wacholder et al. have called the false positive report probability [10]. According to the 2 × 2 table, one gets PPV = (1 − β)R⁄(R − βR + α). A research fi nding is thus

The Essay section contains opinion pieces on topics of broad interest to a general medical audience.

Why Most Published Research Findings Are False John P. A. Ioannidis

Citation: Ioannidis JPA (2005) Why most published research fi ndings are false. PLoS Med 2(8): e124.

Copyright: © 2005 John P. A. Ioannidis. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abbreviation: PPV, positive predictive value

John P. A. Ioannidis is in the Department of Hygiene and Epidemiology, University of Ioannina School of Medicine, Ioannina, Greece, and Institute for Clinical Research and Health Policy Studies, Department of Medicine, Tufts-New England Medical Center, Tufts University School of Medicine, Boston, Massachusetts, United States of America. E-mail: [email protected]

Competing Interests: The author has declared that no competing interests exist.

DOI: 10.1371/journal.pmed.0020124

SummaryThere is increasing concern that most

current published research fi ndings are false. The probability that a research claim is true may depend on study power and bias, the number of other studies on the same question, and, importantly, the ratio of true to no relationships among the relationships probed in each scientifi c fi eld. In this framework, a research fi nding is less likely to be true when the studies conducted in a fi eld are smaller; when effect sizes are smaller; when there is a greater number and lesser preselection of tested relationships; where there is greater fl exibility in designs, defi nitions, outcomes, and analytical modes; when there is greater fi nancial and other interest and prejudice; and when more teams are involved in a scientifi c fi eld in chase of statistical signifi cance. Simulations show that for most study designs and settings, it is more likely for a research claim to be false than true. Moreover, for many current scientifi c fi elds, claimed research fi ndings may often be simply accurate measures of the prevailing bias. In this essay, I discuss the implications of these problems for the conduct and interpretation of research.

It can be proven that most claimed research

fi ndings are false.

Cited: 3532


15

●●

●

●●

●●

●●●

●

●●

●● ●● ●●

0.00

0.25

0.50

0.75

1.00

10 15 20 25 30 35 40 45 50 55 60Sample size

Posi

tive

Pred

ictiv

e Va

lue

Positive Predictive Value


16Ioannidis(2005).Whymostresearchfindingsarefalse.PLoSMedicine,2,8.


alternating extreme research claims and extremely opposite refutations [29]. Empirical evidence suggests that this sequence of extreme opposites is very common in molecular genetics [29].

These corollaries consider each factor separately, but these factors often infl uence each other. For example, investigators working in fi elds where true effect sizes are perceived to be small may be more likely to perform large studies than investigators working in fi elds where true effect sizes are perceived to be large. Or prejudice may prevail in a hot scientifi c fi eld, further undermining the predictive value of its research fi ndings. Highly prejudiced stakeholders may even create a barrier that aborts efforts at obtaining and disseminating opposing results. Conversely, the fact that a fi eld

is hot or has strong invested interests may sometimes promote larger studies and improved standards of research, enhancing the predictive value of its research fi ndings. Or massive discovery-oriented testing may result in such a large yield of signifi cant relationships that investigators have enough to report and search further and thus refrain from data dredging and manipulation.

Most Research Findings Are False for Most Research Designs and for Most FieldsIn the described framework, a PPV exceeding 50% is quite diffi cult to get. Table 4 provides the results of simulations using the formulas developed for the infl uence of power, ratio of true to non-true relationships, and bias, for various types of situations that may be characteristic of specifi c study designs and settings. A fi nding from a well-conducted, adequately powered randomized controlled trial starting with a 50% pre-study chance that the intervention is effective is

eventually true about 85% of the time. A fairly similar performance is expected of a confi rmatory meta-analysis of good-quality randomized trials: potential bias probably increases, but power and pre-test chances are higher compared to a single randomized trial. Conversely, a meta-analytic fi nding from inconclusive studies where pooling is used to “correct” the low power of single studies, is probably false if R ≤ 1:3. Research fi ndings from underpowered, early-phase clinical trials would be true about one in four times, or even less frequently if bias is present. Epidemiological studies of an exploratory nature perform even worse, especially when underpowered, but even well-powered epidemiological studies may have only a one in fi ve chance being true, if R = 1:10. Finally, in discovery-oriented research with massive testing, where tested relationships exceed true ones 1,000-fold (e.g., 30,000 genes tested, of which 30 may be the true culprits) [30,31], PPV for each claimed relationship is extremely low, even with considerable

Box 1. An Example: Science at Low Pre-Study Odds

Let us assume that a team of investigators performs a whole genome association study to test whether any of 100,000 gene polymorphisms are associated with susceptibility to schizophrenia. Based on what we know about the extent of heritability of the disease, it is reasonable to expect that probably around ten gene polymorphisms among those tested would be truly associated with schizophrenia, with relatively similar odds ratios around 1.3 for the ten or so polymorphisms and with a fairly similar power to identify any of them. Then R = 10/100,000 = 10−4, and the pre-study probability for any polymorphism to be associated with schizophrenia is also R/(R + 1) = 10−4. Let us also suppose that the study has 60% power to fi nd an association with an odds ratio of 1.3 at α = 0.05. Then it can be estimated that if a statistically signifi cant association is found with the p-value barely crossing the 0.05 threshold, the post-study probability that this is true increases about 12-fold compared with the pre-study probability, but it is still only 12 × 10−4.

Now let us suppose that the investigators manipulate their design,

analyses, and reporting so as to make more relationships cross the p = 0.05 threshold even though this would not have been crossed with a perfectly adhered to design and analysis and with perfect comprehensive reporting of the results, strictly according to the original study plan. Such manipulation could be done, for example, with serendipitous inclusion or exclusion of certain patients or controls, post hoc subgroup analyses, investigation of genetic contrasts that were not originally specifi ed, changes in the disease or control defi nitions, and various combinations of selective or distorted reporting of the results. Commercially available “data mining” packages actually are proud of their ability to yield statistically signifi cant results through data dredging. In the presence of bias with u = 0.10, the post-study probability that a research fi nding is true is only 4.4 × 10−4. Furthermore, even in the absence of any bias, when ten independent research teams perform similar experiments around the world, if one of them fi nds a formally statistically signifi cant association, the probability that the research fi nding is true is only 1.5 × 10−4, hardly any higher than the probability we had before any of this extensive research was undertaken!

DOI: 10.1371/journal.pmed.0020124.g002

Figure 2. PPV (Probability That a Research Finding Is True) as a Function of the Pre-Study Odds for Various Numbers of Conducted Studies, nPanels correspond to power of 0.20, 0.50, and 0.80.










































17

RESEARCH ARTICLE SUMMARY◥

PSYCHOLOGY

Estimating the reproducibility ofpsychological scienceOpen Science Collaboration*

INTRODUCTION: Reproducibility is a defin-ing feature of science, but the extent to whichit characterizes current research is unknown.Scientific claims should not gain credencebecause of the status or authority of theiroriginator but by the replicability of theirsupporting evidence. Even research of exem-plary quality may have irreproducible empir-ical findings because of random or systematicerror.

RATIONALE: There is concern about the rateand predictors of reproducibility, but limitedevidence. Potentially problematic practices in-clude selective reporting, selective analysis, andinsufficient specification of the conditions nec-essary or sufficient to obtain the results. Directreplication is the attempt to recreate the con-ditions believed sufficient for obtaining a pre-

viously observed finding and is the means ofestablishing reproducibility of a finding withnew data. We conducted a large-scale, collab-orative effort to obtain an initial estimate ofthe reproducibility of psychological science.

RESULTS:We conducted replications of 100experimental and correlational studies pub-lished in three psychology journals using high-powered designs and original materials whenavailable. There is no single standard for eval-uating replication success. Here, we evaluatedreproducibility using significance and P values,effect sizes, subjective assessments of replica-tion teams, and meta-analysis of effect sizes.The mean effect size (r) of the replication ef-fects (Mr = 0.197, SD = 0.257) was half the mag-nitude of the mean effect size of the originaleffects (Mr = 0.403, SD = 0.188), representing a

substantial decline.Ninety-sevenpercent of orig-inal studies had significant results (P < .05).Thirty-six percent of replications had signifi-

cant results; 47% of origi-nal effect sizes were in the95% confidence intervalof the replication effectsize; 39% of effects weresubjectively rated to havereplicated the original re-

sult; and if no bias in original results is as-sumed, combining original and replicationresults left 68% with statistically significanteffects. Correlational tests suggest that repli-cation success was better predicted by thestrength of original evidence than by charac-teristics of the original and replication teams.

CONCLUSION:No single indicator sufficient-ly describes replication success, and the fiveindicators examined here are not the onlyways to evaluate reproducibility. Nonetheless,collectively these results offer a clear conclu-sion: A large portion of replications producedweaker evidence for the original findings de-spite using materials provided by the originalauthors, review in advance for methodologi-cal fidelity, and high statistical power to detectthe original effect sizes. Moreover, correlationalevidence is consistent with the conclusion thatvariation in the strength of initial evidence(such as original P value) was more predictiveof replication success than variation in thecharacteristics of the teams conducting theresearch (such as experience and expertise).The latter factors certainly can influence rep-lication success, but they did not appear to doso here.Reproducibility is not well understood be-

cause the incentives for individual scientistsprioritize novelty over replication. Innova-tion is the engine of discovery and is vital fora productive, effective scientific enterprise.However, innovative ideas become old newsfast. Journal reviewers and editors may dis-miss a new test of a published idea as un-original. The claim that “we already know this”belies the uncertainty of scientific evidence.Innovation points out paths that are possible;replication points out paths that are likely;progress relies on both. Replication can in-crease certainty when findings are reproducedand promote innovation when they are not.This project provides accumulating evidencefor many findings in psychological researchand suggests that there is still more work todo to verify whether we know what we thinkwe know.▪

RESEARCH

SCIENCE sciencemag.org 28 AUGUST 2015 • VOL 349 ISSUE 6251 943

The list of author affiliations is available in the full article online.*Corresponding author. E-mail: [email protected] this article as Open Science Collaboration, Science 349,aac4716 (2015). DOI: 10.1126/science.aac4716

Original study effect size versus replication effect size (correlation coefficients). Diagonalline represents replication effect size equal to original effect size. Dotted line represents replicationeffect size of 0. Points below the dotted line were effects in the opposite direction of the original.Density plots are separated by significant (blue) and nonsignificant (red) effects.

ON OUR WEB SITE◥

Read the full articleat http://dx.doi.org/10.1126/science.aac4716..................................................

on A

pril

5, 2

016

Dow

nloa

ded

from

on

April

5, 2

016

Dow

nloa

ded

from

on

April

5, 2

016

Dow

nloa

ded

from

on

April

5, 2

016

Dow

nloa

ded

from

on

April

5, 2

016

Dow

nloa

ded

from

on

April

5, 2

016

Dow

nloa

ded

from

on

April

5, 2

016

Dow

nloa

ded

from

on

April

5, 2

016

Dow

nloa

ded

from

on

April

5, 2

016

Dow

nloa

ded

from

OpenScienceCollabora1on(2015).Es1ma1ngthereproducibilityofpsychologicalscience.Science,349,6251.

•  OpenScienceCollabora1on(2015)•  100replica1onsofexperimentalandcorrela1onalstudies•  36%ofreplica1onscouldreproduceresult•  Inabilitytoreplicatehasseveralsources:

–  P-hacking–  Publica1onbias–  Falseposi:vesduetolowPPV

3. In a low powered study: winner’s curse.

18

●

●●

●

●

●●

●●

●

●

0.50

0.75

1.00

1.25

1.50

10 15 20 25 30 35 40 45 50 55 60Sample size

Estim

ated

Effe

ct S

izeWinner's curse

3. In a low powered study: winner’s curse.

19

LawofSmallNumbersaka“Winner’sCurse”Smallstudiesover-es1mateeffectsize E

ffect

Siz

e (L

og O

dds

Rat

io)

Sample Size (Log of Total N in Meta Analysis)

•  256 meta analyses − For a binary effect (odds ratio) − Drawn from Cochran database

•  Lowest N, biggest effect sizes!

Ioannidis(2008).“Whymostdiscoveredtrueassocia1onsareinflated.”Epidemiology,19(5),640-8.

Power and sample size calculations for fMRI 3

State of the art in Neuroimaging

•  Most work on power in neuroimaging: voxelwise BOLD contrasts: –  Comparing activation between conditions/groups (task fMRI)

“Which brain region is responsible for understanding language?” –  Comparing structural measurements between groups (VBM)

“Comparing patients with Alzheimer's disease with HC’s: are there specific regions that are structurally different?

–  Comparing seed-based connectivity maps “When we look at the co-activations (connections) of the brain with the primary visual cortex, is there a difference between patients with psychosis and HC’s?”

–  … •  Power for (graph theory) connectivity analyses: only open questions

(see later)

21

A A A B B B

FOR EACH VOXEL:Y = b0 + b1 X + ε=> b1 , SE(b1) , T = b1/SE(b1) group: H0: b1=0

A A A B B B

Applications in Neuroimaging

•  Problems for power analysis in fMRI – Multiple testing: different definitions of power – Very complex modeling strategies

Within subject variance, between subject variance, temporal correlation, …

– Spatial inference strategies: Should be reflected in power analysis

– Probably more than one alternative: •  Varying Ha: δ1, δ2, δ3, …, δ99,999, δ100,000 (hard part) •  Also: σ1, σ2, σ3 … too

22

Mixed Effects fMRI Modeling

PoweringroupfMRIdependsondandwithin-&between-subjectvariability…

Cov(εg) = Vg = diag{c(XTkVk-1Xk)-1σk2cT} + σB2 IN

^

βcont ^

=

βg1

βg2

+

= Xg βg + εg

Within subject variability Between subject variability

2nd Level fMRI Model

cβk

Mumford&Nichols(2008).Powercalcula1onforgroupfMRIstudiesaccoun1ngforarbitrarydesignandtemporal….NeuroImage,39(1),261–8.23


•  Requiresspecifying–  Intra-subjectcorrela1onVk–  Intra-subjectvarianceσk2– BetweensubjectvarianceσB2– Nottomen1onXk,c&d

•  But,thengivesflexibility– Canconsiderdifferentdesigns

•  E.g.shorterrunsmoresubjects.•  Op1mizeeventdesignforop1malpower

Cov(εg) = Vg = diag{c(XTkVk-1Xk)-1σk2cT} + σB2 IN

Within subject variability Between subject variability

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16Time (hours)

0 5 10 150

10

20

30

40

50

60

70

80

90

100

Number of on/off (20s/10s) cycles (TR=2.5)

Pow

er (%

)

24 subjects22 subjects20 subjects18 subjects16 subjects14 subjects12 subjects

Mumford&Nichols(2008).Powercalcula1onforgroupfMRIstudiesaccoun1ngforarbitrarydesignandtemporal….NeuroImage,39(1),261–8.24


Toolboxtoes1mateallthisfromexis1ngdata–  JeanePeMumford’shttp://fmripower.org

WorkswithFSL&SPM!

25

Power based on non-central Random Field Theory

•  Thepreviousmethodignoredspa1alcorrela1on

•  PowerforRFTInference– Canprovidepowergivenamaskofsignal

– Orprovidemapsof‘localpower’– S1llvoxelwise!

Hayasakaetal.(2007).Powerandsamplesizecalcula1onforneuroimagingstudiesbynon-centralrandomfieldtheory.NeuroImage,37:721-30.26

Power based on non-central Random Field Theory

•  PowerMapbySatoruHayasaka

http://sourceforge.net/ projects/powermap

27

Peak power (self promotion)

–  Manyvariablestospecifyores1mateàrobustness?

-  Voxelwise:correlatedtests+hugemul1pletes1ngproblem-  Peakinference

-  Peaksareindependent-  Hugedatareduc1on,butlocalizingpower-  Takesintoaccountsmoothness-  Effectsizesarereported-  ROIandwhole-brain-  Simplifymodel

28

Activation intensity

X coordinate (Left−Right)

Y co

ordi

nate

s (A

nter

ior−

Post

erio

r)

20

40

60

80

20 40 60

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Activation intensity

X coordinate (Left−Right)

Y co

ordi

nate

s (A

nter

ior−

Post

erio

r)

20

40

60

80

20 40 60

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Peak power

Neuropower:www.neuropowertools.org

29

Applications in Neuroimaging

30

fMRIpower Powermap Neuropower

Inferencelevel ROI Voxels Peaks

Alterna:vehypothesis AverageROIeffect Distribu1onof

effectsizesDistribu1onofeffectsizes

Powerdefini:on Power Family-wisepower Averagepower

Differentvariancecomponents

Allvariancecomponentses1mated

X X

Spa:alalterna:ve X Yes Independenceofpeaks

Varyingeffectsizes PerROI Yes Yes

Requirespilotdata Yes Yes Yes

-  Toes1mateeffectsizes/variancecomponents/…

-  Pilotdata=expensive!-  Solu1oninclinicaltrials:

-  mul1-stagedesignwithinterimanalyses-  Correc1onforpeaking-  Samplesizeadapta1on

-  Ourproposal:-  Roughes1matebasedonopendata-  Atern1subjects:re-es1matesamplesize

31

A note on pilot data

How about functional connectivity?

32Smithetal.(2011).NetworkmodelingmethodsforfMRI.Neuroimage,54,2


33Smithetal.(2011).NetworkmodelingmethodsforfMRI.Neuroimage,54,2


34

Fifty Shades of Gray, Matter:Using Bayesian priors to improve the power of whole-brain

voxel- and connexelwise inferences

Krzysztof J. Gorgolewski∗, Pierre-Louis Bazin†, Haakon Engen‡, Daniel S. Margulies∗∗Max Planck Research Group: Neuroanatomy and Connectivity

†Department of Neurophysics‡Department of Social Neuroscience,Max Planck Institute for Human Cognitive and Brain Sciences

Leipzig, Germany

Abstract—To increase the power of neuroimaging analyses, itis common practice to reduce the whole-brain search space to asubset of hypothesis-driven regions-of-interest (ROIs). Ratherthan strictly constrain analyses, we propose to incorporateprior knowledge using probabilistic ROIs (pROIs) using ahierarchical Bayesian framework. Each voxel prior probabilityof being “of-interest” or “of-non-interest” is used to perform aweighted fit of a mixture model. We demonstrate the utilityof this approach through simulations with various pROIs,and the applicability using a prior based on the NeuroSynthdatabase search term “emotion” for thresholding the fMRIresults of an emotion processing task. The modular structureof pROI correction facilitates the inclusion of other innovationsin Bayesian mixture modeling, and offers a foundation forbalancing between exploratory analyses without neglectingprior knowledge.

Keywords-inference; Bayesian inference; fMRI priors; mix-ture models;

I. INTRODUCTION

Many of our neuroimaging studies begin with a region-specific hypothesis, and yet we conduct whole-brain voxel-wise analyses to explore the entire brain for potential signal.While whole-brain analyses tolerably decrease signal-to-noise (SNR), when we move to the space of voxelwiseconnections, or connexels [1], the quadratic decrease inSNR for full exploration becomes prohibitive (Fig. 1). Effectsizes which were sufficient in voxelwise analyses cannotbe used to distinguish signal from noise in the connexelspace. Low SNR in connectivity analyses has traditionallybeen compensated for by using different kinds of region-of-interest (ROI) selection (“seed-based”) or data reductionapproaches. The search space is thus restricted to only theconnections from an a priori ROI drawn on the connectivitymatrix. However, this approach completely discards theinformation outside the ROI, thus excluding even strongsignal that does not fall within it. Additionally, analysesusing ROI masks are highly sensitive to their size and shapecriteria, and do not convey information about the uncertaintyof their borders. To balance these interests, we propose anovel approach for performing inference using a two-levelhierarchical mixture model. Instead of using a binary ROI

!"#$%& '"(($#$%&

! ! ! ! !!

&)*(+%(")&$

Figure 1. Schematic illustration of whole-brain voxel and connexelanalyses. Left: Symbolic representation of flattened voxels (non-active(black) and active (red)). Right: Connectivity matrix where each pointcorresponds to one connexel. Below: The relation between SNR in thevoxelwise (SNRv) and connexelwise (SNRc) cases, where n stands forthe number of voxels.

mask we propose a prior probability map (“probabilisticROI” or pROI) with values ranging from 0 to 1. This mapis used to set mixing parameters for probabilistic mixturemodel distinguishing between voxels-of-interest and thoseof non-interest (Fig. 2A). In a second level of this hier-archical model, voxels-/connexels-of-interest are modeledas two Gaussian distributions: noise and signal (Fig. 2B).The result of this two-level hierarchical model is inferencethat incorporates non-binary prior knowledge in the form ofpROIs.

II. METHODS

We propose to formally incorporate prior knowledge intothe inference process by using a Bayesian framework. Theprior informs the search area, which in turn is subdivided

2013 3rd International Workshop on Pattern Recognition in Neuroimaging

978-0-7695-5061-9/13 $26.00 © 2013 IEEEDOI 10.1109/PRNI.2013.57

194

Gorgolewskietal.(2013).FityshadesofGray,MaPer:UsingBayesianpriorstoimprovethepowerofwhole-brainvoxel-andconnexelwiseinference.Proceedingsfor3rdInterna1onalWorkshoponpaPernrecogni1oninneuroimaging.

1.Correla:onmatrixbetweenallvoxels:(mul1pletes1ngproblem)2

Difficultproblemforinference=Difficultproblemforpowercalcula1ons2.Clustervoxelsinameaningfulway:Atlas-based,func1on-based,data-driven,…S1llhugemul1pletes1ngproblemThrowawaydata3.Summarizeconnec:vityinglobalmetrics:Inference=subjectwisecomparisonofmetricsPowercalcula1on=classicalpoweranalysisGpower,...

Other variables affecting power and reproducibility 4

Questionable research practices

36

•  Sloppyornonexistentanalysisprotocols

– Youstopwhenyougettheresultyouexpect– These“vibra1ons”canonlyleadtoinflatedfalseposi1ves

•  Afflictswell-intendedresearchers– Mul1tudeofpreprocessing/modellingchoices

•  Linearvs.non-linearalignment•  CanonicalHRF?Deriva1ves?FLOBS?

“Try voxel-wise whole brain, then cluster-wise, then if not getting good results, look for subjects with bad movement, if still nothing, maybe try a global signal regressor; if still nothing do SVC for frontal lobe, if not, then try DLPFC (probably only right side), if still nothing, will look in literature for xyz coordinates near my activation, use spherical SVC… surely that’ll work!”

DoesyourlabhavewriTenprotocols?

Design efficiency

37

signal.

Although this design type might seem suboptimal because of the slow transitions between

conditions, for certain psychological experiments this method is appropriate. This is illus-

trated in one of the studies used in this master dissertation, the study of Van Opstal et al.

(2008). In that study di↵erent stages in a learning process were compared. The study is

explained further in more detail (see 1.4.1).

Blocked Design (1)

Condition

OFF

ON

Blocked Design (2)

Condition

OFF

ON

Event-Related Design

Convolu

ted C

onditio

n

OFF

ON

Convolu

ted C

onditio

n

OFF

ON

Figure 3: Trial course for 2 blocked and event-related designs.

Event-related design The basic principle underlying event-related fMRI designs is

very simple: one looks for stimulus-induced activity. Here the stimulus can be seen as e.g.

the onset of a condition or the beginning of task. The distinctive feature of this design is that

it permits to relate changes in the BOLD signal to specific stimuli/trials/events contrary to

the blocked designs where the activation is related to a longer time period. Since the relation

between the stimulus and the BOLD-signal is known or can be estimated, one can declare

regions or voxels active if there is an increased activity on moments predicted to be active.

This is what is described in the right panel of Figure 3. In the lower right panel the convoluted

consequence of the trials is shown.

As mentioned before, fMRI o↵ers more temporal flexibility (albeit not unlimited) than

PET. Additionally, with fMRI a more precise estimation of the HRF can be obtained. Al-

though this is considered as a major advantage (Amaro & Barker, 2006), this is not used

in the major part of the cognitive neuroscience studies (Lindquist, Loh, Atlas, & Wager,

2009). From a pure psychological perspective the gain in temporal flexibility is probably

9

Design efficiency

38

•  Yi=Xiβ+εi•  Var(β)=σ2(XʹX)−1àVariancepropor1onaltodiag[(X’X)-1]àStandardisedeffectsizedependentondesign.

•  Readmore:–  Smithetal.(2007).Meaningfuldesignandcontrastes1mabilityinfMRI.Neuroimage,34,127-136.

– Wager&Nichols(2003).Op1miza1onofexperimentaldesigninfMRI:ageneralframeworkusingagene1calgorithm.NeuroImage,18,293-309.

^

Power in bayesian analyses

39

•  Bayesiananalysis:mixtureofH0andHa•  BFameasureofbothFPRandpower•  S1ll:cutoffcanbefoundmoreeasilyifthedistancebetweenH0andHaismaximal.

•  Sameforanykindofanalysis:– Differentanalysesàdifferentcutoffs– Samedata!– Goal:maximalsepara1onbetweentrueandfalseeffects

Two-sided testing

40

•  1vs.2sidedtes1ngshouldreflecttheore1calhypothesis– Withadirec1onalhypothesis:one-sidedtes:ng!!–  Stats=confirmatory–  Samefalseposi1verate,bePertrueposi1verate–  Ifyoufindtheoppositeeffect:moreexploratory,nosignificancetes1ng!

– WhyallowFPR’sforaneffectyou’renotevenlookingfor?

•  Readmore:–  ChoandAbe(2013).Istwo-tailedtes1ngfordirec1onalresearchhypothesestestslegi1mate?Journalofbusinessresearch,66,9.

Conclusion

41

•  Lowpowerhasdetrimentaleffectsonscience–  Inabilitytofindtrueeffects– Mostfindingsarefalse– Effectsareoveres1mated

•  SpecificdetailsoffMRIrequirespecialisedapproaches.

•  Solu1onsforvoxelwiseanalyses•  Mostlyopenques1onforconnec1vityanalyses

Thank you!

power and sample size calculations in fmri · 2016. 5. 25. · power and sample size calculations...

Documents