1 statistics – modelling your data chris rorden 1.modelling data: signal, error and covariates...
TRANSCRIPT
1
Statistics – Modelling Your Data
Chris Rorden1. Modelling data:
Signal, Error and Covariates Parametric Statistics
2. Thresholding Results: Statistical power and statistical errors The multiple comparison problem Familywise error and Bonferroni Thresholding Permutation Thresholding False Discovery Rate Thresholding Implications: null results uninterruptible
2
The fMRI signal
Last lecture: we predict areas that are involved with a task will become brighter (after a delay)
Therefore, we expect that if someone repeatedly does a task for 12 seconds, and rests for 12 seconds, our signal should look like this:
-4
-2
0
2
4
6
8
1 4 7
10
13
16
19
22
25
28
31
34
37
40
43
46
49
52
55
58
Time (volumes, each 2sec)
Sig
na
l B
rig
htn
es
s Model
3
Calculating statistics
Does this brain area change brightness when we do the task?– Top panel: very
good predictor (very little error)
– Lower panel: somewhat less good predictor
-4
-2
0
2
4
6
8
1 4 7
10
13
16
19
22
25
28
31
34
37
40
43
46
49
52
55
58
Time
Sig
na
l Bri
gh
tne
ss
Observed
Model
Error
-4
-2
0
2
4
6
8
1 4 7
10
13
16
19
22
25
28
31
34
37
40
43
46
49
52
55
58
Time
Sig
na
l Bri
gh
tne
ss
Observed
Model
Error
4
General Linear Model
The observed data is composed of a signal that is predicted by our model and unexplained noise (Boynton et al., 1996).
MYMeasured Data
Amplitude (solve for) Design ModelNoise
5
What is your model?
Model is predicted effect. Consider Block desgin experiment:
– Three conditions, each for 11.2sec1. Press left index finger when you see 2. Press right index finger when you see 3. Do nothing when you see
Inte
nsity
Time
6
FSL/SPM display of model
Analysis programs display model as grid. Each column is regressor
– e.g. left / right arrows. Each row is a volume of data
– for within-subject fMRI = time Brightness of row is model’s predicted
intensity.
Inte
nsity
Tim
e
Time
7
Statistical Contrasts
fMRI inference based on contrast. Consider study with left arrow and
right arrow as regressors1. [1 0] identifies activation correlated with
left arrows: we could expect visual and motor effects.
2. [1 –1] identifies regions that show more response to left arrows than right arrows. Visual effects should be similar, so should select contralateral motoric.
Choice of contrasts crucial to inference.
8
Statistical Contrasts
t-Test is one tailed, F-test is two-tailed.
– T-test: [1 –1] mutually exclusive of [-1 1]: left>right vs right>left.
– F-test: [1 –1] = [-1 1]: difference between left and right.
Choice of test crucial to inference.
9
How many regressors?
We collected data during a block design, where the participant completed 3 tasks
– Left hand movement– Right hand movement– Rest
We are only interested in the brain areas involed with Left hand movement.
Should we include uninteresting right hand movement as a regressor in our statistical model?
– I.E. Is a [1] analysis the same as a [1 0]?– Is a [1 0] analysis identical, better, worse or
different from a [1] analysis?
=?
10
Meaningful regressors decrease noise
Meaningful regressors can explain some of the variability.
Adding a meaningful regressor can reduce the unexplained noise from our contrast.
11
Correlated regressors decrease signal
If a regressor is strongly correlated with our effect, it can reduce the residual signal– Our signal is excluded as the
regressor explains this variability.– Example: responses highly
correlated to visual stimuli
12
Single factor…
Consider a test to see how well height predicts weight.
Explained VarianceUnexplained Variance
t=
Weight
Height
Small t-score
height only weakly predicts weight
High t-score
height strongly predicts weight
13
Adding a second factor…
How does an additional factor influence our test?
E.G. We can add waist diameter as a regressor.
Does this regressor influence the t-test regarding how well height predicts weight?
Consider ratio of cyan to green.
Increased t
Waist explains portion of weight not predicted by height.
Decreased t
Waist explains portion of weight predicted by height.
Weight
Height
Waist
14
Regressors and statistics
Our analysis identifies three classes of variability:1. Signal: Predicted effect of interest2. Noise (aka Error): Unexplained variance3. Covariates: Predicted effects that are not relevant.
Statistical significance is the ratio:
Covariates will– Improve sensitivity if they reduce error (explain otherwise unexplained variance).– Reduce sensitivity if they reduce signal (explain variance that is also predicted by our effect of interest).
SignalNoise
t=
15
Summary
Regressors should be orthogonal – Each regressor describes independent variance.– Variance should not be explained by more than
one regressor.E.G. we will see that including temporal
derivatives as regressors tend to help event related designs (temporal processing lecture).
16
Group Analysis
We typically want to make inferences about the general population
Conduct time course analysis on many people. Identify which patterns are consistent across group.
17
Parametric Statistics
SPM and FSL conduct parametric statistics.– T-test, F-test, Correlation
These make assumptions about data.We will not check to see if these assumptions
are valid.
18
Parametric Statistics
Parameters = Assumptions Parametric Statistics assume that data can
be accurately defined by two values: 1. Mean = measure of central tendency
2. Variance = measure of noise2=1
2=2
2=3
2=4f(x)
Means differVariabilities Differ
19
Parametric Statistics
Parametric Statistics are popular– Simple (complex data described by two numbers:
mean and variability)– Flexible (can look at how multiple factors interact)– Powerful: very sensitive at detecting real effects– Robust: usually work even if assumptions violated
Tend to fail graciously: by becoming more conservative
20
Normal Distribution
Parametric Statistics Assume Bell-Shaped data:
Often, this is wrong. Mean may not be a good measure:
Positive Skew: response times, hard exam
Negative Skew: easy exam Bimodal: some students got it
21
Rank-Order Statistics
Rank-order statistics make fewer assumptions.Have less power (if data is normal)
– Require more measurements– May fail to detect real results
Computationally slow Classic examples:
– Wilcoxon Mann-Whitney – Fligner and Policello’s robust rank order test
22
Problem with rank order statistics
While rank-order statistics are often referred to as non-parametric, most make assumptions:– WMW: assume both distributions have same
shape.– FP: assume both distributions are symmetrical.
Both these tests become liberal if their assumptions are not met.– They fail catastrophically.
23
What to do?
In general, use parametric tests.– In face of violations, you will simply lose power
One alternative is to use permutation testing, e.g. SnPM.– Permuation testing is only as powerful as the test statistic it
uses: SnPM uses the t-test, which is sensitive to changes in mean (so it can be blind to changes in median).
Recent alternative is truly non-parametric test of Brunner and Munzel.– Can offer slightly better power than t-test if data is skewed.
Rorden et al. 2007.
24
Statistical Thresholding
– Type I/II Errors– Power– Multiple Comparison Problem
Bonferroni CorrectionPermutation ThresholdingFalse Discovery RateROI Analysis
25
E.G. erythropoietin (EPO) doping in athletes– In endurance athletes, EPO improves performance ~ 10%– Races often won by less than 1%– Without testing, athletes forced to dope to be competitive– Dangers: Carcinogenic and can cause heart-attacks– Therefore: Measure haematocrit level to identify drug users…
Statistics
hae
mat
ocr
it
30%
50%
If there was no noise in our measure, it would be easy to identify EPO doping:
26
The problem of noise
Science tests hypotheses based on observations– We need statistics because our data is noisy
In the real world, haematocrit levels vary– This unrelated noise in our measure is called ‘error’
How to we identify dopers?
In the real world, hematocrit varies between people
hem
ato
crit
30%
50%
27
Statistical Threshold
hem
ato
crit
30%
50%
hem
ato
crit
30%
50%
If we set the threshold too low, we will accuse innocent people (high rate of false alarms).
If we set the threshold too high, we will fail to detect dopers (high rate of misses).
28
Possible outcomes of drug test
nonDoper EPO Doper
Accuse and expel
Innocent accused (false alarm)
Type I error
Doper expelled
(hit)
Allow to compete
Innocent competes (correct rejection)
Doper sneaks through (miss)Type II error
Reality (unknown)D
ecis
ion
29
Errors
With noisy data, we will make mistakes. Statistics allows us to
– Estimate our confidence– Bias the type of mistake we make (e.g. we can decide whether we will tend to make false alarms or misses)
We can be liberal: avoiding misses We can be conservative: avoiding false alarms. We want liberal tests for airport weapons detection (X-ray often leads to innocent cases being
opened). Our society wants conservative tests for criminal conviction: avoid sending innocent people to
jail.
30
Liberal vs Conservative Thresholds
LIBERAL
A low threshold, we will accuse innocent people (high rate of false alarms, Type I).
CONSERVATIVE
A high threshold, we will fail to detect dopers (high rate of misses, Type II).
31
Statistical Power
Statistical Power is our probability of making a Hit. It reflects our ability to detect real effects.
Type II errorCorrect rejection
Accept Ho
HitType I errorReject Ho
Ho falseHo true
Dec
isio
n
RealityTo make new
discoveries, we need to optimize power.
There are 4 ways to increase power…
32
1.) Alpha and Power
By making alpha less strict, we can increase power.(e.g. p < 0.05 instead of 0.01)
However, we increase the chance of a Type I error!
33
2.) Effect Size and Power
Power will increase if the effect size increases. (e.g. higher dose of drug, 7T MRI instead of 1.5T).
Unfortunately, effect sizes are often small and fixed.
34
3.) Variability and Power
Reducing variability increases the relative effect size. Most measures of brain activity noisy.
35
4.) Sample Size
A final way to increase our power is to collect more data.
We can sample a person’s brain activity on many similar trials.
We can test more people.The disadvantage is time and money. Increasing the sample size is often our only
option for increasing statistical power.
36
Reflection
Statistically, relative ‘effect size’ and ‘variability’ are equivalent.
Our confidence is the ratio of effect size versus variability (signal versus noise).
In graphs below, same is used.
=
37
Alpha level
Statistics allow us to estimate our confidence. is our statistical threshold: it measures our chance of Type I error. An alpha level of 5% means only 1/20 chance of false alarm (we will
only accept p < 0.05). An alpha level of 1% means only 1/100 chance of false alarm (p<
0.01). Therefore, a 1% alpha is more conservative than a 5% alpha.
38
Multiple Comparison Problem
Assume a 1% alpha for drug testing.An innocent athlete only has 1% chance of
being accused.Problem: 10,500 athletes in the Olympics. If all innocent, and = 1%, we will wrongly
accuse 105 athletes (0.01*10500)!This is the multiple comparison problem.
39
Multiple Comparisons
The gray matter volume ~900cc (900,000mm3)Typical fMRI voxel is 3x3x3mm (27mm3)Therefore, we will conduct >30,000 testsWith 5% alpha, we will make >1500 false
alarms!
40
Multiple Comparison Problem
If we conduct 20 tests, with an = 5%, we will on average make one false alarm (20x0.05).
If we make twenty comparisons, it is possible that we may be making 0, 1, 2 or in rare cases even more errors.
The chance we will make at least one error is given by the formula: 1- (1- )C: if we make twenty comparisons at p < .05, we have a 1-(.95) 20 = 64% chance that we are reporting at least one erroneous finding. This is our familywise error (FWE) rate.
41
Bonferroni Correction
Bonferroni Correction: controls FWE.For example: if we conduct 10 tests, and want
a 5% chance of any errors, we will adjust our threshold to be p < 0.005 (0.05/10).
Benefits: Controls for FWE.Problem: Very conservative = very little chance
of detecting real effects = low power.
42
Random Field Theory
We spatially smooth our data – peaks due to noise should be attenuated by neighbors.
– Worsley et al, HBM 4:58-73, 1995. RFT uses resolution elements (resels) instead of voxels.
– If we smooth our data with 8mm FWHM, then resel size is 8mm. SPM uses RFT for FWE correction: only requires
statistical map, smoothness and cluster size threshold.– Euler characteristic: unsmoothed noise will have high peaks but
few clusters, smoothed data will be have lower peaks but show clustering.
RFT has many unchecked assumptions (Nichols) Works best for heavily smoothed data (x3 voxel size)
5mm
10mm
15mm
Image from Nichols
43
Permutation Thresholding
Prediction: Label ‘Group 1’ and ‘Group 2’ mean something.
Null Hypothesis (Ho): Labels are meaningless.
If Ho true, we should get similar t-scores if we randomly scramble order.
Group 1
Group 2
44
Permutation Thresholding
Group 1
Group 2
Observed, max T = 4.1
1. Permutation 1, max T = 3.2
2. Permutation 2, max T = 2.9
3. Permutation 3, max T = 3.34.
Permutation 4, max T = 2.85.
Permutation 5, max T = 3.5
…
1000.Permutation 1000, max T = 3.1
… …
45
Permutation Thresholding
Compute maximum T-score for 1000 permutations.
Find 5th Percentile max T. Any voxel in our observed
dataset that exceeds this threshold has only 5% probability of being noise.
Max
T
Percentile
0 100
0
5
5%
T= 3.9
46
Permutation Thresholding
Permutation Thresholding offers the same protection against false alarms as Bonferroni.
Typically, much more powerful than Bonferroni. Implementations include SnPM, FSL’s randomise, and
my own NPM. Disadvantage: computing 1000 permutations means it
takes x1000 times longer than typical analysis!
Simulation data from Nichols et al.: Permutation always optimal. Bonferroni typically conservative. Random Fields only accurate with high DF and heavily smoothed.
47
False Discovery Rate
Traditional statistics attempts to control the False Alarm rate.
‘False Discovery Rate’ controls the ratio of false alarms to hits.
It often provides much more power than Bonferroni correction.
48
FDR
Assume Olympics where no athletes took EPO:
Assume Olympics where some cheat:– When we conduct many tests, we can estimate the
amount of real signal
49
FDR vs FWE
Bonferroni FWE applies same threshold to each data set
FDR is dynamic: threshold based on signal detected.
5% Bonferroni: only a 5% chance an innocent athlete will be accused.
5% FDR: only 5% of expelled athletes are innocent.
50
Controlling for multiple comparisons
Bonferroni correction– We will often fail to find real results.
RFT correction– Typically less conservative than Bonferroni.– Requires large DF and broad smoothing.
Permutation Thresholding– Offers same inference as Bonferroni correction. – Typically much less conservative than Bonferroni.– Computationally very slow
FDR correction– At FDR of .05, about 5% of ‘activated’ voxels will be false alarms.– If signal is only tiny proportion of data, FDR will be similar to Bonferroni.
51
Alternatives to voxelwise analysis
Conventional fMRI statistics compute one statistical comparison per voxel.– Advantage: can discover effects anywhere in brain.– Disadvantage: low statistical power due to multiple comparisons.
Small Volume Comparison: Only test a small proportion of voxels. (Still have to adjust for RFT).
Region of Interest: Pool data across anatomical region for single statistical test.
SVC
ROI
SPM
Example: how many comparisons on this slice?
•SPM: 1600
•SVC: 57
•ROI: 1
52
ROI analysis
In voxelwise analysis, we conduct an indepent test for every voxel
– Each voxel is noisy– Huge number of tests, so severe penalty for multiple
comparisons Alternative: pool data from region of interest.
– Averaging across meaningful region should reduce noise.– One test per region, so FWE adjustment less severe.
Region must be selected independently of statistical contrast!
– Anatomically predefined– Defined based on previous localizer session– Selected based on combination of conditions you will
contrast.
M1: movement
S1: sensation
53
Inference from fMRI statistics
fMRI studies have very low power.– Correction for multiple comparisons– Poor signal to noise– Variability in functional anatomy between people.
Null results impossible to interpret. (Hard to say an area is not involved with task).
54
Between and Within Subject Variance
Consider experiment to see if music influences typing speed.
Possible effect will be small. Large variability between
people: some people much better typist than others.
Solution: repeated measure design to separate between and within subject variability. 0
10
20
30
40
50
60
70
Bach Rock Silent
Alice
Bob
Donna
Nick
Sam
Typ
ing
sp
ee
d:
wo
rds
pe
r m
inu
te
55
Multiple Subject Analysis: Mixed Model
Model all of the data at once
Between and within subject variation is accounted for
Can’t apply mixed model directly to fMRI data because there is so much data!
Sub 1 Sub 2 Sub 3 Sub 4
Group
Z stats
56
Multiple Subject Analysis: SPM2
First estimate each subject’s contrast effect sizes (copes)
Run a t-test on the copes Holmes and Friston assume
within subject variation is same for all subjects, this allows them to ignore it at the group level– Not equivalent to a mixed
modelSub 1 Sub 2 Sub 3 Sub 4
Group: T-test
Results
copes copes copes copes
57
Multiple Subject Analysis: FSL
First estimate each subject’s copes and cope variability (varcopes)
Then enter the copes and varcopes into group model– varcopes supply within subject
variation– Between subject variation and group
level means are then estimated Equivalent to mixed model Much slower than SPM Sub 1 Sub 2 Sub 3 Sub 4
Group
Z stats
copes varcopes
copes varcopes
copes varcopes
copes varcopes