introduction to resampling in matlab feb 5 2009 austin hilliard

28
Introduction to resampling in MATLAB Feb 5 2009 Austin Hilliard

Upload: imogen-barker

Post on 28-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to resampling in MATLAB Feb 5 2009 Austin Hilliard

Introduction to resampling in MATLAB

Feb 5 2009Austin Hilliard

Page 2: Introduction to resampling in MATLAB Feb 5 2009 Austin Hilliard

So you've done an experiment...

Two independent datasets: control experimental

Have n numbers in each dataset, representing independent measurements

Question:

Did the experimental manipulation have an effect, i.e. did it make a difference?

Page 3: Introduction to resampling in MATLAB Feb 5 2009 Austin Hilliard

The question restated

Is there a significant (i.e. “real”) numerical difference between the control and experimental datasets?

What is the probability that the two datasets are subsets of two different/independent distributions?

What is the probability that the two datasets are NOT subsets of the same underlying distribution?

Page 4: Introduction to resampling in MATLAB Feb 5 2009 Austin Hilliard

Answering the question

What do we need? A number(s) to represent/summarize each dataset

→ data descriptors A test for comparing these numbers

→ test statistic A way to assess the significance of the test statistic

→ distribution of test statistics generated from datasets that ARE from the same underlying distribution

Page 5: Introduction to resampling in MATLAB Feb 5 2009 Austin Hilliard

Descriptors and tests

Know your data! Visualize it Does it have a central tendency? (i.e. is histogram

vaguely mound shaped?) - If yes, use the mean as descriptor - If no, median may be better

Typical test: compute difference (subtraction) between descriptors

Page 6: Introduction to resampling in MATLAB Feb 5 2009 Austin Hilliard

We'll discuss how to assess the significance of your test statistic soon, that's the resampling

part...

But first let's visualize some data

Page 7: Introduction to resampling in MATLAB Feb 5 2009 Austin Hilliard

Simulate and inspect data

ctrl = 10.*(5.75+randn(25,1));exp = 10.*(5+randn(25,1));

boxplot([ctrl exp]) groups = {'control' 'experimental'}boxplot([ctrl exp],'labels',groups)

Page 8: Introduction to resampling in MATLAB Feb 5 2009 Austin Hilliard

Central tendency?

hist(ctrl) hist(exp) title('Control group') title('Experimental group')

Page 9: Introduction to resampling in MATLAB Feb 5 2009 Austin Hilliard

Compute descriptors and test stat

Our data looks pretty mound shaped, so let's use the mean as data descriptor

Take the absolute value of the difference as our test stat

ctrl_d = mean(ctrl); exp_d = mean(exp);

test_stat = abs(ctrl_d - exp_d);

Page 10: Introduction to resampling in MATLAB Feb 5 2009 Austin Hilliard

Assessing the test statistic

What we would really like to test: The probability that the two datasets are subsets of

two different/independent distributions

Problem: We don't know what those hypothetical independent

distributions are, if we did we wouldn't have to go through all of this!

Page 11: Introduction to resampling in MATLAB Feb 5 2009 Austin Hilliard

Assessing the test statistic

Solution: Compute the probability that the two datasets are

NOT subsets of the same underlying distribution How to do this?

Start by assuming datasets are subsets of same distribution → null hypothesis

See what test statistic looks like when null hypothesis is true

Page 12: Introduction to resampling in MATLAB Feb 5 2009 Austin Hilliard

Assessing the test statistic

Need to generate a distribution of test statistic generated when datasets really ARE from the same underlying distribution

→ distribution of test stat under null hypothesis

Then we can quantify how (a)typical the value of our test stat is under null hypothesis if typical, our datasets are likely from same

distribution → no effect of experiment if atypical, there is a good chance datasets are from

different distributions → experiment had an effect

Page 13: Introduction to resampling in MATLAB Feb 5 2009 Austin Hilliard

Generate the null distribution using bootstrap

Basic procedure: Combine datasets Randomly sample (w/ replacement) from

combined dataset to create two pseudo datasets w/ same n as real datasets

Compute descriptors and test statistic for pseudo datasets

Repeat 10,000 times, keeping track of pseudo test stat

Page 14: Introduction to resampling in MATLAB Feb 5 2009 Austin Hilliard

Combine datasets and resamplebox = [ctrl; exp]; % combine datasets into 'box' % could use concat() pseudo_stat_dist = zeros(1,10000); % create vector 'pseudo_stat_dist' % to store results of resampling % --- start resampling --- for trials = 1:10000 % resample 10000 times pseudo_ctrl = ... % create pseudo groups to be sample(length(ctrl), box); % same size as actual groups pseudo_exp = ... sample(length(exp), box); pseudo_ctrl_d = ... % compute pseudo group mean(pseudo_ctrl); % descriptors pseudo_exp_d = ... mean(pseudo_exp); pseudo_stat = ... % compute pseudo test stat abs(pseudo_ctrl_d - pseudo_exp_d); pseudo_stat_dist(trials) = ... % store pseudo test stat to pseudo_stat; % build null distribution end

Page 15: Introduction to resampling in MATLAB Feb 5 2009 Austin Hilliard

Now what?

Compute how many values of pseudo test stat are greater than our actual test stat, then divide by the total number of data points in the null distribution

This approximates the likelihood of getting our actual test stat (i.e. the likelihood of seeing a difference as big as we see) if our two datasets were truly from the same underlying distribution

→ p-value

Page 16: Introduction to resampling in MATLAB Feb 5 2009 Austin Hilliard

Compute p-val and visualize null distribution

bigger = count(pseudo_stat_dist > test_stat); % count pseudo test stats % bigger than actual stat pval = bigger / length(pseudo_stat_dist) % divide by total number % of pseudo test stats to % compute p-value % could use proportion() hist(pseudo_stat_dist) % show histogram of pseudo title('Pseudo test stat distribution') % test stats xlabel('Pseudo test stat') ylabel('Frequency')

Page 17: Introduction to resampling in MATLAB Feb 5 2009 Austin Hilliard

How to interpret the p-value?

Assuming that the null hypothesis is true, the p-value is the likelihood that we would see a value of the test statistic that is greater than the value of our actual test statistic

Restated → Assuming both groups really are from the same

underlying distribution, the p-value is the likelihood that we would see a difference between them as big as we do

Page 18: Introduction to resampling in MATLAB Feb 5 2009 Austin Hilliard

“Paired” data

• What if your two datasets are not actually independent?

• Maybe you have a single group of subjects/birds/cells/etc. with measurements taken under different conditions

• Important: the measurements within each condition must still be independent

Page 19: Introduction to resampling in MATLAB Feb 5 2009 Austin Hilliard

Bootstrap to test paired data

• Basic strategy is the same, but...

• Instead of one descriptor/group, have as many descriptors as data pairs

– compute difference between measurement under condition1 and measurement under condition2

• Test statistic is mean/median/etc. of the differences

• Slightly different resampling procedure

Page 20: Introduction to resampling in MATLAB Feb 5 2009 Austin Hilliard

Visualize data

• Investigate distributions for raw data, also should plot distribution of the differences between paired data-points

• Why?

→ Test statistic is computed from these differences, not the raw datasets

Page 21: Introduction to resampling in MATLAB Feb 5 2009 Austin Hilliard

Resampling procedure

• Randomly sample 1 or -1 and store in vector

• Multiply by vector of differences

– randomizes direction of change

• Compute test stat (mean, median, etc.) for randomized differences

• Repeat

• Now we have distribution of test statistic if direction of change is random, i.e. NOT dependent on experimental conditions

Page 22: Introduction to resampling in MATLAB Feb 5 2009 Austin Hilliard

Compute descriptors and test stat

diffs = ctrl - exp; % compute change from ctrl condition % to exp condition test_stat = abs(mean(diffs)); % compute test statistic pseudo_stat_dist = zeros(1,10000); % create vector to store pseudo test % stats generated from resampling

Page 23: Introduction to resampling in MATLAB Feb 5 2009 Austin Hilliard

Resampling

for trials = 1:10000 % resample 10,000 times signs = sample(length(diffs), [-1 1])'; % randomly create vector of 1's % and -1's same size as diffs pseudo_diffs = signs .* diffs; % multiply by actual diffs to % randomize direction of diff pseudo_stat = abs(mean(pseudo_diffs)); % compute pseudo test stat from % pseudo diffs pseudo_stat_dist(trials) = pseudo_stat; % store pseudo test stat to % grow null distribution end

Page 24: Introduction to resampling in MATLAB Feb 5 2009 Austin Hilliard

Compute p-val and visualize null distribution

bigger = count(pseudo_stat_dist > test_stat); % count pseudo test stats % bigger than actual stat pval = bigger / length(pseudo_stat_dist) % divide by total number % of pseudo test stats to % compute p-value % could use proportion() hist(pseudo_stat_dist) % show histogram of pseudo title('Pseudo test stat distribution') % test stats xlabel('Pseudo test stat') ylabel('Frequency')

Page 25: Introduction to resampling in MATLAB Feb 5 2009 Austin Hilliard

function [] = bootstrap2groupsMEAN_PAIRED (ctrl, exp) % --- prepare for resampling --- diffs = ctrl - exp; % compute change from ctrl condition % to exp condition test_stat = abs(mean(diffs)) % compute test statistic pseudo_stat_dist = zeros(1,10000); % create vector to store pseudo test % stats generated from resampling % --- resampling --- for trials = 1:10000 % resample 10,000 times signs = sample(length(diffs), [-1 1])'; % randomly create vector of 1's % and -1's same size as diffs pseudo_diffs = signs .* diffs; % multiply by actual diffs to % randomize direction of diff pseudo_stat = abs(mean(pseudo_diffs)); % compute pseudo test stat from % pseudo diffs pseudo_stat_dist(trials) = pseudo_stat; % store pseudo test stat to % grow null distribution end % --- end resampling --- bigger = count(pseudo_stat_dist > test_stat); % count pseudo test stats % bigger than actual stat pval = bigger / length(pseudo_stat_dist) % divide by total number % of pseudo test stats to % compute p-value % could use proportion() hist(pseudo_stat_dist) % show histogram of pseudo title('Pseudo test stat distribution') % test stats xlabel('Pseudo test stat') ylabel('Frequency')

Page 26: Introduction to resampling in MATLAB Feb 5 2009 Austin Hilliard

function [] = bootstrap2groupsMEANS (ctrl, exp)%% function [] = bootstrap2groupsMEANS (ctrl, exp)%% 'ctrl' and 'exp' are assumed to be column vectors, this is% necessary for creation of 'box'% % --- prepare for resampling --- ctrl_d = mean(ctrl); % compute descriptors exp_d = mean(exp); % 'ctrl_d' and 'exp_d' test_stat = abs(ctrl_d - exp_d) % compute test statistic 'test_stat' box = [ctrl; exp]; % combine datasets into 'box' % could use concat() pseudo_stat_dist = zeros(1,10000); % create vector 'pseudo_stat_dist' % to store results of resampling % --- start resampling --- for trials = 1:10000 % resample 10,000 times pseudo_ctrl = ... % create pseudo groups to be sample(length(ctrl), box); % same size as actual groups pseudo_exp = ... sample(length(exp), box); pseudo_ctrl_d = ... % compute pseudo group mean(pseudo_ctrl); % descriptors pseudo_exp_d = ... mean(pseudo_exp); pseudo_stat = ... % compute pseudo test stat abs(pseudo_ctrl_d - pseudo_exp_d); pseudo_stat_dist(trials) = ... % store pseudo test stat to pseudo_stat; % build null distribution end % --- end resampling --- bigger = count(pseudo_stat_dist > test_stat); % count pseudo test stats % bigger than actual stat pval = bigger / length(pseudo_stat_dist) % divide by total number % of pseudo test stats to % compute p-value % could use proportion() hist(pseudo_stat_dist) % show histogram of pseudo title('Pseudo test stat distribution') % test stats xlabel('Pseudo test stat') ylabel('Frequency')

Page 27: Introduction to resampling in MATLAB Feb 5 2009 Austin Hilliard

Don't forget to look online to get a more generalized version of 'bootstrap2groupsMEANS.m' called 'bootstrap2groups.m'

It's more flexible (can define method for descriptor and iterations as input), stores output in a struct, plots raw data, has one-tailed tests if needed, and can be used for demo (try running 'out = bootstrap2groups(@mean)'

Page 28: Introduction to resampling in MATLAB Feb 5 2009 Austin Hilliard

Possible topics for next week

• confidence intervals

• normality testing

• ANOVA (1 or 2-way)

• power