Introduction to resampling in MATLAB. So you've done an experiment... Two independent datasets:...
-
Upload
gerard-young -
Category
Documents
-
view
212 -
download
0
Transcript of Introduction to resampling in MATLAB. So you've done an experiment... Two independent datasets:...
Introduction to resampling in MATLAB
So you've done an experiment...
Two independent datasets: control experimental
Have n numbers in each dataset, representing independent measurements
Question:
Did the experimental manipulation have an effect, i.e. did it make a difference?
The question restated
Is there a significant (i.e. “real”) numerical difference between the control and experimental datasets?
What is the probability that the two datasets are subsets of two different/independent distributions?
What is the probability that the two datasets are NOT subsets of the same underlying distribution?
Answering the question
What do we need? A number(s) to represent/summarize each dataset
→ data descriptors A test for comparing these numbers
→ test statistic A way to assess the significance of the test statistic
→ distribution of test statistics generated from datasets that ARE from the same
underlying distribution
Descriptors and tests
Know your data! Visualize it Does it have a central tendency? (i.e. is histogram
vaguely mound shaped?) If yes, use the mean as descriptor If no, median may be better
Typical test: compute difference (subtraction) between descriptors
We'll discuss how to assess the significance of your test statistic soon, that's the resampling
part...
But first let's visualize some data
Simulate and inspect data
ctrl = 10.*(5.75+randn(25,1));exp = 10.*(5+randn(25,1));
boxplot([ctrl exp]) groups = {'control' 'experimental'}boxplot([ctrl exp],'labels',groups)
Central tendency?
hist(ctrl) hist(exp) title('Control group') title('Experimental group')
Compute descriptors and test stat
Our data looks pretty mound shaped, so let's use the mean as data descriptor
Take the absolute value of the difference as our test stat
ctrl_d = mean(ctrl); exp_d = mean(exp);
test_stat = abs(ctrl_d - exp_d);
Assessing the test statistic
What we would really like to test: The probability that the two datasets are subsets of
two different/independent distributions
Problem: We don't know what those hypothetical independent
distributions are, if we did we wouldn't have to go through all of this!
Assessing the test statistic
Solution: Compute the probability that the two datasets are
NOT subsets of the same underlying distribution How to do this?
Start by assuming datasets are subsets of same distribution → null hypothesis
See what test statistic looks like when null hypothesis is true
Assessing the test statistic
Need to generate a distribution of test statistic generated when datasets really ARE from the same underlying distribution
→ distribution of test stat under null hypothesis
Then we can quantify how (a)typical the value of our test stat is under null hypothesis if typical, our datasets are likely from same
distribution → no effect of experiment if atypical, there is a good chance datasets are from
different distributions → experiment had an effect
Generate the null distribution using bootstrap
Basic procedure: Combine datasets Randomly sample (w/ replacement) from
combined dataset to create two pseudo datasets w/ same n as real datasets
Compute descriptors and test statistic for pseudo datasets
Repeat 10,000 times, keeping track of pseudo test stat
Combine datasets and resamplebox = [ctrl; exp]; % combine datasets into 'box' % could use concat() pseudo_stat_dist = zeros(1,10000); % create vector 'pseudo_stat_dist' % to store results of resampling % --- start resampling --- for trials = 1:10000 % resample 10000 times pseudo_ctrl = ... % create pseudo groups to be sample(length(ctrl), box); % same size as actual groups pseudo_exp = ... sample(length(exp), box); pseudo_ctrl_d = ... % compute pseudo group mean(pseudo_ctrl); % descriptors pseudo_exp_d = ... mean(pseudo_exp); pseudo_stat = ... % compute pseudo test stat abs(pseudo_ctrl_d - pseudo_exp_d); pseudo_stat_dist(trials) = ... % store pseudo test stat to pseudo_stat; % build null distribution end
Now what?
Compute how many values of pseudo test stat are greater than our actual test stat, then divide by the total number of data points in the null distribution
This numerically approximates the likelihood of getting our actual test stat (i.e. the likelihood of seeing a difference as big as we see) if our two datasets were truly from the same underlying distribution
→ p-value
Compute p-val and visualize null distribution
bigger = count(pseudo_stat_dist > test_stat); % count pseudo test stats % bigger than actual stat pval = bigger / length(pseudo_stat_dist) % divide by total number % of pseudo test stats to % compute p-value % could use proportion() hist(pseudo_stat_dist) % show histogram of pseudo title('Pseudo test stat distribution') % test stats xlabel('Pseudo test stat') ylabel('Frequency')
How to interpret the p-value?
Assuming that the null hypothesis is true, the p-value is the likelihood that we would see a value of the test statistic that is greater than the value of our actual test statistic
Restated → Assuming both datasets really are from the
same underlying distribution, the p-value is the likelihood that we would see a difference as big as we do
function [] = bootstrap2groupsMEANS (ctrl, exp)%% function [] = bootstrap2groupsMEANS (ctrl, exp)%% 'ctrl' and 'exp' are assumed to be column vectors, this is% necessary for creation of 'box'%% This function will use bootstrap to compute the probability% that data in 'ctrl' and 'exp' are from the same distribution,% using the group means as descriptors, and the absolute value% of the difference between means as the test statistic % (similar to Student's t-test) % --- prepare for resampling --- ctrl_d = mean(ctrl); % compute descriptors exp_d = mean(exp); % 'ctrl_d' and 'exp_d' test_stat = abs(ctrl_d - exp_d) % compute test statistic 'test_stat' box = [ctrl; exp]; % combine datasets into 'box' % could use concat() pseudo_stat_dist = zeros(1,10000); % create vector 'pseudo_stat_dist' % to store results of resampling % --- start resampling --- for trials = 1:10000 % resample 10000 times pseudo_ctrl = ... % create pseudo groups to be sample(length(ctrl), box); % same size as actual groups pseudo_exp = ... sample(length(exp), box); pseudo_ctrl_d = ... % compute pseudo group mean(pseudo_ctrl); % descriptors pseudo_exp_d = ... mean(pseudo_exp); pseudo_stat = ... % compute pseudo test stat abs(pseudo_ctrl_d - pseudo_exp_d); pseudo_stat_dist(trials) = ... % store pseudo test stat to pseudo_stat; % build null distribution end % --- end resampling --- bigger = count(pseudo_stat_dist > test_stat); % count pseudo test stats % bigger than actual stat pval = bigger / length(pseudo_stat_dist) % divide by total number % of pseudo test stats to % compute p-value % could use proportion() hist(pseudo_stat_dist) % show histogram of pseudo title('Pseudo test stat distribution') % test stats xlabel('Pseudo test stat') ylabel('Frequency')