Biostatistics IV
An introduction to bootstrap
2 Getting something from nothing?
In Rudolph Erich Raspe's tale, Baron Munchausen had, in one of his many adventurous travels, fallen to the bottom of a deep lake and just as he was to succumb to his fate he thought to pull himself up by his own BOOTSTRAP.
The original version of the tale is in German, where Munchausen actually draws himself up by the hair, not the bootstraps. The figure on the left refers to German version of the story.
Efron and LePage gave the method they developed the name Bootstrap in honour of the unbelievable stories that the Baron told of his travels.
3 The general problem
“We have a set of real-valued observations x1, ... xn independently sampled from an unknown probability distribution F. We are interested in estimating some parameter by using the information in the sample data with an estimator -hat = t(x). Some measure of the estimate’s accuracy is as important as the estimate itself; we want a standard error of -hat and, even better a confidence interval on the true value .”
Efron and LePage (1992)
4 Our sample mean
From the 4 samples we take a mean and calculate a standard deviation! By doing this we are assuming that the
population is normally distributed But what if we don’t know the distribution?
0
50
100
150
200
250
300
350
400
450
0 5 10 15 20 25 30 35 40
4 sample
5 The solution: Bootstrap
Generate large numbers of “bootstrap” data sets, x1, x2, x3, ... xb, from the original data.
The observations in each data set is generated by a random draw, with replacement, from the observations in the original data set.
Each data set has the same number of observation (n) as the original data set.
Refit the model to each bootstrap data set. Compute the statistics of interest (probability
profile, standard deviations, confidence intervals) from the results for each model fit.
6 Resampling the sample
Within a bootstrap sample the values are drawn randomly from the original sample.
Could thus get some measurements more than once within a particular bootstrap sample
The sample Bootsrap sample numberi Data 1 2 3 4 … n1 3 7 8 5 3 72 5 3 9 3 7 113 7 3 11 7 8 34 8 5 11 5 9 115 9 8 5 8 3 96 11 9 8 3 3 3
The mean 7.17 5.83 8.67 5.17 5.50 7.33The std. Deviation 2.86 2.56 2.25 2.04 2.81 3.67
7 Bootstrap: Why?
When the sample contains all the available information about the population one can act as if the sample is really the population for the purpose of estimating the sampling distribution.
Sampling with replacement is consistent with sampling a population that is effectively infinite - treat the sample as the total population.
Elegant, powerful and easy :-)
8 Bootstrap: When?
When sample cannot be represented by a distribution, especially if the underlying population distribution is not known.
If one knows the distribution there is little advantage to using bootstrap. There is however no harm in
bootstrapping such data sets!
Bootstrapping the residuals
10 Resampling residuals
In a time series model one must maintain the time order of the data.
Thus, resample the residuals with replacement from the optimum fit.
The randomly sampled residuals are applied to the optimum fitted values to generate new bootstrap samples.
Process repeated n-times to obtain a probability profile of the value of interest.
Use a catch curve analysis as an illustration
11 Original data set and statistics
z = -0.86
-10123456789
0 5 10 15
Age
ln c
@a
Data set: c@a of one cohort
Analysis: Slope estimate of fully recruited fish
Z = -Slope = Total mortality
Task: Obtain some information about the confidence interval of the Z using bootstrapping techniques.
12 The residuals to resample
Z = 0.86
-2
-1
0
1
2
3
4
5
6
7
8
9
5 6 7 8 9 10 11 12 13 14 15
13 One bootstrap sample
Original data set 1 Bootstrap sample
Residual Observed Predicted Random value from Bootstrap
Age ln c@a ln c@a obs-pred draw (age) draw ln c@a6 7.056 6.990 0.066 12 1.117 8.1077 5.406 6.134 -0.728 9 -0.424 5.7108 6.143 5.279 0.865 6 0.066 5.3449 3.999 4.423 -0.424 14 -0.645 3.778
10 3.468 3.567 -0.099 6 0.066 3.63311 2.622 2.711 -0.090 8 0.865 3.57612 2.972 1.855 1.117 7 -0.728 1.12713 0.939 1.000 -0.061 6 0.066 1.06514 -0.501 0.144 -0.645 11 -0.090 0.054
Intercept -Slope (Z) -Slope (Z)12.1 0.86 n 9 0.91
Bootstrap value = Predicted value + random residual value
14 More formally stated …
*
*
*
*
ˆ
ˆ
where
ln( )
a a a
a aa
a a
K K
K K
K C
The * represents a random draw of the residuals from the set available
15 Original and 1 bootstrap sample
Z = 0.86Z = 0.91
-2
-1
0
1
2
3
4
5
6
7
8
9
5 6 7 8 9 10 11 12 13 14 15
Original1 boot
16 n bootstrap samples
1 Bootstrap sample
Residual Random value for BootstrapRandom draw ln c@a
6 0.066 7.0568 0.865 6.999
14 -0.645 4.63412 1.117 5.54013 -0.061 3.50611 -0.090 2.6227 -0.728 1.1277 -0.728 0.271
11 -0.090 0.054
Slope (-Z)n 9 0.97
1 Bootstrap sample
Residual Random value for BootstrapRandom draw ln c@a
14 -0.645 6.34611 -0.090 6.0458 0.865 6.143
10 -0.099 4.32412 1.117 4.6846 0.066 2.777
14 -0.645 1.21111 -0.090 0.91011 -0.090 0.054
Slope (-Z)n 9 0.87
1 Bootstrap sample
Residual Random value for BootstrapRandom draw ln c@a
8 0.865 7.8559 -0.424 5.710
14 -0.645 4.6347 -0.728 3.695
11 -0.090 3.47714 -0.645 2.0676 0.066 1.921
11 -0.090 0.91014 -0.645 -0.501
Slope (-Z)n 9 0.91
1 Bootstrap sample
Residual Random value for BootstrapRandom draw ln c@a
14 -0.645 6.3466 0.066 6.2008 0.865 6.1436 0.066 4.489
10 -0.099 3.46810 -0.099 2.61213 -0.061 1.79514 -0.645 0.3558 0.865 1.008
Slope (-Z)n 9 0.82
1 Bootstrap sample
Residual Random value for BootstrapRandom draw ln c@a
9 -0.424 6.56610 -0.099 6.03512 1.117 6.39511 -0.090 4.33311 -0.090 3.47714 -0.645 2.06711 -0.090 1.76613 -0.061 0.9398 0.865 1.008
Slope (-Z)n 9 0.82
1 Bootstrap sample
Residual Random value for BootstrapRandom draw ln c@a
12 1.117 8.1078 0.865 6.9998 0.865 6.143
14 -0.645 3.77810 -0.099 3.46813 -0.061 2.65013 -0.061 1.7958 0.865 1.864
14 -0.645 -0.501
Slope (-Z)n 9 0.99
….
17
5000 bootstrap samples: Distribution
Bootstrap# Z1 0.8962 0.8583 0.8094 0.9865 0.9676 0.9207 0.8858 0.8589 0.891
10 0.91811 0.93212 0.89813 0.86514 0.90615 0.973... ...... ...
4992 0.8174993 0.6664994 0.8144995 0.8404996 0.8904997 0.7564998 0.8874999 0.7025000 0.716
0
50
100
150
200
250
300
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
1.05
Z value
# o
bse
rvat
ion
s
18 Bootstrap confidence intervals
With b bootstrap estimates of the parameter of interest b: Obtain confidence interval simply by
finding the percentile bootstrap estimates that contain the desired confidence.
More formally stated: An estimate of the100(1-)% CI around the sample estimate of is obtained from the two bootstrap estimates that contain the central 100(1-)% of all b bootstrap estimates.
19 80% & 95% confidence interval
0
50
100
150
200
250
300
350
0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 1.05
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
0.55
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
20 How many bootstraps?
This was more of an issue prior to the common availability of powerful computers.
However, in complicated models with many parameters issue is still valid and the number of bootstraps needed is a question of efficiency.
Can simply test the sensitivity of the parameters in question by running different number of runs ------- >
21
95% CI of Z and bootstrap numbers
0.6
0.7
0.8
0.9
1.0
1.1
0 200 400 600 800 1000
Number of bootstraps
95%
Confiden
ce inte
rval
In this simple case do not need much more than around 200 bootstraps to get CI
22
Different CV in catches: 1000 bootstraps
Z bootstrap distribution
0
20
40
60
80
100
120
140
160
180
200
0.60 0.70 0.80 0.90 1.00
n
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Cu
mu
lati
ve s
um
Z bootstrap distribution
0
20
40
60
80
100
120
140
160
180
200
0.60 0.70 0.80 0.90 1.00
n
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Cu
mu
lati
ve s
um
Z bootstrap distribution
0
20
40
60
80
100
120
140
160
180
200
0.60 0.70 0.80 0.90 1.00
n
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Cu
mu
lati
ve s
um
Z bootstrap distribution
0
20
40
60
80
100
120
140
160
180
200
0.60 0.70 0.80 0.90 1.00
n
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Cu
mu
lati
ve s
um
CV=0.1 CV=0.2
CV=0.3 CV=0.4
True Z=0.8
23 Final point
Bootstrap seems like the magic thing and it looks very impressive. It is a helpful explorative tool It represent the noise in the data given the
model assumption. Thus not a true confidence profile of the total error.
Should thus talk of “pseudo-confidence intervals” or “bootstrap confidence interval”.
The same rule applies with using bootstrap as with any other tools: Understand how it is implemented in the
particular software package that you may be using and ask if that is appropriate to the data set that you have.
Top Related