Download - Biostatistics IV An introduction to bootstrap. 2 Getting something from nothing? In Rudolph Erich Raspe's tale, Baron Munchausen had, in one of his many.

Biostatistics IV

An introduction to bootstrap

2 Getting something from nothing?

In Rudolph Erich Raspe's tale, Baron Munchausen had, in one of his many adventurous travels, fallen to the bottom of a deep lake and just as he was to succumb to his fate he thought to pull himself up by his own BOOTSTRAP.

The original version of the tale is in German, where Munchausen actually draws himself up by the hair, not the bootstraps. The figure on the left refers to German version of the story.

Efron and LePage gave the method they developed the name Bootstrap in honour of the unbelievable stories that the Baron told of his travels.

3 The general problem

“We have a set of real-valued observations x1, ... xn independently sampled from an unknown probability distribution F. We are interested in estimating some parameter by using the information in the sample data with an estimator -hat = t(x). Some measure of the estimate’s accuracy is as important as the estimate itself; we want a standard error of -hat and, even better a confidence interval on the true value .”

Efron and LePage (1992)

4 Our sample mean

From the 4 samples we take a mean and calculate a standard deviation! By doing this we are assuming that the

population is normally distributed But what if we don’t know the distribution?

0

50

100

150

200

250

300

350

400

450

0 5 10 15 20 25 30 35 40

4 sample

5 The solution: Bootstrap

Generate large numbers of “bootstrap” data sets, x1, x2, x3, ... xb, from the original data.

The observations in each data set is generated by a random draw, with replacement, from the observations in the original data set.

Each data set has the same number of observation (n) as the original data set.

Refit the model to each bootstrap data set. Compute the statistics of interest (probability

profile, standard deviations, confidence intervals) from the results for each model fit.

6 Resampling the sample

Within a bootstrap sample the values are drawn randomly from the original sample.

Could thus get some measurements more than once within a particular bootstrap sample

The sample Bootsrap sample numberi Data 1 2 3 4 … n1 3 7 8 5 3 72 5 3 9 3 7 113 7 3 11 7 8 34 8 5 11 5 9 115 9 8 5 8 3 96 11 9 8 3 3 3

The mean 7.17 5.83 8.67 5.17 5.50 7.33The std. Deviation 2.86 2.56 2.25 2.04 2.81 3.67

7 Bootstrap: Why?

When the sample contains all the available information about the population one can act as if the sample is really the population for the purpose of estimating the sampling distribution.

Sampling with replacement is consistent with sampling a population that is effectively infinite - treat the sample as the total population.

Elegant, powerful and easy :-)

8 Bootstrap: When?

When sample cannot be represented by a distribution, especially if the underlying population distribution is not known.

If one knows the distribution there is little advantage to using bootstrap. There is however no harm in

bootstrapping such data sets!

Bootstrapping the residuals

10 Resampling residuals

In a time series model one must maintain the time order of the data.

Thus, resample the residuals with replacement from the optimum fit.

The randomly sampled residuals are applied to the optimum fitted values to generate new bootstrap samples.

Process repeated n-times to obtain a probability profile of the value of interest.

Use a catch curve analysis as an illustration

11 Original data set and statistics

z = -0.86

-10123456789

0 5 10 15

Age

ln c

@a

Data set: c@a of one cohort

Analysis: Slope estimate of fully recruited fish

Z = -Slope = Total mortality

Task: Obtain some information about the confidence interval of the Z using bootstrapping techniques.

12 The residuals to resample

Z = 0.86

-2

-1

0

1

2

3

4

5

6

7

8

9

5 6 7 8 9 10 11 12 13 14 15

13 One bootstrap sample

Original data set 1 Bootstrap sample

Residual Observed Predicted Random value from Bootstrap

Age ln c@a ln c@a obs-pred draw (age) draw ln c@a6 7.056 6.990 0.066 12 1.117 8.1077 5.406 6.134 -0.728 9 -0.424 5.7108 6.143 5.279 0.865 6 0.066 5.3449 3.999 4.423 -0.424 14 -0.645 3.778

10 3.468 3.567 -0.099 6 0.066 3.63311 2.622 2.711 -0.090 8 0.865 3.57612 2.972 1.855 1.117 7 -0.728 1.12713 0.939 1.000 -0.061 6 0.066 1.06514 -0.501 0.144 -0.645 11 -0.090 0.054

Intercept -Slope (Z) -Slope (Z)12.1 0.86 n 9 0.91

Bootstrap value = Predicted value + random residual value

14 More formally stated …

*

*

*

*

ˆ

ˆ

where

ln( )

a a a

a aa

a a

K K

K K

K C

The * represents a random draw of the residuals from the set available

15 Original and 1 bootstrap sample

Z = 0.86Z = 0.91

-2

-1

0

1

2

3

4

5

6

7

8

9

5 6 7 8 9 10 11 12 13 14 15

Original1 boot

16 n bootstrap samples

1 Bootstrap sample

Residual Random value for BootstrapRandom draw ln c@a

6 0.066 7.0568 0.865 6.999

14 -0.645 4.63412 1.117 5.54013 -0.061 3.50611 -0.090 2.6227 -0.728 1.1277 -0.728 0.271

11 -0.090 0.054

Slope (-Z)n 9 0.97

1 Bootstrap sample


14 -0.645 6.34611 -0.090 6.0458 0.865 6.143

10 -0.099 4.32412 1.117 4.6846 0.066 2.777

14 -0.645 1.21111 -0.090 0.91011 -0.090 0.054

Slope (-Z)n 9 0.87

1 Bootstrap sample


8 0.865 7.8559 -0.424 5.710

14 -0.645 4.6347 -0.728 3.695

11 -0.090 3.47714 -0.645 2.0676 0.066 1.921

11 -0.090 0.91014 -0.645 -0.501

Slope (-Z)n 9 0.91

1 Bootstrap sample


14 -0.645 6.3466 0.066 6.2008 0.865 6.1436 0.066 4.489

10 -0.099 3.46810 -0.099 2.61213 -0.061 1.79514 -0.645 0.3558 0.865 1.008

Slope (-Z)n 9 0.82

1 Bootstrap sample


9 -0.424 6.56610 -0.099 6.03512 1.117 6.39511 -0.090 4.33311 -0.090 3.47714 -0.645 2.06711 -0.090 1.76613 -0.061 0.9398 0.865 1.008

Slope (-Z)n 9 0.82

1 Bootstrap sample


12 1.117 8.1078 0.865 6.9998 0.865 6.143

14 -0.645 3.77810 -0.099 3.46813 -0.061 2.65013 -0.061 1.7958 0.865 1.864

14 -0.645 -0.501

Slope (-Z)n 9 0.99

….

17

5000 bootstrap samples: Distribution

Bootstrap# Z1 0.8962 0.8583 0.8094 0.9865 0.9676 0.9207 0.8858 0.8589 0.891

10 0.91811 0.93212 0.89813 0.86514 0.90615 0.973... ...... ...

4992 0.8174993 0.6664994 0.8144995 0.8404996 0.8904997 0.7564998 0.8874999 0.7025000 0.716

0

50

100

150

200

250

300

0.60

0.65

0.70

0.75

0.80

0.85

0.90

0.95

1.00

1.05

Z value

# o

bse

rvat

ion

s

18 Bootstrap confidence intervals

With b bootstrap estimates of the parameter of interest b: Obtain confidence interval simply by

finding the percentile bootstrap estimates that contain the desired confidence.

More formally stated: An estimate of the100(1-)% CI around the sample estimate of is obtained from the two bootstrap estimates that contain the central 100(1-)% of all b bootstrap estimates.

19 80% & 95% confidence interval

0

50

100

150

200

250

300

350

0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 1.05

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

0.55

0.60

0.65

0.70

0.75

0.80

0.85

0.90

0.95

1.00

20 How many bootstraps?

This was more of an issue prior to the common availability of powerful computers.

However, in complicated models with many parameters issue is still valid and the number of bootstraps needed is a question of efficiency.

Can simply test the sensitivity of the parameters in question by running different number of runs ------- >

21

95% CI of Z and bootstrap numbers

0.6

0.7

0.8

0.9

1.0

1.1

0 200 400 600 800 1000

Number of bootstraps

95%

Confiden

ce inte

rval

In this simple case do not need much more than around 200 bootstraps to get CI

22

Different CV in catches: 1000 bootstraps

Z bootstrap distribution

0

20

40

60

80

100

120

140

160

180

200

0.60 0.70 0.80 0.90 1.00

n

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Cu

mu

lati

ve s

um


0

20

40

60

80

100

120

140

160

180

200

0.60 0.70 0.80 0.90 1.00

n

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Cu

mu

lati

ve s

um


0

20

40

60

80

100

120

140

160

180

200

0.60 0.70 0.80 0.90 1.00

n

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Cu

mu

lati

ve s

um


0

20

40

60

80

100

120

140

160

180

200

0.60 0.70 0.80 0.90 1.00

n

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Cu

mu

lati

ve s

um

CV=0.1 CV=0.2

CV=0.3 CV=0.4

True Z=0.8

23 Final point

Bootstrap seems like the magic thing and it looks very impressive. It is a helpful explorative tool It represent the noise in the data given the

model assumption. Thus not a true confidence profile of the total error.

Should thus talk of “pseudo-confidence intervals” or “bootstrap confidence interval”.

The same rule applies with using bootstrap as with any other tools: Understand how it is implemented in the

particular software package that you may be using and ask if that is appropriate to the data set that you have.