Bootstrapping
PUBH 7401: Fundamentals of Biostatistical Inference
Eric F. LockUMN Division of Biostatistics, SPH
10/18/2018
PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping
Recall: Sampling distributions
Question: how do I find/approximate the sampling distribution of astatistic?
1 Derivation from probability distribution of Xi ’s2 Simulation from probability distribution of Xi ’s3 Approximation using asymptotic theory4 Bootstrapping
Option 1.) is often not possibleOption 3.) is nice, but...
The Central Limit Theorem is only for a sample meanMay not be a good approximation (especially for small samplesize n
PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping
Sampling Distribution Via Simulation
In an ideal world, to learn about the sampling distribution we would1 Take lots of different samples from the population2 Calculate the statistic in each of those samples3 Plot the sampling distribution
This is totally unreasonable as we would never take multiplesamples from the population
PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping
Sampling Distribution of Statistics Using Simulation
In a slightly more realistic world, to learn about the samplingdistribution we could
1 Simulate lots of different samples (of the same size) from thepopulation
2 Calculate the statistic in each of those samples3 Plot the sampling distribution
(But this is challenging if we do not know the distribution of thepopulation)
PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping
Idea of the (nonparametric) bootstrap
I In practice, we only collect a single sample from thepopulation
I A single sample statistic
I Want to know how statistics could vary with different samples
I Bootstrapping: Using sample data, create an artificial“population” to sample from
I Many copies of the original sample
I Reese’s Pieces:http://ericfrazerlock.com/Reeses_bootstrap.pptx
PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping
Idea of the (nonparametric) bootstrap
I A bootstrap “population” from a sample:
PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping
Sampling with replacement
I To draw a sample from bootstrap population, sample withreplacement from the sample we have.
I Each unit can be selected more than once.
I The bootstrap sample is of the original sample size n.
PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping
Bootstrap terminology
I A bootstrap sample is a random sample taken withreplacement from the original sample, of the same size as theoriginal sample.
I A bootstrap statistic is the statistic computed from abootstrap sample.
I E.g., sample mean, sample proportion, correlation, etc.
I A bootstrap distribution is the distribution of many bootstrapstatistics.
PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping
Finding a bootstrap distribution
http://ericfrazerlock.com/bootstrap_analogy.pptx
PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping
Golden rule of bootstrapping
I Bootstrap statistics are to the original sample statisticasthe original sample statistic is to the population parameter
I The bootstrap distribution approximates the shape and spread(variance) of the unknown population distribution.
PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping
Example: Body temperatures
I Consider body temperatures for random sample 50 individuals1
I The mean body temperature in the sample is x̄ = 92.26◦
I Generate a bootstrap distribution for the sample mean
I http://www.lock5stat.com/StatKey/bootstrap_1_quant/bootstrap_1_quant.html
1https://www.tandfonline.com/doi/full/10.1080/10691898.1996.11910512PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping
Probability interpretation of bootstrap
I Recall: For a simple random sample X1, . . . , Xn areindependent and from the same probability distribution
I In practice, we may not know the probability distribution
I Our “best guess” of the population pmf is often the empiricalpmf. That is, we assume that
P̂(X = x) = # of times in sample observed xn
I This can be a good approximation even if the distribution iscontinuous
I When we bootstrap, we simulate from this distribution
PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping
Empirical Distribution
Consider previous health cost data for El GoogAssume that we took a sample of size 5 from the cost data:
## [1] 479 489 725 1955 2809
The empirical distribution p̂(x) is
x 479 489 725 1955 2809p̂(x) 0.2 0.2 0.2 0.2 0.2
PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping
Bootstrap Sampling Distribution
1,000 bootstrap simulations with the original n = 50 dataset
Mean Cost in Bootstrap Samples
Den
sity
1500 2000 2500
0.00
000.
0005
0.00
100.
0015
0.00
20
PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping
Top Related