Bootstrappingericfrazerlock.com/bootstrapping.pdf · I A bootstrap statistic is the statistic...

Post on 26-Jun-2020

2 views 0 download

Transcript of Bootstrappingericfrazerlock.com/bootstrapping.pdf · I A bootstrap statistic is the statistic...

Bootstrapping

PUBH 7401: Fundamentals of Biostatistical Inference

Eric F. LockUMN Division of Biostatistics, SPH

elock@umn.edu

10/18/2018

PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping

Recall: Sampling distributions

Question: how do I find/approximate the sampling distribution of astatistic?

1 Derivation from probability distribution of Xi ’s2 Simulation from probability distribution of Xi ’s3 Approximation using asymptotic theory4 Bootstrapping

Option 1.) is often not possibleOption 3.) is nice, but...

The Central Limit Theorem is only for a sample meanMay not be a good approximation (especially for small samplesize n

PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping

Sampling Distribution Via Simulation

In an ideal world, to learn about the sampling distribution we would1 Take lots of different samples from the population2 Calculate the statistic in each of those samples3 Plot the sampling distribution

This is totally unreasonable as we would never take multiplesamples from the population

PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping

Sampling Distribution of Statistics Using Simulation

In a slightly more realistic world, to learn about the samplingdistribution we could

1 Simulate lots of different samples (of the same size) from thepopulation

2 Calculate the statistic in each of those samples3 Plot the sampling distribution

(But this is challenging if we do not know the distribution of thepopulation)

PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping

Idea of the (nonparametric) bootstrap

I In practice, we only collect a single sample from thepopulation

I A single sample statistic

I Want to know how statistics could vary with different samples

I Bootstrapping: Using sample data, create an artificial“population” to sample from

I Many copies of the original sample

I Reese’s Pieces:http://ericfrazerlock.com/Reeses_bootstrap.pptx

PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping

Idea of the (nonparametric) bootstrap

I A bootstrap “population” from a sample:

PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping

Sampling with replacement

I To draw a sample from bootstrap population, sample withreplacement from the sample we have.

I Each unit can be selected more than once.

I The bootstrap sample is of the original sample size n.

PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping

Bootstrap terminology

I A bootstrap sample is a random sample taken withreplacement from the original sample, of the same size as theoriginal sample.

I A bootstrap statistic is the statistic computed from abootstrap sample.

I E.g., sample mean, sample proportion, correlation, etc.

I A bootstrap distribution is the distribution of many bootstrapstatistics.

PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping

Finding a bootstrap distribution

http://ericfrazerlock.com/bootstrap_analogy.pptx

PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping

Golden rule of bootstrapping

I Bootstrap statistics are to the original sample statisticasthe original sample statistic is to the population parameter

I The bootstrap distribution approximates the shape and spread(variance) of the unknown population distribution.

PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping

Example: Body temperatures

I Consider body temperatures for random sample 50 individuals1

I The mean body temperature in the sample is x̄ = 92.26◦

I Generate a bootstrap distribution for the sample mean

I http://www.lock5stat.com/StatKey/bootstrap_1_quant/bootstrap_1_quant.html

1https://www.tandfonline.com/doi/full/10.1080/10691898.1996.11910512PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping

Probability interpretation of bootstrap

I Recall: For a simple random sample X1, . . . , Xn areindependent and from the same probability distribution

I In practice, we may not know the probability distribution

I Our “best guess” of the population pmf is often the empiricalpmf. That is, we assume that

P̂(X = x) = # of times in sample observed xn

I This can be a good approximation even if the distribution iscontinuous

I When we bootstrap, we simulate from this distribution

PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping

Empirical Distribution

Consider previous health cost data for El GoogAssume that we took a sample of size 5 from the cost data:

## [1] 479 489 725 1955 2809

The empirical distribution p̂(x) is

x 479 489 725 1955 2809p̂(x) 0.2 0.2 0.2 0.2 0.2

PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping

Bootstrap Sampling Distribution

1,000 bootstrap simulations with the original n = 50 dataset

Mean Cost in Bootstrap Samples

Den

sity

1500 2000 2500

0.00

000.

0005

0.00

100.

0015

0.00

20

PUBH 7401: Fundamentals of Biostatistical Inference Bootstrapping