COMP STAT WEEK 4 DAY 2people.stat.sfu.ca/.../CompStat_Week4_Day2-2016.pdf · Parametric bootstrap:...

27
COMP STAT WEEK 4 DAY 2 Chapter 7 of Statistical Computing with R Dave Campbell, www.stat.sfu.ca/~dac5

Transcript of COMP STAT WEEK 4 DAY 2people.stat.sfu.ca/.../CompStat_Week4_Day2-2016.pdf · Parametric bootstrap:...

Page 1: COMP STAT WEEK 4 DAY 2people.stat.sfu.ca/.../CompStat_Week4_Day2-2016.pdf · Parametric bootstrap: For model Estimate from the data, by MLE, MOM,… Generate B new data sets From

COMP STAT WEEK 4 DAY 2Chapter 7 of Statistical Computing with R

Dave Campbell, www.stat.sfu.ca/~dac5

Page 2: COMP STAT WEEK 4 DAY 2people.stat.sfu.ca/.../CompStat_Week4_Day2-2016.pdf · Parametric bootstrap: For model Estimate from the data, by MLE, MOM,… Generate B new data sets From

R tips!

Then:

Page 3: COMP STAT WEEK 4 DAY 2people.stat.sfu.ca/.../CompStat_Week4_Day2-2016.pdf · Parametric bootstrap: For model Estimate from the data, by MLE, MOM,… Generate B new data sets From

>library(snow)

>cl = makeCluster(4,’SOCK’)

> clusterSetRNGStream(cl , iseed)

iseed can be set to make it “reproducibly random”

separate streams with period of 2127 for each node!

>clusterSetRNGStream(cl , 100:105)

> clusterCall(cl,rnorm,n=1)

Page 4: COMP STAT WEEK 4 DAY 2people.stat.sfu.ca/.../CompStat_Week4_Day2-2016.pdf · Parametric bootstrap: For model Estimate from the data, by MLE, MOM,… Generate B new data sets From

In genoud, you can include bounds and the option:

boundary.enforcement=2 enforces the boundary

boundary.enforcement=1 encourages the boundary

boundary.enforcement=0 ignores the boundary

Page 5: COMP STAT WEEK 4 DAY 2people.stat.sfu.ca/.../CompStat_Week4_Day2-2016.pdf · Parametric bootstrap: For model Estimate from the data, by MLE, MOM,… Generate B new data sets From

Reproducibility step #1

Start your code with this:

>rm(list=ls())

why?

Page 6: COMP STAT WEEK 4 DAY 2people.stat.sfu.ca/.../CompStat_Week4_Day2-2016.pdf · Parametric bootstrap: For model Estimate from the data, by MLE, MOM,… Generate B new data sets From

BootStrap:

Page 7: COMP STAT WEEK 4 DAY 2people.stat.sfu.ca/.../CompStat_Week4_Day2-2016.pdf · Parametric bootstrap: For model Estimate from the data, by MLE, MOM,… Generate B new data sets From

Two flavours of bootstraps:

parametric non-parametric

Goal is to estimate the standard error

Page 8: COMP STAT WEEK 4 DAY 2people.stat.sfu.ca/.../CompStat_Week4_Day2-2016.pdf · Parametric bootstrap: For model Estimate from the data, by MLE, MOM,… Generate B new data sets From

Parametric bootstrap:

For model

Estimate from the data, by MLE, MOM,…

Generate B new data sets

From which we obtain and

Use

Y1,...,YN ~ f (θ)

θ̂

Y 11,...,Y

1N ~ f (θ̂)

MY B1,...,Y

BN ~ f (θ̂)

θ̂1*,...θ̂B

*

SEθ̂= var θ̂1

*,...θ̂B*( )⎡

⎣⎤⎦ =

1B −1

E(θ̂*) − θ̂b*⎡⎣ ⎤⎦

b=1

B

∑2

E(θ̂*) = 1B

θ̂b*

b=1

B

2

Page 9: COMP STAT WEEK 4 DAY 2people.stat.sfu.ca/.../CompStat_Week4_Day2-2016.pdf · Parametric bootstrap: For model Estimate from the data, by MLE, MOM,… Generate B new data sets From

Use the original data to get a point estimate and then explore the behaviour of the sampling distribution of the estimate when applied to simulated data.

In Bayes-land this is like taking your predictive distribution to give you your parameter distribution.

Page 10: COMP STAT WEEK 4 DAY 2people.stat.sfu.ca/.../CompStat_Week4_Day2-2016.pdf · Parametric bootstrap: For model Estimate from the data, by MLE, MOM,… Generate B new data sets From

Parametric bootstrap (as stated initially) assumes the sampling distribution is symmetric and the model is good [i.e. if the model has a Normal f(.) and the real data truly has a Normal f(.)].

Symmetry is important if we want to use the standard error for confidence intervals.

We will use and then remove this assumption later when we get into confidence intervals. But first we will consider estimator bias.

Page 11: COMP STAT WEEK 4 DAY 2people.stat.sfu.ca/.../CompStat_Week4_Day2-2016.pdf · Parametric bootstrap: For model Estimate from the data, by MLE, MOM,… Generate B new data sets From

R Studio

# 1-Basic parametric bootstrap interval for the mean

Page 12: COMP STAT WEEK 4 DAY 2people.stat.sfu.ca/.../CompStat_Week4_Day2-2016.pdf · Parametric bootstrap: For model Estimate from the data, by MLE, MOM,… Generate B new data sets From

Estimate bias via parametric bootstrap samples

This bias approximation is valid if the sampling distribution of is close to that of

E(θ̂) −θtrue ≈ E(θ̂*) − θ̂

θ̂* − θ̂

θ̂ −θtrue

Page 13: COMP STAT WEEK 4 DAY 2people.stat.sfu.ca/.../CompStat_Week4_Day2-2016.pdf · Parametric bootstrap: For model Estimate from the data, by MLE, MOM,… Generate B new data sets From

R Studio

# 2 - estimated bias of the estimator:

Page 14: COMP STAT WEEK 4 DAY 2people.stat.sfu.ca/.../CompStat_Week4_Day2-2016.pdf · Parametric bootstrap: For model Estimate from the data, by MLE, MOM,… Generate B new data sets From

Estimate bias via parametric bootstrap samples *must have unbiased

estimatorE(θ̂) −θtrue ≈ E(θ̂

*) − θ̂

Page 15: COMP STAT WEEK 4 DAY 2people.stat.sfu.ca/.../CompStat_Week4_Day2-2016.pdf · Parametric bootstrap: For model Estimate from the data, by MLE, MOM,… Generate B new data sets From

R Studio

# 3-parametric bootstrap Small sample bias estimation for using the formula V1 as a variance estimate:

Page 16: COMP STAT WEEK 4 DAY 2people.stat.sfu.ca/.../CompStat_Week4_Day2-2016.pdf · Parametric bootstrap: For model Estimate from the data, by MLE, MOM,… Generate B new data sets From

Basic bootstrap confidence interval is

But that assumes a symmetric sampling distribution

θ̂ ±tdf ,1−α /2SEθ̂

Page 17: COMP STAT WEEK 4 DAY 2people.stat.sfu.ca/.../CompStat_Week4_Day2-2016.pdf · Parametric bootstrap: For model Estimate from the data, by MLE, MOM,… Generate B new data sets From

Alternatively use bootstrap confidence interval is:

lower , upper

θ̂ − q θ̂*,1−α 2( ) − θ̂⎡⎣

⎤⎦

θ̂ − q θ̂*,α 2( ) − θ̂⎡⎣

⎤⎦

Page 18: COMP STAT WEEK 4 DAY 2people.stat.sfu.ca/.../CompStat_Week4_Day2-2016.pdf · Parametric bootstrap: For model Estimate from the data, by MLE, MOM,… Generate B new data sets From

Alternatively use bootstrap confidence interval is:

lower , upper

Note that for q is the quantile and α=5%, this might look backwards.

For a skewed distribution, the data.estimator is more likely to be in the direction of the skew (compared to the mean) and this formulation fixes that

θ̂ − q θ̂*,1−α 2( ) − θ̂⎡⎣

⎤⎦

θ̂ − q θ̂*,α 2( ) − θ̂⎡⎣

⎤⎦

Page 19: COMP STAT WEEK 4 DAY 2people.stat.sfu.ca/.../CompStat_Week4_Day2-2016.pdf · Parametric bootstrap: For model Estimate from the data, by MLE, MOM,… Generate B new data sets From

R Studio

# 4-obtain a Bootstrap CI

Page 20: COMP STAT WEEK 4 DAY 2people.stat.sfu.ca/.../CompStat_Week4_Day2-2016.pdf · Parametric bootstrap: For model Estimate from the data, by MLE, MOM,… Generate B new data sets From

80th percentile of a sample: in RStudio

Bootstrap distribution vs, actual sampling distribution

Bias and CI for the 80th percentile

RStudio:

# 5-80th percentile

Page 21: COMP STAT WEEK 4 DAY 2people.stat.sfu.ca/.../CompStat_Week4_Day2-2016.pdf · Parametric bootstrap: For model Estimate from the data, by MLE, MOM,… Generate B new data sets From

80th percentile of a sample: in RStudio

Bootstrap distribution vs, actual sampling distribution

Bias and CI for the 80th percentile

Coverage probability of the interval estimate

RStudio:

# 6-coverage probability

Page 22: COMP STAT WEEK 4 DAY 2people.stat.sfu.ca/.../CompStat_Week4_Day2-2016.pdf · Parametric bootstrap: For model Estimate from the data, by MLE, MOM,… Generate B new data sets From

Parametric Bootstrap

model error is the assumed form of f(.) correct?

simulation error (is the likelihood bimodal?)

Statistical Error sampling distribution values unknown (like using N instead of t when σ unknown)

Nonparametric Bootstrap:

????

Note the potential sources of error

Page 23: COMP STAT WEEK 4 DAY 2people.stat.sfu.ca/.../CompStat_Week4_Day2-2016.pdf · Parametric bootstrap: For model Estimate from the data, by MLE, MOM,… Generate B new data sets From

For data:

and estimator of something useful: , which is a function of Y

produce a new sample chosen by resampling n values from with replacement.

Compute the

Repeat sampling and estimation B times. Use the B estimates to get an interval estimate for based on the sampling variability of

non-parametric BootstrapY1,...Yn

θ̂(Y )

Y *1,...Yb

*

Y1,...Yn

θ̂(Y *)

θ̂(Y ) θ̂(Y *)

Page 24: COMP STAT WEEK 4 DAY 2people.stat.sfu.ca/.../CompStat_Week4_Day2-2016.pdf · Parametric bootstrap: For model Estimate from the data, by MLE, MOM,… Generate B new data sets From

>clusterApply Sends jobs to cores in a round robin style. Like dealing cards

>clusterApplyLB let’s cores pull in new jobs as soon as the old one is done. Basically do this always with more jobs than cores

Using a parallel algorithm - write pseudocode

Use one of:

Page 25: COMP STAT WEEK 4 DAY 2people.stat.sfu.ca/.../CompStat_Week4_Day2-2016.pdf · Parametric bootstrap: For model Estimate from the data, by MLE, MOM,… Generate B new data sets From

non-parametric bootstrap

Iris data is also in “Statistical Computing with R” Ch4 (plotting)

RStudio iris data set obtain a point and interval for the 33rd percentile. - Why 33%?

Algorithm is parallel, but how to avoid communication overhead?

Page 26: COMP STAT WEEK 4 DAY 2people.stat.sfu.ca/.../CompStat_Week4_Day2-2016.pdf · Parametric bootstrap: For model Estimate from the data, by MLE, MOM,… Generate B new data sets From

Parametric Bootstrap

model error is the assumed form of f(.) correct?

simulation error (is the likelihood bimodal?)

Statistical Error sampling distribution values unknown (like using N instead of t when σ unknown)

Nonparametric Bootstrap:

simulation error

Note the potential sources of error

Page 27: COMP STAT WEEK 4 DAY 2people.stat.sfu.ca/.../CompStat_Week4_Day2-2016.pdf · Parametric bootstrap: For model Estimate from the data, by MLE, MOM,… Generate B new data sets From

Bootstrap fails when it is very sensitive to only a few points in the data so data is not a great representation of the population.

Bootstrap tends to work when small data changes lead to small changes in the estimator.

If you think bootstrap might be dodgy try a simulation study (as we did with 80th percentile and n=10

Don’t bother bootstrapping the min or the max

Many other variants exist for bootstrapping residuals,

Bootstrap fails