COMP STAT WEEK 4 DAY 2people.stat.sfu.ca/.../CompStat_Week4_Day2-2016.pdf · Parametric bootstrap:...

COMP STAT WEEK 4 DAY 2Chapter 7 of Statistical Computing with R

Dave Campbell, www.stat.sfu.ca/~dac5

http://www.stat.sfu.ca/~dac5

R tips!

Then:

>library(snow)

>cl = makeCluster(4,’SOCK’)

> clusterSetRNGStream(cl , iseed)

iseed can be set to make it “reproducibly random”

separate streams with period of 2127 for each node!

>clusterSetRNGStream(cl , 100:105)

> clusterCall(cl,rnorm,n=1)

In genoud, you can include bounds and the option:

boundary.enforcement=2 enforces the boundary

boundary.enforcement=1 encourages the boundary

boundary.enforcement=0 ignores the boundary

Reproducibility step #1

Start your code with this:

>rm(list=ls())

why?

BootStrap:

Two flavours of bootstraps:

parametric non-parametric

Goal is to estimate the standard error

Parametric bootstrap:

For model

Estimate from the data, by MLE, MOM,…

Generate B new data sets

From which we obtain and

Use

Y1,...,YN ~ f (θ)

θ̂

Y 11,...,Y

1N ~ f (θ̂)

MY B1,...,Y

BN ~ f (θ̂)

θ̂1*,...θ̂B

*

SEθ̂= var θ̂1

*,...θ̂B*( )⎡

⎣⎤⎦ =

1B −1

E(θ̂*) − θ̂b*⎡⎣ ⎤⎦

b=1

B

∑2

E(θ̂*) = 1B

θ̂b*

b=1

B

∑

2

Use the original data to get a point estimate and then explore the behaviour of the sampling distribution of the estimate when applied to simulated data.

In Bayes-land this is like taking your predictive distribution to give you your parameter distribution.

Parametric bootstrap (as stated initially) assumes the sampling distribution is symmetric and the model is good [i.e. if the model has a Normal f(.) and the real data truly has a Normal f(.)].

Symmetry is important if we want to use the standard error for confidence intervals.

We will use and then remove this assumption later when we get into confidence intervals. But first we will consider estimator bias.

R Studio

# 1-Basic parametric bootstrap interval for the mean

Estimate bias via parametric bootstrap samples

This bias approximation is valid if the sampling distribution of is close to that of

E(θ̂) −θtrue ≈ E(θ̂*) − θ̂

θ̂* − θ̂

θ̂ −θtrue

R Studio

# 2 - estimated bias of the estimator:

Estimate bias via parametric bootstrap samples *must have unbiased

estimatorE(θ̂) −θtrue ≈ E(θ̂

*) − θ̂

R Studio

# 3-parametric bootstrap Small sample bias estimation for using the formula V1 as a variance estimate:

Basic bootstrap confidence interval is

But that assumes a symmetric sampling distribution

θ̂ ±tdf ,1−α /2SEθ̂

Alternatively use bootstrap confidence interval is:

lower , upper

θ̂ − q θ̂*,1−α 2( ) − θ̂⎡⎣

⎤⎦

θ̂ − q θ̂*,α 2( ) − θ̂⎡⎣

⎤⎦

Alternatively use bootstrap confidence interval is:

lower , upper

Note that for q is the quantile and α=5%, this might look backwards.

For a skewed distribution, the data.estimator is more likely to be in the direction of the skew (compared to the mean) and this formulation fixes that

θ̂ − q θ̂*,1−α 2( ) − θ̂⎡⎣

⎤⎦

θ̂ − q θ̂*,α 2( ) − θ̂⎡⎣

⎤⎦

R Studio

# 4-obtain a Bootstrap CI

80th percentile of a sample: in RStudio

Bootstrap distribution vs, actual sampling distribution

Bias and CI for the 80th percentile

RStudio:

# 5-80th percentile

80th percentile of a sample: in RStudio

Bootstrap distribution vs, actual sampling distribution

Bias and CI for the 80th percentile

Coverage probability of the interval estimate

RStudio:

# 6-coverage probability

Parametric Bootstrap

model error is the assumed form of f(.) correct?

simulation error (is the likelihood bimodal?)

Statistical Error sampling distribution values unknown (like using N instead of t when σ unknown)

Nonparametric Bootstrap:

????

Note the potential sources of error

For data:

and estimator of something useful: , which is a function of Y

produce a new sample chosen by resampling n values from with replacement.

Compute the

Repeat sampling and estimation B times. Use the B estimates to get an interval estimate for based on the sampling variability of

non-parametric BootstrapY1,...Yn

θ̂(Y )

Y *1,...Yb

*

Y1,...Yn

θ̂(Y *)

θ̂(Y ) θ̂(Y *)

>clusterApply Sends jobs to cores in a round robin style. Like dealing cards

>clusterApplyLB let’s cores pull in new jobs as soon as the old one is done. Basically do this always with more jobs than cores

Using a parallel algorithm - write pseudocode

Use one of:

non-parametric bootstrap

Iris data is also in “Statistical Computing with R” Ch4 (plotting)

RStudio iris data set obtain a point and interval for the 33rd percentile. - Why 33%?

Algorithm is parallel, but how to avoid communication overhead?

Parametric Bootstrap

model error is the assumed form of f(.) correct?

simulation error (is the likelihood bimodal?)

Statistical Error sampling distribution values unknown (like using N instead of t when σ unknown)

Nonparametric Bootstrap:

simulation error

Note the potential sources of error

Bootstrap fails when it is very sensitive to only a few points in the data so data is not a great representation of the population.

Bootstrap tends to work when small data changes lead to small changes in the estimator.

If you think bootstrap might be dodgy try a simulation study (as we did with 80th percentile and n=10

Don’t bother bootstrapping the min or the max

Many other variants exist for bootstrapping residuals,

Bootstrap fails

COMP STAT WEEK 4 DAY 2people.stat.sfu.ca/.../CompStat_Week4_Day2-2016.pdf · Parametric bootstrap:...

Documents

Transcript of COMP STAT WEEK 4 DAY 2people.stat.sfu.ca/.../CompStat_Week4_Day2-2016.pdf · Parametric bootstrap:...