COMP STAT WEEK 4 DAY 2people.stat.sfu.ca/.../CompStat_Week4_Day2-2016.pdf · Parametric bootstrap:...
Transcript of COMP STAT WEEK 4 DAY 2people.stat.sfu.ca/.../CompStat_Week4_Day2-2016.pdf · Parametric bootstrap:...
COMP STAT WEEK 4 DAY 2Chapter 7 of Statistical Computing with R
Dave Campbell, www.stat.sfu.ca/~dac5
R tips!
Then:
>library(snow)
>cl = makeCluster(4,’SOCK’)
> clusterSetRNGStream(cl , iseed)
iseed can be set to make it “reproducibly random”
separate streams with period of 2127 for each node!
>clusterSetRNGStream(cl , 100:105)
> clusterCall(cl,rnorm,n=1)
In genoud, you can include bounds and the option:
boundary.enforcement=2 enforces the boundary
boundary.enforcement=1 encourages the boundary
boundary.enforcement=0 ignores the boundary
Reproducibility step #1
Start your code with this:
>rm(list=ls())
why?
BootStrap:
Two flavours of bootstraps:
parametric non-parametric
Goal is to estimate the standard error
Parametric bootstrap:
For model
Estimate from the data, by MLE, MOM,…
Generate B new data sets
From which we obtain and
Use
Y1,...,YN ~ f (θ)
θ̂
Y 11,...,Y
1N ~ f (θ̂)
MY B1,...,Y
BN ~ f (θ̂)
θ̂1*,...θ̂B
*
SEθ̂= var θ̂1
*,...θ̂B*( )⎡
⎣⎤⎦ =
1B −1
E(θ̂*) − θ̂b*⎡⎣ ⎤⎦
b=1
B
∑2
E(θ̂*) = 1B
θ̂b*
b=1
B
∑
2
Use the original data to get a point estimate and then explore the behaviour of the sampling distribution of the estimate when applied to simulated data.
In Bayes-land this is like taking your predictive distribution to give you your parameter distribution.
Parametric bootstrap (as stated initially) assumes the sampling distribution is symmetric and the model is good [i.e. if the model has a Normal f(.) and the real data truly has a Normal f(.)].
Symmetry is important if we want to use the standard error for confidence intervals.
We will use and then remove this assumption later when we get into confidence intervals. But first we will consider estimator bias.
R Studio
# 1-Basic parametric bootstrap interval for the mean
Estimate bias via parametric bootstrap samples
This bias approximation is valid if the sampling distribution of is close to that of
E(θ̂) −θtrue ≈ E(θ̂*) − θ̂
θ̂* − θ̂
θ̂ −θtrue
R Studio
# 2 - estimated bias of the estimator:
Estimate bias via parametric bootstrap samples *must have unbiased
estimatorE(θ̂) −θtrue ≈ E(θ̂
*) − θ̂
R Studio
# 3-parametric bootstrap Small sample bias estimation for using the formula V1 as a variance estimate:
Basic bootstrap confidence interval is
But that assumes a symmetric sampling distribution
θ̂ ±tdf ,1−α /2SEθ̂
Alternatively use bootstrap confidence interval is:
lower , upper
θ̂ − q θ̂*,1−α 2( ) − θ̂⎡⎣
⎤⎦
θ̂ − q θ̂*,α 2( ) − θ̂⎡⎣
⎤⎦
Alternatively use bootstrap confidence interval is:
lower , upper
Note that for q is the quantile and α=5%, this might look backwards.
For a skewed distribution, the data.estimator is more likely to be in the direction of the skew (compared to the mean) and this formulation fixes that
θ̂ − q θ̂*,1−α 2( ) − θ̂⎡⎣
⎤⎦
θ̂ − q θ̂*,α 2( ) − θ̂⎡⎣
⎤⎦
R Studio
# 4-obtain a Bootstrap CI
80th percentile of a sample: in RStudio
Bootstrap distribution vs, actual sampling distribution
Bias and CI for the 80th percentile
RStudio:
# 5-80th percentile
80th percentile of a sample: in RStudio
Bootstrap distribution vs, actual sampling distribution
Bias and CI for the 80th percentile
Coverage probability of the interval estimate
RStudio:
# 6-coverage probability
Parametric Bootstrap
model error is the assumed form of f(.) correct?
simulation error (is the likelihood bimodal?)
Statistical Error sampling distribution values unknown (like using N instead of t when σ unknown)
Nonparametric Bootstrap:
????
Note the potential sources of error
For data:
and estimator of something useful: , which is a function of Y
produce a new sample chosen by resampling n values from with replacement.
Compute the
Repeat sampling and estimation B times. Use the B estimates to get an interval estimate for based on the sampling variability of
non-parametric BootstrapY1,...Yn
θ̂(Y )
Y *1,...Yb
*
Y1,...Yn
θ̂(Y *)
θ̂(Y ) θ̂(Y *)
>clusterApply Sends jobs to cores in a round robin style. Like dealing cards
>clusterApplyLB let’s cores pull in new jobs as soon as the old one is done. Basically do this always with more jobs than cores
Using a parallel algorithm - write pseudocode
Use one of:
non-parametric bootstrap
Iris data is also in “Statistical Computing with R” Ch4 (plotting)
RStudio iris data set obtain a point and interval for the 33rd percentile. - Why 33%?
Algorithm is parallel, but how to avoid communication overhead?
Parametric Bootstrap
model error is the assumed form of f(.) correct?
simulation error (is the likelihood bimodal?)
Statistical Error sampling distribution values unknown (like using N instead of t when σ unknown)
Nonparametric Bootstrap:
simulation error
Note the potential sources of error
Bootstrap fails when it is very sensitive to only a few points in the data so data is not a great representation of the population.
Bootstrap tends to work when small data changes lead to small changes in the estimator.
If you think bootstrap might be dodgy try a simulation study (as we did with 80th percentile and n=10
Don’t bother bootstrapping the min or the max
Many other variants exist for bootstrapping residuals,
Bootstrap fails