Post on 03-Jan-2016
description
Limits to Statistical TheoryBootstrap analysis
ESM 206
11 April 2006
Assumption of t-test
• Sample mean is a t-distributed random variable– Guaranteed if observations are normally distributed random variables or
sample size is very large
– In practice, OK if observations are not too skewed and sample size is reasonably large
• This assumption also applies when using standard formula for 95% CI of mean
Resampling for a confidence interval of the mean
IN AN IDEAL WORLD
• Take sample
• Calculate sample mean
• Take new sample
• Calculate new mean
• Repeat many times
• Look at the distribution of sample means
• 95% CI ranges from 2.5 percentile to 97.5 percentile
• IN THE REAL WORLD
• Find some way to simulate taking a sample
• Calculate the sample mean
• Repeat many times
• Look at the distribution of sample means
• 95% CI ranges from 2.5 percentile to 97.5 percentile
Bootstrap resampling
PARAMETRIC BOOTSTRAP• Assume data are random variables from
a particular distribution– E.g., log-normal
• Use data to estimate parameters of the distribution
– E.g., mean, variance
• Use random number generator to create sample
– Same size as original– Calculate sample mean
• Allows us to ask: What if data were a random sample from specified distribution with specified parameters?
NONPARAMETRIC BOOTSTRAP• Assume underlying distribution from
which data come is unknown– Best estimate of this distribution is the
data themselves – the empirical distribution function
• Create a new dataset by sampling with replacement from the data
– Same size as original– Calculate sample mean
WHICH IS BETTER?• If underlying distribution is correctly
chosen, parametric has more precision• If underlying distribution incorrectly
chosen, parametric has more bias
TcCB in the cleanup site
• Parametric bootstrap– If Y is log-normal, it is specified in
terms of mean and standard deviation of X = log(Y)
– Mean = -0.547
– SD = 1.360
– Use “Monte Carlo Simulation” to generate 999 replicate simulated datasets from log-normal distribution
– Calculate mean of each replicate and sort means
– 25th value is lower end of 95% CI
– 975th value is upper end of 95% CI
0
50
100
150
100.0%
99.5%
97.5%
90.0%
75.0%
50.0%
25.0%
10.0%
2.5%
0.5%
0.0%
maximum
quartile
median
quartile
minimum
168.64
168.64
57.80
2.70
1.15
0.43
0.23
0.17
0.09
0.09
0.09
Quantiles
Mean
Std Dev
Std Err Mean
upper 95% Mean
lower 95% Mean
N
3.9151948
20.0156
2.2809894
8.4581788
-0.627789
77
Moments
Cleanup
Distributions
0
50
100
150
100.0%
99.5%
97.5%
90.0%
75.0%
50.0%
25.0%
10.0%
2.5%
0.5%
0.0%
maximum
quartile
median
quartile
minimum
168.64
168.64
57.80
2.70
1.15
0.43
0.23
0.17
0.09
0.09
0.09
Quantiles
Mean
Std Dev
Std Err Mean
upper 95% Mean
lower 95% Mean
N
3.9151948
20.0156
2.2809894
8.4581788
-0.627789
77
Moments
Cleanup
Distributions
95% CI: [-0.678, 8.458]
Parametric bootstrap: results
• 95% CI: [0.917, 2.293]
Distribution of sample means
0
20
40
60
80
100
120
140
160
180
0.83
68503
1.06
81231
1.29
93959
1.53
06687
1.76
19415
1.99
32143
2.22
44871
2.45
57599
2.68
70327
2.91
83055
3.14
95783
3.38
08511
3.61
21239
3.84
33967
4.07
46695
Bin (label shows upper limit)
Fre
qu
ency
-3
-2
-1
0
1
2
3
4
5
6
Mean
Std Dev
Std Err Mean
upper 95% Mean
lower 95% Mean
N
-0.547426
1.3604488
0.1550375
-0.238642
-0.85621
77
Moments
log(cleanup)
Distributions
Normal QQ Plot
• Sort data
• Index the values (i = 1,2,…,n)
• Calculate q = i /(n+1)– This is the quantile
• Plot quantiles against data values– This is the empirical cumulative
distribution function (CDF)
• Construct CDF of standard normal using same quantiles
• Compare the distributions at the same quantiles
-3
-2
-1
0
1
2
3
4
5
6.01 .05 .10 .25 .50 .75 .90 .95 .99
-3 -2 -1 0 1 2 3
Normal Quantile Plot
Mean
Std Dev
Std Err Mean
upper 95% Mean
lower 95% Mean
N
-0.547426
1.3604488
0.1550375
-0.238642
-0.85621
77
Moments
log(cleanup)
Distributions
Nonparametric bootstrap: results
• 95% CI: [0.851, 9.248]
0
20
40
60
80
100
120
140
0.5
1.5
2.5
3.5
4.5
5.5
6.5
7.5
8.5
9.5
10.5
11.5
12.5
13.5
14.5
Bootstrap mean
Fre
qu
ency
Bootstrap and hypothesis tests
• One sample t-test– Calculate bootstrap CI of mean– Does it overlap test value?
• Paired t-test– Calculate differences:
• Di = xi - yi
– Find bootstrap CI of mean difference– Does it overlap zero?
• Two-sample t-test– Want to create simulated data where
H0 is true (same mean) but allow variance and shape of distribution to differ between populations
– Easiest with nonparametric:• Subtract mean from each sample.
Now both samples have mean zero• Resample these residuals, creating
simulated group A from residuals of group A and simulated group B from residuals of group B
– Generate distribution of t values– P is fraction of simulated t’s that
exceed t calculated from data
TcCB: H0: cleanup mean = reference mean
• t = 1.45
• Bootstrapped ‘t’ values do not follow a t distribution!
• P = 0.02
0
100
200
300
400
500
600
-37.6
7550
2
-34.8
1313
6
-31.9
5077
1
-29.0
8840
5
-26.2
2604
-23.3
6367
4
-20.5
0130
9
-17.6
3894
3
-14.7
7657
8
-11.9
1421
2
-9.05
1846
8
-6.18
9481
3
-3.32
7115
8
-0.46
4750
3
2.39
76152
2
Bin (label shows upper limit)
Fre
qu
ency