FSRM 16:958:587 Advanced Simulation Methods for Finance (Lecture 4) · 2016. 2. 17. · (Lecture 4)...
Transcript of FSRM 16:958:587 Advanced Simulation Methods for Finance (Lecture 4) · 2016. 2. 17. · (Lecture 4)...
FSRM 16:958:587 Advanced SimulationMethods for Finance
(Lecture 4)
Min-ge Xie
Department of Statistics & Biostatistics,Rutgers University
Bootstrap – a Simulation & Resampling Method
General statements (overly simplified intro/some key words)
Bootstrap method in Statistics is a resampling (simulation)approach for making statistical inference for unknownparameters (of a underlying population from which the observedsample data are generated).
Bootstrap samples are simulated “phantom" samples based onobserved sample data;Bootstrap distributions are derived from the bootstrap samplesand they can be used to make statistical inference.
� Although it’s a simulation method, it is a little different from thesimulation techniques that we’ve learned before.
Bootstrap – a Simulation & Resampling Method
Motivation/background
The primary task of a statistician is to summarize a samplebased study and generalize the finding to the parent(underlying) population in a scientific manner.
The summary (often through a sample statistic such asmean, median, correlation, etc) will fluctuate from sampleto sample
We would like to know the magnitude of these fluctuationsto get an overall picture — This fluctuation can often bedescribed in the form of a probability distribution called asampling distribution.
Bootstrap – a Simulation & Resampling Method
Motivation/background (continue) —Suppose we do not make much assumption (do not knowmuch) about the underlying population:
Ideally, if we can repeated draw samples from the targetdistribution again and again =⇒We can have multiple(many) copies of sample statistic in these repeatedlydrawing samples =⇒ The multiple (many) copies ofsample statistic can then provide us a good idea about thefluctuation and the sampling distribution.
But, in reality, we only have one set (copy) of observeddata (sample).
Bootstrap – a Simulation & Resampling Method
The idea behind bootstrap is to use the observed sample data as a“surrogate population”, for the purpose of approximating the samplingdistribution of a statistic
Specifically,
— We resample with replacement from the sample data at hand andcreate a large number of “phantom samples” known as bootstrapsamples.
— These bootstrap samples can be used to quantify the fluctuation (‘makeinference’) of a "population parameter" of the “surrogate population”.
— Under some conditions, the “phantom" inference is the same as (canhelp to derive) the real inference that we are looking for.
This leads us to the "bootstrap method"!
Bootstrap – a Simulation & Resampling Method
The idea behind bootstrap is to use the observed sample data as a“surrogate population”, for the purpose of approximating the samplingdistribution of a statistic
Specifically,
— We resample with replacement from the sample data at hand andcreate a large number of “phantom samples” known as bootstrapsamples.
— These bootstrap samples can be used to quantify the fluctuation (‘makeinference’) of a "population parameter" of the “surrogate population”.
— Under some conditions, the “phantom" inference is the same as (canhelp to derive) the real inference that we are looking for.
This leads us to the "bootstrap method"!
Bootstrap – a Simulation & Resampling Method
Read a companion note (introductory review article) by Singh and Xie
(2010) in International Encyclopedia of Education.
(http://stat.rutgers.edu/~mxie/RCPapers/bootstrap.pdf)or(http://www.sciencedirect.com/science/referenceworks/9780080448947)
Bootstrap – a Simulation & Resampling Method
Setting/Setup:We have a sample data set x1, . . . , xn
i.i.d∼ F (x); Also, let θbe a population characteristic of the distribution F , and wehave an estimator θ for θ, where θ = θ (x1, . . . , xn) is afunction of the sample set x = (x1, . . . , xn).
For example, θ is the mean of distribution F (populationmean) and θ = X .
Goal:We need to make an inference about θ – Beside the pointestimator θ, we like to know the sampling distribution ofθ = θ (x1, . . . , xn); in particular, find a confidence intervalfor θ etc.
Bootstrap – a Simulation & Resampling Method
Bootstrap (generate) a set of new (“phantom”) data
From the observed sample data set {x1, x2, . . . , xn},resample with replacement to get a new data set of size n:
— Randomly pick (each with probability 1/n) a data point from{x1, x2, . . . , xn} and set it to be x∗
1 ; repeat the exactly samerandom pick n − 1 times to get x∗
2 , x,3 . . . , x
∗n .
This new set of data {x∗1 , . . . , x
∗n} is called a set of
bootstrap sample.
� To make statistical inference, we repeat this bootstrappingsampling process a large number of (say N) times to get Nsets of bootstrap samples.
Bootstrap – a Simulation & Resampling Method
Bootstrap (generate) a set of new (“phantom”) data
From the observed sample data set {x1, x2, . . . , xn},resample with replacement to get a new data set of size n:
— Randomly pick (each with probability 1/n) a data point from{x1, x2, . . . , xn} and set it to be x∗
1 ; repeat the exactly samerandom pick n − 1 times to get x∗
2 , x,3 . . . , x
∗n .
This new set of data {x∗1 , . . . , x
∗n} is called a set of
bootstrap sample.
� To make statistical inference, we repeat this bootstrappingsampling process a large number of (say N) times to get Nsets of bootstrap samples.
Bootstrap – a Simulation & Resampling Method
A bootstrap sampling algorithm to get a CI for θ:1 Generating N bootstrap datasets, each of size n and
compute the corresponding bootstrap estimator θ∗:1st : {x∗
1 , . . . , x∗n }[1] ∼ {x1, . . . , xn}, θ∗1 = θ
({x∗
1 , . . . , x∗n }[1]
)2nd : {x∗
1 , . . . , x∗n }[2] ∼ {x1, . . . , xn}, θ∗2 = θ
({x∗
1 , . . . , x∗n }[2]
)...
Nth: {x∗1 , . . . , x
∗n }[N] ∼ {x1, . . . , xn}, θ∗N = θ
({x∗
1 , . . . , x∗n }[N]
)2 Sort {θ∗1, θ∗2, . . . , θ∗N} from the smallest to the largest.
Now we have θ∗(1) ≤ θ∗(2) ≤ · · · ≤ θ
∗(N); a histogram could be
constructed.
Bootstrap – a Simulation & Resampling Method
Claim (symmetric case): A (1− α)100% confidenceInterval for the parameter θ is simply[
θ∗(L), θ∗(U)
],
where L =αN2
and U =(
1− α
2
)N, for 0 < α < 1.
For example, a 95% C.I. for parameter θ is[θ∗(25), θ
∗(975)
], if
N = 1000(Note: L = 1000× .025 = 25, U = 1000× .975 = 975)An equivalent way to write the above confidence interval issimply
[θ∗α/2, θ
∗1−α/2
]
Bootstrap – a Simulation & Resampling Method
Claim: (asymmetric case) A (1− α)100% confidenceInterval for the parameter θ is[
2θ − θ∗(U),2θ − θ∗(L)
],
where L =αN2
and U =(
1− α
2
)N, for 0 < α < 1.
For example, a 95% C.I. for parameter θ is just[2θ − θ∗(975),2θ − θ
∗(25)
], if N = 1000.
Bootstrap – a Simulation & Resampling Method
Example (cf., Example 2 of the companion note by Singh andXie (2010)):Data: Two types of measurements to assess body fat in n = 20collegiate football players
BOD 2.5 4.0 4.1 6.2 7.1 7.0 8.3 9.2 9.3 12.0 12.2 12.6 14.2 14.4 15.1 15.2 16.3 17.1 17.9 17.9 HW 8.0 6.2 9.2 6.4 8.6 12.2 7.2 12.0 14.9 12.1 15.3 14.8 14.3 16.3 17.9 19.5 17.5 14.3 18.3 16.2
— BOD is BOD POD, a whole body air-displacement plethysmograph
— HW refers to hydrostatic weighing.
Question: To study the correlation between the BOD and HWmeasurements (find a confidence interval)
Bootstrap – a Simulation & Resampling Method
## R program of bootstrap algorithm for correla3on parameter ## Example 2 of Singh and Xie (2010) # Data BOD = scan() 2.5 4.0 4.1 6.2 7.1 7.0 8.3 9.2 9.3 12.0 12.2 12.6 14.2 14.4 15.1 15.2 16.3 17.1 17.9 17.9 HW = scan() 8.0 6.2 9.2 6.4 8.6 12.2 7.2 12.0 14.9 12.1 15.3 14.8 14.3 16.3 17.9 19.5 17.5 14.3 18.3 16.2 data.ex2 = cbind(BOD, HW) ## Boxplot of the data and the scaNer plot par(mfrow = c(1,2)) boxplot(data.ex2); plot(BOD, HW)
Bootstrap – a Simulation & Resampling Method
BOD HW
510
1520
5 10 15
68
1012
1416
1820
BOD
HW
Bootstrap – a Simulation & Resampling Method
A generic bootstrap algorithm for this example
Step 1: At each iteration k = 0, 1, 2, . . . ,N = 1000, generate a
bootstrap data set of n = 20 pairs by repeating the following procedure:
1 For i = 1, . . . ,20, randomly sample a pair (x∗i , y
∗i ) from the
20 observed data pairs {(2.5,8.0), (4.0,6.2), . . . , (17.9,16.2)} (sample with replacement);These new 20 pairs form a bootstrap sample set(x∗,y∗) = {(x∗
1 , y∗1 ), (x
∗2 , y
∗2 ), . . . , (x
∗20, y
∗20)}.
2 Compute the bootstrap sample correlation coefficientρ∗ = corr(x∗,y∗).
Step 2: Produce a histogram using the N = 1000 ρ∗’s and also sortthese ρ∗. The histogram (next page) suggests that the bootstrapdistribution is skewed; so the 95% confidence interval for ρ is[2ρ− ρ∗(975), 2ρ− ρ∗(25)].
Bootstrap – a Simulation & Resampling Method
## R program of bootstrap algorithm for correla3on parameter ## Example 2 of Singh and Xie (2010) # Bootstrapping and calcula3on of bootstrap corr coef. corr.b=matrix(0,1000) for(i in 1:1000) { # sample genera3ng a set of new bootstrap sample indx = sample(1:nrow(data.ex2), replace = T) data.bt = data.ex2[indx,] # calculate correla3on coeeficient corr.b[i]= cor(data.bt[,1], data.bt[,2]) }
Bootstrap – a Simulation & Resampling Method
## Es%mate of corr using the orginal data cor(data.ex2[,1], data.ex2[,2]); cor(BOD, HW) 1] 0.8678753 [1] 0.8678753 # Histogram and boostrap 95% CI hist(corr.b);summary(corr.b) Min. :0.6495 1st Qu.:0.8434 Median :0.8736 Mean :0.8667 3rd Qu.:0.8966 Max. :0.9584 corr.b.srt = sort(corr.b) CI.95 = c(2* cor(BOD, HW) -‐ corr.b.srt[975], 2 * cor(BOD, HW) -‐ corr.b.srt[25]); CI.95 [1] 0.7998790 0.9692848
Bootstrap – a Simulation & Resampling Method
Histogram of corr.b
corr.b
Frequency
0.6 0.7 0.8 0.9 1.0
0100
200
300
400
500
Bootstrap – a Simulation & Resampling Method
Bootstrap Central Limit Theory (Singh, 1981):
Theorem: Under some mild conditions, we have when n islarge (n→∞),
(θ∗ − θ)∣∣∣∣θ ∼ (θ − θ0)
∣∣∣∣θ0. (1)
Proof: Omitted.
(Notation: The distribution (1) has a cumulative distributionfunction G(·). )
Bootstrap – a Simulation & Resampling Method
Based on the Bootstrap Central Limit Theory, we can show thatthe claims on pages 14-15 are justified.
Proof of the claims on page 14-15:Case (i) The distribution (1) is symmetric.
We define the cumulative distribution of the bootstrap estimatorwhen given the sample data:
Bn (t) = P(θ∗ ≤ t |θ
).
(The Bn (t) is also known as bootstrap distribution).
Bootstrap – a Simulation & Resampling Method
We have the following statements:
Bn (t) is monotonically increasing in t (since it is a cumulativedistribution function).
When t = θ∗α, Bn
(θ∗α
)= P
(θ∗ ≤ θ∗α
∣∣θ) = α.
So we know that θ∗α = B−1n (α).
When t = θ0 the true parameter value, we have
Bn (θ0) = P(θ∗ ≤ θ0
∣∣∣∣θ) = P(θ∗ − θ ≤ θ0 − θ
∣∣∣∣θ)= G
(θ0 − θ
)(by G’s deinition)
= G(θ − θ0
)(by symmetry)
∼ U (0, 1) (by the theorem that we also have θ − θ0 ∼ G)
Bootstrap – a Simulation & Resampling Method
So for any α, 0 < α < 1,
P{θ0 ≤ θ∗α
}= P
{θ0 ≤ B−1
n (α)}= P {Bn (θ0) ≤ α}
= P (U ≤ α) = α.
Thus,[θ∗2.5%, θ
∗97.5%
]is a 95% confidence interval for θ (with
95% confidence to cover the true θ0), because
P(θ∗2.5% ≤ θ0 ≤ θ∗97.5%
)= P
(θ0 ≤ θ∗97.5%
)− P
(θ0 ≤ θ∗2.5%
)= 97.5%− 2.5% = 95%.
Bootstrap – a Simulation & Resampling Method
Case (ii) The distribution (1) is not symmetric.We define
Cn (t) = P(
2θ − θ∗ ≤ t |θ).
We have the following statements:
Cn (t) is monotonically increasing in t .
When t = 2θ − θ∗α, Cn
(2θ − θ∗α
)= P
(2θ − θ∗ ≤ 2θ − θ∗α
∣∣θ)= P
(θ∗ ≥ θ∗α
∣∣θ) = 1− α. So we know that 2θ − θ∗α = C−1n (1− α).
When t = θ0 the true parameter value, we have
Cn (θ0) = P(
2θ − θ∗ ≤ θ0
∣∣∣∣θ) = P(θ∗ − θ ≥ θ − θ0
∣∣∣∣θ)= 1−G
(θ − θ0
)(by G’s deinition)
∼ U (0, 1) (by the theorem that we also have θ − θ0 ∼ G)
Bootstrap – a Simulation & Resampling Method
So for any α, 0 < α < 1,
P{θ0 ≤ 2θ − θ∗α
}= P
{θ0 ≤ C−1
n (1− α)}= P {Cn (θ0) ≤ 1− α}
= P (U ≤ 1− α) = 1− α.
Thus,[2θ − θ∗97.5%, 2θ − θ∗2.5%
]is a 95% confidence interval for θ (with 95%
confidence to cover the true θ0), because
P(
2θ − θ∗97.5% ≤ θ0 ≤ 2θ − θ∗2.5%)
= P(θ0 ≤ 2θ − θ∗2.5%
)− P
(θ0 ≤ 2θ − θ∗97.5%
)= 97.5%− 2.5% = 95%.
�
Bootstrap – a Simulation & Resampling Method
Other primary applications (beside CI’s) of the bootstrapsampling method
Approximating Standard Error of a Sample Estimate —Use
seB =
[1N
N∑i=1
(θ∗i − θ
)2
] 12
to estimate the standard error se(θ).
Bias correction by bootstrap — Often, Bias(θ) =θ − θ0 ≈ O(1/n). This bias can be estimated by
BiasB =1N
N∑i=1
θ∗i − θ.
Let’s bootstrap Bill Gates! ... Happy "bootstrapers"
Bootstrap – a Simulation & Resampling Method
Application – Bootstrap method in regression models
Linear regression model
yi = β0 + β1xi + εi , for i = 1,2, . . . ,n,
where εi ∼(0, σ2).
Least square (LS) estimator
β1 =
∑ni=1 (xi − x) (yi − y)∑n
i=1 (xi − x)2 .
We want to make inference on β1.
Bootstrap – a Simulation & Resampling Method
If εi ∼ N(0, σ2),
β1 ∼ N
β1, σ2
{n∑
i=1
(xi − x)2
}−1
(n − 2)s2 ∼ σ2χ2n−2.
=⇒ we can use the conventional t = β1/s (or z when n islarge) test to make inference on β1.
Alternatively, we can use bootstrap approach to makeinference for β1 (only need to assume εi ∼ (0, σ2)).
Bootstrap – a Simulation & Resampling Method
If εi ∼ N(0, σ2),
β1 ∼ N
β1, σ2
{n∑
i=1
(xi − x)2
}−1
(n − 2)s2 ∼ σ2χ2n−2.
=⇒ we can use the conventional t = β1/s (or z when n islarge) test to make inference on β1.
Alternatively, we can use bootstrap approach to makeinference for β1 (only need to assume εi ∼ (0, σ2)).
Bootstrap – a Simulation & Resampling Method
Method 1: Resample data pairs
Resample from (xi , yi) , preserve the pairs(x1, y1) (x∗
1 , y∗1 )
(x2, y2) (x∗2 , y
∗2 )
...bootstrap=⇒
...
(xn, yn) (x∗n , y
∗n )
⇓ ⇓
β1 β∗1
Repeat N times to get N copies of β∗1 ’s.
Based on these N copies of β∗1 ’s, we can make inferenceabout β1 (compute confidence intervals, making tests, etc).
Bootstrap – a Simulation & Resampling Method
Method 2: Resample residuals
Based on the sample data {(x1, y1) , (x2, y2) , · · · , (xn, yn)},we can obtain LS estimates β0 and β1. Also, compute theresiduals {e1, . . . ,en}.
Resample from the residual set {e1, . . . ,en} to obtainbootstrap residuals {e∗
1, . . . ,e∗n}.
Define y∗i = β0 + β1xi + e∗
i , for i = 1, . . . ,n, so that we havea bootstrap data set:
{(x1, y∗
1),(x2, y∗
2), · · · , (xn, y∗
n )}
.
Based on this bootstrap data set, we can get a bootstrapestimate β∗1.
Bootstrap – a Simulation & Resampling Method
Method 2: Resample residuals (continue)
Repeat the last bullet step N times to get N copies of β∗1 ’s.
Based on these N copies of β∗1 ’s, we can make inferenceabout β1 (compute confidence intervals, making tests, etc).
Bootstrap – a Simulation & Resampling Method
### Example: Bootstrap method for regression ## Anne7e Dobson (1990) "An IntroducCon to ## Generalized Linear Models". ## Page 9: Plant Weight Data. ctl <-‐ c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14) trt <-‐ c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69) group <-‐ gl(2, 10, 20, labels = c("Ctl","Trt")) weight <-‐ c(ctl, trt) mydata <-‐ data.frame(weight, group) ## Linear regression: lm.D9 <-‐ lm(weight ~ group, data = mydata)
Bootstrap – a Simulation & Resampling Method
> summary(lm.D9) ## Parameter es3mate Coefficients: Es3mate Std. Error t value Pr(>|t|) (Intercept) 5.0320 0.2202 22.850 9.55e-‐15 *** groupTrt -‐0.3710 0.3114 -‐1.191 0.249 ## 95% Confidence Intervals: > confint(lm.D9, "groupTrt") 2.5 % 97.5 % groupTrt -‐1.0253 0.2833003
Bootstrap – a Simulation & Resampling Method
## Bootstrap Methods: ## Func3on for method 1: boot.meth1 <-‐ func3on(data = mydata, indices){
data <-‐ data[indices,] # select obs. in bootstrap sample mod <-‐ lm(formula = weight ~ group, data=data) coefficients(mod) # return coefficient vector
} ## Func3on for method 2: boot.meth2 <-‐ func3on(data = mydata, indices, fit = lm.D9){ weight.boot <-‐ fiOed(lm.D9) + residuals(lm.D9)[indices] data.star <-‐ data; data.star[,1] <-‐ weight.boot mod <-‐ lm(weight ~ group, data = data.star)
coefficients(mod) # return coefficient vector }
Bootstrap – a Simulation & Resampling Method
> #### Use my own code to Run boostrap > ## Boostrap sample size 5000 > b1.vec = b2.vec = rep(0, 5000) > for (ii in 1:5000) { + b.indx = sample(1:nrow(mydata), replace = TRUE) + b1.vec[ii] = boot.meth1(mydata, b.indx)["groupTrt"] + b2.vec[ii] = boot.meth2(mydata, b.indx, lm.D9)["groupTrt”]} > ## ## Confidence intervals > b1.vec = sort(b1.vec); b2.vec = sort(b2.vec) > c(low = b1.vec[125], up = b1.vec[4875]); low up -‐0.9650505 0.2376923 > c(low = b2.vec[125], up = b2.vec[4875]); low up -‐0.9632 0.1942 > ## histograms > par(mfrow = c(1,2)); hist(b1.vec); hist(b2.vec)
Bootstrap – a Simulation & Resampling Method
Histogram of b1.vec
b1.vec
Frequency
-1.5 -1.0 -0.5 0.0 0.5
0200
400
600
800
1200
Histogram of b2.vec
b2.vec
Frequency
-1.0 -0.5 0.0 0.5
0200
400
600
800
1200
Bootstrap – a Simulation & Resampling Method
> ## Run bootstrap through the R's "boot" func5on: > library(boot) > out.boot.meth1 <-‐ boot(mydata, boot.meth1, 5000) > out.boot.meth1 Bootstrap Sta5s5cs : original bias std. error t1* 5.032 -‐0.002156102 0.1769029 t2* -‐0.371 0.005725787 0.3045753 > boot.ci(out.boot.meth1, index=2, type=c("norm", "perc")) BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS Based on 5000 bootstrap replicates Intervals : Level Normal Percen5le 95% (-‐0.9737, 0.2202 ) (-‐0.9492, 0.2200 ) Calcula5ons and Intervals on Original Scale
Bootstrap – a Simulation & Resampling Method
> out.boot.meth2 <-‐ boot(mydata, boot.meth2, 5000) > out.boot.meth2 Bootstrap Sta;s;cs : original bias std. error t1* 5.032 0.00198706 0.2065991 t2* -‐0.371 -‐0.00548282 0.2957972 > boot.ci(out.boot.meth2, index=2, type=c("norm", "perc")) BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS Based on 5000 bootstrap replicates Intervals : Level Normal Percen;le 95% (-‐0.9453, 0.2142 ) (-‐0.9500, 0.2023 ) Calcula;ons and Intervals on Original Scale
Bootstrap – a Simulation & Resampling Method
> plot(out.boot.meth1, index = 2)
Histogram of t
t*
Density
-1.5 -1.0 -0.5 0.0 0.5 1.0
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
-3 -2 -1 0 1 2 3
-1.0
-0.5
0.0
0.5
Quantiles of Standard Normal
t*
Bootstrap – a Simulation & Resampling Method
> plot(out.boot.meth2, index = 2)
Histogram of t
t*
Density
-1.5 -1.0 -0.5 0.0 0.5
0.0
0.5
1.0
1.5
-3 -2 -1 0 1 2 3
-1.5
-1.0
-0.5
0.0
0.5
Quantiles of Standard Normal
t*
Bootstrap – a Simulation & Resampling Method
Further remarks on bootstrap estimation
We introduced the bootstrap approach and illustrated itusing some basic and regression examples. Themethodology is very broad and can be used in manyapplications.
– It is a simulation based method, one may not get exactlythe same numerical answer when repeating the same code.(A common practical solution to this problem: fix randomseed at the beginning.)
Bootstrap – a Simulation & Resampling Method
Further remarks on bootstrap estimation (continue...)
Most bootstrap methods are developed to studyindependent observations. When used to studycorrelations or dependent data, the key is to preserve thecorrelation/dependence.
– For example, in our examples on correlation coefficientsand regressions, we have tried to preserve thecorrelation/dependence.
– For dependent samples (for examples, time series models,Brownian motion or other stochastic processes), a usefulscheme of moving-block bootstrap. [Self study material -
(http://www2.econ.iastate.edu/classes/econ674/
bunzel/documents/DepBootstrap.pdf)]
Good night!