FSRM 16:958:587 Advanced Simulation Methods for Finance (Lecture 4) · 2016. 2. 17. · (Lecture 4)...

FSRM 16:958:587 Advanced SimulationMethods for Finance

(Lecture 4)

Min-ge Xie

Department of Statistics & Biostatistics,Rutgers University

Bootstrap – a Simulation & Resampling Method

General statements (overly simplified intro/some key words)

Bootstrap method in Statistics is a resampling (simulation)approach for making statistical inference for unknownparameters (of a underlying population from which the observedsample data are generated).

Bootstrap samples are simulated “phantom" samples based onobserved sample data;Bootstrap distributions are derived from the bootstrap samplesand they can be used to make statistical inference.

� Although it’s a simulation method, it is a little different from thesimulation techniques that we’ve learned before.


Motivation/background

The primary task of a statistician is to summarize a samplebased study and generalize the finding to the parent(underlying) population in a scientific manner.

The summary (often through a sample statistic such asmean, median, correlation, etc) will fluctuate from sampleto sample

We would like to know the magnitude of these fluctuationsto get an overall picture — This fluctuation can often bedescribed in the form of a probability distribution called asampling distribution.


Motivation/background (continue) —Suppose we do not make much assumption (do not knowmuch) about the underlying population:

Ideally, if we can repeated draw samples from the targetdistribution again and again =⇒We can have multiple(many) copies of sample statistic in these repeatedlydrawing samples =⇒ The multiple (many) copies ofsample statistic can then provide us a good idea about thefluctuation and the sampling distribution.

But, in reality, we only have one set (copy) of observeddata (sample).


The idea behind bootstrap is to use the observed sample data as a“surrogate population”, for the purpose of approximating the samplingdistribution of a statistic

Specifically,

— We resample with replacement from the sample data at hand andcreate a large number of “phantom samples” known as bootstrapsamples.

— These bootstrap samples can be used to quantify the fluctuation (‘makeinference’) of a "population parameter" of the “surrogate population”.

— Under some conditions, the “phantom" inference is the same as (canhelp to derive) the real inference that we are looking for.

This leads us to the "bootstrap method"!


Read a companion note (introductory review article) by Singh and Xie

(2010) in International Encyclopedia of Education.

(http://stat.rutgers.edu/~mxie/RCPapers/bootstrap.pdf)or(http://www.sciencedirect.com/science/referenceworks/9780080448947)

http://stat.rutgers.edu/~mxie/RCPapers/bootstrap.pdf

http://www.sciencedirect.com/science/referenceworks/9780080448947

http://www.sciencedirect.com/science/referenceworks/9780080448947


Setting/Setup:We have a sample data set x1, . . . , xn

i.i.d∼ F (x); Also, let θbe a population characteristic of the distribution F , and wehave an estimator θ for θ, where θ = θ (x1, . . . , xn) is afunction of the sample set x = (x1, . . . , xn).

For example, θ is the mean of distribution F (populationmean) and θ = X .

Goal:We need to make an inference about θ – Beside the pointestimator θ, we like to know the sampling distribution ofθ = θ (x1, . . . , xn); in particular, find a confidence intervalfor θ etc.


Bootstrap (generate) a set of new (“phantom”) data

From the observed sample data set {x1, x2, . . . , xn},resample with replacement to get a new data set of size n:

— Randomly pick (each with probability 1/n) a data point from{x1, x2, . . . , xn} and set it to be x∗

1 ; repeat the exactly samerandom pick n − 1 times to get x∗

2 , x,3 . . . , x

∗n .

This new set of data {x∗1 , . . . , x

∗n} is called a set of

bootstrap sample.

� To make statistical inference, we repeat this bootstrappingsampling process a large number of (say N) times to get Nsets of bootstrap samples.


A bootstrap sampling algorithm to get a CI for θ:1 Generating N bootstrap datasets, each of size n and

compute the corresponding bootstrap estimator θ∗:1st : {x∗

1 , . . . , x∗n }[1] ∼ {x1, . . . , xn}, θ∗1 = θ

({x∗

1 , . . . , x∗n }[1]

)2nd : {x∗

1 , . . . , x∗n }[2] ∼ {x1, . . . , xn}, θ∗2 = θ

({x∗

1 , . . . , x∗n }[2]

)...

Nth: {x∗1 , . . . , x

∗n }[N] ∼ {x1, . . . , xn}, θ∗N = θ

({x∗

1 , . . . , x∗n }[N]

)2 Sort {θ∗1, θ∗2, . . . , θ∗N} from the smallest to the largest.

Now we have θ∗(1) ≤ θ∗(2) ≤ · · · ≤ θ

∗(N); a histogram could be

constructed.


Claim (symmetric case): A (1− α)100% confidenceInterval for the parameter θ is simply[

θ∗(L), θ∗(U)

],

where L =αN2

and U =(

1− α

2

)N, for 0 < α < 1.

For example, a 95% C.I. for parameter θ is[θ∗(25), θ

∗(975)

], if

N = 1000(Note: L = 1000× .025 = 25, U = 1000× .975 = 975)An equivalent way to write the above confidence interval issimply

[θ∗α/2, θ

∗1−α/2

]


Claim: (asymmetric case) A (1− α)100% confidenceInterval for the parameter θ is[

2θ − θ∗(U),2θ − θ∗(L)

],

where L =αN2

and U =(

1− α

2

)N, for 0 < α < 1.

For example, a 95% C.I. for parameter θ is just[2θ − θ∗(975),2θ − θ

∗(25)

], if N = 1000.


Example (cf., Example 2 of the companion note by Singh andXie (2010)):Data: Two types of measurements to assess body fat in n = 20collegiate football players

BOD 2.5 4.0 4.1 6.2 7.1 7.0 8.3 9.2 9.3 12.0 12.2 12.6 14.2 14.4 15.1 15.2 16.3 17.1 17.9 17.9 HW 8.0 6.2 9.2 6.4 8.6 12.2 7.2 12.0 14.9 12.1 15.3 14.8 14.3 16.3 17.9 19.5 17.5 14.3 18.3 16.2

— BOD is BOD POD, a whole body air-displacement plethysmograph

— HW refers to hydrostatic weighing.

Question: To study the correlation between the BOD and HWmeasurements (find a confidence interval)


## R program of bootstrap algorithm for correla3on parameter ## Example 2 of Singh and Xie (2010) # Data BOD = scan() 2.5 4.0 4.1 6.2 7.1 7.0 8.3 9.2 9.3 12.0 12.2 12.6 14.2 14.4 15.1 15.2 16.3 17.1 17.9 17.9 HW = scan() 8.0 6.2 9.2 6.4 8.6 12.2 7.2 12.0 14.9 12.1 15.3 14.8 14.3 16.3 17.9 19.5 17.5 14.3 18.3 16.2 data.ex2 = cbind(BOD, HW) ## Boxplot of the data and the scaNer plot par(mfrow = c(1,2)) boxplot(data.ex2); plot(BOD, HW)


BOD HW

510

1520

5 10 15

68

1012

1416

1820

BOD

HW


A generic bootstrap algorithm for this example

Step 1: At each iteration k = 0, 1, 2, . . . ,N = 1000, generate a

bootstrap data set of n = 20 pairs by repeating the following procedure:

1 For i = 1, . . . ,20, randomly sample a pair (x∗i , y

∗i ) from the

20 observed data pairs {(2.5,8.0), (4.0,6.2), . . . , (17.9,16.2)} (sample with replacement);These new 20 pairs form a bootstrap sample set(x∗,y∗) = {(x∗

1 , y∗1 ), (x

∗2 , y

∗2 ), . . . , (x

∗20, y

∗20)}.

2 Compute the bootstrap sample correlation coefficientρ∗ = corr(x∗,y∗).

Step 2: Produce a histogram using the N = 1000 ρ∗’s and also sortthese ρ∗. The histogram (next page) suggests that the bootstrapdistribution is skewed; so the 95% confidence interval for ρ is[2ρ− ρ∗(975), 2ρ− ρ∗(25)].


## R program of bootstrap algorithm for correla3on parameter ## Example 2 of Singh and Xie (2010) # Bootstrapping and calcula3on of bootstrap corr coef. corr.b=matrix(0,1000) for(i in 1:1000) { # sample genera3ng a set of new bootstrap sample indx = sample(1:nrow(data.ex2), replace = T) data.bt = data.ex2[indx,] # calculate correla3on coeeficient corr.b[i]= cor(data.bt[,1], data.bt[,2]) }


## Es%mate of corr using the orginal data cor(data.ex2[,1], data.ex2[,2]); cor(BOD, HW) 1] 0.8678753 [1] 0.8678753 # Histogram and boostrap 95% CI hist(corr.b);summary(corr.b) Min. :0.6495 1st Qu.:0.8434 Median :0.8736 Mean :0.8667 3rd Qu.:0.8966 Max. :0.9584 corr.b.srt = sort(corr.b) CI.95 = c(2* cor(BOD, HW) -‐ corr.b.srt[975], 2 * cor(BOD, HW) -‐ corr.b.srt[25]); CI.95 [1] 0.7998790 0.9692848


Histogram of corr.b

corr.b

Frequency

0.6 0.7 0.8 0.9 1.0

0100

200

300

400

500


Bootstrap Central Limit Theory (Singh, 1981):

Theorem: Under some mild conditions, we have when n islarge (n→∞),

(θ∗ − θ)∣∣∣∣θ ∼ (θ − θ0)

∣∣∣∣θ0. (1)

Proof: Omitted.

(Notation: The distribution (1) has a cumulative distributionfunction G(·). )


Based on the Bootstrap Central Limit Theory, we can show thatthe claims on pages 14-15 are justified.

Proof of the claims on page 14-15:Case (i) The distribution (1) is symmetric.

We define the cumulative distribution of the bootstrap estimatorwhen given the sample data:

Bn (t) = P(θ∗ ≤ t |θ

).

(The Bn (t) is also known as bootstrap distribution).


We have the following statements:

Bn (t) is monotonically increasing in t (since it is a cumulativedistribution function).

When t = θ∗α, Bn

(θ∗α

)= P

(θ∗ ≤ θ∗α

∣∣θ) = α.

So we know that θ∗α = B−1n (α).

When t = θ0 the true parameter value, we have

Bn (θ0) = P(θ∗ ≤ θ0

∣∣∣∣θ) = P(θ∗ − θ ≤ θ0 − θ

∣∣∣∣θ)= G

(θ0 − θ

)(by G’s deinition)

= G(θ − θ0

)(by symmetry)

∼ U (0, 1) (by the theorem that we also have θ − θ0 ∼ G)


So for any α, 0 < α < 1,

P{θ0 ≤ θ∗α

}= P

{θ0 ≤ B−1

n (α)}= P {Bn (θ0) ≤ α}

= P (U ≤ α) = α.

Thus,[θ∗2.5%, θ

∗97.5%

]is a 95% confidence interval for θ (with

95% confidence to cover the true θ0), because

P(θ∗2.5% ≤ θ0 ≤ θ∗97.5%

)= P

(θ0 ≤ θ∗97.5%

)− P

(θ0 ≤ θ∗2.5%

)= 97.5%− 2.5% = 95%.


Case (ii) The distribution (1) is not symmetric.We define

Cn (t) = P(

2θ − θ∗ ≤ t |θ).

We have the following statements:

Cn (t) is monotonically increasing in t .

When t = 2θ − θ∗α, Cn

(2θ − θ∗α

)= P

(2θ − θ∗ ≤ 2θ − θ∗α

∣∣θ)= P

(θ∗ ≥ θ∗α

∣∣θ) = 1− α. So we know that 2θ − θ∗α = C−1n (1− α).

When t = θ0 the true parameter value, we have

Cn (θ0) = P(

2θ − θ∗ ≤ θ0

∣∣∣∣θ) = P(θ∗ − θ ≥ θ − θ0

∣∣∣∣θ)= 1−G

(θ − θ0

)(by G’s deinition)

∼ U (0, 1) (by the theorem that we also have θ − θ0 ∼ G)


So for any α, 0 < α < 1,

P{θ0 ≤ 2θ − θ∗α

}= P

{θ0 ≤ C−1

n (1− α)}= P {Cn (θ0) ≤ 1− α}

= P (U ≤ 1− α) = 1− α.

Thus,[2θ − θ∗97.5%, 2θ − θ∗2.5%

]is a 95% confidence interval for θ (with 95%

confidence to cover the true θ0), because

P(

2θ − θ∗97.5% ≤ θ0 ≤ 2θ − θ∗2.5%)

= P(θ0 ≤ 2θ − θ∗2.5%

)− P

(θ0 ≤ 2θ − θ∗97.5%

)= 97.5%− 2.5% = 95%.

�


Other primary applications (beside CI’s) of the bootstrapsampling method

Approximating Standard Error of a Sample Estimate —Use

seB =

[1N

N∑i=1

(θ∗i − θ

)2

] 12

to estimate the standard error se(θ).

Bias correction by bootstrap — Often, Bias(θ) =θ − θ0 ≈ O(1/n). This bias can be estimated by

BiasB =1N

N∑i=1

θ∗i − θ.

Let’s bootstrap Bill Gates! ... Happy "bootstrapers"


Application – Bootstrap method in regression models

Linear regression model

yi = β0 + β1xi + εi , for i = 1,2, . . . ,n,

where εi ∼(0, σ2).

Least square (LS) estimator

β1 =

∑ni=1 (xi − x) (yi − y)∑n

i=1 (xi − x)2 .

We want to make inference on β1.


If εi ∼ N(0, σ2),

β1 ∼ N

β1, σ2

{n∑

i=1

(xi − x)2

}−1

(n − 2)s2 ∼ σ2χ2n−2.

=⇒ we can use the conventional t = β1/s (or z when n islarge) test to make inference on β1.

Alternatively, we can use bootstrap approach to makeinference for β1 (only need to assume εi ∼ (0, σ2)).


Method 1: Resample data pairs

Resample from (xi , yi) , preserve the pairs(x1, y1) (x∗

1 , y∗1 )

(x2, y2) (x∗2 , y

∗2 )

...bootstrap=⇒

...

(xn, yn) (x∗n , y

∗n )

⇓ ⇓

β1 β∗1

Repeat N times to get N copies of β∗1 ’s.

Based on these N copies of β∗1 ’s, we can make inferenceabout β1 (compute confidence intervals, making tests, etc).


Method 2: Resample residuals

Based on the sample data {(x1, y1) , (x2, y2) , · · · , (xn, yn)},we can obtain LS estimates β0 and β1. Also, compute theresiduals {e1, . . . ,en}.

Resample from the residual set {e1, . . . ,en} to obtainbootstrap residuals {e∗

1, . . . ,e∗n}.

Define y∗i = β0 + β1xi + e∗

i , for i = 1, . . . ,n, so that we havea bootstrap data set:

{(x1, y∗

1),(x2, y∗

2), · · · , (xn, y∗

n )}

.

Based on this bootstrap data set, we can get a bootstrapestimate β∗1.


Method 2: Resample residuals (continue)

Repeat the last bullet step N times to get N copies of β∗1 ’s.

Based on these N copies of β∗1 ’s, we can make inferenceabout β1 (compute confidence intervals, making tests, etc).


### Example: Bootstrap method for regression ## Anne7e Dobson (1990) "An IntroducCon to ## Generalized Linear Models". ## Page 9: Plant Weight Data. ctl <-‐ c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14) trt <-‐ c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69) group <-‐ gl(2, 10, 20, labels = c("Ctl","Trt")) weight <-‐ c(ctl, trt) mydata <-‐ data.frame(weight, group) ## Linear regression: lm.D9 <-‐ lm(weight ~ group, data = mydata)


> summary(lm.D9) ## Parameter es3mate Coefficients: Es3mate Std. Error t value Pr(>|t|) (Intercept) 5.0320 0.2202 22.850 9.55e-‐15 *** groupTrt -‐0.3710 0.3114 -‐1.191 0.249 ## 95% Confidence Intervals: > confint(lm.D9, "groupTrt") 2.5 % 97.5 % groupTrt -‐1.0253 0.2833003


## Bootstrap Methods: ## Func3on for method 1: boot.meth1 <-‐ func3on(data = mydata, indices){

data <-‐ data[indices,] # select obs. in bootstrap sample mod <-‐ lm(formula = weight ~ group, data=data) coefficients(mod) # return coefficient vector

} ## Func3on for method 2: boot.meth2 <-‐ func3on(data = mydata, indices, fit = lm.D9){ weight.boot <-‐ fiOed(lm.D9) + residuals(lm.D9)[indices] data.star <-‐ data; data.star[,1] <-‐ weight.boot mod <-‐ lm(weight ~ group, data = data.star)

coefficients(mod) # return coefficient vector }


> #### Use my own code to Run boostrap > ## Boostrap sample size 5000 > b1.vec = b2.vec = rep(0, 5000) > for (ii in 1:5000) { + b.indx = sample(1:nrow(mydata), replace = TRUE) + b1.vec[ii] = boot.meth1(mydata, b.indx)["groupTrt"] + b2.vec[ii] = boot.meth2(mydata, b.indx, lm.D9)["groupTrt”]} > ## ## Confidence intervals > b1.vec = sort(b1.vec); b2.vec = sort(b2.vec) > c(low = b1.vec[125], up = b1.vec[4875]); low up -‐0.9650505 0.2376923 > c(low = b2.vec[125], up = b2.vec[4875]); low up -‐0.9632 0.1942 > ## histograms > par(mfrow = c(1,2)); hist(b1.vec); hist(b2.vec)


Histogram of b1.vec

b1.vec

Frequency

-1.5 -1.0 -0.5 0.0 0.5

0200

400

600

800

1200

Histogram of b2.vec

b2.vec

Frequency

-1.0 -0.5 0.0 0.5

0200

400

600

800

1200


> ## Run bootstrap through the R's "boot" func5on: > library(boot) > out.boot.meth1 <-‐ boot(mydata, boot.meth1, 5000) > out.boot.meth1 Bootstrap Sta5s5cs : original bias std. error t1* 5.032 -‐0.002156102 0.1769029 t2* -‐0.371 0.005725787 0.3045753 > boot.ci(out.boot.meth1, index=2, type=c("norm", "perc")) BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS Based on 5000 bootstrap replicates Intervals : Level Normal Percen5le 95% (-‐0.9737, 0.2202 ) (-‐0.9492, 0.2200 ) Calcula5ons and Intervals on Original Scale


> out.boot.meth2 <-‐ boot(mydata, boot.meth2, 5000) > out.boot.meth2 Bootstrap Sta;s;cs : original bias std. error t1* 5.032 0.00198706 0.2065991 t2* -‐0.371 -‐0.00548282 0.2957972 > boot.ci(out.boot.meth2, index=2, type=c("norm", "perc")) BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS Based on 5000 bootstrap replicates Intervals : Level Normal Percen;le 95% (-‐0.9453, 0.2142 ) (-‐0.9500, 0.2023 ) Calcula;ons and Intervals on Original Scale


> plot(out.boot.meth1, index = 2)

Histogram of t

t*

Density

-1.5 -1.0 -0.5 0.0 0.5 1.0

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

-3 -2 -1 0 1 2 3

-1.0

-0.5

0.0

0.5

Quantiles of Standard Normal

t*


> plot(out.boot.meth2, index = 2)

Histogram of t

t*

Density

-1.5 -1.0 -0.5 0.0 0.5

0.0

0.5

1.0

1.5

-3 -2 -1 0 1 2 3

-1.5

-1.0

-0.5

0.0

0.5

Quantiles of Standard Normal

t*


Further remarks on bootstrap estimation

We introduced the bootstrap approach and illustrated itusing some basic and regression examples. Themethodology is very broad and can be used in manyapplications.

– It is a simulation based method, one may not get exactlythe same numerical answer when repeating the same code.(A common practical solution to this problem: fix randomseed at the beginning.)


Further remarks on bootstrap estimation (continue...)

Most bootstrap methods are developed to studyindependent observations. When used to studycorrelations or dependent data, the key is to preserve thecorrelation/dependence.

– For example, in our examples on correlation coefficientsand regressions, we have tried to preserve thecorrelation/dependence.

– For dependent samples (for examples, time series models,Brownian motion or other stochastic processes), a usefulscheme of moving-block bootstrap. [Self study material -

(http://www2.econ.iastate.edu/classes/econ674/

bunzel/documents/DepBootstrap.pdf)]

http://www2.econ.iastate.edu/classes/econ674/bunzel/documents/DepBootstrap.pdf

http://www2.econ.iastate.edu/classes/econ674/bunzel/documents/DepBootstrap.pdf

Good night!

FSRM 16:958:587 Advanced Simulation Methods for Finance (Lecture 4) · 2016. 2. 17. · (Lecture 4)...

Documents

Transcript of FSRM 16:958:587 Advanced Simulation Methods for Finance (Lecture 4) · 2016. 2. 17. · (Lecture 4)...