Review of Statistics · Introduction Statistical Estimation Central Limit Theorem # Set the seed...
Transcript of Review of Statistics · Introduction Statistical Estimation Central Limit Theorem # Set the seed...
![Page 1: Review of Statistics · Introduction Statistical Estimation Central Limit Theorem # Set the seed for the random number set.seed(124) # Run simulation Nsamples =2000 result_CLT1](https://reader033.fdocuments.us/reader033/viewer/2022042417/5f33031289ee497efa361357/html5/thumbnails/1.jpg)
Introduction Statistical Estimation Central Limit Theorem
Review of Statistics
Instructor: Yuta Toyama
Last updated: 2020-05-10
1 / 31
![Page 2: Review of Statistics · Introduction Statistical Estimation Central Limit Theorem # Set the seed for the random number set.seed(124) # Run simulation Nsamples =2000 result_CLT1](https://reader033.fdocuments.us/reader033/viewer/2022042417/5f33031289ee497efa361357/html5/thumbnails/2.jpg)
Introduction Statistical Estimation Central Limit Theorem
Section 1
Introduction
2 / 31
![Page 3: Review of Statistics · Introduction Statistical Estimation Central Limit Theorem # Set the seed for the random number set.seed(124) # Run simulation Nsamples =2000 result_CLT1](https://reader033.fdocuments.us/reader033/viewer/2022042417/5f33031289ee497efa361357/html5/thumbnails/3.jpg)
Introduction Statistical Estimation Central Limit Theorem
Acknowledgement
Acknowledgement: This chapter is largely based on chapter 3 of“Introduction to Econometrics with R”.https://www.econometrics-with-r.org/index.html
3 / 31
![Page 4: Review of Statistics · Introduction Statistical Estimation Central Limit Theorem # Set the seed for the random number set.seed(124) # Run simulation Nsamples =2000 result_CLT1](https://reader033.fdocuments.us/reader033/viewer/2022042417/5f33031289ee497efa361357/html5/thumbnails/4.jpg)
Introduction Statistical Estimation Central Limit Theorem
Introduction
The goal of this chapter is
1. Review of EstimationI Properties of Estimators: Unbiasedness, ConsistencyI Law of large numbers
2. Review of Central Limit TheoremI Important tool for hypothesis testing (to be covered later)
4 / 31
![Page 5: Review of Statistics · Introduction Statistical Estimation Central Limit Theorem # Set the seed for the random number set.seed(124) # Run simulation Nsamples =2000 result_CLT1](https://reader033.fdocuments.us/reader033/viewer/2022042417/5f33031289ee497efa361357/html5/thumbnails/5.jpg)
Introduction Statistical Estimation Central Limit Theorem
Section 2
Statistical Estimation
5 / 31
![Page 6: Review of Statistics · Introduction Statistical Estimation Central Limit Theorem # Set the seed for the random number set.seed(124) # Run simulation Nsamples =2000 result_CLT1](https://reader033.fdocuments.us/reader033/viewer/2022042417/5f33031289ee497efa361357/html5/thumbnails/6.jpg)
Introduction Statistical Estimation Central Limit Theorem
Estimation
I Estimator: A mapping from the sample data drawn from an unknownpopulation to a certain feature in the populationI Example: Consider hourly earnings of college graduates Y .
I You want to estimate the mean of Y , defined as E [Y ] = µyI Draw a random sample of n i.i.d. (identically and independently
distributed) observations Y1,Y2, . . . ,YN
I How to estimate E [Y ] from the data?
6 / 31
![Page 7: Review of Statistics · Introduction Statistical Estimation Central Limit Theorem # Set the seed for the random number set.seed(124) # Run simulation Nsamples =2000 result_CLT1](https://reader033.fdocuments.us/reader033/viewer/2022042417/5f33031289ee497efa361357/html5/thumbnails/7.jpg)
Introduction Statistical Estimation Central Limit Theorem
I Idea 1: Sample mean
Y = 1n
n∑i=1
Yi ,
I Idea 2: Pick the first observation of the sample.I Question: How can we say which is better?
7 / 31
![Page 8: Review of Statistics · Introduction Statistical Estimation Central Limit Theorem # Set the seed for the random number set.seed(124) # Run simulation Nsamples =2000 result_CLT1](https://reader033.fdocuments.us/reader033/viewer/2022042417/5f33031289ee497efa361357/html5/thumbnails/8.jpg)
Introduction Statistical Estimation Central Limit Theorem
Properties of the estimator
Consider the estimator µN for the unknown parameter µ.
1. Unbiasdeness: The expectation of the estimator is the same as the trueparameter in the population.
E [µN ] = µ
2. Consistency: The estimator converges to the true parameter inprobability.
∀ε > 0, limN→∞
Prob(|µN − µ| < ε) = 1
I Intuition: As the sample size gets larger, the estimator and the trueparameter is close with probability one.
I Note: a bit different from the usual convergence of the sequence.
8 / 31
![Page 9: Review of Statistics · Introduction Statistical Estimation Central Limit Theorem # Set the seed for the random number set.seed(124) # Run simulation Nsamples =2000 result_CLT1](https://reader033.fdocuments.us/reader033/viewer/2022042417/5f33031289ee497efa361357/html5/thumbnails/9.jpg)
Introduction Statistical Estimation Central Limit Theorem
Sample mean Y is unbiased and consistent
I Showing these two properties using mathmaetics is straightforward:I Unbiasedness: Take expectation.I Consistency: Law of large numbers.
I Let’s examine these two properties using R programming!
9 / 31
![Page 10: Review of Statistics · Introduction Statistical Estimation Central Limit Theorem # Set the seed for the random number set.seed(124) # Run simulation Nsamples =2000 result_CLT1](https://reader033.fdocuments.us/reader033/viewer/2022042417/5f33031289ee497efa361357/html5/thumbnails/10.jpg)
Introduction Statistical Estimation Central Limit Theorem
Step 0: Preparing packages
# Use the following packageslibrary("readr")library("ggplot2")library("reshape")
# If not yet, please install by install.packages("").
10 / 31
![Page 11: Review of Statistics · Introduction Statistical Estimation Central Limit Theorem # Set the seed for the random number set.seed(124) # Run simulation Nsamples =2000 result_CLT1](https://reader033.fdocuments.us/reader033/viewer/2022042417/5f33031289ee497efa361357/html5/thumbnails/11.jpg)
Introduction Statistical Estimation Central Limit Theorem
Step 1: Prepare a population
I Use income and age data from PUMS 5% sample of U.S. Census 2000.I PUMS: Public Use Microdata SampleI Download the example data. Put this file in the same folder as your R
script file.I https://yutatoyama.github.io/AppliedEconometrics2020/03_Stat/data_
pums_2000.csv
11 / 31
![Page 12: Review of Statistics · Introduction Statistical Estimation Central Limit Theorem # Set the seed for the random number set.seed(124) # Run simulation Nsamples =2000 result_CLT1](https://reader033.fdocuments.us/reader033/viewer/2022042417/5f33031289ee497efa361357/html5/thumbnails/12.jpg)
Introduction Statistical Estimation Central Limit Theorem
# Use "readr" packagelibrary(readr)pums2000 <- read_csv("data_pums_2000.csv")
## Parsed with column specification:## cols(## AGE = col_double(),## INCTOT = col_double()## )
I We treat this dataset as population.
pop <- as.vector(pums2000$INCTOT)
12 / 31
![Page 13: Review of Statistics · Introduction Statistical Estimation Central Limit Theorem # Set the seed for the random number set.seed(124) # Run simulation Nsamples =2000 result_CLT1](https://reader033.fdocuments.us/reader033/viewer/2022042417/5f33031289ee497efa361357/html5/thumbnails/13.jpg)
Introduction Statistical Estimation Central Limit Theorem
I Population mean and standard deviation
pop_mean = mean(pop)pop_sd = sd(pop)
# Average income in populationpop_mean
## [1] 30165.47
# Standard deviation of income in populationpop_sd
## [1] 38306.17
13 / 31
![Page 14: Review of Statistics · Introduction Statistical Estimation Central Limit Theorem # Set the seed for the random number set.seed(124) # Run simulation Nsamples =2000 result_CLT1](https://reader033.fdocuments.us/reader033/viewer/2022042417/5f33031289ee497efa361357/html5/thumbnails/14.jpg)
Introduction Statistical Estimation Central Limit Theorem
I income distribution in population (Unit in USD)
fig <- ggplot2::qplot(pop, geom = "density",xlab = "Income",ylab = "Density")
14 / 31
![Page 15: Review of Statistics · Introduction Statistical Estimation Central Limit Theorem # Set the seed for the random number set.seed(124) # Run simulation Nsamples =2000 result_CLT1](https://reader033.fdocuments.us/reader033/viewer/2022042417/5f33031289ee497efa361357/html5/thumbnails/15.jpg)
Introduction Statistical Estimation Central Limit Theorem
plot(fig)
0.00000
0.00001
0.00002
0 200000 400000 600000Income
Den
sity
15 / 31
![Page 16: Review of Statistics · Introduction Statistical Estimation Central Limit Theorem # Set the seed for the random number set.seed(124) # Run simulation Nsamples =2000 result_CLT1](https://reader033.fdocuments.us/reader033/viewer/2022042417/5f33031289ee497efa361357/html5/thumbnails/16.jpg)
Introduction Statistical Estimation Central Limit Theorem
I The distribution has a long tail.I Let’s plot the distribution in log scale
# `log` option specifies which axis is represented in log scale.fig2 <- qplot(pop, geom = "density",
xlab = "Income",ylab = "Density",log = "x")
16 / 31
![Page 17: Review of Statistics · Introduction Statistical Estimation Central Limit Theorem # Set the seed for the random number set.seed(124) # Run simulation Nsamples =2000 result_CLT1](https://reader033.fdocuments.us/reader033/viewer/2022042417/5f33031289ee497efa361357/html5/thumbnails/17.jpg)
Introduction Statistical Estimation Central Limit Theorem
plot(fig2)
0.00
0.25
0.50
0.75
1.00
100 10000 1000000Income
Den
sity
17 / 31
![Page 18: Review of Statistics · Introduction Statistical Estimation Central Limit Theorem # Set the seed for the random number set.seed(124) # Run simulation Nsamples =2000 result_CLT1](https://reader033.fdocuments.us/reader033/viewer/2022042417/5f33031289ee497efa361357/html5/thumbnails/18.jpg)
Introduction Statistical Estimation Central Limit Theorem
I Let’s investigate how close the sample mean constucted from therandom sample is to the true population mean.
I Step 1: Draw random samples from this population and calculate Y foreach sample.I Set the sample size N.
I Step 2: Repeat 2000 times. You now have 2000 sample means.
# Set the seed for the random number.# This is needed to maintaine the reproducibility of the results.set.seed(123)
# draw random sample of 100 observations from the variable poptest <- sample(x = pop, size = 100)
18 / 31
![Page 19: Review of Statistics · Introduction Statistical Estimation Central Limit Theorem # Set the seed for the random number set.seed(124) # Run simulation Nsamples =2000 result_CLT1](https://reader033.fdocuments.us/reader033/viewer/2022042417/5f33031289ee497efa361357/html5/thumbnails/19.jpg)
Introduction Statistical Estimation Central Limit Theorem
# Use loop to repeat 2000 times.Nsamples = 2000result1 <- numeric(Nsamples)
for (i in 1:Nsamples ){
test <- sample(x = pop, size = 100)result1[i] <- mean(test)
}
19 / 31
![Page 20: Review of Statistics · Introduction Statistical Estimation Central Limit Theorem # Set the seed for the random number set.seed(124) # Run simulation Nsamples =2000 result_CLT1](https://reader033.fdocuments.us/reader033/viewer/2022042417/5f33031289ee497efa361357/html5/thumbnails/20.jpg)
Introduction Statistical Estimation Central Limit Theorem
# Anotther way to do this.result1 <- replicate(expr = mean(sample(x = pop, size = 10)), n = Nsamples)result2 <- replicate(expr = mean(sample(x = pop, size = 100)),
n = Nsamples)result3 <- replicate(expr = mean(sample(x = pop, size = 500)),
n = Nsamples)
# Create dataframeresult_data <- data.frame( Ybar10 = result1,
Ybar100 = result2,Ybar500 = result3)
20 / 31
![Page 21: Review of Statistics · Introduction Statistical Estimation Central Limit Theorem # Set the seed for the random number set.seed(124) # Run simulation Nsamples =2000 result_CLT1](https://reader033.fdocuments.us/reader033/viewer/2022042417/5f33031289ee497efa361357/html5/thumbnails/21.jpg)
Introduction Statistical Estimation Central Limit Theorem
I Step 3: See the distribution of those 2000 sample means.
# Use "melt" to change the format of result_datadata_for_plot <- melt(data = result_data, variable.name = "Variable" )
## Using as id variables
# Use "ggplot2" to create the figure.# The variable `fig` contains the information about the figurefig <-
ggplot(data = data_for_plot) +xlab("Sample mean") +geom_line(aes(x = value, colour = variable ), stat = "density" ) +geom_vline(xintercept=pop_mean ,colour="black")
21 / 31
![Page 22: Review of Statistics · Introduction Statistical Estimation Central Limit Theorem # Set the seed for the random number set.seed(124) # Run simulation Nsamples =2000 result_CLT1](https://reader033.fdocuments.us/reader033/viewer/2022042417/5f33031289ee497efa361357/html5/thumbnails/22.jpg)
Introduction Statistical Estimation Central Limit Theorem
plot(fig)
0.00000
0.00005
0.00010
0.00015
0.00020
25000 50000 75000Sample mean
dens
ity
variable
Ybar10
Ybar100
Ybar500
22 / 31
![Page 23: Review of Statistics · Introduction Statistical Estimation Central Limit Theorem # Set the seed for the random number set.seed(124) # Run simulation Nsamples =2000 result_CLT1](https://reader033.fdocuments.us/reader033/viewer/2022042417/5f33031289ee497efa361357/html5/thumbnails/23.jpg)
Introduction Statistical Estimation Central Limit Theorem
I Observation 1: Regardless of the sample size, the average of the samplemeans is close to the population mean. Unbiasdeness
I Observation 2: As the sample size gets larger, the distribution isconcentrated around the population mean. Consistency (law of largenumbers)
23 / 31
![Page 24: Review of Statistics · Introduction Statistical Estimation Central Limit Theorem # Set the seed for the random number set.seed(124) # Run simulation Nsamples =2000 result_CLT1](https://reader033.fdocuments.us/reader033/viewer/2022042417/5f33031289ee497efa361357/html5/thumbnails/24.jpg)
Introduction Statistical Estimation Central Limit Theorem
Section 3
Central Limit Theorem
24 / 31
![Page 25: Review of Statistics · Introduction Statistical Estimation Central Limit Theorem # Set the seed for the random number set.seed(124) # Run simulation Nsamples =2000 result_CLT1](https://reader033.fdocuments.us/reader033/viewer/2022042417/5f33031289ee497efa361357/html5/thumbnails/25.jpg)
Introduction Statistical Estimation Central Limit Theorem
Central limit theorem
I Cental limit theorem: Consider the i.i.d. sample of Y1, · · · ,YN drawnfrom the random variable Y with mean µ and variance σ2. Thefollowing Z converges in distribution to the normal distribution.
Z = 1√N
N∑i=1
Yi − µσ
d→ N(0, 1)
In other words,lim
N→∞P (Z ≤ z) = Φ(z)
25 / 31
![Page 26: Review of Statistics · Introduction Statistical Estimation Central Limit Theorem # Set the seed for the random number set.seed(124) # Run simulation Nsamples =2000 result_CLT1](https://reader033.fdocuments.us/reader033/viewer/2022042417/5f33031289ee497efa361357/html5/thumbnails/26.jpg)
Introduction Statistical Estimation Central Limit Theorem
What does CLT mean?
I The central limit theorem implies that if N is large enough, we canapproximate the distribution of Y by the standard normal distributionwith mean µ and variance σ2/N regardless of the underlyingdistribution of Y .
I This property is called asymptotic normality.
I Let’s examine this property through simulation!!
26 / 31
![Page 27: Review of Statistics · Introduction Statistical Estimation Central Limit Theorem # Set the seed for the random number set.seed(124) # Run simulation Nsamples =2000 result_CLT1](https://reader033.fdocuments.us/reader033/viewer/2022042417/5f33031289ee497efa361357/html5/thumbnails/27.jpg)
Introduction Statistical Estimation Central Limit Theorem
Numerical Simulation
I Use the same example as before. Remember that the underlying incomedistribution is clearly NOT normal.I Population mean µ = 30165.4673315I standard deviation σ = 38306.1712336.
27 / 31
![Page 28: Review of Statistics · Introduction Statistical Estimation Central Limit Theorem # Set the seed for the random number set.seed(124) # Run simulation Nsamples =2000 result_CLT1](https://reader033.fdocuments.us/reader033/viewer/2022042417/5f33031289ee497efa361357/html5/thumbnails/28.jpg)
Introduction Statistical Estimation Central Limit Theorem
# define function for simulationf_simu_CLT = function(Nsamples, samplesize, pop, pop_mean, pop_sd ){
output = numeric(Nsamples)for (i in 1:Nsamples ){
test <- sample(x = pop, size = samplesize)output[i] <- ( mean(test) - pop_mean ) / (pop_sd / sqrt(samplesize))
}
return(output)
}
28 / 31
![Page 29: Review of Statistics · Introduction Statistical Estimation Central Limit Theorem # Set the seed for the random number set.seed(124) # Run simulation Nsamples =2000 result_CLT1](https://reader033.fdocuments.us/reader033/viewer/2022042417/5f33031289ee497efa361357/html5/thumbnails/29.jpg)
Introduction Statistical Estimation Central Limit Theorem
# Set the seed for the random numberset.seed(124)
# Run simulationNsamples = 2000result_CLT1 <- f_simu_CLT(Nsamples, 10, pop, pop_mean, pop_sd )result_CLT2 <- f_simu_CLT(Nsamples, 100, pop, pop_mean, pop_sd )result_CLT3 <- f_simu_CLT(Nsamples, 1000, pop, pop_mean, pop_sd )
# Random draw from standard normal distribution as comparisonresult_stdnorm = rnorm(Nsamples)
# Create dataframeresult_CLT_data <- data.frame( Ybar_standardized_10 = result_CLT1,
Ybar_standardized_100 = result_CLT2,Ybar_standardized_1000 = result_CLT3,Standard_Normal = result_stdnorm)
29 / 31
![Page 30: Review of Statistics · Introduction Statistical Estimation Central Limit Theorem # Set the seed for the random number set.seed(124) # Run simulation Nsamples =2000 result_CLT1](https://reader033.fdocuments.us/reader033/viewer/2022042417/5f33031289ee497efa361357/html5/thumbnails/30.jpg)
Introduction Statistical Estimation Central Limit Theorem
I Now take a look at the distribution.
# Use "melt" to change the format of result_datadata_for_plot <- melt(data = result_CLT_data, variable.name = "Variable" )
## Using as id variables
# Use "ggplot2" to create the figure.fig <-
ggplot(data = data_for_plot) +xlab("Sample mean") +geom_line(aes(x = value, colour = variable ), stat = "density" ) +geom_vline(xintercept=0 ,colour="black")
30 / 31
![Page 31: Review of Statistics · Introduction Statistical Estimation Central Limit Theorem # Set the seed for the random number set.seed(124) # Run simulation Nsamples =2000 result_CLT1](https://reader033.fdocuments.us/reader033/viewer/2022042417/5f33031289ee497efa361357/html5/thumbnails/31.jpg)
Introduction Statistical Estimation Central Limit Theorem
plot(fig)
0.0
0.2
0.4
0 5Sample mean
dens
ity
variable
Ybar_standardized_10
Ybar_standardized_100
Ybar_standardized_1000
Standard_Normal
I As N grows, the distribution is getting closer to the standard normaldistribution. 31 / 31