Econometrics I Review
-
Upload
bakpengseng -
Category
Documents
-
view
26 -
download
5
description
Transcript of Econometrics I Review
EC3304 Econometrics II
Textbook: Stock and Watson’s Introduction to Econometrics
Topics:
Ch. 2 and 3: Review of probability and statistics, matrix algebra, and simple regression model
Ch. 10: Panel data
Ch. 11: Binary dependent variables
Ch. 12: Instrumental variables
Ch. 13: Experiments and quasi-experiments
Ch. 14: Intro to time series and forecasting
Ch. 15: Estimation of dynamic causal effects
Ch. 16: Additional topics in time series
1
Assessment:
1. Tutorial participation 20%
2. Midterm exam 30%
• week 7, time and location to be announced
3. Final exam 50%
• 28 April, 2014 (Monday), 1PM, location to be announced
Office hours: TBA
Tutorials start week 4
2
Chapter 2 Review of Probability
Section 2.2 Expected values, mean, and variance
• The expected value of a random variable, Y , denoted E (Y ) or µY , is a weighted-average
of all possible values of Y
– For a discrete random variable with k possible outcomes and probability function Pr (Y = y):
E (Y ) =
k∑j=1
yj Pr (Y = yj)
– For a continuous random variable with probability density function f (y):
E (Y ) =
∫ ∞−∞
y f (y) dy
• Intuitively, the expectation can be thought of as the long-run average of a random variable
over many repeated trials or occurrences
3
• The variance and standard deviation measure the dispersion or spread of a proba-
bility distribution
• For a discrete random variable:
σ2Y = var (Y ) = E (Y − µY )2 =
k∑j=1
(yj − µY )2 Pr (Y = yj)
• For a continuous random variable:
σ2Y = var (Y ) = E (Y − µY )2 =
∫ ∞−∞
(y − µY )2 f (y) dy
• The standard deviation is the square root of the variance, denoted σY
• It is easier to interpret the SD since it has the same units as Y
– The units of the variance are the squared units of Y
4
• Linear functions of a random variable have convenient properties
• Suppose
Y = a + bX,
where a and b are constants
• The expectation of Y is
µY = a + bµX
• The variance of Y is
σ2Y = b2σ2X
• The standard deviation of Y is
σY = bσX
5
• A random variable is standardized by the formula
Z ≡ X − µXσX
,
which can be written as
Z = aX + b,
where a =1
σXand b = −µX
σX.
• The expectation of Z is
µZ = aµX + b = (µX/σX)− µXσX
= 0
• The variance of Z is
σ2Z = a2σ2X =(σ2X/σ
2X
)= 1
• Thus, the standardized random variable Z has a mean of zero and a variance of 1
6
• Covariance measures the extent to which two random variables move together:
σXY ≡ cov (X, Y ) = E [(X − µX) (Y − µY )]
=
k∑i=1
l∑j=1
(xi − µX) (yj − µY ) Pr (X = xi, Y = yj)
• The covariance depends on the units of measurement
• Correlation does not
• Correlation is the covariance divided by the standard deviations of X and Y :
ρXY ≡ corr (X, Y ) =cov (X, Y )√
var (X) var (Y )=
σXYσXσY
• The correlation is constrained to the values:
−1 ≤ corr (X, Y ) ≤ 1
7
Section 2.4 The normal, chi-squared, Student t, and F distributions
• We will concern ourselves with the normal distribution which we will see over and over
• The normal distribution is a continuous random variable that can take on any value
• Its PDF has the familiar bell-shaped graph
• We say X is normally distributed with mean µ and variance σ2, written as X ∼ N(µ, σ2
)• Mathematically, the PDF of a normal random variable X with mean µ and variance σ2 is
f (x) =1√
2πσ2exp
[−(x− µ)2
2σ2
]• If X ∼ N
(µ, σ2
), then aX + b ∼ N
(aµ + b, a2σ2
)• A good relationship to memorize is
• A random variable Z such that Z ∼ N (0, 1) has the standard normal distribution
• The standard normal PDF is denoted by φ (z) and is given by
φ (z) =1√2π
exp
[−1
2z2]
• The standard normal CDF is denoted by Φ (z)
• In other words:
Pr (Z ≤ c) = Φ (c)
• Ex. Suppose X ∼ N (3, 4) and we want to know Pr (X ≤ 1)
• We compute the probability by standardizing and then looking the probability up in a
textbook:
Pr (X ≤ 1) = Pr (X − 3 ≤ 1− 3) = Pr
(X − 3
2≤ −2
2
)= Pr (Z ≤ −1) = Φ (−1) = 0.159
10
Section 2.5 Random sampling and the distribution of the sample average
• A population is any well-defined group of subjects such as individuals, firms, cities, etc
• We would like to know something about the population
– Ex. the distribution of wages of the working population (or mean and variance, etc.)
• We cannot survey the whole population
– instead we use random sampling and make inferences regarding the distribution
11
• Random sampling : n objects are selected at random from a population with each
member of the population having an equally likely chance of being selected
– If we obtain the wages of 500 randomly chosen people from the working population, then
we have a random sample of wages from the population of all working people
– We use the sample to infer the distribution
• The observations are random reflecting the fact that many different outcomes are possible
– If we sample another 500 people, then the wages in this sample will differ
– So our estimation of the distribution (or some statistic) is itself random
12
• Formally, let f (y) be some unknown pdf that we want to learn about
• The value taken on by random variable Y is the outcome of an experiment and the associated
probabilities are defined by the function f (y)
• Suppose that we repeat the experiment n times independently
• We say that the n observations denoted Y1, Y2, . . . , Yn is a random sample of size n from
the population f (y)
• This random sample consists of observations of n independently and identically distributed
random variables Y1, Y2, . . . , Yn, each with the pdf f (y)
13
• The sample mean Y is: Y = 1n (Y1 + Y2 + · · · + Yn) = 1
n
∑ni=1 Yi
• The sample mean is random and has a sampling distribution
• The mean of Y is
E(Y)
= E
(1
n
n∑i=1
Yi
)=
1
n
n∑i=1
E (Yi) =1
n· n · µY = µY
• The variance of Y is
σ2Y≡ var
(Y)
= var
(1
n
n∑i=1
Yi
)=
1
n2
n∑i=1
var (Yi) =1
n2· n · σ2Y =
σ2Yn
(since Y1, Y2, . . . , Yn are i.i.d. for i 6= j cov (Yi, Yj) = 0)
• These results hold whatever the underlying distribution of Y is
• If Y ∼ N(µY , σ
2Y
), then Y ∼ N
(µY , σ
2/n)
(the sum of normally distributed random variables is normally distributed)
14
Section 2.6 Large-sample approximations to sampling distributions
• There are two approaches to characterizing sampling distributions: exact or approximate
• The exact approach means deriving the exact sampling distribution for any value of n
– finite-sample distribution
• The approximate approach means using an approximation that can work well for large n
– asymptotic distribution
• The key concepts for asymptotic results are the Law of Large Numbers and the Central
Limit Theorem
15
Convergence in Probability
• Let S1, S2, . . . , Sn be a sequence of random variables. The sequence Sn is said to converge
in probability to α if the probability that Sn is “close” to α tends to 1 as n→∞
• The constant α is called the probability limit of Sn
• Ex. Sn is the sample average of a sample of n observations: Sn = Y . We will argue below
that the probability limit of Y is µY
• This idea is written mathematically as
Yp→ µY or plim
(Y)
= µY
16
• Usually we will use the Law of Large Numbers to find the probability limit of a random
variable
• The Law of Large Numbers states that under certain conditions Y will be close to µ
with a very high probability when n is large:
If Yi, i = 1, . . . , n are i.i.d with E (Yi) = µ and if var (Yi) < ∞, then
Yp→ µY
17
Convergence in Distribution
• Let F1, F2, F3, . . . , Fn be a sequence of cumulative distribution functions corresponding to
a sequence of random variables S1, S2, . . . , Sn. The sequence Sn is said to converge in
distribution to S if the sequence of distribution functions Fn converges to F , the CDF
of S
• That is, as n→∞, the distribution of Sn is approximately equal to the distribution of S
• Ex. Sn may be the standardized sample average: Sn =Y − µσY
, which we will show below
has a CDF similar to S ∼ N (0, 1) when n is large
• We write
Snd→ S
• The distribution F is called the asymptotic distribution of Sn
19
• Usually we will use the Central Limit Theorem to argue convergence in distribution
• The Central Limit Theorem states that under certain conditions the distribution of
a standardized sample mean is well approximated by a normal distribution when n is large
• Suppose that Y1, . . . , Yn are i.i.d. with E (Yi) = µY and var (Yi) = σ2Y , where 0 < σ2Y <∞.
Then, for large n,
Y − µYσY /√n
a∼ N (0, 1)
• This is a very convenient property that is used over and over in econometrics
• A slightly different presentation of the CLT that is often useful:
√n(Y − µY
) a∼ N(0, σ2Y
)
20
Continuity Theorem
Suppose that Ynp→ α and g (·) is a continuous function. Then
g (Yn)p→ g (α)
In words:
To find the probability limit of a function of a random variable,
replace the random variable with its probability limit
Ex. If s2Yp→ σ2Y , then sY
p→ σY .
22
Slutsky’s Theorem
Suppose that Xnd→ X , Yn
p→ α, and g (·) is a continuous function. Then,
g (Xn, Yn)d→ g (X,α)
In words:
The asymptotic distribution of g (Xn, Yn) is
approximately equal to the distribution of g (X,α)
Ex. Suppose Zn =Y − µσY
d→ Z, where Z ∼ N (0, 1). Then, the asymptotic distribution Z2n is
equal to the distribution of Z2, which is χ21.
23
Chapter 3 Review of statistics
• Statistics is learning something about a population using a sample from that population
• Estimation, hypothesis testing, and confidence intervals
Section 3.1 Estimation of the population mean
• An estimator is a function of the sample data
• An estimate is the numerical value of the estimator actually computed using data
• Suppose we want to learn the mean of Y , µY
• How do we do it? We come up with an estimator for µY , often denoted µY
• Naturally, we could use the sample average Y or even the the first observation, Y1
• There are many possible estimators of µY so we need some criteria to decide which ones
are “good”
24
Some desirable properties of estimators
• Unbiasedness : E (µY ) = µY
• Consistency : µYp→ µY
• Efficiency :
Let µY and µY be unbiased estimators. µY is more efficient than µY if var (µY ) < var (µY ).
• Y is unbiased, E(Y)
= µY , and consistent, Yp→ µY
• Y1 is unbiased, E (Y1) = µY but not consistent (why?)
• Efficiency? var(Y)
=σ2Yn< var (Y1) = σ2Y
• Y is the Best Linear Unbiased Estimator (BLUE)
– That is, Y is the most efficient estimator of µY among all unbiased estimators that are
weighted averages of Y1, . . . , Yn
25
Section 3.2 Hypothesis tests concerning the population mean
• The null hypothesis is a hypothesis about the population that we want to test
– Typically not what the researcher thinks is true
• A second hypothesis is called the alternative hypothesis
– Typically what the researcher thinks is true
H0 : E (Y ) = µY,0 and H1 : E (Y ) 6= µY,0
• Hypothesis testing entails using a test statistic to decide whether to accept the null hypoth-
esis or reject it in favor of the alternative hypothesis
• In our example, we use the sample mean to test hypotheses about the population mean
• Since the sample is random, any test statistic is random, and we have to reject or accept
the null hypothesis using a probabilistic calculation
26
• Given a null hypothesis, the p-value is the probability of drawing a test statistic (Y ) that
is as extreme or more extreme than the observed value of the test statistic (y)
• A small p-value indicates that the null hypothesis is unlikely to be true
• To compute the p-value we make use of the CLT and treat Ya∼ N
(µY , σ
2Y
)• Under the null hypothesis Y
a∼ N(µY,0, σ
2Y
)(assume for now that σ2Y is known)
• For y > µY,0, the probability of getting a more extreme positive value than y:
Pr(Y > y | µY,0
)= Pr
(Y − µY,0σY
>y − µY,0σY
)
= Pr
(Z >
y − µY,0σY
)
= Φ
(−y − µY,0
σY
)
• The p-value is thus
p-value = 2 · Φ(−y − µY,0
σY
)
28
• For y < µY,0, the probability of getting a more extreme negative value than y:
Pr(Y < y | µY,0
)= Pr
(Y − µY,0σY
<y − µY,0σY
)
= Pr
(Z <
y − µY,0σY
)
= Φ
(y − µY,0σY
)
• The p-value is thus
p-value = 2 · Φ(y − µY,0σY
)
• We can encompass both these cases by using the absolute value function:
p-value = 2 · Φ(−∣∣∣∣y − µY,0σY
∣∣∣∣)29
• Typically, the standard deviation σY is unknown and must be estimated
• The sample variance is
s2Y =1
N − 1
n∑i=1
(Yi − Y
)2and the standard error is
σY =sY√n,
• The p-value formula is simply altered by replacing the standard deviation with the standard
error
p-value = 2 · Φ(−∣∣∣∣y − µY,0σY
∣∣∣∣)• That is, when n is large, Y
a∼ N(µY , σ
2Y
)• If the p-value is less than some pre-decided value (often 0.05) than the null hypothesis is
rejected. Otherwise, it is accepted (or more precisely, not rejected).
30
Section 3.3 Confidence intervals for the population mean
• An estimate is a single best “guess” of µY
• A confidence interval gives a range of values that contains µY with a certain probability
• Using the sampling distribution Ya∼ N
(µY , σ
2Y
)we have:
Pr
(−1.96 ≤ Y − µY
σY≤ 1.96
)= 0.95
⇒ Pr(Y − 1.96σY ≤ µY ≤ Y + 1.96σY
)= 0.95
• The interval[Y − 1.96σY , Y + 1.96σY
]contains µY with probability 0.95
• An estimate of this interval, [ y − 1.96 σY , y + 1.96 σY ], is called a 95% confidence interval
of µY
• Note that the first interval is random but the second is not
31
Appendix 18.1 Summary of matrix algebra
• A matrix is a collection of numbers (called elements) that are laid out in columns and
rows
• The dimension of a matrix is m×n, where m is the number of rows and n is the number
of columns
• An m× n matrix A is
A =
a11 a12 · · · a1n
a21 a22 · · · a2n
... ... . . . ...
am1 am2 · · · amn
• The element in the ith row and jth column of matrix A is denoted aij
32
• A vector of dimension n is an collection of n numbers (called elements) collected either
in a column or row:
b =
b1
b2
...
bn
and c = [ c1 c2 · · · cn ]
• A column vector is a n× 1 matrix and a row vector is a 1× n matrix
• In most cases, we’ll consider only column vectors
• The ith element of vector b is denoted bi
• A matrix of dimension 1× 1 is called a scalar
33
•Matrix addition : two matrices A and B, each of the same dimension, are added
together by adding their elements
• Vector multiplication (dot product or inner product) of two n-dimensional column
vectors, a and b, is computed as a′b =∑n
i=1 aibi
34
•Matrix multiplication : two matrices A and B can be multiplied together to form the
product C ≡ AB if they are conformable
– conformable: the number of columns of A equals the number of rows of B
• The (i, j) element of C is the dot product of the ith row of A and the jth column of B
35
• The identity matrix In is a n× n with 1’s on the diagonal and 0’s everywhere else
In =
1 0 · · · 0
0 1 . . . 0
... . . . . . . 0
0 0 0 1
• The inverse of the square matrix A is defined as the matrix for which A−1A = In
• The transpose of matrix A, denoted A′, switches the rows and columns. That is,
– element (i, j) of A becomes element (j, i) of A′
– the element in the ith row and jth column is moved to the jth row and ith column
• If A has dimension m× n, then A′ has dimension n×m
36
• Some useful properties of matrix algebra and calculus:
1. (A +B)′ = A′ +B′
2. (A +B)C = AC +BC
3. (AB)′ = B′A′
4.∂
∂z(z′Az) = (A +A′) z = 2Az (last step assumes A is symmetric)
5.∂
∂z(a′z) = a
37
Simple linear regression model: summation notation vs. matrix notation
• The simple linear regression model is
Yi = β0 + β1Xi + ui, i = 1, . . . , n
where Yi is the dependent variable, Xi is the independent variable, β0 is the intercept, β1
is the slope, and ui is the error term
• The OLS estimator minimizes the sum of squared residuals:
n∑i=1
u2i =
n∑i=1
(Yi − β0 − β1Xi
)2• The first-order conditions are:
n∑i=1
(Yi − β0 − β1Xi
)= 0
andn∑i=1
Xi
(Yi − β0 − β1Xi
)= 0
38
• Solving for β0 and β1:
β1 =
∑ni=1
(Xi −X
) (Yi − Y
)∑ni=1
(Xi −X
)2and
β0 = Y − β1X
• Properties of the OLS estimator of the slope of the simple linear regression model
• We can also write the linear regression model in matrix notation
39
• We generalize the model to include k independent variables (plus an intercept)
Yi = β0 + β1X1i + β2X2i + . . . + βkXki + ui, i = 1, . . . , n
• The (k + 1) dimensional column vector X i contains the independent variables of the ith
observation and β is a (k + 1) dimensional column vector of coefficients. Then, we can
rewrite the model as
Yi = X ′iβ + ui, i = 1, . . . , n
• The n× (k + 1) dimensional matrix X contains the stacked X ′i, i = 1, . . . , n:
X =
X ′1
X ′2
...
X ′n
=
1 X11 . . . Xk1
1 X12 . . . Xk2
... ... . . . ...
1 X1n . . . Xkn
40
• The n-dimensional column vector Y contains the stacked observations of the dependent
variable and the n-dimensional column vector u contains the stacked error terms:
Y =
Y1
Y2
...
Yn
and u =
u1
u2
...
un
• We can now write all n observations compactly as
Y = Xβ + u
41
• The sum of squared residuals is
u′u =(Y −Xβ
)′ (Y −Xβ
)= Y ′Y − 2Y ′Xβ + β′X ′Xβ
• The first-order conditions are
−2X ′Y + 2X ′Xβ = 0
which can be solved for β as
−2X ′Y + 2X ′Xβ = 0
⇒ (X ′X) β = X ′Y
⇒ β = (X ′X)−1X ′Y
• β is the vector representation of the OLS estimators
42