Econometrics I Review

EC3304 Econometrics II

Textbook: Stock and Watson’s Introduction to Econometrics

Topics:

Ch. 2 and 3: Review of probability and statistics, matrix algebra, and simple regression model

Ch. 10: Panel data

Ch. 11: Binary dependent variables

Ch. 12: Instrumental variables

Ch. 13: Experiments and quasi-experiments

Ch. 14: Intro to time series and forecasting

Ch. 15: Estimation of dynamic causal effects

Ch. 16: Additional topics in time series

1

Assessment:

1. Tutorial participation 20%

2. Midterm exam 30%

• week 7, time and location to be announced

3. Final exam 50%

• 28 April, 2014 (Monday), 1PM, location to be announced

Office hours: TBA

Tutorials start week 4

2

Chapter 2 Review of Probability

Section 2.2 Expected values, mean, and variance

• The expected value of a random variable, Y , denoted E (Y ) or µY , is a weighted-average

of all possible values of Y

– For a discrete random variable with k possible outcomes and probability function Pr (Y = y):

E (Y ) =

k∑j=1

yj Pr (Y = yj)

– For a continuous random variable with probability density function f (y):

E (Y ) =

∫ ∞−∞

y f (y) dy

• Intuitively, the expectation can be thought of as the long-run average of a random variable

over many repeated trials or occurrences

3

• The variance and standard deviation measure the dispersion or spread of a proba-

bility distribution

• For a discrete random variable:

σ2Y = var (Y ) = E (Y − µY )2 =

k∑j=1

(yj − µY )2 Pr (Y = yj)

• For a continuous random variable:

σ2Y = var (Y ) = E (Y − µY )2 =

∫ ∞−∞

(y − µY )2 f (y) dy

• The standard deviation is the square root of the variance, denoted σY

• It is easier to interpret the SD since it has the same units as Y

– The units of the variance are the squared units of Y

4

• Linear functions of a random variable have convenient properties

• Suppose

Y = a + bX,

where a and b are constants

• The expectation of Y is

µY = a + bµX

• The variance of Y is

σ2Y = b2σ2X

• The standard deviation of Y is

σY = bσX

5

• A random variable is standardized by the formula

Z ≡ X − µXσX

,

which can be written as

Z = aX + b,

where a =1

σXand b = −µX

σX.

• The expectation of Z is

µZ = aµX + b = (µX/σX)− µXσX

= 0

• The variance of Z is

σ2Z = a2σ2X =(σ2X/σ

2X

)= 1

• Thus, the standardized random variable Z has a mean of zero and a variance of 1

6

• Covariance measures the extent to which two random variables move together:

σXY ≡ cov (X, Y ) = E [(X − µX) (Y − µY )]

=

k∑i=1

l∑j=1

(xi − µX) (yj − µY ) Pr (X = xi, Y = yj)

• The covariance depends on the units of measurement

• Correlation does not

• Correlation is the covariance divided by the standard deviations of X and Y :

ρXY ≡ corr (X, Y ) =cov (X, Y )√

var (X) var (Y )=

σXYσXσY

• The correlation is constrained to the values:

−1 ≤ corr (X, Y ) ≤ 1

7

Section 2.4 The normal, chi-squared, Student t, and F distributions

• We will concern ourselves with the normal distribution which we will see over and over

• The normal distribution is a continuous random variable that can take on any value

• Its PDF has the familiar bell-shaped graph

• We say X is normally distributed with mean µ and variance σ2, written as X ∼ N(µ, σ2

)• Mathematically, the PDF of a normal random variable X with mean µ and variance σ2 is

f (x) =1√

2πσ2exp

[−(x− µ)2

2σ2

]• If X ∼ N

(µ, σ2

), then aX + b ∼ N

(aµ + b, a2σ2

)• A good relationship to memorize is

• A random variable Z such that Z ∼ N (0, 1) has the standard normal distribution

• The standard normal PDF is denoted by φ (z) and is given by

φ (z) =1√2π

exp

[−1

2z2]

• The standard normal CDF is denoted by Φ (z)

• In other words:

Pr (Z ≤ c) = Φ (c)

• Ex. Suppose X ∼ N (3, 4) and we want to know Pr (X ≤ 1)

• We compute the probability by standardizing and then looking the probability up in a

textbook:

Pr (X ≤ 1) = Pr (X − 3 ≤ 1− 3) = Pr

(X − 3

2≤ −2

2

)= Pr (Z ≤ −1) = Φ (−1) = 0.159

10

Bak

Highlight

Section 2.5 Random sampling and the distribution of the sample average

• A population is any well-defined group of subjects such as individuals, firms, cities, etc

• We would like to know something about the population

– Ex. the distribution of wages of the working population (or mean and variance, etc.)

• We cannot survey the whole population

– instead we use random sampling and make inferences regarding the distribution

11

• Random sampling : n objects are selected at random from a population with each

member of the population having an equally likely chance of being selected

– If we obtain the wages of 500 randomly chosen people from the working population, then

we have a random sample of wages from the population of all working people

– We use the sample to infer the distribution

• The observations are random reflecting the fact that many different outcomes are possible

– If we sample another 500 people, then the wages in this sample will differ

– So our estimation of the distribution (or some statistic) is itself random

12

• Formally, let f (y) be some unknown pdf that we want to learn about

• The value taken on by random variable Y is the outcome of an experiment and the associated

probabilities are defined by the function f (y)

• Suppose that we repeat the experiment n times independently

• We say that the n observations denoted Y1, Y2, . . . , Yn is a random sample of size n from

the population f (y)

• This random sample consists of observations of n independently and identically distributed

random variables Y1, Y2, . . . , Yn, each with the pdf f (y)

13

• The sample mean Y is: Y = 1n (Y1 + Y2 + · · · + Yn) = 1

n

∑ni=1 Yi

• The sample mean is random and has a sampling distribution

• The mean of Y is

E(Y)

= E

(1

n

n∑i=1

Yi

)=

1

n

n∑i=1

E (Yi) =1

n· n · µY = µY

• The variance of Y is

σ2Y≡ var

(Y)

= var

(1

n

n∑i=1

Yi

)=

1

n2

n∑i=1

var (Yi) =1

n2· n · σ2Y =

σ2Yn

(since Y1, Y2, . . . , Yn are i.i.d. for i 6= j cov (Yi, Yj) = 0)

• These results hold whatever the underlying distribution of Y is

• If Y ∼ N(µY , σ

2Y

), then Y ∼ N

(µY , σ

2/n)

(the sum of normally distributed random variables is normally distributed)

14

Section 2.6 Large-sample approximations to sampling distributions

• There are two approaches to characterizing sampling distributions: exact or approximate

• The exact approach means deriving the exact sampling distribution for any value of n

– finite-sample distribution

• The approximate approach means using an approximation that can work well for large n

– asymptotic distribution

• The key concepts for asymptotic results are the Law of Large Numbers and the Central

Limit Theorem

15

Convergence in Probability

• Let S1, S2, . . . , Sn be a sequence of random variables. The sequence Sn is said to converge

in probability to α if the probability that Sn is “close” to α tends to 1 as n→∞

• The constant α is called the probability limit of Sn

• Ex. Sn is the sample average of a sample of n observations: Sn = Y . We will argue below

that the probability limit of Y is µY

• This idea is written mathematically as

Yp→ µY or plim

(Y)

= µY

16

• Usually we will use the Law of Large Numbers to find the probability limit of a random

variable

• The Law of Large Numbers states that under certain conditions Y will be close to µ

with a very high probability when n is large:

If Yi, i = 1, . . . , n are i.i.d with E (Yi) = µ and if var (Yi) < ∞, then

Yp→ µY

17

Convergence in Distribution

• Let F1, F2, F3, . . . , Fn be a sequence of cumulative distribution functions corresponding to

a sequence of random variables S1, S2, . . . , Sn. The sequence Sn is said to converge in

distribution to S if the sequence of distribution functions Fn converges to F , the CDF

of S

• That is, as n→∞, the distribution of Sn is approximately equal to the distribution of S

• Ex. Sn may be the standardized sample average: Sn =Y − µσY

, which we will show below

has a CDF similar to S ∼ N (0, 1) when n is large

• We write

Snd→ S

• The distribution F is called the asymptotic distribution of Sn

19

• Usually we will use the Central Limit Theorem to argue convergence in distribution

• The Central Limit Theorem states that under certain conditions the distribution of

a standardized sample mean is well approximated by a normal distribution when n is large

• Suppose that Y1, . . . , Yn are i.i.d. with E (Yi) = µY and var (Yi) = σ2Y , where 0 < σ2Y <∞.

Then, for large n,

Y − µYσY /√n

a∼ N (0, 1)

• This is a very convenient property that is used over and over in econometrics

• A slightly different presentation of the CLT that is often useful:

√n(Y − µY

) a∼ N(0, σ2Y

)

20

Bak

Highlight

Bak

Highlight

Continuity Theorem

Suppose that Ynp→ α and g (·) is a continuous function. Then

g (Yn)p→ g (α)

In words:

To find the probability limit of a function of a random variable,

replace the random variable with its probability limit

Ex. If s2Yp→ σ2Y , then sY

p→ σY .

22

Slutsky’s Theorem

Suppose that Xnd→ X , Yn

p→ α, and g (·) is a continuous function. Then,

g (Xn, Yn)d→ g (X,α)

In words:

The asymptotic distribution of g (Xn, Yn) is

approximately equal to the distribution of g (X,α)

Ex. Suppose Zn =Y − µσY

d→ Z, where Z ∼ N (0, 1). Then, the asymptotic distribution Z2n is

equal to the distribution of Z2, which is χ21.

23

Bak

Sticky Note

If something converges in distribution to another, then when put through a function, the "new" something will converge to that something else put through the same function.

Chapter 3 Review of statistics

• Statistics is learning something about a population using a sample from that population

• Estimation, hypothesis testing, and confidence intervals

Section 3.1 Estimation of the population mean

• An estimator is a function of the sample data

• An estimate is the numerical value of the estimator actually computed using data

• Suppose we want to learn the mean of Y , µY

• How do we do it? We come up with an estimator for µY , often denoted µY

• Naturally, we could use the sample average Y or even the the first observation, Y1

• There are many possible estimators of µY so we need some criteria to decide which ones

are “good”

24

Some desirable properties of estimators

• Unbiasedness : E (µY ) = µY

• Consistency : µYp→ µY

• Efficiency :

Let µY and µY be unbiased estimators. µY is more efficient than µY if var (µY ) < var (µY ).

• Y is unbiased, E(Y)

= µY , and consistent, Yp→ µY

• Y1 is unbiased, E (Y1) = µY but not consistent (why?)

• Efficiency? var(Y)

=σ2Yn< var (Y1) = σ2Y

• Y is the Best Linear Unbiased Estimator (BLUE)

– That is, Y is the most efficient estimator of µY among all unbiased estimators that are

weighted averages of Y1, . . . , Yn

25

Section 3.2 Hypothesis tests concerning the population mean

• The null hypothesis is a hypothesis about the population that we want to test

– Typically not what the researcher thinks is true

• A second hypothesis is called the alternative hypothesis

– Typically what the researcher thinks is true

H0 : E (Y ) = µY,0 and H1 : E (Y ) 6= µY,0

• Hypothesis testing entails using a test statistic to decide whether to accept the null hypoth-

esis or reject it in favor of the alternative hypothesis

• In our example, we use the sample mean to test hypotheses about the population mean

• Since the sample is random, any test statistic is random, and we have to reject or accept

the null hypothesis using a probabilistic calculation

26

• Given a null hypothesis, the p-value is the probability of drawing a test statistic (Y ) that

is as extreme or more extreme than the observed value of the test statistic (y)

• A small p-value indicates that the null hypothesis is unlikely to be true

• To compute the p-value we make use of the CLT and treat Ya∼ N

(µY , σ

2Y

)• Under the null hypothesis Y

a∼ N(µY,0, σ

2Y

)(assume for now that σ2Y is known)

• For y > µY,0, the probability of getting a more extreme positive value than y:

Pr(Y > y | µY,0

)= Pr

(Y − µY,0σY

>y − µY,0σY

)

= Pr

(Z >

y − µY,0σY

)

= Φ

(−y − µY,0

σY

)

• The p-value is thus

p-value = 2 · Φ(−y − µY,0

σY

)

28

Bak

Highlight

• For y < µY,0, the probability of getting a more extreme negative value than y:

Pr(Y < y | µY,0

)= Pr

(Y − µY,0σY

<y − µY,0σY

)

= Pr

(Z <

y − µY,0σY

)

= Φ

(y − µY,0σY

)

• The p-value is thus

p-value = 2 · Φ(y − µY,0σY

)

• We can encompass both these cases by using the absolute value function:

p-value = 2 · Φ(−∣∣∣∣y − µY,0σY

∣∣∣∣)29

• Typically, the standard deviation σY is unknown and must be estimated

• The sample variance is

s2Y =1

N − 1

n∑i=1

(Yi − Y

)2and the standard error is

σY =sY√n,

• The p-value formula is simply altered by replacing the standard deviation with the standard

error

p-value = 2 · Φ(−∣∣∣∣y − µY,0σY

∣∣∣∣)• That is, when n is large, Y

a∼ N(µY , σ

2Y

)• If the p-value is less than some pre-decided value (often 0.05) than the null hypothesis is

rejected. Otherwise, it is accepted (or more precisely, not rejected).

30

Section 3.3 Confidence intervals for the population mean

• An estimate is a single best “guess” of µY

• A confidence interval gives a range of values that contains µY with a certain probability

• Using the sampling distribution Ya∼ N

(µY , σ

2Y

)we have:

Pr

(−1.96 ≤ Y − µY

σY≤ 1.96

)= 0.95

⇒ Pr(Y − 1.96σY ≤ µY ≤ Y + 1.96σY

)= 0.95

• The interval[Y − 1.96σY , Y + 1.96σY

]contains µY with probability 0.95

• An estimate of this interval, [ y − 1.96 σY , y + 1.96 σY ], is called a 95% confidence interval

of µY

• Note that the first interval is random but the second is not

31

Appendix 18.1 Summary of matrix algebra

• A matrix is a collection of numbers (called elements) that are laid out in columns and

rows

• The dimension of a matrix is m×n, where m is the number of rows and n is the number

of columns

• An m× n matrix A is

A =

a11 a12 · · · a1n

a21 a22 · · · a2n

... ... . . . ...

am1 am2 · · · amn

• The element in the ith row and jth column of matrix A is denoted aij

32

• A vector of dimension n is an collection of n numbers (called elements) collected either

in a column or row:

b =

b1

b2

...

bn

and c = [ c1 c2 · · · cn ]

• A column vector is a n× 1 matrix and a row vector is a 1× n matrix

• In most cases, we’ll consider only column vectors

• The ith element of vector b is denoted bi

• A matrix of dimension 1× 1 is called a scalar

33

•Matrix addition : two matrices A and B, each of the same dimension, are added

together by adding their elements

• Vector multiplication (dot product or inner product) of two n-dimensional column

vectors, a and b, is computed as a′b =∑n

i=1 aibi

34

•Matrix multiplication : two matrices A and B can be multiplied together to form the

product C ≡ AB if they are conformable

– conformable: the number of columns of A equals the number of rows of B

• The (i, j) element of C is the dot product of the ith row of A and the jth column of B

35

• The identity matrix In is a n× n with 1’s on the diagonal and 0’s everywhere else

In =

1 0 · · · 0

0 1 . . . 0

... . . . . . . 0

0 0 0 1

• The inverse of the square matrix A is defined as the matrix for which A−1A = In

• The transpose of matrix A, denoted A′, switches the rows and columns. That is,

– element (i, j) of A becomes element (j, i) of A′

– the element in the ith row and jth column is moved to the jth row and ith column

• If A has dimension m× n, then A′ has dimension n×m

36

• Some useful properties of matrix algebra and calculus:

1. (A +B)′ = A′ +B′

2. (A +B)C = AC +BC

3. (AB)′ = B′A′

4.∂

∂z(z′Az) = (A +A′) z = 2Az (last step assumes A is symmetric)

5.∂

∂z(a′z) = a

37

bakpengseng

Highlight

Bak

Highlight

Bak

Highlight

Simple linear regression model: summation notation vs. matrix notation

• The simple linear regression model is

Yi = β0 + β1Xi + ui, i = 1, . . . , n

where Yi is the dependent variable, Xi is the independent variable, β0 is the intercept, β1

is the slope, and ui is the error term

• The OLS estimator minimizes the sum of squared residuals:

n∑i=1

u2i =

n∑i=1

(Yi − β0 − β1Xi

)2• The first-order conditions are:

n∑i=1


)= 0

andn∑i=1

Xi


)= 0

38

• Solving for β0 and β1:

β1 =

∑ni=1

(Xi −X

) (Yi − Y

)∑ni=1

(Xi −X

)2and

β0 = Y − β1X

• Properties of the OLS estimator of the slope of the simple linear regression model

• We can also write the linear regression model in matrix notation

39

• We generalize the model to include k independent variables (plus an intercept)

Yi = β0 + β1X1i + β2X2i + . . . + βkXki + ui, i = 1, . . . , n

• The (k + 1) dimensional column vector X i contains the independent variables of the ith

observation and β is a (k + 1) dimensional column vector of coefficients. Then, we can

rewrite the model as

Yi = X ′iβ + ui, i = 1, . . . , n

• The n× (k + 1) dimensional matrix X contains the stacked X ′i, i = 1, . . . , n:

X =

X ′1

X ′2

...

X ′n

=

1 X11 . . . Xk1

1 X12 . . . Xk2

... ... . . . ...

1 X1n . . . Xkn

40

• The n-dimensional column vector Y contains the stacked observations of the dependent

variable and the n-dimensional column vector u contains the stacked error terms:

Y =

Y1

Y2

...

Yn

and u =

u1

u2

...

un

• We can now write all n observations compactly as

Y = Xβ + u

41

• The sum of squared residuals is

u′u =(Y −Xβ

)′ (Y −Xβ

)= Y ′Y − 2Y ′Xβ + β′X ′Xβ

• The first-order conditions are

−2X ′Y + 2X ′Xβ = 0

which can be solved for β as

−2X ′Y + 2X ′Xβ = 0

⇒ (X ′X) β = X ′Y

⇒ β = (X ′X)−1X ′Y

• β is the vector representation of the OLS estimators

42

bakpengseng

Highlight

bakpengseng

Highlight

Econometrics I Review

Documents

Transcript of Econometrics I Review