Lecture 1: Introduction, Regressions and Causal Inference 1.pdf · Lecture 1: Introduction,...

Post on 25-Aug-2020

25 views 0 download

Transcript of Lecture 1: Introduction, Regressions and Causal Inference 1.pdf · Lecture 1: Introduction,...

Lecture 1: Introduction, Regressions and CausalInference

January 10, 2016

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Introduction

From PPG1004H you should understand the following concepts:

1 Understand means and variances

2 Understand p-values and Hypothesis Tests3 Underlying Regression Assumptions (and why E[U|X ] = 0 is so

important)4 Understand what controls do5 Fixed-effects6 Interpreting Regression Coefficients7 The Basics of Causal Inference

This will be the focus of next course

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Introduction

From PPG1004H you should understand the following concepts:

1 Understand means and variances

2 Understand p-values and Hypothesis Tests3 Underlying Regression Assumptions (and why E[U|X ] = 0 is so

important)4 Understand what controls do5 Fixed-effects6 Interpreting Regression Coefficients7 The Basics of Causal Inference

This will be the focus of next course

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Introduction

From PPG1004H you should understand the following concepts:

1 Understand means and variances2 Understand p-values and Hypothesis Tests

3 Underlying Regression Assumptions (and why E[U|X ] = 0 is soimportant)

4 Understand what controls do5 Fixed-effects6 Interpreting Regression Coefficients7 The Basics of Causal Inference

This will be the focus of next course

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Introduction

From PPG1004H you should understand the following concepts:

1 Understand means and variances2 Understand p-values and Hypothesis Tests

3 Underlying Regression Assumptions (and why E[U|X ] = 0 is soimportant)

4 Understand what controls do5 Fixed-effects6 Interpreting Regression Coefficients7 The Basics of Causal Inference

This will be the focus of next course

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Introduction

From PPG1004H you should understand the following concepts:

1 Understand means and variances2 Understand p-values and Hypothesis Tests3 Underlying Regression Assumptions (and why E[U|X ] = 0 is so

important)

4 Understand what controls do5 Fixed-effects6 Interpreting Regression Coefficients7 The Basics of Causal Inference

This will be the focus of next course

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Introduction

From PPG1004H you should understand the following concepts:

1 Understand means and variances2 Understand p-values and Hypothesis Tests3 Underlying Regression Assumptions (and why E[U|X ] = 0 is so

important)

4 Understand what controls do5 Fixed-effects6 Interpreting Regression Coefficients7 The Basics of Causal Inference

This will be the focus of next course

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Introduction

From PPG1004H you should understand the following concepts:

1 Understand means and variances2 Understand p-values and Hypothesis Tests3 Underlying Regression Assumptions (and why E[U|X ] = 0 is so

important)4 Understand what controls do

5 Fixed-effects6 Interpreting Regression Coefficients7 The Basics of Causal Inference

This will be the focus of next course

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Introduction

From PPG1004H you should understand the following concepts:

1 Understand means and variances2 Understand p-values and Hypothesis Tests3 Underlying Regression Assumptions (and why E[U|X ] = 0 is so

important)4 Understand what controls do5 Fixed-effects

6 Interpreting Regression Coefficients7 The Basics of Causal Inference

This will be the focus of next course

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Introduction

From PPG1004H you should understand the following concepts:

1 Understand means and variances2 Understand p-values and Hypothesis Tests3 Underlying Regression Assumptions (and why E[U|X ] = 0 is so

important)4 Understand what controls do5 Fixed-effects6 Interpreting Regression Coefficients

7 The Basics of Causal InferenceThis will be the focus of next course

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Introduction

From PPG1004H you should understand the following concepts:

1 Understand means and variances2 Understand p-values and Hypothesis Tests3 Underlying Regression Assumptions (and why E[U|X ] = 0 is so

important)4 Understand what controls do5 Fixed-effects6 Interpreting Regression Coefficients7 The Basics of Causal Inference

This will be the focus of next course

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Quick Review

p-values

Definition: The p-value is the probability of getting results at leastas extreme as the ones you observed, given that the null hypothesis iscorrect

It can’t tell you the magnitude of an effect, the strength of theevidence or the probability that the finding was the result of chance.

Layman’s explanation: You suspect a coin is weighted toward heads(therefore set H0 : p = 0.5). You flip it 100 times and get more headsthan tails. The p-value won’t tell you whether the coin is fair, but itwill tell you the probability that you’d get at least as many heads asyou did if the coin was fair. That’s it.

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Quick Review

P-value and Economic vs. Statistical Significance

Statistical Significance: If p-value< 0.05, then your result isstatistically significant

Economic Significance: We could not care a less about the p-value

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Regressions

Regression: a measure of the relation between the mean value of onevariable and corresponding values of other variables

There are many types of regressions (logit, probit, IV)We focus on Ordinary Least Square (OLS) regressions

OLS: Minimizes differences between observed responses in a linearregression model

Yi = α + β1X1i + εi =⇒ “univariate” regressionYi = α + β1X1i + β2X2i + εi =⇒ “multivariate” regression

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Regressions and STATA

A regression equation tells you what to write in STATA

Yi = α + β1X1i + β2X2i + εi

In STATA:

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Unit of Analysis

Before writing down a regression equation, know the unit of analysisOften used subscripts: i =individual, t =time, s =school, g =grade,p =province

Yi = α + β1X1i + β2X2i + εi

Unit is:

Ypt = α + β1X1pt + β2X2pt + εpt

Unit is:

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Unit of Analysis II

Not all variables need to be at the same unit of analysisJust the outcome (Y), the regressor of interest and the error term

“Identifying variation” comes from the unit of analysis

Ysgt = α + β1X1sgt + β2X2st + εsgt

The above regression uses variation from:

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Underlying Regression Assumptions

The Required Underlying Regression Assumptions are:*****=IMPORTANT

1 Correct specification: Yi = α+ βXi + εi (linearity and additivity)****2 Exogeneity: E[ε|X ] = 0 *****3 No perfect multicollinearity4 Homoskedasticity: Var [ε|X ] = σ2

The other two often used Assumptions are:1 Normality: ε|X ∼ N(0, σ2)2 Observations are i.i.d.

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Regression Assumption 1

Specification is correct: Yi = α + βXi + εi (linearity and additivity)

While this assumption is important, we generally have to assumesome functional form. If we believe this is wrong we can addinteractions or polynomials:

Interaction: Yi = α + β1X1i + β2X2i + β3(X1 ∗ X2)i + εiPolynomial: Yi = α + β1Xi + β2X 2

i + εi

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Regression Assumption 2

Exogeneity: E[ε|X ] = 0. This is BY FAR the most importantassumption.

The assumption for the regression equation Yi = α + β1X1i + εi isviolated when an omitted variable, X2, is BOTH correlated to theoutcome and X1. I.e.:

Corr(Y ,X2) 6= 0Corr(X1,X2) 6= 0

If either condition fails, β1 is BIASED

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Regression Assumption 2 Continued

Exogeneity: E[ε|X ] = 0.

Much of what empirical economists do is to find a way to make thisassumption hold

We do it by shutting down the link between X1 and X2 (i.e. makeCorr(X1,X2) = 0)

Introduction Regressions Causal Inference Control Variables Randomized Experiments

The Other Assumptions

No perfect multicollinearityNot really a problem; just do not add collinear variables together in aregression (or let STATA solve the problem)

Homoskedasticity: Var [ε|X ] = σ2

Not a problem; can allow for heteroskedastic or clustered standarderrors easily. In STATA for heteroskedastic put “,r” after the “reg”command

Normality - Does not affect bias, only efficiency of OLSi.i.d. - Important for serial or auto-correlation of error terms. We can(somewhat) correct for this.

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Introduction

What do you think of these studies? (that we see all the time innewspapers)

1 http://www.nbcnews.com/health/cancer/university-texas-study-links-meat-kidney-cancer-n459811

2 http://hereandnow.wbur.org/2016/01/06/sugar-breast-cancer-study

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Correlation and Causation

Correlation:

Causation:

1 http://www.tylervigen.com/spurious-correlations

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Causal Inference

For any causal statement you should be able to answer all thefollowing questions:

1 What is the unit of analysis?2 What is the treatment?3 What outcome are we interested in?4 What are the counterfactual outcomes?5 What is the causal link?6 How is the counterfactual mimicked? Does this sound reasonable?

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Causal Inference

A good way to think about these is to do a thought experiment andthink about ‘treated’ and ‘untreated’Suppose we have two types of people. People A get a drug. People Bdo not. We are interested in their blood pressure.

What is our treatment?

What is our counterfactual?

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Causal Inference

I am going to introduce some math notation here:The outcome for each treated person is: Y1,iThe outcome for each untreated person is: Y0,i

What are the outcomes for the treated?

For the untreated?

What is the treatment effect?

What is the selection bias?

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Thought experiments to regressions

How can we find the difference in outcomes between people A and B?

µA − µB or can regress:

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Example

Let’s try doing the causal statements for the following article:http://www.nbcnews.com/health/cancer/university-texas-study-links-meat-kidney-cancer-n459811

1 What is the unit of analysis?

2 What is the treatment?3 What outcome are we interested in?4 What are the counterfactual outcomes?5 What is the causal link?6 How is the counterfactual mimicked? Does this sound reasonable?

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Example

Let’s try doing the causal statements for the following article:http://www.nbcnews.com/health/cancer/university-texas-study-links-meat-kidney-cancer-n459811

1 What is the unit of analysis?2 What is the treatment?

3 What outcome are we interested in?4 What are the counterfactual outcomes?5 What is the causal link?6 How is the counterfactual mimicked? Does this sound reasonable?

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Example

Let’s try doing the causal statements for the following article:http://www.nbcnews.com/health/cancer/university-texas-study-links-meat-kidney-cancer-n459811

1 What is the unit of analysis?2 What is the treatment?3 What outcome are we interested in?

4 What are the counterfactual outcomes?5 What is the causal link?6 How is the counterfactual mimicked? Does this sound reasonable?

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Example

Let’s try doing the causal statements for the following article:http://www.nbcnews.com/health/cancer/university-texas-study-links-meat-kidney-cancer-n459811

1 What is the unit of analysis?2 What is the treatment?3 What outcome are we interested in?4 What are the counterfactual outcomes?

5 What is the causal link?6 How is the counterfactual mimicked? Does this sound reasonable?

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Example

Let’s try doing the causal statements for the following article:http://www.nbcnews.com/health/cancer/university-texas-study-links-meat-kidney-cancer-n459811

1 What is the unit of analysis?2 What is the treatment?3 What outcome are we interested in?4 What are the counterfactual outcomes?5 What is the causal link?

6 How is the counterfactual mimicked? Does this sound reasonable?

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Example

Let’s try doing the causal statements for the following article:http://www.nbcnews.com/health/cancer/university-texas-study-links-meat-kidney-cancer-n459811

1 What is the unit of analysis?2 What is the treatment?3 What outcome are we interested in?4 What are the counterfactual outcomes?5 What is the causal link?6 How is the counterfactual mimicked? Does this sound reasonable?

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Causal Inference

There are 5 basic empirical methods to obtain causal inference:

1 Controls (includes matching/fixed-effects)2 Randomized Experiments3 Difference-in-Differences4 Instrumental Variables5 Regression Discontinuity

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Control Variables

The main problem of causal inference is the possibility of omittedvariable bias (OVB)

So why do we not just control for all omitted variables?

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Control Variables

So while controls do not seem great at getting causality, they help by:

Eliminating obvious OVBReducing standard error

For this reason, controls are almost always included in every regression

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Control Variables

http://well.blogs.nytimes.com/2015/08/19/researchers-link-longer-work-hours-and-stroke-risk/?_r=0

1 What is the unit of analysis?

2 What is the treatment?3 What outcome are we interested in?4 What are the counterfactual outcomes?5 What is the causal link?6 How is the counterfactual mimicked? Does this sound reasonable?

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Control Variables

http://well.blogs.nytimes.com/2015/08/19/researchers-link-longer-work-hours-and-stroke-risk/?_r=0

1 What is the unit of analysis?2 What is the treatment?

3 What outcome are we interested in?4 What are the counterfactual outcomes?5 What is the causal link?6 How is the counterfactual mimicked? Does this sound reasonable?

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Control Variables

http://well.blogs.nytimes.com/2015/08/19/researchers-link-longer-work-hours-and-stroke-risk/?_r=0

1 What is the unit of analysis?2 What is the treatment?3 What outcome are we interested in?

4 What are the counterfactual outcomes?5 What is the causal link?6 How is the counterfactual mimicked? Does this sound reasonable?

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Control Variables

http://well.blogs.nytimes.com/2015/08/19/researchers-link-longer-work-hours-and-stroke-risk/?_r=0

1 What is the unit of analysis?2 What is the treatment?3 What outcome are we interested in?4 What are the counterfactual outcomes?

5 What is the causal link?6 How is the counterfactual mimicked? Does this sound reasonable?

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Control Variables

http://well.blogs.nytimes.com/2015/08/19/researchers-link-longer-work-hours-and-stroke-risk/?_r=0

1 What is the unit of analysis?2 What is the treatment?3 What outcome are we interested in?4 What are the counterfactual outcomes?5 What is the causal link?

6 How is the counterfactual mimicked? Does this sound reasonable?

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Control Variables

http://well.blogs.nytimes.com/2015/08/19/researchers-link-longer-work-hours-and-stroke-risk/?_r=0

1 What is the unit of analysis?2 What is the treatment?3 What outcome are we interested in?4 What are the counterfactual outcomes?5 What is the causal link?6 How is the counterfactual mimicked? Does this sound reasonable?

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Control Variables

strokei = α + β1HoursWorkedi + β2Controlsi + εi

What controls could you realistically never put in the above regressionthat may lead to OVB?

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Control Variables

Whether you add a control or not often depends on qualitativereasoning

For this reason, researchers often report many regressions, usingvariable levels of controls:

In general, add variables that should either:Directly affects the outcomeProxy for another unobserved variable that affects the outcome

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Table 1: Difference-in-Differences Estimates of CSR on Private School Share

Outcome Variable: Private School Share (%)

(1) (2) (3) (4)

Treatment*Post -1.41*** -1.35*** -0.91*** -1.32***(0.17) (0.18) (0.28) (0.27)

Treatment 2.87*** 2.82*** 4.33*** 3.32***(0.25) (0.52) (0.60) (0.46)

Post -0.73*** 0.26* -0.01 0.24*(0.15) (0.15) (0.10) (0.13)

Year/Grade FE No Yes Yes Yes

Demographic Controls No No Yes Yes

District FE No No No Yes

Number of Observations 253,056 253,056 188,210 188,210

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Control Variables

For example,strokei = α + β1HoursWorkedi + β2HoursExercisedi + β3Racei + εi

What is the control HoursExercisedi for?

What is the control Racei for?

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Fixed-effects

Fixed-effects can be a bit confusing. Often they are a control.Sometimes they can be used for causal inference.

As a control:Fixed effects are just a bunch of dummy variables. They are added ascontrols just like any other variable.

i.e. Time FEs, province FEs, school FEs

For time FEs, if you have 10 years of data you:

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Fixed-effects II

For causal inference:

Our essential concern is that people who work longer hours also differin other dimensions (e.g. diet)

Idea: Why not control for the fact they are the same person?Must have panel data (i.e. variation over time across the same person)

Essentially we estimate the effect of an increase/decrease in workinghours for person X on his likelihood of cancer

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Fixed-effects in a Regression

In a regression, you write down fixed effect without a β in front. Thesubscript then denotes the fixed-effect.

strokeit = α + β1HoursWorkedit + β2Controlsit + λt + δi + εit

δi =

λt =

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Using Fixed-effects

Implementing fixed-effects:

Open up STATA

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Pros and Cons

What is the major bias concern in fixed-effects?

Pros vs. Cons of fixed-effects:

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Causal Inference

There are 5 basic empirical methods to obtain causal inference:

1 Controls (includes matching/fixed-effects)2 Randomized Experiments3 Difference-in-Differences4 Instrumental Variables5 Regression Discontinuity

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Randomized Experiments

Randomized experiments (or RCTs) are the “gold standard” of policyevaluation

Unfortunately, they are really tough to get off the ground

Also, some questions are not amenable to RCTsFor example:

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Implementation

Implementation of randomized experiments can be difficult.

Before the experiment can be run you need to:1 Find necessary sample size (use “ssi” in STATA)2 Get funding (they are often very expensive)3 Get ethical approval (can be very difficult)

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Implementation II

Afterwards, you need to:1 Randomize

Since they are so expensive, to prevent improper randomization due to“luck” researchers often “stratify” by some characteristic whichguarantees balance between treatment and control in thosecharacteristics

2 Ensure there is limited attrition (often the bane of randomized trials)3 Ensure there is no cross-contamination

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Evaluating Randomized Trials

To evaluate randomized trials, researchers look at internal andexternal validity

We will do this for Project STAR

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Internal Validity

For internal validity we look at:

1 Proper Randomization (look for covariate balance)

2 Differential Attrition

3 Cross Contamination

4 Hawthorne Effects (could also be under external validity)

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Internal Validity

For internal validity we look at:1 Proper Randomization (look for covariate balance)

2 Differential Attrition

3 Cross Contamination

4 Hawthorne Effects (could also be under external validity)

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Internal Validity

For internal validity we look at:1 Proper Randomization (look for covariate balance)

2 Differential Attrition

3 Cross Contamination

4 Hawthorne Effects (could also be under external validity)

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Internal Validity

For internal validity we look at:1 Proper Randomization (look for covariate balance)

2 Differential Attrition

3 Cross Contamination

4 Hawthorne Effects (could also be under external validity)

Introduction Regressions Causal Inference Control Variables Randomized Experiments

External Validity

For external validity we look at:

1 Generalizability (i.e. sample selection)

2 Scalability

3 General Equilibrium Effects

Introduction Regressions Causal Inference Control Variables Randomized Experiments

External Validity

For external validity we look at:1 Generalizability (i.e. sample selection)

2 Scalability

3 General Equilibrium Effects

Introduction Regressions Causal Inference Control Variables Randomized Experiments

External Validity

For external validity we look at:1 Generalizability (i.e. sample selection)

2 Scalability

3 General Equilibrium Effects

Introduction Regressions Causal Inference Control Variables Randomized Experiments

External Validity

For external validity we look at:1 Generalizability (i.e. sample selection)

2 Scalability

3 General Equilibrium Effects

Introduction Regressions Causal Inference Control Variables Randomized Experiments

Project STAR

What is Project STAR?

Open up STATA....