Generating Plausible Causal Hypotheses

Generating Plausible Causal Hypotheses

ByLarry V. Hedges

Northwestern University

Presented at the 2010 IES Research Conference

GoalsProvide a brief introduction to causal inference

Explain why experiments provide model free estimates of causal effects

Examine the possibility of causal inference from a few quasi-experimental designs

-Assignment based on a covariate

-Regression discontinuity design

-Nonequivalent control group design

Examine the difference in differences approach in more detail

What is Causal Inference?We all think we know what we mean by cause and effect

But a formal treatment is useful

It turns out that there are several treatments of cause and effect

The modern statistical approach is often called the Rubin-Holland-Rosenbaum model

(But its roots go back as far as Neyman, 1923)

The Rubin Holland ModelKey concepts

Units (e.g., individuals)

Treatments (e.g., 0, 1)

Responses (e.g., r0, r1)ri

0 the response of unit i if it got treatment 0ri

1 the response of unit i if it got treatment 1

Causal effect of treatment 1 versus 0 on unit iτi = ri

1 – ri0

The Rubin Holland ModelThe definition of the causal effect of treatment 1 versus 0 on unit i

τi = ri1 – ri

0

• This is a relative definition: The effect of treatment 1 compared to treatment 0

• This is a counterfactual definition, you can’t observe both ri0 and ri

1

• The (relative) causal effect of a treatment on a single unit cannot be estimated without additional assumptions(Although with additional assumptions single subject designs attempt to do so)

Causal Inference and Missing DataNote that causal inference is a missing data problem

You cannot observe both ri0 and ri

1—one of them is always missing

Not surprisingly, modern ideas for causal inference sometimes draw on modern ideas for handling missing data

Missing data methods try to find conditions that reduce the missing data to be (conditionally) as if “random sampling”

Methods for causal inference try to find conditions that reduce the missing data to be (conditionally) “as if” random assignment

We will discuss some of these later

The Rubin Holland Model

Example

Note that we assumethat both ri

0 and ri1 are

known for the purposes of illustration

Unit r1 r0 τ

1 20 10 10

2 20 10 10

3 20 10 10

4 20 10 10

5 11 20 -9

6 11 20 -9

7 11 20 -98 11 20 -9

The Rubin Holland ModelExample

Any particular experimentwould assign some unitsto treatment, others to control, so some ri

0’s would be observed, some ri

0’s would beobserved

Unit r1 r0 τ

1 20 10 10

2 20 10 10

3 20 10 10

4 20 10 10

5 11 20 -9

6 11 20 -9

7 11 20 -98 11 20 -9

The Rubin Holland ModelExample

Each possible experimentwould get a differentaverage treatment effect but the average over allpossible assignmentswould be the averagetreatment effect

Unit r1 r0 τ

1 20 10 10

2 20 10 10

3 20 10 10

4 20 10 10

5 11 20 -9

6 11 20 -9

7 11 20 -98 11 20 -9

The Rubin Holland Model

Example

Note that assigning thebest treatment to a unitdoes not give an unbiasedestimate of the averagetreatment effect

Unit r1 r0 τ

1 20 10 10

2 20 10 10

3 20 10 10

4 20 10 10

5 11 20 -9

6 11 20 -9

7 11 20 -98 11 20 -9

The Rubin Holland ModelRandomized experiments

Define the assignment variable Z via Z = 0 if a unit gets control and Z = 1 if a unit gets treatment

Random assignment means that

Therefore (r0, r1) is independent of Z (assignment)

0 0 0

1 1 1

E | 0 E | 1 E

E | 0 E | 1 E

r Z r Z r

r Z r Z r

The Rubin Holland ModelRandomized experiments give model free estimates of

the average (relative) causal effect of a treatment

Why?

Because, independence of Z (assignment) and (r0, r1) implies

.

0 1 0 1 0E E E E | 1 E | 0r r r r r Z r Z

The Rubin Holland ModelThis is all very simple

But this is deceptive

I have already embedded assumptions into the model (as had Rubin, 1974)

Why are there only 2 possible outcomes?

What if the treatment I get affects your response to treatment?

This assumption is called “no interference between units” (e.g., Cox, 1958) or the stable unit treatment value (SUTV) (e.g., Rubin, assumption

The Rubin Holland ModelSUTV can be wrong!

Consider response to vaccines

The response to the smallpox vaccine (or not) depends on who else is vaccinated

This is how eradication is possible

Consider classrooms or schools where social interaction is possible (indeed probable)

Contamination is a violation of SUTV

The Rubin Holland ModelSome associations cannot be causal

Suppose one of ri0 or ri

1 does not exist

• Some individuals would never accept treatment (refusers)

• Some individuals would always get treatment (always takers)

• Some individuals would always do the opposite of what they were assigned (defiers)

This leads to the concept of compliers and complier average treatment effect

The Rubin Holland ModelOn a more philosophical level, not all “what if” questions have causal

answers

The idea of a randomized experiment helps clarify what effects might be causal

If you cannot imagine an experiment that assigns the treatments being compared, it may not be sensible to talk of causal effects

It may not be sensible to talk of sex differences as causal effects

But, it might be sensible to talk of gender (social) differences causal effects

The Rubin Holland ModelSimilarly, it may not make sense to talk about causal

effects of treatments on

• Never takers

• Always takers

• Defiers

It makes sense to explicitly limit the scope of our attempts at causal inference to the compliers

Scope of Causal InferenceRandomized experiments give model-free estimates of average causal

effects

Is there any other way to get them?

No other model-free methods are known

Many other methods can give estimates of causal effects given that a model is true

The key problem with these methods is that the model must be assumed to be true, and the model assumptions are often difficult or impossible to verify

But such methods are useful when experiments cannot be done or to suggest plausible causal hypothese

Estimating Treatment EffectsConsider treatment assignment (dummy variable) Z and outcome Y

Regress Y on Z

Yi = β0 + β1 Zi + εi

The estimate of β1 is just the difference between the mean Y for Z = 1 (the treatment group) and the mean Y for Z = 0 (the control group)

Thus the OLS estimate is

= β 1 +

1 0 1 1

0 0 0

Y β β ε

Y β ε

1 0Y Y 1 0

Estimating Treatment Effects(With Random Assignment)

If the treatment is randomly assigned, then Z is uncorrelated with ε (X is exogenous)

If X is uncorrelated with ε if and only if

But if , then the mean difference is

= β 1 + = β 1

This implies that standard methods (OLS) give an unbiased estimate of β1, which is the average treatment effect

That is, the treatment-control mean difference is an unbiased estimate of β1,

1 0

1 0

1 0Y Y 1 0

What goes wrong without randomization?(Simple Case)

If we do not have randomization, there is no guarantee that Z is uncorrelated with ε (Z may be endogenous)

Thus the OLS estimate is still

= β1 +

If Z is correlated with ε, then

Hence does not estimate β1, but some other quantity that depends on the correlation of Z and ε

If Z is correlated with ε, then standard methods give a biased estimate of β1

1 0Y Y 1 0

1 0Y Y

1 0

Instrumental VariablesOne way to see this is in terms of two regression equations

Yi = β0 + β1Zi + εi

Zi = γ0 + γ1Xi + ηi

Note that, in this model Z is endogenous (may be correlated with ε)

The instrumental variables model requires that:

1. γ1 ≠ 0 so that X predicts Z, and

2. X uncorrelated with ε (X is exogenous) [Cov{ε, X} = 0]

Estimating Causal Effects (IV Studies)

Angrist, Imbens, & Rubin (1996) showed that IV can estimate average causal effects of Z on Y, if the following assumptions hold:

1. SUTVA

2. Random assignment of X

3. Exclusion restriction (exogeneity of X)

4. Nonzero causal effect of X on Z

5. Monotonicity (no defiers)

Then the IV estimate is an estimate of the average treatment effect for those who comply with assignment

Assignment by Covariate ValueLet X be a covariate and x be the value of X

Suppose that units with the same X value are randomly assigned with probability π(x), where 0 < π(x) < 1

Thus

Conditional independence of Z (assignment) and (r0, r1) given X implies

Thus the experiment estimates the conditional causal effect given X

0 0 0

1 1 1

E | , 0 E | , 1 E |

E | , 0 E | , 1 E |

r X Z r X Z r X

r X Z r X Z r X

0 1 0 1 0E | E | E E | , 1 E | , 0r r X r X r X r X Z r X Z

Assignment by Covariate ValueThe conditional causal effect of treatment τ(x) might be called the local

average treatment effect at X = x

The weighted average of local average treatment effects

estimates the average causal effect of treatment

Note that the overall treatment-control mean difference (even controlling for X) does not necessarily estimate the average causal effect of treatment, because there may be more

x

x x

Regression Discontinuity DesignsRegression discontinuity designs (RDD) assign to treatment by

covariate value, but assign all units with X > c to treatment

but violate the principle that 0 < π(x) < 1

However, RDDs can estimate the local average causal effect of the treatment at X = x

The reason is that the RDD is a randomized experiment at the cutpoint X = c

More properly, the limit as x → c is a randomized experiment.

10

if x cx

if x c

Regression Discontinuity DesignsNote that the RDD design can support estimation of causal effects,

The causal effect that can be estimated, τ(c), is

In other words, the causal effect (local average treatment effect) at the value X = c, which is the gap or discontinuity at X = c

But not every analysis of the design estimates the causal effect

Analyses that use models assuming functional form (e.g., linear regression) depend on that functional form assumption

1 0lim | 1 lim | 0x c x c

c E r Z E r Z

Regression Discontinuity Designs

Nonparametric regression methods can, in principle, provide model-free estimates of the causal effect of treatment at X = c

But these methods themselves make technical assumptions (e.g., about bandwidth, etc.)

Thus estimation of treatment effects in RDD are in practice somewhat model dependent

Designs with multiple cutpoints can provide estimates of treatment effects at multiple points or more externally valid average causal effects

Nonequivalent Control Group Designs

These designs compare a treatment group with a (non-randomized) comparison group

There is a huge range of quality in these designs, ranging from pretty good to awful

Often matching or adjustment for covariates (a form of pseudo-matching), or both, are used

Can such designs ever provide estimates of average causal effects?

Yes, but essentially never estimates that are model free


How well they work depends on how well the analytic model captures essential features of the data

This is not always possible to determine empirically

If we can assume conditional independence of Z (assignment) and (r0, r1) given X or even that

Then the experiment can estimate the causal effect of treatment, since

.

0 0

1 1

E | , 0 E |

E | , 1 E |

r X Z r X

r X Z r X

0 1 0 1 0E | E | E E | , 1 E | , 0r r X r X r X r X Z r X Z


Note that this is the equivalent of making the treatment assignment “as if random” conditional on the covariate (or matching variable) X

This is the basic strategy of matching for causal inference (e.g., Rubin, Rosenbaum, Cochran)

It is also the basic strategy for inference under missing dataFind covariates so that, conditional on the observed covariates, the missing data is “as if random”

In missing data theory, this is called “strong ignorability”


This is all very abstract

Make it concrete by considering response functions—that is r0 or r1 as a function of covariates or other effects

For example, suppose that

ri0 = α + βxi + εi

0

ri1 = α + τ + βxi + εi

1

and that εi0 and εi

1 are independent of x

Then it easily follows that the usual estimate of the average treatment effect is unbiased


But suppose that the response functions are a little different

ri0 = α + β0xi + εi

0

ri1 = α + τ + β1xi + εi

1


1 are independent of x

Then it easily follows that the usual estimate of the treatment effect is

where is an “average” of β0 and β1

1 0x x


The analysis could be “fixed up” to remove the bias if we knew the response function

But that is exactly the point

To get an unbiased estimate of the causal effect, you have to know the right model, so analyses will be model dependent

It is not easy (maybe impossible) to know what the right model is

Moreover, I choose a very simple model (homogeneous treatment effects with responses a linear function of the observed covariates)

Differences in DifferencesThe difference in differences idea can be seen as a particular kind of

nonequivalent control group design

It is frequently used to evaluate the effects of policies in education and elsewhere

Assume that there is a series of longitudinal observations in locations (e.g., states) where a policy has been implemented at some time in some locations

Crudely, we estimate the effect of a policy by comparing • the difference in outcome before and after the policy is implemented

for individuals affected by the policy, compared to • the difference for individuals unaffected by the policy

That is why it is a difference in differences estimator

Differences in DifferencesMore elaborate (and convincing) analyses control for location and time or

model variation as random effects

Let Yist be the outcome for individual i in location s at time t

Let Xist be the corresponding individual level covariates

Then the model might be

Yist = αs + πt + γXist + βTst + εist

where αs and πt are location and time fixed effects, is a vector of covariate effects, Tst is a dummy variable for treatment, and εist is a residual

There may be clustering by location, which needs to be taken into account

Differences in DifferencesObviously the difference in differences estimator has great

appeal

Given a good longitudinal data set, it is easy to use

It is simple to understand and explain to policy makers

It is a natural analysis to learn from “natural experiments” where a policy has been tried some place and not others or has been tried at different times in different locations

Differences in DifferencesThis model may seem hard to formulate in causal model terms

The treatment effect is identified by the difference between post-policy and pre-policy outcomes, in the treatment (got policy) group versus the control group

Let ri0 and ri

0 be the possible outcomes after treatment and X be the pretreatment variable

This estimate is estimating

It can estimate the average causal effect under several circumstances

1 0 1 0E | 1 E | 0 E Er X Z r X Z r r not

Differences in DifferencesThis estimate is estimating

It can estimate the average causal effect under some circumstances

For example, if the response functions are

ri0 = αi + xi + εi

0

ri1 = αi + τ + xi + εi

1


1 are independent of xi, then the difference in differences estimate does estimate τ, the average causal effect of treatment

1 0E | 1 E | 0r X Z r X Z

What Can Go Wrong?One big problem

Z can be correlated with (r0 – X , r1 – X)

• X can cause both the policy and be correlated with outcome

• Something else can cause both X and Z

• This is the general endogeneity problem

What Can Go Wrong?Informal checks

• Look at trends beyond the time of policy implementation

• Estimate effects of treatment where there is no policy change as a check (you should see no effect)

These are suggestive not definitive

They can invalidate an analysis, not validate one

What Can Go Wrong?One smaller problem

The data often exhibit large autocorrelations, and this can lead to large underestimates of standard errors, making tests reject (far) too often

There are three reasons for this:

• Data are often based on long time series

• Data are highly positively correlated over time

• The treatment variable does not change much

What Can Go Wrong?The standard error problem is difficult to solve

Parametric analysis (generalized least squares with autocorrelation) can be done, but inference for autocorrelation is poor

Randomization tests seem to perform well for problems like these

Collapsing the data into two time periods is sometimes useful and improves performance of tests

Conclusion

Without randomization, causal inference is much harder and more model dependent

ReferencesAbadie, A. (2000). Semiparametric Difference-in-Differences

Estimators, Working Paper, Kennedy School of Government, Harvard University.

Bertrand, M., Duflo, E., & Mullainathan, S. (2001). How much should we trust difference in differences estimators? MIT Department of Economics Working Paper Series 01-34.

Meyer, B. (1995). Natural and Quasi-Natural Experiments in Economics, Journal of Business and Economic Statistics, 13, 151-162.

Moulton, B. R. (1990). An Illustration of a Pitfall in Estimating the Effects of Aggregate Variables in Micro Units, Review of Economics and Statistics, 72 , 334-338.

References (cont.)Newey, W. & West, K. D. (1987). A Simple, Positive Semi-definite,

Heteroskedasticity and Autocorrelation Consistent-Covariance Matrix,” Econometrica, 55, 703-708.

Nickell, S. (1981). Biases in Dynamic Models with Fixed Effects, Econometrica, 49,1417-1426.

Rosenbaum, P. (1993). Hodges-Lehmann Point Estimates of Treatment Effect in Observational Studies, Journal of the American Statistical Association, 88, 1250-1253.

Rosenbaum, P. (1996). Observational Studies and Nonrandomized Experiments, In S. Ghosh and C.R.Rao, (Eds), Handbook of Statistics, 13.

Solon, G. (1984). Estimating Auto-correlations in Fixed-Effects Models, NBER Technical Working Paper No. 32, 1984.

Generating Plausible Causal Hypotheses

Documents

Transcript of Generating Plausible Causal Hypotheses