KING, Gary e ZENG, Langche. When Can History Be Our Guide - Counterfactual Inference. Political...

8/7/2019 KING, Gary e ZENG, Langche. When Can History Be Our Guide - Counterfactual Inference. Political Methodology. Co

1/51

Political Methodology

Committee on Concepts and MethodsWorking Paper Series

1

April 2005

When Can History be Our Guide?

The Pitfalls of Counterfactual Inference

Gary King

Harvard University

([email protected])

Langche Zeng

George Washington University

([email protected])

C&MThe Committee on Concepts and Methodswww.concepts-methods.org

IPSAInternational Political Science Associationwww.ipsa.ca

CIDECentro de Investigacin y Docencia Econmicaswww.cide.edu


2/51

Editor

Andreas Schedler (CIDE, Mexico City)

Editorial Board

Jos Antonio Cheibub, Yale University

Michael Coppedge, University of Notre Dame

John Gerring, Boston University

George J. Graham, Vanderbilt University

Russell Hardin, New York University

Evelyne Huber, University of North Carolina atChapel Hill

James Johnson, University of Rochester

Gary King, Harvard University

Bernhard Kittel, University of Amsterdam

Gerardo L. Munck, University of SouthernCalifornia, Los Angeles

Frederic C. Schaffer, Massachusetts Institute ofTechnology

Kathleen Thelen, Northwestern University

The C&M working paper series are publishedby the Committee on Concepts and Methods(C&M), the Research Committee No. 1 of theInternational Political Science Association

(IPSA), hosted at the Center for EconomicResearch and Teaching (CIDE) in Mexico City.C&M working papers are meant to share workin progress in a timely way before formalpublication.Authors bear full responsibility forthe content of their contributions. All rightsreserved.

The Committee on Concepts and Methods(C&M) promotes conceptual andmethodological discussion in political science. Itprovides a forum of debate betweenmethodological schools who otherwise tend toconduct their deliberations on separate tables. Itpublishes two series of working papers:Political Concepts and Political Methodology.

Political Concepts contains work of excellenceon political concepts and political language. Itseeks to include innovative contributions toconcept analysis, language usage, conceptoperationalization, and measurement.

Political Methodology contains work ofexcellence on methods and methodology in thestudy of politics. It invites innovative work onfundamental questions of research design, theconstruction and evaluation of empiricalevidence, theory building and theory testing.The series welcomes, and hopes to foster,contributions that cut across conventionalmethodological divides, as between quantitativeand qualitative methods, or between

interpretative and observational approaches.

Submissions. All papers are subject to reviewby either a member of the Editorial Board or anexternal reviewer. Authors interested inincluding their work in the C&M Series mayseek initial endorsement by one editorial boardmember. Alternatively, they may send theirpaper to [email protected].

The C&M webpage offers full access to pastworking papers. It also permits readers tocomment on the papers.

www.concepts-methods.org


3/51

1Thanks to Jim Alt, Scott Ashworth, Neal Beck, Jack Goldstone, Sander Greenland, Orit Kedar, Walter Mebane,

Maurizio Pisati, Kevin Quinn, Jas Sekhon, Simon Jackman for helpful discussions; Michael Doyle and Nicholas

Sambanis for their data and replication information; and the National Institutes of Aging (P01 AG17625-01), the

National Science Foundation (SES-0318275, IIS-9874747), and the Weatherhead Initiative for research support.

Software to implement the methods in this paper is available from http://GKing.Harvard.Edu.2David Florence Professor of Government, Harvard University (Center for Basic Research in the Social Sci-

ences, 34 Kirkland Street, Harvard University, Cambridge MA 02138; http://GKing.Harvard.Edu,

[email protected], (617) 495-2027).3Associate Professor of Political Science, George Washington University, 2201 G Street NW, Washington, DC

20052, [email protected].


4/51

Abstract

Inferences about counterfactuals are essential for prediction, answering what if questions, and esti-

mating causal effects. However, when the counterfactuals posed are too far from the data at hand, con-

clusions drawn from well-specified statistical analyses become based on speculation and convenient but

indefensible model assumptions rather than empirical evidence. Unfortunately, standard statistical ap-

proaches assume the veracity of the model rather than revealing the degree of model-dependence, and

so this problem can be hard to detect. We develop easy-to-apply methods to evaluate counterfactuals

that do not require sensitivity testing over specified classes of models. If an analysis fails the tests we

offer, then we know that substantive results are sensitive to at least some modeling choices that are not

based on genuine prior knowledge or empirical evidence. We use these methods to evaluate the exten-

sive scholarly literatures on the effects of changes in the degree of democracy in a country (on any de-

pendent variable) and separate analyses of the effects of UN peacebuilding efforts, and find evidence

that many scholars are inadvertently drawing conclusions based more on modeling hypotheses than on

their empirical evidence. For some research questions, history contains insufficient information to be

our guide.


5/51

1 Introduction

Social science is about making inferences using facts we know to learn about facts we do not know.

Some inferential targets (the facts we do not know) are factual, which means that they exist even if we

do not know them. In early 2003, many wanted to know whether Saddam Hussein survived the war in

Iraq. Although the U.S. military has had considerable difficulty determining the answer, he was either

alive or dead. In contrast, other inferential targets are counterfactual, and thus do not exist, at least not

yet. Counterfactual inference is crucial for studying what if questions, such as whether the Ameri-

cans and British would have invaded Iraq if the 9/11/2001 attack on the World Trade Center had not

occurred. Counterfactuals are also crucial for making forecasts, such as whether there will be peace in

the Mideast in the next two years, since the quantity of interest is not knowable at the time of the fore-

cast, although it will eventually become known (i.e., factual). Counterfactuals are essential as well in

making causal inferences, since causal effects are differences between factual and counterfactual infer-

ences: for example, how much more international trade would Syria have engaged in during 2003 if the

Iraqi war had been averted.

Counterfactual inference has thus been central topic of methodological discussion in political sci-

ence (Fearon, 1991, Thorson and Sylvan, 1982; Tetlock and Belkin, 1996; Tetlock and Lebow), psy-

chology (Tetlock, 1999; Tetlock et al., 2000), history (Murphy, 1969, Gould, 1969; Dozois and Schmidt,

1998; Tally, 2000), philosophy (Lewis, 1973; Kvart, 1986), computer science (Pearl, 2000), statistics

(Rubin, 1974; Holland, 1986), and other disciplines. As scholars have long recognized, however, some

counterfactuals are more amenable to empirical analysis than others. In particular, some counterfactuals

are more strained, farther from the data, or otherwise unrealistic.

For example, with a sample of U.S. time series data, we could reasonably ask how much U.S. pres-

idential approval would drop if inflation increased by two percentage points, and we could generate a

fairly certain answer. We should not expect to get a valid answer from the same data, given any model,

if we asked how much approval would drop if inflation increased by 200 percentage points. Remark-

ably, however, whatever statistical model we used to compute the first counterfactual inference could

also be used to compute the second. Our confidence interval for the second inference would be some-

what wider than the first, but the second inference would be far more uncertain than the confidence in-

terval indicates. The confidence interval is not wrong: if the other assumptions of the model are correct,

it accurately portrays the uncertainties, conditional on the model. The problem is that we have little rea-

1


6/51

son to believe the model when it veers so far from the data. In other words, the second inference is far

more model dependent than the first. Thus, for example, if the model used were a linear regression, the

2% inference would not depend very heavily on the exact linearity assumption as long as the model fits

the observed data reasonably well and the underlying true conditional expectation function is reasonablysmooth. However, the 200% inference is so far from American historical experience that any conclu-

sion will be very sensitive to most features of the model, no matter how good the fit of the model to

the data. The linearity assumption in this context thus looms very large. Extrapolate using a slightly

quadratic function instead of a linear one, even one that fits the in sample data only slightly differently,

and inferences will differ dramatically, even if little evidence exists with which to distinguish between

the two models.

But how can we tell how model dependent are inferences when the counterfactual is not so obvi-ously extreme, or when it involves more than one explanatory variable? The answer to this question

cannot come from any of the model-based quantities we normally compute and our statistics programs

typically report, such as standard errors, confidence intervals, coefficients, likelihood ratios, predicted

values, test statistics, first differences, p-values, etc. To understand how far from the facts are our coun-

terfactual inferences, and thus how model-dependent are our inferences, we need to look elsewhere. At

present, scholars study model-dependence primarily via sensitivity analyses: changing the model and as-

sessing how much our conclusions change. If the changes are substantively large for models in a partic-ular class, then inferences are deemed model dependent. If the class of models examined are all a priori

reasonable, and conclusions change a lot as the models within the class change, then the analyst may

conclude that the data contain little or no information about the counterfactual question at hand. This

is a fine approach, but it is insufficient in circumstances where the class of possible models cannot be

easily formalized and identified, or where the models within a particular class cannot feasibly be enu-

merated and run, i.e., most of the time. In practice, the class of models chosen are those that are con-

venient such as those with different control variables under the same functional form. The identifiedclass of models normally excludes at least some models that have a reasonable probability of returning

different substantive conclusions. Most often, this approach is skipped entirely. What our analysis pro-

vides is several easy-to-apply methods that reveal the degree of model dependency, without having to

run all the models. As a consequence, the method applies for the class of nearly all models, whether

or not they are formalized, enumerated, and run, and for the class of all possible dependent variables,

2


7/51

conditional only on the choice of a set of explanatory variables. If an analysis fails our tests then we

know it will fail a sensitivity test too, but without being in the impossible position of having to run all

possible models to find out.

Our field includes many discussions of the problem of strained counterfactuals in qualitative re-search. For example, Lebow (2000) and Fearon (1991) distinguish between miracle and plausible

counterfactuals and offer qualitative ways of judging the difference. Tetlock and Belkin (1996: ch. 1)

also discuss criteria for judging counterfactuals (of which historical consistency may be of most rel-

evance to our analysis). Although this may well be the most serious problem confronting quantitative

comparative politics and international relations, quantitative empirical scholarship rarely addresses the

issue. Yet, it is hard to think of many quantitative analyses in comparative politics and international re-

lations in recent years that do not stop to interpret their results by asking what happens, for example,to the probability of conflict if all control variables are set to their means and the key causal variable

is changed from its 25th to its 75th percentile value (King, Tomz and Wittenberg, 2000). Every one of

these analyses is making a counterfactual prediction, and every one needs to be evaluated by the same

ideas well known in qualitative research. In this paper, we provide quantitative measures of these and

related criteria that are meant to complement the ideas for qualitative research discussed by many au-

thors.

We offer two empirical examples. The first evaluates inferences in the scholarly literatures on theeffects of democracy. These effects (on any of the dependent variables used in the literature) have long

been among the most studied questions in comparative politics and international relations. Our results

demonstrate that numerous scholars in these literatures are asking counterfactual questions that are far

from their data, and are therefore inadvertently drawing conclusions about the effects of democracy

based on indefensible model assumptions rather than empirical evidence. The results also show that

almost all analyses about democracy include at least some counterfactuals with little empirical support.

Whereas this example about democracy applies approximately to a large array of prior work, wealso introduce an example that applies exactly to one groundbreaking study on designing appropriate

peacebuilding strategies (Doyle and Sambanis, 2000). We replicate this work and, applying our methods

to these data, we find that the central causal inference in the study involves counterfactuals that are too

far from observed data to draw reliable inferences with the authors methods or any others. As a result

the data do not contain sufficient information to answer the questions asked, and the conclusions of-

3


8/51

fered in the study are thus based primarily on modeling assumptions than than on empirical evidence. 1

Section 2 shows more specifically how to identify questions about the future and what if scenar-

ios that cannot be answered well in given data sets. This section introduces several new approaches for

assessing how based in factual evidence is a given counterfactual. Section 3 provides a new decomposi-tion of the bias in estimating causal effects using observational data that is more suited to the problems

most prevalent in political science. This decomposition enables us to identify causal questions without

good causal answers in given data sets and shows how to narrow these questions in some cases to those

that can be answered more decisively. We use each of our methods to evaluate counterfactuals regard-

ing the effects of democracy and UN peacekeeping. Section 4 concludes.

2 Forecasts and What If Questions

Although statistical technology sometimes differs for making forecasts and estimating the answers to

what if questions (e.g., Gelman and King, 1994), the logic is sufficiently similar that we consider

them together. Although our suggestions are general, we use aspects of the international conflict litera-

ture as a running example to fix ideas. Thus, let Y, our outcome variable, denote the degree of conflict

initiated by a country, and X denote a vector of explanatory variables, including measures such as GDP

and democracy.2 In regression-type models, including least squares, logit, probit, event counts, dura-

tion models, and most others used in the social sciences, we usually compute forecasts and answers to

what if questions using the model-based conditional expected value of Y given a chosen vector of

values x of the explanatory variables, X.

The model typically includes a specification for (i.e., assumption about) the conditional expectation

function (CEF):

E(Y|X) = g(X, ), (1)

where g() is some specified parametric functional form and is a vector of parameters to be esti-

mated. To make a forecast, we plug the specified vector of values x, and the point estimate of, which

we denote , into this CEF and compute the estimated CEF or predicted value:

E(Y|x) = g(x, ). (2)

The estimated CEF for the familiar linear regression model, for example, is x = 0 + 1x1 + +

kxk; the logit model is [1 + ex]1; the exponential duration model is ex; count models are usually

specified as ex, etc.3

4


9/51

Interestingly, each of these CEFs can be computed for any (real) values ofx. The model never

complains, and exactly the same calculation can be applied for any value ofx. However, even if the

model fits the data we have in our sample well, a vector x far from any rows in the matrix X is not

likely to produce accurate forecasts. If a linear model indicates that one more year of education willearn you an extra $1,000 in annual income, the model also implies that 10 more years of education will

get you $10,000 in extra annual income. In fact, it also says with as straight a face as a statistical

model ever offers that 50 years more of education will raise your salary by $50,000. Even though

no statistical assumption may be violated as a result of your choice of any set of real numbers for x,

the model obviously produces better forecasts (and what if evaluations) for some values ofx than

others. Predictive confidence intervals for forecasts farther from the data are usually larger, but confi-

dence intervals computed in the usual way still assume the veracity of the model no matter how far thecounterfactual is from the data.

Worrying about model choice may be good in general, but it will not help here. Other models

will not do verifiably better with the same data; one cannot determine from the evidence which model

is more appropriate. So searching for a better model, without better data, better theory, or a different

counterfactual question, in this case is simply futile. We merely need to recognize that some questions

cannot be answered from some data sets. Our linearity (or other functional form assumptions) are writ-

ten globally for any value ofx but in fact are relevant only locally in or near our observeddata. In this paper, we seek to understand where local ends and global begins. For forecasting and

analyzing what if questions, our task comes down to seeing how far x is from the observed X.

Clearly the best solution to these problems is scholarly judgment, but such judgments have failed

us in the literatures on democracy, examples of which we provide below, and many other areas. Nonethe-

less, our goal is to offer some tools to aid human judgment, not to replace it. Judgment, of course, is

not a statistical issue and so we do not always have the luxury of appealing to statistical theory to de-

rive our methods.We now offer several procedures that can be used to assess whether a question posed can be re-

liably answered from any statistical model. We will not assume knowledge of the functional form,

model, estimator, or dependent variable, but we do assume that the CEF is fairly smooth (a concept we

define formally below). Such is not always the case, but the idea comports with most statistical models

and theoretical processes in the discipline.

5


10/51

2.1 Interpolation vs. Extrapolation

A key distinction in assessing the counterfactual question x is whether answering it by computing E(Y|x)

would involve interpolation or extrapolation (e.g., Kuo, 2001; Hastie et al., 2001). Assuming some min-

imal smoothness in the CEF in the region near the data implies that the data would in general contain

more information about counterfactual questions that require interpolation than those that require ex-

trapolation. Hence answering a question involving extrapolation is normally far more model-dependent

than one involving interpolation.

For intuition, imagine we have observed data on income for those with high school degrees and

for those with four-year college degrees, and we wished to estimate what income one with a two-year

associates college degree would have. This is a simple counterfactual what if question because we

have no data on people with two-year degrees. The interpolation task, then, is to draw some curve from

expected income among high school graduates to the expected income among four-year college gradu-

ates. Without any assumptions, this curve could go anywhere, and our inferred income for those with

two-year degrees would not be constrained at all. Imposing the assumption that the CEF is smooth

(i.e., that it contains no sharp changes of direction and that it not bend too fast or too many times be-

tween the two end points, a concept we formalize in Appendix B) is quite reasonable for this example

and indeed for most political science processes. The consequence of this smoothness assumption is to

narrow greatly the range of income into which the interpolated value can fall. Even if it is higher than

income for four-year colleges or lower than for high school educated students, it probably wont be too

much outside this range. However, now suppose we observe the same data but need to extrapolate to

the income for those with masters degrees. We could impose some smoothness again, but even allowing

one bend in the curve could make the extrapolation change a lot more than the interpolation. One way

to look at this is that the same level of smoothness (say the number of changes of direction allowed)

constrains the interpolated value more than the extrapolated value, since for interpolation any change

in direction must be accompanied by a change back to intersect the other observed point. With extrap-

olation, one change need not be matched with a change in the opposite direction, since there exists no

observed point on the other side of the counterfactual being estimated.

Thus, if we learn that a counterfactual question involves extrapolation, we still might wish to pro-

ceed if the question is sufficiently important, but we would be aware of how much more model depen-

dent our answers will be. How to determine whether a question involves extrapolation with one variable

6


11/51

should now be obvious; fortunately, doing it for more than one requires only one additional concept:

Indeed, our main message in this section is that, mathematically, questions that involve interpolation are

values of the vector x which fall in the convex hull ofX, and with this definition (and our explanations

below) researchers can easily check whether a particular question requires extrapolation.

4

Formally, the convex hull of a set of points is the smallest convex set that contains them.5 This is

easiest to see graphically, such as via the example in Figure 1 for one explanatory variable (on the left)

and for two (on the right), given simulated data. The small vertical lines in the left graph denote data

points on the one explanatory variable in that example. The convex hull for one variable is marked by

the maximum and minimum data points: any counterfactual question between those points requires in-

terpolation; points outside involve extrapolation. (The left graph also includes a nonparametric density

estimate, a smooth version of a histogram, that gives another view of the same data.) For two explana-tory variables, the convex hull is given by a polygon with extreme data points as vertices such that for

any two points in the polygon, all points that are on the line connecting them are also in the polygon

(i.e., the polygon is a convex set). A counterfactual question x that appears outside the polygon requires

extrapolation. Anything inside involves interpolation.

Although Figure 1 only portrays convex hulls for one and two explanatory variables, the concept is

well defined for any number of dimensions. For three explanatory variables, and thus three dimensions,

the convex hull could be found by shrink wrapping the fixed points in three dimensional space. Theshrink wrapped surface encloses counterfactual questions requiring interpolation. For four or more ex-

planatory variables, the convex hull is more difficult to visualize, but we can still easily assess whether

a point lies within the hull based on the mathematical property of the hull. Appendix A shows how to

do this by solving a standard linear programming problem, meaning that generally available linear pro-

gramming software can be employed to solve the problem quickly.

If we make no assumptions about the true CEF, then we can learn nothing about counterfactuals

other than those that are coincident with the data points X (the factuals). Counterfactual inferencethus requires some type of additional assumption, which almost always involves a version of smooth-

ness in the CEF. We shall assume only a minimal degree of smoothness, but with it we are able to

show, in Appendix B, the informational advantages of interpolation over extrapolation.

Outliers in X would make this extrapolation-detection method inefficient. That is, some questions

x would be identified as requiring interpolation even though they would really involve extrapolation.

7


12/51

However, if this method identified a question as requiring extrapolation, then the suspicion of outliers in

X would only make the finding more solid. Of course, outliers need to be identified before, or as part

of, any estimation strategy or the model applied will not be reasonably estimated, even if the counter-

factual question asked is close to the data. As such, the problem of outliers is not unique to the issue athand.

Similarly, if specific forms of nonlinearity and interactions (such as squared terms or products of

existing variables) are known to be present in the CEF, explicitly using them as additional input vari-

ables will result in a CEF that is a smoother function of the larger set of inputs. As a result, counter-

factual inference within the convex hull of this larger data matrix X will be more accurate, and so the

results of the test we propose will be more informative about the approximation error in making inter-

polations.

6

However, researchers should not include these extra terms in their input data X unless theyknow they belong in the CEF: putting them in when they do not belong could cause one to conclude

incorrectly that a counterfactual requires extrapolation.

2.2 How Far Is The Counterfactual from the Data?

While the informational advantage for interpolation holds in general, if we are willing to assume that

the CEF is smooth in the neighborhood of the observed data, then points just outside the convex hull

are arguably less of an extrapolation than those farther outside. Another related issue is that it is (em-

pirically rare but theoretically) possible for a point just inside the extrapolation region (i.e., just outside

the convex hull ofX) to be closer to a large amount of data points than one inside the hull that oc-

cupies a large empty region away from most of the data. These qualifications as well as the need

to evaluate counterfactuals within the interpolation region point to the need for other methods that

could be used simultaneously, a problem to which we now turn. Thus, in addition to assessing whether

a counterfactual question requires interpolation or extrapolation, we also examine how far a counter-

factual is from the data by a distance or closeness measures.

Our goal is some measure of the fraction of observations (rows) in X nearby the counterfactual

x. To create this measure, we begin with a measure of distance between two points (or rows) xi and

xj based on Gowers (1966,1971) nonparametric measure. It is defined simply as the average absolute

distance between the elements of the two points, divided by the range of the data:

G2ij =1

K

Kk=1

|xik xjk |

rk. (3)

where the range is rk = max(X.k) min(X.k) and the min and max functions return the smallest and

8


13/51

largest elements respectively in the set including the kth element of the explanatory variables X. Thus,

the elements of the measure are normalized for each variable to range between zero and one, and then

averaged. The measure is designed to apply to all types of variables, including both quantitative and

qualitative data.

7

Since x may be outside the convex hull ofX, our modified version ofG

2

may rangeanywhere from zero on up. Thus, if G2 = 0, then x and the row in question ofX are identical, and

the larger G2ij , the more different the two rows are. (IfG2 > 1 then x lies outside the convex hull of

X, but the reverse does not necessarily hold.) We interpret G2 as the distance between the two points as

a proportion of the distance across the data, X. So G2 = 0.3 means that to get from one point to the

other, one needs to go 30% of the way across the data set.

With G2 applied to our problem, we need to summarize n numbers, the distances between x and

each row in X. If space permits, we suggest presenting a cumulative frequency plot ofG2

vertically bythe fraction of rows in X with G2 values less than the given value on the horizontal axis. If space is

short, such as would happen if many counterfactuals need to be evaluated, any fixed point on this graph

could be used as a one-number summary. For example, we could report the fraction of rows in the data

with G2 values less than say 10% (or some other number such as based on the generalized variance

ofX).8 We interpret the resulting measure as the fraction of the observed data nearby the counterfactual.

2.3 Counterfactuals About Democracy

We now apply these methods of evaluating counterfactuals to address one of the most asked questions

in political science: what is the effect of a democratic form of government (as compared to less demo-

cratic forms). We study counterfactuals relating to the degree of democracy using data collected by the

State Failure Task Force (Esty et al., 1998). See King and Zeng (2002) for an independent evaluation.

These data are easily the best ever collected in this area, in part because the authors had consider-

able resources from the federal government to marshal for their data collection efforts, and so the usual

scarcity of time, resources, expertise, etc., that affect most data collection efforts are not constraints

here. The only limitation on types and especially combinations of data the task force could collect was

the world: that is, countries can be found with only a finite number of bundles of characteristics, and

this constraint affects everyone studying counterfactuals about democracy, no matter what the depen-

dent variable. Thus, to the extent that we find that certain counterfactual questions of interest are unan-

swerable, our point regarding problems in the literatures on the effects of democracy are all that much

firmer.

9


14/51

After extensive searches Esty et al. (1998) used as explanatory variables trade openness (as a proxy

for economic conditions and government effectiveness), the infant mortality rate, and democracy. Democ-

racy is coded as two dummy variables representing autocracy, partial democracy, and full democracy.

King and Zeng (2002) added to these the fraction of the population in the military, population density,and legislative effectiveness.

The task forces dependent variable is the onset of state failure, but our analyses apply to all depen-

dent variables. What would happen if more of the world were democratic is a question that underlies

much other work in comparative politics and international relations over the last half century as well

as a good deal of American foreign policy. To see how widely our analyses apply, we began collect-

ing other articles in the field that use a set of explanatory variables with a fair degree of overlap with

the set used here, and stopped at twenty after searching only the last few years. The methods presentedin this section would need to be repeated to draw more certain conclusions from each of these other

articles, but the overlap was sufficient to infer that the results presented here will likely apply without

modification to a large number of articles in our field.

We begin a description of our empirical analyses with four clear examples, the first two obviously

extrapolations and the second two obviously interpolations, and then move to averages of many other

cases of more substantive interest. Before turning to empirically reasonable counterfactuals, we begin

with examples that are deliberately extreme. Extreme examples are of course useful for ensuring ex-pository clarity, but they are also useful here since, although almost no serious researcher would expect

the data to provide information about such counterfactuals if intentionally asked, almost all empirical

analysts estimating the effects of democracy have implicitly asked precisely these questions. This is al-

ways the case when all observations are used in the estimation and causal effect evaluation, as is typical

in the literature. So although the two examples we now introduce are obviously extreme, we show that

many actually asked in the literature are in fact also quite extreme.

We begin by asking what would happen if Canada in 1996 had become an autocracy, but its val-ues on other variables remained at their actual values? We find, as we would expect, that this extreme

counterfactual is outside the convex hull of the observed data and therefore requires extrapolation. In

other words, we can ask what would have happened if Canada had become autocratic in 1996, but we

cannot use history as our guide, since the world (and therefore our data) includes no examples of au-

tocracies that are similar enough to Canada on other measured characteristics. Similarly, if we ask what

10


15/51

would have happened if Saudi Arabia in 1996 had become a full democracy, we would also be required

to make an extrapolation, since it too falls outside the convex hull.

We now ask two counterfactual questions that are as obviously reasonable as the last two were un-

reasonable. Thus, we ask what would have happened if Poland had become an autocracy in 1990 (i.e.,just after it became a democracy)? From qualitative information available about Poland, this counterfac-

tual is quite plausible, and many even thought (and worried) about it actually occurring at the time. Our

analysis confirms the plausibility of this suspicion since this question falls within the convex hull; an-

alyzing it would require interpolation and thus not much model dependence. In other words, the world

has included examples of autocracies that are like Poland in other respects, and so history can be our

guide. Another reasonable counterfactual is to ask what would have happened had Hungary become a

full democracy in 1989 (i.e., just before it actually did become a democracy). This question is also inthe convex hull and would also require only interpolation to produce specific estimates.

We now further analyze these four counterfactuals questions using our modified Gower distance

measure. The question is how far the counterfactual x is from each row in the observed data set X, and

so the distance measure applied to the entire data set gives n numbers. We summarize these numbers,

without loss of information, in the cumulative frequency plots in Figure 2. The left plot includes coun-

terfactuals that change from autocracies to democracies, and the right plot is the reverse. The dark line

in each graph refers to a counterfactual within the convex hull and the dashed line is for a counterfac-tual outside the hull. Each line gives the cumulative distribution of our modified Gower distance square

measures. Take, for example, the value of G2 (given horizontally) of 0.1 (which is an average distance

of 10% from the minimum to the maximum values on each variable in X).9

Essentially no real country-years are within 0.10 or less of this counterfactual for changing Saudi

Arabia to a democracy, but about 25% of the data are within this distance for Hungary. Similarly, just a

few observations in the data are within even 0.15 of Canada changing to an autocracy, although about a

quarter of the country-years are within this distance for Poland. Of course, the full cumulative densitiesin the figure provide more information than this one point.

We now examine a larger sets of counterfactuals all at once with more convenient numerical sum-

maries along the lines of our verbal description of Figure 2. We start all variables set at their actual val-

ues and then ask what would happen to all autocracies if they became full democracies, and to all full

democracies if they became autocracies. This analysis includes 5,814 country-years, including 1,775

11


16/51

tracking full democracies and 4,039 representing autocracies. What we found was that only 28.4% of

the country-years in this widely examined counterfactual fell within the convex hull of the observed

data. This means that to analyze this counterfactual in practice, 71.6% of the country-years would re-

quire extrapolation and would thus be highly model-dependent regardless of the model applied or de-pendent variable analyzed. As Table 1 summarizes, the result is not symmetric: Among the full democ-

racies switched to autocracies, 53% require interpolation, whereas among the autocracies switched to

full democracies, only 17% are interpolation problems. Unfortunately, little discussion in the literature

reflects these facts.

The first few columns of Table 1 break down this average result for counterfactuals from three dif-

ferent regions. The rest of the table provides the fraction of countries within a modified Gower dis-

tance of about 0.1 of a counterfactual, averaged over all counterfactuals within a given region and typeof change in democracy. For example, across the 4,039 country-years where we could hypothetically

change autocracies to partial democracies, an average of only 4.2% of the data points are this close to

the counterfactual.

The overall picture in this table is striking. Studying the effects of changes in democracy have

been a major project within comparative politics and international relations for at least half a century.

This table applies to almost every such analysis with democracy as an explanatory variable in any field

with the same or similar control variables, regardless of the choice of dependent variable. Some ar-eas and counterfactuals are less strained than others, but the results here show that most inferences in

these fields (or most countries within each analysis) are highly model dependent, based much more on

unverifiable assumptions about the model than on empirical data. A large fraction are highly model-

dependent extrapolations, but even those which are interpolations are fairly distant from available data.

This result varies by region and by counterfactual, and it would of course vary more if we changed the

set of explanatory variables, but no matter how you look at it, the problem of reaching beyond ones

necessarily limited data comes through in Table 1 with clarity.Numerous interesting case studies could emerge from analyses like these. For example, public pol-

icy makers and the media spent considerable time debating what would happen if Haiti became more of

a democracy. In the early to mid-1990s, we find that the counterfactual of moving Haiti from a partial

to a full democracy was in the convex hull, and hence a question that had a chance of being answered

accurately with the available data. By 1996, conditions had worsened in the country, and this counter-

12


17/51

factual became more counter to the facts, moving well out of the hull and thus required extrapolation.

2.4 Counterfactuals About UN Peacekeeping

In the first quantitative analysis of the correlates of successful peacebuilding and of the contribution of

UN operations to peacebuilding outcomes, Doyle and Sambanis (2000, p.782) build and analyze a data

set on 124 post-World War II civil wars. They characterize their results as providing broad guidelines

for designing the appropriate peacebuilding strategy (p.779). This is clearly a pathbreaking work about

a deeply important question of public policy and it opens up a whole new area of analysis for our field.

Doyle and Sambanis were also unusually helpful and speedy in providing us data and documentation on

their analyses. On the theory that no good deed goes unpunished, however, it turns out that we demon-

strate in this section that the conclusions of their article in large part rely on modeling assumptions,

rather than the empirical data used in the analysis. However, our results should not be viewed as taking

anything away from Doyle and Sambanis achievement in asking previously unanalyzed questions and

proposing a new quantitative framework for answering them. They should also obviously not be faulted

for being unaware of the methods introduced in our article, coming as they do years after their article

was published. We also have no reason to doubt the veracity of their hypotheses, only the available data

used to support them.

We focus here on Doyle and Sambanis main concern [which] is how international capacities, UN

peace operations in particular, influence the probability of peacebuilding success (2000: p.783). We

begin by replicating their key logistic regression model (numbered A8 in their Table 3, p.790), and re-

port the results in Table 2, marked original model. Doyle and Sambanis report results in odds ratios,

whereas we report the more traditional logit coefficients, but the replication is otherwise exact.

The theoretical justification for Doyle and Sambanis logit specification comes from what they call

their interactive model, which posits peacebuilding success as thus the result of interactions among

the level of hostility, local capacities, and international capacities such as the UN involvement. We

could study any aspect of their statistical specification, but for illustration we focus on the effect of

most interest to the authors. This is the effect of multidimensional UN peacekeeping operations, which

include missions with extensive civilian functions, including economic reconstruction, institutional re-

form, and election oversight, and which they find are extremely significant and positively associated

with peacebuilding (p.791). In their original model in table 2, this is their UNOP4 variable, which

is dichotomous (one for UN peacekeeping operations and zero otherwise). Clearly the result is large

13


18/51

for a logit coefficient, and considerably larger than its standard error. When translated into an odds ra-

tio, which is Doyle and Sambanis preferred form, this coefficient means that the odds of peacebuilding

success with a UN peacekeeping operation is 23 times larger than with no such operation, holding con-

stant a list of potential confounding control variables. This is certainly a striking result, and one wereturn to in Section 3.3 when we discuss causal inferences.

In this section, we examine the counterfactuals of interest. Assessing the causal effects of UN

peacekeeping operations implicitly involves asking the question: In countries with UN involvement,

how much peacebuilding success would we have witnessed if the UN had not gotten get involved. Sim-

ilarly, in civil wars without UN involvement, how much success would there have been if the UN had

gotten involved. In other words, the authors are interested in predictions with the dichotomous UNOP4

variable set to 1 UNOP4, which means they have one counterfactual for each observation. We stud-ied this with the methods of Section 2.2. To begin with, we check how many counterfactuals are in the

convex hull of the observed data. We found none. That is, every single counterfactual in the entire data

set is a risky extrapolation rather what would have been a comparatively safer interpolation. We also

computed the Gower distance and found that very few of the counterfactuals were near much of the

data within the convex hull. For example, we computed that, for all counterfactuals, an average of only

1.3% of data points were within 10% of the squared Gower distance across the data set. Thus, not only

are the counterfactuals all extrapolations, but they do not lie just outside the convex hull and are insteadmostly fairly extreme extrapolations beyond the data. These results strongly indicate that the available

data used in Doyle and Sambanis study contain little information to answer the key causal questions

the authors ask, and hence the conclusions reached there are based more on theory than empirical evi-

dence.

We now proceed to give relatively simple examples of the consequences of this result. We begin

by making only one change in the logit specification by including a simple interaction between UNOP4

and the duration of the civil war, leaving the rest of the specification as is. (This is of course only oneillustrative example, and not the only aspect of the specification sensitive to assumptions.) Including the

interaction would seem highly consistent with the interactive theory put forward in the article, and

so would not seem possible to exclude without empirical evaluation. Excluding this interaction, which

the original specification does, is equivalent to assuming that the effect of UN peacekeeping operations

is identical for wars of all lengths (except for the nonlinearities in the logit model). This may be true,

14


19/51

but it would be easy to develop reasons to the contrary that are not contradicted by known results in the

literature.

The result of this new specification is given in the second set of columns in Table 2. The coeffi-

cient on the interaction is positive and clearly distinguishable from zero (the p-value is 0.001), and sothere is some evidence by the usual rules of thumb to indicate that this model is to be preferred to the

original one. The likelihoods of the two models indicate that the fit of the two models are highly simi-

lar, although they slightly favor our modified model.

We now offer the left graph in Figure 3, which plots predicted values from both models based on

the actual values of UNOP4 and the other explanatory variables. This graph shows that almost all the

predicted values from two models fall on the 45 degree line, which marks identical predictions. Only

six points fall markedly off the 45 degree line. Of these points, we marked with a circle the five pointsfor which the model with the interaction fits the in-sample data better than the original model, and with

a square the one point for which the original model fits better than the modified model. However one

looks at the results in this graph, we conclude that the two models give extremely similar factual pre-

dictions.

We now turn to the counterfactuals of interest by setting UNOP4 to 1 UNOP4, leaving all other

variables at their observed values, computing the same predicted values, and redoing the same plot.

Thus, this is the central question in the article: What happens if all the civil wars with UN interven-tion did not have UN intervention, and all the civil wars without UN intervention did have UN inter-

vention. The logit model in the article estimates these counterfactual effects. The graph on the right in

Figure 3 gives the results by again plotting the predictions for the original vs. modified models. The re-

sult could hardly be more dramatic, with very few of the points anywhere near the 45 degree line. That

is, consider any probability from the original model on the vertical axis (say 0.5 for example), the prob-

abilities from the modified model are spread horizontally over almost the entire range from 0 to 1. This

remarkable result indicates that these two models that differ only very slightly in their fit to the insample data are giving wildly different counterfactual predictions. Of course, these are precisely the

results we would expect when counterfactual predictions are well outside the convex hull: Although the

two models we chose fit the factuals almost identically, their counterfactual predictions are completely

different because the counterfactuals are far away from the data.

We return to the analysis of these data, as well as the massively different substantive implications

15


20/51

of these results, in Section 3.3.

3 Causal Effects

We now turn to counterfactual evaluation as part of causal inference in general and about the effects of

democracy and UN peacekeeping operations in particular. To fix ideas, we begin with a version of the

democratic peace hypothesis, which states that democracies are less conflictual than nondemocracies.

Our discussion generalizes to all possible dependent variables. Let D denote the treatment (or key

causal) variable where D = 1 denotes democracy and D = 0 denotes nondemocracy.10 The dependent

variable is Y, the degree of conflict.

To define the causal effect of democracy on conflict, we denote Y1 as the degree of conflict that

would be observed if the country were a democracy and Y0 as the degree of conflict if the country were

a nondemocracy. Obviously, only either Y0 or Y1 are observed for any one country at any given time,

but not both, since (in our present simplified formulation) a country either is or is not a democracy.

That is, we observe only Y = Y0(1 D) + Y1D.

In principle, the democracy variable can have a different causal effect for every country in the sam-

ple. We can then define the causal effect of democracy by averaging over the whole world, or for the

democracies and nondemocracies separately (or for any other subset of countries). For democracies, it

is the so-called average causal effect among the treated, which we define as follows:

= E(Y1|D = 1) E(Y0|D = 1) (4)

= Factual Counterfactual

We call the first term factual since Y1 is observable when D = 1, although the expected value still

may need to be estimated. We refer to the second as counterfactual because Y0 (the degree of conflict

that would be initiated by a country if it were not a democracy) is not observed and indeed is unob-

servable in democratic countries (D = 1). The causal effect for nondemocracies (D = 0) is directly

analogous and also involves factual and counterfactual terms.

Although medical researchers are usually interested in , political scientists normally focus the av-

erage causal effect for the entire set of observations,

= E(Y1) E(Y0), (5)

where both terms have a counterfactual element, since each expectation is taken over all countries, but

Y1 is only observed for democracies and Y0 only for nondemocracies. These definitions of causal ef-

16


21/51

fects are used in a wide variety of literatures (Rubin, 1974; Holland, 1986; Robins, 1999, 1999b; King,

Keohane, and Verba, 1994; Pearl, 2000).

A counterfactual x in this context therefore takes the form of some observed data with only one

element changed for example, Mexico with all its attributes fixed but with the regime type changedto autocracy. Fortunately, we can easily evaluate how reasonable it is to ask about this counterfactual in

ones data with the methods already introduced in Section 2: by checking whether x falls in the con-

vex hull of the observed X and computing the distance from x to X. In addition, since x has only one

counterfactual element we show below that we can also easily consult another criterion, whether x falls

on the support ofP(X).11

In most cases, the true causal effect, or , is unknown and needs be estimated, often from obser-

vational data since social experiments are costly or, in our case, infeasible. In Section 3.1, we discussthe sources of potential problems in using observational data to estimate the causal effects. We focus on

there for expository purposes as is usual in the statistical literature but, as our proofs in Appendix C

show, our results hold for the effect on nondemocracies and the more relevant overall effect, , as well.

Our empirical examples analyze the overall average causal effect, which is the usual parameter of inter-

est in political science. In addition to illuminating sources of potential problems in causal inference, the

decomposition of the estimation bias shows that inference involving counterfactuals not on the common

support is a critical source of bias.3.1 Decomposition of Causal Effect Estimation Bias

We begin with the simplest estimator of using observational data, the difference in means (or, equiva-

lently, the coefficient on D from a regression ofY on D):

d = mean(Y|D = 1) mean(Y|D = 0) = mean(Y1|D = 1) mean(Y0|D = 0), (6)

where mean(a) =

i ai/n (for any vector a with elements ai for i = 1, . . . , n). The first line is the

data-based analogue to (4), whereas the second recognizes that, for example, when D = 1, Y = Y1. To

identify the sources of potential problems using observational data in causal inference, we now presenta new decomposition of the bias E(d ) ofd as an estimator of the causal effect . This decompo-

sition generalizes Heckman et al.s (1998b) three-part decomposition. Their decomposition was applied

to a simpler problem that does not adequately represent the full range of issues in causal inference in

political science. Our new version helps to identify and understand the threats to causal inference in

our discipline, as well as to focus in on where counterfactual inference is most at issue. Thus, we show

that,

17


22/51

bias E(d )

=E[mean(Y1|D = 1) mean(Y0|D = 0) ]

=[E(Y1|D = 1) E(Y0|D = 0)] [E(Y1|D = 1) E(Y0|D = 1)]

=E(Y0|D = 1) E(Y0|D = 0) (7)

=x + z + n + d (8)

We derive the last equality and give the mathematical definition of the terms x, z, n, and d in

Appendix C. These four terms denote the four sources of bias in using observational data, with the sub-

scripts being mnemonics for the components. They are due to, respectively, omitting relevant control

variables (x), controlling for variables that have been affected by the key causal variable ( z), in-

sufficient overlap in the distributions of the control variables for the democracies and non-democracies

(n), and density differences in the control variable distributions (d). We now explain and interpret

each of these components, and then apply them to the effects of democracy.

The absence of all bias in estimating with d would be assured if we knew (from Equation 7) that

E(Y0|D = 1) = E(Y0|D = 0). (9)

Assumption (9) says that it is safe to use the observed control group outcome (Y0|D = 0, the level of

conflict initiated by nondemocracies) in place of the unobserved counterfactual (Y0|D = 1, the level of

conflict initiated by democracies, if they were actually nondemocracies.) Since this is rarely the case,

we introduce control variables: Let Z denote a vector of control variables (explanatory variables aside

from D), such that X= {D,Z}. If, after conditioning on Z, treatment assignment is random that is,

if we measure and control for right set of control variables (those that are related to D and affect Y),

so that

E(Y0|D = 1, Z) = E(Y0|D = 0, Z) (10)

holds, then from (21) in Appendix C, the first component of bias vanishes: x = 0. Thus, this first

component of bias, x, is due to pertinent control variables being omitted from X so that (10) is vio-

lated. This is the familiar omitted variable bias, which can plague any model. It can also be due to con-

trolling for irrelevant variables in certain situations and so Z should be minimally sufficient (Greenland

et al., 1999).

The second component of bias in our decomposition, z, deviates from zero when some of the

control variables Z are at least in part consequences of the key causal variable D. IfZ includes these

18


23/51

post-treatment variables, then when the key causal variable D changes, the post-treatment variables may

change too. Thus, if we denote Z1 and Z0 as the values that Z take on when D = 1 and D = 0,

respectively (components ofZ that are strictly pre-treatment do not change between Z0 and Z1), then,

as with Y, either Z1 or Z0, but not both, are observed for any one observation, and the observed Z =Z0(1 D) + Z1(D) will be different from the counterfactuals Z0 and Z1, resulting in a non-zero z in

(8).

As a simple example that illustrates the bias of controlling for post-treatment variables, suppose we

are predicting the vote with partisan identification. If we control for the intended vote five minutes be-

fore walking into the voting booth, our estimate of the effect of partisan identification would be nearly

zero. The reason is that we are inappropriately controlling for the consequences of our key causal vari-

able, and for most of the effects of it, thus biasing the overall effect. Yet, we certainly should controlfor a pre-treatment variable like race which cannot be a consequence of partisan identification but may

be a confounding variable that needs to be controlled. Thus, causal models require separating out the

pre- and post-treatment variables and controlling only for the pre-treatment, background characteristics.

Post-treatment variable bias may well be the largest overlooked component of bias in estimat-

ing causal effects in political science (see King, Keohane, and Verba, 1994: 173ff). It is known in the

statistical literature, but is assumed away in most models and decompositions (Frangakis and Rubin,

2001). This decision may be reasonable in other fields, where the distinction between pre- and post-treatment variables is easier to recognize and avoid, but in political science, especially comparative

politics and international relations, the problem is often severe. For example, is GDP a consequence

or cause of democracy? How about education levels? Fertility rates? Infant mortality? Trade levels?

Are international institutions causes or consequences of international cooperation? Many or possibly

even most variables in these literatures are both causes and consequences of whatever is regarded as the

treatment (or key causal) variable. As Lebow (2000: 575) explains Scholars not infrequently assume

that one aspect of the past can be changed and everything else kept constant,. . . [but these] Surgicalcounterfactuals are no more realistic than surgical air strikes. This is especially easy to see in quan-

titative research when each of the variables in an estimation takes its turn in different paragraphs of

an article playing the role of the treatment. However, only in rare statistical models, and only under

stringent assumptions, is it possible to estimate more than one causal effect from a single model.

To avoid this component of bias, z , we need to ensure that we control for no post-treatment vari-

19


24/51

ables, or that the distribution of our post-treatment variables does not vary with D:

P(Z0|D = 1) = P(Z1|D = 1). (11)

If this assumption holds, thenz = 0

in (8) vanishes.

The fundamental problem with much research in comparative politics and international relations

is not merely the bias induced by controlling for post-treatment variables. The problem is that even if

dropping out these variables alleviates post-treatment bias, it will likely also induce omitted variable

bias. In our field, unfortunately, we cannot get around considering both x and z together, and in

many situations we cannot fix one without making the other worse. The same is not true in all fields

(which is perhaps the reason the z component was ignored by Heckman et al., 1998), but it is ram-

pant in ours.Unfortunately, the news gets worse, since even the methodologists last resort try it both ways

and if it doesnt make a difference, you can ignore the problem does not work here. Rosenbaum

(1984) studies the situation where we run two analyses, one including and one excluding the variables

that are partly consequences and partly causes of X. He shows that the true effect could be greater than

these two or less than both. It is hard to emphasize sufficiently how prevalent this problem is in com-

parative politics and international relations.

Although we have no solution to this problem, we can offer one way to avoid both z and xin the presence of variables that are partially post-treatment. Aside from choosing better research de-

signs in the first place, of course, our suggestion is to study multiple-variable causal effects. If we can-

not study the effects of democracy controlling for GDP because higher GDP is in part a consequence of

democracy, we may be able to study the joint causal effect of a change from nondemocracy to democ-

racy and a simultaneous increase in GDP. This counterfactual is more realistic, i.e., closer to the data,

because it reflects changes that actually occur in the world and does not require us to imagine holding

variables constant that do not stay constant in nature. If this alternative formulation provides an inter-esting research question, then it can be studied without bias due to z since the joint causal effect will

not be affected by post-treatment bias. Moreover, the multiple-variable causal effect might also have no

omitted variable bias x, since both variables would be part of the treatment and could not be potential

confounders. Of course, if this question is not of interest, and we need to stick with the original ques-

tion, then no easy solution exists at present. At that point, we should recognize that the counterfactual

question being posed is too unrealistic and too strained to provide a reasonable answer using the given

20


25/51

data with any statistical model. Either way, this is a serious problem that needs to move higher on the

agenda of political methodology.

The last two components of bias nonoverlap, n, and density differences, d are both af-

fected by aspects of the distribution of the control variables, Z. For clarity, during our discussion herewe assume that the two components of bias we have previously discussed are not an issue, so that we

have no post-treatment bias, and we have the minimally sufficient set of pre-treatment variables Z to

control for. From (19) and (20), we see that what contributes to the remaining two components of bias

is the difference in the distribution of Z across the treatment (democracies) and control (nondemocra-

cies) groups. These differences are precisely what create bias: For example, before assigning a drug or

a placebo to two groups of patients, we first need to ensure that the two groups are the same on all rel-

evant pre-treatment characteristics (i.e., Z). If, before treatment, the control group is on average health-ier than the treatment group, then the effect of the drug will not be estimated correctly.

We parse these group differences into two forms. First, the support of the distributions might dif-

fer, that is, there may be certain values ofZ that some members of one group take on but no members

of the other group have (for example, we might observe no full democracies with infant mortality rates

as high as in some of the autocracies). This part of the difference constitutes non-overlap bias n. In-

tuitively, in regions of nonoverlap, the treatment and control groups are simply incomparable, and thus

the counterfactuals underlying the causal inference are not sufficiently close to the data to draw validinferences. The second form of difference is in the unequal distribution of Z across the two groups in

the region of common support (for example, we may have observations on both democracies and autoc-

racies that have a relatively low infant mortality rate, but many more such observations are democracies

than autocracies.) This component, d, contributes to the bias but by controlling for Z with the right

functional form, or via weighting or some other method, we may be able to eliminate it.

Thus for the last two components of bias to vanish we need to ensure that the distribution of Z in

the treatment group is the same as that in the control group. That is,

P(Z|D = 1) = P(Z|D = 0). (12)

If we restrict inference to the overlap region only, then we do not have the problem of comparing the

incomparable, and the nonoverlap bias n is eliminated. This would change the question being asked,

but at least the result would be a question that has an answer, which of course would be useful only

if the resulting question is of interest. This step is of course directly analogous to and indeed closely

21


26/51

related to the criterion that counterfactuals should be in the convex hull. And if in the region of valid

inference (overlap) we correctly adjust the density differences across the two groups, then d is elimi-

nated. Together, (12) holds.

So the question is how to ensure (12), that is, how to identify regions of density overlap, and howto adjust for differences in this region. It is easy to see, for the counterfactual x implied in the second

term of (4), that being on the common support ofP(Z|D = 1) and P(Z|D = 0) is in fact equivalent

to being on the support ofP(X), so identifying the regions of density overlap is also a direct check of

the quality of the key counterfactual required for causal inference. In the simple case when Z contains

just one variable, such as GDP, we could plot the histogram of GDP for democracies and compare it to

that for nondemocracies. Nonoverlap can easily be identified by comparing these histograms (or non-

parametric density estimates), which also readily reveals the nature of density differences across the twogroups.

To avoid inference on the nonoverlap region, we would simply restrict our analysis to data with the

same range of observations on GDP for both groups. To adjust for the density differences, some form

of nonparametric matching, such as subclassification on GDP, can be easily applied as there is only one

variable to match on. Or alternatively a parametric model of adequate functional flexibility can be used

to control for GDP.

In most real applications, of course, Z contains many control variables, and so checking assump-tion (12) would involve estimating and comparing two multidimensional densities in hyperspace. For

more than a few explanatory variables, this is a formidable task (essentially impossible without strin-

gent assumptions). Adjusting the density differences by matching would be extremely difficult or infea-

sible since with many matching variables matches are hard to find; similarly, parametric models suffer

from the curse of dimensionality as correct functional forms are difficult to identify, or to estimate effi-

ciently due to the large number of parameters.

Fortunately, this curse of dimensionality problem has been solved with the use of the propensity

score, Pr(D = 1|Z), the probability ofD = 1 given the control variables Z. The propensity score

summarizes relevant aspects of the multidimensional Z with a scalar . Rosenbaum and Rubin (1983)

show that conditioning on provides a substitute for Equation 12 that is true without assumptions:

P(Z|D = 1, ) = P(Z|D = 0, ). (13)

This statement is interesting in and of itself, since it implies that if we can estimate , and control for

22


27/51

it, we will be able to eliminate bias even if assumption (12) does not hold. For present purposes, Equa-

tion (13) enables us to show that assumption (12) can be expressed in an equivalent and more easy-to-

verify form. That is, the assumption sufficient to eliminate the two remaining components of bias (so

that n = d = 0) is that the distribution of for democracies is identical to the distribution of fornondemocracies:12

P(|D = 1) = P(|D = 0). (14)

This expression solves the curse of dimensionality problem in Equation 12, and so is easier to use, be-

cause it only requires comparison of two unidimensional densities of (one for democracies and one

for nondemocracies), and adjusting the density differences based only on one control variable.

3.2 The Causal Effect of Democracy

In this section we use the ideas discussed in Section 3.1 to analyze the literature studying causal effect

of democracy. We have already discussed the serious problem with variables that are partially a conse-

quence and partially a cause of democracy, and these unfortunately include many, perhaps even most, of

the other variables included in regression models with democracy economic variables, features of the

political system, indicators of conflict, measures of participation in international institutions, etc.

We add to this discussion in this section and analysis of the n component of bias, which is also

relevant to our theme of counterfactual evaluation. We take the state failure data and consider the causal

effect of partial democracy vs. autocracy, and then separately of the effect of full democracy vs. autoc-

racy. (The analysis here applies to the causal effect of these changes on any dependent variable.) To

identify regions of nonoverlap i.e., areas in which one type of country does not have a comparable

counterpart in the other, and therefore where the counterfactuals underlying the causal inference are not

reasonably factual we compute the propensity score and plot the density estimates (smooth versions

of histograms, computed with kernels that have bounded support) for partial democracies and autocra-

cies (in the left graph) and for full democracies and autocracies (in the right graph) in Figure 4.

Figure 4 clearly shows that the densities within each pair of comparisons are not the same, which

is no surprise. The ex ante probability of being a democracy should be higher for actual democracies

than actual autocracies, and this is clearly the case: most of the mass of democracies is on the right of

each graph (indicating high probability values) and even more of the density for autocracies is at the

left (indicating low probability values). Also as expected, the differences are more dramatic for causal

effect of full democracy vs. autocracy (on the right) than for partial democracy vs. autocracy (on the

23


28/51

left). The extent to which the two densities are not identical in each graph is a vivid description of the

bias of the difference in means estimator (6) due to the n and d components. If, in contrast, the val-

ues ofD were assigned randomly, then the ex ante probabilities would not be related to D, and so the

two densities would be the same. It is the observational, nonrandom aspects of this assignment that pro-duce these components of bias. In the figure the nonoverlap bias, n, refers to portions of each graph

for which one curve has positive density and the other has zero density. In each plot in this figure, ar-

eas of nonoverlap are at the ends. For example, no actual autocracies have an ex ante probability of

being a partial democracy or full democracy of more than about 0.8. Yet, a large fraction of democra-

cies have this high an ex ante probability. Similarly, a large number of autocracies ex ante probabilities

of being democracies are less than about 0.02, but no democracies can be found with such low proba-

bilities.Since datasets that overlap the one we use have been analyzed in so many publications, Figure 4

reveals that nonoverlap bias are a substantial problem in the literature on the causal effects of democ-

racy (for any dependent variable). Discussion of these issues in the quantitative empirical literature is

rare and we know of very few studies that attempt formally to diagnose this issue, much less to do any-

thing about it. Using nonoverlap data requires highly model-dependent extrapolation to areas where

no data exist. The problem is closely related to what if counterfactuals that lie outside the convex

hull we discuss earlier (at the limit, being in the overlap region is a sufficient condition for being in theconvex hull). In the causal inference literature where scholars have begun to pay attention to the prob-

lem, the solution mostly takes the form of changing the question to be asked, that is, eliminating the

portions of the population to which the areas of nonoverlap correspond and hence changing the target

of inference. The counterfactuals required for inference on the original population are considered too

strained for that purpose. Just as we recommend in Section 2, we must conclude that these causal ef-

fects cannot be reliably estimated in existing data. In our empirical example here, the area of nonover-

lap in both graphs in Figure 4 is substantial. In the comparison between full democracy vs. autocracy inparticular, the nonoverlap area is so large that the counterfactuals for causal inferences would be too far

from the data to allow empirical justification for the vast majority of such inferences.

Before ending this section we comment briefly on the component of bias due to density differ-

ences in the overlap region, d. This can be a substantial component of the total bias in many empir-

ical problems. Unlike the nonoverlap problem which is just beginning to receive attention, however,

24


29/51

adjusting density differences is a topic extensively researched and a practice routinely performed in the

literatures on democracy. A major function of standard regression type models used in political sci-

ence, for example, is precisely for that purpose. Parametric models are subject to incorrect functional

form assumptions, so alternatively nonparametric methods such as matching or inverse propensity scoreweighting can also be used (e.g., Rosenbaum and Rubin, 1984; Heckman et al., 1998; Robins, 1999,

1999b; Winship and Morgan, 1999).

We conclude this section by noting that although decompositions of bias are very useful in under-

standing the problems, all four components of bias must ordinarily be eliminated simultaneously to pro-

duce reliable inferences. With one exception that we discuss momentarily, a partial fix could even

make the total bias worse. For example, if z and x are both known not to be zero, it is possible (al-

though perhaps unlikely) that fixing one rather than both components may actually increase the totalbias.

The exception is n. Although blind luck can always occur, eliminating the nonoverlap region,

hence recognizing the impossibility of making certain causal inferences, should always be helpful. It

may be disappointing of course to know that the desired questions have no good answers in available

data, but it is better to know this than to ignore it.

3.3 The Causal effect of UN Peacekeeping

We now apply the methods of estimating causal effects introduced in Section 3.1 to the Doyle and

Sambanis (2000) example we first reanalyzed in Section 2.4. In that section, we showed how all the

counterfactuals in these data were extrapolations and far from the convex hull and thus inferences about

them highly model dependent. We now demonstrate that the specific causal effects of interest are also

highly dependent on modeling assumptions implicit in their logit specification.

Thus, we begin in Figure 5 by plotting the densities (via histograms) of the propensity score (the

probability of a UN peacekeeping operation given the other explanatory variables) one for cases

where there was a UN peacekeeping operation (UNOP4=1) and one for cases without an operation

(UNOP4=0). As in Figure 4, the extent to which the two histograms overlap is a measure of the com-

mon support of the two distributions. Unfortunately, the figure demonstrates that very little of the dis-

tributions overlap at all. Hence, there exists almost no leverage in the data one which to base causal

inferences about the effect of UN peacekeeping operations (UNOP4=1 for only seven observations in

the data). There is some overlap, but the very high correlations between peacekeeping operations and

25


30/51

the control variables makes parsing out the effect of the former exceptionally difficult and highly model

dependent.

We now demonstrate the consequences of this lack of sufficient common support for model depen-

dence in two ways, one parametric and the other nonparametric. For the first, we compare estimates ofthe causal effects in the two logit analyses in Table 2, which fit the data almost identically and which

differ only by one interaction term. From these logit models, we compute the marginal effects of UN

peacekeeping operations (from UNOP4=0 to 1) as a function of the duration of the civil war, holding

constant all other variables at their means. Figure 6 plots these results.

The vertical axis in Figure 6 is the marginal effect, which, conditional on the veracity of the logit

model, is a causal effect. The horizontal axis is the duration of the civil war. The dotted line is the

causal effect of UN peacekeeping estimated by the the model originally presented in Doyle and Sam-banis (2000). Without a formal interaction term, the nonlinearities of the logit model allow the effect of

UNOP4 to vary with civil war duration almost linearly, and the effect is clear: The effect of UN peace-

keeping operations is smaller for civil wars that have gone on for a longer time before the UN stepped

in. This is quite a plausible result, since we might expect that long conflicts would be more difficult

to resolve. However, as it turns out, the empirical support for this result depends almost entirely on

unsupported modeling assumptions. This can be seen by examining the solid line, which portrays the

causal effect for our modified model. As can be plainly seen, the effects for the two models are mas-sively different. In the modified model, the probability of success is hardly affected by UN operations

for relatively short wars, it increases fast, and then declines in parallel to the original model. The huge

differences for shorter wars and the opposite slopes of the two lines in that region suggest diametrically

opposed policy implications for UN missions. Although the two models gave nearly identical fit to the

factual data, the counterfacutals are far enough from the observed data to make conclusions highly sen-

sitive to modeling assumptions.

Finally, the differences between the two logit models in Table 2 obviously suggests that there maybe a wide variety of other models that would also fit nearly as well or slightly better, but would give

other even more different estimates of causal effects. We therefore estimate the causal effect nonpara-

metrically via matching, giving up the logit functional form assumption entirely, and keeping only the

assumption that all the right control variables have been identified and included. The problem of course

is that exact matches are impossible, and in no case does there exist a civil war with a UN peacekeep-

26


31/51

ing mission that is like in all respects a civil war without a UN peacekeeping mission. The lack of ex-

act matches means that assumptions must be added to the matching process, which we do via five sepa-

rate approaches in Table 3.13

Table 3 demonstrates the sensitivity of conclusions about the effects of UN peacekeeping opera-tions on the particular assumptions one can make. In the table, only one of the five estimates is signif-

icantly greater than zero (in the sense that the 95% confidence interval excludes zero), and only two

of five point estimates are of any substantial size. All have very wide confidence intervals. Moreover,

the one significant result involves probably the least plausible set of assumptions, since all cases with-

out UN peacekeeping operations are used for matches no matter how different they are (as measured

by the control variables) from the UN peacekeeping cases, and thus inferences are made no matter how

stretched the counterfactuals are. We conclude that any conclusions about the effects of UN peacekeep-ing operations are more theoretical than empirical. The data set used does not by itself, or with a rea-

sonable set of given assumptions, support the conclusions offered.

4 Concluding Remarks

Even far-out questions with answers that are highly model-dependent may still be important enough to

warrant further study. For a few examples, what would happen to the stability of the U.S. Government

if inflation increased to 200% or had Gore refused to concede the 2000 election? What would the fu-

ture of military conflict be in a world without nation states? How bad would the devastation be from

a third world war? If a new virulent infectious disease that is ten times as bad as AIDS strikes the de-

veloped world and lasts longer than the AIDS crisis, would current international institutions survive?

Scholars can and certainly still should ask questions like these, but we would be better served if we

knew whether and to what degree our answers are based on empirical evidence rather than model as-

sumptions. Sometimes, with the data at hand, no statistical model can give valid answers, and we must

rely on theory or new data collect

KING, Gary e ZENG, Langche. When Can History Be Our Guide - Counterfactual Inference. Political...

Documents

Transcript of KING, Gary e ZENG, Langche. When Can History Be Our Guide - Counterfactual Inference. Political...