Declaring and Diagnosing Research Designs · MIDA captures the analysis-relevant features of a...

American Political Science Review, of 22

doi:10.1017/S0003055419000194 © American Political Science Association 2019

Declaring and Diagnosing Research DesignsGRAEME BLAIR University of California, Los Angeles

JASPER COOPER University of California, San Diego

ALEXANDER COPPOCK Yale University

MACARTAN HUMPHREYS WZB Berlin and Columbia University

Researchers need to select high-quality research designs and communicate those designs clearly toreaders. Both tasks are difficult. We provide a framework for formally “declaring” the analyticallyrelevant features of a research design in a demonstrably complete manner, with applications to

qualitative, quantitative, and mixed methods research. The approach to design declaration we describerequires defining amodel of the world (M), an inquiry (I), a data strategy (D), and an answer strategy (A).Declaration of these features in code provides sufficient information for researchers and readers to useMonte Carlo techniques to diagnose properties such as power, bias, accuracy of qualitative causalinferences, and other “diagnosands.” Ex ante declarations can be used to improve designs and facilitatepreregistration, analysis, and reconciliation of intended andactual analyses. Expost declarations are usefulfor describing, sharing, reanalyzing, and critiquing existing designs. We provide open-source software,DeclareDesign, to implement the proposed approach.

As empirical social scientists, we routinely facetwo research design problems. First, we need toselect high-quality designs, given resource

constraints. Second, we need to communicate thosedesigns to readers and reviewers.

To select strong designs, we often rely on rules ofthumb, simple power calculators, or principles from themethodological literature that typically address one

component of a design while assuming optimal con-ditions for others. These relatively informal practicescan result in the selection of suboptimal designs, orworse, designs that are simply tooweak to deliver usefulanswers.

To convince others of the quality of our designs, weoften defend them with references to previous studiesthat used similar approaches, with power analyses thatmay rely on assumptions unknown even to ourselves,or with ad hoc simulation code. In cases of dispute overthe merits of different approaches, disagreementssometimes fall back on first principles or epistemo-logical debates rather than on demonstrations of theconditions under which one approach does better thananother.

In this paperwedescribe an approach to address theseproblems. We introduce a framework—MIDA—thatasks researchers to specify information about theirbackground model (M), their inquiry (I), their datastrategy (D), and their answer strategy (A). We thenintroduce the notion of “diagnosands,” or quantitativesummaries of design properties. Familiar diagnosandsinclude statistical power, the bias of an estimator withrespect to an estimand, or the coverage probability of aprocedure for generating confidence intervals. We saya design declaration is “diagnosand-complete” whena diagnosand can be estimated from the declaration.We do not have a general notion of a complete design,but rather adopt an approach in which the purposes ofthe design determine which diagnosands are valuableand in turnwhat featuresmust be declared. In practice,domain-specific standards might be agreed uponamong members of particular research communities.For instance, researchers concerned about the policyimpact of a given treatmentmight require a design thatis diagnosand-complete for an out-of-sample diag-nosand, such as bias relative to the population averagetreatment effect. Theymay also consider a diagnosanddirectly related to policy choices, such as the

Graeme Blair , Assistant Professor of Political Science, Univer-sity of California, Los Angeles, [email protected], https://graemeblair.com.

Jasper Cooper , Assistant Professor of Political Science, Uni-versity of California, SanDiego, [email protected], http://jasper-cooper.com.

AlexanderCoppock ,AssistantProfessorofPoliticalScience,YaleUniversity, [email protected], https://alexandercoppock.com.

Macartan Humphreys , WZB Berlin, Professor of PoliticalScience, Columbia University, [email protected], http://www.macartan.nyc.

Authors are listed in alphabetical order.Thisworkwas supported inpart by a grant from the Laura and JohnArnold Foundation and seedfunding from EGAP—Evidence in Governance and Politics. Errorsremain the responsibility of the authors. We thank the AssociateEditor and three anonymous reviewers for generous feedback. Inaddition,wethankPeterAronow,JulianBruckner,AdrianDusaAdamGlynn, DonaldGreen, JustinGrimmer, KolbyHansen, ErinHartman,Alan Jacobs,TomLeavitt,WinstonLin,MattoMildenberger,MatthiasOrlowski, Molly Roberts, Tara Slough, Gosha Syunyaev, AnnaWilke,Teppei Yamamoto, Erin York, LaurenYoung, and Yang-Yang Zhou;seminar audiences at Columbia, Yale, MIT, WZB, NYU, Mannheim,Oslo, Princeton, Southern California Methods Workshop, and theEuropean FieldExperiments Summer School; as well as participants atthe EPSA 2016, APSA 2016, EGAP 18, BITSS 2017, and SPSP 2018meetings for helpful comments. We thank Clara Bicalho, Neal Fultz,Sisi Huang, Markus Konrad, Lily Medina, Pete Mohanty, AaronRudkin, Shikhar Singh, Luke Sonnet, and John Ternovski for theirmany contributions to the broader project. The methods proposed inthis paper are implemented in an accompanying open-source softwarepackage, DeclareDesign (Blair et al. 2018). Replication files areavailable at the American Political Science Review Dataverse: https://doi.org/10.7910/DVN/XYT1VB.

Received: January 15, 2018; revised: August 14, 2018; accepted:February 14, 2019.

1

Dow

nloa

ded

from

htt

ps://

ww

w.c

ambr

idge

.org

/cor

e. U

CLA

Libr

ary,

on

04 Ju

n 20

19 a

t 17:

00:3

3, s

ubje

ct to

the

Cam

brid

ge C

ore

term

s of

use

, ava

ilabl

e at

htt

ps://

ww

w.c

ambr

idge

.org

/cor

e/te

rms.

htt

ps://

doi.o

rg/1

0.10

17/S

0003

0554

1900

0194

https://doi.org/10.1017/S0003055419000194

https://orcid.org/0000-0001-9164-2102

mailto:[email protected]

https://orcid.org/0000-0002-8639-3188


https://orcid.org/0000-0002-5733-2386


https://orcid.org/0000-0001-7029-2326


https://doi.org/10.7910/DVN/XYT1VB


https://www.cambridge.org/core

https://www.cambridge.org/core/terms

https://doi.org/10.1017/S0003055419000194

probability of making the right policy decision afterresearch is conducted.

We acknowledge that although many aspects of de-sign quality can be assessed through design diagnosis,many cannot. For instance the contribution to an aca-demic literature, relevance to a policy decision, andimpact on public debate are unlikely to be quantifiableex ante.

Using this framework, researchers can declarea research design as a computer code object and thendiagnose its statistical properties on the basis of thisdeclaration. We emphasize that the term “declare”does not imply a public declaration or even necessarilya declaration before research takes place. A re-searcher may declare the features of designs in ourframework for their own understanding and declaringdesigns may be useful before or after the research isimplemented. Researchers can declare and diagnosetheir designs with the companion software for thispaper, DeclareDesign, but the principles of designdeclaration and diagnosis do not depend on any par-ticular software implementation.

The formal characterizationanddiagnosis of designsbefore implementation can servemanypurposes. First,researchers can learn about and improve their in-ferential strategies. Done at this stage, diagnosis ofa design and alternatives can help a researcher selectfrom a range of designs, conditional upon beliefs aboutthe world. Later, a researcher may include designdeclaration and diagnosis as part of a preanalysis planor in a funding request. At this stage, the full specifi-cationof a design serves a communication function andenables third parties to understand a design and anauthor’s intentions. Even if declared ex-post, formaldeclaration still has benefits. The complete charac-terization can help readers understand the propertiesof a research project, facilitate transparent replication,and can help guide future (re-)analysis of the studydata.

The approach we describe is clearly more easilyapplied to some types of research than others. Inprospective confirmatory work, for example,researchers may have access to all design-relevantinformation prior to launching their study. For moreinductive research, by contrast, researchers maysimply not have enough information about possiblequantities of interest to declare a design in advance.Although in some cases the designmay still be usefullydeclared ex post, in others it may not be possible tofully reconstruct the inferential procedure after thefact. For instance, although researchers might be ableto provide compelling grounds for their inferences,they may not be able to describe what inferences theywould have drawn had different data been realized.This may be particularly true of interpretivistapproaches and approaches to process tracing thatwork backwards from outcomes to a set of possiblecauses that cannot be prespecified. We acknowledgefrom the outset that variation in research strategylimits the utility of our procedure for different types ofresearch. Even still, we show that our framework canaccommodate discovery, qualitative inference, and

different approaches to mixed methods research, aswell as designs that focus on “effects-of-causes”questions, often associated with quantitativeapproaches, and “causes-of-effects” questions, oftenassociated with qualitative approaches.

Formally declaring research designs as objects in themannerwe describe here brings, we hope, four benefits.It can facilitate the diagnosis of designs in terms of theirability to answer the questionswewant answered underspecified conditions; it can assist in the improvement ofresearch designs through comparison with alternatives;it can enhance research transparency by making designchoices explicit; and it can provide strategies to assistprincipled replication and reanalysis of publishedresearch.

RESEARCH DESIGNS AND DIAGNOSANDS

We present a general description of a research designas the specification of a problem and a strategy toanswer it. We build on two influential research designframeworks. King, Keohane, and Verba (1994, 13)enumerate four components of a research design:a theory, a research question, data, and an approach tousing the data. Geddes (2003) articulates the linksbetween theory formation, research question formu-lation, case selection and coding strategies, andstrategies for case comparison and inference. In bothcases, the set of components are closely aligned tothose in the framework we propose. In our exposition,we also employ elements fromPearl’s (2009) approachto structural modeling, which provides a syntax formapping design inputs to design outputs as well as thepotential outcomes framework as presented, for ex-ample, in Imbens andRubin (2015), whichmany socialscientists use to clarify their inferential targets. Wecharacterize the design problem at a high level ofgenerality with the central focus being on the re-lationship between questions and answer strategies.We further situate the framework within existing lit-erature below.

Elements of a Research Design

The specification of a problem requires a description ofthe world and the question to be asked about the worldasdescribed.Providingananswer requires adescriptionof what information is used and how conclusions arereached given this information.

At its most basic we think of a research design asincluding four elements ÆM, I, D, Aæ:

1. Acausalmodel,M, of how theworldworks.1 In general,following Pearl’s definition of a probabilistic causalmodel (Pearl 2009) we assume that a model containsthree core elements. First, a specification of the varia-bles X about which research is being conducted. This

1 ThoughM is a causal model of the world, such a model can be usedfor both causal and non-causal questions of interest.

Graeme Blair et al.

2

Dow

nloa

ded

from

htt

ps://

ww

w.c

ambr

idge

.org

/cor

e. U

CLA

Libr

ary,

on

04 Ju

n 20

19 a

t 17:

00:3

3, s

ubje

ct to

the

Cam

brid

ge C

ore

term

s of

use

, ava

ilabl

e at

htt

ps://

ww

w.c

ambr

idge

.org

/cor

e/te

rms.

htt

ps://

doi.o

rg/1

0.10

17/S

0003

0554

1900

0194



https://doi.org/10.1017/S0003055419000194

includes endogenous and exogenous variables (V andUrespectively) and the ranges of these variables. In theformal literature this is sometimes called the signature ofa model (e.g., Halpern 2000). Second, a specification ofhow each endogenous variable depends on other vari-ables (the “functional relations” or, as in Imbens andRubin (2015), “potential outcomes”), F. Third, a prob-ability distribution over exogenous variables, P(U).

2. An inquiry, I, about the distribution of variables, X,perhaps given interventions on some variables. UsingPearl’s notation we can distinguish between questionsthataskabout theconditionalvaluesofvariables, suchasPr(X1|X2 5 1) and questions that ask about values thatwould arise under interventions: Pr(X1|do(X2 5 1)).2

We let aM denote the answer to I under the model.Conditional on the model, aM is the value of the esti-mand, the quantity that the researcher wants to learnabout.

3. A data strategy,D, generates data d onX under modelMwith probabilityPM(d|D). The data strategy includessampling strategies and assignment strategies,whichwedenote with PS and PZ respectively. Measurementtechniques are also a part of data strategies and can bethought of as procedures by which unobserved latentvariables are mapped (possibly with error) into ob-served variables.

4. An answer strategy, A, that generates answer aA usingdata d.

Akey feature of this bare specification is that ifM,D,and A are sufficiently well described, the answer toquestion I has a distribution PM(a

A|D). Moreover, onecan construct a distribution of comparisons of this an-swer to the correct answer, under M, for example byassessing PM(a

A 2 aM|D). One can also compare thisto results under different data or analysis strategies,PM(a

A 2 aM|D9) and PM(aA9 2 aM|D), and to answers

generated under alternative models, PM(aA 2 aM9|D),

as long as these possess signatures that are consistentwith inquiries and answer strategies.

MIDA captures the analysis-relevant features ofa design, but it does not describe substantive elements,such as how theories are derived, how interventions areimplemented, or even, qualitatively, how outcomesare measured. Yet many other aspects of a design thatare not explicitly labeled in these features enter into thisframework if they are analytically relevant. For ex-ample, if treatment effects decay, logistical details ofdata collection (such as the duration of time betweena treatment being administered and endline data col-lection) may enter into the model. Similarly, if a re-searcher anticipates noncompliance, substantiveknowledge of how treatments are taken up can be in-cluded in many parts of the design.

Diagnosands

The ability to calculate distributions of answers, givena model, opens multiple avenues for assessment andcritique. How good is the answer you expect to get froma given strategy? Would you do better, given somedesideratum, with a different data strategy? Witha different analysis strategy?How good is the strategy ifthe model is wrong in one way or another?

To allow for this kind of diagnosis of a design, weintroduce two further concepts, both functions of re-search designs. These are quantities that a researcher ora third party could calculate with respect to a design.

1. A diagnostic statistic is a summary statistic generatedfrom a “run” of a design—that is, the results givena possible realization of variables, given the model anddata strategy. For example the statistic: e5 “differencebetween theestimatedand theactual average treatmenteffect” is adiagnostic statistic that requires specifyinganestimand. The statistic s ¼ 1 p#0:05ð Þ, interpreted as“the result is considered statistically significant at the5% level,” is a diagnostic statistic that does not requirespecifying an estimand, but it does presuppose an an-swer strategy that reports a p-value.

Diagnostic statistics are governed by probabilitydistributions that arise because both the model and thedata generation, given the model, may be stochastic.

2. A diagnosand is a summary of the distribution ofa diagnostic statistic. For example, (expected) bias inthe estimated treatment effect is E eð Þ and statisticalpower is E sð Þ.

To illustrate, consider thefollowingdesign.AmodelMspecifies three variablesX, Y, and Z defined on the realnumber line that form the signature. In additional weassume functional relationships between them that allowfor thepossibilityofconfounding(forexample,Y5bX1Z1 «Y;X5 Z1 «X, withZ, «X, «Z distributed standardnormal). The inquiry I is “what would be the averageeffectof aunit increase inXonY in thepopulation?”Thespecification of this question depends on the signature ofthe model, but not the functional relations of the model.The answer provided by the model does of course de-pend on the functional relations. Consider now a datastrategy,D, in which data are gathered onX andY for nrandomly selectedunits.AnansweraA, is thengeneratedusing ordinary least squares as the answer strategy, A.

We have specified all the components of MIDA.Wenowask:How strong is this researchdesign?Oneway toanswer this question is with respect to the diagnosand“bias.” Here the model provides an answer, aM, to theinquiry, so the distribution of bias given the model, aA2aM, can be calculated.

In this example, the expected performance of thedesign may be poor, as measured by the bias diag-nosand, because the data and analysis strategy do nothandle the confounding described by the model (seeSupplementary Materials Section 1 for a formal dec-laration and diagnosis of this design). In comparison,

2 The distinction lies in whether the conditional probability isrecorded through passive observation or active intervention tomanipulate the probabilities of the conditioning distribution. Forexample, Pr(X1|X2 5 1) might indicate the conditional probabilitythat it is raining, given that Jack has his umbrella, whereas Pr(X1|do(X251))would indicate theprobability of rain, given Jack ismadeto carry an umbrella.

Declaring and Diagnosing Research Designs

3

Dow

nloa

ded

from

htt

ps://

ww

w.c

ambr

idge

.org

/cor

e. U

CLA

Libr

ary,

on

04 Ju

n 20

19 a

t 17:

00:3

3, s

ubje

ct to

the

Cam

brid

ge C

ore

term

s of

use

, ava

ilabl

e at

htt

ps://

ww

w.c

ambr

idge

.org

/cor

e/te

rms.

htt

ps://

doi.o

rg/1

0.10

17/S

0003

0554

1900

0194



https://doi.org/10.1017/S0003055419000194

better performance may be achieved through an al-ternative data strategy (e.g., where D9 randomlyassignedX before recordingX and Y) or an alternativeanalysis strategy (e.g.,A9 conditionsonZ).Thesedesignevaluations depend on the model, and so one mightreasonably ask how performance would look were themodel different (for example, if the underlying processinvolved nonlinearities).

In all cases, the evaluation of a design depends on theassessmentofadiagnosand,andcomparing thediagnosesto what could be achieved under alternative designs.

Choice of Diagnosands

What diagnosands should researchers choose? Al-though researchers commonly focus on statisticalpower, a larger range of diagnosands can be examinedand may provide more informative diagnoses of designquality. We list and describe some of these in Table 1,indicating for each the design information that is re-quired in order to calculate them.

The set listed here includes many canonical diag-nosands used in classical quantitative analyses. Diag-nosands can also be defined for design properties thatare often discussed informally but rarely subjected toformal investigation. For example one might define aninference as “robust” if the same inference is made

under different analysis strategies. Onemight concludethat an interventiongives“value formoney” if estimatesare of a certain size and be interested in the probabilitythat a researcher in correct in concluding that an in-tervention provides value for money.

We believe there is not yet a consensus arounddiagnosands forqualitativedesigns.However, in certaintreatments clear analogues of diagnosands exist, such assampling bias or estimation bias (e.g., Herron andQuinn 2016). There are indeed notions of power,coverage, and consistency for QCA researchers (e.g.,Baumgartner and Thiem 2017; Rohlfing 2018) andconcerns around correct identification of causes ofeffects, orof causalpathways, for scholarsusingprocess-tracing (e.g.,Bennett 2015; Fairfield andCharman2017;Humphreys and Jacobs 2015; Mahoney 2012).

Though many of these diagnosands are familiar toscholars using frequentist approaches, analogous diag-nosands can be used to assess Bayesian estimation strat-egies (see Rubin 1984), and as we illustrate below, somediagnosands are unique to Bayesian answer strategies.

Given that there are many possible diagnosands, theoverall evaluation of a design is both multi-dimensionalandqualitative.For somediagnosands, quality thresholdshave been established through common practice, such asthe standard power target of 0.80. Some researchers areunsatisfied unless the “bias” diagnosand is exactly zero.

TABLE1. ExamplesofDiagnosandsand theElementsof theModel (M), Inquiry (I),DataStrategy (D),andAnswer Strategy (A) Required in Order for a Design to be Diagnosand-Complete for Each Diagnosand

Diagnosand Description

Required:

M I D A

Power Probability of rejecting null hypothesis of no effect 3 3 3

Estimation bias Expected difference between estimate andestimand

3 3 3 3

Sampling bias Expected difference between population averagetreatment effect and sample average treatmenteffect (Imai, King, and Stuart 2008)

3 3 3

RMSE Root mean-squared-error 3 3 3 3

Coverage Probability the confidence interval contains theestimand

3 3 3 3

SD of estimates Standard deviation of estimates 3 3 3

SD of estimands Standard deviation of estimands 3 3 3

Imbalance Expected distance of covariates across treatmentconditions

3 3

Type S rate Probability estimate has incorrect sign, ifstatistically significant (Gelman and Carlin2014)

3 3 3 3

Exaggeration ratio Expected ratio of absolute value of estimate toestimand, if statistically significant (Gelman andCarlin 2014)

3 3 3 3

Value for money Probability that a decision based on estimatedeffect yields net benefits

3 3 3 3

Robustness Joint probability of rejecting the null hypothesisacross multiple tests

3 3 3

Graeme Blair et al.

4

Dow

nloa

ded

from

htt

ps://

ww

w.c

ambr

idge

.org

/cor

e. U

CLA

Libr

ary,

on

04 Ju

n 20

19 a

t 17:

00:3

3, s

ubje

ct to

the

Cam

brid

ge C

ore

term

s of

use

, ava

ilabl

e at

htt

ps://

ww

w.c

ambr

idge

.org

/cor

e/te

rms.

htt

ps://

doi.o

rg/1

0.10

17/S

0003

0554

1900

0194



https://doi.org/10.1017/S0003055419000194

Yet for most diagnosands, we only have a sense of betterandworse, and improving one canmean hurting another,as in the classic bias-variance tradeoff. Our goal is not todichotomizedesigns intohighand lowquality,but insteadto facilitate the assessment of design quality on dimen-sions important to researchers.

What is a Complete ResearchDesign Declaration?

A declaration of a research design that is in some sensecomplete is required in order to implement it, com-municate its essential features, and to assess its prop-erties. Yet existing definitions make clear that there isno single conception of a complete research design: atthe time of writing, the Consolidated Standards ofReporting Trials (CONSORT) Statement widely usedin medicine includes 22 features, while other proposalsrange from nine to 60 components.3

We propose a conditional conception of complete-ness: we say a design is “diagnosand-complete” foragivendiagnosand if that diagnosand canbe calculatedfrom the declared design. Thus a design that isdiagnosand-complete for one diagnosand may not befor another. Consider, for example, the diagnosandstatistical power. Power is the probability of obtaininga statistically significant result. Equivalently, it is theprobability that the p-value is lower than a criticalvalue (e.g., 0.005). Thus, power-completeness requiresthat the answer strategy return a p-value and a signif-icance threshold be specified. It does not, however,require a well-defined estimand, such as a true effectsize (see Table 1 where, for a power diagnosand, thereis no check under I). In contrast, bias- or RMSE-completeness does not require a hypothesis test, butdoes require the specification of an estimand.

Diagnosand-completeness is a desirable property tothe extent that it means a diagnosand can be calculated.How useful diagnosand-completeness is depends onwhether the diagnosand is worth knowing. Thus,evaluating completeness should focus first on whetherdiagnosands for which completeness holds are indeeduseful ones.

The utility of a diagnosis depends in part on whetherthe information underlying declaration is believable.For instance, a design may be bias-complete, but onlyunder the assumptions of a given spillover structure.Readersmightdisagreewith theseassumptions.Even inthis case, however, an advantage of declaration isa clarification of the conditions for completeness.

EXISTING APPROACHES TO LEARNINGABOUT RESEARCH DESIGNS

Much quantitative research design advice focuses onone aspect of design at a time, rather than on theways in

which multiple components of a research design relateto each other. Statistics articles and textbooks tend tofocus on a specific class of estimators (Angrist andPischke 2008; Imbens and Rubin 2015; Rosenbaum2002), set of estimands (Heckman, Urzua, and Vytlacil2006; Imai, King, and Stuart 2008;Deaton 2010; Imbens2010), data collection strategies (Lohr 2010), or ways ofthinking about data-generation models (Gelman andHill 2006; Pearl 2009). In Shadish, Cook, and Campbell(2002, 156), for example, the “elements of a design”consist of “assignment, measurement, comparisongroups and treatments,” a definition that does not in-clude questions of interest or estimation strategies. Insome instances, quantitative researchers do presentmultiple elements of research design. Gerber andGreen (2012), for example, examine data-generatingmodels, estimands, assignment and sampling strategies,and estimators for use in experimental causal inference;and Shadish, Cook, and Campbell (2002) and Dunning(2012) similarly describe the various aspects of de-signing quasi-experimental research and exploitingnatural experiments.

In contrast, a number of qualitative treatments focuson integrating the many stages of a research design,from theory generation, to case selection, measure-ment, and inference. In an influential book on mixedmethod research design for comparative politics, forexample, Geddes (2003) articulates the links betweentheory formation (M), research question formulation(I), case selection and coding strategies (D), andstrategies for case comparison and inference (A). King,Keohane, and Verba (1994) and the ensuing discussionin Brady and Collier (2010) highlight how alternativequalitative strategies present tradeoffs in terms ofdiagnosands suchasbias andgeneralizability.However,few of these texts investigate those diagnosands for-mally in order to measure the size of the tradeoffs be-tween alternative qualitative strategies.4 Qualitativeapproaches, including process tracing and qualitativecomparative analysis, sometimes appear almost her-metic, complete with specific epistemologies, types ofresearch questions, modes of data gathering, andanalysis. Though integrated, these strategies are oftennot formalized.And if they are, it is seldom in away thatenables comparison with other approaches or quanti-fication of design tradeoffs.

MIDA represents an attempt to thread the needlebetween these two traditions. Quantifying the strengthof designs necessitates a language for formally de-scribing the essential features of a design. The relativelyfragmented manner in which the quantitative design isthought of in existing work may produce real researchrisks for individual research projects. In contrast, themore holistic approaches of some qualitative traditionsoffer many benefits, but formal design diagnosis can bedifficult. Our hope is that MIDA provides a frameworkfor doing both at once.

3 See “Pre Analysis Plan Template” (60 features); World Bank De-velopment Impact Blog (nine features).

4 Some exceptions are provided on page 4. Herron andQuinn (2016),for example, conduct a formal investigation of the RMSE and biasexhibited by the alternative case selection strategies proposed in aninfluential piece by Seawright and Gerring (2008).


5

Dow

nloa

ded

from

htt

ps://

ww

w.c

ambr

idge

.org

/cor

e. U

CLA

Libr

ary,

on

04 Ju

n 20

19 a

t 17:

00:3

3, s

ubje

ct to

the

Cam

brid

ge C

ore

term

s of

use

, ava

ilabl

e at

htt

ps://

ww

w.c

ambr

idge

.org

/cor

e/te

rms.

htt

ps://

doi.o

rg/1

0.10

17/S

0003

0554

1900

0194



https://doi.org/10.1017/S0003055419000194

A useful way to illustrate the fragmented nature ofthinking on research design among quantitativescholars is to examine the tools that are actually used todo research design. Perhaps the most prominent ofthese are “power calculators.”These have an all-designflavor in the sense that they ask whether, given an an-swer strategy, a data collection strategy is likely toreturn a statistically significant result. Power calcu-lations like these are done using formulae (e.g., Cohen1977; Haseman 1978; Lenth 2001; Muller et al. 1992;Muller and Peterson 1984); software tools such as Webapplications and general statistical software (e.g.,easypower for R and Power and Sample Size for Stata)as well as standalone tools (e.g., Optimal Design,G*Power, nQuery, SPSS Sample Power); and some-times Monte Carlo simulations.

Inmost cases these tools, though touchingonmultipleparts of a design, in fact leave almost no scope to de-scribe what the data generating processes can be, whatthe questions of interest are, and what types of analyseswill be undertaken.We conducted a census of currentlyavailable diagnostic tools (mainly power calculators)and assessed their ability to correctly diagnose threevariants of a common experimental design, in whichassignment probabilities are heterogeneous by block.5

The first variant simply uses a difference-in-means es-timator (DIM), the second conditions on block fixedeffects (BFE), and the third includes inverse-probability weighting to account for the heteroge-neous assignment probabilities (BFE-IPW).

We found that the vast majority of tools used areunable to correctly characterize the tradeoffs thesethree variants present. As shown in Table 2, none of

the tools was able to diagnose the design while takingaccount of important features that bias unweightedestimators.6 In our simulations the result is anoverstatement of the power of the difference-in-means.

Because no tool was able to account for weighting inthe estimator, none was able to calculate the power forthe IPW-BFE answer strategy. Moreover, no toolsought to calculate the design’s bias, root mean-squared-error, or coverage (which require in-formation on I). The companion software to this article,which was designed based on MIDA, illustrates thatpower is amisleading indicator of quality in this context.While the IPW-BFE estimator is better powered andless biased than the BFE estimator, its purported effi-ciency is misleading. IPW-BFE is better powered thanDIM and BFE because it produces biased varianceestimates that lead to a coverage probability that is toolow. In terms of RMSE and the standard deviation ofestimates, the IPW-BFE strategy does not outperformthe BFE estimator. This exercise should not be taken asproof of the superiority of one strategy over another ingeneral; instead we learn about their relative perfor-mance for particular diagnosands for the specific designdeclared.

We draw a number of conclusions from this review oftools.

First, researchers are generally not designing studiesusing the actual strategies that they will use to conductanalysis. Fromtheperspectiveof theoverall designs, thepower calculations are providing the wrong answer.

Second, the tools can drive scholars toward relativelynarrow design choices. The inputs to most power cal-culators are data strategy elements like the number ofunits or clusters. Power calculators do not generally

TABLE 2. Existing Tools Cannot Declare Many Core Elements of Designs and, as a Result, Can OnlyCalculate Some Diagnosands

(a) Declare design elements (b) Diagnosis capabilities

Design feature Diagnosand

(M) Effect and block size correlated 0/30 Power (DIM estimator) 28/30(I) Estimand 0/30 Power (BFE estimator) 13/30(D) Sampling procedure 0/30 Power (IPW-BFE estimator) 0/30(D) Assignment procedure 0/30 Bias (any estimator) 0/30(D) Block sizes vary 1/30 Coverage (any estimator) 0/30(A) Probability weighting 0/30 SD of estimates (any estimator) 0/30

Note: Panel (a) indicates the number of tools that allowdeclarationof a particular feature of the design aspart of the diagnosis. In the first row,for example, 0/30 indicates that no tool allows researchers to declarecorrelatedeffect andblocksizes.Panel (b) indicates thenumberof toolsthat can perform a particular diagnosis. Results correspond to design tool census concluded in July 2017 and do not include tools publishedsince then.

5 We assessed tools listed in four reviews of the literature (Green andMacLeod 2016; Groemping 2016; Guo et al. 2013; Kreidler et al.,2013), in addition to the first thirty results fromGoogle searches of theterms “statistical bias calculator,” “statistical power calculator,” and“sample size calculator.”We found no admissible tools using the term“statistical bias calculator.” Thirty of the 143 tools we identified wereable to diagnose inferential properties of designs, such as their power.See Supplementary Materials Section 2 for further details on the toolsurvey.

6 For example, no design could account for: the posited correlationbetween block size and potential outcomes; the sampling strategy; theexact randomization procedure; the formal definition of the estimandas the population average treatment effect; or the use of inverse-probability weighting. The one tool (GLIMMPSE) that was able toaccount for theblocking strategyencounteredanerrorandwasunableto produce diagnostic statistics.

Graeme Blair et al.

6

Dow

nloa

ded

from

htt

ps://

ww

w.c

ambr

idge

.org

/cor

e. U

CLA

Libr

ary,

on

04 Ju

n 20

19 a

t 17:

00:3

3, s

ubje

ct to

the

Cam

brid

ge C

ore

term

s of

use

, ava

ilabl

e at

htt

ps://

ww

w.c

ambr

idge

.org

/cor

e/te

rms.

htt

ps://

doi.o

rg/1

0.10

17/S

0003

0554

1900

0194



https://doi.org/10.1017/S0003055419000194

focus on broader aspects of a design, like alternativeassignmentproceduresor thechoiceofestimator.Whileresearchers may have an awareness that such tradeoffsexist, quantifying the extent of the tradeoff is by nomeans obvious until the model, inquiry, data strategy,and answer strategy is declared in code.

Third, the tools focus attention on a relatively narrowset of questions for evaluating a design. While un-derstanding power is important for some designs, therange of possible diagnosands of interest is muchbroader. Quantitative researchers tend to focus onpower, when other diagnosands such as bias, coverage,or RMSE may also be important. MIDA makes clear,however, that these features of a design are often linkedin ways that current practice obscures.

Asecond illustrationof risksarising fromafragmentedconceptualizationof researchdesigncomes fromdebatesover the disproportionate focus on estimators to thedetriment of careful consideration of estimands. Huber(2013), for example, worries that the focus on identifi-cation leads researchers away from asking compellingquestions. In the extreme, the estimators themselves(andnot the researchers)appear toselect theestimandofinterest. Thus, Deaton (2010) highlights how in-strumental variables approaches identify effects fora subpopulation of compliers. Who the compliers are isjointly determined by the characteristics of the subjectsand also by the data strategy. The implied estimand (theLocal Average Treatment Effect, sometimes called theComplier Average Causal Effect) may or may not be oftheoretical interest. Indeed, as researchers swap oneinstrument for another, the implied estimand changes.Deaton’s worry is that researchers are getting an answer,but they do not know what the question is.7 Were thequestion posed as the average effect of a treatment, thentheperformanceof the instrumentwoulddependonhowwell the instrumental variables regression estimates thatquantity, and not how well they answer the question fora different subpopulation. This is not done in usualpractice, however, as estimands are often not included aspart of a research design.

To illustrate risks arising from the combination ofa fractured approach to design in the formal quantitativeliterature,andtheholisticbutoften less formalapproachesin the qualitative literature, we point to difficulties theseapproaches have in learning from each other.

Goertz andMahoney (2012) tell a tale of two culturesin which qualitative and quantitative researchers differnot just in the analytic tools they use, but in very manyways, including, fundamentally, in their con-ceptualizations of causation and the kinds of questionstheyask.Theauthors claim(thoughnotallwouldagree)that qualitative researchers think of causation in termsof necessary and/or sufficient causes, whereas manyquantitative researchers focus on potential outcomes,average effects, and structural equations. One mightworry that such differences would preclude designdeclarationwithin a common framework, but they need

not, at least for qualitative scholars that consider causesin counterfactual terms.8

For example, a representation of a causal process intermsofcausal configurationsmight take the form:Y5AB1C,meaning that thepresence ofA andBor thepresenceofC is sufficient toproduceY.This configuration statementmaps directly into a potential outcomes function (orstructural equation)of the formY(A,B,C)5max(AB,C).Given this, the marginal effect of one variable, conditionalon others, can be translated to the conditions in which thevariable is difference-making in the sense of altering rel-evant INUS9 conditions: E(Y(A 5 1|B, C) 2 Y(A 5 0|B,C))5 E(B5 1, C5 0).10 Describing these differences innotation as differences in notions of causality suggests thatthere is limited scope for considering designs that mixapproaches, and that there is little that practitioners of oneapproach can say to practitioners of another approach. Incontrast, clarification that the difference is one regardingthe inquiry—i.e., which combinations of variables guar-antee a given outcome and not the averagemarginal effectof a variable across conditions—opens up the possibility toassess how quantitative estimation strategies fare whenapplied to estimating this estimand.

A second point of difference is nicely summarized byGoertz and Mahoney (2012, 230): “qualitative analystsadopt a ‘causes-of-effects’ approach to explanation […whereas] statistical researchers follow the ‘effects-of-causes’ approach employed in experimental research.”We agree with this association, though from a MIDAperspective we see such distinctions as differences inestimands and not as differences in ontology. Condi-tioning on a givenX andY the effects-of-cause question isE(Yi(Xi 5 1) 2 Yi(Xi 5 0)). By contrast, the cause-of-effects question can bewritten Pr(Yi(0)5 0 |Xi5 1,Yi(1)5 1). This expression asks what are the chances that Ywouldhavebeen0 ifXwere0 foraunit i forwhichXwas1and Yi(1) was 1. The two questions are of a similar formthough the cause-of-effects question is harder to answer(Dawid 2000). Once thought of as questions about whatthe estimand is, one can assess directly when one or an-other estimation strategy is more or less effective at fa-cilitating inference about the estimand of interest. In fact,experiments are in general not able to solve the identifi-cation problem for cause-of-effects questions (Dawid2000) and this may be one reason for why these questionsare often ignored by quantitative researchers. Exceptionsinclude Yamamoto (2012) and Balke and Pearl (1994).

Below, we demonstrate gains from declaration ofdesigns in a common framework by providing examples

7 AronowandSamii (2016) express a similar concern formodels usingregression with controls.

8 Schneider andWagemann (2012, 320–1) also note that there are notgrounds to assume incommensurability, noting that “if set-theoretic,method-specific concepts…. can be translated into the potentialoutcomes framework, the communication between scholars fromdifferent research traditions will be facilitated.” See also Mahoney(2008) on the consistency of these conceptualizations.9 An INUS condition is “an insufficient but non-redundant part of anunnecessary but sufficient condition” (Mackie 1974).10 Goertz and Mahoney (2012, 59) also make the point that the dif-ference is in practice, and is not fundamental: “Within quantitativeresearch, it does not seem useful to group cases according to commoncausal configurations on the independent variables.Althoughone coulddo this, it is not a practice within the tradition.” (Emphasis added.)


7

Dow

nloa

ded

from

htt

ps://

ww

w.c

ambr

idge

.org

/cor

e. U

CLA

Libr

ary,

on

04 Ju

n 20

19 a

t 17:

00:3

3, s

ubje

ct to

the

Cam

brid

ge C

ore

term

s of

use

, ava

ilabl

e at

htt

ps://

ww

w.c

ambr

idge

.org

/cor

e/te

rms.

htt

ps://

doi.o

rg/1

0.10

17/S

0003

0554

1900

0194



https://doi.org/10.1017/S0003055419000194

of design declaration for crisp-set qualitative compar-ative analysis (Ragin 1987), nested case analysis(Lieberman 2005), and CPO (causal process observa-tion) process-tracing (Collier 2011; Fairfield 2013),alongside experimental and quasi-experimental designs.

Overall, this discussion suggests that the commonways inwhich designs are conceptualized produce threedistinct problems. First, the different components ofa design may not be chosen to work optimally together.Second, consideration is unevenly distributed acrosscomponents of a design. Third, the absence of a com-mon framework across research traditions obscureswhere the points of overlap and difference lie and maylimit both critical assessment of approaches and cross-fertilization. We hope that the MIDA framework andtools can help address these challenges.

DECLARING AND DIAGNOSING RESEARCHDESIGNS IN PRACTICE

A design that can be declared in computer code can thenbe simulated in order to diagnose its properties. Theapproach to declaration that we advocate is one thatconceives of a design as a concatenation of steps. To il-lustrate, the top panel of Table 3 shows how to declarea design in code using the companion software to this

paper,DeclareDesign (Blair et al. 2018).The resultingset of objects (p_U, f_Y, I, p_S, p_Z, R, and A) areall steps. Formally, each of these steps is a function. Thedesign is the concatenation of these, which we representusing the “1” operator: design,- p_U1 f_Y1 I 1p_S 1 p_Z 1 R 1 A. A single simulation runs throughthese steps, calling each of the functions successively. Adesign diagnosis conducts m simulations, then summa-rizes the resulting distribution of diagnostic statistics inorder to estimate the diagnosand.

Diagnosands can be estimated with higher levels ofprecision by increasing m. However, simulations areoften computationally expensive. In order to assesswhether researchers have conducted enough simu-lations to be confident in their diagnosand estimates, werecommend estimating the sampling distributions of thediagnosands via the nonparametric bootstrap.11 Withthe estimated diagnosand and its standard error, we cancharacterize our uncertainty aboutwhether the rangeof

TABLE 3. A Procedure for Declaring and Diagnosing Research Designs Using the CompanionSoftware DeclareDesign (Blair et al. 2018)

Design declaration Code

MDeclare background variablesDeclare functional relations

� p_U ,- declare_population(N 5 200, u 5 rnorm(N))f_Y ,- declare_potential_outcomes(Y ~Z 1 u)

I Declare inquiry I,-declare_estimand(ATE5mean(Y_Z_12Y_Z_0))

DDeclare samplingDeclare assignmentDeclare outcome revelation

8<: p_S ,- declare_sampling(n 5 100)

p_Z ,- declare_assignment(m 5 50)R ,- declare_reveal(Y, Z)

A Declare answer strategy A ,- declare_estimator(Y ~Z, estimand 5 “ATE”)Declare design, ÆM, I, D, Aæ design ,- p_U 1 f_Y 1 I 1 p_S 1 p_Z 1 R 1 A

Design simulation (1 draw) Code

1 Draw a population u using P(U) u ,- p_U()2 Generate potential outcomes using fY D ,- f_Y(u)3 Calculate estimand aM a_M ,- I(D)4 Draw data, d, given model assumptions

and data strategiesd ,- R(p_Z(p_S(D)))

5 Calculate answers, aA using A and d a_A ,- A(d)6 Calculate a diagnostic statistic t using

aA and aMe ,- a_A[“estimate”] 2 a_M[“estimand”]

Design diagnosis (m draws) Code

Declare a diagnosand bias ,- declare_diagnosands(bias 5 mean(estimate 2 estimand))

Calculate a diagnosand diagnose_design(design, diagnosands 5 bias, sims 5 m)

Note: The toppanel includeseachelementofadesign that canbedeclaredalongwithcodeused todeclare them.Themiddlepaneldescribessteps to simulate that design. The bottom panel includes the procedure to diagnose the design.

11 In their paper on simulating clinical trials through Monte Carlo,Morris, White, and Crowther (2019) provide helpful analytic formulafor deriving Monte Carlo standard errors for several diagnosands(“performance measures”). In the companion software, we adopta non-parametric bootstrap approach that is able to calculate standarderrors for any user-provided diagnosand.

Graeme Blair et al.

8

Dow

nloa

ded

from

htt

ps://

ww

w.c

ambr

idge

.org

/cor

e. U

CLA

Libr

ary,

on

04 Ju

n 20

19 a

t 17:

00:3

3, s

ubje

ct to

the

Cam

brid

ge C

ore

term

s of

use

, ava

ilabl

e at

htt

ps://

ww

w.c

ambr

idge

.org

/cor

e/te

rms.

htt

ps://

doi.o

rg/1

0.10

17/S

0003

0554

1900

0194



https://doi.org/10.1017/S0003055419000194

likely values of the diagnosand compare favorably toreference values such as statistical power of 0.8.12

Design diagnosis places a burden on researchers tocome up with a causal model, M. Since researcherspresumably want to learn about the model, declaring itin advancemay seem to beg the question. Yet declaringa model is often unavoidable when diagnosing designs.Inpractice, doing so is already familiar toany researcherwhohas calculated thepowerof adesign,which requiresthe specification of effect sizes. The seeming arbitrari-ness of thedeclaredmodel canbemitigatedby assessingthe sensitivity of diagnosis to alternative models andstrategies, which is relatively straightforward givena diagnosand-complete design declaration. Further,researchers can inform their substantive models withexisting data, such as baseline surveys. Just as powercalculators focus attention on minimum detectableeffects, design declaration offers a tool to demonstratedesign properties and how they change depending onresearcher assumptions.

In the next sections, we illustrate how research designsthat aim to answer descriptive, causal, and exploratoryresearch questions can be declared and diagnosed inpractice. We then describe how the estimand-focusedapproach we propose works with designs that focusless on estimand estimation and more on modeling datagenerating processes. In all cases, we highlight potentialgains fromdeclaringdesignsusing theMIDAframework.

Descriptive Inference

Descriptive research questions often center on mea-suring aparameter ina sampleor in thepopulation, suchas the proportion of voters in the United States whosupport the Democratic candidate for president. Al-though seemingly very different fromdesigns that focuson causal inference, because of the lack of explanatoryvariables, the formal differences are not great.

Survey Designs

We examine an estimator of candidate support thatconditions on being a “likely voter.” For this problem,the data that help researchers predict who will vote areof critical importance. In the Supplementary MaterialsSection 3.1, we declare a model in which latent votersare likely to vote for a candidate, but overstate their truepropensity to vote. The inquiry is the true underlyingsupport for the candidate among those who will vote,while the data strategy involves taking a randomsamplefrom the national adult population and asking surveyquestions that measure vote intention and likelihood ofvoting. As an answer strategy, we estimate support forthe candidate among likely voters. The diagnosis showsthat when people misreport whether they vote,

estimates of candidate support may be biased, a com-monplace observation about the weaknesses of surveymeasures. The utility of design declaration here is thatwe can calibrate how far off our estimates will be underreasonable models of misreporting.

Bayesian Descriptive Inference

Although our simulation approach has a frequentistflavor, the MIDA framework itself can also be appliedto Bayesian strategies. In Supplementary MaterialsSection 3.2, we declare aBayesian descriptive inferencedesign. The model stipulates a latent probability ofsuccess for each unit, and makes one binomial draw foreachaccording to thisprobability.The inquiry is the truelatent probability, and the data strategy involvesa random sample of relatively few units. We considertwo answer strategies: first, we stipulate uniform priors,with a mean of 0.50 and a standard deviation of 0.29; inthe second,weplacemorepriorprobabilitymass at 0.50,with a standard deviation of 0.11.

Once declared, the design can be diagnosed not onlyin terms of its bias, but also as a function of quantitiesspecific to Bayesian estimation approaches, such as theexpected shift in the location and scale of the posteriordistribution relative to the prior distribution. The di-agnosis shows that the informative prior approachyieldsmore certain andmore biased inferences than theuniform prior approach. In terms of the bias-variancetradeoff, the informative priors decrease the posteriorstandard deviation by 40% relative to the uniformpriors, but increase the bias by 33%.

Causal Inference

The approach to design diagnosis we propose can beused todeclare anddiagnose a rangeof researchdesignstypically employed to answer causal questions in thesocial sciences.

Process Tracing

Although not all approaches to process tracing arereadily amenable to design declaration (e.g., theory-building process tracing, see Beach and Pedersen 2013,16), some are. We focus here on Bayesian frameworksthat have been used to describe process tracing logics(e.g., Bennett 2015; Humphreys and Jacobs 2015; Fair-field and Charman 2017). In these approaches, “causalprocess observations” (CPOs) are believed to be ob-served with different probabilities depending on thecausal process that has played out in a case. Ideal-typeCPOsasdescribedbyVanEvera (1997) are“hoop tests”(CPOs that are nearly certain to be seen if the hypothesisis true, but likely either way), “smoking-gun tests”(CPOs that are unlikely to be seen in general but areextremely unlikely if a hypothesis is false), and “doubly-decisive tests” (CPOs that are likely tobeseen if andonlyif a hypothesis is true).13 Unlike much quantitative

12 This procedure depends on the researcher choosing a “good”diagnosand estimator. In nearly all cases, diagnosands will be featuresof thedistributionofadiagnostic statistic that, given i.i.d. sampling,canbe consistently estimated via plug-in estimation (for example takingsample means). Our simulation procedure, by construction, yieldsi.i.d. draws of the diagnostic statistic.

13 See also Collier, Brady, and Seawright (2004), Mahoney (2012),Bennett and Checkel (2014), Fairfield (2013).


9

Dow

nloa

ded

from

htt

ps://

ww

w.c

ambr

idge

.org

/cor

e. U

CLA

Libr

ary,

on

04 Ju

n 20

19 a

t 17:

00:3

3, s

ubje

ct to

the

Cam

brid

ge C

ore

term

s of

use

, ava

ilabl

e at

htt

ps://

ww

w.c

ambr

idge

.org

/cor

e/te

rms.

htt

ps://

doi.o

rg/1

0.10

17/S

0003

0554

1900

0194



https://doi.org/10.1017/S0003055419000194

inference, such studies often pose “causes-of-effects”inquiries (did the presence of a strongmiddle class causea revolution?), and not “effects-of-causes” questions(what is the average effect of a strong middle class onthe probability of a revolution happening?) (Goertzand Mahoney 2012). Such inquiries often implya hypothesis—“the strong middle class caused the rev-olution,” say—that canbe investigatedusingBayes’ rule.

Formalizing thiskindofprocess-tracingexercise leadsto non-obvious insights about the tradeoffs involved incommitting to one or anotherCPO strategy ex ante.Wedeclare a design based on amodel of the world in whichboth the driver,X, and the outcome,Y,might bepresentin a given case either becauseX caused Y or because Ywouldhavebeenpresent regardless ofX (or perhaps, analternative cause was responsible for Y). See Supple-mentaryMaterials Section 3.3. The inquiry is whetherXin fact caused Y in the specific case under analysis (i.e.,would Y have been different if X were different?). Thedata strategy consists of selecting one case from a pop-ulation of cases, based on the fact that bothX andY arepresent, and then collecting two causal process obser-vations. Even before diagnosis, the declaration of thedesign illustrates an important point: the case selectionstrategy informs the answer strategy by enabling theresearcher to narrow down the number of causal pro-cesses that might be at play. This greatly simplifies theapplication of Bayes’ rule to the case in question.

Importantly, the researcher attaches two different exante probabilities to the observation of confirmatoryevidence in each CPO, depending on whether X did ordid not cause Y. Specifically, the first CPO containsevidence that is more likely to be seen when the hy-pothesis is true, Pr(E1|H) 5 0.75, but even when H isfalse and Y happened irrespective of X, there is someprobability of observing the first piece of evidence:Pr(E1|¬H) 5 0.25. The first CPO thus constitutesa “straw-in-the-wind” test (albeit a reasonably strongone). By contrast, the probability of observing the ev-idence in the second CPO when the hypothesis thatX caused Y is true, Pr(E2|H) is 0.30, whereas theprobability of observing the evidence when the hy-pothesis is false, Pr(E2|¬H) is only 0.05. The secondCPO thus constitutes a “smoking gun” test of H. Ob-serving the secondpieceofevidence ismore informativethan observing the first, because it is so unlikely toobserve a smoking gun when the hypothesis is false.

Diagnosis reveals that a researcher who relied solelyon the weaker “straw-in-the-wind” test would makebetter inferences on average than one who relied solelyon the “smoking gun” test. One does better relying onthe straw because, even if it is less informative whenobserved, it is muchmore commonly observed than thesmokinggun,which is an informative,but rare, clue.TheCollier (2011, 826) assertion that, of the four tests,straws-in-the-wind are “the weakest and place the leastdemand on the researcher’s knowledge and assump-tions” might thus be seen as an advantage rather thana disadvantage. In practice, of course, scholars oftenseek multiple CPOs, possibly of different strength (see,for example, Fairfield 2013). In such cases, the diagnosissuggests the learning depends on the ways in which

these CPOs are correlated. There are large gainsfrom seeking two CPOs when they are negativelycorrelated—for example, if they arise from alternativecausal processes. But there are weak gains when CPOsarise from the same process. Presentations of processtracing rarely describe correlations between CPOprobabilities yet the need to specify these (and the gainfrom doing so) presents itself immediately whena process tracing design is declared.

Qualitative Comparative Analysis (QCA)

One approach to mixed methods research focuses onidentifying ways that causes combine to produce out-comes. What, for instance, are the combinations ofdemography, natural resource abundance, and in-stitutional development that give rise to civil wars? Ananswermight beof the form: conflicts arisewhen there isnatural resource abundance and weak institutionalstructure or when there are deep ethnic divisions. Thekey idea is that different configurations of conditionscan lead to the same outcome (equifinality) and theinterest is in assessingwhich combinations of conditionsmatter.

Many applications of qualitative comparative anal-ysis use Boolean minimization algorithms to assesswhich configurations of factors are associated withdifferent outcomes. Critics have highlighted that thesealgorithms are sensitive to measurement error (Hug2013).Pointing to such sensitivity, someevengoas far asto call for the rejection of QCA as a framework forinquiry (Lucas and Szatrowski 2014; for a nuancedresponse, see Baumgartner and Thiem 2017).

However, a formal declaration of a QCA designmakes clear that these criticismsunnecessarily conflateQCA answer strategies with their inquiries (fora similar argument, see Collier 2014). Contrary toclaims that regression analysis and QCA stemfrom fundamentally different ontologies (Thiem,Baumgartner, and Bol 2016), we show that saturatedregression analysis may mitigate measurement errorconcerns in QCA. This simple proof of concept joinsefforts toward unifying QCA with aspects of main-stream statistics (Braumoeller 2003; Rohlfing 2018)and other qualitative approaches (Rohlfing andSchneider 2018).

In Supplementary Materials Section 3.4 we declarea QCA design, focusing on the canonical case of binaryvariables (“crisp-set QCA”). The model features anoutcome Y that arises in a case if and only if cause A isabsent and causeB is present (Y5 a *B). The approachextends readily to cases with many causes in complexconfigurations. For our inquiry, we wish to know thetrueminimal set of configurations of conditions that aresufficient to cause Y. The data strategy involves mea-suring and encoding knowledge aboutY in a truth table.We allow for some error in this process. As in Rohlfing(2018), we are agnostic as to how this error arises: it maybe that scholarly debate generates epistemic un-certainty about whether Y is truly present or absent ina given case, or that there is measurement error due tosampling variability.

Graeme Blair et al.

10

Dow

nloa

ded

from

htt

ps://

ww

w.c

ambr

idge

.org

/cor

e. U

CLA

Libr

ary,

on

04 Ju

n 20

19 a

t 17:

00:3

3, s

ubje

ct to

the

Cam

brid

ge C

ore

term

s of

use

, ava

ilabl

e at

htt

ps://

ww

w.c

ambr

idge

.org

/cor

e/te

rms.

htt

ps://

doi.o

rg/1

0.10

17/S

0003

0554

1900

0194



https://doi.org/10.1017/S0003055419000194

For answer strategies, we compare two QCA mini-mization approaches. The first employs the classicalQuine-McCluskey (QMC)minimization algorithm (seeDusa and Thiem 2015, for a definition) and the secondthe “Consistency Cubes” (CCubes) algorithm (Dusa2018) to solve for the set of causal conditions thatproducesY. This comparisondemonstrates theutility ofdeclaration and diagnosis for researchers using QCAalgorithms,whomightworryaboutwhether their choiceof algorithm will alter their inferences.14 We show that,at least in simple cases such as this, such concerns areminimal.

We also consider how ordinary least squares mini-mization performs when targeting a QCA estimand.The right hand side of the regression includes indicatorsformembership inall feasible configurationsofAandB.Configurations that predict the presence of Y withprobability greater than 0.5 are then included in the setof sufficient conditions.

The diagnosis of this design shows that QCA algo-rithms can be successful at pinpointing exactly thecombination of conditions that give rise to outcomes.When there is noerror and the sample is large enough toensure sufficient variation in the data, QMC andCCubes successfully recover the correct configuration100%of the time.Thediagnosis also confirms thatQCAvia saturated regression can recover thedata generatingprocess correctly and the configuration of causes esti-mand can then be computed, correctly, from estimatedmarginal effects.

This last point is important for thinking through thegains from employing the MIDA framework. Thedeclaration clarifies that QCA is not equivalent tosaturated regression: without substantial trans-formation, regression does not target the QCA esti-mands (Thiem, Baumgartner, andBol 2016). However,it also clarifies that regression models can be integratedinto classical QCA inquiries, and do very well. Usingregression to perform QCA is equivalent to QMC andCCubes when there is no error, and even slightly out-performs these algorithms (on the diagnosands weconsider) in the presence of measurement error. Morework is required to understand the conditions underwhich the approaches perform differently.

However, thedeclarationanddiagnosis illustrate thatthere need not be a tension between regression as anestimation procedure and causal configurations as anestimand. Rather than seeing them as rival researchparadigms, scholars interested in QCA estimands cancombine the machinery developed in the QCA litera-ture to characterize configurations of conditions withthe machinery developed in the broader statistical lit-erature to uncover data generating processes. Thus, forinstance, in answer to critiques that themethoddoes nothave a strategy for causal identification (Tanner 2014),one could in principle try to declare designs in which

instrumental variables strategies, say, are used incombination with QCA estimands.

Nested Mixed Methods

A second approach to mixed methods research nestsqualitative small N analysis within a strategy thatinvolvesmovement back and forwards between largeNtheory testing and smallN theory validation and theorygeneration. Lieberman (2005) describes a strategy ofnested analysis of this form. In Supplementary Mate-rials Section 3.5, we specify the estimands and analysisstrategies implied by the procedure proposed in Lie-berman (2005). In our declaration, we assume a modelwith binary variables and an inquiry focused on therelationship between X and Y (both causes-of-effectsand effects-of-causes are studied). Themodel allows forthe possibility that there are variables that are notknown to the researcher when conducting large Nanalysis, but might modify or confound the relationshipbetween X and Y. The data strategy and answerstrategies are quite complex and integrated with eachother. The researcher begins by analyzing a data setinvolving X and Y. If the quantitative analysis is“successful” (defined in terms of sufficient residualvariance explained), the researcher engages in within-case “on the regression line” analysis.Usingwithin-casedata, the researcher assesses the extent to which Xplausibly caused Y (or not X caused not Y) in thesecases. If the qualitative or quantitative analyses rejectthemodel, then a newqualitative analysis is undertakento better understand the relationship betweenX andY.In the design, this qualitative exploration is treated asthe possibility of discovering the importance of a thirdvariable that may moderate the effect of X on Y. If analternative model is successfully developed, it is thentested on the same large N data.

Diagnosis of this design illustrates some of itsadvantages. In particular, in some settings the within-case analysis can guide researchers tomodels that bettercapture data generating processes and improve identi-fication. The declaration also highlights the design fea-tures that are left to researchers.Howmany cases shouldbe gathered and how should they be selected? Whatthresholds should be used to decide whether a theory issuccessful or not? The design diagnosis suggests in-teresting interactions between these design elements.For instance, if the bar for success in the theory testingstage is low in terms of the minimum share of casesexplained that are considered adequate, then the re-searcher might be better off sampling fewer qualitativecases in the testing stage and more in the developmentstage. More variability in the first stage makes it morelikely that onewould reject a theory, whichmight in turnlead to the discovery of a better theory.

Observational Regression-Based Strategies

Many observational studies seek tomake causal claims,but do not explicitly employ the potential outcomesframework, instead describing inquiries in terms ofmodel parameters. Sometimes studies describe their

14 For both methods, we use the “parsimonious” solution and not the“conservative”or“intermediate” solutions thathavebeencriticized inBaumgartner and Thiem (2017), though our declaration could easilybe modified to check the performance of these alternative solutions.


11

Dow

nloa

ded

from

htt

ps://

ww

w.c

ambr

idge

.org

/cor

e. U

CLA

Libr

ary,

on

04 Ju

n 20

19 a

t 17:

00:3

3, s

ubje

ct to

the

Cam

brid

ge C

ore

term

s of

use

, ava

ilabl

e at

htt

ps://

ww

w.c

ambr

idge

.org

/cor

e/te

rms.

htt

ps://

doi.o

rg/1

0.10

17/S

0003

0554

1900

0194



https://doi.org/10.1017/S0003055419000194

goal as the estimation of a parameter b from amodel ofthe form yi5a1bxi1 «i.What is the estimand here? Ifwe believe that this model describes the true datagenerating process, then b is an estimand: it is the true(constant) marginal effect of x on y. But what if we arewrong about the model? We run into a problem if wewant to assess the properties of strategies under dif-ferent assumptions about data generation when theinquiry itself depends on the data generating model.

To address this problem, we can declare an inquiry asa summary of differences in potential outcomes acrossconditions, b. Such a summary might derive froma simple comparison of potential outcomes—for ex-ample t ” ExEi(Yi(x) 2 Yi(x 2 1)) captures the dif-ference in outcomes between having income x andhaving a dollar less, x2 1, for different possible incomelevels. Or it could be a parameter from amodel appliedto the potential outcomes. For examplewemight definea and b as the solutions to:

mina;bð Þ�i

ZYi xð Þ � a� bxð Þ2f xð Þdx

HereYi(x) is the (unknown)potential outcome for unit iin condition x. Estimand b can be thought of as thecoefficient one would get on x if one were to able toregress all possible potential outcomes on all possibleconditions for all units (given density of interest f(x)).15

Our data strategy will simply consist of the passiveobservation of units in the population, andwe assess theperformance of an answer strategy employing an OLSmodel to estimate b under different conditions.

To illustrate, we declare a design that lets us quicklyassess the properties of a regression estimate under theassumption that in the true data-generating process y isin fact a nonlinear function of x (SupplementaryMaterials Section 3.6). Diagnosis of the design showsthat under uniform random assignment of x, the linearregression returns an unbiased estimate of a (linear)estimand, even though the true data generating processis nonlinear. Interestingly, with the design in hand, it iseasy to see that unbiasedness is lost in a design in whichdifferent values of xi are assigned with differing prob-abilities. The benefit of declaration here is that, withoutdefining I, it is hard to see the conditions under whichAis biased or unbiased. Declaration and diagnosis clarifythat, even though the answer strategy “assumes” a non-linear relationship inM thatdoesnothold,undercertainconditionsOLS is still able to estimate a linear summaryof that relationship.

Matching on Observables

In many observational research designs, the processesby which units are assigned to treatment are not knownwith certainty. In matching designs, the effects of un-known assignment procedure may, for example, beassessed by matching units on their observable traitsunder an assumption of as-if random assignment

betweenmatched pairs. Diagnosis in such instances canshed light on risks when such assumptions are notjustified. In Supplementary Materials Section 3.7, wedeclare a design with amodel in which three observablerandom variables are combined in a probit process thatassigns the treatment variable, Z. The inquiry pertainsto the average treatment effect of Z on the outcome Yamong those actually assigned to treatment, which weestimate using an answer strategy that tries to re-construct the assignment process to calculate aA. Ourdiagnosis shows thatmatching improvesmean-squared-error (E[(aA 2 aM)2]) relative to a naive difference-in-means estimator of the treatment effect on the treated(ATT), but can nevertheless remainbiased (E[aA2 aM]„ 0) if thematching algorithm does not successfully pairunits with equal probabilities of assignment, i.e., ifmatching has not eliminated all sources of confounding.The chief benefit of the MIDA declaration here is toseparate out beliefs about the data generating process(M) from the details of the answer strategy (A), whoserobustness to alternative data generating processes canthen be assessed.

Regression Discontinuity

While in many observational settings researchers donot know the assignment process, in others, researchersmay know how assignment works without necessarilycontrolling it. In regression discontinuity designs, causalidentification is often premised on the claim that po-tential outcomes are continuous at a critical threshold(see De la Cuesta and Imai 2016; Sekhon and Titiunik2016). The declaration of such designs involves amodelthat defines the unknown potential outcomes functionsmapping average outcomes to the running and treat-ment variables. Our inquiry is the difference in theconditional expectations of the two potential outcomesfunctions at the discontinuity. The data strategyinvolves passive observation and collection of the data.The answer strategy is a polynomial regression in whichthe assignment variable is linearly interacted witha fourth order polynomial transformation of the run-ning variable. In Supplementary Materials Section 3.8,we declare and diagnose such a design.

The declaration highlights a difference between thisdesign and many others: the estimand here is not an av-erage of potential outcomes of a set of sample units, butrather an unobservable quantity defined at the limit of thediscontinuity. This feature makes the definition of diag-nosands such as bias or external validity conceptuallydifficult. If researchers postulate unobservable counter-factuals, such as the “treated” outcome for a unit locatedbelow the treatment threshold, then the usefulness of theregressiondiscontinuityestimateof theaverage treatmenteffect for a specific set of units can be assessed.

Experimental Design

In experimental research, researchers are in control ofsample construction and assignment of treatments,which makes declaring these parts of the designstraightforward. A common choice faced in experi-mental research is between employing a 2-by-2 factorial

15 An alternativemight be to imagine amarginal effect conditional onactual assignment: if xi is the observed treatment received by unit i,define, for small d, t ” E[Yi(xi) 2 Yi(xi 2 d)]/d.

Graeme Blair et al.

12

Dow

nloa

ded

from

htt

ps://

ww

w.c

ambr

idge

.org

/cor

e. U

CLA

Libr

ary,

on

04 Ju

n 20

19 a

t 17:

00:3

3, s

ubje

ct to

the

Cam

brid

ge C

ore

term

s of

use

, ava

ilabl

e at

htt

ps://

ww

w.c

ambr

idge

.org

/cor

e/te

rms.

htt

ps://

doi.o

rg/1

0.10

17/S

0003

0554

1900

0194



https://doi.org/10.1017/S0003055419000194

design or a three-arm trialwhere the “both” condition isexcluded.Supposeweare interested in theeffect of eachof two treatments when the other condition is set tocontrol. Should we choose a factorial design or a three-arm design? Focusing for simplicity on the effect ofa single treatment,wedeclare twodesigns under a rangeof alternative models to help assess the tradeoffs. Forboth designs, we consider modelsM1,…,MK, where welet the interaction between treatments vary over therange 20.2 to 10.2. Our inquiry is always the averagetreatment effect of treatment 1 given all units are in thecontrol condition for treatment 2. We consider twoalternative data strategies: an assignment strategy inwhich subjects are assigned to a control condition,treatment 1, or treatment 2, each with probability 1/3;andanalternative strategy inwhichweassign subjects toeach of four possible combinations of factors withprobability 1/4. The answer strategy in both casesinvolves a regression of the outcome on both treatmentindicators with no interaction term included.

We declare and diagnose this design and confirm thatneither design exhibits bias when the true interactionterm is equal to zero (Figure 1 left panel). The details ofthedeclaration canbe found inSupplementaryMaterialsSection 3.9. However, when the interaction between thetwo treatments is stronger, the factorial design rendersestimates of the effect of treatment 1 that are more andmore biased relative to the “pure”main effect estimand.Moreover, there is a bias-variance tradeoff in choosingbetween the two designs when the interaction is weak(Figure 1 right panel).When the interaction term is closeto zero, the factorial design is preferred, because it ismorepowerful: it comparesonehalfof the subjectpool tothe other half, whereas the three-arm design only com-pares a third to a third.However, as themagnitude of theinteraction term increases, the precision gains are offsetby the increase in bias documented in the left-panel.When the true interaction between treatments is large,the three-arm design is then preferred. This exercise

highlights key points of design guidance. Researchersoften select factorial designs because they expect in-teraction effects, and indeed factorial designs are re-quired to assess these. However if the scientific questionof interest is the pure effect of each treatment,researchers should (perhaps counterintuitively) usea factorial design if they expect weak interaction effects.An integrated approach to design declaration hereillustrates non-trivial interactions between the datastrategy, on the one hand, and the ability of answers (aA)to approximate the estimand (aM), on the other.

Designs for Discovery-Oriented Research

In someresearchprojects, theultimatehypotheses thatareassessedarenotknownat thedesignstage.Some inductivedesigns are entirely unstructured and explore a variety ofdata sources with a variety of methods within a generaldomain of interest until a new insight is uncovered. Yetmany can be described in a more structured way.

In studying textual data, for example, a researchermayhave a procedure for discovering the “topics” that arediscussed in a corpus of documents.Before beginning theresearch, the setof topics andeven thenumberof topics isunknown. Instead, the researcher selects a model forestimating the content of a fixed number of topics (e.g.,Blei,Ng,andJordan2003)andaprocedure forevaluatingthemodelfit used to selectwhichnumberof topicsfits thedata best. Such a design is inductive, yet the analyticaldiscovery process can be described and evaluated.

We examine a data analysis procedure in which theresearcher assesses possible analysis strategies in a firststage on half of the data and in the second stage appliesher preferred procedure to the second half of the data.Split-sample procedures such as this enable researchersto learn about the data inductively while protectingagainst Type I errors (for an early discussion of thedesign, see Cox 1975). In Supplementary MaterialsSection 3.10, we declare a design in which the modelstipulates a treatment of interest, but also specifies

FIGURE 1. Diagnoses of DesignsWith Factorial or Three-ArmAssignment Strategies Illustrate a Bias-Variance Tradeoff

Bias (left), root mean-squared-error (center), and power (right) are displayed for two assignment strategies, a 23 2 treatment arm factorialdesign (blacksolid lines;circles)anda three-armdesign (graydashed lines; triangles)according tovarying interactioneffect sizesspecified inthe potential outcomes function (x axis). The third panel also shows power for the interaction effect (squares) from the factorial design.


13

Dow

nloa

ded

from

htt

ps://

ww

w.c

ambr

idge

.org

/cor

e. U

CLA

Libr

ary,

on

04 Ju

n 20

19 a

t 17:

00:3

3, s

ubje

ct to

the

Cam

brid

ge C

ore

term

s of

use

, ava

ilabl

e at

htt

ps://

ww

w.c

ambr

idge

.org

/cor

e/te

rms.

htt

ps://

doi.o

rg/1

0.10

17/S

0003

0554

1900

0194



https://doi.org/10.1017/S0003055419000194

groups for which there might be heterogeneous treat-ment effects. Themain inquiry pertains to the treatmenteffect, but the researchers anticipate that they may beinterested in testing for heterogeneous treatmenteffects if they observe prima facie evidence for it. Thedata strategy involves random assignment. The answerstrategy involves examination of main effects, but inaddition the researchers examine heterogeneoustreatment effects inside a random subgroup of the data.If they find evidence of differential effects they specifya new inquiry which is assessed on the remaining data.The results on heterogeneous effects are comparedagainst a strategy that simply reports discoveries foundusing complete data, rather than on split data (we callthis the “unprincipled” approach).

Wesee lowerbias fromprincipleddiscovery than fromunprincipled discovery as one might expect. The decla-ration and diagnosis also highlight tradeoffs in terms ofmean squared error. Mean squared error is not neces-sarily lower for the principled approach since fewer dataare used in the final test. Moreover, the principledstrategy is somewhat less likely to produce a result at allsince it is less likely that a result would be discovered ina subset of the data than in the entire data set. With thisdesign declared, one can assess what an optimal divisionof units into training and testing data might be givendifferent hypothesized effect sizes.

Designs for Modeling DataGeneration Processes

For most designs we have described, the estimand ofinterest is a number: an average level, a causal effect, ora summary of causal effects. Yet in some situations,researchers seeknot toestimate a particular number, butrather to model a data generating process. For work ofthis kind, the data generating process is the estimand,rather than any particular comparison of potential out-comes. This was the case for the qualitativeQCA designwe lookedat, inwhich the combinationof conditions thatproduce an outcome was the estimand. This model-focused orientation is also common for quantitativeresearchers. In the example from ObservationalRegression-Based Strategies, we noted that a researchermight be interested not in the average effect resultingfrom a change in X over some range, but in estimatinga function f �YðXÞ (which itself might be used to learnabout different quantities of interest). This kind of ap-proach can be handled within the MIDA framework intwoways.Oneasks theresearcherto identify theultimatequantities of interest ex ante and to treat these as theestimands. In this case, the model generated to makeinferences about quantities of interest is thought of aspart of the answer strategy, a, and not part of i. A secondapproach posits a true underlying DGP as part ofm, f ��Y .The estimand is then also a function, f �Y , which could bef ��Y itself or anapproximation.16Anestimate is a function

fY thataims toapproximate f �Y . In this case, it is difficult tothink of diagnosands like bias or coverage when com-paring f �Y to fY, but diagnosands can still be constructedthat measure the success of the modeling. For instance,for a range of values of X we could compare values offY(X) to f �YðXÞ, or employ familiar statistics of goodnessoffit, suchas theR2. TheMIDAframework forces clarityregarding which of these approaches a design is using,and as a consequence,what kinds of criticismsof a designare on target. For instance, returning to the regressionstrategies example: if a linear model is used to estimatea linear estimand, it may behave well for that purposeeven when the underlying process is very nonlinear. If,however, the goal is to estimate the shape of the datagenerating process, the linear estimator will surely farepoorly.

***Theresearchdesignswehavedescribed in this section

are varied in the intellectual traditions as well as in-ferential goals they represent. Yet commonalitiesemerge, which enabled us to declare each design interms of MIDA. Exploring this broad set of researchpractices through MIDA clarified non-obvious aspectsof thedesigns, suchas the targetof inference (Inquiry) inQCA designs or regression discontinuity designs withfinite units, as well as the subtle implications of beliefsabout heterogeneity in treatment effects (Model) forselecting between three-armand 23 2 factorial designs.

PUTTING DECLARATIONS AND DESIGNDIAGNOSIS TO USE

We have described and illustrated a strategy for de-claring researchdesigns forwhich“diagnosands” canbeestimated given conjectures about the world. Howmight declaring and diagnosing research designs in thisway affect the practices of authors, readers, and repli-cation authors? We describe implications for howdesigns are chosen, communicated, and challenged.

Making Design Choices

Themove toward increasing credibility of research in thesocial sciences places a premium on considering alter-native data strategies and analysis strategies at earlystages of research projects, not only because it reducesresearcherdiscretionafterobservingoutcomes,butmoreimportantlybecause it can improvethequalityof thefinalresearch design. While there is nothing new about theidea of determining features such as sampling and esti-mation strategies ex ante, in practice many designs arefinalized late in the research process, after data arecollected. Frontloading design decisions is difficult notonly because existing tools are rudimentary and oftenmisleading, but because it is not clear in current practicewhat features of a design must be considered ex ante.

We provide a framework for identifying which fea-tures affect the assessment of a design’s properties, de-claring designs and diagnosing their inferential quality,and frontloadingdesigndecisions.Declaring thedesign’sfeatures in code enables direct exploration of alternative

16 For instance researchers might be interested in a “conditionalexpectation function,” or in locating a parameter vector that canrender amodel as goodas possible—such asminimizing theKullback-Leibler information criterion (White 1982).

Graeme Blair et al.

14

Dow

nloa

ded

from

htt

ps://

ww

w.c

ambr

idge

.org

/cor

e. U

CLA

Libr

ary,

on

04 Ju

n 20

19 a

t 17:

00:3

3, s

ubje

ct to

the

Cam

brid

ge C

ore

term

s of

use

, ava

ilabl

e at

htt

ps://

ww

w.c

ambr

idge

.org

/cor

e/te

rms.

htt

ps://

doi.o

rg/1

0.10

17/S

0003

0554

1900

0194



https://doi.org/10.1017/S0003055419000194

data and analysis strategies using simulated data; eval-uating alternative strategies through diagnosis; and ex-ploring the robustness of a chosen strategy to alternativemodels. Researchers can undertake each step beforestudy implementation or data collection.

Communicating Design Choices

Bias in published results can arise for many reasons.For example, researchers may deliberately or in-advertently select analysis strategies because theyproduce statistically significant results. Proposed sol-utions to reduce this kind of bias focus on various typesof preregistration of analysis strategies by researchers(Casey, Glennerster, and Miguel 2012; Green and Lin2016; Nosek et al. 2015; Rennie 2004; Zarin and Tse2008). Study registries are now operating in numerousareas of social science, including those hosted by theAmerican Economic Association, Evidence in Gov-ernance and Politics, and the Center forOpen Science.Bias may also arise from reviewers basing publicationrecommendations on statistical significance. Results-blind review processes are being introduced in somejournals to address this form of bias (e.g., Findley et al.2016).

However, the effectiveness of design registries andresults-blind review in reducing the scope for eitherform of publication bias depends on clarity over whichelements must be included to describe the design. Inpractice, some registries rely on checklists and pre-analysis plans exhibit great variation, ranging from listsof written hypotheses to all-but-results journal articles.In our view, the solution to this problem does not lie inever-more-specific questionnaires, but rather in a newway of characterizing designs whose analytic featurescan be diagnosed through simulation.

The actions to be taken by researchers are describedby the data strategy and the answer strategy; these twofeatures of a design are clearly relevant elements ofa preregistration document. In order to know whichdesign choices were made ex ante and which were ar-rived at ex post, researchers need to communicate theirdata and answer strategies unambiguously. However,assessingwhether thedataandanswer strategies areanygoodusually requires specifying amodel andan inquiry.Design declaration can clarify for researchers and thirdparties what aspects of a study need to be specified inorder to meet standards for effective preregistration.Rather than asking: “are the boxes checked?” thequestion becomes: “can it be diagnosed?” The relevantdiagnosands will likely depend on the type of researchdesign. However, if an experimental design is, for ex-ample, “bias complete,” then we know that sufficientinformation has been given to define the question, data,and answer strategy unambiguously.

Declaration of a design in code also enables a final andinfrequently practiced step of the registration process, inwhich the researcher “reports and reconciles” the finalwith the planned analysis. Identifying how and whetherthe features of a design diverge between ex ante and expost declarations highlights deviations from the pre-analysis plan. The magnitude of such deviations

determines whether results should be considered ex-ploratory or confirmatory. At present, this exerciserequires a review of dozens of pages of text, such thatdifferences (or similarities)arenot immediatelycleareventoclosereaders.Reconciliationofdesignsdeclaredincodecan be conducted automatically, by comparing changes tothe code itself (e.g., a move from the use of a stratifiedsampling function to simple random sampling) and bycomparingkeyvariables in thedesignsuchas sample sizes.

Challenging Design Choices

The independent replication of the results of studiesafter their publication is an essential component of theshift toward more credible science. Replication—whether verification, reanalysis of the original data, orreproduction using fresh studies—provides incentivesfor researchers to be clear and transparent in theiranalysis strategies, and can build confidence infindings.17

In addition to rendering the designmore transparent,diagnosand-complete declaration can allow for a dif-ferent approach to the re-analysis and critique ofpublished research. A standard practice for replicatorsengaging in reanalysis is to propose a range of alter-native strategies and assess the robustness of the data-dependent estimates to different analyses. The problemwith this approach is that, when divergent results arefound, third parties do not have clear grounds to decidewhich results to believe. This issue is compounded bythe fact that, in changing the analysis strategy, repli-cators risk departing from the estimand of the originalstudy, possibly providing different answers to differentquestions. In theworst case scenario, it canbedifficult todetermine what is learned both from the original studyand from the replication.

A more coherent strategy facilitated by design sim-ulations would be to use a diagnosand-complete dec-laration to conduct “design replication.” In a designreplication, a scholar restates the essential designcharacteristics to learn about what the study could haverevealed, not just what the original author reports wasrevealed. This helps to answer the question: under whatconditions are the results of a study to be believed? Byemphasizing abstract properties of the design, designreplication provides grounds to support alternativeanalyses on the basis of the original authors’ intentionsand not on the basis of the degree of divergence ofresults. Conversely, it provides authors with grounds toquestion claims made by their critics.

Table 4 illustrates situations that may arise. In a de-clareddesign an authormight specify situation 1: a set ofclaims on the structure of the variables and their po-tential outcomes (the model) and an estimator (theanswer strategy).Acriticmight thenquestion the claimson potential outcomes (for example, questioning a no-spillovers assumption) or question estimation strategies

17 Foradiscussionof thedistinctionsbetween thesedifferentmodesofreplication, see Clemens (2017).


15

Dow

nloa

ded

from

htt

ps://

ww

w.c

ambr

idge

.org

/cor

e. U

CLA

Libr

ary,

on

04 Ju

n 20

19 a

t 17:

00:3

3, s

ubje

ct to

the

Cam

brid

ge C

ore

term

s of

use

, ava

ilabl

e at

htt

ps://

ww

w.c

ambr

idge

.org

/cor

e/te

rms.

htt

ps://

doi.o

rg/1

0.10

17/S

0003

0554

1900

0194



https://doi.org/10.1017/S0003055419000194

(for example, arguing for inclusion or exclusion ofa control variable from an analysis), or both.

In this context, there are several possible criteria foradmitting alternative answer strategies:

• HomeGroundDominance. If ex ante the diagnostics forsituation 3 are better than for 1 then this gives grounds toswitch to 3. That is, if a critic can demonstrate that analternative estimation strategy outperforms an originalestimation strategy even under the data generatingprocess assumed by an original researcher, then theyhave strong grounds to propose a change in strategies.Conversely, if an alternative estimation strategy pro-duces different results, conditional on the data, but doesnot outperform the original strategy given the originalassumptions, this gives grounds to question thereanalysis.

• Robustness to AlternativeModels. If the diagnostics insituation 3 are as good as in 1 but are better in situation4 than in situation 2 this provides a robustness argu-ment for altering estimation strategies. For example,in a design with heterogeneous probabilities by block,an inverse propensity-weighted estimator will doabout as well as a fixed effects estimator in terms ofbias when treatment effects are constant, but willperform better on this dimension when effects areheterogeneous.

• Model Plausibility. If the diagnostics in situation 1 arebetter than in situation 3, but the diagnostics in situation4 are better than in situation 2, then things are less clearand the justificationof a change inestimators dependsonthe plausibility of the different assumptions about po-tential outcomes.

The normative value or relative ranking of thesecriteria should be left to individual research commu-nities. Without a declared design, in particular themodel and inquiry, none of these criteria can be eval-uated, complicating the defense of claims for both thecritic and the original author.

APPLICATION: DESIGN REPLICATION OFBjorkman and Svensson (2009)

We illustrate the insights that a formalized approach todesign declaration can reveal through an application tothe design of Bjorkman and Svensson (2009), whichinvestigated whether community-basedmonitoring canimprove health outcomes in rural Uganda.

We conduct a “design replication:” using availableinformation, we posit a Model, Inquiry, Data, andAnswer strategy to assess properties of Bjorkmanand Svensson (2009). This design replication can becontrasted with the kind of reanalysis of the study’s datathathasbeenconductedbyDonatoandGarciaMosqueira(2016) or the reproduction by Raffler, Posner, and Par-kerson (2019) in which the experiment was conductedagain.

The exercise serves three purposes: first, it shedslight on the sorts of insights the design can producewithout using the original study’s data or code; sec-ond, it highlights how difficulties can arise fromdesigns in which the inquiry is not well-defined; third,we can assess the properties of replication strate-gies, notably those pursued by Donato and GarciaMosqueira (2016) and Raffler, Posner, and Parkerson(2019), in order to make clearer the contributions ofsuch efforts.

In the original study, Bjorkman and Svensson(2009) estimate the effects of treatment on two im-portant indicators: child mortality, defined as thenumber of deaths per 1,000 live births among under-5year-olds (taken at the catchment-area-level) andweight-for-age z-scores, which are calculated bysubtracting from an infant’s weight the median fortheir age from a reference population, and dividing bythe standard deviation of that population. In theoriginal design, the authors estimate a positive effectof the interventiononweight among surviving infants.They also find that the treatment greatly decreaseschild mortality.

We briefly outline the steps of our design replicationhere, and present more detail in SupplementaryMaterials Section 4.

We began by positing a model of the world in whichunobserved variables, “family health” and “communityhealth,” determine both whether infants survive earlychildhood and whether they are malnourished.

Our attempt to define the study’s inquiry met witha difficulty: the weight of infants in control areas whoselives would have been saved if they had been in thetreatment is undefined (for a discussion of the generalproblem known as “truncation-by-death,” see Zhangand Rubin 2003). Unless we are willing to make con-jectures about undefined states of theworld (such as thecontrol weight of a child whowould not have survived ifassigned to the control), we can only define the averagedifference in individuals’ potential outcomes for thosechildren whose survival is unaffected by the treatment:

TABLE 4. Diagnosis Results Given Alternative Assumptions About theModel and Alternative AnswerStrategies

Author’s assumed model Alternative claims on model

Author’s proposed answer strategy 1 2Alternative answer strategy 3 4

Note: Four scenarios encountered by researchers and reviewers of a study are considered depending on whether the model or the answerstrategy differ from the author’s original strategy and model.

Graeme Blair et al.

16

Dow

nloa

ded

from

htt

ps://

ww

w.c

ambr

idge

.org

/cor

e. U

CLA

Libr

ary,

on

04 Ju

n 20

19 a

t 17:

00:3

3, s

ubje

ct to

the

Cam

brid

ge C

ore

term

s of

use

, ava

ilabl

e at

htt

ps://

ww

w.c

ambr

idge

.org

/cor

e/te

rms.

htt

ps://

doi.o

rg/1

0.10

17/S

0003

0554

1900

0194



https://doi.org/10.1017/S0003055419000194

E[Weight(Z 5 1) 2 Weight(Z 5 0)|Alive(Z 5 0) 5Alive(Z 5 1) 5 1].18

As in the original article we stratify sampling oncatchment area and cluster-assign households in 25 ofthe 50 catchment areas to the intervention.

We estimatemortality at the cluster level andweight-for-age among living children at the household level, asin Bjorkman and Svensson (2009).

Figure 2 illustrates how the existence of an effect onmortality can pose problems for the unbiased estima-tion of an effect on weight-for-age. The histogramsrepresent the sampling distributions of the averageeffect estimates of community monitoring on infantmortality and weight-for-age. The dotted vertical linerepresents the true average effect (aM). The mortalityestimand is defined at the cluster level and the weight-for-age estimand is defined for infants who wouldsurvive regardless of treatment status. The dashed linerepresents the average answer, i.e., the answer we ex-pect the design to provide (E[aA]). The weight-for-ageanswer strategy simply compares the weights of sur-viving infants across treatment and control. Under ourpostulated model of the world, the estimates of theeffect onweight-for-age are biased downwards becauseit is precisely those infants with low health outcomeswhose lives are saved by the treatment.

Wedrawupon the“robustness toalternativemodels”criterion (described in the previous section) to argue foran alternative answer strategy that exhibits less biasunder plausible conjectures about the world.

An alternative answer strategy is to attempt to subsetthe analysis of the weight effects to a group of infants

whose survival does not depend on the treatment. Thisapproach is equivalent to the “find always-responders”strategy for avoiding post-treatment bias in audit studies(Coppock 2019). In the original study, for example, theeffects on survival aremuch larger among infants youngerthan two years old. If indeed the survival of infants abovethis age threshold is unaffected by the treatment, then it ispossible to provide unbiased estimates of the weight-for-age effect, if only among this group. In terms of bias, suchan approachdoes at least aswell if we assume that there isno correlationbetweenweight andmortality, andbetter ifsuch a correlation does exist. It thus satisfies the “ro-bustness to alternative models” criterion.

A reasonable counter to this replication effort mightbe to say that the alternative answer strategy does notmeet the criterion of “home ground dominance” withrespect to RMSE. The increase in variance from sub-setting to a smaller group may outweigh the bias re-duction that it entails. In both cases, transparentarguments can be made by formally declaring andcomparing the original and modified designs.

The design replication also highlights the relatively lowpower of the weight-for-age estimator. As Gelman andCarlin (2014) have shown, conditioning on statisticalsignificance insuchcontexts canposerisksofexaggeratingthe true underlying effect size. Based on our assumptions,what can we say here, specifically, about the risk of ex-aggeration? How effectively does a design such as thatused in the replication by Raffler, Posner, and Parkerson(2019) mitigate this risk? To answer this question, wemodify the sampling strategy of our simulation of theoriginal study to include 187 clusters instead of 50.19 We

FIGURE 2. Data-independent Replication of Estimates in Bjorkman and Svensson (2009)

Histogramsdisplay the frequencyof simulatedestimates of the effect of communitymonitoring on infantmortality (left) andonweight-for-age(right). The dashed vertical line shows the average estimate, the dotted vertical line shows the average estimand.

18 Of course,we could defineour estimandas the difference in averageweights for any surviving children in either state of the world: E[Weight(Z5 1)|Alive(Z5 1)5 1]2E[Weight(Z5 0)|Alive(Z5 0)51].This estimandwould lead toveryaberrant conclusions. Suppose, forexample, that only one child with a very healthyweight survived in thecontrol and all children, with weights ranging from healthy to veryunhealthy, survived in the treatment. Despite all those lives saved, thisestimandwould suggest that the treatment has a large negative impacton health.

19 Raffler, Posner, and Parkerson (2019) employ a factorial designwhich breaks down the original intervention into two subcomponents:interface meetings between the community and village health teams,on theonehand,and integrationof report cards into theactionplansofhealth centers, on the other.We augment the sample size here only bythenumberof clusters corresponding to thepurecontrolandboth-armconditions, as the other conditions of the factorialwerenot included inthe original design. Including those other 189 clusters would onlystrengthen the conclusions drawn.


17

Dow

nloa

ded

from

htt

ps://

ww

w.c

ambr

idge

.org

/cor

e. U

CLA

Libr

ary,

on

04 Ju

n 20

19 a

t 17:

00:3

3, s

ubje

ct to

the

Cam

brid

ge C

ore

term

s of

use

, ava

ilabl

e at

htt

ps://

ww

w.c

ambr

idge

.org

/cor

e/te

rms.

htt

ps://

doi.o

rg/1

0.10

17/S

0003

0554

1900

0194



https://doi.org/10.1017/S0003055419000194

then define the diagnosand of interest as the “exaggera-tion ratio” (Gelman and Carlin 2014): the ratio of theabsolute value of the estimate to the absolute value of theestimand, given that the estimated effect is significant atthe a 5 0.05 level. This diagnosand thus providesameasureofhowmuch thedesignexaggerates effect sizesconditional on statistical significance.

The original design exhibits a high exaggeration ratio,according to the assumptions employed in the simu-lations: on average, statistically significant estimates tendto exaggerate the true effect of the intervention onmortality by a factor of two and on weight-for-age bya factor of four. In other words, even though the studyestimates effects on mortality in an unbiased manner,limiting attention to statistically significant effects pro-vides estimates that are twice as large inabsolutevalue asthe true effect size on average. By contrast, using thesamesamplesizeas thatemployed inRaffler,Posner,andParkerson (2019) reduces the exaggeration ratio on themortality estimand to where it should be, around one.

Finally,wecanalsoaddress theanalytic replicationbyDonato and Garcia Mosqueira (2016). The replicators(D&M) noted that the eighteen community-basedorganizations who carried out the original “power tothe people” (P2P) intervention were active in 64% ofthe treatment communities and 48% of the controlcommunities. Donato and Garcia Mosqueira (2016)posit that prior presence of these organizations may becorrelated with health outcomes, and therefore includein their analytic replication of the mortality and weight-for-age regressions both an indicator for CBOpresenceand the interaction of the intervention with CBOpresence. The inclusion of these terms into the re-gression reduces the magnitude of the coefficients onthe intervention indicator and thereby increases thep-values above thea5 0.1 threshold in some cases. Theoriginal authors (B&S) criticized the replicators’ de-cision to include CBO presence as a regressor, on thegrounds that in any such study it is possible to find someunrelated variable whose inclusion will increase stan-dard error of the treatment effect estimate.

In short, the original replicators make a set of con-trasting claims about the true model of the world: B&Sclaim that CBOpresence is unrelated to the outcome ofinterest (Bjorkman Nyqvist and Svensson 2016),whereas D&M claim that CBO presence might indeedaffect (or be otherwise correlated with) health out-comes. As we argued in the previous section, diagnosisof the properties of the answer strategy under thesecompeting claims should determine which answerstrategy is best justified.

Since we do not know whether the replicators wouldhave conditioned on CBO presence and its interactionwith the intervention if it had not been imbalanced, wemodify the original design to include four differentreplicator strategies: thefirst ignoresCBOpresence as inthe original study; the second includes CBO presenceirrespective of imbalance; the third includes an indicatorfor CBO presence only if the CBO presence is signifi-cantly imbalanced among the 50 treatment and controlclustersat thea50.05 level;and the last strategy includesterms for bothCBOpresence and an interaction of CBO

presence with the treatment irrespective of imbalance.Weconsiderhow these strategiesperformunder amodelin which CBO presence is unrelated to health outcomes,and another inwhich, as claimedby the replicators, CBOpresence is highly correlated with health outcomes.

Including the interaction term is a strictly dominatedstrategy from the standpoint of reducing mean squarederror: irrespective of whether CBO presence is corre-lated with health outcomes or imbalanced, the RMSEexpected under this strategy is higher than under anyother strategy. Thus, based on a criterion of “HomeGround Dominance” in favor of B&S, one would bejustified in discounting the importance of the repli-cators’ observation that “including the interaction termleads to a further reduction in magnitude and signifi-cance” of the estimated treatment effect (Donato andGarcia Mosqueira 2016, 19).

Supposing now that there is no correlation betweenCBO presence and health outcomes, inclusion of theCBO indicator does increase RMSE ever so slightly inthose instances where there is imbalance, and thestandard errors are ever so slightly larger. On average,however, the strategies of conditioning on CBO pres-ence regardless of balance and conditioning on CBOpresence only if imbalanced perform about as well asa strategy of ignoring CBO presence when there is nounderlying correlation. However, when there is a cor-relation between health outcomes and CBO presence,strategies that include CBO presence improve RMSEconsiderably, especially when there is imbalance. Thus,D&M could make a “Robustness to AlternativeModels” claim in defense of their inclusion of the CBOdummy: including CBO presence does not greatly di-minish inferential quality on average, even if there is nocorrelation inCBOpresence and outcomes; and if thereis such a correlation, including CBO presence in theregression specification strictly improves inferences. Insum, a diagnostic approach to replication clarifies thatone should resist updating beliefs about the study basedon the use of interaction terms, but that the inclusion oftheCBO indicator only harms inferences in a very smallsubset of cases. In general, including it does not worseninferences and in many cases can improve them. Thisapproach helps to clarify which points of disagreementaremost critical forhow the scientific community shouldinterpret and learn from replication efforts.

CONCLUSION

We began with two problems faced by empirical socialscience researchers: selecting high quality designs andcommunicating them to others. The preceding sectionshave demonstrated how the MIDA framework can ad-dress both challenges. Once designs are declared inMIDA terms, diagnosing their properties and improvingthem becomes straightforward. Because MIDAdescribes a grammar of research designs that appliesacross a very broad range of empirical research tradi-tions, it enables efficient sharing of designs with others.

Designing high quality research is difficult and comeswith many pitfalls, only a subset of which are

Graeme Blair et al.

18

Dow

nloa

ded

from

htt

ps://

ww

w.c

ambr

idge

.org

/cor

e. U

CLA

Libr

ary,

on

04 Ju

n 20

19 a

t 17:

00:3

3, s

ubje

ct to

the

Cam

brid

ge C

ore

term

s of

use

, ava

ilabl

e at

htt

ps://

ww

w.c

ambr

idge

.org

/cor

e/te

rms.

htt

ps://

doi.o

rg/1

0.10

17/S

0003

0554

1900

0194



https://doi.org/10.1017/S0003055419000194

ameliorated by theMIDA framework.Others we fail toaddress entirely and in some cases, we may even ex-acerbate them. We outline four concerns.

The first is the worry that evaluative weight could getplaced on essentiallymeaningless diagnoses.Given thatdesign declaration includes declarations of conjecturesabout the world it is possible to choose inputs so thatadesignpassesanydiagnostic test set for it. For instance,a simulation-based claim to unbiasedness that incor-porates all features of a design is still only good withrespect to the precise conditions of the simulation (incontrast, analytic results, when available, may extendover general classes of designs). Still worse, simulationparameters might be selected because of their prop-erties. A power analysis, for instance, may be useless ifimplausible parameters are chosen to raise powerartificially. While MIDA may encourage more honestdeclarations, there is nothing in the framework thatenforces them. As ever, garbage-in, garbage-out.

Second, we see a risk that researchmay get evaluatedon the basis of a narrow, but perhaps inappropriate setof diagnosands. Statistical power is often invoked asa key design feature—but there may be little value inknowing thepowerof a study that is biasedaway from itstarget of inference. The appropriateness of the diag-nosanddependson thepurposesof the study.AsMIDAis silent on the question of a study’s purpose, it cannotguide researchers or critics to the appropriate set ofdiagnosands by which to evaluate a design. An ad-vantage of the approach is that the choice of diag-nosands gets highlighted and new diagnosands can begenerated in response to substantive concerns.

Third, emphasis on the statistical properties of a de-sign can obscure the substantive importance of a ques-tion being answered or other qualitative features ofa design. A similar concern has been raised regardingthe “identification revolution” where a focus on iden-tification risks crowding out attention to the importanceof questions being addressed (Huber 2013). Ourframework can help researchers determine whethera particular design answers a question well (or at all),and it alsonudges themtomakesure that theirquestionsare defined clearly and independently of their answerstrategies. It cannot, however, help researchers choosegood questions.

Finally, we see a risk that the variation in the suit-ability of design declaration to different researchstrategies may be taken as evidence of the relative su-periority of different types of research strategies.Whilewe believe that the range of strategies that can be de-clared and diagnosed is wider than what one might atfirst think possible, there is no reason to believe that allstrong designs can be declared either ex ante or ex post.An advantage of our framework, we hope, is that it canhelp clarifywhen a strategy can or cannot be completelydeclared. When a design cannot be declared, non-declarability is all the framework provides, and in suchcases we urge caution in drawing conclusions aboutdesign quality.

We conclude on a practical note. In the end, we areasking that scholars add a step to their workflow. Wewant scholars to formally declare and diagnose their

research designs both in order to learn about them andto improve them. Much of the work of declaring anddiagnosing designs is already part of how social scien-tists conduct research: grant proposals, IRB protocols,preanalysis plans, anddissertationprospectuses containdesign information and justifications for why the designis appropriate for the question. The lack of a commonlanguage to describe designs and their properties,however, seriously hampers the utility of these practicesfor assessing and improving designquality.Wehope thatthe inclusion of a declaration and diagnosis step to theresearch process can help address this basic difficulty.

SUPPLEMENTARY MATERIAL

To view supplementary material for this article, pleasevisit https://doi.org/10.1017/S0003055419000194.

Replication materials can be found on Dataverse at:https://doi.org/10.7910/DVN/XYT1VB.

REFERENCES

Angrist, Joshua D., and Jorn-Steffen Pischke. 2008.Mostly HarmlessEconometrics: An Empiricist’s Companion. Princeton, NJ:Princeton University Press.

Aronow, PeterM., andCyrus Samii. 2016. “DoesRegressionProduceRepresentative Estimates of Causal Effects?”American Journal ofPolitical Science 60 (1): 250–67.

Balke, Alexander, and Judea Pearl. 1994. Counterfactual Probabili-ties: Computational Methods, Bounds and Applications. In Pro-ceedings of the Tenth International Conference on Uncertainty inArtificial Intelligence. Burlington, MA: Morgan Kaufmann Pub-lishers, 46–54.

Baumgartner, Michael, and Alrik Thiem. 2017. “Often Trusted butNever (Properly) Tested: Evaluating Qualitative ComparativeAnalysis.” Sociological Methods & Research: 1–33. Published firstonline 3 May 2017.

Beach, Derek, and Rasmus Brun Pedersen. 2013. Process-TracingMethods: Foundations and Guidelines. Ann Arbor, MI: Universityof Michigan Press.

Bennett, Andrew. 2015. Disciplining Our Conjectures: SystematizingProcess Tracing with Bayesian Analysis. In Process Tracing, eds.Andrew Bennett and Jeffrey T. Checkel. Cambridge: CambridgeUniversity Press, 276–98.

Bennett, Andrew, and Jeffrey T. Checkel, eds. 2014.Process Tracing.Cambridge: Cambridge University Press.

Bjorkman,Martina, and Jakob Svensson. 2009. “Power to the People:Evidence from a Randomized Field Experiment of a Community-Based Monitoring Project in Uganda.” Quarterly Journal of Eco-nomics 124 (2): 735–69.

Bjorkman Nyqvist, Martina, and Jakob Svensson. 2016. “Commentson Donato and Mosqueira’s (2016) ‘additional Analyses’ ofBjorkman and Svensson (2009).” Unpublished research note.

Blair, Graeme, Jasper Cooper, Alexander Coppock, MacartanHumphreys, and Neal Fultz. 2018. “DeclareDesign.” Softwarepackage for R, available at http://declaredesign.org.

Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003. “LatentDirichlet Allocation.” Journal of Machine Learning Research 3:993–1022.

Brady, Henry E., and David Collier. 2010. Rethinking Social Inquiry:Diverse Tools, Shared Standards. Lanham, MD: Rowman & Lit-tlefield Publishers.

Braumoeller, Bear F. 2003. “Causal Complexity and the Study ofPolitics.” Political Analysis 11 (3): 209–33.

Casey, Katherine, Rachel Glennerster, and Edward Miguel. 2012.“Reshaping Institutions: Evidence on Aid Impacts Using a Pre-Analysis Plan.”Quarterly Journal of Economics 127 (4): 1755–812.

Clemens, Michael A. 2017. “The Meaning of Failed Replications: AReviewandProposal.” Journal ofEconomicSurveys31 (1): 326–42.


19

Dow

nloa

ded

from

htt

ps://

ww

w.c

ambr

idge

.org

/cor

e. U

CLA

Libr

ary,

on

04 Ju

n 20

19 a

t 17:

00:3

3, s

ubje

ct to

the

Cam

brid

ge C

ore

term

s of

use

, ava

ilabl

e at

htt

ps://

ww

w.c

ambr

idge

.org

/cor

e/te

rms.

htt

ps://

doi.o

rg/1

0.10

17/S

0003

0554

1900

0194

https://doi.org/10.1017/S0003055419000194


http://declaredesign.org



https://doi.org/10.1017/S0003055419000194

Cohen, Jacob. 1977. Statistical Power Analysis for the BehavioralSciences. New York, NY: Academic Press.

Collier, David. 2011. “Understanding Process Tracing.” PS: PoliticalScience & Politics 44 (4): 823–30.

Collier, David. 2014. “Comment: QCA Should Set Aside the Algo-rithms.” Sociological Methodology 44 (1): 122–6.

Collier,David,HenryE.Brady, and JasonSeawright. 2004. SourcesofLeverage in Causal Inference: Toward an Alternative View ofMethodology. In Rethinking Social Inquiry: Diverse Tools, SharedStandards, eds. David Collier and Henry E. Brady. Lanham, MD:Rowman and Littlefield, 229–66.

Coppock, Alexander. 2019. “Avoiding Post-Treatment Bias in AuditExperiments.” Journal of Experimental Political Science 6(1): 1–4.

Cox, David R. 1975. “ANote on Data-Splitting for the Evaluation ofSignificance Levels.” Biometrika 62 (2): 441–4.

Dawid, A.Philip. 2000. “Causal Inference without Counterfactuals.”Journal of the American Statistical Association 95 (450): 407–24.

De la Cuesta, Brandon, and Kosuke Imai. 2016. “Misunderstandingsabout the Regression Discontinuity Design in the Study of CloseElections.” Annual Review of Political Science 19: 375–96.

Deaton, Angus S. 2010. “Instruments, Randomization, and Learningabout Development.” Journal of Economic Literature 48 (2):424–55.

Donato, Katherine, and Adrian Garcia Mosqueira. 2016. “Power tothe People? A Replication Study of a Community-Based Moni-toring Programme in Uganda.” 3ie Replication Papers 11.

Dunning, Thad. 2012. Natural Experiments in the Social Sciences: ADesign-Based Approach. Cambridge: CambridgeUniversity Press.

Dusa, Adrian. 2018. QCA with R. A Comprehensive Resource. NewYork, NY: Springer.

Dusa, Adrian, and Alrik Thiem. 2015. “Enhancing the MinimizationofBooleanandMultivalueOutputFunctionswitheQMC.” Journalof Mathematical Sociology 39 (2): 92–108.

Fairfield, Tasha. 2013. “Going where the Money Is: Strategies forTaxing Economic Elites in Unequal Democracies.” World De-velopment 47: 42–57.

Fairfield, Tasha, and Andrew E. Charman. 2017. “Explicit BayesianAnalysis for Process Tracing: Guidelines, Opportunities, andCaveats.” Political Analysis 25 (3): 363–80.

Findley, Michael G., Nathan M. Jensen, Edmund J. Malesky, andThomas B. Pepinsky. 2016. “Can Results-Free Review ReducePublication Bias? The Results and Implications of a Pilot Study.”Comparative Political Studies 49 (13): 1667–703.

Geddes, Barbara. 2003.Paradigms and SandCastles: TheoryBuildingand Research Design in Comparative Politics. Ann Arbor, MI:University of Michigan Press.

Gelman, Andrew, and Jennifer Hill. 2006. Data Analysis Using Re-gression and Multilevel/Hierarchical Models. Cambridge: Cam-bridge University Press.

Gelman, Andrew, and John Carlin. 2014. “Beyond Power Calcu-lations Assessing Type S (Sign) and Type M (Magnitude) Errors.”Perspectives on Psychological Science 9 (6): 641–51.

Gerber, Alan S., and Donald P. Green. 2012. Field Experiments:Design,Analysis, and Interpretation. NewYork,NY:W.W.Norton.

Goertz, Gary, and James Mahoney. 2012. A Tale of Two Cultures:Qualitative and Quantitative Research in the Social Sciences.Princeton, NJ: Princeton University Press.

Green, Donald P., and Winston Lin. 2016. “Standard OperatingProcedures: A Safety Net for Pre-analysis Plans.” PS: PoliticalScience and Politics 49 (3): 495–9.

Green, Peter, andCatriona J.MacLeod. 2016. “SIMR:AnRPackagefor Power Analysis of Generalized Linear Mixed Models by Sim-ulation.” Methods in Ecology and Evolution 7 (4): 493–8.

Groemping, Ulrike. 2016. “Design of Experiments (DoE)&Analysisof Experimental Data.” Last accessed May 11, 2017. https://cran.r-project.org/web/views/ExperimentalDesign.html.

Guo, Yi, Henrietta L. Logan, Deborah H. Glueck, and Keith E.Muller. 2013. “Selecting a Sample Size for Studies with RepeatedMeasures.” BMC Medical Research Methodology 13 (1): 100.

Halpern, JosephY. 2000. “Axiomatizing Causal Reasoning.” Journalof Artificial Intelligence Research 12: 317–37.

Haseman, Joseph K. 1978. “Exact Sample Sizes for Use with theFisher-Irwin Test for 2 3 2 Tables.” Biometrics 34 (1): 106–9.

Heckman, James J., Sergio Urzua, and Edward Vytlacil. 2006. “Un-derstanding Instrumental Variables in Models with Essential

Heterogeneity.” The Review of Economics and Statistics 88 (3):389–432.

Herron, Michael C., and Kevin M. Quinn. 2016. “A Careful Look atModern Case Selection Methods.” Sociological Methods & Re-search 45 (3): 458–92.

Huber, John. 2013. “Is Theory Getting Lost in the ‘IdentificationRevolution’?” The Monkey Cage blog post.

Hug, Simon. 2013. “Qualitative Comparative Analysis: How In-ductive Use and Measurement Error lead to Problematic In-ference.” Political Analysis 21 (2): 252–65.

Humphreys, Macartan, and AlanM. Jacobs. 2015. “MixingMethods:ABayesianApproach.”American Political Science Review 109 (4):653–73.

Imai, Kosuke, Gary King, and Elizabeth A. Stuart. 2008. “Mis-understandings between Experimentalists and Observationalistsabout Causal Inference.” Journal of the Royal Statistical Society:Series A 171 (2): 481–502.

Imbens, Guido W. 2010. “Better LATE Than Nothing: Some Com-ments onDeaton (2009) andHeckman andUrzua (2009).” Journalof Economic Literature 48 (2): 399–423.

Imbens, Guido W., and Donald B. Rubin. 2015. Causal Inference inStatistics, Social, and Biomedical Sciences. Cambridge: CambridgeUniversity Press.

King, Gary, Robert O. Keohane, and Sidney Verba. 1994.DesigningSocial Inquiry: Scientific Inference in Qualitative Research.Princeton, NJ: Princeton University Press.

Kreidler, Sarah M., Keith E. Muller, Gary K. Grunwald, Brandy M.Ringham, Zacchary T. Coker-Dukowitz, Uttara R. Sakhadeo,Anna E. Baron, and Deborah H. Glueck. 2013. “GLIMMPSE:Online Power Computation for Linear Models with and withouta Baseline Covariate.” Journal of Statistical Software 54 (10) 1–26.

Lenth, Russell V. 2001. “Some Practical Guidelines for Effective SampleSize Determination.” The American Statistician 55 (3): 187–93.

Lieberman, Evan S. 2005. “Nested Analysis as a Mixed-MethodStrategy for Comparative Research.” American Political ScienceReview 99 (3): 435–52.

Lohr, Sharon. 2010.Sampling:DesignandAnalysis. Boston:BrooksCole.Lucas, Samuel R., and Alisa Szatrowski. 2014. “Qualitative Com-

parative Analysis in Critical Perspective.” Sociological Methodol-ogy 44 (1): 1–79.

Mackie, John Leslie. 1974. The Cement of the Universe: A Study ofCausation. Oxford: Oxford University Press.

Mahoney, James. 2008. “Toward a Unified Theory of Causality.”Comparative Political Studies 41 (4–5): 412–36.

Mahoney, James. 2012. “The Logic of Process Tracing Tests in theSocial Sciences.” Sociological Methods & Research 41 (4): 570–97.

Morris, Tim P., Ian R. White, and Michael J. Crowther. 2019. “UsingSimulationStudies toEvaluateStatisticalMethods.”Statistics inMedicine.

Muller, Keith E., and Bercedis L. Peterson. 1984. “Practical Methodsfor Computing Power in Testing the Multivariate General LinearHypothesis.”Computational Statistics&DataAnalysis2 (2): 143–58.

Muller, Keith E., Lisa M. Lavange, Sharon Landesman Ramey, andCraig T. Ramey. 1992. “Power Calculations for General LinearMultivariate Models Including RepeatedMeasures Applications.”Journal of the American Statistical Association 87 (420): 1209–26.

Nosek, Brian A., GeorgeAlter, George C. Banks, Denny Borsboom,Sara D. Bowman, Steven J. Breckler, Stuart Buck, et al. 2015.“Promoting an Open Research Culture: Author Guidelines forJournals Could Help to Promote Transparency, Openness, andReproducibility.” Science 348 (6242): 1422.

Pearl, Judea. 2009. Causality. Cambridge: Cambridge UniversityPress.

Raffler, Pia, Daniel N. Posner, and Doug Parkerson. 2019. “TheWeakness of Bottom-Up Accountability: Experimental Evidencefrom the Ugandan Health Sector.” Working Paper.

Ragin, Charles. 1987. The Comparative Method. Moving beyondQualitative andQuantitative Strategies. Berkeley, CA:University ofCalifornia Press.

Rennie, Drummond. 2004. “Trial Registration.” Journal of theAmerican Medical Association: The Journal of the AmericanMedical Association 292 (11): 1359–62.

Rohlfing, Ingo. 2018. “Power and False Negatives in QualitativeComparativeAnalysis:Foundations,SimulationandEstimation forEmpirical Studies.” Political Analysis 26 (1): 72–89.

Graeme Blair et al.

20

Dow

nloa

ded

from

htt

ps://

ww

w.c

ambr

idge

.org

/cor

e. U

CLA

Libr

ary,

on

04 Ju

n 20

19 a

t 17:

00:3

3, s

ubje

ct to

the

Cam

brid

ge C

ore

term

s of

use

, ava

ilabl

e at

htt

ps://

ww

w.c

ambr

idge

.org

/cor

e/te

rms.

htt

ps://

doi.o

rg/1

0.10

17/S

0003

0554

1900

0194

https://cran.r-project.org/web/views/ExperimentalDesign.html

https://cran.r-project.org/web/views/ExperimentalDesign.html



https://doi.org/10.1017/S0003055419000194

Rohlfing, Ingo, and Carsten Q. Schneider. 2018. “A UnifyingFramework for Causal Analysis in Set-Theoretic MultimethodResearch.” Sociological Methods & Research 47 (1): 37–63.

Rosenbaum, Paul R. 2002. Observational Studies. New York, NY:Springer.

Rubin, Donald B. 1984. “Bayesianly Justifiable and Relevant Fre-quency Calculations for the Applied Statistician.” Annals of Sta-tistics 12 (4): 1151–72.

Schneider, Carsten Q., and Claudius Wagemann. 2012. Set-theoreticMethods for the Social Sciences:AGuide toQualitativeComparativeAnalysis. Cambridge: Cambridge University Press.

Seawright, Jason, and John Gerring. 2008. “Case Selection Techni-ques in Case Study Research: A Menu of Qualitative and Quan-titative Options.” Political Research Quarterly 61 (2): 294–308.

Sekhon, Jasjeet S., and Rocıo Titiunik. 2016. “Understanding Re-gression Discontinuity Designs as Observational Studies.” Obser-vational Studies 2: 173–81.

Shadish, William, Thomas D. Cook, and Donald Thomas Campbell.2002. Experimental and Quasi-Experimental Designs for General-ized Causal Inference. Boston, MA: Houghton Mifflin.

Tanner, Sean. 2014. “QCA Is of Questionable Value for Policy Re-search.” Policy and Society 33 (3): 287–98.

Thiem, Alrik, Michael Baumgartner, and Damien Bol. 2016. “StillLost in Translation! A Correction of Three Misunderstandingsbetween Configurational Comparativists and Regressional Ana-lysts.” Comparative Political Studies 49 (6): 742–74.

Van Evera, Stephen. 1997.Guide to Methods for Students of PoliticalScience. Ithaca, NY: Cornell University Press.

White, Halbert. 1982. “Maximum Likelihood Estimation of Mis-specified Models.” Econometrica: Journal of the EconometricSociety 50 (1): 1–25.

Yamamoto, Teppei. 2012. “Understanding the Past: StatisticalAnalysis of Causal Attribution.” American Journal of PoliticalScience 56 (1): 237–56.

Zarin, Deborah A., and Tony Tse. 2008. “Moving towards Trans-parency of Clinical Trials.” Science 319 (5868): 1340–2.

Zhang, Junni L., and Donald B. Rubin. 2003. “Estimation of CausalEffects via Principal Stratification When Some Outcomes AreTruncated by ‘Death’.” Journal of Educational and BehavioralStatistics 28 (4): 353–68.

APPENDIX

In this appendix, we demonstrate how each diagnosand-relevant feature of a simple design can be defined in code,with an application in which the assignment procedure isknown, as in an experimental or quasi-experimental design.

M(1) The population. Defines the population variables,including both observed and unobservedX. In the examplebelow we define a function that returns a normally dis-tributed variable of a given size. Critically, the declaration isnot a declaration of a particular realization of data but ofa data generating process. Researchers will typically havea sense of the distribution of covariates from previous work,and may even have an existing data set of the units that willbe in the studywith background characteristics. Researchersshould assess the sensitivity of their diagnosands to differ-ent assumptions about P(U).

population ,-declare_population(N 5 1000, u 5 rnorm

(N))

M(2) The structural equations, or potential outcomesfunction. Thepotential outcomes functiondefines conjecturedpotential outcomes given interventions Z and parents. In theexample below the potential outcomes function maps froma treatment condition vector (Z) and background data u,generatedbyP(U) to avector of outcomes. In this example thepotential outcomes function satisfies a SUTVA con-dition—each unit’s outcome depends on its own conditiononly, though in general since Z is a vector, it need not.

potential_outcomes ,-declare_potential_outcomes(Y ~ 0.25 * Z

1 u)

In many cases, the potential outcomes function (or its fea-tures) is the very thing that the study sets out to learn, so it canseemoddtoassumefeaturesof it.Wesuggest twoapproaches todeveloping potential outcomes functions that will yield usefulinformation about the quality of designs. First, consider a nullpotential outcomes function in which the variables of interestare set to have no effect on the outcome whatsoever. Diag-nosands such as bias can then be assessed relative to a true

estimand of zero. This approach will not work for diagnosandslike power or the Type-S rate. Second, set a series of potentialoutcomesfunctions thatcorrespondtocompeting theories.Thisapproach enables the researcher to judge whether the designyields answers that help adjudicate between the theories.

I Estimands. The estimand function creates a summary ofpotential outcomes. In principle, the estimand function canalso take realizations of assignments as arguments, in order tocalculatepost-treatment estimands.Below, theestimand is theAverageTreatmentEffect, or the average difference betweentreated and untreated potential outcomes.

estimand ,-declare_estimand(ATE 5 mean(Y_Z_1 2

Y_Z_0))

D(1) The sampling strategy. Defines the distribution overpossible samples for which outcomes are measured, pS.

In the example below each unit generated by P(U) issampled with 10% probability. Again sampling describesa sampling strategy and not an actual sample.

sampling ,- declare_sampling(n 5 100)

D(2) The treatment assignment strategy. Defines thestrategy for assigning variables under the notional control ofresearchers. In this example each sampled unit is assigned totreatment independently with probability 0.5. In designs inwhich the sampling process or the assignment process are in thecontrol of researchers, pz is known. In observational designs,researchers either know or assume pz based on substantiveknowledge. We make explicit here an additional step in whichthe outcome for Y is revealed after Z is determined.

assignment ,- declare_assignment(m 5 50)reveal ,- declare_reveal(Y, Z)

A The answer strategies are functions that use informationfromrealizeddataand thedesign, butdonothaveaccess to thefull schedule of potential outcomes. In the declaration weassociate estimators with estimands and we record a set ofsummary statistics that are required to compute diagnosticstatistics. In the example below, an estimator function takes


21

Dow

nloa

ded

from

htt

ps://

ww

w.c

ambr

idge

.org

/cor

e. U

CLA

Libr

ary,

on

04 Ju

n 20

19 a

t 17:

00:3

3, s

ubje

ct to

the

Cam

brid

ge C

ore

term

s of

use

, ava

ilabl

e at

htt

ps://

ww

w.c

ambr

idge

.org

/cor

e/te

rms.

htt

ps://

doi.o

rg/1

0.10

17/S

0003

0554

1900

0194



https://doi.org/10.1017/S0003055419000194

data and returns an estimate of a treatment effect using thedifference-in-means estimator, as well as a set of associatedstatistics, including the standard error, p-value, and the con-fidence interval.

estimator ,- declare_estimator(Y ~ Z,model 5 lm_robust, estimand 5 “ATE”)

We then declare the design by adding together the ele-ments. Order matters. Since we have defined the estimandbefore the sampling step, our estimand is the PopulationAverage Treatment Effect, not the Sample Average Treat-ment Effect. We have also included a declare_reveal()step between the assignment and estimation steps that revealsthe outcome Y on the basis of the potential outcomes anda realized random assignment.

design ,-population 1 potential_outcomes 1

estimand 1sampling 1 assignment 1 reveal 1

estimator

These six features represent the study. In order to assessthe completeness of a declaration and to learn about theproperties of the study, we also define functions for thediagnostic statistics, t(D,Y, f), and diagnosands, u(D,Y, f, g).For simplicity, the two can be coded as a single function. Forexample, to calculate the bias of the design as a diagnosandis:

diagnosand ,- declare_diagnosands(bias 5 mean(estimate 2 estimand))

Diagnosing the design involves simulating the designmanytimes, then calculating the value of the diagnosand from theresulting simulations.

diagnosis ,-diagnose_design(design 5 design,diagnosands 5 diagnosand,sims 5 500, bootstrap_sims 5 FALSE)

Thediagnosis returns an estimate of the diagnosands, alongwith other metadata associated with the simulations.

Graeme Blair et al.

22

Dow

nloa

ded

from

htt

ps://

ww

w.c

ambr

idge

.org

/cor

e. U

CLA

Libr

ary,

on

04 Ju

n 20

19 a

t 17:

00:3

3, s

ubje

ct to

the

Cam

brid

ge C

ore

term

s of

use

, ava

ilabl

e at

htt

ps://

ww

w.c

ambr

idge

.org

/cor

e/te

rms.

htt

ps://

doi.o

rg/1

0.10

17/S

0003

0554

1900

0194



https://doi.org/10.1017/S0003055419000194

Declaring and Diagnosing Research Designs · MIDA captures the analysis-relevant features of a...

Documents

Transcript of Declaring and Diagnosing Research Designs · MIDA captures the analysis-relevant features of a...