Assessing J4P Projects: Responding constructively to pervasive challenges

Assessing J4P Projects:Responding constructively to pervasive challengesMichael WoolcockDevelopment Research GroupThe World BankWashington, June 6, 2007

Overview

Three challenges:Allocating development resourcesAssessing project effectiveness (in general)Assessing J4P effectiveness (in particular)Discussion of options, strategies for assessing J4P pilots

Three challenges How to allocate development resources?How to assess project effectiveness (in general)?How to assess J4P effectiveness (in particular)?

1. Allocating development resourcesHow to allocate finite resources to projects believed likely to have a positive development impact?Allocations made for good and bad reasons, only a part of which is evidence-based, but most of which is theory-based, i.e., done because of an implicit (if not explicit) belief that Intervention A will cause Impact B in Place C net of Factors D and E for Reasons F and G.E.g., micro-credit will raise the income of villagers in Flores, independently of their education and wealth, because it enhances their capacity to respond to shocks (floods, illness) and enables larger-scale investment in productive assets (seeds, fertilizer)

Allocating development resourcesImperatives of the prevailing resource allocation mechanisms (e.g., those of the World Bank) strongly favor one-size-fits-all policy solutions (despite protestations to the contrary!) that deliver predictable, readily measurable results in a short time frameRoads, electrification, immunizationProjects that diverge from this structuree.g., J4Penter the resource allocation game at a disadvantage. But the obligation to demonstrate impact (rightly) remains; just need to enter the fray well armed, empirically and politically

2. How to Assess Project Effectiveness?Need to disentangle the effect of a given intervention over and above other factors occurring simultaneouslyDistinguishing between the signal and noiseIs my job creation program reducing unemployment, or is it just the booming economy?Furthermore, an intervention itself may have many componentsTTLs most immediately concerned about which aspect is the most important, or the binding constraint?(Important as this is, it is not the same thing as assessing impact)Need to be able to make defensible causal claims about project efficacy even (especially) when the apparent rigor of econometric methods arent suitable/availableThus need to change both the terms and content of debate

Impact Evaluation 101Core evaluation challenge:Disentangling effects of people, place, and project (or policy) from what would have happened otherwiseI.e., need a counterfactual (but this is rarely observed)Tin standardBeneficiary assessments, administrative checksSilverDouble difference: before/after, program/controlGoldRandomized allocation, natural experiments

Impact Evaluation 101Core evaluation challenge:Disentangling effects of people, place, and project (or policy) from what would have happened otherwiseI.e., need a counterfactual (but this is rarely observed)Tin standardBeneficiary assessments, administrative checksSilverDouble difference: before/after, program/controlGoldRandomized allocation, natural experiments(Diamond?)Randomized, triple-blind, placebo-controlled, cross-overAlchemy?Making gold with what you have, given prevailing constraints (people, money, time, logistics, politics)

We observe an outcome indicator

Intervention

Y1 (observedl)

Y0

t=0

and its value rises after the program

Intervention

Y1 (observedl)

Y0

t=0

t=1 time

However, we need to identify the counterfactual

Intervention

Y1 (observedl)

Y1*

(counterfactual)

Y0

t=0

t=1 time

since only then can we determine the impact of the intervention

Y1

Impact = Y1- Y1*

Y1*

Y0

t=0

t=1 time

Problems when evaluation is not built in ex-anteNeed a reliable comparison group Before/After: Other things may happen Units with/without the policyMay be different for other reasons than the policy (e.g. because policy is placed in specific areas)

How can we fill in the missing data on the counterfactual?RandomizationQuasi Experiment:MatchingPropensity-score matchingDifference-in-differenceMatched double differenceRegression Discontinuity DesignInstrumental variablesComparison group designsDesigns pairing jurisdictionsLagged start designsNatural occurring comparison group

1. Randomization

Randomized out group reveals counterfactual Only a random sample participates As long as the assignment is genuinely random, impact is revealed in expectation. Randomization is the theoretical ideal, and the benchmark for non-experimental methods. Identification issues are more transparent compare with other evaluation technique. But there are problems in practice: Internal validity: selective non-compliance External validity: difficult to extrapolate results from a pilot experiment to the whole population

An example from MexicoProgresa: Grants to poor families (women), conditional on preventive health care and school attendance for childrenMexican government wanted an evaluation; order of community phase-in was randomResults: child illness down 23%; height increased 1-4cm; 3.4% increase in enrollmentAfter evaluation: PROGRESA expanded within Mexico, similar programs adopted throughout other Latin American countries

An example from KenyaSchool-based de-worming: treat with a single pill every 6 months at a cost of 49 cents per student per year27% of treated students had moderate-to-heavy infection, 52% of comparisonTreatment reduced school absenteeism by 25%, or 7 percentage pointsCosts only $3 per additional year of school participation

2. Matching

Matched comparators identify counterfactualPropensity-score matching: Match on the basis of the probability of participation

Match participants to non-participants from a larger surveyThe matches are chosen on the basis of similarities in observed characteristicsThis assumes no selection bias based on unobservable heterogeneity (i.e., things that are not readily measurable by orthodox surveys, such as motivation, connections)Validity of matching methods depends heavily on data quality

3. Difference-in-difference (double difference)Collect baseline data on non-participants and (probable) participants before the program.

Compare with data after the program. Subtract the two differences, or use a regression with a dummy variable for participant.This allows for selection bias but it must be time-invariant and additive.Observed changes over time for non-participantsprovide the counterfactual for participants.

The Assessing J4P ChallengeYoure a star in development if you devise a best practice and a tool kiti.e., a universal, easy-to-administer solution to a common problemThere are certain problems for which finding such a universal solution is both desirable and possible (e.g., TB, roads for high rainfall environments)But many key problems, such as those pertaining to local governance and law reform (e.g., J4P), inherently require context-specific solutions that are heavily dependent on negotiation and teamwork, not a technology (pills, bridges, seeds)Not clear that if such a project works here that it will also work there, or that bigger will be betterAssessing such complex projects is enormously difficult

Why are complex interventions so hard to evaluate? A simple exampleYou are the inventor of BrightSmile, a new toothpaste that you are sure makes teeth whiter and reduces cavities without any harmful side effects. How would you prove this to public health officials and (say) Colgate?

Why are complex interventions so hard to evaluate? A simple exampleYou are the inventor of BrightSmile, a new toothpaste that you are sure makes teeth whiter and reduces cavities without any harmful side effects. How would you prove this to public health officials and (say) Colgate?Hopefully (!), you would be able to:Randomly assign participants to a treatment and control group (and then have then switch after a certain period); make sure both groups brushed the same way, with the same frequency, using the same amount of paste and the same type of brush; ensure nobody (except an administrator) knew who was in which group

Cf. Demonstrating impact of BrightSmile vs. J4P projectsEnormously difficultmethodologically, logistically and empiricallyto formally identify impact; equally problematic to draw general policy implications, especially for other countriesPrototypical complex CDD/J4P project:Open project menu: unconstrained content of interventionHighly participatory: communities control resources and decision-makingDecentralized: local providers and communities given high degree of discretion in implementationEmphasis on building capabilities and the capacity for collective actionContext-specific; project is (in principle) designed to respond to and reflect local cultural realitiesProjects impact may be non-additive (e.g., stepwise, exponential, high initially then tapering off) [DIAGRAM]

How does J4P work over time?(or, what is its functional form?)ImpactTimeImpactTimeImpactTimeImpactTimeACBDCCTs?Governance?AIDS awareness?Bridges?

How does J4P work over time?(or, what is its functional form?)ImpactTimeImpactTimeImpactTimeImpactTime?GHEFUnknown Unknowable?Empowerment?Pest control?e.g., cane toads

Science, Complexity, and EvaluationLoManyWide Narrow

So, what can we do whenInputs are variables (not constants)?Facilitation/participation vs. tax cuts (seeds, pills, etc)Teaching vs. text booksTherapy vs. medicineAdapting to context is an explicit, desirable feature?Each context/project nexus is thus idiosyncraticOutcomes are inherently hard to define and measure?E.g., empowerment, collective action, conflict mediation, social capital

Using Mixed Methods to Make Causal ClaimsAlternative Approaches to Understanding CausalityEconometrics: robustness tests on large N datasets; controlling statistically for various contending factorsHistory: processes (process tracing), conjunctures shaping single/rare eventsAnthropology: deep knowledge of contextsExploring inductive approachescf. Courtroom lawyers: present various types and quality of evidence (qualitative and quantitative) to test particular hypotheses about the efficacy of J4PThe art of research/evaluation is knowing how to work within time, budgetary and human resource constraints to answer important problems, drawing on an optimal package of data and the methods available to procure and interpret it

Techniques, Tools, InstrumentsInterviews (individuals, key informants)Discussion groupLiterature searchArchive file reviewQuestionnaire surveyCase studyAptitude or knowledge testOpinion pollContent analysis (e.g., of newspapers)Practically all the techniques used in the social sciences, especially in statistics, can be used for evaluation.

Be innovative on samplingCant really take random samples, or assign villagers to treatment and control groups (though one may be able to do this with specific aspects of projectse.g., Olken)Comparative case study methods use theory (or knowledge of context!) to identify 2-3 factors deemed most likely to influence project impacte.g., quality of local government, access to markets, etcControl for these contextual effects by selecting high and low areas of each variable, then use Propensity Score Matching methods, plus qualitative insights, to select matched program and comparison areas

Impact Evaluation Helps UsTo determine mean impactVery important for policy decisionsBut provides little grounds for asking other key questions, for example:Would a scaled up project be even better?Can the same results be expected elsewhere?Where is there room for improvement?Which aspects of a multi-faceted project are most important (and/or the binding constraint?)

Assessing J4P Projects: Responding constructively to pervasive challenges

Documents

Transcript of Assessing J4P Projects: Responding constructively to pervasive challenges