CHAPTER 17 Procedures,andPractices ProgramEvaluation ...ragarci2/Figueredo, Olderbak,...

C H A P T E R

17 Program Evaluation: Principles,Procedures, and Practices

Aurelio José Figueredo, Sally Gayle Olderbak, Gabriel Lee Schlomer,Rafael Antonio Garcia, and Pedro Sofio Abril Wolf

Abstract

This chapter provides a review of the current state of the principles, procedures, and practices withinprogram evaluation. We address a few incisive and difficult questions about the current state of thefield: (1) What are the kinds of program evaluations? (2) Why do program evaluation results often haveso little impact on social policy? (3) Does program evaluation suffer from a counterproductive systemof incentives? and (4) What do program evaluators actually do? We compare and contrast the meritsand limitations, strengths and weaknesses, and relative progress of the two primary contemporarymovements within program evaluation, Quantitative Methods and Qualitative Methods, and wepropose an epistemological framework for integrating the two movements as complementary formsof investigation, each contributing to different stages in the scientific process. In the final section, weprovide recommendations for systemic institutional reforms addressing identified structural problemswithin the real-world practice of program evaluation.

Key Words: Program evaluation, formative evaluation, summative evaluation, social policy, moralhazards, perverse incentives, quantitative methods, qualitative methods, context of discovery, contextof justification

IntroductionPresident Barack Obama’s 2010 Budget included

many statements calling for the evaluation of moreU. S. FederalGovernment programs (Office ofMan-agement and Budget, 2009). But what precisely ismeant by the term evaluation? Who should conductthese evaluations? Who should pay for these evalua-tions? How should these evaluations be conducted?

This chapter provides a review of the principles,procedures, and practices within program evaluation.We start by posing and addressing a few incisive anddifficult questions about the current state of thatfield:

1. What are the different kinds of programevaluations?

2. Why do program evaluation results often haveso little impact on social policy?3. Does program evaluation suffer from a

counterproductive system of incentives?

We then ask a fourth question regarding thereal-world practice of program evaluation: What doprogram evaluators actually do? In the two sectionsthat follow, we try to answer this question by review-ing the merits and limitations, strengths and weak-nesses, and relative progress of the two primary con-temporary “movements” within program evaluationand the primary methods of evaluation upon whichthey rely: Part 1 addresses Quantitative Methodsand Part 2 addresses Qualitative Methods. Finally,we propose a framework for the integration of the

331

two movements as complementary forms of investi-gation in program evaluation, each contributing todifferent stages in the scientific process. In the finalsection, we provide recommendations for systemicinstitutional reforms addressing identified structuralproblems within the real-world practice of programevaluation.

What Are the Different Kinds of ProgramEvaluations?

Scriven (1967) introduced the important dis-tinction between summative program evaluations ascompared with formative program evaluations. Thegoal of a summative evaluation is to judge the meritsof a fixed, unchanging program as a finished prod-uct, relative to potential alternative programs. Thisjudgment should consist of an analysis of the costsand benefits of the program, as compared with otherprograms targeted at similar objectives, to justifythe expenses and opportunity costs society incurs inimplementing one particular program as opposed toan alternative program, aswell as in contrast to doingnothing at all. Further, a summative evaluationmustexamine both the intended and the unintended out-comes of the programmatic intervention and notjust the specific stated goals, as represented by theoriginators, administrators, implementers, or advo-cates of the program (Scriven, 1991). A formativeevaluation, on the other hand, is an ongoing evalu-ation of a program that is not fixed but is still in theprocess of change. The goal of a formative evalua-tion is to provide feedback to the programmanagerswith the purpose of improving the program regard-ing what is and what is not working well and not tomake a final judgment on the relative merits of theprogram.

The purely dichotomous and mutually exclusivemodel defining the differences between summa-tive and formative evaluations has been softenedand qualified somewhat over the years. Tharp andGallimore (1979, 1982), in their research anddevelopment (R& D) program for social action,proposed a model of evaluation succession, pat-terned on the analogy of ecological succession, whereinan ongoing, long-term evaluation begins as a for-mative program evaluation and acquires featuresof a summative program evaluation as the pro-gram naturally matures, aided by the continuousfeedback from the formative program evaluationprocess. Similarly, Patton (1996) has proposeda putatively broader view of program evaluationthat falls between the summative versus formative

dichotomy: (1) knowledge-generating evaluation,evaluations that are designed to increase our concep-tual understanding of a particular topic; (2) devel-opmental evaluation, an ongoing evaluation thatstrives to continuously improve the program; and(3) using the evaluation processes, which involvesmore intently engaging the stakeholders, and oth-ers associated with the evaluation, to think moreabout the program and ways to improve its efficacyor effectiveness. Patton has argued that the distinc-tion between summative and formative evaluationis decreasing, and there is a movement within thefield of program evaluation that applies a more cre-ative use and application of evaluation. What hetermed knowledge-generative evaluation is a form ofevaluation focused not on the instrumental use ofevaluation findings (e.g., making decisions based onthe results of the evaluation) but, rather, on theconceptual use of evaluation findings (e.g., theoryconstruction).

A developmental evaluation (Patton, 1994) is aform of program evaluation that is ongoing and isfocused on the development of the program. Evalua-tors provide constant feedback but not always in theforms of official reports. Developmental evaluationassumes components of the program under evalua-tion are constantly changing, and so the evaluationis not geared toward eventually requiring a summa-tive program evaluation but, rather, is focused onconstantly adapting and evolving the evaluation tofit the evolving program. Patton (1996) proposedthat program evaluators should focus not only onreaching the evaluation outcomes, but also on theprocess of the evaluation itself, in that the evalua-tion itself can be “participatory and empowering . . .increasing the effectiveness of the program throughthe evaluation process rather than just the findings”(p. 137).

Stufflebeam (2001) has presented a larger clas-sification of the different kinds of evaluation, con-sisting of 22 alternative approaches to evaluationthat can be classified into four categories. Stuffle-beam’s first category is called Pseudoevaluations andencompasses evaluation approaches that are oftenmotivated by politics, which may lead to mislead-ing or invalid results. Pseudoevaluation approachesinclude: (1) Public Relations-Inspired Studies and(2) Politically Controlled Studies (for a descriptionof each of the 22 evaluation approaches, pleaserefer to Stufflebeam’s [2001] original paper). Stuffle-beam’s second category is called Questions-And-Methods-Evaluation Approaches (Quasi-EvaluationStudies) and encompasses evaluation approaches

332 p r o g r a m e va l u at i o n

geared to address a particular question, or applya particular method, which often result in nar-rowing the scope of the evaluation. This cate-gory includes: (3) Objectives-Based Studies; (4)Accountability, Particularly Payment by ResultsSection; (5) Objective Testing Program; (6)Outcome Evaluation as Value-Added Assessment;(7) Performance Testing; (8) Experimental Stud-ies; (9) Management Information Systems; (10)Benefit–Cost Analysis Approach; (11) Clarifi-cation Hearing; (12) Case Study Evaluations;(13) Criticism and Commentary; (14) ProgramTheory-Based Evaluation; and (15)Mixed-MethodsStudies.

Stufflebeam’s (2001) third category, Improve-ment/Accountability-OrientedEvaluationApproaches,is the most similar to the commonly used definitionof program evaluation and encompasses approachesthat are extensive and expansive in their approachand selection of outcome variables, which use amul-titude of qualitative and quantitative methodologiesfor assessment. These approaches include: (16)Decision/Accountability-Oriented Studies; (17)Consumer-Oriented Studies; and (18) Accredita-tion/Certification Approach. Stufflebeam’s fourthcategory is called Social Agenda/Advocacy Approachesand encompasses evaluation approaches that aregeared toward directly benefitting the communityin which they are implemented, sometimes so muchso that the evaluation may be biased, and areheavily included by the perspective of the stake-holders. These approaches include: (19) Client-Centered Studies (or Responsive Evaluation); (20)Constructivist Evaluation; (21) Deliberative Demo-cratic Evaluation; and (22) Utilization-FocusedEvaluation.

These different types of program evaluations arenot exhaustive of all the types that exist, but theyare the ones that we consider most relevant to thecurrent analysis and ultimate recommendations.

Why Do Program Evaluation Results OftenHave So Little Impact on Social Policy?

At the time of writing, the answer to this ques-tion is not completely knowable. Until we havemoreresearch on this point, we can never completely doc-ument the impact that program evaluation has onpublic policy. Many other commentators on pro-gram evaluation (e.g., Weiss, 1999), however, havemade the point that program evaluation does nothave as much of an impact on social policy as wewould like it to have. To illustrate this point, we willuse two representative case studies: theKamehameha

Early Education Project (KEEP), and theDrug AbuseResistance Education (DARE). Although the successor failure of a program and the success or failure of aprogram evaluation are two different things, one isintimately related to the other, because the successor failure of the program evaluation is necessarilyconsidered in reference to the success or failure ofthe program under evaluation.

The Frustrated Goals of ProgramEvaluation

When it comes to public policy, the goal of anevaluation should include helping funding agencies,such as governmental entities, decide whether to ter-minate, cut back, continue, scale up, or disseminatea program depending on success or failure of theprogram, whichwould be themain goal of a summa-tive program evaluation. An alternative goal mightbe to suggest modifications to existing programs inresponse to data gathered and analyzed during anevaluation, which would be the main goal of a for-mative program evaluation. Although both goalsare the primary purposes of program evaluation,in reality policymakers rarely utilize the evaluationfindings for these goals and rarely make decisionsbased on the results of evaluations. Even an evalua-tion that was successful in its process can be blatantlyignored and result in a failure in its outcome. Werelate this undesirable state of affairs further belowwith the concept of a market failure from economictheory.

According to Weiss (1999) there are four majorreasons that program evaluations may not have adirect impact on decisions by policymakers (the“Four I’s”). First, when making decisions, a host ofcompeting interests present themselves. Because ofthis competition, the results of different evaluationscan be used to the benefit or detriment of the causesof various interested parties. Stakeholders with con-flicting interests can put the evaluator between arock and a hard place. An example of this is whena policymaker receives negative feedback regardinga program. On the one hand, the policymaker isinterested in supporting successful programs, buton the other hand, a policymaker who needs toget re-elected might not want to be perceived as“the guy who voted no on drug prevention.” Sec-ond, the ideologies of different stakeholder groupscan also be a barrier for the utilization of programevaluation results. These ideologies filter potentialsolutions and restrict results to which policymakerswill listen.This occursmost oftenwhen the ideologyclaims that something is “fundamentally wrong.”

f i g u e r e d o , o l d e r b a k , s c h l o m e r , g a r c i a , w o l f 333

For example, an abstinence-only program, designedto prevent teenage pregnancy, may be in competi-tion with a program that works better, but becausethe program passes out condoms to teenagers, theabstinence-only plan may be funded because of theideologies of the policymakers or their constituents.Third, the information contained in the evaluationreport itself can be a barrier. The results of evalua-tions are not the only source of information and areoften not the most salient. Policymakers often haveextensive information regarding a potential policy,and the results of the evaluation are competing withevery other source of information that can enter thedecision-making process. Finally, the institutionalcharacteristics of the program itself can become abarrier.The institution ismadeupof peopleworkingwithin the context of a set structure and a history ofbehavior. Because of these institutional characteris-tics, change may be difficult or even considered “off-limits.” For example, if an evaluation results in advo-cating the elimination a particular position, then theresults may be overlooked because the individualcurrently in that position is 6 months from retire-ment. Please note that we are not making a valuejudgment regarding the relativemerits of such a deci-sion but merely describing the possible situation.

The utilization of the results of an evaluation isthe primary objective of an evaluation; however, itis often the case that evaluation results are put asidein favor of other, less optimal actions (Weiss, 1999).This is not a problem novel to program evaluatorsbut a problem that burdens most applied social sci-ence. A prime example of this problem is that ofthe reliability of eyewitness testimony. Since Eliz-abeth Loftus published her 1979 book, EyewitnessTestimony, there has been extensive work done onthe reliability of eyewitnesses and the developmentof false memories. Nevertheless, it took 20 years forthe U. S. Department of Justice to institute nationalstandards reflecting the implications of these find-ings (Wells et al., 2000). Loftus did accomplishwhatWeiss refers to as “enlightenment” (Weiss, 1980),or the bringing of scientific data into the appliedrealm of policymaking. Although ideally programswould implement evaluation findings immediately,this simply does not oftenhappen. As stated byWeiss(1999), the volume of information that organiza-tions or policymakers have regarding a particularprogram is usually too vast to be overthrown byone dissenting evaluation. These problems appearto be inherent in social sciences and programevaluation, and it is unclear how to amelioratethem.

To illustrate how programs and program evalua-tions can succeed or fail, we use two representativecase studies: one notable success of the program eval-uation process, the KEEP, and one notable failureof the program evaluation process, DARE.

Kamehameha Early Education ProjectA classic example of a successful program eval-

uation described by Tharp and Gallimore (1979)was that of KEEP. Kamehameha Early EvaluationProject was started in 1970 to improve the read-ing and general education of Hawaiian children.The project worked closely with program evaluatorsto identify solutions for many of the unique prob-lems faced by young Hawaiian-American childrenin their education, from kindergarten through thirdgrade, and to discover methods for disseminatingthese solutions to the other schools in Hawaii. Theevaluation took 7 years before significant improve-ment was seen and involved a multidisciplinaryapproach, including theoretical perspectives fromthe fields of psychology, anthropology, education,and linguistics.

Based on their evaluation of KEEP, Tharp andGallimore (1979) identified four necessary con-ditions for a successful program evaluation: (1)longevity—evaluations need time to take place,which requires stability in other areas of the pro-gram; (2) stability in the values and goals of theprogram; (3) stability of funding; and (4) theopportunity for the evaluators’ recommendations toinfluence the procedure of the program.

In terms of the “Four I’s,” the interests of KEEPwere clear and stable. The project was interested inimproving general education processes. In terms ofideology and information, KEEP members believedthat the evaluation process was vital to its successand trusted the objectivity of the evaluators, takingtheir suggestions to heart. From its inception, theinstitution had an evaluation system built in. Sincecontinuing evaluations were in process, the programitself had no history of institutional restriction ofevaluations.

Drug Abuse Resistance EducationIn this notable case, we are not so much high-

lighting the failure of a specific program evaluation,or of a specific program per se, as highlighting theinstitutional failure of program evaluation as a sys-tem, at least as currently structured in our society.In the case of DARE, a series of program evalua-tions produced results that, in the end, were not


acted upon. Rather, what should have been recog-nized as a failed program lives on to this day. TheDARE program was started in 1983, and the goalof the program was to prevent drug use. Althoughthere are different DARE curricula, depending onthe targeted age group, the essence of the program isthat uniformed police officers deliver a curriculumin safe classroom environments aimed at preventingdrug use among the students. As of 2004, DAREhasbeen the most successful school-based preventionprogram in attracting federal money: The estimatedaverage federal expenditure is three-quarters of abillion dollars per year (West & O’Neal, 2004).Although DARE is successful at infiltrating schooldistricts and attracting tax dollars, research spanningmore than two decades has shown that the programis ineffective at best and detrimental at worst. Oneof the more recent meta-analyses (West & O’Neal,2004) estimated the average effect size for DARE’seffectiveness was extremely low and not even statisti-cally significant (r = 0.01; Cohen’s d = 0.02, 95%confidence interval = –0.04, 0.08).

Early studies pointed to the ineffectiveness ofthe DARE program (Ennett, Tobler, Ringwalt, &Flewelling, 1994; Clayton, Cattarello, & Johnstone,1996; Dukes, Ullman, & Stein, 1996). In responsetomuch of this research, the SurgeonGeneral placedthe DARE program in the “Does Not Work” cat-egory of programs in 2001. In 2003, the U. S.Government Accountability Office (GAO) wrotea letter to congressmen citing a series of empiri-cal studies in the 1990s showing that in some casesDARE is actually iatrogenic, meaning that DAREdoes more harm than good.

Despite all the evidence, DARE is still heavilyfunded by tax dollars through the following govern-ment agencies: California National Guard, Com-bined Federal Campaign (CFC), Florida NationalGuard, St. Petersburg College, Multijurisdic-tional, Counterdrug Task Force Training, IndianaNational Guard, Midwest Counterdrug TrainingCenter/ National Guard, U. S. Department ofDefense, U. S. Department of Justice, Bureauof Justice Assistance (BJA), Drug EnforcementAdministration, Office of Juvenile Justice andDelinquency Prevention, and the U. S. Departmentof State.

These are institutional conflicts of interest. Asdescribed above, few politicianswant to be perceivedas “the guy who voted against drug prevention.” Thefailure ofDARE stems primarily from these conflictsof interest. In lieu of any better options, the U. S.Federal Government continues to support DARE,

simply because to not do so might appear as if theywere doing nothing. At the present writing in 2012,DARE has been in effect for 29 years. Attemptingto change the infrastructure of a longstanding pro-gram like this would be met with a great deal ofresistance.

We chose theDARE example specifically becauseit is a long-running example, as it takes years tomakethe determination that somewhere something in thesystem of program evaluation failed. If this chapterwere being written in the early 1990s, people in thefield of program evaluationmight reasonably be pre-dicting that based on the data available, this programshould either be substantially modified or discon-tinued. Rather, close to two decades later and afterbeing blacklisted by the government, it is still a verywell-funded program. One may argue that the pro-gram evaluators themselves did their job; however,what is the use of program evaluation if policymak-ers are not following recommendations based ondata produced by evaluations? Both the scientificevidence and the anecdotal evidence seem to sug-gest that programs with evaluations built-in seemto result in better utilization of evaluation resultsand suggestions. This may partly result from bet-ter communication between the evaluator and thestakeholders, but if the evaluator is on a first-namebasis (or maybe goes golfing) with the stakehold-ers, then what happens to his/her ability to remainobjective? We will address these important issues inthe sections that immediately follow by exploringthe extant system of incentives shaping the practiceof program evaluation.

What System of Incentives Governs thePractice of Program Evaluation?Who Are Program Evaluators?

On October 19, 2010, we conducted a surveyof the brief descriptions of qualifications and expe-rience of evaluators posted by program evaluators(344 postings in total) under the “Search Resumes”link on the American Evaluation Association (AEA)website (http://www.eval.org/find_an_evaluator/evaluator_search.asp). Program evaluators’ skillswere evenly split in their levels of quantitative(none: 2.0%; entry: 16.9%; intermediate: 37.5%;advanced: 34.9%; expert: 8.4%; strong: 0.3%)and qualitative evaluation experience (none: 1.5%;entry: 17.2%; intermediate: 41.6%; advanced:27.3%; expert: 12.2%; strong: 0.3%). Programevaluators also expressed a range of years they wereinvolvedwith evaluation (<1 year: 12.5%; 1–2 years:


20.1%; 3–5 years: 24.1%; 6–10 years: 19.8%; >10years: 23.5%).

In general, program evaluators were highly edu-cated, with the highest degree attained being eithera masters (58.8%) or a doctorate of some sort(36%), and fewer program evaluators had onlyan associates (0.3%) or bachelors degree (5.0%).The degree specializations were also widely dis-tributed. Only 12.8% of the program evaluatorswith posted resumes described their education asincluding some sort of formal training specifically inevaluation. The most frequently mentioned degreespecialization was in some field related to Psychol-ogy (25.9%), including social psychology and socialwork. The next most common specialization wasin Education (15.4%), followed by Policy (14.0%),Non-Psychology Social Sciences (12.2%), PublicHealth or Medicine (11.6%), Business (11.3%),Mathematics or Statistics (5.8%), Communication(2.9%), Science (2.3%), Law or Criminal Jus-tice (2.0%), Management Information Systems andother areas related to Technology (1.5%), Agricul-ture (1.5%), and Other, such as Music (1.7%).

For Whom Do Program Evaluators Work?We sampled job advertisements for program

evaluators using several Internet search engines:usajobs.gov, jobbing.com, and human resourcespages for government agencies such as NationalInstitutes of Health (NIH), theNational Institute ofMental Health (NIMH), Centers for Disease Con-trol (CDC), and GAO. Based on this sampling, wedetermined there are four general types of programevaluation jobs.

Many agencies that deliver or implement socialprograms organize their own program evaluations,and these account for the first, second, and thirdtypes of program evaluation jobs available. Thefirst type of program evaluation job is obtained inresponse to a call or request for proposals for a givenevaluation. The second type of program evaluationjob is obtained when the evaluand (the programunder evaluation) is asked to hire an internal pro-gram evaluator to conduct a summative evaluation.The third general type of program evaluation jobis obtained when a program evaluator is hired toconduct a formative evaluation; this category couldinclude an employee of the evaluandwho servesmul-tiple roles in the organization, such as secretary anddata collector.

We refer to the fourth type of program evaluationjob as the Professional GovernmentWatchdog.Thattype of evaluator works for an agency like the GAO.

The GAO is an independent agency that answersdirectly to Congress. The GAO has 3300 workers(http://www.gao.gov/about/workforce/) working inroughly 13 groups: (1) Acquisition and SourcingManagement; (2) Applied Research and Meth-ods; (3) Defense Capabilities and Management; (4)Education, Workforce, and Income Security; (5)FinancialManagement and Assurance; (6) FinancialMarkets and Community Investment; (7) HealthCare; (8) Homeland Security and Justice; (9) Infor-mation Technology; (10) International Affairs andTrade; (11) Natural Resources and Environment;(12) Physical Infrastructure; and (13) StrategicIssues. Each of these groups is tasked with theoversight of a series of smaller agencies that dealwith that group’s content. For example, the Natu-ral Resources and Environment group oversees theDepartment of Agriculture, Department of Energy,Department of the Interior, Environmental Pro-tection Agency, Nuclear Regulatory Commission,Army Corps of Engineers, National Science Foun-dation, National Marine Fisheries Service, and thePatent and Trademark Office.

With the many billions of dollars being spentby the U. S. government on social programs, wesincerely doubt that 3300 workers can possibly pro-cess all the program evaluations performed for theentire federal government. Recall that the estimatedaverage federal expenditure forDARE alone is three-quarters of a billion dollars per year and that thisprogram has been supported continuously for 17years. We believe that such colossal annual expendi-tures should include enough to pay for a few moreof these “watchdogs” or at least justify the additionalexpense of doing so.

Who Pays the Piper?The hiring of an internal program evaluator for

the purpose of a summative evaluation is a recipe foran ineffective evaluation. There is a danger that theprogram evaluator can become what Scriven (1976,1983) has called a program advocate. According toScriven, these program evaluators are not necessarilymalicious but, rather, could be biased as a result ofthe nature of the relationship between the programevaluator, the program funder, and the programmanagement. The internal evaluator is generallyemployed by, and answers to, the management ofthe program and not directly to the program fun-der. In addition, because the program evaluator’sjob relies on the perceived “success” of the evalu-ation, there is an incentive to bias the results infavor of the program being evaluated. Scriven has


argued that this structure may develop divided loy-alties between the program being evaluated and theagency funding the program (Shadish, Cook, &Leviton, 1991). Scriven (1976, 1983) has recom-mended that summative evaluations are necessaryfor a society to optimize resource allocation butthat we should also periodically re-assign programevaluators to different program locations to preventindividual evaluators from being co-opted into localstructures. The risks of co-opting are explained inthe next section.

Moral Hazards and Perverse IncentivesAs a social institution, the field of program eval-

uation has professed very high ethical standards.For example, in 1994 The Joint Committee onStandards for Educational Evaluation produced theSecond Edition of an entire 222-page volume onprofessional standards in program evaluation. Notall were what we would typically call ethical stan-dards per se, but one of the four major categoriesof professional evaluation standards was called Pro-priety Standards and addressed what most peoplewould refer to as ethical concerns. The other threecategories were denoted Utility Standards, Feasibil-ity Standards, and Accuracy Standards. Although itmight be argued that a conscientious program eval-uator is ethically obligated to carefully consider theutility, feasibility, and accuracy of the evaluation, itis easy to imagine how an occasional failure in anyof these other areas might stem from factors otherthan an ethical lapse.

So why do we need any protracted considera-tion of moral hazards and perverse incentives in adiscussion of program evaluation? We should makeclear at the outset that we do not believe that mostprogram evaluators are immoral or unethical. It isimportant to note that in most accepted uses ofthe term, the expression moral hazard makes noassumptions, positive or negative, about the relativemoral character of the parties involved, although insome cases the term has unfortunately been usedin that pejorative manner. The term moral hazardonly refers (or should only refer) to the structureof perverse incentives that constitute the particularhazard in question (Dembe & Boden, 2000). Wewish to explicitly avoid the implication that thereare immoral or unethical individuals or agencies outthere that intentionally corrupt the system for theirown selfish benefit. Unethical actors hardly needmoral hazards to corrupt them: They are presum-ably already immoral and can therefore be readily

corrupted, presumably with little provocation. It isthe normally moral or ethical people about whichwe need to worry under the current system of incen-tives, because this systemmay actually penalize themfor daring to do the right thing for society.Moral hazards and perverse incentives refer to con-

ditions under which the incentive structures in placetend to promote socially undesirable or harmfulbehavior (e.g., Pauly, 1974). Economic theory refersthe socially undesirable or harmful consequences ofsuch behavior as market failures, which occur whenthere is an inefficient allocation of goods and servicesin a market. Arguably, continued public or privatefunding of an ineffective or harmful social programtherefore constitutes a market failure, where thesocial program is conceptualized as the product thatis being purchased. In economics, one of the well-documented causes of market failures is incompleteor incorrect information on which the participantsin the market base their decisions. That is howthese concepts may relate to the field of programevaluation.

One potential source of incomplete or incorrectinformation is referred to in economic theory as thatof information asymmetry, which occurs in economictransactions where one party has access to eithermore or better information than the other party.Information asymmetrymay thus lead tomoral haz-ard, where one party to the transaction is insulatedfrom the adverse consequences of a decision buthas access to more information than another party(specifically, the party that is not insulated fromthe adverse consequences of the decision in ques-tion). Thus, moral hazards are produced when theparty with more information has an incentive to actcontrary to the interests of the party with less infor-mation. Moral hazard arises because one party doesnot risk the full consequences of its own decisionsand presumably acquires the tendency to act lesscautiously than otherwise, leaving another party tosuffer the consequences of those possibly ill-adviseddecisions.

Furthermore, a principal-agent problem mightalso exist where one party, called an agent, acts onbehalf of another party, called the principal. Becausethe principal usually cannot completely monitor theagent, the situation often develops where the agenthas access to more information than the principaldoes. Thus, if the interests of the agent and theprincipal are not perfectly consistent and mutu-ally aligned with each other, the agent may havean incentive to behave in a manner that is con-trary to the interests of the principal. This is the


problem of perverse incentives, which are incentivesthat have unintended and undesirable effects (“unin-tended consequences”), defined as being against theinterests of the party providing the incentives (inthis case, the principal). A market failure becomesmore than a mere mistake and instead becomes theinevitable product of a conflict of interests betweenthe principal and the agent. A conflict of interestsmay lead the agent to manipulate the informationthat they provide to the principal. The informa-tion asymmetry thus generated will then lead to thekind of market failure referred to as adverse selection.Adverse selection is amarket failure that occurswheninformation asymmetries between buyers and sellerslead to suboptimal purchasing decisions on the partof the buyer, such as buyingworthless or detrimentalgoods or services (perhaps like DARE?).

When applying these economic principles tothe field of program evaluation, it becomes evi-dent that because program evaluators deal purelyin information, and this information might bemanipulated—either by them or by the agencies forwhich they work (or both of them in implicit orexplicit collusion)—we have a clear case of informa-tion asymmetry. This information asymmetry, underperverse incentives, may lead to a severe conflict ofinterests between the society or funding agency (theprincipal) and the program evaluator (the agent).This does not mean that the agent must perforce becorrupted, but the situation does create a moral haz-ard for the agent, regardless of any individual virtues.If the perverse incentives are acted on (meaning theyindeed elicit the execution of impropriety), then itis clearly predicted by economic theory to producea market failure and specifically adverse selection onthe part of the principal.

Getting back to the question of the professionalstandards actually advocated within program eval-uation, how do these lofty ideals compare to thekind of behavior thatmight be expected undermoralhazards and perverse incentives, presuming that pro-gram evaluators are subject to the same kind ofmotivations, fallibilities, and imperfections as therest of humanity? The Joint Committee on Stan-dards for Educational Evaluation (1994) listed thefollowing six scenarios as examples of conflicts ofinterest:

– Evaluatorsmight benefit or lose financially, longterm or short term, depending on what evalua-tion results they report, especially if the evaluatorsare connected financially to the program beingevaluated or to one of its competitors.

– The evaluator’s jobs and/or ability to get futureevaluation contracts might be influenced by theirreporting of either positive or negative findings.– The evaluator’s personal friendships or profes-

sional relationships with clients may influence thedesign, conduct, and results of an evaluation.– The evaluator’s agency might stand to gain or

lose, especially if they trained the personnel or devel-oped the materials involved in the program beingevaluation.– A stakeholder or client with a personal financial

interest in a program may influence the evaluationprocess.– A stakeholder or client with a personal pro-

fessional interest in promoting the program beingevaluated may influence the outcome of an evalu-ation by providing erroneous surveys or interviewresponses. (p. 115)

In response to these threats to the integrity of aprogram evaluation, the applicable Propriety Stan-dard reads: “Conflicts of interest should be dealtwith openly and honestly, so that it does not com-promise the evaluation processes and results” (TheJointCommittee on Standards for Educational Eval-uation, 1994, p. 115). Seven specific guidelines aresuggested for accomplishing this goal, but manyof them appear to put the onus on the individualevaluators and their clients to avoid the problem.For example, the first three guidelines recommendthat the evaluator and the client jointly identifyin advance possible conflicts of interest, agree inwriting to preventive procedures, and seek morebalanced outside perspectives on the evaluation.These are all excellent suggestions and should workextremely well in all cases, except where either theevaluator, the client, or both are actually experiencingreal-world conflicts of interests. Another interest-ing guideline is: “Make internal evaluators directlyresponsible to agency heads, thus limiting the influ-ence other agency staffmight have on the evaluators”(p. 116). We remain unconvinced that the lower-echelon and often underpaid agency staff have moreof a vested interest in the outcome of an evalua-tion than the typically more highly paid agency headpresumablymanaging the program being evaluated.

A similar situation exists with respect to the Pro-priety Standards for the Disclosure of Findings:“The formal parties to an evaluation should ensurethat the full set of evaluation findings along withpertinent limitations are made accessible to the per-sons affected by the evaluation, and any others withexpressed legal rights to receive the results” (The


JointCommittee on Standards for Educational Eval-uation, 1994, p. 109). This statement implicitlyrecognizes the problem of information asymmetrydescribed above but leaves it up to the “formal partiesto an evaluation” to correct the situation. In contrast,we maintain that these are precisely the interestedparties that will be most subject to moral hazardsand perverse incentives and are therefore the leastmotivated by the financial, professional, and pos-sibly even political incentives currently in place toact in the broader interests of society as awhole in theuntrammelled public dissemination of information.

Besides financial gain or professional advance-ment, Stufflebeam (2001) has recognized politicalgains andmotivations also play a role in the problemof information asymmetry:

The advance organizers for a politically controlledstudy include implicit or explicit threats faced by theclient for a program evaluation and/or objectives forwinning political contests. The client’s purpose incommissioning such a study is to secure assistance inacquiring, maintaining, or increasing influence,power, and/or money. The questions addressed arethose of interest to the client and special groups thatshare the client’s interests and aims. Two mainquestions are of interest to the client: What is thetruth, as best can be determined, surrounding aparticular dispute or political situation? Whatinformation would be advantageous in a potentialconflict situation? . . . Generally, the client wantsinformation that is as technically sound as possible.However, he or she may also want to withholdfindings that do not support his or her position. Thestrength of the approach is that it stresses the need foraccurate information. However, because the clientmight release information selectively to create orsustain an erroneous picture of a program’s merit andworth, might distort or misrepresent the findings,might violate a prior agreement to fully releasefindings, or might violate a “public’s right to know”law, this type of study can degenerate into apseudoevaluation. (p. 10–11)

By way of solutions, Stufflebeam (2001) thenoffers:

While it would be unrealistic to recommend thatadministrators and other evaluation users not obtainand selectively employ information for political gain,evaluators should not lend their names andendorsements to evaluations presented by their clientsthat misrepresent the full set of relevant findings, thatpresent falsified reports aimed at winning political

contests, or that violate applicable laws and/or priorformal agreements on release of findings. (p. 10)

Like most of the guidelines offered by The JointCommittee on Standards for Educational Evalua-tion (1994) for the Disclosure of Findings, thisleaves it to the private conscience of the individualadministrator or evaluator to not abuse their posi-tion of privileged access to the informationproducedby program evaluation. It also necessarily relies onthe individual administrator’s or evaluator’s self-reflective and self-critical conscious awareness of anybiases or selective memory for facts that one mightbring to the evaluation process, to be intellectuallyalerted and on guard against them.

To be fair, some of the other suggestions offeredin both of these sections of the Propriety Standardsare more realistic, but it is left unclear exactly whois supposed to be specifically charged with eitherimplementing or enforcing them. If it is again leftup to either the evaluator or the client, acting eitherindividually or in concert, it hardly addresses theproblems that we have identified. We will take upsome of these suggestions later in this chapter andmake specific recommendations for systemic institu-tional reforms as opposed to individual exhortationsto virtue.

As should be clear from our description of thenature of the problem, it is impossible under infor-mation asymmetry to identify specific program evalu-ations that have been subject to these moral hazards,precisely because they are pervasive and not directlyevident (almost by definition) in any individualfinal product. There is so much evidence for thesephenomena from other fields, such as experimen-tal economics, that the problems we are describingshould be considered more than unwarranted spec-ulation. This is especially true in light of the factthat some of our best hypothetical examples camedirectly from the 1994 book cited above on pro-fessional evaluation standards, indicating that theseproblems have been widely recognized for sometime. Further, we do not think thatwe are presentinga particularly pejorative view of program evaluationcollectively or of program evaluators individually:we are instead describing how some of the regret-table limitations of human nature, common to allareas of human endeavor, are exacerbated by theway that program evaluations are generally handledat the institutional level. The difficult situation ofthe honest and well-intentioned program evaluatorunder the current system of incentives is just a spe-cial case of this general human condition, which


subjects both individuals and agencies to a varietyof moral hazards.

Cui Bono? The Problem of MultipleStakeholders

In the historic speech, Pro Roscio Amerino, givenby Marcus Tullius Cicero in 80 bc, he is quoted ashaving said (Berry, 2000):

The famous Lucius Cassius, whom the Romanpeople used to regard as a very honest and wise judge,was in the habit of asking, time and again, “To whosebenefit?”

That speech made famous the expression “cuibono?” for the next two millennia that followed. Inprogram evaluation, we have a technical definitionfor the generic answer to that question. Stakeholdersare defined as the individuals or organizations thatare either directly or indirectly affected by the pro-gram and its evaluation (Rossi & Freeman, 1993).Although a subtle difference here is that the stake-holders can either gain or lose and do not alwaysstand to benefit, the principle is the same. Muchof what has been written about stakeholders in pro-gram evaluation is emphatic on the point that thepaying client is neither the only, nor necessarily themost important, stakeholder involved. The evalu-ator is responsible for providing information to amultiplicity of different interest groups. This castsa program evaluator more in the role of a publicservant than a private contractor.

For example, The Joint Committee on Standardsfor Educational Evaluation (1994) addressed theproblem of multiple stakeholders under several dif-ferent and very interesting headings. First, underUtility Standards, they state that Stakeholder Iden-tification is necessary so that “[p]ersons involvedin or affected by the evaluation should be identi-fied, so that their needs can be addressed” (p. 23).This standard presupposes the rather democratic andegalitarian assumption that the evaluation is beingperformed to address the needs of all affected andnot just those of the paying client.

Second, in the Feasibility Standards, under Polit-ical Viability, the explain that “[t]he evaluationshould be planned and conducted with anticipationof the different positions of various interest groups,so that their cooperation might be obtained, andso that possible attempts by any of these groups tocurtail evaluation operations or to bias or misapplythe results can be averted or counteracted” (p. 63).This standard instead presupposes that the diverse

stakeholder interests have to be explicitly includedwithin the evaluation process because of politicalexpediency, at the very least as a practical matterof being able to effectively carry out the evaluation,given the possible interference by these same specialinterest groups. The motivation of the client in hav-ing to pay to have these interests represented, andof the evaluator in recommending that this be done,might therefore be one of pragmatic or “enlight-ened” self-interest rather than of purely altruistic andpublic-spirited goals.

Third, in the Propriety Standards, under Ser-vice Orientation, they state: “Evaluations should bedesigned to assist organizations to address and effec-tively serve the needs of the targeted participants” (p.83). This standard presupposes that both the client,directly, and the evaluator, indirectly, are engagedin public service for the benefit of these multiplestakeholders. Whether this results from enlightenedself-interest on either of their parts, with an eye tothe possible undesirable consequences of leaving anystakeholder groups unsatisfied, or to disinterestedand philanthropic communitarianism is left unclear.

Fourth, in the Propriety Standards, under Dis-closure of Findings, as already quoted above, thereis the statement that the full set of evaluation find-ings should be made accessible to all the personsaffected by the evaluation and not just to the client.This standard again presupposes that the evaluationis intended and should be designed for the ultimatebenefit of all persons affected. So all persons affectedare evidently “cui bono?”As another ancient apho-rism goes, “vox populi, vox dei” (“the voice of thepeople is the voice of god,” first attested to havebeen used by Alcuin of York, who disagreed withthe sentiment, in a letter to Charlemagne in 798 ad;Page, 1909, p. 61).

Regardless of the subtle differences in perspectiveamong many of these standards, all of them presentus with a very broad view of for whomprogram eval-uators should actually take themselves to working.These standards again reflect very lofty ethical prin-ciples. However, we maintain that the proposedmechanisms and guidelines for achieving those goalsremain short of adequate to insure success.

What Do Program Evaluators Actually Do?Part I: Training and CompetenciesConceptual Foundations of ProfessionalTraining

Recent attempts have beenmade (King, Stevahn,Ghere, & Minnema, 2001; Stevahn, King, Ghere,


&Minnema, 2005) at formalizing the competenciesand subsequent training necessary of program eval-uators. These studies have relied on the thoughtsand opinions of practicing evaluators in terms oftheir opinion of the essential competencies of aneffective evaluator. In their studies, participants wereasked to rate their perceived importance on a vari-ety of skills that an evaluator should presumablyhave. In this study (King et al., 2001), there wasremarkably general agreement among evaluators forcompetencies that an evaluator should possess. Forexample, high agreement was observed for char-acteristics such as the ability to collect, analyze,and interpret data as well as to report the results.In addition, there was almost universal agreementregarding the evaluator’s ability to frame the evalu-ation question as well as understand the evaluationprocess. These areas of agreement suggest that theessential training that evaluators should have arein the areas of data-collection methods and data-analytic techniques. Surprisingly, however, therewas considerable disagreement regarding the abil-ity to do research-oriented activities, drawing aline between conducting evaluation and conduct-ing research. Nonetheless, we believe that trainingin research-oriented activities is essential to pro-gram evaluation because the same techniques suchas framing questions, data collection, and data anal-ysis and interpretation are gained through formaltraining in research methods. This evidently con-troversial position will be defended further below.Formal training standards are not yet developed forthe field of evaluation (Stevahan et al., 2005). How-ever, it does appear that the training necessary to bean effective evaluator includes formal and rigoroustraining in both research methods and the statisticalmodels that are most appropriate to those meth-ods. Further below, we outline some of the researchmethodologies and statistical models that are mostcommon within program evaluation.

In addition to purely data-analytic models, how-ever, logicmodels provide program evaluators with anoutline, or a roadmap, for achieving the outcomegoals of the program and illustrate relationshipsbetween resources available, planned activities, andthe outcome goals. The selection of outcome vari-ables is important because these are directly relevantto the assessment of the success of the program.An outcome variable refers to the chosen changesthat are desired by the program of interest. Out-come variables can be specified at the level of theindividual, group, or population and can refer toa change in specific behaviors, practices, or ways

of thinking. A generic outline for developing alogic model is presented by the UnitedWay (1996).They define a logic model as including four com-ponents. The first component is called Inputs andrefers to the resources available to program, includ-ing financial funds, staff, volunteers, equipment,and any potential restraints, such as licensure. Thesecond component is called Activities and refers toany planned services by the program, such as tutor-ing, counseling, or training. The third componentis called Outputs and refers to the number of par-ticipants reached, activities performed, product orservices delivered, and so forth. The fourth com-ponent is called Outcomes and refers to the benefitsproduced by those outputs for the participants orcommunity that the program was directed to help.Each component of the logic model can be fur-ther divided into initial or intermediate goals, witha long- or short-term timeframe, and can includemultiple items within each component.

Table 17.1 displays an example of a logic model.The logic model shown is a tabular representationthat we prepared of the VERB Logic Model devel-oped for the Youth Media Campaign LongitudinalSurvey, 2002–2004 (Center for Disease Control,2007). This logic model describes the sequence ofevents envisioned by the program for bringing aboutbehavior change, presenting the expected relationsbetween the campaign inputs, activities, impacts,and outcomes. A PDF of the original figure canbe downloaded directly from the CDC website(http://www.cdc.gov/youthcampaign/research/PDF/LogicModel.pdf ).

We believe that it is essential for program evalu-ators to be trained in the development and appli-cation of logic models because they can assistimmensely in both the design and the analysis phasesof the program evaluation. It is also extremelyimportant that the collaborative development oflogic models be used as a means of interacting andcommunicating with the program staff and stake-holders during this process, as an additional way ofmaking sure that their diverse interests and concernsare addressed in the evaluation of the program.

Conceptual Foundations of Methodologicaland Statistical Training

In response to a previous assertion by Shadish,Cook, and Leviton (1991) that program evaluationwas notmerely “applied social science,” Sechrest andFigueredo (1993) argued that the reason that thiswas so was:


Table 17.1. Example of a Logic Model: Youth Media Campaign Longitudinal Survey, 2002–2004

Input Activities Short-term outcomes Mid-term outcomes Long-term outcomes

ConsultantsStaffResearch andevaluationContractorsCommunityInfrastruc-turePartnership

AdvertisingPromotionsWebPublic relationsNational andcommunityoutreach

Tween and parentawareness of thecampaign brand andits messages“Buzz” about thecampaign and brandmessages

Changes in:Subjective NormsBeliefsSelf-efficacyPerceived behavioralcontrol

Tweens engaging inand maintainingphysical activity,leading to reducingchronic disease andpossibly reducingunhealthy riskybehaviors

Shadish et al. (1991) appeal to the peculiar problemsmanifest in program evaluation. However, thesevarious problems arise not merely in programevaluation but whenever one tries to apply socialscience. The problems, then, arise not from theperverse peculiarities of program evaluation but fromthe manifest failure of much of mainstream socialscience and the identifiable reasons for that failure.(p. 646–647)

These “identifiable reasons” consisted primar-ily of various common methodological practicesthat led to the “chronically inadequate externalvalidity of the results of the dominant experimen-tal research paradigm” (p. 647) that had beeninadvisedly adopted by mainstream social science.

According to Sechrest and Figueredo (1993), thelimitations of these sterile methodological practiceswere very quickly recognized by program evaluators,who almost immediately began creating the quasi-experimental methods that were more suitable forreal-world research and quickly superseded the olderlaboratory-based methods, at least within programevaluation:

Arguably, for quasi-experimentation, the morepowerful and sophisticated intellectual engines ofcausal inference are superior, by now, to those of theexperimental tradition. (p. 647)

The proposed distinction between program eval-uation and applied social science was therefore morea matter of practice than a matter of principle. Pro-gram evaluation had adopted methodological prac-tices that were appropriate to its content domain,which mainstream social science had not. Thestrong implication was that the quasi-experimentalmethodologies developed within program evalua-tion would very likely be more suitable for appliedsocial science in general than the dominant experi-mental paradigm.

Similarly, we extend this line of reasoning toargue that program evaluators do not employ acompletely unique set of statistical methods either.However, because program evaluators disproportion-ately employ a certain subset of research methods,which are now in more general use throughoutapplied psychosocial research, it necessarily followsthat they must therefore disproportionally employ acertain subset of statistical techniques that are appro-priate to those particular designs. In the sectionsbelow, we therefore concentrate on the statisticaltechniques that are in most common use in programevaluation, although these data-analyticmethods arenot unique to program evaluation per se.

What Do Program Evaluators Actually Do?Part II: Quantitative MethodsFoundations of Quantitative Methods:Methodological Rigor

Even its many critics acknowledge that the hall-mark andmain strength of the so-called quantitativeapproach to program evaluation resides primarilyin its methodological rigor, whether it is appliedin shoring up the process of measurement or inbuttressing the strength of causal inference. Inthe following sections, we review a sampling ofthe methods used in quantitative program evalu-ation to achieve the sought-after methodologicalrigor, which is the “Holy Grail” of the quantitativeenterprise.

Evaluation-Centered ValidityWithin program evaluation, and social sciences

in general, there are several types of validity thathave been identified. Cook and Campbell (1979)formally distinguished between four types of valid-ity more specific to program evaluation: (1) internalvalidity, (2) external validity, (3) statistical conclu-sion validity, and (4) construct validity. Internal


validity refers to establishing the causal relation-ship between two variables such as treatment andoutcome; external validity refers to supporting thegeneralization of results beyond a specific study;statistical conclusion validity refers to applying sta-tistical techniques appropriately to a given problem;and construct validity falls within a broader classof validity issues in measurement (e.g. face valid-ity, criterion validity, concurrent validity, etc.) butspecifically consists of assessing and understandingprogram components and outcomes accurately. Inthe context of a discussion of methods in programevaluation, two forms of validity take primacy: inter-nal and external validity. Each validity type is treatedwith more detail in the following sections.

internal validityThe utility of a given method in program evalua-

tion is generally measured in terms of how internallyvalid it is believed to be. That is, the effectivenessof a method in its ability to determine the causalrelationship between the treatment and outcome istypically considered in the context of threats to inter-nal validity. There are several different types of threatto internal validity, each of which applies to greaterand lesser degrees depending on the givenmethod ofevaluation. Here we describe a few possible threatsto internal validity.

selection biasSelection bias is the greatest threat to internal

validity for quasi-experimental designs. Selectionbias is generally a problem when comparing exper-imental and control groups that have not beencreated by the random assignment of participants.In such quasi-experiments, groupmembership (e.g.,treatment vs. control) may be determined by someunknown or little-known variable that may con-tribute to systematic differences between the groupsand may thus become confounded with the treat-ment. History is another internal validity threat.History refers to any events, not manipulated bythe researcher, that occur between the treatment andthe posttreatment outcomemeasurement thatmighteven partially account for that posttreatment out-come. Any events that coincide with the treatment,whether systematically related to the treatment ornot, that could produce the treatment effects on theoutcome are considered history threats. For exam-ple, practice effects in test taking could account fordifferences pretest and posttreatment if the sametype of measure is given at each measurement occa-sion. Maturation is the tendency for changes in

an outcome to spontaneously occur over time. Forexample, consider a program aimed at increasingformal operations in adolescents. Because formaloperations tend to increase over time during ado-lescence, the results of any program designed topromote formal operations during this time periodwould be confounded with the natural matura-tional tendency for formal operations to improvewith age. Finally, regression to the mean may causeanother threat to internal validity. These regres-sion artifacts generally occur when participants areselected into treatment groups or programs becausethey are unusually high or low on certain char-acteristics. When individuals deviate substantiallyfrom the mean, this might in part be attributableto errors of measurement. In such cases, it might beexpected that over time, their observed scores willnaturally regress back toward the mean, which ismore representative of their true scores. In researchdesigns where individuals are selected in this way,programmatic effects are difficult to distinguishfrom those of regression toward the mean. Sev-eral other forms of threats to internal validity arealso possible (for examples, see Shadish, Cook, &Campbell, 2002; Mark & Cook, 1984; Smith,2010).

external validityExternal validity refers to the generalizability

of findings, or the application of results beyondthe given sample in a given setting. The bestway to defend against threats of external validityis to conduct randomized experiments on rep-resentative samples, where participants are firstrandomly drawn from the population and then ran-domly assigned to the treatment and control groups.Because there are no prior characteristics systemati-cally shared by all members of either the control ortreatment participants with members of their owncorresponding groups, but systematically differingbetween those groups, it can be extrapolated thatthe effect of a program is applicable to others beyondthe specific sample assessed. This is not to say thatthe results of a randomized experiment will be appli-cable to all populations. For example, if a programis specific to adolescence and was only tested onadolescents, then the impact of the treatment maybe specific to adolescents. On the contrary, evalu-ations that involve groups that were nonrandomlyassigned face the possibility that the effect of thetreatment is specific to the population being sampledand thus becomes ungeneralizable to other popu-lations. For example, if a program is designed to


reduce the recidivism rates of violent criminals, butthe participants in a particular program are thosewho committed a specific violent crime, then theestimated impact of that program may be specificto only those individuals who committed that spe-cific crime and not generalizable to other violentoffenders.

Randomized ExperimentsRandomized experiments are widely believed to

offer evaluators the most effective way of assessingthe causal influence of a given treatment or program(St. Pierre, 2004). The simplest type of random-ized experiment is one in which individuals arerandomly assigned to one of at least two groups—typically a treatment and control group. By virtueof random assignment, each group is approximatelyequivalent in their characteristics and thus threatsto internal validity as a result of selection bias are,by definition, ruled out. Thus, the only systematicdifference between the groups is implementation ofthe treatment (or programparticipation), so that anysystematic differences between groups can be safelyattributed to receiving or not receiving the treat-ment. It is the goal of the evaluator to assess thisdegree of difference to determine the effectivenessof the treatment or program (Heckman & Smith,1995; Boruch, 1997).

Although randomized experiments might pro-vide the best method for establishing the causalinfluence of a treatment or program, they are notwithout their problems. For example, it may sim-ply be undesirable or unfeasible to randomly assignparticipants to different groups. Randomized exper-iments may be undesirable if results are neededquickly. In some cases, implementation of the treat-ment may take several months or even years tocomplete, precluding timely assessment of the treat-ment’s effectiveness. In addition, it is not feasibleto randomly assign participant characteristics. Thatis, questions involving race or sex, for example,cannot be randomly assigned, and, therefore, useof a randomized experiment to answer questionsthat center on these characteristics is impossible.Although experimental methods are useful for elim-inating these confounds by distributing participantcharacteristics evenly across groups, when researchquestions center on these prior participant char-acteristics, experimental methods are not feasiblemethods to apply to this kind of problem. In addi-tion, there are ethical considerations that must betaken into account before randomly assigning indi-viduals to groups. For example, itwouldbeunethical

to assign participants to a cigarette smoking con-dition or other condition that may cause harm.Furthermore, it is ethically questionable to with-hold effective treatment from some individuals andadminister treatment to others, such as in can-cer treatment or education programs (see Cook,Cook, & Mark, 1977; Shadish et al., 2002). Ran-domized experiments may also suffer other formsof selection bias insensitive to randomization. Forexample, selective attrition from treatmentsmay cre-ate nonequivalent groups if some individuals aresystematically more likely to drop out than oth-ers (Smith, 2010). Randomized experiments mayalso suffer from a number of other drawbacks.For a more technical discussion of the relation-ship between randomized experiments and causalinference, see Cook, Scriven, Coryn, and Evergreen(2010).

Quasi-ExperimentsQuasi-experiments are identical to randomized

experiments with the exception of one element: ran-domization. In quasi-experimental designs, partici-pants are not randomly assigned to different groups,and thus the groups are considered non-equivalent.However, during data analysis, a program evaluatormay attempt to construct equivalent groups throughmatching. Matching involves creating control andtreatment groups that are similar in their character-istics, such as age, race, and sex. Attempts to createequivalent groups through matching may result inundermatching, where groups may be similar inone characteristic (such as race) but nonequivalentin others (such as socioeconomic status). In suchsituations, a program evaluator maymake use of sta-tistical techniques that control for undermatching(Smith, 2010) or decide to only focus on matchingthose characteristics that could moderate the effectsof the treatment.

Much debate surrounds the validity of using ran-domized experiments versus quasi-experiments inestablishing causality (see, for example, Cook et. al.2010). Our goal in this section is not to evaluatethe tenability of asserting causality within quasi-experimental designs (interested readers are referredto Cook &Campbell, 1979) but, rather, to describesome of the more common methods that fall underthe rubric of quasi-experiments and how they relateto program evaluation.

one-group, posttest-only designAlso called the one-shot case study (Campbell,

1957), the one-group, posttest-only design provides


the evaluator with information only about treat-ment participants and only after the treatment hasbeen administered. It contains neither a pretest nor acontrol group, and thus conclusions about programimpact are generally ambiguous. This design can bediagrammed:

NR X O1

The NR refers to the nonrandom participationin this group. The X refers to the treatment, whichfrom left to right indicates that it temporally pre-cedes the outcome (O), and the subscript 1 indicatesthat the outcome was measured at time-point 1.Although simple in its formulation, this design hasa number of drawbacks that may make it unde-sirable. For example, this design is vulnerable toseveral threats to internal validity, particularly his-tory threats (Kirk, 2009; Shadish et al., 2002).Because there is no other group with which tomake comparisons, it is unknown if the treatmentis directly associated with the outcome or if otherevents that coincide with treatment implementationconfound treatment effects.

Despite these limitations, there is one circum-stance in which this design might be appropriate. Asdiscussed by Kirk (2009), the one-group, posttest-only design may be useful when sufficient knowl-edge about the expected value of the dependentvariable in the absence of the treatment is available.For example, consider high school students whohave taken a course of calculus and recently com-pleted an exam. To assess the impact of the calculuscourse, one would have to determine the averageexpected grade on the exam had the students nottaken the course and compare it to the scores theyactually received (Shadish et al., 2002). In this sit-uation, the expected exam grade for students hadthey not taken the course would likely be very lowcompared to the student’s actual grades. Thus, thistechnique is only likely useful when the size of theeffect (taking the class) is relatively large and dis-tinct from alternative possibilities (such has historythreat).

posttest-only, nonequivalent groupsdesign

This design is similar to the one-group, posttest-only design in that only posttest measures are avail-able; however, in this design, a comparison groupis available. Unlike a randomized experiment withparticipants randomly assigned to a treatment anda control group, in this design participant groupmembership is not randomized. This design can be

diagrammed:

NR X O1

NR X O1

Interpretation of this diagram is similar to thatof the previous one; however, in this diagram,the dashed line indicates that the participants ineach of these groups are different individuals. Itis important to note that the individuals in thesetwo groups represent nonequivalent groups andmaybe systematically different from each other in someuncontrolled extraneous characteristics. This designis a significant improvement over the one-group,posttest-only design in that a comparison group thathas not experienced the treatment can be comparedon the dependent variable of interest. The principaldrawback, however, is that this method may suf-fer from selection bias if the control and treatmentgroups differ from each other in a systematic waythis is not related to the treatment (Melvin &Cook,1984). For example, participants selected into atreatment based on their need for the treatment maydiffer on characteristics other than treatment needfrom those not selected into the treatment.

Evaluators may implement this method whenpretest information is not available, such as whena treatment starts before the evaluator has been con-sulted. In addition, an evaluator may choose touse this method if pretest measurements have thepotential to influence posttest outcomes (Willson& Putnam, 1982). For example, consider a programdesigned to increase spelling ability in middle child-hood. At pretest and posttest, children are given alist of words to spell. Program effectiveness wouldthen be assessed via estimating the improvement inspelling by comparing their spelling performancebefore and after the program. However, if the sameset of words were given to children at posttest thatwhere administered in the pretest, then the effect ofthe program might be confounded with a practiceeffect.

Although it is possible that pretest measures mayinfluence posttest outcomes, such situations arelikely to be relatively rare. In addition, the costs ofnot including a pretest may significantly outweighthe potential benefits (see Shadish et al., 2002).

one-group, pretest–posttest designIn the pretest–posttest design, participants are

assessed before the treatment and assessed againafter the treatment has been administered. However,there is no control group comparison. The form of


this design is:

NR O1X O2.

This design provides a baseline with which tocompare the same participants before and after treat-ment. Change in the outcome between pretest andposttest is commonly attributed to the treatment.This attribution, however, may be misinformed asthe design is vulnerable to threats to internal validity.For example, history threats may occur if uncon-trolled extraneous events coincide with treatmentimplementation. In addition, maturation threatsmay also occur if the outcome of interest is relatedwith time. Finally, if the outcome measure wasunusually high or low at pretest, then the changedetected by the posttest may not be the result of thetreatment but, rather, of regression toward the mean(Melvin & Cook, 1984).

Program evaluators might use this method whenit is not feasible to administer a program only to oneset of individuals and not to another. For example,this method would be useful if a program has beenadministered to all students in a given school, wherethere cannot be a comparative control group.

pretest and posttest, nonequivalentgroups design

The pretest and posttest nonequivalent groupsdesign is probably the most common to programevaluators (Shadish et al., 2002). This design com-bines the previous two designs by not only includingpretest and posttest measures but also a controlgroup at pretest and posttest. This design can bediagrammed:

NR O1X O2

NR O1O2.

The advantage of this design is that threats tointernal validity can more easily be ruled out (Mark& Cook, 1984). When threats to internal validityare plausible, they can be more directly assessed inthis design. Further, in the context of this design,statistical techniques are available to help accountfor potential biases (Kenny, 1975). Indeed, severalauthors make recommendations that data should beanalyzed in a variety of ways to determine the propereffect size of the treatment and evaluate the potentialfor selection bias that might be introduced as a resultof nonrandomgroups (see Cook&Campbell, 1979;Reichardt, 1979; Bryk, 1980).

In summary, the pretest and posttest, nonequiv-alent groups design, although not without its flaws,is a relatively effective technique for assessing treat-ment impact. An inherent strength of this design is

that with the exception of selection bias as a resultof nonrandom groups, no single general threat tointernal validity can be assigned. Rather, threats tointernal validity are likely to be specific to the givenproblem under evaluation.

interrupted time series designThe interrupted time series design is essentially

an extension of the pretest and posttest, nonequiva-lent groups design, although it not strictly necessaryfor one to include a control group. Ideally, thisdesign consists of repeated measures of some out-come prior to treatment, implementation of thetreatment, and then repeated measures of the out-come after treatment. The general form of thisdesign can be diagrammed:

NR O1O2O3O4O5X O6O7O8O9O10

NR O1O2O3O4O5O6O7O8O9O10.

In this diagram, the first line of Os refers to thetreatment group, which can be identified by theX among the Os. The second line of Os refers tothe control condition, as indicated by the lack ofan X. The dashed line between the two conditionsindicates participants are different between the twogroups, and the NR indicates that individuals andnonrandomly distributed between the groups.

Interrupted time series design is considered bymany to be the most powerful quasi-experimentaldesign to examine the longitudinal effects of treat-ments (Wagner et al., 2002). Several pieces ofinformation can be gained about the impact of atreatment. The first is a change in the level of theoutcome (as indicated by a change in the inter-cept of the regression line) after the treatment. Thissimply means that change in mean levels of the out-come as a result of the treatment can be assessed.The second is change in the temporal trajectory ofthe outcome (as indicated by a change in the slopeof the regression line). Because of the longitudi-nal nature of the data, the temporal trajectories ofthe outcome can be assessed both pre- and post-treatment, and any change in the trajectories canbe estimated. Other effects can be assessed as well,such as any changes in the variances of the outcomesafter treatment, whether the effect of the treatmentis continuous or discontinuous and if the effect ofthe treatment is immediate or delayed (see Shadishet al., 2002). Thus, several different aspects oftreatment implementation can be assessed with thisdesign.

In addition to its utility, the interrupted timeseries design (with a control group) is robust against


many forms of internal validity threat. For exam-ple, with a control group added to the model,history is no longer a threat because any exter-nal event that might have co-occurred with thetreatment should have affected both groups, pre-sumably equally. In addition, systematic pretest dif-ferences between the treatment and control groupscan be more accurately assessed because there areseveral pretest measures. Overall, the interruptedtime series design with a nonequivalent controlgroup is a very powerful design (Mark & Cook,1984).

A barrier to this design includes the fact thatseveral measurements are needed both before andafter treatment. This may be impossible if the eval-uator was not consulted until after the treatmentwas implemented. In addition, some evaluators mayhave to rely on the availability of existing data thatthey did not collect or historical records. These lim-itations may place constraints on the questions thatcan be asked by the evaluator.

regression discontinuity designFirst introduced to the evaluation community by

Thistlethwaite andCampbell (1960), the regression-discontinuity design (RDD) provides a powerfuland unbiased method for estimating treatmenteffects that rivals that of a randomized experiment(see Huitema, 1980). The RDD contains both atreatment and a control group. Unlike other quasi-experimental designs, however, the determination ofgroup membership is perfectly known. That is, inthe RDD, participants are assigned to either a treat-ment or control group based on a particular cutoff(see alsoTrochim, 1984, for a discussion of so-calledfuzzy regression discontinuity designs). The RDDtakes the following form:

OAC X O2

OAC O2.

OA refers to the pretestmeasure for which the cri-terion for group assignment is determined, C refersto the cutoff score for group membership, X refersto the treatment, and O2 refers to the measuredoutcome. As an example, consider the case whereelementary school students are assigned to a pro-gram aimed at increasing reading comprehension.Assignment to the program versus no program isdetermined by a particular cutoff score on a pretestmeasure of reading comprehension. In this case,group membership (control vs. treatment) is notrandomly assigned; however, the principle or deci-sion rule for assignment is perfectly known (e.g.,

the cut-off score). By directly modeling the knowndeterminant of group membership, the evaluator isable to completely account for the selection processthat determined group membership.

The primary threat to the internal validity ofthe RDD is history, although the tenability of thisfactor as a threat is often questionable. More impor-tantly, the analyses of RDDs are by nature complex,and correctly identifying the functional forms of theregression parameters (linear, quadratic, etc.) canhave a considerable impact on determining the effec-tiveness of a program (see Reichardt, 2009, for areview).

Measurement and Measurement Issues inProgram Evaluation

In the context of program evaluation, threetypes of measures should be considered: (1) inputmeasures, (2) process measures, and (3) outcomemeasures (Hollister & Hill, 1995). Input measuresconsist of more general measures about the programand the participants in them, such as the number ofindividuals in a given program or the ethnic com-position of program participants. Process measurescenter on the delivery of the program, such as amea-sure of teaching effectiveness in a program designedto improve reading comprehension in schoolchil-dren. Outcome measures are those measures thatfocus on the ultimate result of the program, suchas a measure of reading comprehension at the con-clusion of the program. Regardless of the type ofmeasurement being applied, it is imperative thatprogram evaluators utilize measures that are consis-tent with the goals of the evaluation. For example,in an evaluation of the performance of health-caresystems around the world, the World Health Orga-nization (WHO) published a report (World HealthOrganization, 2000) that estimated how well thedifferent health-care systems of different countrieswere functioning. As a part of this process, theauthors of the report sought to make recommen-dations based on empirical evidence rather thanWHO ideology. However, their measure of overallhealth system functioning was based, in part, on anInternet-based questionnaire of 1000 respondents,half of whom were WHO employees. In this case,themeasure used to assess health system functioningwas inconsistentwith the goals of the evaluation, andthis problem did not go unnoticed (see Williams,2001). Evaluators should consider carefully whatthe goals of a given program are and choose mea-sures that are appropriate toward the goals of theprogram.


An important part of choosing measures appro-priate to the goals of a program is choosingmeasuresthat are psychometrically sound. Atminimum,mea-sures should be chosen that have been demonstratedin past research to have adequate internal con-sistency. In addition, if the evaluator intends toadminister a test multiple times, then the cho-sen measure should have good test–retest reliability.Similarly, if the evaluator chooses a measure thatis scored by human raters, then the measure shouldshow good inter-rater reliability. In addition to thesebasic characteristics of reliability, measures shouldalso have good validity, in that they actually measurethe constructs that they are intended to measure.Publishedmeasures aremore likely to already possessthese qualities and thus may be less problematicalwhen choosing among possible measures.

It may be the case, however, that either an eval-uator is unable to locate an appropriate measure orno appropriate measures currently exist. In this case,evaluators may consider developing their own scalesof measurement as part of the process of programevaluation. Smith (2010) has provided a nice tuto-rial on constructing a survey-based scale for programevaluation. Rather than restate these points, how-ever, we discuss some of the issues that an evaluatormay face when constructing new measures in theprocess of program evaluation. Probably the mostimportant point is that there is no way, a priori, toknow that the measure being constructed is valid, inthat it measures what it intended to measure. Pre-sumably the measure will be high in face validity,but this does not necessarily translate into constructvalidity. Along these lines, if an evaluator intends tocreate their own measure of a given construct in thecontext of an evaluation, then themeasure should beproperly vetted regarding its utility in assessing pro-gram components prior to making any very strongconclusions.

One way to validate a new measure is to addadditional measures in the program evaluation toshow convergent and divergent validity. In addition,wherever possible, it would be ideal if pilot dataon the constructed measure could be obtained fromsome of the program participants to help evaluatethe psychometric properties of the measure priorto its administration to the larger sample that willconstitute the formal program evaluation.

Another problem that program evaluators mayface is that of “re-inventing the wheel,” when cre-ating a measure from scratch. When constructing ameasure, program evaluators are advised to researchthe construct that they intend to measure so that

useful test items can be developed. One way toavoid re-inventing the wheel may be to either bor-row items for other validated scales or to modifyan existing scale to suit the needs of the programand evaluation, while properly citing the originalsources. Collaboration with academic institutionscan help facilitate this process by providing resourcesto which an evaluator may not already have access.

Statistical Techniques in ProgramEvaluation

Program evaluators may employ a wide variety oftechniques to analyze the results of their evaluation.These techniques range from “simple” correlations,t -tests, and analyses of variance (ANOVAs) to moreintensive techniques such as multilevel modeling,structural equation modeling, and latent growthcurve modeling. It is often the case that the researchmethod chosen for the evaluation dictates the statis-tical technique used to analyze the resultant data.For experimental designs and quasi-experimentaldesigns, various forms of ANOVA, multiple regres-sion, and non-parametric statistics may suffice.However, for longitudinal designs, there may bemore options for the program evaluator in termsof how to analyze the data. In this section, we dis-cuss some of the analytical techniques that mightbe employed when analyzing longitudinal data and,more specifically, the kind of longitudinal dataderived from an interrupted time series design.For example, we discuss the relative advantagesand disadvantages of repeated measures analysis ofvariance (RM-ANOVA), multilevel modeling, andlatent growth curvemodeling. For amore systematicreview of some of the more basic statistical tech-niques in program evaluation, readers are referredto Newcomer and Wirtz (2004).

To discuss the properties of each of these tech-niques, consider a hypothetical longitudinal studyon alcohol use among adolescents. Data on alcoholconsumption were collected starting when the ado-lescents were in sixth grade and continued throughthe twelfth grade. As a part of the larger longitu-dinal study, a group of adolescents were enrolledin a program aimed at reducing alcohol consump-tion during adolescence. The task of the evaluatoris to determine the effectiveness of the program inreducing alcohol use across adolescence.

One way to analyze such data would be to useRM-ANOVA. In this analysis, the evaluator wouldhave several measures of alcohol consumption acrosstime and another binary variable that codedwhethera particular adolescent received the program. When


modeling this data, the repeatedmeasures of alcoholconsumption would be treated as a repeated mea-sure, whereas the binary program variable would betreated as a fixed factor. The results of this analysiswould indicate the functional form of the alcoholconsumption trend over time as well as if the trenddiffered between the two groups (program vs. noprogram). The advantage of the repeated measurestechnique is that the full form of the alcohol con-sumption trajectory can be modeled, and increasesand decreases in alcohol consumption can easily begraphically displayed (e.g., in SPSS). In addition,the shape of the trajectory (e.g., linear, quadratic,cubic, etc.) of alcohol consumption can be testedempirically through significance testing. The pri-mary disadvantage of RM-ANOVA in this case isthat the test of the difference between the two groupsis limited to the shape of the overall trajectory andcannot be extended to specific periods of time. Forexample, prior to the treatment, we would expectthat the two groups should not differ in their alcoholconsumption trajectories; only after the treatmentdo we expect differences. Rather than specificallytesting the difference in trajectories following thetreatment, a test is being conducted about the over-all shape of the curves. In addition, this techniquecannot test the assumption that the two groups areequal in their alcohol consumption trajectories priorto the treatment, a necessary precondition needed tomake inferences about the effectiveness of the pro-gram. To test these assumptions, we need to moveto multilevel modeling (MLM).

Multilevel modeling is a statistical techniquedesigned for use with data that violate the assump-tion of independence (see Kenny, Kashy, & Cook,2006). The assumption of independence states thatafter controlling for an independent variable, theresidual variance between variables should be inde-pendent. Longitudinal data (as well as dyadic data)tend to violate this assumption. The major advan-tage of MLM is that the structure of these residualcovariances can be directly specified (see Singer,1998, for examples). In addition, and more specifi-cally in reference to the current program evaluationexample, the growth function of longitudinal datacan be more directly specified in a number of flexi-ble ways (see, for example, Singer & Willett, 2003,p. 138). One interesting technique that has seenlittle utilization in the evaluation field is what hasbeen called a piecewise growth model (see Seltzer,Frank, & Bryk, 1994, for an example). In thismodel, rather than specifying a single linear or curvi-linear slope, two slopes with a single intercept are

modeled. The initial slope models change up to aspecific point, whereas the subsequent slope mod-els change after a specific point. Perhaps by now,the utility of this method has been discovered as itapplies to time series analysis in that trajectories ofchange can be modeled before and after the imple-mentation of a treatment, intervention, or program.In terms of the present example, change in alco-hol consumption can be a model for the entiresample before and after the program implementa-tion. Importantly, different slopes can be estimatedfor the two different groups (program vs. no pro-gram) and empirically tested for differences in theslopes. For example, consider a model that speci-fied a linear growth trajectory for the initial slope(prior to the program) and another linear growthtrajectory for the subsequent slope (after the pro-gram). In a piecewise growth model, significancetesting (as well as the estimation of effect sizes) canbe performed separately for both the initial slopeand subsequent slope. Further, by adding the fixedeffect of program participation (program vs. noprogram), initial and subsequent slopes for the dif-ferent groups can be modeled and the differencesbetween the initial and subsequent slopes for the twogroups can be tested. With piecewise growth mod-eling, the evaluator can test the assumption that theinitial slopes between the two groups are, in fact,the same as well as test the hypotheses that follow-ing the program the growth trajectories of the twogroups differ systematically, with the intended effectbeing that the program group shows a less positiveor even negative slope over time (increased alcoholconsumption among adolescents being presumedundesirable).

Although this method is very useful for inter-rupted time series design, it is not without itsdrawbacks. Perhaps one drawback is the complex-ity of model building; however, this drawback isquickly ameliorated with some research on the topicand perhaps some collaboration. Another drawbackto this technique is that the change in subsequentslope may be driven primarily by a large change inbehavior immediately following the program anddoes not necessarily indicate a lasting change overtime. Other modeling techniques can be used toexplore such variations in behavioral change overtime. The interested reader can refer to Singer andWillett (2003).

Structural equation modeling can also be usedto model longitudinal data through the use oflatent growth curve models. For technical detailson how to specify a latent growth curve model,


the interested reader can refer to Duncan, Dun-can, and Stryker (2006). The primary advantage ofusing latent growth curve modeling over MLM isthat latent variables can be used (indeed, piecewisegrowth models can be estimated in a latent growthmodel framework as well; see Muthén & Muthén,2009, p. 105). In addition, more complex modelssuch as multilevel latent growth curve models canbe implemented. Such models also account for theinterdependence of longitudinal data but are alsouseful when data are nested—for example, whenthere is longitudinal data on alcohol consumption inseveral different schools. These models can becomeincreasingly complex, and it is recommended thatevaluators without prior knowledge of this statisticaltechnique seek the advice and possible collaborationwith experts on this topic.

What Do Program Evaluators Actually Do?Part III: Qualitative MethodsFoundations of Qualitative Methods:Credibility and Quality

The two principal pillars on which qualitativeprogram evaluation rests are credibility and quality.These two concepts lie at the heart of all qualitativeresearch, regardless of any more specific philosoph-ical or ideological subscriptions (Patton, 1999).Although these concepts are not considered to bepurely independent of each other in the literature,for the sake of clarity of explanation, we will treatthem as such unless otherwise specified.

credibilityWhen performing a literature search on the cred-

ibility concept within the qualitative paradigms, theemphasis seems to be primarily with the researcherand only secondarily on the research itself. Thepoints most notably brought to light are those ofresearcher competence and trustworthiness.

competenceCompetence is the key to establishing the cred-

ibility of a researcher. If a researcher is deemedas incompetent, then the credibility and qualityof the entire study immediately comes into ques-tion. One of the biggest issues lies with training ofqualitative researchers inmethods. In a classic exam-ple of the unreliability of eyewitness testimonies,Katzer, Cook, and Crouch (1978) point out whatcan happen when sufficient training does not occur.Ignorance is not bliss, at least in science. Giving anyresearcher tools without the knowledge to use them

is simply bad policy. Subsequent to their initial train-ing, the next most important consideration withrespect to competence is the question of their scien-tific “track record.” If an evaluator has demonstratedbeing able to perform high-quality research manytimes, then it can be assumed that the researcher iscompetent.

trustworthinessSomething else to note when considering the

credibility of an evaluator is trustworthiness. Thereis little doubt that the researcher’s history mustbe taken into account (Patton, 1999). Withoutknowing where the researcher is “coming from,”in terms of possible ideological commitments, thereports made by a given evaluator may appearobjective but might actually be skewed by per-sonal biases. This is especially a problem with morephenomenological methods of qualitative programevaluation, such as interpretive and social construc-tionist. As Denzin (1989) and many others havepointed out, pure neutrality or impartiality is rare.This means that not being completely forthrightabout any personal biases should be a “red flag”regarding the trustworthiness (or lack thereof ) of theevaluator.

judging credibilityThere are those that argue that credibility and

trustworthiness are not traits that an evaluator canachieve themselves, but rather that it has to beestablished by the stakeholders, presumably demo-cratically and all providing equal input (Atkinson,Heath, & Chenail, 1991). This notion seems to beakin to that of external validity. This is also funda-mentally different from another school of thoughtthat claims to be able to increase “truth value” viaexternal auditing (Lincoln & Guba, 1985). Likeexternal validation, Atkinson would argue that eval-uators are not in a position to be able to judgetheir own work and that separate entities shouldbe responsible for such judging. According to thisperspective, stakeholders need to evaluate the eval-uators. If we continue down that road, then theevaluators of the evaluators might need to be evalu-ated, and they will need to be evaluated, and so onand so forth. As the Sixth Satire, written by FirstCentury Roman poet Decimus Iunius Iuvenalis,asks: “quis custodiet ipsos custodes?” (“who shall watchthe watchers?”; Ramsay, 1918) The way around thisinfinite regress is to develop some sort of standardby which comparisons between the researcher andthe standard can be made.


Evaluators can only be as credible as the cred-ibility of the system that brought them to theircurrent positions. Recall that there is a diverse arrayof backgrounds among program evaluators and abroad armamentarium of research methods and sta-tistical models available from which they can select,as well as the fact that there are currently no formaltraining standards in program evaluation (Stevahanet al., 2005). Until a standard of training is in place,there is no objective way to assess the credibility of aresearcher, and evaluators are forced to rely on highlysubjectivemeasures of credibility, fraughtwith biasesand emotional reactions.

QualityThe other key concern in qualitative program

evaluation is quality. Quality concerns echo thosevoiced regarding questions of reliability and valid-ity in quantitative research, although the framingof these concepts is done within the philosophicalframework of the research paradigm (Golafshani,2003). Patton, as the “go-to guy” for how to doqualitative program evaluations, has applied quan-titative principles to qualitative program evaluationthroughout his works (Patton, 1999, 1997, 1990),although they seem to fall short in application.His primary emphases are on rigor in testing andinterpretation.

rigorous testingApart from being thorough in the use of any sin-

gle qualitative method, there appears to be a singlekey issue with respect to testing rigor, and this iscalled triangulation.

Campbell discussed the concept of methodologi-cal triangulation (Campbell, 1953, 1956; Campbell& Fiske, 1959). Triangulation is the use of multi-ple methods, each having their own unique biases,to measure a particular phenomenon. This multi-ple convergence allows for the systematic varianceascribable to the “trait” being measured by multi-ple indicators to be partitioned from the systematicvariance associatedwith each “method” and from theunsystematic variance attributable to the inevitableand random “error” of measurement, regardless ofthe method used. Within the context of qualitativeprogram evaluation, this can consist either ofmixingquantitative and qualitative methods or of mix-ing qualitative methods. Patton (1999) outspokenlysupported the use of either form of triangulation,because each method of measurement has its ownadvantages and disadvantages.

Other contributors to this the literature haveclaimed that the “jury is still out” concerning theadvantages of triangulation (Barbour, 1998) andthat clearer definitions are needed to determinetriangulation’s applicability to qualitative methods.Barbour’s claim seems unsupported because thereis a clear misinterpretation of Patton’s work. Pat-ton advocates a convergence of evidence. Becausethe nature of qualitative data is not as precise asthe nature of quantitative data, traditional hypoth-esis testing is virtually impossible. Barbour is underthe impression that Patton is referring to perfectlycongruent results. This is obviously not possiblebecause, as stated above, there will always be dif-ferent divergences between different measures basedon which method of measurement is used. Pattonis advocating the use of multiple and mixed meth-ods to produce consistent results. One example ofhow to execute triangulation within the qualita-tive paradigm focused on three different educationaltechniques (Oliver-Hoyo&Allen, 2006). For coop-erative grouping, hands-on activities, and graphicalskills, these authors used interviews, reflective jour-nal entries, surveys, and field notes. The authorsfound that the exclusive use of surveys would haveled to different conclusions, because the results ofthe surveys alone indicated that there was eitherno change or a negative change, whereas the othermethods instead indicated that there was a positivechange with the use of these educational techniques.This demonstrates the importance of using trian-gulation. When results diverge, meaning that theyshow opposing trends using different methods, theaccuracy of the findings falls into question.

Lincoln and Guba (1985) have also discussed theimportance of triangulation but have emphasizedits importance in increasing the rigor and trustwor-thiness of research with respect to the interpreta-tion stage. This is ultimately because all methodswill restrict what inferences can be made from aqualitative study.

rigorous interpretationAswith quantitative program evaluation, qualita-

tive methods require rigorous interpretation at twolevels: the microscale, which is the sample, and themacroscale, which is the population for quantita-tive researchers and is most often the social or globalimplication for qualitative researchers.

Looking at qualitative data is reminiscent ofexploratory methods in quantitative research butwithout the significance tests. Grounded Theory isone such analytic method. The job of the researcher


is to systematically consider all of the data and toextract theory from the data (Strauss & Corbin,1990). The only exception made is for theory exten-sion when going with a preconceived theory isacceptable.

Repeatedly throughout the literature (e.g., Pat-ton, 1999; Atkinson, Heath, & Chenail, 1991;Lincoln&Guba, 1985), the evaluator is emphasizedas the key instrument in analysis of data. Althoughstatistics can be helpful, they are seen as restrict-ing and override any “insight” from the researcher.Analysis necessarily depends on the “astute patternrecognition” abilities of the investigating researcher(Patton, 1999). What Leech and Onwuegbuzie(2007) have called “data analysis triangulation” isessentially an extension of the triangulation conceptdescribed by Patton (1999) as applied to data analyt-ics. The idea is that by analyzing data with differenttechniques, convergence can be determined, makingthe findings more credible or trustworthy.

Because a large part of qualitative inquiry is sub-jective and dependent on a researcher’s creativity,Patton (1999) has advocated reporting all relevantdata and making explicit all thought processes, thusavoiding the problem of interpretive bias. Thismay allow anyone that reads the evaluation reportto determine whether the results and suggestionswere sufficiently grounded. Shek et al. (2005) haveoutlined the necessary steps that must occur todemonstrate that the researcher is not simply forcingtheir opinions into their research.

Qualitative Methods in Program EvaluationThe most common methods in qualitative pro-

gram evaluation are straightforward and fall into oneof two broad categories: first-party or third-partymethods (done from the perspective of the eval-uands, which are the programs being evaluated).These methods are also used by more quantita-tive fields of inquiry, although they are not usuallyframed as part of the research process.

first-party methodsWhen an evaluator directly asks questions to

the entities being evaluated, the evaluator is utiliz-ing a first-party method. Included in this methodare techniques such as interviews (whether of indi-viduals or focus groups), surveys, open-endedquestionnaires, and document analyses.

Interviews, surveys, and open-ended question-naires are similar in nature. In interviews, theresearcher begins with a set of potential questions,and depending on the way in which the individuals

within the entity respond, the questions willmove ina particular direction. The key here is that the ques-tioning is fluid, open, and not a forced choice. Inthe case of surveys and open-ended questionnaires,fixed questions are presented to the individual, butthe potential answers are left as open as possible,such as in short-answer responding. Like with inter-views, if it can be helped, the questioning is openand not a forced choice (see Leech &Onwuegbuzie,2007; Oliver-Hoyo & Allen, 2006; Pugach, 2001;Patton, 1999).

Although document analysis is given its own cate-gory in the literature (Pugach, 2001; Patton, 1999),it seems more appropriate to include the documentanalysis technique alongwith other first-partymeth-ods. Document analysis will usually be conductedon prior interviews, transcribed statements, or otherofficial reports. It involves doing “archival digging”to gather data for the evaluation. Pulling out key“success” or “failure” stories are pivotal to perform-ing these kinds of analyses and utilized as often aspossible for illustrative purposes.

The unifying theme of these three techniques isthat the information comes from within the entitybeing evaluated.

third-party methodsThe other primary type of methodology used in

qualitative research is third-party methods. The twomajor third-party methods are naturalistic obser-vations and case studies. These methods are morephenomenological in nature and require rigoroustraining on the part of the researcher for proper exe-cution. These methods are intimately tied with theCompetence section above.

Naturalistic observation has been used by bio-logical and behavioral scientists for many years andinvolves observation of behavior within its naturalcontext. This method involves observing some tar-get (whether that is a human or nonhuman animal)performing a behavior in its natural setting. This ismost often accomplished reviewing video recordingsor recording the target in person while not inter-acting with the target. There are, however, manycases of researchers interacting with the target andthen “going native” or becoming a member of thegroup they initially sought to study (Patton, 1999).Some of the most prominent natural scientists haveutilized this method (e.g., Charles Darwin, JaneGoodall, and Isaac Newton). According to Patton(1999), there are well-documented problems withthis method, including phenomena like researcherpresence effects, “going native,” researcher biases,


and concerns regarding researcher training. Despitethe inherent risks and problems with naturalisticobservation, it has been, and will likely continueto be, a staple method within scientific inquiry.

Case studies can be special cases of a naturalisticobservation or can be a special kind of “artificial”observation. Case studies provide extensive detailabout a few individuals (Banfield &Cayago-Gicain,2006; Patton, 1999) and can simply be used todemonstrate a point (as in Abma, 2000). Casestudies usually take a substantial amount of timeto gather appropriate amounts of idiographic data.This method utilizes any records the researcher canget their hands on, regarding the individual beingstudied (self-report questionnaires, interviews, med-ical records, performance reviews, financial records,etc.). As with naturalistic observation, case studyresearchers must undergo much training before theycan be deemed “capable” of drawing conclusionsbased on a single individual. The problems withcase studies are all of those in naturalistic observa-tion but with the addition of a greater probability ofa sampling error. Because case studies are so inten-sive, they are often also very expensive. The salienceand exhaustion of a few cases makes it difficult tonotice larger, nominal trends in the data (Banfield& Cayago-Gicain, 2006). This could also put adisproportionate emphasis on the “tails” of the dis-tribution, although that may be precisely what theresearcher wants to accomplish (see next section).

Critiques/Criticisms of QuantitativeMethods

One of the major critiques of quantitative meth-ods by those in qualitative evaluation is that ofcredibility. Relevance of findings using quantitativeevaluation to what is “important” or what is “theessence” of the question, according to those usingqualitative evaluation methods, is rather poor (seediscussion in Reichardt & Rallis, 1994a, 1994b).Recall that according to Atkinson (1991), the rel-evance of findings, and whether they are appropri-ate, cannot be determined by the evaluator. Thestakeholders are the only ones that can determinerelevance. Although there are those in qualitativeprogram evaluation that think almost everything iscaused by factors like “social class” and “disparity inpower,” Atkinson would argue that the evaluator isnot able to determine what is or is not relevant tothe reality experienced by the stakeholders.

Another criticism is that quantitative researchtends to focus simply on the majority, neglectingthe individuals in the outer ends of the normal

distribution. This is a valid critique for those quan-titative researchers who tend to “drop” their outliersfor better model fits. Banfield and Cayago-Gicain(2006) have pointed out that qualitative researchallows for more detail on a smaller sample. Thisallows for more context surrounding individuals tobe presented. With additional knowledge from the“atypical” (tails of the distribution) cases, theory canbe extracted that fits all of the data best and not justthe “typical” person.

Beyond the Qualitative/QuantitativeDebate

Debate about the superiority of qualitative ver-sus quantitative methodology has a long history inprogram evaluation. Prior to the 1970s, random-ized experiments were considered the gold standardin impact assessment. More and more, however,as evaluators realized the limitations of random-ized experiments, quasi-experiments became moreacceptable (Madey, 1982). It was also not untilthe early 1970s that qualitative methods becamemore acceptable; however, epistemological differ-ences between the two camps prevailed in perpet-uating the debate, even leading to distrust andslander between followers of the different perspec-tives (Kidder & Fine, 1987). In an effort to ebbthe tide of the qualitative–quantitative debate, someevaluators have long called for integration betweenthe two approaches. By recognizing that methodstypically associated with qualitative and quantita-tive paradigms are not inextricably linked to theseparadigms (Reichardt & Cook, 1979), an evaluatorhas greater flexibility with which to choose specificmethods that are simply the most appropriate for agiven evaluation question (Howe, 1988). Further,others have pointed out that because the qualita-tive and quantitative approaches are not entirelyincompatible (e.g., Reichardt & Rallis, 1994a,1994b), common ground can be found betweenthe two methods when addressing evaluationquestions.

An evaluator thus may choose to use quantitativeor qualitative methods alone or may choose to useboth methods in what is known as a mixed methodsdesign. A mixed methods approach to evaluationhas been advocated on the basis that the two meth-ods: (1) provide cross-validation (triangulation) ofresults and (2) complement each other, where therelative weakness of one method becomes the rela-tive strength of the other. For example, despite thepurported epistemological differences between the


two paradigms, the different approaches to evalua-tion often lead to the same answers (Sale, Lohfeld,& Brazil, 2002). Thus, combining both methodsinto the same evaluation can result in converginglines of evidence. Further, each method can beused to complement the other. For example, theuse of qualitative data collection techniques canhelp in the development or choice of measurementinstruments, as the personal interaction with indi-vidual participants may pave the way for collectingmore sensitive data (Madey, 1982).

Despite the promise of integrating qualitativeand quantitative methods through a mixed methodapproach, Sale et al. (2002) challenged the nota-tion that qualitative and quantitative methods areseparable from their respective paradigms, contraryto the position advocated by Reichardt and Cook(1979). Indeed, these authors have suggested thatbecause the two approaches deal with fundamen-tally different perspectives, the use of both methodsto triangulate or complement each other is invalid.Rather, mixed methods should be used in accor-dance with one another only with the recognitionof the different questions that they address. In thisview, it should be recognized that qualitative andquantitativemethods do address different questions,but at the same time they can show considerableoverlap. Thus, mixed methods designs provide amore complete picture of the evaluation space byproviding all three components: cross-validation,complimentarity, and unique contributions fromeach.

Despite the utility in principle of integratingboth qualitative and quantitative methods in evalu-ation and the more recent developments in mixedmethodology (see Greene & Caracelli, 1997), theoverwhelmingmajority of published articles in prac-tice employ either qualitative or quantitative meth-ods to the exclusion of the other. Perhaps onereason for the persistence of the single methodologyapproach is the lack of training in both approaches inevaluation training programs. For example, the AEAwebsite (http://www.eval.org) lists 51 academic pro-grams that have an evaluation focus or evaluationoption. In a review of each of these programs, wefound that none of the evaluation programs had amixed methods focus. Moreover, when programsdid have a focus, it was on quantitative methods.Further, within these programs quantitative meth-ods and qualitative methods were generally taughtin separate classes, and there was no evidence ofany class in any program that was focused specifi-cally on mixed methods designs. Indeed, Johnson

and Onwuegbuzie (2004) have noted that “ . . .graduate students who graduate from educationalinstitutions with an aspiration to gain employmentin the world of academia or research are left withthe impression that they have to pledge allegianceto one research school of thought or the other” (p.14). Given the seeming utility of a mixed methodsapproach, it is unfortunate that more programs donot offer specific training in these techniques.

Competing Paradigms or PossibleIntegration?

In summary, the quantitative and qualita-tive approaches to program evaluation have beenwidely represented as incommensurable Kuhnianparadigms (e.g., Guba & Lincoln, 1989). On theother hand, it has been suggested that perhapsthe road to reconciliation lies with Reichenbach’s(1938) important distinction between the contextof discovery versus the context of justification inscientific research. Sechrest and Figueredo (1993)paraphrased their respective definitions:

In the context of discovery, free reign is given tospeculative mental construction, creative thought,and subjective interpretation. In the context ofjustification, unfettered speculation is superseded bysevere testing of formerly favored hypotheses,observance of a strict code of scientific objectivity,and the merciless exposure of one’s theories to thegravest possible risk of falsification. (p. 654)

Based on that philosophical perspective, Sechrestand Figueredo (1993) recommended the follow-ing methodological resolution of the quantita-tive/qualitative debate:

We believe that some proponents of qualitativemethods have incorrectly framed the issue as anabsolute either/or dichotomy. Many of thelimitations that they attribute to quantitativemethods have been discoursed upon extensively inthe past. The distinction made previously, however,was not between quantitative and qualitative, butbetween exploratory and confirmatory research. Thisdistinction is perhaps more useful because itrepresents the divergent properties of twocomplementary and sequential stages of the scientificprocess, rather than two alternative procedures . . .Perhaps a compromise is possible in light of therealization that although rigorous theory testing isadmittedly sterile and nonproductive withoutadequate theory development, creative theory


construction is ultimately pointless without scientificverification. (p. 654)

We also endorse that view. However, in case Sechrestand Figueredo (1993) were not completely clear thefirst time, we will restate this position here a littlemore emphatically.Webelieve that qualitativemeth-ods aremost useful in exploratory research, meaningearly in the evaluation process, the so-called contextof discovery, in that they are more flexible and openand permit the researcher to follow intuitive leadsand discover previously unknown and unimaginedfacts that were quite simply not predicted by existingtheory. Qualitative methods are therefore a usefultool for theory construction. However, the poten-tially controversial part of this otherwise conciliatoryposition is that it is our considered opinion thatqualitative methods are inadequate for confirma-tory research, the so-called context of justification,in that they do not and cannot even in princi-ple be designed to rigorously subject our theoriesto critical risk of falsification, as by comparisonto alternative theories (Chamberlin, 1897; Platt,1964; Popper, 1959; Lakatos, 1970, 1978). Forthat purpose, quantitative methods necessarily excelbecause of their greater methodological rigor andbecause they are equipped to do just that. Quanti-tative methods are therefore a more useful tool fortheory testing. This does not make quantitative eval-uation in any way superior to qualitative evaluation,in that exploration and confirmation are both partof the necessary cycle of scientific research.

It is virtually routine in many other fields, suchas in the science of ethology, to make detailedobservations regarding the natural history of anyspecies before generating testable hypotheses thatpredict their probable behavior. In cross-culturalresearch, it is standard practice to do the basicethnographical exploration of any new society understudy prior to making any comparative behavioralpredictions. These might be better models for pro-gram evaluation to follow than constructing thesituation as an adversarial one between supposedlyincommensurable paradigms.

Conclusions and Recommendationsfor the Future

As a possible solution to some of the structuralproblems, moral hazards, and perverse incentivesin the practice of program evaluation that we havereviewed, Scriven (1976, 1991) long ago suggestedthat the program funders should pay for summa-tive evaluations and pay the summative evaluators

directly. We completely agree with this because webelieve that the summative program evaluators mustnot have to answer to the evaluands and that theresults of the evaluation should not be “filtered”through them.

For example, in the Propriety Standards for Con-flicts of Interest, The Joint Committee on Standardsfor Educational Evaluation (1994) has issued thefollowing guideline: “Wherever possible, obtainthe evaluation contract from the funding agencydirectly, rather than through the funded programor project” (p. 116). Our only problem with thisguideline is that the individual evaluator is called onto implement this solution. Should an ethical eval-uator then decline contracts offered by the fundedprogram or project? This is not a realistic solution tothe problem. As a self-governing society, we shouldsimply not accept summative evaluations in whichthe funded programs or projects (evaluands) havecontracted their own program evaluators. This is asimple matter of protecting the public interest bymaking the necessary institutional adjustments toaddress a widely recognized moral hazard.

Similarly, in the Propriety Standards for Disclo-sure of Findings,The Joint Committee on Standardsfor Educational Evaluation (1994) has issued vari-ous guidelines for evaluators to negotiate in advancewith clients for complete, unbiased, anddetailed dis-closure of all evaluation findings to all directly andindirectly affected parties. The problem is that thereis currently no incentive in place for an individualevaluator to do so and possibly jeopardize the awardof an evaluation contract by demanding conditionsof such unrestricted dissemination of information towhich almost no client on this planet is very likelyto agree.

On the other hand, we recommend that theevaluands should pay for formative evaluations andpay the formative evaluators directly.This is becausewe believe that formative evaluators should providecontinuous feedback to the evaluands and not pub-lish those results externally before the program isfully mature (e.g., Tharp & Gallimore, 1979). Thatway, the formative evaluator can gain the completetrust and cooperation of the program administratorsand the program staff. Stufflebeam (2001) writes:

Clients sometimes can legitimately commissioncovert studies and keep the findings private, whilemeeting relevant laws and adhering to an appropriateadvance agreement with the evaluator. This can bethe case in the United States for private organizationsnot governed by public disclosure laws. Furthermore,


an evaluator, under legal contractual agreements, canplan, conduct, and report an evaluation for privatepurposes, while not disclosing the findings to anyoutside party. The key to keeping client-controlledstudies in legitimate territory is to reach appropriate,legally defensible, advance, written agreements and toadhere to the contractual provisions concerningrelease of the study’s findings. Such studies also haveto conform to applicable laws on release ofinformation. (p. 15)

In summary, summative evaluations should gen-erally be external, whereas formative evaluationsshould generally be internal. Only strict adherenceto these guidelines will provide the correct incen-tive system for all the parties concerned, includingthe general public, which winds up paying for allthis. The problem essentially boils down to one ofintellectual property. Who actually owns the datagenerated by a program evaluation? In a free mar-ket society, the crude but simple answer to thisquestion is typically “whoever is paying for it!” Inalmost no case is it the program evaluator, whois typically beholden to one party or another foremployment. We should therefore arrange for theowner of that intellectual property to be in everycase the party whose interests are best aligned withthose of the society as a whole. In the case of aformative evaluation, that party is the program-providing agency (the evaluand) seeking to improveits services with a minimum of outside interfer-ence, whereas in the case of a summative evalu-ation, that party is the program-funding agencycharged with deciding whether any particular pro-gram is worth society’s continuing investment andsupport.

Many informative and insightful comparisonsand contrasts have been made on the relative meritsand limitations of internal and external evaluators(e.g., Braskamp, Brandenburg, & Ory, 1987; Love,1991; Mathison, 1994; Meyers, 1981; Newman& Brown, 1996; Owen & Rogers, 1999; Pat-ton, 1997; Tang, Cowling, Koumijian, Roeseler,Lloyd, & Rogers, 2002; Weiss, 1998). Althoughall of those considerations are too many to listhere, internal evaluators are generally valued fortheir greater availability and lower cost as well asfor their greater contextual knowledge of the par-ticular organization and ability to obtain a greaterdegree of commitment from stakeholders to theultimate recommendations of the evaluation, basedon the perceived legitimacy obtained through theirdirect experience in the program. We believe that

these various strengths of internal evaluators areideally suited to the needs of formative evaluation;however, some of these same characteristics mightcompromise their credibility in the context of asummative evaluation. In contrast, external evalu-ators are generally valued for their greater technicalexpertise as well as for their greater independenceand objectivity, including greater accountability tothe public interest and ability to criticize the organi-zation being evaluated—hence their greater abilityto potentially position themselves as mediators orarbiters between the stakeholders. We believe thatthese various strengths of external evaluators are ide-ally suited to the needs of summative evaluation;however, some of these same characteristics mightcompromise their effectiveness in the context of aformative evaluation.

A related point is that qualitative methodsare arguably superior for conducting the kind ofexploratory research often needed in a formative eval-uation, whereas quantitative methods are arguablysuperior for conducting the confirmatory researchoften needed in a summative evaluation. By tran-sitive inference with our immediately prior recom-mendation, we would envision qualitative methodsbeing of greater use to internal evaluators andquantitative methods being of greater use to exter-nal evaluators, if each method is being applied towhat they excel at achieving, within their contin-gently optimal contexts. With these conclusions, wemake our final recommendation that the qualita-tive/quantitative debate be officially ended, with therecognition that both kinds of research each havetheir proper and necessary place in the cycle of sci-entific research and, by logical implication, that ofprogram evaluation. Each side must abandon theclaims that their preferred methods can do it alland, in the spirit of the great evaluation methodolo-gist and socio-cultural evolutionary theorist DonaldThomas Campbell, to recognize that all our meth-ods are fallible (Campbell & Fiske, 1959) and thatonly through exploiting their mutual complementar-ities can we put all of the interlocking fish scales ofomniscience back together (Campbell, 1969).

ReferencesAbma,T.A. (2000). Stakeholder conflict: A case study. Evaluationand Program Planning, 23, 199–210.

Atkinson, B., Heath, A., & Chenail, R. (1991). Qualitativeresearch and the legitimization of knowledge. Journal ofMarital and Family Therapy, 17 (2), 175–180.

Barbour, R. S. (1998). Mixing qualitative methods: Qual-ity assurance or qualitative quagmire? Qualitative HealthResearch, 8(3), 352–361.


Banfield, G., & Cayago-Gicain, M. S. (2006). Quali-tative approaches to educational evaluation: A regionalconference-workshop. International Education Journal, 7 (4),510–513.

Berry, D. H. (2000) Cicero Defense Speeches, trans. New York:Oxford University Press.

Boruch, R. F. (1997). Randomized experiments for planning andevaluation: A practical guide. Thousand Oaks, CA: Sage.

Braskamp, L.A., Brandenburg, D.C.&Ory, J.C. (1987). Lessonsabout clients’ expectations. In J Nowakowski (Ed.), Theclient perspective on evaluation: New Directions For ProgramEvaluation, 36, 63–74. San Francisco, CA: Jossey-Bass.

Bryk, A. S. (1980). Analyzing data frompremeasure/postmeasuredesigns. In S. Anderson, A. Auquier, W. Vandaele, & H.I. Weisburg (Eds.), Statistical methods for comparative studies(pp. 235–260). Hoboken, NJ: John Wiley & Sons.

Campbell, D. T. (1953). A study of leadership among subma-rine officers. Columbus, OH: The Ohio State University,Personnel Research Board.

Campbell, D. T. (1956). Leadership and its effects upon the group.Columbus, OH: Bureau of Business Research, The OhioState University.

Campbell, D. T. (1957). Factors relevant to the validity ofexperiments in social settings. Psychological Bulletin, 54,297–312.

Campbell, D. T. (1969). Ethnocentrism of disciplines and thefish-scale model of omniscience. In M. Sherif and C.W.Sherif, (Eds.), Interdisciplinary Relationships in the SocialSciences, (pp. 328–348). Chicago IL: Aldine.

Campbell, D. T., & Fiske, D. W. (1959). Convergent and dis-criminant validation by the multitrait-multimethod matrix.Psychological Bulletin 56 (2), 81–105.

Center for Disease Control (2007). Youth media campaign,VERB logic model. Retrieved May 18, 2012, fromhttp://www.cdc.gov/youthcampaign/research/logic.htm

Chamberlin, T.C. (1897). The method of multiple workinghypotheses. Journal of Geology, 5, 837–848.

Clayton, R.R., Cattarello, A.M., & JohnstoneB.M. (1996).Theeffectiveness of Drug Abuse Resistance Education (ProjectDARE): 5-year follow-up results. Preventive Medicine 25(3),307–318.

Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation:design and analysis issues for field settings. Chicago, IL: Rand-McNally.

Cook, T. D., Cook, F. L., & Mark, M. M. (1977). Random-ized and quasi-experimental designs in evaluation research:An introduction. In L. Rutman (Ed.), Evaluation researchmethods: A basic guide (pp. 101–140). Beverly Hills, CA:Sage.

Cook, T. D., Scriven, M., Coryn, C. L. S., & Evergreen, S.D. H (2010). Contemporary thinking about causation inevaluation: A dialogue with Tom Cook and Michael Scriven.American Journal of Evaluation, 31, 105–117.

Dembe, A. E., & Boden, L. I. (2000). Moral hazard: A questionof morality? New Solutions 2000, 10(3), 257–279.

Denzin, N. K. (1989). Interpretive interactionism. Newbury Park,CA: Sage.

Dukes, R. L., Stein, J. A., & Ullman, J. B. (1996). Long-term impact ofDrugAbuseResistance Education (D.A.R.E.).Evaluation Review, 21(4), 483–500.

Dukes, R. L., Ullman, J. B., & Stein, J. A. (1996). Three-yearfollow-up of Drug Abuse Resistance Education (D.A.R.E.).Evaluation Review, 20(1), 49–66.

Duncan, T. E., Duncan, S. C., & Stryker, L. A. (2006). Anintroduction to latent variable growth curve modeling: Concepts,issues, and applications (2nd Ed.). Mahwah, NJ: LaurenceErlbaum.

Ennett, S. T., Tobler, M. S., Ringwalt, C. T., & Flewelling,R. L. (1994). How effective is Drug Abuse Resistance Educa-tion? Ameta-analysis of ProjectDARE outcomes evaluations.American Journal of Public Health, 84 (9), 1394–1401.

General AccountabilityOffice (2003). Youth IllicitDrugUse Pre-vention (Report No. GAO-03-172R). Marjorie KE: Author.

Golafshani, N. (2003). Understanding reliability and validity inqualitative research. The Qualitative Report, 8(4), 597–606.

Greene, J. C., & Caracelli, V. J. (1997). Defining and describ-ing the paradigm issue in mixed-method evaluation. In J. C.Greene & V. J. Caracelli (Eds.), Advances in mixed-methodevaluation: The challenges and benefits of integrating diverseparadigms (pp. 5–17).(New Directions for Evaluation, No.74). San Francisco: Jossey-Bass.

Guba, E. G., & Lincoln, Y. S. (1989). Fourth generationevaluation. Newbury Park: Sage.

Heckman, J. J., &Smith, J. A. (1995). Assessing the case for socialexperiments. Journal of Economic Perspectives, 9, 85–110.

Hollister, R. G. & Hill, J (1995). Problems in the evaluation ofcommunity-wide initiatives. In Connell, J. P., Kubish, A. C.,Schorr, L. B., &Weiss, C. H. (Eds.), New approaches to eval-uating community intiatives: Concepts, methods, and contexts(pp. 127–172). Washington, DC: Aspen Institute.

Howe, K. R. (1988). Against the quantitative-qualitative incom-patibility thesis or dogmas die hard. Educational Researcher,17, 10–16.

Huitema, B. E. (1980).The analysis of covariance and alternatives.New York, NY: John Wiley & Sons.

Johnson, R. B., & Onwuegbuzie, A. J. (2004). Mixed-methodsresearch: A research paradigm whose time has come. Educa-tional Researcher, 33, 14–26.

Katzer, J., Cook, K. and Crouch, W. (1978). Evaluating informa-tion: A guide for users of social science research. Reading, MA:Addison-Wesley.

Kenny, D. A. (1975). A quasi-experimental approach to assessingtreatment effects in the non-equivalent control group design.Psychological Bulletin, 82, 345–362.

Kenny, D. A., Kashy, D. A. & Cook, W. L. (2006). Dyadic dataanalysis. New York: Guilford Press.

Kidder, L. H., & Fine, M. (1987). Qualitative and quantitativemethods: When stories converge. In M. M. Mark & R. L.Shotland (Eds.),Multiple methods in program evaluation (pp.57–75). Indianapolis, IN: Jossey-Bass.

King, J., A., Stevahn, L., Ghere, G. & Minnema, J. (2001).Toward a taxonomy of essential evaluator competencies.American Journal of Evaluation, 22, 229–247.

Kirk, R. E. (2009). Experimental Design. In R.E. Millsap &A. Maydeu-Olivares (Eds.), The Sage handbook of quantita-tive methods in psychology (pp. 23–45). Thousand Oaks, CA:Sage.

Lakatos, I. (1970). Falsification and the methodology of scien-tific research programmes. In Lakatos, I., & Musgrave, A.,(Eds.), Criticism and the growth of knowledge (pp. 91–196).Cambridge, UK: Cambridge University Press.

Lakatos, I. (1978).The methodology of scientific research programs.Cambridge, UK: Cambridge University Press.

Leech, N. L., & Onwuegbuzie, A. J. (2007). An array of qualita-tive data analysis tools: A call for data analysis triangulation.School Psychology Quarterly, 22(4), 557–584.


Loftus, E.F. (1979). Eyewitness Testimony, Cambridge, MA:Harvard University Press.

Love, A.J. (1991). Internal evaluation: Building organizations fromwithin. Newbury Park, CA: Sage.

Madey, D. L. (1982). Some benefits of integration qualitativeand quantitativemethods in program evaluation. EducationalEvaluation and Policy Analysis, 4, 223–236.

Mark, M. M., & Cook, T. D. (1984). Design of randomizedexperiments and quasi-experiments. In L. Rutman (Ed.),Evaluation research methods: A basic guide (pp. 65–120).Beverly Hills, CA: Sage.

Mathison, S. (1994). Rethinking the evaluator role: partner-ships between organizations and evaluators. Evaluation andProgram Planning, 17(3), 299–304.

Meyers, W. R. (1981). The Evaluation Enterprise: A RealisticAppraisal of Evaluation Careers, Methods, and Applications.San Francisco, CA: Jossey-Bass.

Muthén, L. K. & Muthén, B. O. (1998–2009). Mplus user’sguide. Statistical analysis with latent variables. Los Angeles,CA: Muthén & Muthén.

Newcomer, K. E. & Wirtz, P. W. (2004). Using statistics inevaluation. In Wholey, J. S., Hatry, H. P. & Newcomer,R. E. (Eds.), Handbook of practical program evaluation (pp.439–478). San Francisco, CA: John Wiley & Sons.

Newman, D.L.&Brown, R.D. (1996). Applied ethics for programevaluation, San Francisco, CA: Sage.

Office of Management and Budget. (2009). A new era of respon-sibility: Renewing America’s promise. Retrieved May 18, 2012,from http://www.gpoaccess.gov/usbudget/fy10/pdf/fy10-newera.pdf

Oliver-Hoyo, M., & Allen, D. (2006). The use of triangulationmethods in qualitative educational research. Journal of CollegeScience Teaching, 35, 42–47.

Owen, J. M., & Rogers, P. J. (1999). Program Evaluation:Forms and Approaches (2nd ed.), St Leonards, NSW: Allen& Unwin.

Page, R. B. (1909). The Letters of Alcuin. New York: The ForestPress.

Patton, M.Q. (1990)Qualitative evaluation and researchmethods.Thousand Oaks, CA: Sage.

Patton, M. Q. (1994). Developmental evaluation. EvaluationPractice. 15(3), 311–319.

Patton, M. Q. (1996). A world larger than formative andsummative. Evaluation Practice, 17 (2), 131–144.

Patton, M.Q. (1997). Utilization-focused evaluation: The newcentury text (3rd ed.). Thousand Oaks, CA: Sage.

Patton, M. Q. (1999). Enhancing the quality and credibilityof qualitative analysis. Health Services Research, 35:5 Part II,1189–1208.

Pauly, M. V. (1974). Overinsurance and public provision ofinsurance: The roles of moral hazard and adverse selection.Quarterly Journal of Economics, 88, 44–62.

Platt, J. R. (1964). Strong inference. Science, 146, 347–353.Popper, K. (1959). The Logic of Scientific Discovery. New York:

Basic Books.Pugach, M. C. (2001). The stories we choose to tell: Fulfilling

the promise of qualitative research for special education. TheCouncil for Exceptional Children, 67 (4), 439–453.

Ramsay, G. G. (1918). Juvenal and Persius. trans. New York:Putnam.

Reichardt, C. S. (1979). The statistical analysis of data from non-equivalent groups design. In T. D. Cook & D. T. Campbell

(Eds.), Quasi-experimentation: Design and analysis issues forfield settings (pp. 147–206). Chicago, IL: Rand-McNally.

Reichardt, C. S. (2009). Quasi-experimental design. InR. E. Millsap & A. Maydeu-Olivares (Eds.), The Sagehandbook of quantitative methods in psychology (pp. 46–71).Thousand Oaks, CA: Sage.

Reichardt, C. S., & Cook, T. D. (1979). Beyond qualitative ver-sus quantitative methods. In T. D. Cook & C. S. Reichardt(Eds.), Qualitative and quantitative methods in evaluationresearch (pp. 7–32). Beverly Hills, CA: Sage.

Reichardt, C. S., & Rallis, S. F. (1994b). Qualitative andquantitative inquiries are not incompatible: A call for anew partnership. New Directions for Program Evaluation, 61,85–91.

Reichardt, C.S., &Rallis, S. F. (1994a). The relationship betweenthe qualitative and quantitative research traditions. NewDirections for Program Evaluation, 61, 5–11.

Reichenbach, H. (1938). Experience and prediction. Chicago:University of Chicago Press.

Rossi, P. H., & Freeman, H. E. (1993). Evaluation: A systematicapproach (5th ed.). Newbury Park, CA: Sage.

Sale, J. E. M., Lohfeld, L. H., & Brazil, K. (2002). Revisitingthe quantitative-qualitative debate: Implications for mixed-methods research. Quality & Quantity, 36, 43–53.

Scriven, M. (1967). The methodology of evaluation. In Gredler,M. E., (Ed.), Program Evaluation (p. 16). Englewood Cliffs,New Jersey: Prentice Hall, 1996.

Scriven, M. (1976). Evaluation bias and its control. InC. C. Abt (Ed.) The Evaluation of Social Programs,(pp. 217–224). Beverly Hills, CA: Sage.

Scriven, M. (1983). Evaluation idiologies. In G.F. Madaus,M. Scriven & D.L. Stufflebeam (Eds.). Evaluation mod-els: Viewpoints on educational and human services evaluation(pp. 229–260). Boston: Kluwer-Nijhoff.

Scriven, M. (1991). Pros and cons about goal-free evaluation.Evaluation Practice, 12(1), 55–76.

Sechrest, L., & Figueredo, A. J. (1993). Program evaluation.Annual Review of Psychology, 44, 645–674.

Seltzer, M. H., Frank, K. A., & Bryk, A. S. (1994). The metricmatters: The sensitivity of conclusions about growth in stu-dent achievement to choice of metric. Education Evaluationand Policy Analysis, 16, 41–49.

Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Exper-imental and quasi-experimental designs for generalized causalinference. Boston, MA: Houghton Mifflin.

Shadish, W. R., Cook, T. D., & Leviton, L. C. (2001). Foun-dations of program evaluations: Theories of practice. NewberryPark, CA: Sage.

Shek, D. T. L., Tang, V. M. Y., & Han, X. Y. (2005). Evaluationof evaluation studies using qualitative research methods inthe social work literature (1990–2003): Evidence that con-stitutes a wake-up call. Research on Social Work Practice, 15,180–194.

Singer, J. D. (1998). Using SAS PROCMIXED to fit multilevelmodels, hierarchical models, and individual growth models.Journal of Educational and Behavioral Statistics, 24, 323–355.

Singer, J. D. & Willett, J. B. (2003). Applied longitudinal dataanalysis: Modeling change and event occurrence. New York:Oxford University Press.

Smith, M. J. (2010). Handbook of program evaluation for socialwork and health professionals. New York: Oxford UniversityPress.


St. Pierre, R. G. (2004). Using randomized experiments. In J.S.Wholey, H. P. Hatry, & K. E. Newcomer (Eds.), Handbookof practical program evaluation (2nd ed., pp. 150–175). SanFransisco, CA: John Wiley & Sons.

Stevahn, L., King, J. A., Ghere, G. & Minnema, J. (2005).Establishing essential compenticies for program evaluators.American Journal of Evaluation, 26, 43–59.

Strauss, A., & Corbin, J. (1990). Basics of qualitative research:Grounded theory procedures and techniques. Newbury Park,CA: Sage.

Stufflebeam,D. L. (2001). Evaluation models.New Directions forEvaluation, 89, 7–98.

Tang, H., Cowling, D.W., Koumijian, K., Roeseler, A., Lloyd,J., & Rogers, T. (2002). Building local program evaluationcapacity toward a comprehensive evaluation. In R. Mohan,D.J. Bernstein, & M.D. Whitsett (Eds.), Responding to spon-sors and Stakeholders in Complex Evaluation Environments(pp. 39–56). New Directions for Evaluation, No. 95. SanFrancisco, CA: Jossey-Bass.

Tharp, R., & Gallimore, R. (1979). The ecology of programresearch and development: A model of evaluation succession.In L. B. Sechrest, S. G. West, M. A. Phillips, R. Redner, &W. Yeaton (Eds.), Evaluation Studies Review Annual (Vol. 4,pp. 39–60). Beverly Hills, CA: Sage.

Tharp, R., & Gallimore, R. (1982). Inquiry process in programdevelopment. Journal of Community Psychology, 10(2), 103–118.

The Joint Committee on Standards for Educational Evalua-tion. (1994). The Program Evaluation Standards (2nd ed.).Thousand Oaks, CA: Sage.

Thistlethwaite, D. L., & Campbell, D. T. (1960). Regression-discontinuity analysis: An alternative to the ex post factoexperiment. The Journal of Educational Psychology, 51,309–317.

Trochim, W. M. K. (1984). Research design for program evalu-ation: The regression discontinuity approach. Newbury Park,CA: Sage.

United Way. (1996). Guide for logic models and measurements.Retrieved May 18, 2012, from http://www.yourunitedway.org/media/GuideforLogModelandMeas.ppt

Wagner, A. K., Soumerai, S. B., Zhang, F., & Ross-Degnan,D. (2002). Segmented regression analysis of interrupted timeseries studies in medication use research. Journal of ClinicalPharmacy and Therapeutics, 27, 299–309.

Weiss, C.H. (1980) Knowledge creep and decision accretion.Knowledge: Creation, Diffusion, Utilisation 1(3): 381–404.

Weiss, C. H. (1998). Evaluation: Methods for Studying Programsand Policies, 2nd ed. Upper Saddle River, NJ: Prentice Hall.

Weiss, C. J. (1999) The interface between evaluation and publicpolicy. Evaluation, 5(4), 468–486.

Wells, G.L., Malpass, R.S., Lindsay, R.C.L., Fisher, R.P.,Turtle, J.W., & Fulero, S.M. (2000). From the lab to thepolice station: A successful application of eyewitness research.American Psychologist, 55(6), 581–598.

West, S.L., & O’Neal, K.K. (2004) Project D.A.R.E. outcomeeffectiveness revisited. American Journal of Public Health,94 (6) 1027–1029.

Williams, A. (2001). Science or marketing at WHO? Com-mentary on ‘World Health 2000’. Health Economics, 10,93–100.

Willson, E. B., & Putnam, R. R. (1982). A meta-analysis ofpretest sensitization effects in experimental design. AmericanEducational Research Journal, 19, 249–258.

World Health Organization (2000). The World Health Report2000 – Health Systems: Improving Performance. World HealthOrganization: Geneva, Switzerland.


CHAPTER 17 Procedures,andPractices ProgramEvaluation ...ragarci2/Figueredo, Olderbak,...

Documents

Transcript of CHAPTER 17 Procedures,andPractices ProgramEvaluation ...ragarci2/Figueredo, Olderbak,...