7 Branch&Pennypacker Generalization and Generality

5/20/2018 7 Branch&Pennypacker Generalization and Generality

1/25

151

DOI: 10.1037/13937-007APA Handbook of Behavior Analysis: Vol. 1. Methods and Principles,G. J. Madden (Editor-in-Chief)Copyright 2013 by the American Psychological Association. All rights reserved.

C H A P T E R 7

GENERALITY AND GENERALIZATION

OF RESEARCH FINDINGS

Marc N. Branch and Henry S. Pennypacker

For generalization, psychologists must

finally rely, as has been done in all the

older sciences, on replication. (Cohen,

1994, p. 997)

Confirmation comes from repetition. . . .

Repetition is the basis for judging . . . sig-

nificance and confidence. (Tukey, 1969,

pp. 8485)

As the general psychology research community

becomes increasingly aware (e.g., Cohen, 1994; Lof-

tus, 1991, 1996; Wilkinson & Task Force on Statisti-

cal Inference, 1999) of the limitations of traditional

group designs and statistical inference methods with

regard to assessing reliability and generality of

research findings, we present an alternative approach

that has been substantially developed in the branch of

psychology now known as behavior analysis. In this

chapter, we outline how individual subject methods,

that is, so-called single-case designs, provide straight-

forward and, in principle, simple methods to assess

the reliability and generality of research findings.

OVERVIEW

The chapter consists of three major sections. Inthe first, we summarize the limitations of tradi-

tional methods, especially as they relate to assess-

ing reliability and generality of research findings

concerning behavior. We make the case that tradi-

tional methods have obscured an important dis-

tinction that has led to psychologys consisting of

two related, but separable, subject matters, behav-

ioral science and actuarial science. We also focus

on the issue of generality across individuals and

how traditional methods can give the illusion of

such generality. In the second major section, we

discuss dimensions of generality in addition to

generality across individuals. Here we define scien-

tific generalityand several other forms of generality

as well. In so doing, we introduce the roles of rep-

lication, both direct and systematic, in assessing

generality of research results. We argue that repli-

cation, instead of statistical inference, is an alter-

native primary method for determining not only

the reliability of results but also for assessing and

characterizing the generality of scientific findings.

In the third major section, we discuss generaliza-

tion of treatment effects, the fundamentals of tech-

nology transfer, and the practices that characterize

translational research. There, we write of program-

ming for and assessment of generalizability of sci-

entific findings to applied settings. We expand our

view then to the engineering issues of technology

development (or technology transfer and transla-

tional research) as a capstone demonstration of

generalization based on an understanding of gen-

erality of research findings.

LIMITATIONS OF TRADITIONAL

METHODS

The traditional group-mean, statistical-inference

approach to analyzing research results has faced

Preparation of this chapter was supported by National Institute on Drug Abuse Grant DA004074.


2/25

Branch and Pennypacker

152

consistent criticism for more than 4 decades (e.g.,

Bakan, 1966; Carver, 1978; Cohen, 1994; Gigeren-

zer, Krauss, & Vitouch, 2004; Loftus, 1991, 1996;

Meehl, 1967, 1978; Nickerson, 2000; Rozeboom,

1960). Most of that criticism has focused on what

those methods have to say about the reliability of

research findings, which is appropriate because iffindings are not reliable, there is no need to assess

their generality. These methods, however, have also

been criticized with respect to theory testing and

development, issues that directly relate to generality.

We treat these two categories of criticism separately.

Significance Testing and ReliabilityAfter all of the carefully reasoned criticism of signifi-

cance testing that has been published, one would

hope that a clear understanding of its limits would

exist among professional psychologists. That, how-ever, appears not to be true, as noted by Cohen

(1994), who lamented that

after 4 decades of severe criticism, the

ritual of null hypothesis significance

testing . . . still persists. [As does] near

universal misinterpretation ofpas the

probability that H-sub-0 is false, [and]

the misinterpretation that its comple-

ment is the probability of successful rep-

lication. (p. 997)Cohens assertion is supported by survey evidence

revealing that a substantial majority of academic

research psychologists incorrectly interpretpvalues

and statistical significance (Haller & Krauss, 2002;

Kalinowski, Fidler, & Cumming, 2008; Oakes,

1986). That a significant proportion of professional

psychologists do not appreciate what statistical

significance and, especially,pvalues represent is

apparent testimony to a weakness in the training of

research psychologists, a failing that lies at the feet

of those of us who are engaged in teaching them. Infact, Haller and Krauss (2002) included a sample

of statistical methodology instructors in their study

and found that 80% of them were mistaken in their

understanding ofpvalues, so it comes as less of a

surprise that the misconceptions are widespread. The

following discussion, therefore, is another attempt to

make clear what apvalue is and what it means.

Apvalue, which results from a significance test,

is a conditional probability. Specifically, it is the

probability, if the null hypothesis is true, of obtain-

ing data of a particular sort. That is, in algebraic

symbols, it isp=P(Data|H0). The important point

is thatpP(H0|Data), which is what a researcher

would presumably really like to know. In otherwords, apvalue does not provide quantitative infor-

mation about whether the null hypothesis is true,

which is apparently widely misunderstood. Because

it does not provide the oft-assumed information

about the likelihood of the null hypothesis being

true, apvalue of .01 does not mean that the proba-

bility of the null hypothesis being true is 1 in 100.

In fact, it conveys nothing quantitative about the

truth of the null hypothesis. To see why, note that

changing the order of conditionality in a condi-

tional probability is crucially important. Considersuch examples as P(Dead|Electrocuted) versus

P(Electrocuted|Dead) or P(Cloudy|Raining) versus

P(Raining|Cloudy). The first probability in each pair

tells nothing about the second, just as P(Data|H0)

reveals nothing about P(H0|Data). Apvalue, there-

fore, has quantitative meaning only if the null

hypothesis is true, but when performing statistical

tests not only does one not know whether the null

hypothesis is true, one probably assumes it is not.

The important fact is that a finding of statistical sig-

nificance, via a smallpvalue, does not imply thatthe null hypothesis is unlikely to be true. The incor-

rect logic underlying the mistaken conclusion (cf.

Falk & Greenbaum, 1995) apparently goes as fol-

lows: If the null hypothesis is true, data of a certain

sort are unlikely. I obtained data of that sort, so

therefore the null hypothesis is unlikely to be true.

That so-called logic is precisely the same as the fol-

lowing: If the next person I meet is an American, he

or she is unlikely to be the President. I just met the

President. Therefore, he or she is unlikely to be an

American.The fundamental misunderstanding of what ap

value is leads directly to the more serious problem

of assuming that it indicates something quantitative

about the reliability, that is, the likelihood of repli-

cation, of the finding. A common misunderstanding

(see Haller & Krauss, 2002, and Oakes, 1986, for

evidence) is that apvalue, for example of .01, is the


3/25

Generality and Generalization of Research Findings

153

complement of the probability of replication should

the experiment be repeated. That is, the mistaken

assumption is that if one conducted the experiment

100 times, one should replicate the result on 99 of

those occasions (at least on average). If one knew

that the null hypotheses were true, then that would

be a correct interpretation of thepvalue. Of course,though, one does not know whether H0is true

(again, one usually hopes it is not). In fact, one con-

ducts the statistical test so that one can make what

one (mistakenly) hopes is an educated guess about

whether it is true. Thus, to say on the basis of a

smallpvalue that a result is statistically reliable is

to strain the meaning of reliablebeyond reasonable

limits.

This limitation of statistical significance is not

based on technical details of the null hypothesis.

That is, the problem does not lie with whether theunderlying distribution is formally normal or near

normal or whether the statistical test involved is

demonstrably robust with respect to violations of

assumptions about the underlying distribution. The

limitation is based in the logic of the approach. All

the assumptions about the distributional character-

istic null hypothesis might in fact be true, but that is

not relevant when one is speaking of what apvalue

indicates.

A major limitation of statistical significance,

therefore, is that it does not provide direct informa-tion about the reliability of research findings. With-

out knowledge about reliability, no examination of

generality can occur because repeatability is the

most basic test of generality. Notwithstanding that

limitation, however, significance testing based on

group means may be seen, incorrectly, to have

implications for generality of findings across sub-

jects. Adherence to this view unfortunately gains

strength as sample size increases. In fact, however,

regardless of sample size, no information about

intersubject generality can be extracted from asignificance statement because no knowledge is

afforded concerning the number of subjects for

whom the effect actually occurred. We examine

the implications of this fact in more detail below.

Aside from the limits surrounding reliability just

described, other characteristics of group-mean data

warrant examination as we move into a discussion

of generality. It is here that we show that psychol-

ogy, presumably because of the widespread use of

significance testing, has developed two distinguish-

able subject matters.

Significance Testing and Generality

Traditional significance testing approaches in psy-chology are generally based on data averaged across

individuals. As is well known, the mean from a

group of individuals (a sample) provides an estimate

of the mean of the entire population from which the

sample is drawn, and that estimate can be bounded

by confidence intervals that provide information

(not the probability, however, that the population

mean falls within the interval; see Smithson, 2003)

about how confident one can be that the population

mean lies within such intervals. Thus, the sample

mean provides information about a parameter thatapplies to the entire population. That fact appears to

imply substantial generality; it applies to the entire

population (however delimited), so generality

appears maximized. This raises two important issues.

First is the question of representativeness of the

means, both sample and population. That is, identi-

cal or similar means can result from substantially

different distributions of scores. Two examples that

illustrate this fact are given in Figures 7.1 and 7.2.

In Figure 7.1, four distributions of 20 scores are

arrayed horizontally in the upper panel. In the toprow, the values are arithmetically separated, whereas

in the other three, they are clustered in various

ways. Note that none of the four is particularly nor-

mal in appearance, that is, clustered in the middle.

The four plots in the lower panel showwith the

top plot corresponding to the top distribution in the

upper panel, and so onthe means (solid points)

and standard deviations (bars) of the four distribu-

tions. They are, as planned, identical. These data

show that identical means and standard deviations,

the stock in trade of inferential statistics, can beobtained from very different distributions of values.

That is, in these cases the means and standard devia-

tions do not provide a particularly informative or

representative indication of what the individual val-

ues are, which implies that when dealing with aver-

ages of measures, or averages across individuals,

attention must be paid to the representativeness of


4/25


154

FIGURE 7.1. Upper panel: Four distributions of values,with each symbol representing one value on thex-axis. Lower

panel: The corresponding means and standard deviations ofthe four corresponding distributions from the upper panel.From The Elements of Graphing Data(rev. ed., p. 215), byW. S. Cleveland, 1994, Summit, NJ: Hobart Press. Copyright1994 by AT&T Bell Laboratories. Reprinted with permission.

X

0 5 10 15 20

0

2

4

6

8

10

12

14

X

0 5 10 15 20

0

2

4

6

8

10

12

14

X

0 5 10 15 20

0

2

4

6

8

10

12

14

X

0 5 10 15 20

Y

Y

Y

Y

0

2

4

6

8

10

12

14

FIGURE 7.2. Anscombes quartet. Each of the four graphs shows 11xypairs and the best-fitting (least-squares estimate) straight line through the points. The slopes and intercepts of thelines are identical. From Graphs in Statistical Analysis, by F. J. Anscombe, 1973,AmericanStatistician, 27,pp. 1920. Copyright 1973 by the American Statistical Association. Adaptedwith permission. All rights reserved.


5/25


155

the mean, not just its value, or even its standard

deviation. Figure 7.2, which contains what is known

asAnscombes quartet(Anscombe, 1973), provides

an even more dramatic illustration of how focusing

only on the average of a set of numbers can lead one

to miss important features of that set. The four

graphs in Figure 7.2 plot 11 values inxycoordi-nates and show the best-fitting (via the least-squares

method) straight line to the data. Obviously, the dis-

tributions of points are quite different in the four

sets. Nevertheless, the means for thexvalues are all

the same, as are their standard deviations. The same

is true for theyvalues (yielding eight instances of

the sort shown in Figure 7.1). In addition, the slopes

and intercepts of the straight lines are identical for

all four sets, as are the sums of squared errors and

sums of squared residuals. Thus, all four would

yield the same correlation coefficient describing therelation betweenxandy.

The point of these illustrations is to indicate

that a sample mean, even though a predictor of a

population mean, is not necessarily a good descrip-

tion of individual values, so it is not necessarily a

good indicator of the generality across individual

measures. When the measures come from individ-

ual people (or other nonhuman animals), it follows

that the average of the group may not reveal, and

may well conceal, much about individuals. It is

important to remember, therefore, that samplemeans from a group of individuals permit infer-

ences about the population average, but these

means do not permit inferences to individuals

unless it is demonstrated that the mean is, in fact,

representative of individuals. Surprisingly, it is rare

in psychology to see the issue of representativeness

of an average even mentioned, although recently,

in the domain of randomized clinical trials, the

limitations attendant to group averages have been

gaining increased mention (e.g., Penston, 2005;

Williams, 2010).Many experimental designs, nevertheless, involve

comparison across groups with large numbers of

subjects, which raises the question of the practical-

ity of presenting the data of every individual. The

concern is legitimate, but the problem is not solved

by resorting to the study of group averages only.

Excellent techniques for comparing distributions,

like stem-and-leaf plots, box plots, and quantile

quantile plots, are available (Cleveland, 1994;

Tukey, 1977). They provide a more complete

description of measures from individuals, or a useful

subset (as can be the case with quantilequantile

plots), than do simple means and standard errors or

means and confidence intervals. We presume thatas null-hypothesis significance-testing approaches

become less prevalent, more effort will be directed

toward developing new and better techniques for

comparing distributions, methods that will include

and make evident the measures from individuals.

Two Separable Subject Mattersfor Psychology?In some instances, the difference between a popula-

tion parameter, such as the population average, and

the activity of an individual is obvious. For example,consider the average rate of pregnancy in women

between 20 and 30 years old. Suppose that rate is

7%. That, of course, is a useful statistic and can be

used to predict how many women in that age cate-

gory will be pregnant. More important for the pres-

ent purposes, however, is that the value, 7%, applies

to no individual woman. That is, no woman is 7%

pregnant. A woman is either pregnant or she is not.

What of situations, however, in which an average

is representative of the behavior of individuals? For

example, suppose that a particular teaching tech-nique is discovered to result in a 10% increase in

performance on some examination and that the

improvement is at or near 10% for every individual.

Is that not a case in which a group average would

permit estimation of a population mean that is, in

fact, a good descriptor of the effect of the training

for individuals and, because it applies to the popula-

tion, has wide generality? The answer is yes and no.

The point to be made here is somewhat subtle,

and so we elaborate on it with an example. Consider

a situation in which a scientist is trying to determinethe relation between amount of practice at solving

five-letter anagrams and subsequent speed at solving

six-letter anagrams. Suppose, specifically, that no

practice and 10, 50, 100, and 200 anagrams of prac-

tice are to be compared. After the practice, subjects

who have never previously solved anagrams, except

for those seen in the practice phase, are given 50 new


6/25


156

anagrams to solve, and the time to complete is

recorded. Because total practice might be a determi-

nant of speed, the scientist opts to use a between-

groups design, with each group being exposed to

one of the practice regimens. That is, the hope is to

extract the seemingly pure relation between practice

and later speed, uncontaminated by prior relevantpractice. The scientist then averages the data from

each group and uses those means to describe the

function relating amount of practice to speed of solv-

ing the new, more difficult anagrams. In an actual

case, variability would likely be found among indi-

viduals within each group, so one issue would be

how representative the average is of each member

of each group. For our example, however, assume

that the average is representative, even perfectly so

(i.e., every subject in a group gives exactly the same

value). The scientist has generated a function, proba-bly one that describes an increase in speed of solving

anagrams as a function of amount of prior practice.

In our example, that function allows us to predict

exactly what an individual would do if exposed to a

certain amount of practice. Even though the means

for each group are representative and therefore per-

mit prediction about individual behavior, the impor-

tant point is that the function has no meaning for

an individual. That is, the function does not describe

something that would occur for an individual

because no individual can be exposed to differentamounts of practice for the first time. The function

is an actuarial account, not a description of a

behavioral process. It is, of course, to the extent

that the means are representative, a useful finding.

It is just not descriptive of a behavioral process in

an individual. To examine the same issue at the

level of an individual would require investigation

of sequences of amounts of practice, and that

examination would have to include experiments

that factor in the role of repeated practice. Obvi-

ously, such an endeavor is considerably more com-plicated than the study that generated the actuarial

curve, but it is the only way to develop a science of

individual behavior. The ontogenetic roots of

behavior cumulate over lifetimes. In later portions

of this chapter, we discuss how the complications

may be confronted.

The point is not to diminish the value of actuar-

ial data, nor to suggest that psychologists abandon

the collection and analysis of such data. If means are

highly representative, such data can offer predic-

tions at the individual subject level. Even if the

means are not highly representative, organizations

such as insurance companies and governments canand do make important use of such information in

determining appropriate shared risk or regulatory

policy, respectively. The point is, using insurance

rates as an example, that just because you are in

a particular group, for example, that of drivers

between the ages of 16 and 25, for which the mean

rate of accidents is higher than for another group,

does not indicate that you personally are more likely

to have an automobile accident. It does mean, how-

ever, that for the insurance company to remain prof-

itable, insurance rates need to be higher for allmembers of the group. Similarly, with respect to

health policy, even though most people who smoke

cigarettes do not get lung cancer, the incidence of

lung cancer, on a relative basis, is substantially

greater, on average, in that group. Because the group

is large, even a low incidence rate yields a substan-

tial number of actual lung cancer cases, so it is in

the governments, and the populations, interest to

reduce the number of people who smoke cigarettes.

The crux of the matter is that actuarial and

behavioral data, although related in that the formerdepend on the latter, are distinguishable and, there-

fore, should be distinguished. Psychology, to the

extent that it relies solely on the methods of inferen-

tial statistics that use averages across individuals,

becomes an actuarial science, not a science of behav-

ioral processes. The methods described in this

chapter are aimed at including in psychology its

oft-stated goal of being a science of behavior (or of

the mind). Behavioral and inferred mental processes

really make sense only at the level of the individual.

(The same is true of physiology, which has become arather exact science in part because of the influence

of Claude Bernard, 1865/1957.) A persons behavior,

including thinking, imagining, and so forth, is par-

ticular to that person. That is, people do not share

their minds or their behavior with others, just as

they do not share their physiology. A counterargument


7/25


157

is that behavior and mental activity are too variable

from individual to individual to permit a scientific

analysis. We based this chapter on the more opti-

mistic view that such activity is amenable to study at

the level of the individual. Because a good deal of

application of psychological knowledge involves

dealing with individuals, for example, as in psycho-therapy, understanding at the level of the individual

remains a worthy goal. Support for the viewpoint

that a science of individual behavior is possible,

however, requires an elaboration of how an individ-

ual subjectbased analysis can yield information that

is more generally applicable to more than one or a

few individuals.

Why Single-Case Designs Do NotMean That N=1

Traditional approaches, with the attendant limita-tions described thus far, likely arose, at least in part,

because of a legitimate concern about focusing

research on individual subjects who are studied

repeatedly through time (more on this later). Such

research is usually performed with relatively few

subjects, leaving open the possibility that effects

seen might be limited with respect to generality

across other individuals. An example, modeled after

one offered by Sidman (1960), provides a response

to such misgivings. Suppose we were interested in

whether listening to classical music while solvingarithmetic problems improves accuracy. Using a

single-case approach, the study is started with a single

subject. We might first establish a baseline of accu-

racy (more on this later) by measuring it over several

successive exposures. Next, we would test the sub-

ject with the music present and then with it absent.

Suppose we find that accuracy is increased when

music is present and reverts to normal when it is

not. Suppose also that unbeknownst to us, the

effect music will have depends on the baseline level

of accuracy; if accuracy is initially low, it isenhanced by the presence of music, whereas if it is

initially high, it is reduced when the music is on.

We might mistakenly conclude, on the basis of the

results from the one subject, that music increases

accuracy of solving the kinds of arithmetic prob-

lems used.

Let us compare how a more traditional between-

groups approach might fare in dealing with the

issue. We apply music to one group and not to

another. What will result will depend on the distri-

bution of baseline accuracy across individuals.

Figure 7.3 shows three possible population distribu-

tions. In B, most people have low accuracy, in Cmost have high accuracy, and in A people fall into

two groups with respect to baseline accuracy. If one

performed the experiment on groups and took the

group mean to be the indicator of the effect of the

independent variable, the conclusion would depend

on the underlying distribution. In A, the conclusion

FIGURE 7.3. Three hypothetical frequency distribu-tions characterizing the number of people display-ing different baseline rates. From Tactics of ScientificResearch: Evaluating Experimental Data in Psychology(p. 149), by M. Sidman, 1960, New York, NY: BasicBooks. Copyright 1988 by Murray Sidman. Reprintedwith permission.


8/25


158

might well be that music has no effect, with the low-

ered accuracy in people with high baseline accuracy

canceling out the increases that result among those

with low baseline accuracy. If the population is dis-

tributed as in B, the conclusion would be that music

increases accuracy because the mean would move in

the direction of improved accuracy. The importantpoint is that simply considering the group average

makes it less likely that the baseline dependency that

underlies the effect will be seen.

Let us now compare what might transpire with

the single-case approach, an approach based on rep-

lication. Having seen the effect in the first subject,

we recruit a second and do the experiment again.

Suppose that the population distribution is as

depicted in Figure 7.3B. The most likely scenario is

that the second subject will also have low baseline

accuracy because someone sampled from the popu-lation is most likely to manifest modal characteris-

tics. We get the same result and could, mistakenly,

conclude that music enhances arithmetic accuracy.

That is, we make the same mistake as with the

group-average approach. The difference between the

two approaches, however, is that the group mean

approach makes it more difficult to discover the

underlying, real effect. The single-case approach,

however, if enough replications are done, will even-

tually and inevitably reveal the problem because

sooner or later someone with high baseline accuracywill be examined and show a decrease. A key phrase

in the previous sentence is if enough replications

are done. Whether that happens is likely to depend

on the perceived importance of the effect. If it is

deemed important, it is likely to be subjected to

additional research, which will, in turn, lead to addi-

tional replications. Thus, the single-case approach is

not some sort of panacea with respect to identifying

such relations, but it offers a direct path to correc-

tive action. Of course, it is possible to ferret out the

baseline dependency using a group-mean approach,but that will happen only if attention is paid to the

data of individual subjects in a group. In the single-

case approach, those data are automatically scruti-

nized. A major point is that single casedoes not

necessarily imply that only one or even only a few

subjects be examined. Some research questions

might involve examination of many subjects. (We

discuss later how to decide how many subjects to

test.) What the approach involves is studying each

subject essentially as an independent experiment.

Generality across subjects is therefore examined

directly by seeing how often the experiments effects

are replicated. A second major point is that the

apparent virtues of studying many subjects, a stan-dard aspect of traditional research designs in psy-

chology, are realized only if the data from each

subject are individually analyzed.

Null-Hypothesis Significance Testingand Theory DevelopmentA major goal in any science is the development of

theory, and there is a sense in which theory has clear

relevance to generality. Effective theories are those

that account for a wide array of research results. That

is, they apply generally. The way in which signifi-cance testing is most commonly used in psychology,

however, mitigates against the orderly development

and testing of theories and against the analysis of

competing theories. The problem was first identified

as a paradox by Meehl (1967; see also Meehl, 1978).

The problem is a logical one based largely on the

choice of the null hypothesis as no effect. The logic

of the common approach is as follows. An investiga-

tor has a hypothesis that imposition of a variable,X,

will change another measure, Y.This hypothesis is

sometimes called the alternative hypothesis.Thenull hypothesis is then chosen to be thatXwill not

change Y,that is, that it will be without effect. Next,

theXcondition is imposed, and Yis measured. A

comparison is then made of YwithoutXand Ywith

X.A statistic is then calculated that is generally a

ratio of changes in Yas a result ofXover changes in

Yas a result of anything else. In more technical

terms, the statistic is effect variance over error vari-

ance. The larger the statistic, the smaller thepvalue,

and the more likely it is that statistical significance is

achieved and the null hypothesis rejected. Standardteaching demands that even though one can decide

to reject the null hypothesis, logic prevents one from

accepting the alternative hypothesis. Instead, one

would say that if the null hypothesis is rejected, the

alternative hypothesis gains support.

The paradox noted by Meehl (1967) arises

from the nature of the statistic itself. The size of the


9/25


159

statistic is controlled by two values, the effect size

and the error variance, so it can be increased in two

ways. The way of interest for this discussion is via

a decrease in error variance, the denominator. A

major way of decreasing error variance is through

increased experimental rigor (one avenue of which

is to increase the number of subjects). To the degreethat extraneous variables (the anything else men-

tioned earlier) can be eliminated or held constant,

error variance should decrease, making it more

likely that that statistic will be large enough to war-

rant a decision as to statistical significance. The

paradox, therefore, is that as experimental rigor is

increasedthat is, as experimental techniques are

refined and improvedstatistical significance

becomes more likely, with the consequence that the

alternative hypothesis gains support, no matter what

the alternative hypothesis is. That does not seemlike a recipe for cumulative progress in science. Sim-

ple null-hypothesis significance testing with the null

hypothesis set at no effect cannot, by itself, help to

develop theory.

Meehl (1967) described one approach that can

obviate this paradox, which is to use significance

testing with a null hypothesis that is not no effect.

Instead, the null hypothesis is what the theory (or

alternative hypothesis) predicts. Consider how the

logic operates when this tactic is used. As experi-

mental rigor increases, error variance is decreased,making it more likely that the resulting statistic will

reach a critical value. When that value is achieved,

the null hypothesis is rejected, but in this case it is

the investigators theory that is rejected. Rather than

increased experimental rigor resulting in its being

easier for ones theory to gain support, it results in

its being easier to reject ones theory. Increasing

experimental control puts the theory to a more rig-

orous test, not an easier one as is the case when

using the no-effect, or no-difference, null hypothe-

sis. The harder one works to reject a theory and failsto succeed, the more confidence one has in the

theory.

Training in statistical inference, at least for psy-

chologists, does not usually emphasize that the null

hypothesis need not be no effect. It can, neverthe-

less, as just noted, be some particular effect. Note

that it has to be some specific value other than zero.

The use of a particular value as the null hypothesis

therefore requires that ones theory be quantitative

enough to generate a specific value. This approach is

what characterizes tests of goodness of fit (those that

use significance tests) of quantitatively specified

functions.

This approach of setting the null hypothesis ata value predicted by theory is nevertheless not

immune to the previously described weaknesses of

significance testing in general. If, however, signifi-

cance testing is used to make decisions, at least this

latter approach does not suffer from the weakness of

making it easier to support a researchers theory,

regardless of what it is, as methods improve.

In this section of the chapter, we have made the

case, we hope, that commonly used psychology

research methods have limitations in assessing reli-

ability and generality of research findings. In addi-tion, the methods have resulted in many areas of

psychology being largely actuarial, group-average

focused science rather than aimed at the behavior of

individuals. In the next section, we describe the

basics of an alternative approach that is based on

replication rather than significance testing and

group averages. It is useful to remember that impor-

tant science was conducted before the invention of

significance testing, and what follows is a descrip-

tion of the application of methods used to establish

most of modern physics and chemistry (and physiol-ogy) to the study of behavior. The approach focuses

on understanding behavioral processes, rather than

actuarial ones, and has already yielded a good deal

of success, as other chapters in Volume 2 of this

handbook illustrate. We should note, nevertheless,

that even if the goal is actuarial prediction and influ-

ence, the methods of statistical inference are limited

in what they can achieve with respect to reliability

of research findings. As we argue, the only sure way

to examine reliability of results is to repeat them, so

replication is the key strategy for both subject mat-ters of psychology.

ASSESSING RELIABILITY AND

GENERALITY VIA REPLICATION

The two distinguishable categories of replication are

direct replication and systematic replication,


10/25


160

although, as we show, the distinction is not a sharp

one. Most researchers are familiar with the concept

of direct replication, which refers to repeating an

experiment as exactly as possible. If the results are

the same or similar enough, the initial effect is said

to be replicated. Direct replication, therefore, is

mainly used to assess the reliability of a researchfinding, but as we show, there is a sense in which it

also provides information about generality. System-

atic replication is the designation for a repetition

of the experiment with something altered to see

whether the effect can be observed in changed cir-

cumstances. If the results are replicated, then the

generality of the finding is extended to the new cir-

cumstances. Many varieties of systematic replication

exist, and it is the strategy most relevant to examin-

ing the generality of research findings.

Direct Replication: Within-SubjectReliability and BaselinesIn the first part of this section, we describe the

methods and roles of direct replication with the

same experimental subject (i.e., a truly single-case

experiment). We open with this simplest case, and

with an example, not only to illustrate how the strat-

egy can be used, but also to deal more clearly with

reservations about and limitations of the approach

as well as how decisions about characteristics of the

replicative process may be made.For our example, suppose that we want to mea-

sure the amount of a certain kind of food eaten after

some period without food. We let our subject eat

after 12 hours of fasting; suppose that she eats

250 grams. Direct replication of this observation

would require that we do the same test again. One

possible, but unlikely, result would be that she

would eat 250 grams again, providing an exact repli-

cation. The amount eaten would more likely be

slightly different, say 245 grams. We might then

conduct another replication to see whether the trendtoward eating less was replicable. Suppose on that

occasion our subject eats 257 grams, making it less

likely that there is a trend toward less ingestion with

successive tests. We could repeat the process again

and again. By repeatedly observing the amount eaten

after a 12-hour fast, we gain more confidence with

each successive measurement about how much our

subject will eat of that particular food after 12 hours

of not eating.

One thing that direct replication can provide, via

a sequence of direct, intrasubject replications such

as that just described, is a baseline. The left segment

of Figure 7.4 shows that there appears to be a steady

baseline amount of intake in our example. A ques-tion that might arise is how many observations are

needed to establish a baseline, that is, to come up

with a convincing assessment? The answer is that it

depends. There is no rule or convention about how

many replications are needed to render an outcome

considered reliable in the eyes of the scientific com-

munity. One factor of importance is how much is

already known. In some of the more advanced

physical sciences, a single replication (usually by a

different research team) might be adequate. In our

example, the researcher might have conducted simi-lar research previously, discovered that the baseline

value does not change after 10 observations, and

thus deemed 10 replications enough. The researcher

who chooses replication as a strategy to determine

reliability of findings, therefore, does not have the

comfort of a set of conventions (akin to those avail-

able to investigators who use conventional levels

of statistical significance) to decide whether to

conclude if an effect is reliable enough to warrant

reporting to the scientific community. Instead, the

investigators judgment plays a role, and his or herscientific reputation is dependent to some degree on

Successive Tests

0 2 4 6 8 10 12 14 16 18 20 22 24 26

GramsEaten

0

50

100

150

200

250

300

Baseline - Food 1 Food 2 Food 1

FIGURE 7.4. Hypothetical data from a series of obser-vations of eating. The first 10 points and last six pointsare amounts eaten of Food 1. The middle six points areamounts eaten of Food 2.


11/25


161

that judgment. One of the comforts of a set of con-

ventions is that if a researcher abides by them and

results are later discovered, via failed attempts at

replication, not to be reliable, that researchers repu-

tation suffers little. In contrast, one can argue that

there are both advantages and disadvantages to rely-

ing on replication. Important advantages are havingthe benefit of informed judgment, especially of a

seasoned investigator, and the fact that social pres-

sure rides more directly on the researchers reputa-

tion. The disadvantage comes from the lack of an

agreed-on set of conventions. Principled arguments

about which is better for science can be made for

both positions, but we favor the view that science,

as a socialbehavioral activity, will fare better, or at

least no worse, if researchers are held more account-

able for their conclusions about reliability and

generality than for their adherence to a set of arbi-trary, often misunderstood conventions.

Returning to the role of a baseline construed as a

set of intrasubject replications, such baselines can

serve as bases of comparison for effects of experi-

mental changes. For example, after establishing a

baseline of eating of the first food, we could change

what the food is, perhaps less tasty or more or less

calorie laden. The second set of points in Figure 7.4,

which in essence depict measures from a second

set of replications, have been chosen to indicate a

decrease. The reliability of the effect is illustrated bythe successive similarity of values, and judgments

about how many replications are needed would

be based on the same sorts of considerations as

involved in the original baseline. A usual check

would involve return to the original food, and the

third set of points indicates a likely result, once

again with a series of replications. The overall exper-

iment, therefore, is an example of the ubiquitous

A-B-A design (see Chapter 1, this volume).

Replication, of course, need not refer only to a

series of successive measurements under identicalconditions to produce a baseline. If the type of find-

ing summarized in Figure 7.4 were especially coun-

terintuitive or at considerable odds with existing

knowledge, one might well repeat the entire project,

Food 1 to Food 2 to Food 1, and that, too, would

constitute a direct intrasubject replication. In fact,

the entire project could be carried out multiple

times if, in the investigators judgment, such confir-

mation was necessary. Each successful replication

increases confidence that the independent variable,

change of food type, is responsible for the change

in eating.

Direct Replication: Between-SubjectsReliability and GeneralityAfter all this work, an immediate limitation is that

the findings, so far as we know, may well apply

only to the one person studied. Our first result is

based on intrasubject replication. If the goal of the

research was to see whether the change in food can

influence eating, then it may be the case that no fur-

ther replication is needed. It is likely, however, that

our interest extends beyond what is possible to what

is common. In that case, additional replication is in

order, which brings us to the next type of direct rep-lication, replication with different subjects, or inter-

subject replication. Intersubject replication is used to

examine generality, in this case across subjects, and

in this single-case design Nis extended to more

than 1. Intersubject replication makes clear the fuzz-

iness of the distinction between direct and system-

atic replication. The latter is generally defined as a

replication with something changed (see below),

and a new subject is certainly a change. We also sug-

gest that systematic replication is a main strategy for

assessing generality, and by studying a second sub-ject, generality across individuals is on trial. It is

even possible to suggest that most replications, even

intrasubject replications, are, in fact, systematic. For

example, in the intrasubject replication described

above, time is different for successive observations,

and the subject brings a different history to each

observation period. It nevertheless has become stan-

dard to characterize replications in which the proce-

dures are essentially the same as direct replications.

As we outline shortly, systematic replications are

characterized by changes in procedure or conditionsthat can be quite substantial.

As noted in the section Significance Testing and

Generality earlier in this chapter, an emphasis on

replication with individual subjects approaches the

issue of subject generality by increasing the number

of subjects studied. Suppose, for the sake of our

example, we study a second subject, performing the


12/25


162

entire experiment, baseline, new food, baseline, and

the whole sequence, over again. There are two major

classes of outcomes. One, we get the same effect.

Two, we do not. Let us deal initially with the former

possibility. The first issue is what we would accept

as same. The second persons baseline level would

likely not be exactly the same, and in fact, it mightbe notably different, averaging, say, 200 grams.

Should we count that as a failure to replicate? The

answer is (again), it depends. If our major concern

was the exact amount eaten and the factors contrib-

uting to that, then the result might well be consid-

ered a failure to replicate. We will hold off for a bit,

however, on what to do in the face of such failures,

and move forward on the assumption that we are

not so much concerned with the exact amount eaten

as with whether the change in food results in a

change in amount eaten. In that case, we might rep-licate, with the second subject, the whole sequence

of conditions, Food 1, Food 2, and back to Food 1.

Two possibilities exist: The results are the same as

for the first subject or they are not, and again, conse-

quently, an important issue is what is meant by

same.The results are unlikely, especially in behav-

ioral science, to be identical quantitatively, and,

in fact, if the baseline is different, the change in

intake cannot be identical in both absolute and rela-

tive terms, so we are left to decide whether to focus

on what is different or on what is similar. In thisstage of the discussion, let us assume that intake

decreased, as it had for the first subject. In that case,

we might feel confident that an important feature of

the data has been replicated. A next question, then,

would be whether additional replication with other

subjects is needed. In this particular example, the

answer would most likely be yes, but as is generally

the case, the real answer is that it depends on what

the goals of the experiment are.

Behavioral scientists, by and large, tend to focus

on similarities rather than differences, so if featuresof data reveal similarity across individuals, those

similarities are likely to be pursued. Consider, there-

fore, a situation in which the data for the second

subject are dissimilar, not only in quantitative terms

but in qualitative ones as well. For example, sup-

pose that for the second subject the change from

Food 1 to Food 2 results in an increase in amount

eaten rather than a decrease. Here, there is no ques-

tion that an important aspect of the first result has

not been replicated. What is to be done then? The

answer lies in the assumption of determinism that is

at the core of behavioral science. If there is a differ-

ence observed between Subject 1 and Subject 2, that

difference is the result of some other influence. Thatis, people do not differ for no reason. In fact, the

failure to replicate the exact intake levels at baseline

must also be a result of some factor. Failure to repli-

cate, therefore, is an occasion on which to initiate a

search for the variable or variables responsible for

the differences in outcomes. Suppose, for example,

that Subject 1 was female, and Subject 2 was male.

Tests with other men and women (note the expan-

sion of N) could reveal whether this factor was

important in determining the outcome. Similarly,

we have already assumed different baseline levels, soit might be the case that baseline level is related to

the direction of change in intake, a hypothesis that

can be examined by studying additional subjects. It

is interesting that examination of this second possi-

bility could be aided if the issue of different base-

lines between Subject 1 and Subject 2 had been

assumed to be a failure to replicate. In that case, we

would have focused on reasons for the difference

and may have identified factors that determine base-

line level. If that were so, it might be possible to

control the baseline levels and to change them sys-tematically, thus providing a direct method for

studying the relation between baseline level and the

effect of changing the food.

Another possible reason that disparate effects are

observed between subjects is differing sensitivity to

the particular value of the independent variable

used. In the example just described, the indepen-

dent variable was characterized qualitatively as a

change in food type, making assessment of sensitiv-

ity to it difficult to assess. If, however, the indepen-

dent variable can be characterized quantitatively, forinstance by carbohydrate content in our example,

the technique of systematic replication, elaborated

below, can be used to examine the possibility.

An important issue in considering direct replica-

tion arises when intersubject replication succeeds

but intrasubject replication does not. Taking our

example, suppose that when the conditions were


13/25


163

changed back to Food 1 with our first subject (cf.

Figure 7.4), eating remained at the lower level,

which would prevent replication of the effect in

Subject 1. Such a result indicates either that some

variable other than the change of food was responsi-

ble for the decrease in eating or that the exposure to

Food 2 has produced a long-lasting change in eat-ing. Support for the second view can come from

attempts at intersubject replication. If experiments

with subsequent subjects reveal that a shift from

Food 1 to Food 2 results in a relatively permanent

decrease in eating, the effect is verified.

When initial effects are not recaptured after

intervening experience that produces a change, the

change is said to be irreversible. Using replication to

examine irreversible effects requires intersubject

replication, so we have here another instance in

which N=1 does not mean that only one subjectneed be studied. Many effects in psychology are irre-

versible, for example, those that we call learning, so

the individual subject approach requires that inter-

subject replication be used to assess the reliability of

such effects, and in so doing the generality of the

effect across subjects is automatically examined.

A focus on each subject individually, of course,

does not prevent the use of traditional data analysis

approaches, should an investigator be so inclined

(for inferential statistical analyses appropriate to

single-case research designs, see Chapters 11 and12, this volume). Some, for example, might want to

present group averages so that actuarial predictions

can be made. Standard techniques can be used sim-

ply by engaging in the usual sorts of data manipula-

tion. An emphasis on the data from individuals,

nevertheless, can be used to enhance the presenta-

tion. For example, consider a study by Dunn,

Sigmon, Thomas, Heil, and Higgins (2008), who

compared two conditions aimed at reducing ciga-

rette smoking. In one, vouchers were given contin-

gent on breath samples that indicated that nosmoking had occurred, whereas in the other the

vouchers were given independently of whether the

subject had smoked. Figure 7.5 shows some of the

results. The bars show group means, and the dots

show data from each individual, illustrating the

degree to which effects were replicable across

patients and the representativeness of the group

averages. Such a display of data provides consider-ably more useful information than do presentations

that include only means or results of tests of statisti-

cal significance.

Systematic Replication: ParametricExperimentsTo this point, our emphasis has been on the

intra- and intersubject generality and reliability of

effects, and we have argued that individual subject

approaches can be effectively used to assess it. Gen-

erality of effects, however, is not limited to general-

ity across individuals, and it is to other forms of

generality, culminating with scientific generality, to

which we now turn.

As noted earlier, systematic replicationrefers to

replication with something changed, and, as also

noted, a case can be made that replication with a

new subject is a form of systematic replication in

FIGURE 7.5. Number of days of continuous absti-nence from smoking cigarettes in two groups of sub-

jects. Circles are data from individuals. Open bars andbrackets show the group means and standard errors

of those means. Subjects represented by the left barreceived vouchers contingent on abstinence, whereasthose represented by the right bar received vouchersindependent of their behavior. The top bracket andasterisk indicate that the mean difference was statisti-cally significant at the .01 level. From Voucher-BasedContingent Reinforcement of Smoking AbstinenceAmong Methadone-Maintained Patients: A Pilot Study,by K. E. Dunn, S. C. Sigmon, C. S. Thomas, S. H. Heil,and S. T. Higgins, 2008,Journal of Applied BehaviorAnalysis, 41,p. 533. Copyright 2008 by the Society forthe Experimental Analysis of Behavior, Inc. Reprintedwith permission.


14/25


164

that it is an experiment with something changed,

namely the experimental subject. From such replica-

tions come assessments of the across-subject gener-

ality of effects. In this section, we discuss other sorts

of changes between experiments that constitute sys-

tematic replication. To do so, let us begin again with

our example of effects of food type on eating. Sup-pose that after obtaining the data in Figure 7.4, we

perform a systematic replication of the study rather

than a direct repetition. For example, we might

notice that Food 2s carbohydrate content is higher

than that of Food 1. We decide, therefore, to alter

the carbohydrate content of Food 2 (and let us

assume, likely impossible, without changing the

taste) so that it matches that of Food 1, and repeat

the experiment. Such an experiment would examine

the generality of Food 2s effect on eating to a new

carbohydrate level. If adjusting Food 2s carbohy-drate amount to equal that of Food 1 resulted in the

switch in foods having no effect on eating, two

things can be concluded. One, the original result

was not replicated. In such cases, it is often wise

to replicate the original experiment to determine

whether unknown variables might have been

responsible. Two, carbohydrate amount is identified

as a likely important variable. Thus, systematic rep-

lication is not only a method for discovering gener-

ality of effects, it is also an approach that can lead to

finding controlling variables.Continuing our description of types of systematic

replication, let us assume we decide to examine

more fully the role of carbohydrates in eating. Our

original experiment may be conducted several times

but with a different carbohydrate mix for Food 2 on

each occasion. Each repetition of the experiment,

then, constitutes a systematic replication because a

new value of carbohydrate is used for each instance.

Experiments that systematically vary the value of a

variable are calledparametricexperiments, and they

play an especially important role in assessing gener-ality. Consider the data in Figure 7.6, which are

constructed to emulate what might result if several

intersubject replications of a parametric experiment

were conducted.

Parametric examination provides a number of

advantages when assessing the reliability and gener-

ality of results. First, had only a single value of the

independent variable been assessed, we might havebeen less than impressed with the degree of inter-

subject replicability of the data. The results of para-

metric examination, however, reveal a good deal of

similarity across the three subjects: All show the

same basic relation. At low percentages, the amount

eaten is roughly constant within each individual.

As the percentage increases, the amount eaten

decreases until the percentage reaches a value above

which further increases are associated with no

changes in amount eaten. Second, and this is a key

characteristic of parametric evaluation, the data sug-gest that only a range of levels of the independent

variable result in a change in behavior. That is, para-

metric experiments permit the identification of

boundary conditions, or limiting conditions, outside

of which a variable is relatively ineffective. As we

show later when dealing with the issue of scientific

generality, information about boundary conditions

can be extremely important.

Figure 7.6 also illustrates how parametric experi-

ments can help deal with the problem of lack of

intersubject replicability when a single value of anindependent variable is examined. Recalling our

original example of comparison of food types, con-

sider what could have happened if our first two sub-

jects were Subjects 1 and 3 of Figure 7.6 and Food 1

had contained 20% carbohydrate and Food 2 had

contained 25%. Changing the food type would have

produced a change for Subject 1 but not for Subject 3,

FIGURE 7.6. Hypothetical data for three subjectsshowing the relationship between carbohydrate contentand amount eaten.


15/25


165

leading to a conclusion that we had failed to repli-

cate the food change effect across subjects. The

parametric examination, however, shows that both

subjects are similar in how food intake was influ-

enced by carbohydrate content, except that behavior

of the two subjects was sensitive in a slightly differ-

ent range. One of the most satisfying outcomes ofparametric experiments is when they reveal similari-

ties that cannot be judged when only a single value

of an independent variable is tested.

It is worth noting, too, that parametric experi-

ments can reveal that apparent intersubject replica-

bility can be misleading regarding how a variable

influences behavior. It is possible that tests with a

single value of an independent variable might lead

to very similar quantitative results for several sub-

jects, whereas a parametric analysis reveals that very

different functions describing the relation betweenthe independent variable happen to cross or come

close together at the particular value of the indepen-

dent variable evaluated.

Parametric experiments illustrate one of the

strengths of being able to characterize independent

variables quantitatively. Experiments that determine

how much of this yields how much of that provide

more information about generality than do experi-

ments that simply test whether a particular value

of an independent variable has an effect. They can

identify similarity where none is evident with a sin-gle value of an independent variable, and they can

also determine whether apparent similarity is

unrepresentative.

We should note that parametric experiments are

not limited in application to only primary indepen-

dent variables, such as that shown in our fictitious

example. Any variable associated with an experiment

can be systematically varied. As an example, the

experiment just described could be conducted under

a range of temperatures, a range of degrees of hydra-

tion of the subjects, a range of times without foodbefore the test, and any of several other variables.

Those experiments, too, would provide information

about the range of conditions under which the inde-

pendent variable of carbohydrate content exerts its

effects in the circumstances of the experiment.

Parametric experiments, although very important,

are not the only kind of systematic replications. One

other type involves using earlier findings as a starting

point, or baseline, for examination of other variables.

As an example, consider the phenomenon of false

memory in the laboratory, produced by a procedure

originally developed by Deese (1959) and later elabo-

rated by Roediger and McDermott (1995). In these

studies, subjects said they recalled or recognizedwords that were not presented. A great deal of

research followed the original demonstrations, and

these experiments varied procedural details, measure-

ment techniques, subject characteristics, and so forth.

In each instance, therefore, in which the false memory

effect was reproduced, the reliability of the phenome-

non was demonstrated and its generality extended.

Using the reproduction of previous findings as a start-

ing point for subsequent research, therefore, is a use-

ful and productive technique for examining reliability

and generality of research outcomes.Sidman (1960), in his characterization of tech-

niques of systematic replication, described a type he

called systematic replication by affirming the conse-

quent (p. 127). Essentially, this approach is very

similar to the idea of hypothesis testing because the

systematic replication is not based on simply chang-

ing some aspect of the experiment to see whether

effects can still be reproduced but rather on what the

investigator sees to be the implications of previous

results. That is, the replication may be based on the

investigators interpretation of what the data mean.For example, consider our fictitious study of the

effects of carbohydrate content on eating. That

result, and perhaps those of other experiments,

might suggest that the phenomenon is not specific

to eating. Carbohydrate ingestion possibly leads to

general lethargy or low motivation for voluntary

behavior. If we suspect that, we might devise other

experiments that could be viewed as systematic rep-

lications based on the possible implications of the

previous findings. If the results were consistent with

the lethargy interpretation, the view would gain incredence; if they were not, the view might well be

abandoned. As Sidman (1960) noted, definite con-

clusions may not be drawn from successful replica-

tions by affirming the consequent, but, as he also

noted, the approach is essential to science. The

degree to which ones confidence in an interpreta-

tion of data grows with successful replications


16/25


166

depends on many things, not the least of which is

how counterintuitive the predicted outcome is.

Types of Generality Assessed andEstablished by Systematic Replication

Johnston and Pennypacker (2009) offered a useful

characterization of the dimensions along which gen-erality can be examined. They initially suggested a

dichotomy between generality of and generality

across. Generality acrossis simple to understand.

As we have already noted, replication can be used to

determine generality across subjects or situations, a

type of generality usually of considerable interest.

Systematic replication comes to the fore in the

assessment of generality across species and across

settings. By definition, systematic replication is an

attempt at replication with something different, so if

the species is changed, or if something (or a lot)about the setting is altered, the replication attempt is

a systematic one. In both cases, the issue of what

constitutes a successful replication may arise. Con-

sider, for example, if we decided to attempt a cross-

species replication of our experiments with food

types, and our new species was a mouse. Obviously,

mice would eat considerably less, and therefore a

precise, quantitative replication would not be possi-

ble. We might (actually, probably would), however,

argue that the replication was successful if the rela-

tion between carbohydrate content and eating wasreplicated, that is, if at low concentrations there was

little effect on eating, but as carbohydrate content

increased, the amount eaten decreased until some

level is reached above which further decreases were

not seen (cf. Figure 7.6).

What if the content values at which the decreases

begin and end differ between the species? For exam-

ple, mice may begin to show a decline when the

food reaches 15% carbohydrate, whereas with the

humans, decreases are not evident until the food

contains 25% carbohydrate. Is that a failure to repli-cate? Again, the answer is yes and no. The business

of science is to find regularities in nature, so empha-

sis is properly placed on similarities. Differences

virtually always exist, so they are easy to find. Nev-

ertheless, they cannot be ignored entirely, but their

main role is not to indicate that the similarities

evident are somehow unimportant, but rather to

promote further research into the origins of the dif-

ferences if the differences are judged to be impor-

tant. The scientist and the scientific community

make judgments about the need for further investi-

gation of the differences that are always present in

replications.

Generality ofalso plays an essential role in sci-ence. Johnston and Pennypacker (2009) described

several categories of generality of,but here we focus

on one in hopes of making the concept clear: gener-

ality of process. Our example is a behavioral process

familiar to most psychologists, specifically the pro-

cess of reinforcement of operant (purposive) behav-

ior. Reinforcementrefers to the increase in likelihood

of behavior as a result of earlier instances being fol-

lowed by certain consequences, which is the pro-

cess. Systematic replications across an immense

range of both behavioral activities and a very largerange of consequences have been shown to provide

instances of the process. For example, in addition to

the traditional lever press and key peck, activities

ranging from the electrical activity of an impercepti-

ble movement of the thumb (de Hefferline, Keenan,

& Harford, 1959), to vocal responses of chicks

(Lane, 1960), to generalized imitation in children

with developmental delays (Baer, Peterson, & Sher-

man, 1967), to the extensive range of activities

described in the use of reinforcement in the treat-

ment of behavior disorders (e.g., Martin & Pear,2007; Ullman & Krasner, 1966) have all been shown

as instances of the process. Similarly, the range of

events used as effective consequences to produce

reinforcement is also broad. Consequences such as

praise, food, intravenous drug administration, open-

ing a window, reducing a loud noise, access to exer-

cise, and many, many others have been effectively

used to produce reinforcement. All the reports

may be viewed as describing systematic replications

of the earliest experiments on the process (e.g.,

Skinner, 1932; Thorndike, 1898).This generality of process is what stands as the

rationale for speaking of reinforcement theory. The

argument is similar to that offered for the motion of

objects. Whatever those objects are, and whether

they are falling, floating, being ballistically pro-

jected, or orbiting in outer space, they can be sub-

sumed under the notion of gravitational attraction,


17/25


167

Newtons theory of gravity. An even more dramatic

example is provided by living things. All manner of

plants and animals populate the earth, and their dif-

ferences are obvious and virtually countless. What is

less obvious but explains the variety is that all life

can be considered to have developed from the opera-

tion of three processes: variation, selection, andretention (Darwin, 1859). The sameness of cellular

architecture, including nuclear material (e.g., DNA

and RNA), also attests to the similarity. Likewise, all

the myriad instances of reinforcement suggest that

considering them instances of a single process is rea-

sonable. As noted earlier, an important goal of sci-

ence is to discover uniformities. In fact, as Duhem

(1954) noted, one of the key features of explanation

is identification of the like in the unlike. Objects

look different, are made of different substances, and

may or may not be moving in variety of ways, butthey are similar in how they are affected by gravity.

Behavioral activities take on many forms, and as just

noted, so can the consequences of those activities.

Nevertheless, they can (on many occasions) exhibit

the phenomenon known theoretically as reinforce-

ment, an instance of generality of process.

Scientific GeneralityAnother extremely important concept is scientific

generality, a type of generality that has some coun-

terintuitive characteristics. Scientific generality isimportant for at least two reasons. One, scientific

generality speaks to scientists ability to reproduce

their own findings and those of other scientists, as

well. Two, scientific generality speaks directly to

the possibility of effective application and translation

of laboratory findings to the world at large, as dis-

cussed more fully later in the last section of this

chapter. Scientific generality is defined by knowl-

edgeable reproducibility. That is, it is not character-

ized in terms of breadth of applicability, but instead

in terms of identification of factors that are requiredfor a phenomenon to occur. To illustrate the differ-

ence between scientific generality and, for example,

generality across people, consider again the fictitious

experiment on food types. Suppose that the original

experiments were all performed with male subjects.

On an attempt at replication with female subjects,

it is discovered that food type, or carbohydrate

composition, has no effect at all on eating. That, of

course, would be clear indication of a limit to the

across-subjects generality of the effect on eating. It

would, however, represent an increase in scientific

generality because it specifies more clearly the con-

ditions required to produce the phenomenon of

reduced food intake. As stated by Johnston andPennypacker (2009), A procedure can be quite

valuable even though it is effective under a narrow

range of conditions, as long as we know what those

conditions are (pp. 343344). The vital role that

systematic replication, and even failures of system-

atic replication, can play in establishing scientific

generality therefore becomes evident. Scientific gen-

erality represents an understanding of the variables

responsible for a phenomenon.

GENERALIZATION, TECHNOLOGY

TRANSFER, AND TRANSLATIONAL

RESEARCH

The function of any science is the acquisition of

basic knowledge. A secondary benefit is often the

possibility of applying that knowledge in ways that

impart benefit to some element of the culture at

large. For example, Galileos basic astronomic obser-

vations eventually led to improved navigation proce-

dures with attendant benefits to the colonial powers

of 17th-century Europe. Pasteurs discovery in 1863of the microorganisms that sour wine and turn it

into vinegar, and the observation that heat would

kill them, led eventually to the germ theory of dis-

ease and the development of vaccines.

In the case of behavior analysis, a relatively

young science, sufficient basic knowledge has been

acquired to permit vigorous attempts at application.

A discipline known as applied behavior analysis,

discussed extensively elsewhere in Volume 2 of this

handbook, is the primary result of these efforts,

although application of the findings of behavioranalysis are to be found in a variety of other disci-

plines including medicine, education, and manage-

ment, to name but a few.

In this section, we describe issues surrounding

attempts to apply laboratory research findings in the

wider world at large. Specifically, we discuss topics

related to applying research findings from controlled


18/25


168

laboratory or therapeutic settings to new situations

or less controlled environments. First, we describe

the issue of generalization of behavioral treatment

effects from treatment settings to real-world circum-

stances. Then we outline basic general strategies

for effective transfer of technologies, taking into

account the known scientific generality of behav-ioral processes. Finally, we offer comments on the

notion of translational research, a matter of much

contemporary interest.

Generalization of ApplicationsOne of the earliest subjects of discussion that arose

with the development of behavior therapy and

behavior modification techniques was the issue

referred to as generalization (e.g., Yates, 1970). Spe-

cifically, there was concern about whether improve-

ments produced in a therapy setting would alsoappear in other, nontherapy (e.g., everyday life) sit-

uations. The term generalizationwas borrowed from

a core behavioral process discovered by experimen-

tal psychologists, that after learning to respond in a

particular way in the presence of a particular stimu-

lus, say frequency of a tone, the same behavior may

also occur in the presence of other more or less sim-

ilar stimuli, say, other frequencies of the tone. It is

an apparently simple logical step to suggest that

behavior learned in a therapy environment might

also appear in nontherapy, real-world environments,and when it does so, the result can be called general-

ization (but see Johnston, 1979, for problems with

such a simple extrapolation). Because applied

behavior analysis generally involves establishing

conditions that alter behavior, the issue of whether

those changes are restricted to the learning situa-

tions arranged or whether they become evident in

other situations is usually important. For example,

if a child who engages in aggressive behavior is

exposed to a treatment program to reduce aggres-

sion, a goal would be to decrease aggression notonly in the treatment setting but in all settings.

In a seminal article, Stokes and Baer (1977) dis-

cussed the issue of generalization of treatment

effects. A key contribution of their article was to

indicate that in general, if effects of a treatment are

to be manifested in a variety of circumstances,

achieving that outcome must be considered in

designing the intervention intended to effect the

change in behavior. That is, it is not always suffi-

cient to simply arrange circumstances that produce

a desired change in behavior in the circumscribed

environment in which the treatment is undertaken.

Instead, procedures should be used that increase the

probability that the change will be enduring andmanifested in those parts of a clients environment

in which the changes are useful. That insight has

been followed by the development of general strate-

gies to enhance the likelihood that behavior changes

occur not only in the treatment environment but

also in other appropriate ones.

For example, Miltenberger (2008) described

several general strategies that can be used to pro-

mote generalization of treatment effects. The most

direct strategy is to arrange for rewards to occur

immediately after instances of generalization occur.Such an approach essentially entails taking treat-

ment to the environments in which it is hoped the

new behavioral patterns will occur. That is, the

training environment is not explicitly delimited.

Such an approach is now widespread in applied

behavior analysis partly as a consequence of an

emphasis on analyzing reinforcement functions

before implementing treatment (see Iwata, Dorsey,

Slifer, Bauman, & Richman, 1982). This approach

to problem behavior entails discovering whether

the behavior is maintained by reinforcement, and ifit is, identifying what the reinforcers are in the

environments in which the problem behavior

occurs. Once the reinforcers responsible for the

maintenance of the problem behavior are identi-

fied, then procedures based on that knowledge are

implemented in the situations in which the behav-

ior occurs.

A related second strategy identified by Milten-

berger (2008) is consideration of the conditions

operating in the environments in which the changed

behavior would be occurring. The idea here is thatbehavior that is changed in the therapeutic setting,

for example learning effective social skills, will lead,

if performed, to more satisfying social interactions

in the nontherapy environment, and those successes

will help to solidify the gains made in the therapy

sessions. In designing the therapeutic goals, there-

fore, consideration is given to what sorts of behavior


19/25


169

are most likely to be successful in the nontherapy

environment.

A less obvious strategy applies when the nonther-

apy environment appears to offer little or no support

for the changed behavior. An example is when ther-

apy is aimed at training an adolescent to walk away

from aggressive provocation in a schoolyard. Behav-ing in such an aggression-thwarting manner is not

likely to result in positive outcomes with peers, who

are instead likely to provide taunts and jeers after

such actions. In such a case, it may be prudent to try

to change the normal consequences in the school-

yard by having teachers or other monitors provide

positive consequences (perhaps in the form of privi-

leges, praise, etc.) for such actions when they occur.

That is, the strategy here involves altering the con-

tingencies operating in the nontherapy environment.

A fourth general strategy is to try to make thetherapy setting more like the nontherapy environ-

ment in which the changed behavior is to occur. A

study by Poche, Brouwer, and Swearingen (1981)

illustrated this approach. They taught abduction

prevention skills to preschool children, but in so

doing incorporated a relatively large number of

abduction lures in the training. The intent was that

by including a wide variety of possible lures that

might be used by would-be kidnappers, the training

would be more effective in real-world situations

than if it had not involved those variations. The gen-eral strategy in this case was to train with as many

likely relevant situations as possible. Another way to

view this strategy is that it involves incorporating

stimuli that are present in the nontherapy environ-

ment into the training.

A fifth approach is somewhat less intuitive, but

research has suggested that it may be effective. The

core idea is that if a variety of different forms of

effective behavior are established by the therapy or

training, the chance of effective behavior occurring

in the nontherapy environment is better, and as aresult the successful behavior will be supported and

continue to occur. As a simple illustration, Milten-

berger (2008) offered the example of teaching a shy

person a variety of specific ways to ask for a date,

which provides the person with several actions to

try, some of which are likely to be successful outside

of therapy.

In this section, we focused on particular strate-

gies for ensuring that desired changes in behavior

established through therapeutic methods occur and

persist in nontraining or nontherapy environments,

that is, in the everyday world. Employment of tactics

emerging from the strategies described has yielded

many successes, and the methods are part of thearmamentarium of applied behavior analysts. These

techniques to promote generalization of behavior

changes have emerged from a consideration of

fundamental behavioral processes that have been

identified and analyzed in basic research and then

subsequently validated as effective through applied

research. They represent, consequently, what can be

called successful transfer from basic science to effec-

tive technology, namely, an instance of what has

come to be called technology transfer.In the next

section, we discuss some general principles of effec-tive technology transfer.

Technology TransferPeople often use the term technologyto refer to the

body o

7 Branch&Pennypacker Generalization and Generality

Documents

Transcript of 7 Branch&Pennypacker Generalization and Generality